(We noticed the faux pas this morning, the sherry was still active)
While I was testing the Cloudatics platform as a service (PaaS) I wanted to do something big, real big and random. A billion would be nice but due to time constraints a mere 10 million would have to suffice for the mean time.
So I generated some random data….
The Cloudatics platform will take raw data and process it down to how you want it, billing is done by the second of processing time.
Connectors define what the data is (a textfile, a spreadsheet or a database connection). So for this instance I have all my heads and tails data in a text file on Dropbox.
Workers instruct Cloudatics what to do with the data, so for this connector I want to run it through a basic word count, this will find all the unique words (in this case either H or T) and update the total throughout. Included to get you started is the workers for finding Twitter @names or #hashtags.
All remains is to create a Job definition:
And then queue it for processing.
Once the job is queued an email is sent to you, a further two emails go out when the job starts and when the job is complete.
The results are waiting for you on the Cloudatics platform within your account page.
The full results are available for you to download, coming soon will be storage to datawarehouse or NoSQL data stores (like HBase, MongoDB and Cassandra for example).
The result?
H 5000525
T 4999475
The random seeding might have been an issue to such rounded results but the point was to prove that Cloudatics could process a large quantity of data. I believe it worked 🙂