As I’ve had a few people asking me about this I thought a blog post would be more appealing.
So as Hadoop intros go this is basic but it gets the point across.
1. Download Hadoop
The demo was originally built against Cloudera’s Hadoop distribution (CDH3) which you can find here: https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH3Packa…
2. Read the installation instructions and install the package.
3. Download the sample code
The I presented at the BigData Bash you can download or clone here: https://github.com/jasebell/devbash
4. Have a read of the documentation on the Github site about how to compile the jar file.
You’ll need Ant installed if you want to do things the easy way.
5. Create a directory called “input”
This is where you place the input data that Hadoop is going to Map and then Reduce. There’s no need to create an output directory as Hadoop will do that for us.
6. Read the docs again to run Hadoop.
hadoop jar devbash.jar com.cloudatics.devbash.TwitterHashtagJob input output
The demo I ran was in standalone mode, no need to run in a pseudo cluster or full clustered mode with the amounts of data we are talking about. The MTV data was < 900,000 tweets and as you saw, if you were there, it didn’t take that long.