BigData Bash – The Hadoop Demo #devbash #dataissexy #hadoop

As I’ve had a few people asking me about this I thought a blog post would be more appealing.

So as Hadoop intros go this is basic but it gets the point across.

1. Download Hadoop
The demo was originally built against Cloudera’s Hadoop distribution (CDH3) which you can find here: https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH3Packa…

2. Read the installation instructions and install the package.

3. Download the sample code 
The I presented at the BigData Bash you can download or clone here: https://github.com/jasebell/devbash

4. Have a read of the documentation on the Github site about how to compile the jar file.
You’ll need Ant installed if you want to do things the easy way.

5. Create a directory called “input”
This is where you place the input data that Hadoop is going to Map and then Reduce.  There’s no need to create an output directory as Hadoop will do that for us.

6. Read the docs again to run Hadoop.

hadoop jar devbash.jar com.cloudatics.devbash.TwitterHashtagJob input output

The demo I ran was in standalone mode, no need to run in a pseudo cluster or full clustered mode with the amounts of data we are talking about.  The MTV data was < 900,000 tweets and as you saw, if you were there, it didn’t take that long. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: