Just jumping in on part 3? You can read part 1 and part 2 to get you up to speed.

Bringing in Hadoop
So far we’ve used Spring XD to pull a stream of tweets into a file. And, normally, at this point we’d be looking to copy the file into HDFS and then run some MapReduce job on it.
Historically you’d be saying:
hadoop fs -put /tmp/xd/output/tweetFashion.out tweetFashion.out
Do the job and then pull the results from HDFS into a file again.
hadoop fs -getmerge output output.txt
If you need a reminder on setting up a single node Hadoop cluster then you can read a quick intro here.
Spring XD and Hadoop
Creating a stream from with the shell we can stream the data into HDFS without the hassle of doing it manually.
First we must tell Spring XD default to Hadoop 1.2.1 (in XD build M4 at time of writing).
Assuming that Hadoop is installed and running you can run the single command:
xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#fashion'| twitterstreamtransformer | hdfs"
With the rollover parameter we can ensure that periodic writes are happening within HDFS and not too much data is being stored in memory.
xd:>stream create --name tweetLouboutins --definition "twitterstream --track='#fashion'|twitterstreamtransformer | hdfs"
Within the HDFS directory you’ll start to see the data appear very much in the same way as it was being dumped out to the tmp directory.
The MapReduce Job
The Mapper and Reducer are a very simple word count demo but looking for hashtags instead of words. The difference here is the output, I don’t want tab delimited as I’m not a great fan of it. I want to output a comma separated file.
Depending on the version of Hadoop you are using defines the property setting you need to set. So the easiest way to sort this is to cover all bases.
static void setTextoutputformatSeparator(final Job job, final String separator){
final Configuration conf = job.getConfiguration(); //ensure accurate config ref
conf.set("mapred.textoutputformat.separator", separator); //Prior to Hadoop 2 (YARN)
conf.set("mapreduce.textoutputformat.separator", separator); //Hadoop v2+ (YARN)
conf.set("mapreduce.output.textoutputformat.separator", separator);
conf.set("mapreduce.output.key.field.separator", separator);
conf.set("mapred.textoutputformat.separatorText", separator); // ?
}
Job job = new Job(); job.setJarByClass(TwitterHashtagJob.class); setTextoutputformatSeparator(job, ",");
#Aksesuar,1 #AkshayJewellery,1 #AlSalamanty,1 #Alabama,1 #Aladin,1 #AlaskanWomen,2 #Alaventa,1 #Albright,8 #Albuquerque,1 #Alcohol,1 #Aldo,1 #AldoShoes,2 #AlecBaldwin,3 #Alejandro…,1 #Alencon,1 #Alert,2 #AlessandraAmbrosio!,2 #AlessiaMarcuzzi,2 #AlexSekella,1 #AlexTurner,3 #Alexa,3 #AlexaChung,8 #AlexaChungIt,1 #AlexanderMcQueen,7 #AlexanderWang,7 #Alexandra,1 #Alexandria,2 #Alexis,1 #Alfani,1 #Alfombraroja,1 #AlfredDunner,1 #Ali,1 #Alice,1
With a little command line refining we can start to pick out the data we actually want. The last thing that’s pleasing on the eye is a visualisation on every hashtag. For example I want to see the counts for Asos, Primark and TopShop.
jason@bigdatagames:~$ egrep "#(asos|primark|topshop)," output20131117.txt #asos,42 #primark,16 #topshop,64
Next time….
We’ll get into the visualisation side of things and also look at maintaining a process to keep the output up to date.
One response to “Twitter Fashion Analytics in Spring XD [Part 3]#BigData #Fashion”
[…] the end of the last post we had mined the data, got counts of all the hashtags and also whittled it down to three brands we […]