London Fashion Week (#lfw/#londonfashionweek) shows us the perils of tweet mining.

London_fashion_week

London Fashion Week kicked off today which means it’s a fine excuse to run some more mining algorithms.  One thing that becomes very apparent is the lack of definition of hashtags.  Now hashtags (#thosethings) are under no definition apart from the author of the tweet.  That’s all very well for the normal readers of the tweets, you can read past the hashtags (a bit like bad spelling, you read into the sentence and past them).

For machine learning programs it causes all sorts of issues.  Consider the following scenario of Antoni & Alison who did a digital catwalk today.  Here’s the hashtags we pulled out with the Cloudatics platform (we gathered our data via the excellent Repknight

#antoni&alison#antoniandalison#antonialison#antoni&alison.#alison&antoni#antoni#antoni+alison#antoniaandalison#antonio#antonioalison

As you may have noticed there’s an issue with the word “and” with its multiple ways of saying the same thing.  “And”, “+” and “&” all mean “and”. Our issue is when you start to mine the data with the likes of Hadoop/MapReduce the tags above are classed as individual classifications with their own count.

The importantance of extract, transform and load (ETL) now becomes very important in order to gain a good working set of data.

So a mapping of “and” should connect to “+” and “&” within the context of the tweet.

I’ll be looking at this in some detail over the next few blog posts and how we can put these into use with Hadoop and Cloudatics in general.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: