Apache SparkML’s Biggest Pain Point – #MachineLearning #Spark #SparkML #Data #BigData

One of the highlights of my job as a Data Engineer (I’m not a data scientist) is that I get to do some very cool stuff with text mining and all that data schizz.

So to that end I’m using Apache Spark, Clojure and Sparkling a lot. With that in mind I do a lot of bag of words, word vectors and such things to get topics and classifications from word documents. And it’s at this point that SparkML fails like a complete worn out donkey because it’s one of those small overlooked elements that you come across once in a while.

In topic modelling though it’s nice to know (actually pretty important) which document was labelled with which terms. So anything using SparkML’s hashingTF function has no trace of which document the term frequency came from. Which is rather pointless and, let’s face it, pretty annoying.

There got that off my chest….



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: