So you want to be a data scientist? #BigData

During the Global Irish Economic Forum yesterday my friend and Digital Circle chair Mary McKenna (@MMaryMcKenna) tweeted:



Five years on from the original quote by D.J. Patil and Jeff Hammerbacher in 2008 it’s still brought out especially at conference like events where suits reign supreme. The one thing that never really got nailed was what it really was. Now I don’t know who at #GIEF2013 actually said it but Mary was right to tweet it and Dara was right to nod, vigoroursly.

A Kaggle of Data Scientists

In July 2013 reached it’s 100,000 user milestone. It’s fairly safe to assume that each of these scientists have a broad range of skills.   And this is the core question that’s still asked, what skills defines a Data Scientist?

First things first, domain knowledge

Knowledge of the domain you are working in is pretty key to all of this. Before data (BigData, SmallData and all the other data), Hadoop and reporting (or visualising if you really have to) comes the basic of question of do you really know this industry.  This becomes pivotal in what you’re doing and something that consultants fall down on.  The numbers alone don’t stand up.

With domain knowledge in that industry sector you’re on to a winner. “In the kingdom of the blind the one eyed man is King”.  I’ve had the privilege of playing with an awful lot of retail loyalty data and supply chain data since the the mid 90’s (I worked for a data science company in 2002 and it was fascinating and damn hard work, I loved it).

Add a lot of theory

Yes statistics knowledge is very very very helpful.  This is where things get really interesting because the argument of the qualification of a data scientist gets wheeled out and gets really messy. I’ve heard may different stories….

“You’ve got to have a statistics degree”

“You’ve need at least a PhD in a computer science background.”


I’ve heard many “You need”‘s over many many years.  Theory is helpful, learning is very helpful, qualifications help a bit more but only prove you can learn. I’ve worked with some PhD’s and their working methods left me with hair loss and, more problematically for a very large client, the wrong results. Yes, PhD’s get it wrong too.

During September I took one of those long sit downs and looked in the mirror again. This was the first year I didn’t ask myself if I should do a degree. Felt quite nice actually. Then Nate Silver made my day even better when he was interview on the Havard Business Review Blog.

“I think the best training is almost always going to be hands on training. In some ways the book is fairly abstract, partly because you’re trying to look at a lot of different fields. You’re trying not to make crazy generalizations across too many spheres.

But my experience is all working with baseball data, or learning game theory because you want to be better at poker, right? Or [you] want to build better election models because you’re curious and you think the current products out there aren’t as strong as they could be.  So, getting your hands dirty with the data set is, I think, far and away better than spending too much time doing reading and so forth.”

Basically find a good mentor and “don’t sweat the degree“. The key takeaway is get your hands on data and with the wash of the stuff that’s now pretty easy to do (as long as the US Government doesn’t shutdown).

Then the tools….

Tools are just that, tools. So my mantra has always been “whatever it takes to get the job” done.  So I don’t mind if it’s Java, Python, Ruby or R (though R is damn helpful). Hadoop gets wheeled out an awful lot in name checks, it’s great for batch processing.  Thing is there are as many tools now as there are data scientists.  Get comfortable with a few of the main ones, there’s plenty of online courseware things to get you up to speed on the basics.

For the big picture it’s worth looking at Swami Chandrasekaran’s Tube Map like picture of the data science field.


Each is a learning track and it’s a pretty handy way to plan your learning.


It’s a BIG field out there and it requires many different facets to cover. I don’t call myself a “data scientist”, I’m still part analyst, part architect and part data miner. For me it’s a case of using to the tools to get the job done.

The thing is for most organisations these skills are spread over many employees. So instead of the “sexiest job” perhaps the “sexiest department” is the Data Science Team.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: