This is not a post about Spark vs Hadoop and all that, each tool is worthy of use in the “BigData” field. It is worth having a little talk about the Spark though as it’s working ways can catch folk out in the early stages*
I like Spark.
I like Hadoop
There are though, differences.
It’s a Programmer’s Tool
First things first, Spark is really a programmers tool. There’s no nice IDE’s, graphical output or anything like that. If don’t have a good grounding in Scala, Java or Python then you will struggle to grasp what’s going on.
While Hadoop has time to mature and tools have been build around it spawning a whole ecosystem of utilities (“you won’t need SQL on Hadoop”, six months later, “here’s SQL on Hadoop, you’ll need this”) Spark is plugging in to these tools and having it’s own pie too. Now some of the libraries like Mlib for machine learning have dependancies. Ultimately if you want full Spark power then you’ll be programming applications.
Many blog posts and articles centre around one concept of Spark, the see how quick it is to do MapReduce functions from the Spark shell (I’ll be using it in this example). I’ve got no problem with that but once you get beyond that point we’re talking applications and that means coding and using either sbt or Maven to build the projects before we run them.
If you are a Python programmer it’s worth noting that some features in the Spark Streaming library aren’t 100% supported (no support for Kafka or Flume, only basic text and TCP streams for example).
Greedy For Cores, Greedy For Memory
Unless you tell it otherwise Spark is insanely greedy for processor cores and memory. It will use anything that it can get it’s hands on. I’ll show you how in a moment and then how to control it too.
Spark isn’t fussy, for this post I’ve used my main Hadoop training server (A HP Gen8 Microserver) and my Toshiba Satellite C70 Windows 8.1 laptop, and they both worked like a treat.
The Master, Workers and Tasks
Spark operates a cluster either standalone, using YARN on Hadoop2 or Apache Mesos. The standalone cluster is easy enough to understand. You have a master and the workers connect to it. When jobs are sent for processing you have control of setting how many cores will do the work, if non are specified then all are allocated.

This allocation is important as cores are not freed up until the worker has completed the job (or it’s halted or crashed out).
Starting the master is a trivial task. It’s just a case of going to the directory you installed Spark in and running:
sbin/start-master.sh
This will start the master and also the web UI on port 8080.

Now you need to add some workers. I have two cores on the server where the master is residing, with 16Gb of memory but I don’t want Spark to use all the memory (just in the case the server is asked to do anything else). In the conf folder there is a file called spark-env.sh and this sets up the various settings for that node.
So for my machine:
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=4g
This will give my worker two processor cores and a 4Gb working memory. I could run two workers of 1 core with 2Gb memory if I wished. All I have to do is change the settings and restart the master.
From the command line I’ll start my worker:
bin/spark-class org.apache.spark.deploy.worker.Worker spark://192.168.1.10:7077
After a short while you can refresh the web UI and you’ll see the worker now listed in the “Workers” table.

Adding Cores From Other Machines
I want to add workers from my laptop, it’s got four cores so I might as well make use of them. First of all I need to tell the master the ip address of the new machine. The file slaves in the conf folder has a list of the addresses of the machines in the cluster.
Once I’ve saved that and restarted the master (and the workers on the same machine as the master) then I can add this new worker to the cluster.
bin/spark-class org.apache.spark.deploy.worker.Worker spark://192.168.1.10:7077
A look at the UI and you can see the new worker listed. In the top left you’ll see the total number of cores and memory available to the cluster.

So finally a cluster with 10Gb of RAM allocated to Spark jobs with six cores available. Now we can do some work.
Running Programs
The Spark shell is handy to see what happens the cluster when an application is run.
bin/spark-shell --master spark://192.168.1.10:7077
Look at the UI now and you’ll see all the cores working on that application. If anyone tries to run another application they’ll be held until resource becomes available.

I can reduce the number of cores a program is running when I start it, if I add the following flag to the execution script.
--total-executor-cores <numCores>
Once the program has finished processing, or I’ve closed the program down. The cores will be made available again to the master.

Concluding
The standalone cluster is fairly straightforward to set up but thought must be given to the number of cores and the amount of memory available. When putting the cluster together it’s worth planning on paper how it’s going to be allocated. Programmers also need to take note on how they expect their applications to run within the cluster. The Spark Context can be set within the code or runtime, it’s always prudent to have a clearly communicated plan on how you expect these things to run.
Jason Bell is a Hadoop consultant based in Northern Ireland but helps companies globally with various BigData, Hadoop and Spark projects. He also offers training on Hadoop, the Hadoop Ecosystem and Spark to developers and anyone interested in what these technologies can do. He’s also the author of “Machine Learning – Hands On For Developers and Technical Professionals“.
* I have battle scars.