#DataScience for the confused. NI software jobs. #unix #excel

Hypothesis: Java is a dead language in Northern Ireland. 

Let’s consider that for a second.

MarilynMonroe1

I’ve heard it for far too long. The “why did you do it in that!” or “no one ever does a startup in Java, it’s a dead language”. So let’s, while teaching some basic data science put this sordid tale to bed once and for all.

Get Thy Data!

I need job numbers and as most NI job postings don’t end up on JobServe there’s no point me looking there. NIJobs on the other hand does.

 

 

nijobs

 

Now I know the searching brings up a HTML page of results. So based on feeding NIJobs with a programming language I’ll get some jobs in return. I’m  not interested in who’s looking, I am interested in a number, the total number of jobs.

If I want to look for Java jobs directly with employers in the IT category then I’d use the following URL:

http://www.nijobs.com/ShowResults.aspx?Keywords=java&Location=&Category=3&Recruiter=Company

You can see the key pairs of variables NIJobs site is looking for. The one I’m interested in is Keywords.

Keywords=java

Thou Shalt Not Copy/Paste HTML

I need to automate this in some way, I don’t want to repeatedly type in the URL and copy/paste the data I’m looking for.

Now I’ve established the URL I can manipulate it a little, as it’s not just Java I’m looking for, I want to compare a range of skills.

  • Java
  • PHP
  • Python
  • Ruby
  • iOS
  • C++
  • Scala

Unix will help me here. Using curl and a for loop I can craft something quick and dirty to pull the information I need.

for i in java php perl python ruby ios c++ scala; do echo -n $i >> jobsoutput.txt ; curl http://www.nijobs.com/ShowResults.aspx?Keywords=$i\&Location=\&Category=3\&Recruiter=Company | grep "Total Jobs Found:" >> jobsoutput.txt ; done

In one line I’m grabbing doing the following:

  • Creating a for loop, each time it loops the value of $i will be that of one of the programming languages I listed.
  • I echo the name of the language but not a newline.
  • Then I use curl to retrieve the html page then pipe it through grep. The only thing I want is the line “Total Jobs Found: “, this is appended to the file. Notice how I’m changing the Keyword value to $i, each time the loop runs the language name will be inserted.

Cleaning the File Up Further

Currently what I have is:

java <label style="margin-right:30px;">Total Jobs Found: 34</label>
php <label style="margin-right:30px;">Total Jobs Found: 7</label>
perl <label style="margin-right:30px;">Total Jobs Found: 10</label>
python <label style="margin-right:30px;">Total Jobs Found: 14</label>
ruby <label style="margin-right:30px;">Total Jobs Found: 5</label>
ios <label style="margin-right:30px;">Total Jobs Found: 4</label>
c++ <label style="margin-right:30px;">Total Jobs Found: 9</label>
scala

There’s tabs or spaces in there and then <label> tags which need to come out. What I want is a comma delimited file with the language and the number of jobs, nothing else.

I’m not an awk expert by any stretch of the imagination but I know enough to get my by and that counts for a lot. We’re not looking for scripting perfection but just a way to get the results I want.

awk '{sub(/[ \t]+\<label style\=\"margin-right\:30px\;\">Total Jobs Found: /,",")};1' jobsoutput.txt > jobstemp.txt
awk '{sub(/\<\/label\>/,"")};1' jobstemp.txt > jobstemp2.txt
awk '{sub(/\n/,",")};1' jobstemp2.txt > nijobsoutput.txt

The three awk commands will remove the tags and clean up the white space. Me being a complete noob on awk (I’m still an old Perl hack at heart) is outputting to new files each time, then I’ll clean up afterwards.

So, now I’ve figured out what I’m doing UNIX command wise I can wrap all this up in a shell script and run it every time I want to update the data.

#!/bin/bash

# Iterate through the languages and pull the info from nijobs.com
# The echo -n means the loop won't output a newline.
for i in java php perl python ruby ios c++ scala; do echo -n $i >> jobsoutput.txt ; curl http://www.nijobs.com/ShowResults.aspx?Keywords=$i\&Location=\&Category=3\&Recruiter=Company | grep "Total Jobs Found:" >> jobsoutput.txt ; done

# Clean up the grep'd html to remove the tags and replace with a comma.
awk '{sub(/[ \t]+\<label style\=\"margin-right\:30px\;\">Total Jobs Found: /,",")};1' jobsoutput.txt > jobstemp.txt
awk '{sub(/\<\/label\>/,"")};1' jobstemp.txt > jobstemp2.txt
awk '{sub(/\n/,",")};1' jobstemp2.txt > nijobsoutput.txt

# Remove the files we don't need anymore.
rm jobsoutput.txt jobstemp.txt jobstemp2.txt

I’m saving this file and calling it ‘nijobsscript.sh’, then quick modification to the file will mean I can run it from the command line.

chmod 755 nijobsscript.sh

Now I can test it.

Running the Script

A quick run of the script…

./nijobsscript.sh

…..and it whirls into action. Once it’s finished I can look at the output and see what I’ve got.

java,34
php,7
perl,10
python,14
ruby,5
ios,4
c++,9
scala

It’s worked nicely. Scala doesn’t have any job listings so there’s no output. I do want to leave it in the script just in case any jobs do pop up in the future.

Going All Florence Nightingale

Right, something that will put a pie chart together quickly. I’m not proud, Excel will do me fine. Yes I could use R, D3, Google Charts or Tableau but all is well with the world and Excel will do what I want it to do. Path of least resistance.

excel

Opening the file in Excel it will import the text file easily. Selecting the whole data set and clicking on charts, pie chart and I’ve got an instant visualisation. No messing about.

florence

Back To That Hypothesis…..

We started with a statement of “fact” fused into the psyche but developers in the land.

Java is a dead language in Northern Ireland. 

Well I think we’ve proved that the data from NIJobs would suggest otherwise. 41% of the IT jobs listed by employers are looking for Java developers. The interesting part is the bottom ones, Ruby and iOS with 6% and 5% respectively.

To be fair this is one jobs website, you can’t measure word of mouth and I’ve not looked at the likes of Twitter or any other social media outlet. As a hypothesis it needs more refinement and investigation.

This though is only the start of the conversation, we know there’s a healthy freelance market for Ruby and iOS but it seems that the more established enterprise companies are on the lookout for more “traditional” languages.

One thing I do know, we just covered some simple data science. So even if you don’t agree you have the tools to do some yourself.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: