I received a question for Boris Drakemandrillsquirrelhugger*, “Jase, you do data science, how many new jobs have Invest Northern Ireland announced in total?”.
“Bless My Cotton Socks I’m In The News”
First we need headlines and in one line of Linux we can have the whole lot.
$for i in {1..314}; do curl http://www.investni.com/news/index.html?page=$i > news_$i.html; done
This is exactly the same as how I pulled nijobs.com data in a previous blog post. Each page is 10 headlines and there’s 3138 headlines, so 314 pages will be fine. While that’s pulling all the html down you may as well get a cuppa….
Messing With The Output
The output is basically html pages. You could fire up Python and BeautifulSoup parsers and anything else that takes your fancy, or just use good old command line data science.
egrep -ohi "\d+ new jobs" *.html | egrep -o "\d+" | awk '{ sum+=$1} END {print sum}'
I’m piping three Linux commands, two egreps, the first to pull out “[a number] new jobs”. The -o flag is to only show the matching string from the regular expression, -i ignores the case, “New jobs” and “new jobs” is different otherwise and -h drops the filename in the output.
58 new jobs 61 new jobs 61 new jobs 84 new jobs 84 new jobs 84 new jobs 30 new jobs 30 new jobs 10 new jobs
The second just to get the figure.
30 30 30 40 82 82 15 300 300 23 540 540 36 125 125
And the exciting part is the awk command at the end where it adds up the stream numbers.
70758
Now that last figure is what we’re after. One caveat to that, any headline with a comma in the figure got ignored…. the first regexp will need tweaking…. you can play with that. So a rough estimate is to say that since June 2003 there have been over 70,000 new jobs announced in INI headlines.
The number you won’t get is how many were filled.
* The names have been changed to protect the innocent, in fact, just made up….. no one asked at all.
One response to “Invest NI’s “new jobs” headlines…. how many in a lifetime?”
Current regex doesn’t correctly parse comma-numbers (eg 1,000).
(?<!\S)(\d*\.?\d+|\d{1,3}(,\d{3})*(\.\d+)?)(?!\S)
works better using the first returned capture group.