Invest NI’s “new jobs” headlines…. how many in a lifetime?

I received a question for Boris Drakemandrillsquirrelhugger*, “Jase, you do data science, how many new jobs have Invest Northern Ireland announced in total?”.

“Bless My Cotton Socks I’m In The News”

First we need headlines and in one line of Linux we can have the whole lot.

$for i in {1..314}; do curl http://www.investni.com/news/index.html?page=$i > news_$i.html; done

This is exactly the same as how I pulled nijobs.com data in a previous blog post. Each page is 10 headlines and there’s 3138 headlines, so 314 pages will be fine. While that’s pulling all the html down you may as well get a cuppa….

1950s-woman-smiling-holding-platter-of-hors-d-oeuvres-snacks

Messing With The Output

The output is basically html pages. You could fire up Python and BeautifulSoup parsers and anything else that takes your fancy, or just use good old command line data science.

egrep -ohi "\d+ new jobs" *.html | egrep -o "\d+" | awk '{ sum+=$1} END {print sum}'

I’m piping three Linux commands, two egreps, the first to pull out “[a number] new jobs”. The -o flag is to only show the matching string from the regular expression, -i ignores the case, “New jobs” and “new jobs” is different otherwise and -h drops the filename in the output.

58 new jobs
61 new jobs
61 new jobs
84 new jobs
84 new jobs
84 new jobs
30 new jobs
30 new jobs
10 new jobs

The second just to get the figure.

30
30
30
40
82
82
15
300
300
23
540
540
36
125
125

And the exciting part is the awk command at the end where it adds up the stream numbers.

70758

Now that last figure is what we’re after. One caveat to that, any headline with a comma in the figure got ignored…. the first regexp will need tweaking…. you can play with that. So a rough estimate is to say that since June 2003 there have been over 70,000 new jobs announced in INI headlines.

The number you won’t get is how many were filled.

* The names have been changed to protect the innocent, in fact, just made up….. no one asked at all.

One response to “Invest NI’s “new jobs” headlines…. how many in a lifetime?”

  1. Current regex doesn’t correctly parse comma-numbers (eg 1,000).

    (?<!\S)(\d*\.?\d+|\d{1,3}(,\d{3})*(\.\d+)?)(?!\S)

    works better using the first returned capture group.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: