I’ve been reading up on R and playing with it here and there. With little time on my hands I can only turn to small snippets of self made challenges to enable to me learn something new. I also thought it might be nice to share the process and see where we go…
There be data in thems hills.
First of all we need data. So, off I tramps to the excellent Lee Munroe creation www.tweetni.com to rip the backside out of it… in a nice way. I don’t want names, I want numbers. So let’s scrape the follower numbers from the tweeters listed.
A simple shell script will pull the data for us and save it to a html page. All good data mining starts with good data and finding a way to make mundane repeat processes easier to do. So……
jasonbell$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ; do curl http://tweetni.com/people?order=followers&page=$i > page$i.html ; done
The shell script loops around the page numbers which we have defined, I know that there’s 19 pages of data so I can use curl to pull the html pages and save them to a file. Easy.
That gives us the following files to play with:
jasonbell$ ls -l
total 1888
-rw-r–r– 1 jasonbell staff 51184 24 Aug 21:04 page1.html
-rw-r–r– 1 jasonbell staff 51054 24 Aug 21:04 page10.html
-rw-r–r– 1 jasonbell staff 51227 24 Aug 21:05 page11.html
-rw-r–r– 1 jasonbell staff 51265 24 Aug 21:05 page12.html
-rw-r–r– 1 jasonbell staff 50695 24 Aug 21:05 page13.html
-rw-r–r– 1 jasonbell staff 50403 24 Aug 21:05 page14.html
-rw-r–r– 1 jasonbell staff 50367 24 Aug 21:05 page15.html
-rw-r–r– 1 jasonbell staff 49950 24 Aug 21:05 page16.html
-rw-r–r– 1 jasonbell staff 49987 24 Aug 21:05 page17.html
-rw-r–r– 1 jasonbell staff 48480 24 Aug 21:05 page18.html
-rw-r–r– 1 jasonbell staff 11798 24 Aug 21:05 page19.html
-rw-r–r– 1 jasonbell staff 51008 24 Aug 21:04 page2.html
-rw-r–r– 1 jasonbell staff 51031 24 Aug 21:04 page3.html
-rw-r–r– 1 jasonbell staff 50878 24 Aug 21:04 page4.html
-rw-r–r– 1 jasonbell staff 50493 24 Aug 21:04 page5.html
-rw-r–r– 1 jasonbell staff 51293 24 Aug 21:04 page6.html
-rw-r–r– 1 jasonbell staff 50982 24 Aug 21:04 page7.html
-rw-r–r– 1 jasonbell staff 50991 24 Aug 21:04 page8.html
-rw-r–r– 1 jasonbell staff 51165 24 Aug 21:04 page9.html
Getting the data we need.
If you open one of the html files you’ll see the table row with the user, twitter name, their bio and, most important to us, the number of followers.
<td class=”alignright”>41221</td>
This is the only piece of data we’re interested in so it makes sense to make life easier for ourselves and extract the stuff out.
Once again a quick shell script will help us out:
jasonbell$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19; do grep “alignright” page$i.html >> collate.html ; done
The >> in the shell script appends to our file collate.html and does’t create a new file each time. Not perfect but fine for our needs…
So we’re left with….
<td class=”alignright”>188246</td>
<td class=”alignright”>41221</td>
<td class=”alignright”>26779</td>
<td class=”alignright”>23372</td>
<td class=”alignright”>18931</td>
<td class=”alignright”>18197</td>
<td class=”alignright”>17950</td>
<td class=”alignright”>17069</td>
<td class=”alignright”>16692</td>
<td class=”alignright”>15816</td>
<td class=”alignright”>12373</td>
<td class=”alignright”>10611</td>
<td class=”alignright”>9974</td>
<td class=”alignright”>9550</td>
<td class=”alignright”>9413</td>
<td class=”alignright”>9398</td>
Hack out the HTML….
With a text editor of choice we need to remove the TD information, we don’t need it.
I’m using vi so my search and replace goes like:
:%s/ <td class=”alignright”>//g
:%s/</td>//g
It doesn’t matter much but I rename collate.html to collate.txt…. it’s not html anymore. π
Doing the funky stuff with R
R is an open source program for statistical analysis. You can download it for your
machine here.
The interface is unforgiving but there is plenty of documentation about. The opening screen is pictured below.
Loading the data in R
We want to read in our newly created collate.txt file in to R and associate all that data with a name. So we’ll have an object called “tweetni” and load the data in with:
> tweetni <- read.table(file(“[path to your data files]/collate.txt”))
Fun with numbers in no time at all.
We can get the real quick basics of our data like the mean (average), median and other goodies with one simple command.
> summary(tweetni)
Standard Deviation
Standard Deviation is a doddle as well…
> sd(tweetni)
Gives us….
Scratching the surface (or perhaps barrel), well yes. But there’s some shell scripting, some basic vi and some basic R. Not bad for free…. more later.