Now all the prerequisites are in place we can now turn our thoughts on to the sentiment itself. Basic sentiment is pretty easy to do so and I’ve put the barebones of what you’ll need in a Dropbox folder so you can have a look on it. To improve on it won’t be that difficult.
Basic Sentiment Analysis
The principle is to take the text of the tweet and run each word though a scoring routine. This routine has positive and negative words in two separate arrays. Running through each word of the tweet if a positive word is matched then the positive score is increased, likewise if a negative word is found then the negative score is increased. The final sentiment score sent back is the positive score minus the negative score.
Now for most people’s needs that is more than enough, it gives us the basics. There’s plenty more scope where that came from…. but for now let’s work on this.
Loading Positive and Negative Words
The words come from word lists which are stored as text files. So it’s a case of loading these into memory first. So I have a function to load in each set and return an array of words.
LoadPosWordSet<-function(){ iu.pos = scan("/Users/Jason/work/data/positive-words.txt", what='character', comment.char=";") pos.words = c(iu.pos) return(pos.words) }
The same happens for the negative words.
LoadNegWordSet<-function(){ iu.neg = scan("/Users/Jason/work/data/negative-words.txt", what='character', comment.char=";") neg.words = c(iu.neg) return(neg.words) }
Finally for this post is the routine that will calculate the score.
GetScore<-function(sentence, pos.words, neg.words) { sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\\d+', '', sentence) # and convert to lower case: sentence = tolower(sentence) word.list = str_split(sentence, '\\s+') words = unlist(word.list) pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) score = sum(pos.matches) - sum(neg.matches) return(score) }
What’s happening here is quite basic but I’ll explain all the same. First job is to clean up the sentence text so punctuation and control characters are removed. Digits get taken out too as they are no use to us in this exercise. Then the sentence is converted into lower case.
The word list is split up in to an array of words and then we look for matches of postive and negative words against this newly created word array. The final job is to count up the positive and negative counts where a word has matched. What’s returned back is the pos score minus the neg score.
Testing what we have
We can test what’s been written so far in R. Assuming the .r file has been read in with source() command. First we load in the positive words.
> pos.words <- LoadPosWordSet() Read 2006 items
Then do the same with the negative words
> neg.words <- LoadNegWordSet() Read 4783 items
So we have 2006 positive words and 4783 negative words loaded. By using the GetScore method we can get a score now on some text, we can make some up to test. So here’s a positive one.
> testscore<-GetScore("This concert is the best thing I've been to!", pos.words, neg.words) > testscore [1] 1
This gave us a score of 1, so it’s positive. Let’s try a negative….
> testscore2<-GetScore("That's bad real bad, horrible", pos.words, neg.words) > testscore2 [1] -3
Now we see how the negative one faired with a score of -3. Bad, bad and horrible being the culprits there.
False Positives
Okay, so here’s the issue that you need to take into account. The above doesn’t measure false positives. So if something came up like, “yeah man, he’s sick on the skateboard!”, means something to one group of people and something else to another. The sentiment here would be negative (triggered on the word “sick”) even though the proper sentiment of the sentence is positive. No one really writes, “By jove! He’s really good on that plank with wheels!”.
Next time….
Time to start processing in volume. Working on pulling many tweets, getting the score and storing it all in the database and put all our code in some logical form of order. In the meantime you can look at the files for the R code, the positive word set and the negative word set.