Well, designing a analytic engine has always been difficult and with Big Data and machine learning, the complexity has taken an unprecedented direction. This article aims to address one important point, how to use traditional systems in software engineering to build an analytic engine. This has got one major advantage, the learning curve with a new skill is not required which helps in focusing on the product rather to understand underlying complex statistical concepts.
In the following sections, I am going to uncover a Sentiment Mining system which can be implemented on a RDBMS (SQL SERVER, ORACLE or MySQL), with a compatible front end system, .NET or Java or a language of your choice.
Problem : Predicting sentiment of a sentence or phrase, using traditional software methods.
Approach: For building a sentiment mining system, we will use Naive Bayes principles of conditional probability. In this scenario, the outcome of an event is determined by an earlier event or variable. Suppose, if we are doing a two bag experiment, where both bags contain Red and Blue balls. A ball is drawn out of a bag and if it turns out to be of particular colour, what are the chances that it was drawn from the first bag or second bag? This kind of scenarios are too common in statistical scenario and for a text mining system, this same approach can be followed.
Naive Bayes in context of Sentiment Mining: In this scenario, a training set will be used to train a classifier which will be basically couple of tables. Lets say, that we have following sentences available to us along with their sentiments, (positive and negative),which will act as a training data.
Sentence 1 – Agent Ram has solved the technical glitch nicely. Thanks to him. [POSITIVE]
Sentence 2 – Agent Ravana was not at all willing to listen to me. He could do better. [NEGATIVE]
Sentence 3 – Agent Bharat was awesome. [POSITIVE]
Sentence 4 – Agent Meghnad did not do any good to me [NEGATIVE].
Sentence X – Agent Laxman solved my issue. God bless him [UNKNOWN]
So, how the Naive Bayes will help in determining the sentiment of Sentence X?
In a nutshell, we will calculate the probability of the unknown sentence based for both cases, positive and negative (P-pos and P-neg). Whatever has the higher value, that will be the sentiment type for the unknown sentence.
Going further, we know that in Sentence 1,
Agent has occurred exactly once in positive sentence, so Agent probability is 1 (2 out of 2) and word “solved” has a probability of ½ in positive sentences. Remember, we are determining an individual word’s probability for a particular sentiment.
Similarly , “willing” will have a probability of ½ in negative sentiments.
When Sentence X (an unknown sample) is encountered:
p-pos for all words are determined and multiplied together. so it will be 1 * 0.01 * 1 * 0.01 * 0.01 * 0.01 * 0.01 [0.01 is taken so that probability is not set to zero with non occurrence of a term] . This is then multiplied by the probability of occurrence of a positive sentiment or Prior Probability, which is 0.5 (2 out of 4).
Similarly, p-neg for sentiment X is determined as:
1 * 0.01 * 0.01 * 0.01 * 0.01 * 0.01 * p-neg (2 out of 4)
Thus we see that p-pos is higher than p-neg and we can conclude that our sentiment X is a positive sentiment.
Once the basics of Naive Bayes are clear, we can get to actually implementing the system. You will be surprised that there are only two tables, which will be required.
t_sentence_sentiment with columns sentence and sentiment_type, alongwith sentence identifier.
t_term_sentiment, which will have term, sentiment_type and count columns.
TRAINING THE CLASSIFIER
The first table is available data and second is training the algorithm. Lets examine the following sentence.
Sentence1 -> Agent Ram has solved the technical glitch nicely. Thanks to him
For table 1, this information is saved as “Sentence1”, “Agent Ram has solved the technical glitch nicely. Thanks to him”, “Positive”
With the help of above information, we will update the second (training) table. This table stores the term, number of times it has appeared in a sentiment type and the type.
If we consider, Sentence 3, then Agent word will be updated to reflect the count as 2. (Agent,1, Positive)
DETERMINING AN UNKNOWN SENTIMENT
After the training part is completed, we can attach our problem. With the above information stored now, we need to find two things, probability of an individual sentiment and probability of sentiment type.
- Probability of a term occurring in positive samples. So, “Agent” word will have the probability 1 ( 2 times in 2 positive samples, “solved” will have 0.5 ( occurred once in two sentences). In this way, probability of all terms are found out. If some terms do not appear, then assume a lower probability (say 0.005)
- Similarly, probability of all terms occurring in negative sentences are found out.
- Overall probability of positive sentiments. This can be found out from the first table.
- Overall probability of negative sentiments.
- Calculate Positive probability of whole sentence by multiplying 1 and 3.
- Calculate negative probability of whole sentence by multiplying 2 and 4.
- From 5 and 6, whichever is higher, is the sentiment of the unknown term.
We have seen it is possible to build a classifier with traditional software methods. Data is important here. And this design can easily be extended to NoSQL scenarios, such as Hive. So this is scalable to big data scenario.