It is very easy to create eyecatching garphics with R. An example of barplot is shown below.
It is very easy to create eyecatching garphics with R. An example of barplot is shown below.
Well, designing a analytic engine has always been difficult and with Big Data and machine learning, the complexity has taken an unprecedented direction. This article aims to address one important point, how to use traditional systems in software engineering to build an analytic engine. This has got one major advantage, the learning curve with a new skill is not required which helps in focusing on the product rather to understand underlying complex statistical concepts.
In the following sections, I am going to uncover a Sentiment Mining system which can be implemented on a RDBMS (SQL SERVER, ORACLE or MySQL), with a compatible front end system, .NET or Java or a language of your choice.
Problem : Predicting sentiment of a sentence or phrase, using traditional software methods.
Approach: For building a sentiment mining system, we will use Naive Bayes principles of conditional probability. In this scenario, the outcome of an event is determined by an earlier event or variable. Suppose, if we are doing a two bag experiment, where both bags contain Red and Blue balls. A ball is drawn out of a bag and if it turns out to be of particular colour, what are the chances that it was drawn from the first bag or second bag? This kind of scenarios are too common in statistical scenario and for a text mining system, this same approach can be followed.
Naive Bayes in context of Sentiment Mining: In this scenario, a training set will be used to train a classifier which will be basically couple of tables. Lets say, that we have following sentences available to us along with their sentiments, (positive and negative),which will act as a training data.
Sentence 1 – Agent Ram has solved the technical glitch nicely. Thanks to him. [POSITIVE]
Sentence 2 – Agent Ravana was not at all willing to listen to me. He could do better. [NEGATIVE]
Sentence 3 – Agent Bharat was awesome. [POSITIVE]
Sentence 4 – Agent Meghnad did not do any good to me [NEGATIVE].
Sentence X – Agent Laxman solved my issue. God bless him [UNKNOWN]
So, how the Naive Bayes will help in determining the sentiment of Sentence X?
In a nutshell, we will calculate the probability of the unknown sentence based for both cases, positive and negative (P-pos and P-neg). Whatever has the higher value, that will be the sentiment type for the unknown sentence.
Going further, we know that in Sentence 1,
Agent has occurred exactly once in positive sentence, so Agent probability is 1 (2 out of 2) and word “solved” has a probability of ½ in positive sentences. Remember, we are determining an individual word’s probability for a particular sentiment.
Similarly , “willing” will have a probability of ½ in negative sentiments.
When Sentence X (an unknown sample) is encountered:
p-pos for all words are determined and multiplied together. so it will be 1 * 0.01 * 1 * 0.01 * 0.01 * 0.01 * 0.01 [0.01 is taken so that probability is not set to zero with non occurrence of a term] . This is then multiplied by the probability of occurrence of a positive sentiment or Prior Probability, which is 0.5 (2 out of 4).
Similarly, p-neg for sentiment X is determined as:
1 * 0.01 * 0.01 * 0.01 * 0.01 * 0.01 * p-neg (2 out of 4)
Thus we see that p-pos is higher than p-neg and we can conclude that our sentiment X is a positive sentiment.
Once the basics of Naive Bayes are clear, we can get to actually implementing the system. You will be surprised that there are only two tables, which will be required.
t_sentence_sentiment with columns sentence and sentiment_type, alongwith sentence identifier.
t_term_sentiment, which will have term, sentiment_type and count columns.
TRAINING THE CLASSIFIER
The first table is available data and second is training the algorithm. Lets examine the following sentence.
Sentence1 -> Agent Ram has solved the technical glitch nicely. Thanks to him
For table 1, this information is saved as “Sentence1”, “Agent Ram has solved the technical glitch nicely. Thanks to him”, “Positive”
With the help of above information, we will update the second (training) table. This table stores the term, number of times it has appeared in a sentiment type and the type.
If we consider, Sentence 3, then Agent word will be updated to reflect the count as 2. (Agent,1, Positive)
DETERMINING AN UNKNOWN SENTIMENT
After the training part is completed, we can attach our problem. With the above information stored now, we need to find two things, probability of an individual sentiment and probability of sentiment type.
We have seen it is possible to build a classifier with traditional software methods. Data is important here. And this design can easily be extended to NoSQL scenarios, such as Hive. So this is scalable to big data scenario.
I have recently done a thorough analysis of publicly available diagnostic data on breast cancer. This analysis used a number of statistical and machine learning techniques. This was used to draw inference from the data. Following is an excerpt of the Medical Diagnostic Analysis.
Analysis of “Breast Cancer Wisconsin (Diagnostic) Data Set”. The dataset is available from “UCI Machine Learning Repository”. Data used is “breast-cancer-wisconsin.data”” (1) and “breast-cancer-wisconsin.names”(2).
The dataset has 11 variables with 699 observations, first variable is the identifier and has been excluded in the analyis. Thus, there are 9 predictors and a response variable (class). The response variable denotes “Malignant” or “Benign” cases.
Class – (2 for benign, 4 for malignant)
There are 16 observations where data is incomplete. In further analysis, these cases are imputed (substitued by most likely values) or ignored. In total, there are 241 cases of malignancy, where as benign cases are 458.
Three different algorithms for classification are tried – with all varibles and with variable selection. Generally, it is found that errors were increased when variables were decreased, with an exception. With KNN, a model with lesser number of variables has performed better than all other models. As this is a small dataset, this could be a case of overfitting. Second option is to use Random Forest full classifier, as it does better than the KNN Classifier with all variables. Decision trees are showing high error rate, so this could be ignored.
Another analysis done with help of R is available here. Cost Factor Analysis
While doing linear regression using R, following error is obtained:
Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
plot.new has not been called yet
This results in visually unable to see the line on the scatterplot. Though examining the summary of the model confirms that model has been built rightly and coefficients values can be obtained as well. Well this error is not an error rather it boils down to the fact that we need to use R differently. Following lines will make this clear.
Say, we want to forecast number of calls based on few months of data. The data is available as following:
Step 1 . Save the data in CSV format. (data.csv)
Step 2. Read the data in R console – c = read.csv(“data.csv”, header=TRUE)
Step 3. Assign x and y variables from c. x <- c$Month and y <- c$Calls
Step 4. Create the model. mod1 <- lm(y ~ x)
Step 5. Scatterplot – plot (x, y, type = “p”)
Above window will be shown. Now closing above window and making the call
abline(mod1, lwd=2) results in the error “plot.new has not been called yet“
The reason for above is since the scatter plot window is closed R could not find a place to draw the line. The solution lies in to click the button “Return focus to Console” and then call the function
on R Console. The result is as following.
Quite easy!!!! isn’t it?
Taking this further , the linear equation obtained is Calls = 6.5 X Months + 20.00. As per this estimate 65, 72 and 78.5 calls are expected in next three months.
The same analysis can be done thrugh excel and the resulting model is as following.
Benefits of Agile methodologies have been realized in software product development as this approach attempts to break the complexity of software development by frequently(regularly) releasing working software.
As this model aligns itself with the ever changing software requirements, these methodologies are increasingly adopted in software engineering. There are various methodologies which fall under Agile umbrella, such as, DSDM, ASD, SCRUM, Extreme Programming. This article will refer to Scrum while discussing ETL development in conjuntion with Agile.
The challenge for Scrum master and product owners is to define the scope of a sprint, which produces a working model at the end of it. ETL development comprises of logical pieces of components (ETL jobs), which makes it convenient to plan a number of ETL jobs in a sprint.
A sprint is normally of two or three weeks of duration. On the basis of the number of resources available at hand, the scrum master can plan a sprint which is realistic and achievable. This is not being suggested here that Scrum or Agile can totally replace all phases, e.g. requirements for ETL normally flows from the “upstream” systems, but build phase is an area which can definitely be exploited.
Many would see this as a challenge that how one can follow the various approaches(SDLC and Agile) at the same time. What is being advocated here is that these methodologies are not intended to replace another, rather they should be used judiciously to leverage the pluses available with them.
A scenario: A software vendor wins a contract (may be Fixed price) to deliver an ETL project. They follow the traditional approach for requirements and high level design phases. During the build phase, they can internally use the Agile principles to drive the ETL development. The advantages are:
1. better planning as planning for less.
2. better estimation.
3. better use of lessons learnt, as lessons learnt from one sprint can be applied to another.
4. minimizing the risks as the working software is rolled out from early stages, providing an opportunity to early identification of errors.
5. A chance to get customer feedback early.
The planning for scheduling software can also be left to later sprints as there would be useful information available in form of jobs and dependencies which would again help in planning more effectively.
in a T&M model, using Agile for ETL development is even better as the customer can find it easy to relate with number of jobs available in a sprint. This in turn will help in early feedback, confidence building of customer as he sees the product much earlier and can provide invaluable feebback.
The Agile methodologies bring the team together and provide a platform to enagage the customer. Combine with the ease of use in ETL development and you can develop a project management model which is effective.
A lot of buzz is created nowadays around “Big Data”. There are some staunch believers of this and there are some who dismiss this as a bubble destined to subside. Companies such as facebook and amazon are already using this. Lets try to understand this whole thing.
The world today is generating data at a frenetic pace. Statistics are available on the net that the data is doubled in every X days. The good thing about this data is that it is available and potentially contains useful information. What’s not so good is how to process this information and use it effectively.
The market place has become competitive and will be becoming more in years to come. The key to growth (and survival) is innovation. Big Data can help in uncover the strategies which can help in making the difference. Big Data though, is not a “magic wand” or “one size fits all” approach, rather it will require careful strategising and planning to achieve something tangible. The following sections will discuss the various facets.
Big data can be defined as data which has:
3. variety and
The world is producing data with great velocity. We heard nowadays that petabytes do not amaze us anymore. The data is varied today, as well, e.g. structured, semi structured and complex. The fourth one, value, is not so straightforward. The value is something whose unlocking is important. Organizations can well today with Big Data if they succeed in defining
1. What they want
2. They understand about the data they need
3. They understand the data that they have which can be used.
4. They know where to look for the data which they need.
The first and foremost on the list is defining the problem. Once you have this then the next stage of understanding and getting data can be achieved. This requires investment, intellectual, financial and time.
The next stage is analysis once you have your data with you. This analysis should be something which can aid you invgetting what you want and help you with the business.
Above is a very brief about big data. There is lot of buzz nowadays. if we look at the example of dotcom bubble, it did not do well earlier, but if you look today, then e-commerce is not only thriving, rather it has become the vehicle of business. Same could be the case with Big Data, because it has the value to transform today’s businesses. The trick lies in getting this unlocked.
Twelve principles underlie the Agile Manifesto, which includes:
Customer satisfaction by rapid delivery of useful software
Welcome changing requirements, even late in development
Working software is delivered frequently (weeks rather than months)
Working software is the principal measure of progress
Sustainable development, able to maintain a constant pace
Close, daily co-operation between business people and developers
Face-to-face conversation is the best form of communication (co-location)
Projects are built around motivated individuals, who should be trusted
Continuous attention to technical excellence and good design
Regular adaptation to changing circumstances
Agile methodologies have been primarily used to help in product development, but slowly this trend is catching for software project management as well. These methodologies are driven on the principles of Agile Manifesto.
The Agile Manifesto highlights over:
Indiviudals rather than processes.
Response to change
These methodologies address the one basic problem which causes the many projects to fail, that is involvement of the users/stakeholders. Waterfall is predictive, Agile is adaptive and the Risk is minimized by frequent release of working software.
Agile methodologies comprise of Extreme Programming, DSDM etc.
A project is defined as following:
1. Has a start and end date, so it is temporary.
2. Creates a unique product or service.
An operation is an ongoing endeavour. Operations and projects differ primarily in that operations are ongoing and repetitive while projects are temporary and unique.
As per PMBOK (PMI), a project can thus be defined in terms of its distinctive characteristics – a project is a temporary endeavor undertaken to create a unique product or service.