12 Dec

Building a simple analytic engine from traditional technologies (RDBMS based)

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

Well, designing a analytic engine has always been difficult and with Big Data and machine learning, the complexity has taken an unprecedented direction. This article aims to address one important point, how to use traditional systems in software engineering to build an analytic engine. This has got one major advantage, the learning curve with a new skill is not required which helps in focusing on the product rather to understand underlying complex statistical concepts. 

In the following sections, I am going to uncover a Sentiment Mining system which can be implemented on a RDBMS (SQL SERVER, ORACLE or MySQL),  with a compatible front end system, .NET or Java or a language of your choice.

 Problem : Predicting sentiment of a sentence or phrase, using traditional software methods.

 Approach: For building a sentiment mining system, we will use Naive Bayes principles of conditional probability. In this scenario, the outcome of an event is determined by an earlier event or variable. Suppose, if we are doing a two bag experiment, where both bags contain Red and Blue balls. A ball is drawn out of a bag and if it turns out to be of particular colour, what are the chances that it was drawn from the first bag or second bag? This kind of scenarios are too common in statistical scenario and for a text mining system, this same approach can be followed.

Naive Bayes in context of Sentiment Mining: In this scenario, a training set will be used to train a classifier which will be basically couple of tables. Lets say, that we have following sentences available to us along with their sentiments, (positive and negative),which will act as a training data.

 Sentence 1 – Agent Ram has solved the technical glitch nicely. Thanks to him. [POSITIVE]

Sentence 2 – Agent Ravana was not at all willing to listen to me. He could do better. [NEGATIVE]

Sentence 3 – Agent Bharat was awesome. [POSITIVE]

Sentence 4 – Agent Meghnad did not do any good to me [NEGATIVE].

Sentence X – Agent Laxman solved my issue. God bless him [UNKNOWN]

 So, how the Naive Bayes will help in determining the sentiment of Sentence X?

 In a nutshell, we will calculate the probability of the unknown sentence based for both cases, positive and negative (P-pos and P-neg). Whatever has the higher value, that will be the sentiment type for the unknown sentence.

 Going further, we know that in Sentence 1,

 Agent has occurred exactly once in positive sentence, so Agent probability is 1 (2 out of 2) and word “solved” has a probability of ½ in positive sentences. Remember, we are determining an individual word’s probability for a particular sentiment.

 Similarly , “willing” will have a probability of ½ in negative sentiments.

 When Sentence X (an unknown sample) is encountered:

 p-pos for all words are determined and multiplied together. so it will be 1 * 0.01 * 1 * 0.01 * 0.01 * 0.01 * 0.01 [0.01 is taken so that probability is not set to zero with non occurrence of a term] . This is then multiplied by the probability of occurrence of a positive sentiment or Prior Probability, which is 0.5 (2 out of 4).

 Similarly, p-neg for sentiment X is determined as:

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

1 * 0.01 * 0.01 * 0.01 * 0.01 * 0.01 * p-neg (2 out of 4)

Thus we see that p-pos is higher than p-neg and we can conclude that our sentiment X is a positive sentiment.

Implementation

Once the basics of Naive Bayes are clear, we can get to actually implementing the system. You will be surprised that there are only two tables, which will be required.

 t_sentence_sentiment  with columns sentence and sentiment_type, alongwith sentence identifier.

t_term_sentiment, which will have term, sentiment_type and count columns.

TRAINING THE CLASSIFIER

The first table is available data and second is training the algorithm. Lets examine the following sentence.

Sentence1 -> Agent Ram has solved the technical glitch nicely. Thanks to him

 For table 1, this information is saved as “Sentence1”, “Agent Ram has solved the technical glitch nicely. Thanks to him”, “Positive”

 With the help of above information, we will update the second (training) table. This table stores the term, number of times it has appeared in a sentiment type and the type.

Agent,1, Positive

Ram,1,Positive

If we consider, Sentence 3, then Agent word will be updated to reflect the count as 2. (Agent,1, Positive)

 DETERMINING AN UNKNOWN SENTIMENT

After the training part is completed, we can attach our problem. With the above information stored now, we need to find two things, probability of an individual sentiment and probability of sentiment type.

  1. Probability of a term occurring in positive samples. So, “Agent” word will have the probability 1 ( 2 times in 2 positive samples, “solved” will have 0.5 ( occurred once in two sentences). In this way, probability of all terms are found out. If some terms do not appear, then assume a lower probability (say 0.005)
  2. Similarly, probability of all terms occurring in negative sentences are found out.
  3. Overall probability of positive sentiments. This can be found out from the first table.
  4. Overall probability of negative sentiments.
  5. Calculate Positive probability of whole sentence by multiplying 1 and 3.
  6. Calculate negative probability of whole sentence by multiplying 2 and 4.
  7. From 5 and 6, whichever is higher, is the sentiment of the unknown term.

Further information: 

We have seen it is possible to build a classifier with traditional software methods. Data is important here. And this design can easily be extended to NoSQL scenarios, such as Hive. So this is scalable to big data scenario.

01 Sep

71 Points you would like to know before writing PMP Exam

Your ads will be inserted here by

Easy Plugin for AdSense.

Please go to the plugin admin page to
Paste your ad code OR
Suppress this ad slot.

1. Shutting down a project is risk avoidance strategy.
2. Beta and triangular distributions are used in Quantitative Risk Analysis.
3. lowest level of RBS can be used as a risk checklist – used in Identify Risks.
4. Sigma Values –
Three Sigma – 93.3
Four Sigma – 99.38
Five Sigma – 99.97
Six Sigma – 99.999665. RAM – Responsibility assignment matrix – RACI is a type of RAM.
6. A process is out of control when a data point exceeds control limit or
if seven consecutive points are above or below the mean.7. Difference between Control Limits and specification limits?8. Value of control limits – generally between Plus minus three sigmas

9. Failure costs are Costs of poor quality. Cost of quality includes
all costs incurred over the life of the product by investment in
preventing non conformance to requirements and failing to meet
requirements (rework).

10. Cost of conformance – Build a quality product + Appraisal costs

11. Cost of non conformance – Internal Failure costs + External Failure costs12. To perform ETC, additional costs are incurred for the project and
there is no budget to do so. EAC = AC + Bottom up ETC.13 The cost performance baseline is an authorized time phased budget at
completion (BAC) used to measure, monitor and control overall cost
performance on the project.

14 Cost baseline excludes management reserves.

15 Bar charts are frequently used in management presentations. For
control and management communication, more comprehensive summary
activity referred as HAMMOCK ACTIVITY is used.

16 Resource Leveling can often cause critical path to change.

17 Diff between quality control and verify scope – ensure / accept.18 Verify scope includes deliverables review with the customer / sponsor
to ensure that they are completed satisfactorily.19. 100 Percent Rule: The WBS represents total work including Project
Management Work, this is called as 100 percent rule.

20. A control account may contain one or more work packages but each work
package must be associated with only one control account.

21. Project Scope Management processes are preceded by some planning
effort which produces a scope management plan.

22. Change Requests is not an input but an output to control scope process, why?

23. Close Project or Phase

24 Project manager will review and ensure that all project work has
completed and has met its objectives.

25 Project scope needs to be reviewed to ensure completion.

26 Project closure should document the reasons if project is terminated
before completion.
Project closure should address:
– activities that satisfy completion or exit criteria for the phase or project.
– activities necessary to transfer the project’s products, services or
– results to the next phase or to production and/ or operations.
– activities needed to collect project or phase records, audit project
-success or failure, gather lessons learned and archive project
– information for future use.

27 Change requests should pass through change management or/and
configuration control systems. Configuration control is focused on
specification of both deliverables and processes while change control
is focused on identifying, documenting and controlling changes to
project and product baselines. Configuration management activities
carried out in integrated change control processes are:

Configuration identification
Configuration Status accounting
Configuration verification and audit

28 Performance Measurement Baseline – scope, schedule and cost baseline

29 kill points, phase exits, milestones , phase gates, decision gates are
same things. (when phases are sequential)30 Functional Weak Matrix Strong Matrix etc.31 Difference between PDM and AON – Same thing
PDM and ADM – Placement of activity on the logic diagram line.

32 Problem Solving and Compromise are two best conflict resolution techniques,
in that order.

33 Expert Power is best form of power.

34 Reward Power next best form of Power, then Formal and then Referent.

35 Net Present Value (Formulae) – Uses Internal Rate of Return, enough to remember to select project with max NPV while selecting.

36 Staffing management plan contains
– Training needs
– Recognition and Rewards
– Release criteria

37. Point of Total assumption. This is where seller assumes the cost.
The costs have become so large in a fixed price contract (FPIF) that
there is no benefit attached.

38. Motivational Theories

– McClelland Achievement Theory
– McGregor X and Y
– Maslow
Self-transcendence
Self-actualization
Esteem
Love and belonging
Safety needs
Physiological needs
Expectancy Theory

Herzberg – Hygiene theory

39. Funding Limit Reconciliation

40. Return on Investment
41. Cost baseline:
42. Network Templates or Schedule Network Templates
Expediting the preparation of networks of project activities. Useful
when project includes identical or nearly identical deliverables.43. Constructive change request: A direction by the buyer or an action
taken by the seller that the other party considers an undocumented
change to the contract. This can lead to a claim.44. A Risk Management Plan includes:
Methodology
Roles and Responsibilities
Budgeting
Timing
Risk Categories
Definitions of Risk categories and Impact
Probability and Impact Matrix
Revised Stakeholders’ tolerances
Reporting Formats (strange but yes)
Tracking

45. Liquidated damages (LDs) are contractually agreed payments in
order to cover the customer’s costs caused by late completion or
failure to meet specifications by the contractor.

46 . Perform Integrated Change Control : This process includes
configuration management activities such as
– Configuration Identification.
– Configuration Status Accounting.
– Configuration verification and Audit.

47. When Requirements Documentation is done?

48. Basis of Estimates include?
49. Assumptions Analysis is used during Identify Risk process.50 Quality Control
Prevention – Keeping errors out of process
inspection – Keeping errors out of the hands of customer
Attribute Sampling – Result either conforms or not conforms
Variables Sampling – Result is rated on continuous scale. (measures
degree of conformity)
Tolerance – Specified range of acceptable limits
Control Limits – thresholds, indicating if the process is out of control.51 The PM team is a subset of Project Team (Core, Executive or
Leadership Team) :The Sponsor works with the PM team, assisting in
matters such as funding, clarifying scope, monitoring progress and
influencing others for benefit of the project. Ground rules establish
clear expectations regarding acceptable behaviours by Team members.

52. Weighting System can be used to select a Seller. (Subjective)

53. Communications Management Plan: Comm. needs and expectations of the
project, how and in what format information will be communicated, when
and where communication will be made and who is responsible for
providing the information.

54. Stretch Assignment

55. Scope Creep – Uncontrolled Changes.
56. Active Risk Acceptance: Most common way is to build contingency
reserve (time, money or resources).
57. Burn Rate – Inverse of CPI.
58 CPAF – The award is decided by the Buyer and generally not
subjected to appeals.
59 Sensitivity Analysis – tornado diagrams.
60 Passive acceptance of risk requires no action except to document
the strategy, leaving the project team to deal with the risk when it
occurs.
Accept is a risk strategy for both, POSITIVE and NEGATIVE risks.61 Joint Quality policy can help is multiple vendors environment.
Project Team should ensure that participating orgs are aware of the
policy.62 Variance Analysis is used for control processes – control scope,
schedule etc and Report Performance Process also.

63 In earned value management technique, Cost Performance Baseline is
also referred as PMB – Performance Measurement Baseline. Cost
Performance Baseline is used to assess funding requirements.

64 A heuristic is BEST described as a Rule of Thumb.

65. Conflicts on projects caused by
– Resources
– Project Priorities
– Schedule
– Technical Opinions66. Who determines the role of a Stakeholder?
67. Read Risk Reserves.
68. Matrix – Read Weak, Strong
69. What is RPN number?
70. Project constraints are Scope, Quality, Schedule, Budget, Resources and Risk.
71 . Hard logic – Mandatory Dependencies.
Soft logic – Discretionary or preferential. ( Such as best practice).
09 Dec

Analysis of Wisconsin data set for Breast Cancer (Using R)

Diagnostic Data Analysis for Wisconsin Breast Cancer Data

I have recently done a thorough analysis of publicly available diagnostic data on breast cancer. This analysis used a number of statistical and machine learning techniques. This was used to draw inference from the data. Following is an excerpt of the Medical Diagnostic Analysis.

Introduction:

Analysis of “Breast Cancer Wisconsin (Diagnostic) Data Set”. The dataset is available from “UCI Machine Learning Repository”. Data used is “breast-cancer-wisconsin.data”” (1) and “breast-cancer-wisconsin.names”(2).

About the data:

The dataset has 11 variables with 699 observations, first variable is the identifier and has been excluded in the analyis. Thus, there are 9 predictors and a response variable (class). The response variable denotes “Malignant” or “Benign” cases.

Predictor variables:

  • Clump Thickness
  • Uniformity of Cell Size
  • Single Epithelial Cell Size
  • Bare Nuclei
  • Uniformity of Cell Shape
  • Bland Chromatin
  • Mitoses
  • Marginal Adhesion
  • Normal Nucleoli

Response variable:

Class – (2 for benign, 4 for malignant)

There are 16 observations where data is incomplete. In further analysis, these cases are imputed (substitued by most likely values) or ignored. In total, there are 241 cases of malignancy, where as benign cases are 458.

 Medical Diagnostic Analysis.

classifier-performance

Inference

Three different algorithms for classification are tried – with all varibles and with variable selection. Generally, it is found that errors were increased when variables were decreased, with an exception. With KNN, a model with lesser number of variables has performed better than all other models. As this is a small dataset, this could be a case of overfitting. Second option is to use Random Forest full classifier, as it does better than the KNN Classifier with all variables. Decision trees are showing high error rate, so this could be ignored.

 

Another analysis done with help of R is available here. Cost Factor Analysis

25 Oct

R – resolving error – plot.new has not been called yet (R Statistical package)

Linear Regression (R)

While doing linear regression using R, following error is obtained:

Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, …) :
  plot.new has not been called yet

This results in visually unable to see the line on the scatterplot. Though examining the summary of the model confirms that model has been built rightly and coefficients values can be obtained as well. Well this error is not an error rather it boils down to the fact that we need to use R differently. Following lines will make this clear.

Say, we want to forecast number of calls based on few months of data. The data is available as following:

1         27
2         40
3         37
4         39
5         46
6         69

Step 1 . Save the data in CSV format. (data.csv)

Step 2. Read the data in R console – c = read.csv(“data.csv”, header=TRUE)

Step 3. Assign x and y variables from c. x <-  c$Month and y <-  c$Calls

Step 4. Create the model. mod1 <- lm(y ~ x)

Step 5. Scatterplot – plot (x, y, type = “p”)

Above window will be shown.  Now closing above window and making the call abline(mod1, lwd=2) results in the error   “plot.new has not been called yet

The reason for above is since the scatter plot window is closed R could not find a place to draw the line. The solution lies in to click the button “Return focus to Console” and then call the function abline(mod1, lwd=2)on R Console. The result is as following.

Quite easy!!!! isn’t it?

Taking this further , the linear equation obtained is  Calls = 6.5 X Months + 20.00. As per this estimate 65,    72 and   78.5 calls are expected in next three months.

 

The same analysis can be done thrugh excel and the resulting model is as following.

10 Jul

Why Agile methodologies are good fit for ETL development?

Benefits of Agile methodologies have been realized in software product development as this approach attempts to break the complexity of software development by frequently(regularly) releasing working software.

As this model aligns itself with the ever changing software requirements, these methodologies are increasingly adopted in software engineering. There are various methodologies which fall under Agile umbrella, such as, DSDM, ASD, SCRUM, Extreme Programming. This article will refer to  Scrum while discussing ETL development in conjuntion with Agile.

The challenge for Scrum master and product owners is to define the scope of a sprint, which produces a working model at the end of it. ETL development comprises of logical pieces of components (ETL jobs), which makes it convenient to plan a number of ETL jobs in a sprint.

A sprint is normally of two or three weeks of duration. On the basis of the number of resources available at hand, the scrum master can plan a sprint which is realistic and achievable.  This is not being suggested here that Scrum or Agile can totally replace all phases, e.g. requirements for ETL normally flows from the “upstream” systems, but build phase is an area which can definitely be exploited.

Many would see this as a challenge that how one can follow the various approaches(SDLC and Agile) at the same time. What is being advocated here is that these methodologies are not intended to replace another, rather they should be used judiciously to leverage the pluses available with them.

A scenario: A software vendor wins a contract (may be Fixed price) to deliver an ETL project. They follow the traditional approach for requirements and high level design phases. During the build phase, they can internally use the Agile principles to drive the ETL development. The advantages are:

1. better planning as planning for less.

2. better estimation.

3. better use of lessons learnt, as lessons learnt from one sprint can be applied to another.

4. minimizing the risks as the working software is rolled out from early stages, providing an opportunity to early identification of errors.

5. A chance to get customer feedback early.

The planning for scheduling software can also be left to later sprints as there would be useful information available in form of jobs and dependencies which would again help in planning more effectively.

in a T&M model, using Agile for ETL development is even better as the customer can find it easy to relate with number of jobs available in a sprint. This in turn will help in early feedback, confidence building of customer as he sees the product much earlier and can provide invaluable feebback.

The Agile methodologies bring the team together and provide a platform to enagage the customer. Combine with the ease of use in ETL development and you can develop a project management model which is effective.

06 Jul

Making sense of “Big Data”

A lot of buzz is created nowadays around “Big Data”. There are some staunch believers of this and there are some who dismiss this as a bubble destined to subside. Companies such as facebook and amazon are already using this. Lets try to understand this whole thing.

The world today is generating data at a frenetic pace. Statistics are available on the net that the data is doubled in every X days. The good thing about this data is that it is available and potentially contains useful information. What’s not so good is how to process this information and use it effectively.

The market place has become competitive and will be becoming more in years to come. The key to growth (and survival) is innovation. Big Data can help in uncover the strategies which can help in making the difference. Big Data though, is not a “magic wand” or “one size fits all” approach, rather it will require careful strategising and planning to achieve something tangible. The following sections will discuss the various facets.

Big data can be defined as data which has:

1. velocity

2. volume

3. variety and

4. value.

The world is producing data with great velocity. We heard nowadays that petabytes do not amaze us anymore. The data is varied today, as well, e.g. structured, semi structured and complex. The fourth one, value, is not so straightforward. The value is something whose unlocking is important. Organizations can well today with Big Data if they succeed in defining

1. What they want

2. They understand about the data they need

3. They understand the data that they have which can be used.

4. They know where to look for the data which they need.

The first and foremost on the list is defining the problem. Once you have this then the next stage of understanding and getting data can be achieved. This requires investment, intellectual, financial and time.

The next stage is analysis once you have your data with you. This analysis should be something which can aid you invgetting what you want and help you with the business.

Above is a very brief about big data. There is lot of buzz nowadays. if we look at the example of dotcom bubble, it did not do well earlier, but if you look today, then e-commerce is not only thriving, rather it has become the vehicle of business. Same could be the case with Big Data, because it has the value to transform today’s businesses. The trick lies in getting this unlocked.

09 Mar

Agile Manifesto

Learning never ends...

 Agile Manifesto

Twelve principles underlie the Agile Manifesto, which includes:

Customer satisfaction by rapid delivery of useful software

Welcome changing requirements, even late in development

Working software is delivered frequently (weeks rather than months)

Working software is the principal measure of progress

Sustainable development, able to maintain a constant pace

Close, daily co-operation between business people and developers

Face-to-face conversation is the best form of communication (co-location)

Projects are built around motivated individuals, who should be trusted

Continuous attention to technical excellence and good design

Simplicity

Self-organizing teams

Regular adaptation to changing circumstances

03 Mar

Agile

Agile methodologies have been primarily used to help in product development, but slowly this trend is catching for software project management as well. These methodologies are driven on the principles of Agile Manifesto.

The Agile Manifesto highlights over:

Indiviudals rather than processes.

Working software

Customer collaboration

Response to change

These methodologies address the one basic problem which causes the many projects to fail, that is involvement of the users/stakeholders. Waterfall is predictive, Agile is adaptive and the Risk is minimized by frequent release of working software.

Agile methodologies comprise of Extreme Programming, DSDM etc.

03 Mar

What is a project?

mypics

A project is defined as following:

1. Has a start and end date, so it is temporary.

2. Creates a unique product or service.

An operation is an ongoing endeavour.  Operations and projects differ primarily in that operations are ongoing and repetitive while projects are temporary and unique.

As per PMBOK (PMI), a project can thus be defined in terms of its distinctive characteristics – a project is a temporary endeavor undertaken to create a unique product or service.

Plugin from the creators ofBrindes Personalizados :: More at PlulzWordpress Plugins