Diagnostic Data Analysis for Wisconsin Breast Cancer Data
I have recently done a thorough analysis of publicly available diagnostic data on breast cancer. This analysis used a number of statistical and machine learning techniques. This was used to draw inference from the data. Following is an excerpt of the Medical Diagnostic Analysis.
Analysis of “Breast Cancer Wisconsin (Diagnostic) Data Set”. The dataset is available from “UCI Machine Learning Repository”. Data used is “breast-cancer-wisconsin.data”” (1) and “breast-cancer-wisconsin.names”(2).
About the data:
The dataset has 11 variables with 699 observations, first variable is the identifier and has been excluded in the analyis. Thus, there are 9 predictors and a response variable (class). The response variable denotes “Malignant” or “Benign” cases.
- Clump Thickness
- Uniformity of Cell Size
- Single Epithelial Cell Size
- Bare Nuclei
- Uniformity of Cell Shape
- Bland Chromatin
- Marginal Adhesion
- Normal Nucleoli
Class – (2 for benign, 4 for malignant)
There are 16 observations where data is incomplete. In further analysis, these cases are imputed (substitued by most likely values) or ignored. In total, there are 241 cases of malignancy, where as benign cases are 458.
Three different algorithms for classification are tried – with all varibles and with variable selection. Generally, it is found that errors were increased when variables were decreased, with an exception. With KNN, a model with lesser number of variables has performed better than all other models. As this is a small dataset, this could be a case of overfitting. Second option is to use Random Forest full classifier, as it does better than the KNN Classifier with all variables. Decision trees are showing high error rate, so this could be ignored.
Another analysis done with help of R is available here. Cost Factor Analysis