The goal of this project is to develop a classification model that will be able to identify the cervical dysplasia in two main categories, normal and abnormal. The data set that I’m working on has a target divided into 7 categories depending on how serious the dysplasia is. In order to divide these categories I will try to implement two unsupervised methods aiming to find:
-
The number of the clusters.
-
How I would split up these clusters.
In addition, I will implement some techniques to determine highly correlated features and a dimension reduction method in order to identify patterns and divide the target column into less categories.
My dataset has 26 features and 500 rows. Although I have to deal with many features the number of the rows is not so big and this is something that I will try to increase in order to develop another model and compare the results. Firstly, I used some statistics techniques to examine the distribution of the features and scatterplots aiming to find which variables are highly correlated.
 |
![]() |
![]() |
Kerne_Short | Kyto_Short | Cyto_Long |
---|---|---|
![]() |
![]() |
![]() |
From the above scatter plots we can draw the conclusion that the categories 1-4 could classified in one group (normal cells) and the rest 5-7 in other group (abnormal cells)
One other technique that could be really helpful to distinguish the data target and verify the above conclusion it’s the Principal Component Analysis. Looking at the variance ratio of the first two component, 80% of the dataset’s variance lies along the first Principal Component and 14% lies along the second PC, We have a lot of information in the first two components So, let’s plot them.
This plot verified indeed our target distinguish normal cell 1-4 and abnormal 5-7. It will be also really interesting to plot a 3D matrix in order to identify this classification.
Lastly before the prediction models, I trained a Supervised algorithm in order to examine the prediction on the 7 different cells categories. I trained a KNN model and below are the results.
The model was able to identify the third and fourth categories almost perfect but it’s not so accurate with the fifth and sixth.
After those steps, I started the training methods, I trained and optimized 4 supervised models with 4 different assumptions: 2 different feature selection methods and 2 different feature scaling methods, standard scaler and normalizer aiming to find out the effect that could cause the results.
Compared all the different assumptions, optimized parameters and considered also the cross validation scores the results are below:
Logistic Regression | SVM | KNN | Decision Tree |
---|---|---|---|
![]() |
![]() |
![]() |
![]() |
KNN | Decision Tree |
---|---|
![]() |
![]() |
Despite the models' performance in train, test and cross validation set the model that performed better is the optimized version of the Decision Tree classifier.