This repository contains the code and the documentation for a Data Mining and ML project focussed on predicting cardiovascular disease The project explores various data mining techniques such as feature engineering, binning, cluster methodologies & Outlier removals also explores models like Gaussian Naive Byes, Logistics Regression, Random Forest, ANN to predict mortality due to CVD diseases.
The dataset employed in this project encompasses 70,000 entries, consisting of 13 numerical features, including age, gender, height, weight, blood pressure values (ap_hi and ap_lo), cholesterol levels, glucose levels, smoking and alcohol habits, physical activity, and the target variable—cardiovascular disease (CVD). After thorough preprocessing, including outlier removal and feature engineering, the data was prepared for experimentation.
https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
Applied four machine learning models for cardiovascular disease prediction:
- Gaussian Naive Bayes
- Decision Tree Classifier
- Random Forest
- Logistic Regression
- Artificial Neural network
To run the code and reproduce the results:
- Clone this repository.
- Install the required dependencies using
pip install -r requirement.txt
. - Execute the main script or Jupyter notebooks to train and evaluate the models.
If you would like to contribute to this project, please follow the standard Git workflow:
- Fork the repository.
- Create a new branch for your feature or bug fix:
git checkout -b feature/new-feature
. - Commit your changes:
git commit -m "Add new feature"
. - Push to your branch:
git push origin feature/new-feature
. - Open a pull request, describing the changes and the rationale.