This repository contains an analysis of the German Credit Risk dataset. The objective of this project is to evaluate credit risk and predict whether a customer falls into the "good" or "bad" risk category using machine learning models.
The repository is organized as follows:
.
├── data
│ ├── raw # Raw data files (original dataset)
│ ├── processed # Processed data files (encoded)
├── notebooks
│ ├── german_credit_risk_analysis.ipynb # Jupyter notebook with the analysis
├── reports
│ ├── figures # Visualizations generated during analysis
├── requirements.txt # Python dependencies for the project
├── README.md # Project overview (this file)
The German Credit Risk dataset includes information on 1,000 customers with the goal of predicting their credit risk. Key features of the dataset include:
- Target Variable:
Risk
- indicates whether the customer is a good or bad credit risk. - Input Variables:
The selected attributes are:
- Age (numeric)
- Sex (text: male, female)
- Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
- Housing (text: own, rent, or free)
- Saving accounts (text - little, moderate, quite rich, rich)
- Checking account (numeric, in DM - Deutsch Mark)
- Credit amount (numeric, in DM)
- Duration (numeric, in month)
- Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
-
Data Preprocessing:
- Handling missing values in features like
Saving accounts
andChecking account
.
- Handling missing values in features like
-
Exploratory Data Analysis (EDA):
- Univariate analysis : plot, treemaps, and creation of categories for Purpose
- Bivariate analysis
- Overview : Pairplot
-
Encoding categorical variables for compatibility with machine learning algorithms:
- Encoding categorical data
- Correlation heatmap
-
Model Training and Evaluation:
- Splitting the dataset
- Standardization
- Models building:
- Naive Bayes
- k-Nearest Neighbors (KNN)
- XGBoost (XGB)
- Metrics for evaluation:
- Accuracy
- F1-score
- ROC-AUC
-
Model synthesis and conclusion:
- ROC Curve
- The XGBoost model achieved the best performance across all metrics, making it the recommended choice for deployment.
- Clone this repository:
git clone https://github.com/clemcoste/german_credit_risk.git cd german_credit_risk
- Create venv and install the required dependencies:
python -m venv venv pip install -r requirements.txt
- Open the notebook on Visual Studio Code and select venv
All visualizations and figures generated during the analysis are stored in the reports/figures directory. These include:
- Feature distributions
- Correlation heatmaps
- Model performance comparisons
For the full list of dependencies, see the requirements.txt file.
Contributions are welcome! Feel free to submit issues or pull requests if you have suggestions for improvement.
This project is licensed under the MIT License. See the LICENSE file for more details.