04.11.2024 - 26.11.2024
Author: Till Meineke
<style>img[src$="#fungi"] { display: block; margin: 0 auto; border-radius: 10%; width: 300px; } </style>source: Giphy
Important
Work in progress. (For questions, bugs, hints and improvements just eMail me)
Missing parts:
- full and clean EDA
- feature encoding
- ranges of values
- missing values
- analysis of target variable
- feature importance analysis
- models
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost
- LightGBM
- CatBoost
- model selection and tuning / mlflow?
- test reproducibility
- fix model deployment / styling
- web application styling
- update README with new information, images and videos
You can rate this version. Basic functionality is working.
You can test the running EB instance with make test_deploy
or in the provided conda environment with python predict_test.py
.
I made a video of local deployment with docker make deploy
and testing with make test_deploy
.
Walking through the woods collecting mushrooms can be a fun activity. However, it can also be dangerous if you don't know which mushrooms are edible or not. The goal of this project is to build and deploy a model that can predict which mushrooms you picked based on some simple characteristics.
To better understand the problem, I will use the Classification Mushroom Data 2020 dataset. The provided primary data describes 173 mushroom species, which can be used for simulating hypothetical mushrooms. Since the provided secondary data contains 61,069 hypothetical mushrooms for binary classification without names, I have to generate simulated data with species names.
This will ensure that the generated dataset is of high quality and relevant for the task I am attempting to solve.
β
βββ data
β βββ raw
β β βββ primary_data_edited.csv <-- Raw data from paper
β β βββ primary_data_meta.txt <-- Raw data from paper (description)
β β βββ secondary_data_generated.csv <-- Raw data from paper
β β βββ secondary_data_meta.txt <-- Raw data from paper (description)
β βββ secondary_data_generated_with_names.csv <-- Generated data
β
βββ images <-- Images for readme and "Learning in public"
β
βββ models
β βββ model_md=20_msl=5.bin <-- Trained model
β
βββ notebooks
β βββ 01_eda.ipynb <-- Exploratory data analysis
β
βββ references
β βββ 'Collins Mushroom Miscellany.epub' <-- Book with mushroom images
β βββ 'Mushroom data creation.pdf' <-- Main paper for creating mushroom data
β βββ 'Mushroom data creation_sup.pdf' <-- Supplementary material
β βββ mushrooms-collins-gem.pdf <-- Book with mushroom images
β
βββ src <-- Source code for use in this project
β βββ services
β β βββ Images <-- Images from book
β β βββ Text <-- Text from book
β β βββ rename_images.py <-- Script to rename images
β βββ __init__.py <-- Python package initializer file
β βββ data_cat.py <-- Script to categorize data
β βββ gen_corr_norm.py <-- Script to generate correlated and normalized data
β βββ mushroom_class_fix.py
β βββ primary_data_gen.py
β βββ secondary_data_gen.py <-- modified Script to generate secondary data
β βββ stats_graphics.py
β βββ text_attr_match.py
β βββ util_func.py
β
βββ .dockerignore <-- Docker ignore file
βββ .gitignore <-- Git ignore file
βββ Dockerfile <-- Docker file
βββ environment.yml <-- Conda environment file
βββ LICENSE
βββ Makefile
βββ Pipfile <-- Pipenv file
βββ Pipfile.lock <-- Pipenv lock file
βββ predict.py <-- Prediction script
βββ predict_test.py <-- Prediction test script
βββ README.md <-- The file you are currently reading
βββ train.py <-- Training script
Working with synthetic data, I was asked to consider the following points:
- Clearly document how you generated the synthetic dataset and the reasoning behind its design.
- Provide sufficient context about the dataset and the model you are building for your peers who will review your project.
In the repository I found several scripts belonging to the paper "Mushroom data creation" by D. Wagner et al. (2020). The scripts are written in Python and are used to generate the secondary data. I had to modify the scripts to generate the data with species names.
def write_to_csv(file_name, funghi_entry_list, use_intervals):
"""
Parameters
----------
file_name: str
name of the written csv file
funghi_entry_list: list of FunghiEntry
list of mushrooms, each element corresponding to one simulated mushroom
use_intervals: bool
uses the interval borders as values for the metrical attributes instead of a simulated float value
Funtionality
------------
writes each simulated mushroom as a line in a csv file
"""
file = open(file_name, "w")
if not use_intervals:
# file.write(data_cat.PRIMARY_DATASET_HEADER.replace("family;name;", "") + "\n")
file.write(data_cat.PRIMARY_DATASET_HEADER + "\n")
else:
file.write(data_cat.DATASET_HEADER_MIN_MAX.replace("name;", "") + "\n")
for funghi_entry in funghi_entry_list:
# funghi_str = funghi_entry.is_edible
funghi_str = funghi_entry.family + ";" + funghi_entry.name + ";" + funghi_entry.is_edible
for category in funghi_entry.categories:
funghi_str += ";" + str(category)
file.write(funghi_str + "\n")
The dataset is a synthetic dataset with 61,069 hypothetical mushrooms. The dataset contains 173 species of mushrooms with names and these features:
family
class
cap-diameter
cap-shape
cap-surface
cap-color
does-bruise-or-bleed
gill-attachment
gill-spacing
gill-color
stem-height
stem-width
stem-root
stem-surface
stem-color
veil-type
veil-color
has-ring
ring-type
spore-print-color
habitat
season
You can find the EDA in this notebook (WIP) and this improved 2nd notebook (WIP).
Since we want a simple model with few features to predict the name of the mushroom, we will drop the features with missing values for our first round of modelling.
I used a Logistic Regression model and a Decision Tree model. I used Cross-Validation to evaluate the models.
You can train the model with:
make train
or in your conda environment with:
python train.py
or with pipenv:
pipenv run python "./train.py"
if the environment is not activated. You can activate it with:
pipenv shell
# and run the script
python train.py
See train.py
. Model is saved in models
folder.
Preferably, you can use make commands (from Makefile) or directly run scripts from scr
.
Refer to section below for the descriptions of make commands. Before running it, consider creating
a virtual environment
Try out the make commands (see make help
).
Important
make grow_fungi
will overwrite the generated data in data/secondary_data_generated_with_names.csv
.
Development was done in conda environment ./environment.yml
. Install with:
make new_conda_environment
or
conda env create -f environment.yml
For the Docker container, I used pipenv. See Pipfile and Pipfile.lock. You can install and activate the environment with:
make new_pipenv_environment
or
pipenv install
# for activation
pipenv shell
Currently, you can find the following docker files:
It is used to build an image for running the model. You can build and run the image with:
make deploy
This will start a Flask server on port 9696.
predict.py
is the script that is run when the container starts. It is momentarily configured for testing with docker. You can modify it to test locally without docker, by commenting/uncommenting the following lines:
# MODEL_FILE = "./models/model_md=20_msl=5.bin" # local testing without docker
MODEL_FILE = "./model_md=20_msl=5.bin" # testing with docker
and run with:
python predict.py
EB is running under fungi-classifier.eba-rpcwcrqg.eu-central-1.elasticbeanstalk.com
Try with:
make test_deploy
I created a web application to classify mushrooms. The user can enter the characteristics of the mushroom.
The application will then classify the mushroom species and provide the user with full information about the species and a picture for reference.
Start app with:
streamlit run app.py
and open browser at http://localhost:8501
.
Docker implementation is pending.
Done.
<style>img[src$="#fungi2"] { display: block; margin: 0 auto; border-radius: 10%; width: 300px; } </style>Source: Giphy