🍄 fungi Incognita - Mushroom Classification 🍄‍🟫

04.11.2024 - 26.11.2024

Author: Till Meineke

source: Giphy

Important

Work in progress. (For questions, bugs, hints and improvements just eMail me)

Missing parts:

full and clean EDA
- feature encoding
- ranges of values
- missing values
- analysis of target variable
- feature importance analysis
models
- Logistic Regression
- Decision Tree
- Random Forest
- XGBoost
- LightGBM
- CatBoost
model selection and tuning / mlflow?
test reproducibility
fix model deployment / styling
web application styling
update README with new information, images and videos

You can rate this version. Basic functionality is working.

You can test the running EB instance with make test_deploy or in the provided conda environment with python predict_test.py.

I made a video of local deployment with docker make deploy and testing with make test_deploy.

Problem description

Walking through the woods collecting mushrooms can be a fun activity. However, it can also be dangerous if you don't know which mushrooms are edible or not. The goal of this project is to build and deploy a model that can predict which mushrooms you picked based on some simple characteristics.

To better understand the problem, I will use the Classification Mushroom Data 2020 dataset. The provided primary data describes 173 mushroom species, which can be used for simulating hypothetical mushrooms. Since the provided secondary data contains 61,069 hypothetical mushrooms for binary classification without names, I have to generate simulated data with species names.

This will ensure that the generated dataset is of high quality and relevant for the task I am attempting to solve.

Project structure

│
├── data
│   ├── raw
│   │   ├── primary_data_edited.csv                 <-- Raw data from paper
│   │   ├── primary_data_meta.txt                   <-- Raw data from paper (description)
│   │   ├── secondary_data_generated.csv            <-- Raw data from paper
│   │   └── secondary_data_meta.txt                 <-- Raw data from paper (description)
│   └── secondary_data_generated_with_names.csv     <-- Generated data
│
├── images                                          <-- Images for readme and "Learning in public"
│
├── models
│   └── model_md=20_msl=5.bin                       <-- Trained model
│
├── notebooks
│   └── 01_eda.ipynb                                <-- Exploratory data analysis
│
├── references
│   ├── 'Collins Mushroom Miscellany.epub'          <-- Book with mushroom images
│   ├── 'Mushroom data creation.pdf'                <-- Main paper for creating mushroom data
│   ├── 'Mushroom data creation_sup.pdf'            <-- Supplementary material
│   └── mushrooms-collins-gem.pdf                   <-- Book with mushroom images
│
├── src                                             <-- Source code for use in this project
│   ├── services
│   │   ├── Images                                  <-- Images from book
│   │   ├── Text                                    <-- Text from book
│   │   └── rename_images.py                        <-- Script to rename images
│   ├── __init__.py                                 <-- Python package initializer file
│   ├── data_cat.py                                 <-- Script to categorize data
│   ├── gen_corr_norm.py                            <-- Script to generate correlated and normalized data
│   ├── mushroom_class_fix.py
│   ├── primary_data_gen.py
│   ├── secondary_data_gen.py                       <-- modified Script to generate secondary data
│   ├── stats_graphics.py
│   ├── text_attr_match.py
│   └── util_func.py
│
├── .dockerignore                                   <-- Docker ignore file
├── .gitignore                                      <-- Git ignore file
├── Dockerfile                                      <-- Docker file
├── environment.yml                                 <-- Conda environment file
├── LICENSE
├── Makefile
├── Pipfile                                         <-- Pipenv file
├── Pipfile.lock                                    <-- Pipenv lock file
├── predict.py                                      <-- Prediction script
├── predict_test.py                                 <-- Prediction test script
├── README.md                                       <-- The file you are currently reading
└── train.py                                        <-- Training script

Generating synthetic mushroom data

Working with synthetic data, I was asked to consider the following points:

Clearly document how you generated the synthetic dataset and the reasoning behind its design.
Provide sufficient context about the dataset and the model you are building for your peers who will review your project.

Documentation of the data generation process

In the repository I found several scripts belonging to the paper "Mushroom data creation" by D. Wagner et al. (2020). The scripts are written in Python and are used to generate the secondary data. I had to modify the scripts to generate the data with species names.

def write_to_csv(file_name, funghi_entry_list, use_intervals):
    """
    Parameters
    ----------
    file_name: str
    name of the written csv file
    funghi_entry_list: list of FunghiEntry
    list of mushrooms, each element corresponding to one simulated mushroom
    use_intervals: bool
    uses the interval borders as values for the metrical attributes instead of a simulated float value

    Funtionality
    ------------
    writes each simulated mushroom as a line in a csv file
    """

    file = open(file_name, "w")
    if not use_intervals:
#        file.write(data_cat.PRIMARY_DATASET_HEADER.replace("family;name;", "") + "\n")
        file.write(data_cat.PRIMARY_DATASET_HEADER + "\n")
    else:
        file.write(data_cat.DATASET_HEADER_MIN_MAX.replace("name;", "") + "\n")
    for funghi_entry in funghi_entry_list:
#        funghi_str = funghi_entry.is_edible
        funghi_str = funghi_entry.family + ";" + funghi_entry.name + ";" + funghi_entry.is_edible
        for category in funghi_entry.categories:
            funghi_str += ";" + str(category)
        file.write(funghi_str + "\n")

Context of the dataset and the model

The dataset is a synthetic dataset with 61,069 hypothetical mushrooms. The dataset contains 173 species of mushrooms with names and these features:

family
class
cap-diameter
cap-shape
cap-surface
cap-color
does-bruise-or-bleed
gill-attachment
gill-spacing
gill-color
stem-height
stem-width
stem-root
stem-surface
stem-color
veil-type
veil-color
has-ring
ring-type
spore-print-color
habitat
season

EDA

You can find the EDA in this notebook (WIP) and this improved 2nd notebook (WIP).

Ranges of values

Missing values

Since we want a simple model with few features to predict the name of the mushroom, we will drop the features with missing values for our first round of modelling.

Analysis of the target variable

Feature importance analysis

Model training

I used a Logistic Regression model and a Decision Tree model. I used Cross-Validation to evaluate the models.

You can train the model with:

make train

or in your conda environment with:

python train.py

or with pipenv:

pipenv run python "./train.py"

if the environment is not activated. You can activate it with:

pipenv shell

# and run the script
python train.py

Exporting notebook to python script

See train.py. Model is saved in models folder.

Reproducibility

Model deployment

Dependency and environment management

Preferably, you can use make commands (from Makefile) or directly run scripts from scr. Refer to section below for the descriptions of make commands. Before running it, consider creating a virtual environment

Makefile and test example (not fully implemented)

Try out the make commands (see make help).

Important

make grow_fungi

will overwrite the generated data in data/secondary_data_generated_with_names.csv.

Conda environment

Development was done in conda environment ./environment.yml. Install with:

make new_conda_environment

or

conda env create -f environment.yml

Pipenv environment

For the Docker container, I used pipenv. See Pipfile and Pipfile.lock. You can install and activate the environment with:

make new_pipenv_environment

or

pipenv install

# for activation
pipenv shell

Containerization

Currently, you can find the following docker files:

Dockerfile

It is used to build an image for running the model. You can build and run the image with:

make deploy

This will start a Flask server on port 9696.

predict.py is the script that is run when the container starts. It is momentarily configured for testing with docker. You can modify it to test locally without docker, by commenting/uncommenting the following lines:

# MODEL_FILE = "./models/model_md=20_msl=5.bin" # local testing without docker
MODEL_FILE = "./model_md=20_msl=5.bin"  # testing with docker

and run with:

python predict.py

Cloud deployment

EB deployment

EB is running under fungi-classifier.eba-rpcwcrqg.eu-central-1.elasticbeanstalk.com

Try with:

make test_deploy

Web application (not fully implemented)

I created a web application to classify mushrooms. The user can enter the characteristics of the mushroom.

The application will then classify the mushroom species and provide the user with full information about the species and a picture for reference.

Start app with:

streamlit run app.py

and open browser at http://localhost:8501.

Docker implementation is pending.

Done.

Source: Giphy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🍄 fungi Incognita - Mushroom Classification 🍄‍🟫

Problem description

Project structure

Generating synthetic mushroom data

Documentation of the data generation process

Context of the dataset and the model

EDA

Ranges of values

Missing values

Analysis of the target variable

Feature importance analysis

Model training

Exporting notebook to python script

Reproducibility

Model deployment

Dependency and environment management

Makefile and test example (not fully implemented)

Conda environment

Pipenv environment

Containerization

Cloud deployment

EB deployment

Web application (not fully implemented)

Files

README.md

Latest commit

History

README.md

File metadata and controls

🍄 fungi Incognita - Mushroom Classification 🍄‍🟫

Problem description

Project structure

Generating synthetic mushroom data

Documentation of the data generation process

Context of the dataset and the model

EDA

Ranges of values

Missing values

Analysis of the target variable

Feature importance analysis

Model training

Exporting notebook to python script

Reproducibility

Model deployment

Dependency and environment management

Makefile and test example (not fully implemented)

Conda environment

Pipenv environment

Containerization

Cloud deployment

EB deployment

Web application (not fully implemented)