Skip to content
This repository was archived by the owner on Jan 22, 2025. It is now read-only.

Enhance Data Preprocessing and Feature Engineering Pipeline #149

Open
sanchitc05 opened this issue Oct 15, 2024 · 0 comments
Open

Enhance Data Preprocessing and Feature Engineering Pipeline #149

sanchitc05 opened this issue Oct 15, 2024 · 0 comments

Comments

@sanchitc05
Copy link

sanchitc05 commented Oct 15, 2024

Description:

The current data preprocessing pipeline in the Predictive Calc repository can be improved to streamline and automate common data preparation tasks, leading to more efficient model training and improved performance.

I would like to contribute by enhancing the data preprocessing pipeline in the following ways:

  1. Handle Missing Data Efficiently:

    • Implement methods to handle missing data more robustly, such as:
      • Mean/Median/Mode imputation for numerical data.
      • Forward/backward fill for time-series data.
      • More advanced techniques like KNN imputation.
  2. Outlier Detection and Handling:

    • Introduce automated detection of outliers using techniques such as:
      • Z-score or IQR (Interquartile Range) methods.
      • Implement an option to either remove or cap the outliers to reduce their impact on model performance.
  3. Feature Scaling:

    • Add standardized feature scaling methods such as:
      • Min-Max scaling.
      • Standardization (z-score normalization).
    • This will ensure that features with large magnitudes do not dominate others during model training, especially for algorithms like logistic regression, SVM, or neural networks.
  4. Automated Feature Engineering:

    • Introduce automated feature engineering steps such as:
      • One-hot encoding for categorical variables.
      • Polynomial features to introduce non-linear interactions between variables.
      • Feature extraction techniques like Principal Component Analysis (PCA) for dimensionality reduction and better model performance.
  5. Pipeline Automation:

    • Build a robust pipeline using scikit-learn’s Pipeline functionality to automate these steps, ensuring consistency and reproducibility across different datasets and models.

These improvements will not only streamline the workflow but also enhance the accuracy and robustness of the machine-learning models by improving the quality of input data. I propose creating a modular, reusable pipeline that can be adapted to different datasets and use cases.

Tech Stack:

  • Python: pandas, scikit-learn
  • Jupyter Notebooks for testing and experimentation
  • pandas-profiling for exploratory data analysis (maybe needed)

Would love to get assigned to this and start working on the implementation!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant