Skip to content

Commit

Permalink
update ml training tutorial pages, other minor fixes (#295)
Browse files Browse the repository at this point in the history
Signed-off-by: cosmicBboy <[email protected]>
  • Loading branch information
cosmicBboy authored Jun 15, 2021
1 parent 2f68075 commit 018bfbf
Show file tree
Hide file tree
Showing 7 changed files with 104 additions and 85 deletions.
50 changes: 26 additions & 24 deletions cookbook/case_studies/ml_training/house_price_prediction/README.rst
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
Multiregion House Price Prediction Model Using XGBoost & Dynamic Workflows
--------------------------------------------------------------------------
A house price prediction model is used to predict a house’s price, given the required parameters. Multiregion here relates to predicting the price of a house in multiple regions.
House Price Regression
-----------------------

A house price prediction model is used to predict a house’s price, given the required inputs.
In this example we'll train a model to predict the price of a house in multiple regions.

Generally, this could be accomplished using any regression model in machine learning.

Where does Flyte fit in?
========================
- Orchestrates the machine learning pipeline
- Can cache the output state between the steps (tasks as per Flyte)
- Easier backtracking to the error source
- Provides a Rich UI (if the Flyte backend is enabled) to view and manage the pipeline
- Orchestrates the machine learning pipeline
- Can cache the output state between the steps (tasks as per Flyte)
- Easier backtracking to the error source
- Provides a Rich UI (if the Flyte backend is enabled) to view and manage the pipeline

A typical house price prediction model isn’t dynamic, but a task has to be dynamic when multiple regions are involved.

Expand All @@ -20,31 +22,31 @@ Dataset
There is no built-in dataset that could be employed to build this model. A dataset has to be created, possibly using this reference model on `Github <https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.ipynb>`__.

The dataset will have the following columns:
Price
House Size
Number of Bedrooms
Year Built
Number of Bathrooms
Number of Garage Spaces
Lot Size
- Price
- House Size
- Number of Bedrooms
- Year Built
- Number of Bathrooms
- Number of Garage Spaces
- Lot Size

Steps to Build the Machine Learning Pipeline
============================================
- Generate dataset and split it into train, validation, and test datasets
- Fit the XGBoost model on to the data
- Generate predictions
- Generate dataset and split it into train, validation, and test datasets
- Fit the XGBoost model on to the data
- Generate predictions

Steps to Make the Pipeline Flyte-Compatible
===========================================
- Create two Python files to segregate the house price prediction logic. One consists of the logic per region, and the other is for multiple regions
- Define a couple of helper functions that are to be used while defining Flyte tasks and workflows
- Define three Flyte tasks -- to generate and split the data, fit the model, and generate predictions. If there are multiple regions, the tasks are dynamic
- Define a workflow to call the dynamic tasks in a specified order
- Create two Python files to segregate the house price prediction logic. One consists of the logic per region, and the other is for multiple regions
- Define a couple of helper functions that are to be used while defining Flyte tasks and workflows
- Define three Flyte tasks -- to generate and split the data, fit the model, and generate predictions. If there are multiple regions, the tasks are dynamic
- Define a workflow to call the dynamic tasks in a specified order

Takeaways
=========
- An in-depth dive into dynamic workflows
- How the Flyte type-system works
- An in-depth dive into dynamic workflows
- How the Flyte type-system works

Code Walkthrough
================
================
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""
Predicting the House Price in a Region Using an XGBoost Model and Flytekit (Python)
-----------------------------------------------------------------------------------
Predicting House Price in a Region with XGBoost
------------------------------------------------
"""

# %%
Expand All @@ -13,8 +13,8 @@
# pip install xgboost

# %%
# Step 1: Importing the Libraries
# ===============================
# Importing the Libraries
# ========================
# First, import all the required libraries.
import typing

Expand All @@ -27,8 +27,8 @@
from flytekit.types.file import FlyteFile

# %%
# Step 2: Initializing the Variables
# ==================================
# Initializing the Variables
# ===========================
# Initialize the variables to be used while building the model.
NUM_HOUSES_PER_LOCATION = 1000
COLUMNS = [
Expand All @@ -44,8 +44,8 @@
SPLIT_RATIOS = [0.6, 0.3, 0.1]

# %%
# Step 3: Defining the Data Generation Functions
# ==============================================
# Defining the Data Generation Functions
# =======================================
# Define a function to generate the price of a house.
def gen_price(house) -> int:
_base_price = int(house["SQUARE_FEET"] * 150)
Expand Down Expand Up @@ -142,8 +142,8 @@ def split_data(


# %%
# Step 4: Task -- Generating & Splitting the Data
# ===============================================
# Task: Generating & Splitting the Data
# ======================================
# Call the previously defined helper functions to generate and split the data. Finally, return the DataFrame objects.
dataset = typing.NamedTuple(
"GenerateSplitDataOutputs",
Expand All @@ -160,8 +160,8 @@ def generate_and_split_data(number_of_houses: int, seed: int) -> dataset:


# %%
# Step 5: Task -- Training the XGBoost Model
# ==========================================
# Task: Training the XGBoost Model
# =================================
# Serialize the XGBoost model using joblib and store the model in a dat file.
model_file = typing.NamedTuple("Model", model=FlyteFile[typing.TypeVar("joblib.dat")])

Expand All @@ -186,8 +186,8 @@ def fit(loc: str, train: pd.DataFrame, val: pd.DataFrame) -> model_file:


# %%
# Step 6: Task -- Generating the Predictions
# ==========================================
# Task: Generating the Predictions
# ===================================
# Unserialize the XGBoost model using joblib and generate the predictions.
@task(cache_version="1.0", cache=True, limits=Resources(mem="600Mi"))
def predict(
Expand All @@ -208,8 +208,8 @@ def predict(


# %%
# Step 7: Workflow -- Defining the Workflow
# =========================================
# Defining the Workflow
# ======================
# Include the following three steps in the workflow:
#
# #. Generate and split the data (Step 4)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
"""
Predicting the House Price in Multiple Regions Using an XGBoost Model and Flytekit (Python)
-------------------------------------------------------------------------------------------
Predicting House Price in Multiple Regions with XGBoost and Dynamic Workflows
------------------------------------------------------------------------------
In this example, you'll use the house price prediction model for one region to expand it to multiple regions.
"""

# %%
# Step 1: Importing the Libraries
# ===============================
# Importing the Libraries
# ========================
# First, import all the required libraries.
import typing

Expand All @@ -30,8 +30,8 @@
)

# %%
# Step 2: Initializing the Variables
# ==================================
# Initializing the Variables
# ===========================
# Initialize the variables to be used while building the model.
NUM_HOUSES_PER_LOCATION = 1000
COLUMNS = [
Expand All @@ -57,8 +57,8 @@
]

# %%
# Step 3: Task -- Generating & Splitting the Data for Multiple Regions
# ====================================================================
# Task: Generating & Splitting the Data for Multiple Regions
# ============================================================
# Call the previously defined helper functions to generate and split the data. Finally, return the DataFrame objects.

dataset = typing.NamedTuple(
Expand Down Expand Up @@ -95,9 +95,8 @@ def generate_and_split_data_multiloc(


# %%
# Step 4: Dynamic Task -- Training the XGBoost Model & Generating the Predictionsfor Multiple Regions
# ===================================================================================================
# (A "Dynamic" Task (aka Workflow) spins up internal workflows)
# Dynamic Workflow: Training the XGBoost Model & Generating the Predictions for Multiple Regions
# ===============================================================================================
#
# Fit the model to the data and generate predictions (two functionalities in a single task to make it more powerful!)
#
Expand All @@ -118,8 +117,8 @@ def parallel_fit_predict(


# %%
# Step 5: Workflow -- Defining the Workflow
# =========================================
# Defining the Workflow
# ======================
# Include the following three steps in the workflow:
#
# #. Generate and split the data (Step 3)
Expand Down
34 changes: 18 additions & 16 deletions cookbook/case_studies/ml_training/pima_diabetes/README.rst
Original file line number Diff line number Diff line change
@@ -1,40 +1,42 @@
PIMA Indians diabetes prediction using XGBoost
-----------------------------------------------
Diabetes Classification
------------------------

The workflow demonstrates how to train an XGBoost model. The workflow is designed for the `Pima Indian Diabetes dataset <https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.names>`__.

An example dataset is available `here <https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv>`__.

Why a Workflow?
================
One common question when you read through the example might be - is it really required to split the training of xgboost into multiple steps. The answer is COMPLICATED!, but let us try and understand what advantages and disadvantages of doing so,
One common question when you read through the example might be - is it really required to split the training of xgboost into multiple steps. The answer is complicated, but let us try and understand what advantages and disadvantages of doing so.

Pros:
^^^^^

- Each task/step is standalone and can be used for various other pipelines
- Each step can be unit tested
- Data splitting, cleaning etc can be done using a more scalable system like Spark
- State is always saved between steps, so it is cheap to recover from failures, especially if caching=True
- Visibility is high
- Each task/step is standalone and can be used for various other pipelines
- Each step can be unit tested
- Data splitting, cleaning etc can be done using a more scalable system like Spark
- State is always saved between steps, so it is cheap to recover from failures, especially if caching=True
- Visibility is high

Cons:
^^^^^

- Performance for small datasets is a concern. The reason is, the intermediate data is durably stored and the state recorded. Each step is essnetially a checkpoint
- Performance for small datasets is a concern. The reason is, the intermediate data is durably stored and the state recorded. Each step is essnetially a checkpoint

Steps of the Pipeline
======================

- Step1: Gather data and split it into training and validation sets
- Step2: Fit the actual model
- Step3: Run a set of predictions on the validation set. The function is designed to be more generic, it can be used to simply predict given a set of observations (dataset)
- Step4: Calculate the accuracy score for the predictions
1. Gather data and split it into training and validation sets
2. Fit the actual model
3. Run a set of predictions on the validation set. The function is designed to be more generic, it can be used to simply predict given a set of observations (dataset)
4. Calculate the accuracy score for the predictions


Takeaways
===========

- Usage of FlyteSchema Type. Schema type allows passing a type safe vector from one task to task. The vector is also directly loaded into a pandas dataframe. We could use an unstructured Schema (By simply omiting the column types). this will allow any data to be accepted by the train algorithm.

- We pass the file as a CSV input. The file is auto-loaded.
- Usage of FlyteSchema Type. Schema type allows passing a type safe vector from one task to task. The vector is also directly loaded into a pandas dataframe. We could use an unstructured Schema (By simply omiting the column types). this will allow any data to be accepted by the train algorithm.
- We pass the file as a CSV input. The file is auto-loaded.


Walkthrough
Expand Down
4 changes: 2 additions & 2 deletions cookbook/case_studies/ml_training/pima_diabetes/diabetes.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""
Train an XGBoost model and validate it
----------------------------------------
Train and Validate a Diabetes Classification XGBoost Model
-----------------------------------------------------------
"""
import typing
Expand Down
18 changes: 8 additions & 10 deletions cookbook/core/control_flow/dynamics.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
"""
Dynamic Tasks
--------------
Dynamic Workflows
------------------
A workflow is typically static where the directed acyclic graph's (DAG) structure is known at compile-time. However, scenarios exist where a run-time parameter (e.g. the output of an earlier task) determines the full DAG structure.
A workflow is typically static where the directed acyclic graph's (DAG) structure is known at compile-time. However,
scenarios exist where a run-time parameter (e.g. the output of an earlier task) determines the full DAG structure.
In such cases, dynamic workflows can be used. Here's a code example that counts the common characters between any two strings.
Inputs: s1 = "Pear", s2 = "Earth"
Output: 3
Dynamic workflows can be used in such cases. Here's a code example that counts the common characters between any two
strings.
"""

Expand Down Expand Up @@ -63,8 +61,8 @@ def derive_count(freq1: typing.List[int], freq2: typing.List[int]) -> int:
# The looping is dependent on the number of characters of both the strings which aren't known until the run time. If the ``@task`` decorator is used to encapsulate the calls mentioned above, the compilation will fail very early on due to the absence of the literal values.
# Therefore, ``@dynamic`` decorator has to be used.
#
# Dynamic workflow is effectively both a task and a workflow. The key thing to note is that the ``body of tasks is run at run time and the
# body of workflows is run at compile (aka registration) time``. Essentially, this is what a dynamic workflow leverages -- it’s a workflow that is compiled at run time (the best of both worlds)!
# Dynamic workflow is effectively both a task and a workflow. The key thing to note is that the _body of tasks is run at run time and the
# body of workflows is run at compile (aka registration) time_. Essentially, this is what a dynamic workflow leverages -- it’s a workflow that is compiled at run time (the best of both worlds)!
#
# At execution (run) time, Flytekit runs the compilation step, and produces
# a ``WorkflowTemplate`` (from the dynamic workflow), which Flytekit then passes back to Flyte Propeller for further running, exactly how sub-workflows are handled.
Expand Down
26 changes: 22 additions & 4 deletions cookbook/docs/ml_training.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,34 @@
:nosearch:

################
ML Training
################

.. panels::
:header: text-center

.. link-button:: auto/case_studies/ml_training/pima_diabetes/index
:type: ref
:text: Diabetes Classification
:classes: btn-block stretched-link
^^^^^^^^^^^^
Train an XGBoost model on the Pima Indians Diabetes Dataset

---

.. link-button:: auto/case_studies/ml_training/house_price_prediction/index
:type: ref
:text: House Price Regression
:classes: btn-block stretched-link
^^^^^^^^^^^^
Use dynamic workflows to train a multiregion house price prediction model.


.. toctree::
:maxdepth: -1
:caption: Contents
:hidden:

auto/case_studies/ml_training/pima_diabetes/index
auto/case_studies/ml_training/house_price_prediction/index

.. admonition:: Coming Soon!

Data Parallel Training, Distributed Training, and Single Node Training
.. TODO: write tutorials for data parallel training, distributed training, and single node training

0 comments on commit 018bfbf

Please sign in to comment.