update ml training tutorial pages, other minor fixes (#295)

Signed-off-by: cosmicBboy <[email protected]>
flyteorg · Jun 15, 2021 · 018bfbf · 018bfbf
1 parent 2f68075
commit 018bfbf
Show file tree

Hide file tree

Showing 7 changed files with 104 additions and 85 deletions.
diff --git a/cookbook/case_studies/ml_training/house_price_prediction/README.rst b/cookbook/case_studies/ml_training/house_price_prediction/README.rst
@@ -1,15 +1,17 @@
-Multiregion House Price Prediction Model Using XGBoost & Dynamic Workflows
---------------------------------------------------------------------------
-A house price prediction model is used to predict a house’s price, given the required parameters. Multiregion here relates to predicting the price of a house in multiple regions.
+House Price Regression
+-----------------------
+
+A house price prediction model is used to predict a house’s price, given the required inputs.
+In this example we'll train a model to predict the price of a house in multiple regions.
 
 Generally, this could be accomplished using any regression model in machine learning.
 
 Where does Flyte fit in?
 ========================
- - Orchestrates the machine learning pipeline
- - Can cache the output state between the steps (tasks as per Flyte)
- - Easier backtracking to the error source
- - Provides a Rich UI (if the Flyte backend is enabled) to view and manage the pipeline
+- Orchestrates the machine learning pipeline
+- Can cache the output state between the steps (tasks as per Flyte)
+- Easier backtracking to the error source
+- Provides a Rich UI (if the Flyte backend is enabled) to view and manage the pipeline
 
 A typical house price prediction model isn’t dynamic, but a task has to be dynamic when multiple regions are involved. 
 
@@ -20,31 +22,31 @@ Dataset
 There is no built-in dataset that could be employed to build this model. A dataset has to be created, possibly using this reference model on `Github <https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.ipynb>`__.
 
 The dataset will have the following columns:
-Price
-House Size
-Number of Bedrooms
-Year Built
-Number of Bathrooms
-Number of Garage Spaces
-Lot Size
+- Price
+- House Size
+- Number of Bedrooms
+- Year Built
+- Number of Bathrooms
+- Number of Garage Spaces
+- Lot Size
 
 Steps to Build the Machine Learning Pipeline
 ============================================
- - Generate dataset and split it into train, validation, and test datasets 
- - Fit the XGBoost model on to the data
- - Generate predictions 
+- Generate dataset and split it into train, validation, and test datasets
+- Fit the XGBoost model on to the data
+- Generate predictions
 
 Steps to Make the Pipeline Flyte-Compatible
 ===========================================
- - Create two Python files to segregate the house price prediction logic. One consists of the logic per region, and the other is for multiple regions
- - Define a couple of helper functions that are to be used while defining Flyte tasks and workflows
- - Define three Flyte tasks -- to generate and split the data, fit the model, and generate predictions. If there are multiple regions, the tasks are dynamic
- - Define a workflow to call the dynamic tasks in a specified order
+- Create two Python files to segregate the house price prediction logic. One consists of the logic per region, and the other is for multiple regions
+- Define a couple of helper functions that are to be used while defining Flyte tasks and workflows
+- Define three Flyte tasks -- to generate and split the data, fit the model, and generate predictions. If there are multiple regions, the tasks are dynamic
+- Define a workflow to call the dynamic tasks in a specified order
 
 Takeaways
 =========
- - An in-depth dive into dynamic workflows
- - How the Flyte type-system works
+- An in-depth dive into dynamic workflows
+- How the Flyte type-system works
 
 Code Walkthrough
-================
+================
diff --git a/cookbook/case_studies/ml_training/house_price_prediction/house_price_predictor.py b/cookbook/case_studies/ml_training/house_price_prediction/house_price_predictor.py
@@ -1,6 +1,6 @@
 """
-Predicting the House Price in a Region Using an XGBoost Model and Flytekit (Python)
------------------------------------------------------------------------------------
+Predicting House Price in a Region with XGBoost
+------------------------------------------------
 """
 
 # %%
@@ -13,8 +13,8 @@
 #       pip install xgboost
 
 # %%
-# Step 1: Importing the Libraries
-# ===============================
+# Importing the Libraries
+# ========================
 # First, import all the required libraries.
 import typing
 
@@ -27,8 +27,8 @@
 from flytekit.types.file import FlyteFile
 
 # %%
-# Step 2: Initializing the Variables
-# ==================================
+# Initializing the Variables
+# ===========================
 # Initialize the variables to be used while building the model.
 NUM_HOUSES_PER_LOCATION = 1000
 COLUMNS = [
@@ -44,8 +44,8 @@
 SPLIT_RATIOS = [0.6, 0.3, 0.1]
 
 # %%
-# Step 3: Defining the Data Generation Functions
-# ==============================================
+# Defining the Data Generation Functions
+# =======================================
 # Define a function to generate the price of a house.
 def gen_price(house) -> int:
     _base_price = int(house["SQUARE_FEET"] * 150)
@@ -142,8 +142,8 @@ def split_data(
 
 
 # %%
-# Step 4: Task -- Generating & Splitting the Data
-# ===============================================
+# Task: Generating & Splitting the Data
+# ======================================
 # Call the previously defined helper functions to generate and split the data. Finally, return the DataFrame objects.
 dataset = typing.NamedTuple(
     "GenerateSplitDataOutputs",
@@ -160,8 +160,8 @@ def generate_and_split_data(number_of_houses: int, seed: int) -> dataset:
 
 
 # %%
-# Step 5: Task -- Training the XGBoost Model
-# ==========================================
+# Task: Training the XGBoost Model
+# =================================
 # Serialize the XGBoost model using joblib and store the model in a dat file.
 model_file = typing.NamedTuple("Model", model=FlyteFile[typing.TypeVar("joblib.dat")])
 
@@ -186,8 +186,8 @@ def fit(loc: str, train: pd.DataFrame, val: pd.DataFrame) -> model_file:
 
 
 # %%
-# Step 6: Task -- Generating the Predictions
-# ==========================================
+# Task: Generating the Predictions
+# ===================================
 # Unserialize the XGBoost model using joblib and generate the predictions.
 @task(cache_version="1.0", cache=True, limits=Resources(mem="600Mi"))
 def predict(
@@ -208,8 +208,8 @@ def predict(
 
 
 # %%
-# Step 7: Workflow -- Defining the Workflow
-# =========================================
+# Defining the Workflow
+# ======================
 # Include the following three steps in the workflow:
 #
 # #. Generate and split the data (Step 4)

diff --git a/...book/case_studies/ml_training/house_price_prediction/multiregion_house_price_predictor.py b/...book/case_studies/ml_training/house_price_prediction/multiregion_house_price_predictor.py
@@ -1,13 +1,13 @@
 """
-Predicting the House Price in Multiple Regions Using an XGBoost Model and Flytekit (Python)
--------------------------------------------------------------------------------------------
+Predicting House Price in Multiple Regions with XGBoost and Dynamic Workflows
+------------------------------------------------------------------------------
 
 In this example, you'll use the house price prediction model for one region to expand it to multiple regions. 
 """
 
 # %%
-# Step 1: Importing the Libraries
-# ===============================
+# Importing the Libraries
+# ========================
 # First, import all the required libraries.
 import typing
 
@@ -30,8 +30,8 @@
     )
 
 # %%
-# Step 2: Initializing the Variables
-# ==================================
+# Initializing the Variables
+# ===========================
 # Initialize the variables to be used while building the model.
 NUM_HOUSES_PER_LOCATION = 1000
 COLUMNS = [
@@ -57,8 +57,8 @@
 ]
 
 # %%
-# Step 3: Task -- Generating & Splitting the Data for Multiple Regions
-# ====================================================================
+# Task: Generating & Splitting the Data for Multiple Regions
+# ============================================================
 # Call the previously defined helper functions to generate and split the data. Finally, return the DataFrame objects.
 
 dataset = typing.NamedTuple(
@@ -95,9 +95,8 @@ def generate_and_split_data_multiloc(
 
 
 # %%
-# Step 4: Dynamic Task -- Training the XGBoost Model & Generating the Predictionsfor Multiple Regions
-# ===================================================================================================
-# (A "Dynamic" Task (aka Workflow) spins up internal workflows)
+# Dynamic Workflow: Training the XGBoost Model & Generating the Predictions for Multiple Regions
+# ===============================================================================================
 #
 # Fit the model to the data and generate predictions (two functionalities in a single task to make it more powerful!)
 #
@@ -118,8 +117,8 @@ def parallel_fit_predict(
 
 
 # %%
-# Step 5: Workflow -- Defining the Workflow
-# =========================================
+# Defining the Workflow
+# ======================
 # Include the following three steps in the workflow:
 #
 # #. Generate and split the data (Step 3)

diff --git a/cookbook/case_studies/ml_training/pima_diabetes/README.rst b/cookbook/case_studies/ml_training/pima_diabetes/README.rst
@@ -1,40 +1,42 @@
-PIMA Indians diabetes prediction using XGBoost
------------------------------------------------
+Diabetes Classification
+------------------------
+
 The workflow demonstrates how to train an XGBoost model. The workflow is designed for the `Pima Indian Diabetes dataset <https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.names>`__.
 
 An example dataset is available `here <https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv>`__.
 
 Why a Workflow?
 ================
-One common question when you read through the example might be - is it really required to split the training of xgboost into multiple steps. The answer is COMPLICATED!, but let us try and understand what advantages and disadvantages of doing so,
+One common question when you read through the example might be - is it really required to split the training of xgboost into multiple steps. The answer is complicated, but let us try and understand what advantages and disadvantages of doing so.
 
 Pros:
+^^^^^
 
- - Each task/step is standalone and can be used for various other pipelines
- - Each step can be unit tested
- - Data splitting, cleaning etc can be done using a more scalable system like Spark
- - State is always saved between steps, so it is cheap to recover from failures, especially if caching=True
- - Visibility is high
+- Each task/step is standalone and can be used for various other pipelines
+- Each step can be unit tested
+- Data splitting, cleaning etc can be done using a more scalable system like Spark
+- State is always saved between steps, so it is cheap to recover from failures, especially if caching=True
+- Visibility is high
 
 Cons:
+^^^^^
 
- - Performance for small datasets is a concern. The reason is, the intermediate data is durably stored and the state recorded. Each step is essnetially a checkpoint
+- Performance for small datasets is a concern. The reason is, the intermediate data is durably stored and the state recorded. Each step is essnetially a checkpoint
 
 Steps of the Pipeline
 ======================
 
- - Step1: Gather data and split it into training and validation sets
- - Step2: Fit the actual model
- - Step3: Run a set of predictions on the validation set. The function is designed to be more generic, it can be used to simply predict given a set of observations (dataset)
- - Step4: Calculate the accuracy score for the predictions
+1. Gather data and split it into training and validation sets
+2. Fit the actual model
+3. Run a set of predictions on the validation set. The function is designed to be more generic, it can be used to simply predict given a set of observations (dataset)
+4. Calculate the accuracy score for the predictions
 
 
 Takeaways
 ===========
 
- - Usage of FlyteSchema Type. Schema type allows passing a type safe vector from one task to task. The vector is also directly loaded into a pandas dataframe. We could use an unstructured Schema (By simply omiting the column types). this will allow any data to be accepted by the train algorithm.
-
- - We pass the file as a CSV input. The file is auto-loaded.
+- Usage of FlyteSchema Type. Schema type allows passing a type safe vector from one task to task. The vector is also directly loaded into a pandas dataframe. We could use an unstructured Schema (By simply omiting the column types). this will allow any data to be accepted by the train algorithm.
+- We pass the file as a CSV input. The file is auto-loaded.
 
 
 Walkthrough

diff --git a/cookbook/case_studies/ml_training/pima_diabetes/diabetes.py b/cookbook/case_studies/ml_training/pima_diabetes/diabetes.py
@@ -1,6 +1,6 @@
 """
-Train an XGBoost model and validate it
-----------------------------------------
+Train and Validate a Diabetes Classification XGBoost Model
+-----------------------------------------------------------
 
 """
 import typing

diff --git a/cookbook/core/control_flow/dynamics.py b/cookbook/core/control_flow/dynamics.py
@@ -1,14 +1,12 @@
 """
-Dynamic Tasks
---------------
+Dynamic Workflows
+------------------
 
-A workflow is typically static where the directed acyclic graph's (DAG) structure is known at compile-time. However, scenarios exist where a run-time parameter (e.g. the output of an earlier task) determines the full DAG structure.
+A workflow is typically static where the directed acyclic graph's (DAG) structure is known at compile-time. However,
+scenarios exist where a run-time parameter (e.g. the output of an earlier task) determines the full DAG structure.
 
-In such cases, dynamic workflows can be used. Here's a code example that counts the common characters between any two strings.
-
-Inputs: s1 = "Pear", s2 = "Earth"
-
-Output: 3
+Dynamic workflows can be used in such cases. Here's a code example that counts the common characters between any two
+strings.
 
 """
 
@@ -63,8 +61,8 @@ def derive_count(freq1: typing.List[int], freq2: typing.List[int]) -> int:
 # The looping is dependent on the number of characters of both the strings which aren't known until the run time. If the ``@task`` decorator is used to encapsulate the calls mentioned above, the compilation will fail very early on due to the absence of the literal values.
 # Therefore, ``@dynamic`` decorator has to be used.
 #
-# Dynamic workflow is effectively both a task and a workflow. The key thing to note is that the ``body of tasks is run at run time and the
-# body of workflows is run at compile (aka registration) time``. Essentially, this is what a dynamic workflow leverages -- it’s a workflow that is compiled at run time (the best of both worlds)!
+# Dynamic workflow is effectively both a task and a workflow. The key thing to note is that the _body of tasks is run at run time and the
+# body of workflows is run at compile (aka registration) time_. Essentially, this is what a dynamic workflow leverages -- it’s a workflow that is compiled at run time (the best of both worlds)!
 #
 # At execution (run) time, Flytekit runs the compilation step, and produces
 # a ``WorkflowTemplate`` (from the dynamic workflow), which Flytekit then passes back to Flyte Propeller for further running, exactly how sub-workflows are handled.

diff --git a/cookbook/docs/ml_training.rst b/cookbook/docs/ml_training.rst
@@ -1,16 +1,34 @@
-:nosearch:
-
 ################
 ML Training
 ################
 
+.. panels::
+    :header: text-center
+
+    .. link-button:: auto/case_studies/ml_training/pima_diabetes/index
+       :type: ref
+       :text: Diabetes Classification
+       :classes: btn-block stretched-link
+    ^^^^^^^^^^^^
+    Train an XGBoost model on the Pima Indians Diabetes Dataset
+
+    ---
+
+    .. link-button:: auto/case_studies/ml_training/house_price_prediction/index
+       :type: ref
+       :text: House Price Regression
+       :classes: btn-block stretched-link
+    ^^^^^^^^^^^^
+    Use dynamic workflows to train a multiregion house price prediction model.
+
+
 .. toctree::
     :maxdepth: -1
     :caption: Contents
+    :hidden:
 
     auto/case_studies/ml_training/pima_diabetes/index
     auto/case_studies/ml_training/house_price_prediction/index
 
-.. admonition:: Coming Soon!
 
-    Data Parallel Training, Distributed Training, and Single Node Training 
+.. TODO: write tutorials for data parallel training, distributed training, and single node training