This lab demonstrates the integrated, end-to-end Azure Machine Learning and Azure Cognitive Services experience in Azure Synapse Analytics. You will learn how to connect an Azure Synapse Analytics workspace to an Azure Machine Learning workspace using a Linked Service and then trigger an Automated ML experiment that uses data from a Spark table. You will also learn how to use trained models from Azure Machine Learning or Azure Cognitive Services to enrich data in a SQL pool table and then serve prediction results using Power BI.
After completing the lab, you will understand the main steps of an end-to-end Machine Learning process that build on top of the integration between Azure Synapse Analytics and Azure Machine Learning.
This lab has the following structure:
- Lab 01 - The Integrated Machine Learning Process in Synapse Analytics
- Before the hands-on lab
- Exercise 1 - Create an Azure Machine Learning linked service
- Exercise 2 - Trigger an Auto ML experiment using data from a Spark table
- Exercise 3 - Enrich data using trained models
- Exercise 4 - Serve prediction results using Power BI
- After the hands-on lab
- Resources
- Report issues
Before stepping through the exercises in this lab, make sure you have properly configured your Azure Synapse Analytics workspace. Perform the tasks below to configure the workspace.
NOTE
If you have already created and configured the Synapse Analytics workspace while running one of the other labs available in this repo, you must not perform this task again and you can move on to the next task. The labs are designed to share the Synapse Analytics workspace, so you only need to create it once.
Follow the instructions in Deploy your Azure Synapse Analytics workspace to create and configure the workspace.
Follow the instructions in Deploy resources for Lab 01 to deploy additional resources for this lab. Once deployment is complete, you are ready to proceed with the exercises in this lab.
In this exercise, you will create and configure an Azure Machine Learning linked service in Synapse Studio. Once the linked service is available, you will explore the Azure Machine Learning integration features in Synapse Studio.
The Synapse Analytics linked service authenticates with Azure Machine Learning using a service principal. The service principal is based on an Azure Active Directory application named Azure Synapse Analytics GA Labs
and has already been created for you by the deployment procedure. The secret associated with the service principal has also been created and saved in the Azure Key Vault instance, under the ASA-GA-LABS
name.
NOTE
In the labs provided by this repo, the Azure AD application is used in a single Azure AD tenant which means it has exactly one service principal associated to it. Consequently, we will use the terms Azure AD application and service principal interchangeably. For a detailed explanation on Azure AD applications and security principals, see Application and service principal objects in Azure Active Directory.
To view the service principal, open the Azure portal and navigate to your instance of Azure Active directory. Select the App registrations
section and you should see the Azure Synapse Analytics GA Labs
application under the Owned applications
tab.
Select the application to view its properties and copy the value of the Application (client) ID
property (you will need it in a moment to configure the linked service).
To view the secret, open the Azure Portal and navigate to the Azure Key Vault instance that has been created in your resource group. Select the Secrets
section and you should see the ASA-GA-LABS
secret:
First, you need to make sure the service principal has permissions to work with the Azure Machine Learning workspace. Open the Azure Portal and navigate to the Azure Machine Learning workspace that has been created in your resource group. Select the Access control (IAM)
section on the left, then select + Add
and Add role assignment
. In the Add role assignment
dialog, select the Contributor
role, select Azure Synapse Analytics GA Labs
service principal, and the select Save
.
You are now ready to create the Azure Machine Learning linked service.
To create a new linked service, open your Synapse workspace, open Synapse Studio, select the Manage
hub, select Linked services
, and the select + New
. In the search field from the New linked service
dialog, enter Azure Machine Learning
. Select the Azure Machine Learning
option and then select Continue
.
In the New linked service (Azure Machine Learning)
dialog, provide the following properties:
- Name: enter
asagamachinelearning01
. - Azure subscription: make sure the Azure subscription containing your resource group is selected.
- Azure Machine Learning workspace name: make sure your Azure Machine Learning workspace is selected.
- Notice how
Tenant identifier
has been already filled in for you. - Service principal ID: enter the application client ID that you copied earlier.
- Select the
Azure Key Vault
option. - AKV linked service: make sure your Azure Key Vault service is selected.
- Secret name: enter
ASA-GA-LABS
.
Next, select Test connection
to make sure all settings are correct, and then select Create
. The Azure Machine Learning linked service will now be created in the Synapse Analytics workspace.
IMPORTANT
The linked service is not complete until you publish it to the workspace. Notice the indicator near your Azure Machine Learning linked service. To publish it, select
Publish all
and thenPublish
.
First, we need to create a Spark table as a starting point for the Machine Learning model trainig process. In Synapse Studio, select the Data
hub and then the Linked
section. In the primary Azure Data Lake Storage Gen 2
account, select the wwi-02
file system, and then select the sale-small-20191201-snappy.parquet
file under wwi-02\sale-small\Year=2019\Quarter=Q4\Month=12\Day=20191201
. Right click the file and select New notebook -> New Spark table
.
Replace the content of the notebook cell with the following code and then run the cell:
import pyspark.sql.functions as f
df = spark.read.load('abfss://wwi-02@<data_lake_account_name>.dfs.core.windows.net/sale-small/Year=2019/Quarter=Q4/Month=12/*/*.parquet',
format='parquet')
df_consolidated = df.groupBy('ProductId', 'TransactionDate', 'Hour').agg(f.sum('Quantity').alias('TotalQuantity'))
df_consolidated.write.mode("overwrite").saveAsTable("default.SaleConsolidated")
NOTE:
Replace
<data_lake_account_name>
with the actual name of your Synapse Analytics primary data lake account.
The code takes all data available for December 2019 and aggregates it at the ProductId
, TransactionDate
, and Hour
level, calculating the total product quantities sold as TotalQuantity
. The result is then saved as a Spark table named SaleConsolidated
. To view the table in the Data
hub, expand the default (Spark)
database in the Workspace
section. Your table will show up in the Tables
folder. Select the three dots at the right of the table name to view the Machine Learning
option in the context menu.
The following options are available in the Machine Learning
section:
- Enrich with new model: allows you to start an AutoML experiment to train a new model.
- Enrich with existing model: allows you to use an existing Azure Cognitive Services model.
In this exercise, you will trigger the execution of an Auto ML experiment and view its progress in Azure Machine learning studio.
To trigger the execution of a new AutoML experiment, select the Data
hub and then select the ...
area on the right of the saleconsolidated
Spark table to activate the context menu.
From the context menu, select Enrich with new model
.
The Enrich with new model
dialog allow you to set the properties for the Azure Machine Learning experiment. Provide values as follows:
- Azure Machine Learning workspace: leave unchanged, should be automaticall populated with your Azure Machine Learning workspace name.
- Experiment name: leave unchanged, a name will be automatically suggested.
- Best model name: leave unchanged, a name will be automatically suggested. Save this name as you will need it later to identify the model in the Azure Machine Learning Studio.
- Target column: Select
TotalQuantity(long)
- this is the feature you are looking to predict. - Spark pool: leave unchanged, should be automaticall populated with your Spark pool name.
Notice the Apache Spark configuration details:
- The number of executors that will be used
- The size of the executor
Select Continue
to advance with the configuration of your Auto ML experiment.
Next, you will choose the model type. In this case, the choice will be Regression
as we try to predict a continuous numerical value. After selecting the model type, select Continue
to advance.
On the Configure regression model
dialog, provide values as follows:
- Primary metric: leave unchanged,
Spearman correlation
should be suggested by default. - Training job time (hours): set to 0.25 to force the process to finish after 15 minutes.
- Max concurrent iterations: leave unchanged.
- ONNX model compatibility: set to
Enable
- this is very important as currently only ONNX models are supported in the Synapse Studio integrated experience.
Once you have set all the values, select Create run
to advance.
As your run is being submitted, a notification will pop up instructing you to wait until the Auto ML run is submited. You can check the status of the notification by selecting the Notifications
icon on the top right part of your screen.
Once your run is successfully submitted, you will get another notification that will inform you about the actual start of the Auto ML experiment run.
NOTE
Alongside the
Create run
option you might have noticed theOpen in notebook option
. Selecting that option allows you to review the actual Python code that is used to submit the Auto ML run. As an exercise, try re-doing all the steps in this task, but instead of selectingCreate run
, selectOpen in notebook
. You should see a notebook similar to this:Take a moment to read through the code that is generated for you. The total time elapsed for the experiment run will take around 20 minutes.
To view the experiment run you just started, open the Azure Portal, select your resource group, and then select the Azure Machine Learning workspace from the resource group.
Locate and select the Launch studio
button to start the Azure Machine Learning Studio.
In Azure Machine Leanring Studio, select the Automated ML
section on the left and identify the experiment run you have just started. Note the experiment name, the Running status
, and the local
compute target.
The reason why you see local
as the compute target is because you are running the AutoML experiment on the Spark pool inside Synapse Analytics. From the point of view of Azure Machine Learning, you are not running your experiment on Azure Machine Learning's compute resources, but on your "local" compute resources.
Select your run, and then select the Models
tab to view the current list of models being built by your run. The models are listed in descending order of the metric value (which is Spearman correlation
in this case), the best ones being listed first.
Select the best model (the one at the top of the list) then click on View Explanations
to open the Explanations (preview)
tab to see the model explanation. You are now able to select one of the explanations, then choose Aggregates
to see the global importance of the input features. For your model, the feature that influences the most the value of the predicted value is ProductId
.
Next, select the Models
section on the left in Azure Machine Learning Studio and see your best model registered with Azure Machine Learning. This allows you to refer to this model later on in this lab.
In this exercise, you will use existing trained models to perform predictions on data. Task 1 uses a trained model from Azure Machine Learning services while Task 2 uses one from Azure Cognitive Services. Finally, you will include the prediciton stored procedure created in Task 1 into a Synapse pipeline.
In Synapse Studio, select the Data
hub, then select the Workspace
tab, and then locate the wwi.ProductQuantityForecast
table in the SQLPool01 (SQL)
database (under Databases
). Activate the context menu by selecting ...
from the righ side of the table name, and then select New SQL script > Select TOP 100 rows
. The table contains the following columns:
-ProductId: the identifier of the product for which we want to predict -TransactionDate: the future date for which we want to predict -Hour: the hour from the future date for which we want to predict -TotalQuantity: the value we want to predict for the specified product, day, and hour.
Notice that TotalQuantity
is zero in all rows as this is the placeholder for the predicted values we are looking to get.
To use the model you just trained in Azure Machine Learning, activate the context menu of the wwi.ProductQuantityForecast
, and then select Machine Learning > Enrich with existing model
. This will open the Enrich with existing model
dialog where you can select your model. Select the most recent model and then select Continue
.
Next, you will manage the input and output column mappings. Because the column names from the target table and the table used for model training match, you can leave all mappings as suggested by default. Select Continue
to advance.
The final step presents you with options to name the stored procedure that will perform the predictions and the table that will store the serialized form of your model. Provide the following values:
- Stored procedure name:
[wwi].[ForecastProductQuantity]
- Select target table:
Create new
- New table:
[wwi].[Model]
Select Deploy model + open script
to deploy your model into the SQL pool.
From the new SQL script that is created for you, copy the ID of the model:
The T-SQL code that is generated will only return the results of the prediction, without actually saving them. To save the results of the prediction directly into the [wwi].[ProductQuantityForecast]
table, replace the generated code with the following:
CREATE PROC [wwi].[ForecastProductQuantity] AS
BEGIN
SELECT
CAST([ProductId] AS [bigint]) AS [ProductId],
CAST([TransactionDate] AS [bigint]) AS [TransactionDate],
CAST([Hour] AS [bigint]) AS [Hour]
INTO #ProductQuantityForecast
FROM [wwi].[ProductQuantityForecast]
WHERE TotalQuantity = 0;
SELECT
ProductId
,TransactionDate
,Hour
,CAST(variable_out1 as INT) as TotalQuantity
INTO
#Pred
FROM PREDICT (MODEL = (SELECT [model] FROM wwi.Model WHERE [ID] = '<your_model_id>'),
DATA = #ProductQuantityForecast,
RUNTIME = ONNX) WITH ([variable_out1] [real])
MERGE [wwi].[ProductQuantityForecast] AS target
USING (select * from #Pred) AS source (ProductId, TransactionDate, Hour, TotalQuantity)
ON (target.ProductId = source.ProductId and target.TransactionDate = source.TransactionDate and target.Hour = source.Hour)
WHEN MATCHED THEN
UPDATE SET target.TotalQuantity = source.TotalQuantity;
END
GO
In the code above, make sure you replace <your_model_id>
with the actual ID of the model (the one you copied in the previous step).
NOTE:
Our version of the stored procedure uses the
MERGE
commdand to update the values of theTotalQuantity
field in-place, in thewwi.ProductQuantityForecast
table. TheMERGE
command has been recently added to Azure Synapse Analytics. For more details, read New MERGE command for Azure Synapse Analytics.
You are now ready to perform the forecast on the TotalQuantity
column. Open a new SQL script and run the following statement:
EXEC
wwi.ForecastProductQuantity
SELECT
*
FROM
wwi.ProductQuantityForecast
Notice how the values in the TotalQuantity
column have changed from zero to non-zero predicted values:
First, we need to create a Spark table to be used as the input for the Cognitive Services model. In Synapse Studio, select the Data
hub and then the Linked
section. In the primary Azure Data Lake Storage Gen 2
account, select the wwi-02
file system, and then select the ProductReviews.csv
file under wwi-02\sale-small-product-reviews
. Right click the file and select New notebook -> New Spark table
.
Replace the content of the notebook cell with the following code and then run the cell:
%%pyspark
df = spark.read.load('abfss://wwi-02@<data_lake_account_name>.dfs.core.windows.net/sale-small-product-reviews/ProductReviews.csv', format='csv'
,header=True
)
df.write.mode("overwrite").saveAsTable("default.ProductReview")
NOTE:
Replace
<data_lake_account_name>
with the actual name of your Synapse Analytics primary data lake account.
To view the table in the Data
hub, expand the default (Spark)
database in the Workspace
section. Your table will show up in the Tables
folder. Select the three dots at the right of the table name to view the Machine Learning
option in the context menu and then select Machine Learning > Enrich with existing model
.
In the Enrich with existing model
dialog, select Text Analytics - Sentiment Analysis
under Azure Cognitive Services
and then select Continue
.
Next, provide values as follows:
- Azure subscription: select the Azure subscription of your resource group.
- Cognitive Services account: select the Cogntive Services account that has been provisioned in your resource group. The name should be
asagacognitiveservices<unique_suffix>
, where<unique_suffix>
is the unique suffix you provided when deploying the Synapse Analytics workspace. - Azure Key Vault linked service: select the Azure Key Vault linked services that has been provisioned in your Synapse Analytics workspace. The name should be
asagakeyvault<unique_suffix>
, where<unique_suffix>
is the unique suffix you provided when deploying the Synapse Analytics workspace. - Secret name: enter
ASA-GA-COGNITIVE-SERVICES
(the name of the secret that contains the key for the specified Cognitive Services account).
Select Continue
to move next.
Next, provide values as follows:
- Language: select
English
. - Text column: select
ReviewText (string)
Select Open notebook
to view the generated code.
NOTE:
When you created the
ProductReview
Spark table by running the notebook cell, you started a Spark session on that notebook. The default settings on your Synapse Analytics workspace will not allow you to start a new notebook that runs in parallel with that one. You will need to copy the contents of the two cells that contain to Cognitive Services integration code into that notebook and run them on the Spark session that you have already started. After copying the two cells, you should see a screen similar to this:
NOTE: To run the notebook generated by Synapse Studio without copying its cells, you can use the
Apache Spark applications
section of theMonitor
hub where you can view and cancel running Spark session. For more details on this, see Use Synapse Studio to monitor your Apache Spark applications. In this lab, we used to copy cells approach to avoid the extra time required to cancel the running Spark session and start a new one afterwards.
Run cells 2 and 3 in the notebook to get the sentiment analysis results for your data.
In Synapse studio, select the Integrate
hub on the left side and then select + > Pipeline
to create a new Synapse pipeline.
Enter Product Quantity Forecast
as the name of the pipeline on the right side.
From the Move & transform
section, add a Copy data
activity and name it Import forecast requests
.
In the Source
section of the copy activity properties, provide the following values:
- Source dataset: select the
wwi02_sale_small_product_quantity_forecast_adls
dataset. - File path type: select
Wildcard file path
- Wildcard paths: enter
sale-small-product-quantity-forecast
in the first textbox, and*.csv
in the second.
In the Sink
section of the copy activity properties, provide the following values:
-Sink dataset: select the wwi02_sale_small_product_quantity_forecast_asa
dataset.
In the Mapping
section of the copy activity properties, select Import schemas
and check field mappings between source and sink.
In the Settings
section of the copy activity properties, provide the following values:
- Enable staging: select the option.
- Staging account linked service: select the
asagadatalake<unique_suffix>
linked service (where<unique_suffix>
is the unique suffix you provided when deploying the Synapse Analytics workspace). - Storage path: enter
staging
.
From the Synapse
section, add a SQL pool stored procedure
activity and name it Forecast product quantities
. Connect the two pipeline activities to ensure the stored procedure runs after the data import.
In the Settings
section of the stored procedure actity properties, provide the following values:
- Azure Synapse dedicated SQL pool: select
SQLPool01
. - Stored procedure name: select
[wwi].[ForecastProductQuantity]
.
Select Debug
to make sure the pipeline works correctly.
Select Publish all
to publish the pipeline.
Create a new SQL script and execute the following script against the SQLPool01
pool:
SELECT
*
FROM
wwi.ProductQuantityForecast
In the results you should now see forecasted values for Hour = 11 (which are corresponding to rows imported by the pipeline):
In this exercise you will view the prediction results in a Power BI report. You will also trigger the forecast pipeline with new input data and view the updated quantites in the Power BI report.
First, you will publish a simple Product Quantity Forecast report to Power BI.
Download the ProductQuantityForecast.pbix
file from the GitHub repo: https://github.com/solliancenet/azure-synapse-analytics-ga-content-packs/blob/main/hands-on-labs/lab-01/ProductQuantityForecast.pbix (select Download
on the GitHub page).
Open the file with Power BI Desktop (ignore the warning about missing credentials). Also, if you are first prompted to update credentials, ignore the messages and close the pop-ups without updating the connection information.
In the Home
section, select Transform data
, then edit the Source
entry in the APPLIED STEPS
list of the ProductQuantityForecast
query. Change the name of the server to asagaworkspace<unique_suffix>.sql.azuresynapse.net
(where <unique_suffix>
is the unique suffix you provided when deploying the Synapse Analytics workspace).
The credentials window will pop up and promopt you to enter the credentials to connect to the Synapse Analytics SQL pool (in case it doesn't, select Data source settings
on the ribbon, select your data source, select Edit Permissions...
, and then select Edit...
under Credentials
).
In the credentials window, select Microsoft account
and then select Sign in
. Use your Power BI Pro account to sign in.
Close all open popup windows, select Close & Apply
and then publish the file to your Power BI workspace.
To view the results of the report, in Synapse Studio, select the Develop
hub on the left side, expand the Power BI
section, and select the ProductQuantityForecast
report under the Power BI reports
section from your workspace.
In Synapse Studio, select the Integrate
hub on the left side and open the Product Quantity Forecast
pipeline. Select + Add trigger
and specify that you want to create a new pipeline trigger. In the New trigger
window, provide the following values:
- Name: enter
New data trigger
. - Type: select
Event
. - Azure subscription: ensure the right Azure subscription is selected (the one containing your resource group).
- Storage account name: select the
asagadatalake<uniqu_prefix>
account (where<unique_suffix>
is the unique suffix you provided when deploying the Synapse Analytics workspace). - Container name: enter
wwi-02
. - Blob path begins with: enter
sale-small-product-quantity-forecast/ProductQuantity
. - Event: select
Blob created
.
Select Continue
to create the trigger, then select once more Continue
in the Data preview
dialog, and then OK
to publish the trigger.
In Synapse Studio, select Publish all
and then Publish
to publish all changes.
Download the ProductQuantity-20201209-12.csv
file from https://solliancepublicdata.blob.core.windows.net/wwi-02/sale-small-product-quantity-forecast/ProductQuantity-20201209-12.csv.
In Synapse Studio, select the Data
hub on the left side, navigate to the primary data lake account in the Linked
section, and open the wwi-02 > sale-small-product-quantity-forecast
path. Delete the existing ProductQuantity-20201209-11.csv
file and upload the ProductQuantity-20201209-12.csv
file. This will trigger the Product Quantity Forecast
pipeline wich will import the forecast requests from the CSV file and run the forecasting stored procedure.
In Synapse Studio, select the Monitor
hub on the left side, and then select Trigger runs
to see the newly activated pipeline run. Once the pipeline is finished, refresh the Power BI report in Synapse Studio to view the updated data.
Follow the instructions in Clean-up your subscription to clean-up your environment after the hands-on lab.
To learn more about the topics covered in this lab, use these resources:
- Quickstart: Create a new Azure Machine Learning linked service in Synapse
- Tutorial: Machine learning model scoring wizard for dedicated SQL pools
In case you encounter any issues with the content in this repository, please follow the How to report issues guideline. We will try to address them as soon as possible. Please check your open issues to learn about their status.