Lab 1 - Azure Machine Learning Model Training

This lab covers clustering with Azure Machine Learning, Automated ML, and model explainability.

Task 1 - Explore dashboard of COVID-19 data

Understanding the source datasets is very important in AI and ML. To help you expedite the process, we have created a Power BI dashboard you can use to explore them at the beginning of each lab.

To get more details about the source datasets, check out the Data Overview section.

To explore the dashboard of COVID-19 data, open the Azure-AI-in-a-Day-Data-Overview.pbix file located on the desktop of the virtual machine provided with your environment.

Task 2 - Explore lab scenario

Given the magnitude of the COVID-19 problem, it comes naturally to have a lot of research on the topic. In fact, in 2020 alone, tens of thousands of papers have been published on COVID-19 alone. The sheer amount of communication on the subject makes it difficult for a researcher to grasp and structure all the relevant topics and details. Furthermore, pre-defined catalogs and papers classification might not always reflect their content in the most effective way possible.

Based on a set of existing research papers, we will use Natural Language Processing and Machine Learning to identify these papers' natural grouping. For each new document that gets into our system, we will use Machine Learning to classify it into one of the previously identified groups. We will use Automated ML (a feature of Azure Machine Learning) to train the best classification model and explain its behavior.

The following diagram highlights the portion of the general architecture covered by this lab.

The high-level steps covered in the lab are:

Explore dashboard of COVID-19 data
Explore lab scenario
Run word embedding process on natural language content of research papers
Explore results of word embedding
Run clustering of research papers and explore results
Use the newly found clusters to label the research document and run the Auto ML process to train a classifier
Run the classifier on "new" research papers
Explain the best model produced by AutoML

Task 3 - Prepare Azure Machine Learning workspace

Open the Azure Portal and sign-in with your lab credentials.
In the list of your recent resources, locate the Azure Machine Learning workspace, select it, and then select Launch studio. If you are prompted to sign-in again, use the same lab credentials you used at the previous step.
In Azure Machine Learning Studio, select Compute (1) from the left side menu and verify that your compute instance is running (2).

Note: If you launched Azure Machine Learning Studio right after your lab environment was provisioned, you might find the compute instance in a provisioning state. In this case, wait a few minutes until it changes its status to Running.
From the Application URI section associated with the compute instance, select Jupyter (3).
In the Jupyter notebook environment, navigate to the root folder.

WARNING: If the root folder does not have a file with the extension w2v, look for nested folders under the Users folder. Ideally, your notebooks should be in the same folder as the w2v file.
If the folder does not contain any notebooks, download the following items to your local machine:

1. Data Preparation.ipynb

3. Document Classification.ipynb

covid_embeddings_model_500_docs.w2v

Upload each file by selecting the Upload (1) button from the top right corner of the screen and then selecting the blue Upload (2) button to confirm.

Task 4 - Prepare data for the Machine Learning process

With the Azure Machine Learning Studio and the Jupyter notebook environment open, select the 1. Data Preparation.ipynb notebook.
Execute the notebook cell by cell (using either Ctrl + Enter to stay on the same cell, or Shift + Enter to advance to the next cell) and observe the results of each cell execution.

Task 5 - Train a Machine Learning model with Automated ML

In this task, we'll use Azure Automated ML to train a machine learning model capable of determining the best cluster for a COVID-19 scientific article. It builds upon the work done in the Data Preparation notebook.

In the Azure Machine Learning studio, switch to the Automated ML (1) section and select + New Automated ML run (2) to start the A
In the Create a new Automated ML run wizard pick COVID19Articles_Train_Vectors (1) as your dataset and select Next (2) to proceed.
In order to be able to launch an Automated ML run we need to provision a compute cluster. On the Configure run step select aml-compute-cpu (1) from the list of clusters. If the list is empty select Create a new compute (2) link.

Note: If you already have aml-compute-cpu cluster provisioned, feel free to skip to step 6.
On the Create compute cluster screen set the values listed below:
- Virtual machine priority (1): Dedicated
- Virtual machine type (2): CPU
- Virtual machine Size (3): Standard_DS3_v2
Select Next (4) to continue.
To configure cluster settings set the values given below:
- Compute name (1): aml-compute-cpu
- Minimum number od nodes (2): 0
- Maximum number of nodes (3): 4
Setting the number of maximum nodes to a higher value will allow Automated ML to run more experiments in parallel, but will also increase your costs

Select Create (4) to proceed.
Set the experiment name to COVID19_Classification (1) and Target column to cluster (2). The values we're trying to predict are in the cluster column. If your compute is not yet selected, make sure aml-compute-cpu (3) is selected as your compute for the experiment. Select Next (4) to continue.
On the Select task type screen select Classification (1) as the machine learning task type for the experiment and select View additional configuration settings (2) to open a new panel of settings.
On the Additional configurations panel, fill in the values listed below:
- Primary metric (1): AUC weighted
- Training job time (hours) (2): 0.25
- Metric score threshold (3): 0.95
- Validation type (4): k-fold cross validation
- Number of cross validations (5): 5
- Max concurrent iterations (6): 4
Thanks to the 0.25 hours set for training job time, the experiment will stop after 15 minutes to minimize cost. When it comes to Max concurrent iterations, Automated ML can try at most four models at the same time, this is also limited by the compute instance's maximum number of nodes.

Select Save (7) to continue.
When you are back on the Select task type screen, select Finish (2) to kick off the Automated ML experiment run. If this is the first time you are launching an experiment run in the Azure Machine Learning workspace, the total experiment time will longer than the training job time we have set. This is because of the time needed to start the Compute Cluster and deploy the container images required to execute.
On the following screen, you will see the progress of your experiment run.
Now that you understand the process of launching an AutoML run, let's explore in the next task the results of an already completed AutoML run.

Note: We have already executed in this environment an AutoML run that is very similar to the one that you've just launched. This allows you to explore AutoML results without having to wait for the completion of the run.

Task 6 - Explore AutoML results

In the Azure Machine Learning Studio, navigate to the Experiments (1) section and locate the COVID19_Classification experiment (2). Select the experiment name link.
You will navigate to the experiment details page where you should see the list of experiment runs. Locate the first run (1) listed here, the one that has the status Completed. Choose the option to Include the existing child runs (2) as illustrated bellow.
Now you should be able to see the list of child runs executed in order to train multiple machine learning models using various classification algorithms. Select the first run (1) with the type Automated ML (2).
On the Run details page, navigate to the Models (1) section. Check the values on the AUC weighted column (2), which is the primary metric selected in the AutoML run configuration. See how the best model was selected, this is the one with the maximum metric value. This is also the model for which the explanation was generated. Select View explanation (3).
On the Explanations (1) section, browse the available explanations (2) and investigate the Model performance (3) representation.

Task 7 - Classify new research documents

With the Azure Machine Learning Studio and the Jupyter notebook environment open, select the 3. Document Classification.ipynb notebook.
Execute the notebook cell by cell (using either Ctrl + Enter to stay on the same cell, or Shift + Enter to advance to the next cell) and observe the results of each cell execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Lab 1 - Azure Machine Learning Model Training

Task 1 - Explore dashboard of COVID-19 data

Task 2 - Explore lab scenario

Task 3 - Prepare Azure Machine Learning workspace

Task 4 - Prepare data for the Machine Learning process

Task 5 - Train a Machine Learning model with Automated ML

Task 6 - Explore AutoML results

Task 7 - Classify new research documents

Files

README.md

Latest commit

History

README.md

File metadata and controls

Lab 1 - Azure Machine Learning Model Training

Task 1 - Explore dashboard of COVID-19 data

Task 2 - Explore lab scenario

Task 3 - Prepare Azure Machine Learning workspace

Task 4 - Prepare data for the Machine Learning process

Task 5 - Train a Machine Learning model with Automated ML

Task 6 - Explore AutoML results

Task 7 - Classify new research documents