This lab covers clustering with Azure Machine Learning, Automated ML, and model explainability.
Understanding the source datasets is very important in AI and ML. To help you expedite the process, we have created a Power BI dashboard you can use to explore them at the beginning of each lab.
To get more details about the source datasets, check out the Data Overview section.
To explore the dashboard of COVID-19 data, open the Azure-AI-in-a-Day-Data-Overview.pbix
file located on the desktop of the virtual machine provided with your environment.
Given the magnitude of the COVID-19 problem, it comes naturally to have a lot of research on the topic. In fact, in 2020 alone, tens of thousands of papers have been published on COVID-19 alone. The sheer amount of communication on the subject makes it difficult for a researcher to grasp and structure all the relevant topics and details. Furthermore, pre-defined catalogs and papers classification might not always reflect their content in the most effective way possible.
Based on a set of existing research papers, we will use Natural Language Processing and Machine Learning to identify these papers' natural grouping. For each new document that gets into our system, we will use Machine Learning to classify it into one of the previously identified groups. We will use Automated ML (a feature of Azure Machine Learning) to train the best classification model and explain its behavior.
The following diagram highlights the portion of the general architecture covered by this lab.
The high-level steps covered in the lab are:
- Explore dashboard of COVID-19 data
- Explore lab scenario
- Run word embedding process on natural language content of research papers
- Explore results of word embedding
- Run clustering of research papers and explore results
- Use the newly found clusters to label the research document and run the Auto ML process to train a classifier
- Run the classifier on "new" research papers
- Explain the best model produced by AutoML
-
Open the Azure Portal and sign-in with your lab credentials.
-
In the list of your recent resources, locate the Azure Machine Learning workspace, select it, and then select
Launch studio
. If you are prompted to sign-in again, use the same lab credentials you used at the previous step. -
In Azure Machine Learning Studio, select
Compute
(1) from the left side menu and verify that your compute instance is running (2).Note: If you launched Azure Machine Learning Studio right after your lab environment was provisioned, you might find the compute instance in a provisioning state. In this case, wait a few minutes until it changes its status to
Running
. -
From the
Application URI
section associated with the compute instance, selectJupyter
(3). -
In the Jupyter notebook environment, navigate to the root folder.
WARNING: If the root folder does not have a file with the extension
w2v
, look for nested folders under theUsers
folder. Ideally, your notebooks should be in the same folder as thew2v
file. -
If the folder does not contain any notebooks, download the following items to your local machine:
Upload each file by selecting the Upload
(1) button from the top right corner of the screen and then selecting the blue Upload
(2) button to confirm.
-
With the Azure Machine Learning Studio and the Jupyter notebook environment open, select the
1. Data Preparation.ipynb
notebook. -
Execute the notebook cell by cell (using either Ctrl + Enter to stay on the same cell, or Shift + Enter to advance to the next cell) and observe the results of each cell execution.
In this task, we'll use Azure Automated ML to train a machine learning model capable of determining the best cluster for a COVID-19 scientific article. It builds upon the work done in the Data Preparation notebook.
-
In the Azure Machine Learning studio, switch to the
Automated ML
(1) section and select+ New Automated ML run
(2) to start the A -
In the
Create a new Automated ML run
wizard pickCOVID19Articles_Train_Vectors
(1) as your dataset and selectNext
(2) to proceed. -
In order to be able to launch an Automated ML run we need to provision a compute cluster. On the
Configure run
step selectaml-compute-cpu
(1) from the list of clusters. If the list is empty selectCreate a new compute
(2) link.Note: If you already have
aml-compute-cpu
cluster provisioned, feel free to skip to step 6. -
On the
Create compute cluster
screen set the values listed below:- Virtual machine priority (1): Dedicated
- Virtual machine type (2): CPU
- Virtual machine Size (3): Standard_DS3_v2
Select
Next
(4) to continue. -
To configure cluster settings set the values given below:
- Compute name (1):
aml-compute-cpu
- Minimum number od nodes (2): 0
- Maximum number of nodes (3): 4
Setting the number of maximum nodes to a higher value will allow Automated ML to run more experiments in parallel, but will also increase your costs
Select
Create
(4) to proceed. - Compute name (1):
-
Set the experiment name to
COVID19_Classification
(1) andTarget column
tocluster
(2). The values we're trying to predict are in thecluster
column. If your compute is not yet selected, make sureaml-compute-cpu
(3) is selected as your compute for the experiment. SelectNext
(4) to continue. -
On the
Select task type
screen selectClassification
(1) as the machine learning task type for the experiment and selectView additional configuration settings
(2) to open a new panel of settings. -
On the
Additional configurations
panel, fill in the values listed below:- Primary metric (1): AUC weighted
- Training job time (hours) (2): 0.25
- Metric score threshold (3): 0.95
- Validation type (4): k-fold cross validation
- Number of cross validations (5): 5
- Max concurrent iterations (6): 4
Thanks to the 0.25 hours set for
training job time
, the experiment will stop after 15 minutes to minimize cost. When it comes toMax concurrent iterations
, Automated ML can try at most four models at the same time, this is also limited by the compute instance's maximum number of nodes.Select
Save
(7) to continue. -
When you are back on the
Select task type
screen, selectFinish
(2) to kick off the Automated ML experiment run. If this is the first time you are launching an experiment run in the Azure Machine Learning workspace, the total experiment time will longer than thetraining job time
we have set. This is because of the time needed to start the Compute Cluster and deploy the container images required to execute. -
On the following screen, you will see the progress of your experiment run.
-
Now that you understand the process of launching an AutoML run, let's explore in the next task the results of an already completed AutoML run.
Note: We have already executed in this environment an AutoML run that is very similar to the one that you've just launched. This allows you to explore AutoML results without having to wait for the completion of the run.
-
In the Azure Machine Learning Studio, navigate to the Experiments (1) section and locate the COVID19_Classification experiment (2). Select the experiment name link.
-
You will navigate to the experiment details page where you should see the list of experiment runs. Locate the first run (1) listed here, the one that has the status Completed. Choose the option to Include the existing child runs (2) as illustrated bellow.
-
Now you should be able to see the list of child runs executed in order to train multiple machine learning models using various classification algorithms. Select the first run (1) with the type Automated ML (2).
-
On the Run details page, navigate to the Models (1) section. Check the values on the AUC weighted column (2), which is the primary metric selected in the AutoML run configuration. See how the best model was selected, this is the one with the maximum metric value. This is also the model for which the explanation was generated. Select View explanation (3).
-
On the Explanations (1) section, browse the available explanations (2) and investigate the Model performance (3) representation.
-
With the Azure Machine Learning Studio and the Jupyter notebook environment open, select the
3. Document Classification.ipynb
notebook. -
Execute the notebook cell by cell (using either Ctrl + Enter to stay on the same cell, or Shift + Enter to advance to the next cell) and observe the results of each cell execution.