Skip to content

Latest commit

 

History

History
108 lines (98 loc) · 7.17 KB

Lab05 - ETL with Azure Databricks – Setup Instructions.md

File metadata and controls

108 lines (98 loc) · 7.17 KB

ETL with Azure Databricks – Setup Instructions

Overview

In this lab, you will create an end-to-end pipeline containing Lookup, Copy and Databricks notebook activities in ADF.
• Lookup or GetMetadata activity is used to ensure the source dataset is ready for downstream consumption, before triggering the copy and analytics job.
• Copy activity, copies the source file/ dataset to the sink storage. The sink storage is mounted as DBFS in the databricks notebook so that the dataset can be directly consumed by Spark.
• Databricks notebook activity, triggers the databricks notebook that transforms the dataset, and adds it to a processed folder/ SQL DW.
We will not create a scheduled trigger in this template for simplicity, and you may add that if required.

Prerequisite:

  1. Create a blob storage account and a container called sinkdata to be used as sink. Keep a note of the storage account name, container name and Access key will be referenced later in the template.

  2. Go to Storage Explorer (preview) -> BLOB CONTAINERS -> sinkdata -> + New Folder and give instructions as below:
    a. Name: staged_sink Click Ok

  3. Now, Upload Product.csv file in that staged_sink folder. You will get this file in "C:\staged_sink"

  4. Ensure you have an Azure Databricks workspace or create a new one.
  • 4.1- Import the below Transform notebook to the databricks workspace. (need not be the same location as below, just remember the path for later usage)
    ACTION REQUIRED – Import the notebook for ETL. Import the notebook from URL. Please enter the below URL in the URL field: https://adflabstaging1.blob.core.windows.net/share/Transformations.html. Select Import.

  • 4.2- Now let’s update the Transformation notebook with your storage connection information (name and access key). Go to command 5 in the imported notebook above, replace it with the below code snippet after replacing the highlighted values. Ensure this is the same storage account created earlier and contains the sinkdata container.

  • 4.3 Generate a databricks access token for ADF to access databricks. Save the access token for later use in creating a databricks linked service, which looks something like 'dapi32db32cbb4w6eee18b7d87e45exxxxxx'


Create Linked Services and datasets

  1. Create new linked services in ADF UI by going to Connections -> Linked services -> + new
  • 1.1- Source – for accessing source data. You can use the public blob storage containing the source files for this sample.
    Select Blob Storage, use the below SAS URI to connect to source storage (read-only access).
https://storagewithdataformdw.blob.core.windows.net/?sv=2019-02-02&ss=bfqt&srt=sco&sp=rwdlacup&se=2020-05-30T18:40:47Z&st=2020-01-03T10:40:47Z&sip=0.0.0.0-255.255.255.255&spr=https,http&sig=u4%2BQnUWSsVw64B2%2FHDCvPsfRDx9QsLxgJZ0sBRyRCPc%3D


  • 1.2- Sink – for copying data into.
    Create a new linked service. Select a storage created in the prerequisite 1, in the sink linked service.

  • 1.3- Databricks – for connecting to databricks cluster
    Create a databricks linked service using access key generated in prerequisite 4.c. If you have an interactive cluster, you may select that. (in this case we use a New job cluster option)


  1. Create datasets
  • 2.1- Add Dataset as shown in image:
  • 2.2- Create 'sourceAvailability_Dataset' to check if source data is available.
    • 2.2.1 Select a data store as Azure Blob Storage
    • 2.2.2- Select the fromat type as DelimitedText of your data.
    • 2.2.3- Now, create as 'sourceAvailability_Dataset' and follow instructions as below:
      Linked service - select sourceBlob_LS
      File path - data/source/_success


  1. Source dataset – for copying the source data (using binary copy)
  • 3.1- Select a data store as Azure Blob Storage

  • 3.2- Select the fromat type as Binary of your data.

  • 3.3 Now, create as 'sourceFiles_Dataset' and follow instructions as below:
    Linked service - select sourceBlob_LS
    File path - data/source/Product.csv


  1. Sink dataset – for copying into the sink destination location
  • 4.1- Select a data store as Azure Blob Storage

  • 4.2- Select the format type as Binary of your data.

  • 4.3- Now, create as 'sinkRawFiles_Dataset' and follow instructions as below:
    Linked service - select sinkBlob_LS
    File path - sinkdata/staged_sink


Create activities

a. Create a Lookup activity Availability flag for doing a Source Availability check (Lookup or GetMetadata can be used). Select sourceAvailability_Dataset created in 2.a.

b. Create a Copy activity file-to-blob for copying dataset from source to sink. In this case the data is binary file. Reference the below screenshots for source and sink configurations in the copy activity.


c. Define pipeline parameters

inputPath    =  /staged_sink
outputPath   =  /processed_sink
filename     =  Product.csv


d. Create a databricks activity
• Select the linked service created in 1.c

• Configure the settings. Create Base Parameters as shown in the screenshot and create parameters to be passed to the databricks notebook from ADF**. Browse and select the correct notebook path uploaded in prerequisite 2.

input = @pipeline().parameters.inputPath
output = @pipeline().parameters.outputPath
filename = @pipeline().parameters.fileName
pipelineRunid = @pipeline().RunId



• Run the pipeline. You can find link to databricks logs for more detailed spark logs.

• You can also verify the data file using storage explorer. (For correlating with Data Factory pipeline runs, we appended pipeline run id from data factory to the output folder. This way you can track back the files generated via each run)