Skip to content

Latest commit

 

History

History
127 lines (114 loc) · 8.53 KB

Lab06-ETL with Azure Data Factory– Setup Instructions.md

File metadata and controls

127 lines (114 loc) · 8.53 KB

ETL with Azure Data Factory (Dataflow) – Setup Instructions

Overview

In this tutorial, you'll use the Azure Data Factory user interface (UX) to create a pipeline that copies and transforms data from an Azure Blob Storage to an Blob Storage sink using mapping data flow. The configuration pattern in this tutorial can be expanded upon when transforming data using mapping data flow.

Also, we will go through the necessary dataflow transformations and actions to return a DataFrame which satisfies following requirements:

  • Select just the product_id, category, brand, model, size and price columns

  • Rename product_id to prodID

  • Sort by the price column in descending order

  • Create new column doublePrice, and multiply the price column by 2. This is for demonstration, in reality you would populate the field using some business logic to make it more meaningful

  • Write the transformed DataFrame in CSV as Product.csv

Exercise 1.1 - Create Linked Service and Datasets

  1. In Azure Portal, go to Resource Group and Click on azure data factory.

  2. Select Author & Monitor to launch the Data Factory UI in a separate tab.

  3. Create new linked services in ADF UI by going to Connections -> Linked services -> + new
    a. Source – for accessing source data. You can use the public blob storage containing the source files for this sample.
    Select Blob Storage, use the below Account Key to connect to source storage.
Storage Account Name: <-Given in your environment page->
Storage Account Key: <-Given in your environment page->


b. Sink – for copying data into.
Create a new linked service. Select a storage created in the prerequisite, in the sink linked service.

2. Create datasets
a. Add Dataset as shown in image:

b. Create 'sourceFiles_Dataset' to check if source data is available.
i. Select a data store as Azure Blob Storage
ii. Select the fromat type as DelimitedText of your data.
iii. Now, create as 'sourceFiles_Dataset' and follow instructions as below:
Linked service - select sourceBlob_LS
File path - data/source/Product.csv
Select the First Row as Header


c. Sink dataset – for copying into the sink destination location
i. Select a data store as Azure Blob Storage

ii. Select the fromat type as DelimitedText of your data.

iii. Now, create as 'sinkRawFiles_Dataset' and follow instructions as below:
Linked service - select sinkBlob_LS
File path - sinkdata
Select the First Row as Header


Exercise 1.2 Build a mapping data flow with six transformations

  1. In the factory top bar, slide the Data Flow debug slider on. Debug mode allows for interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes to warm up and users are recommended to turn on debug first if they plan to do Data Flow development. For more information, see Debug Mode.

  2. In Factory Resouces, select Data flows and add New Data flow. Select Mapping Data Flow while creating DataFlow. Once you create your Data Flow, you'll be automatically sent to the data flow canvas. In this step, you'll build a data flow that takes the Product.csv in Blob storage.


  3. In the data flow canvas, add a source by clicking on the Add Source box
    • Name: Give any name

  4. Name your source Source. Select sourceFiles_Dataset from dropdown.

  5. If your debug cluster has started, go to the Data Preview tab of the source transformation and click Refresh to get a snapshot of the data. You can use data preview to verify your transformation is configured correctly.
  6. Next to your source node on the data flow canvas, click on the plus icon to add a new transformation. The first transformation you're adding is a Derived Colunm. The Derived Column transformation in ADF Data Flows is a multi-use transformation. While it is generally used for writing expressions for data transformation, you can also use it for data type casting and you can even modify metadata with it.

  7. Name your Derived Colunm transformation Casting. Add Colunm and define the expression as shown below image:

  8. Add below expression in Expression for field
    toInteger(product_id)

  9. Add other columns product_id, category, brand, model, size, and price. Add new colunm as doublePrice and in expression field give toFloat(price) as a value.

  10. The next transformation you'll add is an Select transformation under Schema modifier.

  11. Using select transformation will Rename the product_id to prodID. Name your Select transformation as UpdateName.

  12. The next transformation you'll add is an Derived Colunm transformation. Will use this transformation for doublePrice, and multiply the price column by 2. This is for demonstration, in reality you would populate the field using some business logic to make it more meaningful.

  13. Name the transformation as Double Price. In Colunms section, select doublePrice.

  14. Add following expression and save the expression:

multiply(price, 2)
15. The next transformation you'll add is an sort transformation under Row modifier. Will sort the price column in descending order.
16. Name the transformation as Sort and sort the condition as below:

17. Next, you want to add a Sink transformation under Destination.

18. Name your Output stream as Sink. Select the sinkRawFiles_Dataset dataset from drop down.

19. Click on Publish All for saving the all solution.

Now you've finished building your data flow. You're ready to run it in your pipeline.

Exercise 3 Running and monitoring the Data Flow

You can debug a pipeline before you publish it. In this step, you're going to trigger a debug run of the data flow pipeline. While data preview doesn't write data, a debug run will write data to your sink destination.

  1. Add Pipeline Factory Resources.

  2. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow activity from the pane to the pipeline canvas.

  3. In the Adding Data Flow pop-up, select Use existing Data Flow and select dataflow from drop down. Click Finish when done.

  4. Go to the pipeline canvas. Click Debug to trigger a debug run.

  5. Pipeline debug of Data Flow activities uses the active debug cluster but still take at least a minute to initialize. You can track the progress via the Output tab. Once the run is successful, click on the eyeglasses icon to open the monitoring pane.
  6. In the monitoring pane, you can see the number of rows and time spent in each transformation step.
  7. Click on a transformation to get detailed information about the columns and partitioning of the data.
  8. Go back to your Azure portal and open Blob Storage -> Containers -> sinkdata. Download the Product.csv using the URL and check the CSV file result