lab | ||||
---|---|---|---|---|
|
In this module, students will learn how to work with files stored in the data lake and external file sources, through T-SQL statements executed by a serverless SQL pool in Azure Synapse Analytics. Students will query Parquet files stored in a data lake, as well as CSV files stored in an external data store. Next, they will create Azure Active Directory security groups and enforce access to files in the data lake through Role-Based Access Control (RBAC) and Access Control Lists (ACLs).
In this module, the student will be able to:
- Query Parquet data with serverless SQL pools
- Create external tables for Parquet and CSV files
- Create views with serverless SQL pools
- Secure access to data in a data lake when using serverless SQL pools
- Configure data lake security using Role-Based Access Control (RBAC) and Access Control Lists (ACLs)
- Module 2 - Run interactive queries using serverless SQL pools
Tailwind Trader's Data Engineers want a way to explore the data lake, transform and prepare data, and simplify their data transformation pipelines. In addition, they want their Data Analysts to explore data in the lake and Spark external tables created by Data Scientists or Data Engineers, using familiar T-SQL language or their favorite tools, which can connect to SQL endpoints.
- You have successfully completed setup to create your lab environment.
- You must have permissions to create new Azure Active Directory security groups and assign members to them.
Understanding data through data exploration is one of the core challenges faced today by data engineers and data scientists as well. Depending on the underlying structure of the data as well as the specific requirements of the exploration process, different data processing engines will offer varying degrees of performance, complexity, and flexibility.
In Azure Synapse Analytics, you can use either SQL, Apache Spark for Synapse, or both. Which service you use mostly depends on your personal preference and expertise. When conducting data engineering tasks, both options can be equally valid in many cases. However, there are certain situations where harnessing the power of Apache Spark can help you overcome problems with the source data. This is because in a Synapse Notebook, you can import from a large number of free libraries that add functionality to your environment when working with data. There are other situations where it is much more convenient and faster using serveless SQL pool to explore the data, or to expose data in the data lake through a SQL view that can be accessed from external tools, like Power BI.
In this exercise, you will explore the data lake using both options.
When you query Parquet files using serverless SQL pools, you can explore the data with T-SQL syntax.
-
Open Synapse Studio (https://web.azuresynapse.net/), and then navigate to the Data hub.
-
Select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the
asaworkspaceXX
primary ADLS Gen2 account (2) and select thewwi-02
container (3). Navigate to thesale-small/Year=2019/Quarter=Q4/Month=12/Day=20191231
folder (4). Right-click on thesale-small-20191231-snappy.parquet
file (5), select New SQL script (6), then Select TOP 100 rows (7). -
Ensure Built-in is selected (1) in the
Connect to
dropdown list above the query window, then run the query (2). Data is loaded by the serverless SQL endpoint and processed as if was coming from any regular relational database.The cell output shows the query results from the Parquet file.
-
Modify the SQL query to perform aggregates and grouping operations to better understand the data. Replace the query with the following, replacing SUFFIX with the unique suffix for your Azure Data Lake store and making sure that the file path in
OPENROWSET
matches the current file path:SELECT TransactionDate, ProductId, CAST(SUM(ProfitAmount) AS decimal(18,2)) AS [(sum) Profit], CAST(AVG(ProfitAmount) AS decimal(18,2)) AS [(avg) Profit], SUM(Quantity) AS [(sum) Quantity] FROM OPENROWSET( BULK 'https://asadatalakeSUFFIX.dfs.core.windows.net/wwi-02/sale-small/Year=2019/Quarter=Q4/Month=12/Day=20191231/sale-small-20191231-snappy.parquet', FORMAT='PARQUET' ) AS [r] GROUP BY r.TransactionDate, r.ProductId;
-
Let's move on from this single file from 2019 and transition to a newer data set. We want to figure out how many records are contained within the Parquet files for all 2019 data. This information is important for planning how we optimize for importing the data into Azure Synapse Analytics. To do this, we'll replace the query with the following (be sure to update the suffix of your data lake in the BULK statement):
SELECT COUNT(*) FROM OPENROWSET( BULK 'https://asadatalakeSUFFIX.dfs.core.windows.net/wwi-02/sale-small/Year=2019/*/*/*/*', FORMAT='PARQUET' ) AS [r];
Notice how we updated the path to include all Parquet files in all subfolders of
sale-small/Year=2019
.The output should be 4124857 records.
Rather than creating a script with OPENROWSET
and a path to the root 2019 folder every time we want to query the Parquet files, we can create an external table.
-
In Synapse Studio, navigate to the Data hub.
-
Select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the
asaworkspaceXX
primary ADLS Gen2 account (2) and select thewwi-02
container (3). Navigate to thesale-small/Year=2019/Quarter=Q4/Month=12/Day=20191231
folder (4). Right-click on thesale-small-20191231-snappy.parquet
file (5), select New SQL script (6), then Create external table (7). -
Make sure
Built-in
is selected for the SQL pool (1). Under Select a database, select + New and enterdemo
(2). For External table name, enterAll2019Sales
(3). Under Create external table, select Using SQL script (4), then select Create (5).Note: Make sure the script is connected to the serverless SQL pool (
Built-in
) (1) and the database is set todemo
(2).The generated script contains the following components:
-
1) The script begins with creating the
SynapseParquetFormat
external file format with aFORMAT_TYPE
ofPARQUET
. -
2) Next, the external data source is created, pointing to the
wwi-02
container of the data lake storage account. -
3) The CREATE EXTERNAL TABLE
WITH
statement specifies the file location and refers to the new external file format and data source created above. -
4) Finally, we select the top 100 results from the
2019Sales
external table. -
5) You may recieve a message
Potential conversion error while reading VARCHAR column 'TransactionId' from UTF8 encoded text. Change database collation to a UTF8 collation or specify explicit column schema in WITH clause and assign UTF8 collation to VARCHAR columns.
If this is the case, after the [TransactionId] varchar(8000) line addCOLLATE Latin1_General_100_BIN2_UTF8,
to remove the error. The finished line will state [TransactionId] varchar(8000) COLLATE Latin1_General_100_BIN2_UTF8,
-
-
Replace the
LOCATION
value in theCREATE EXTERNAL TABLE
statement withsale-small/Year=2019/*/*/*/*.parquet
. -
Run the script.
After running the script, we can see the output of the SELECT query against the
All2019Sales
external table. This displays the first 100 records from the Parquet files located in theYEAR=2019
folder.
Tailwind Traders found an open data source for country population data that they want to use. They do not want to merely copy the data since it is regularly updated with projected populations in future years.
You decide to create an external table that connects to the external data source.
-
Replace the SQL script with the following:
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys) BEGIN declare @pasword nvarchar(400) = CAST(newid() as VARCHAR(400)); EXEC('CREATE MASTER KEY ENCRYPTION BY PASSWORD = ''' + @pasword + '''') END CREATE DATABASE SCOPED CREDENTIAL [sqlondemand] WITH IDENTITY='SHARED ACCESS SIGNATURE', SECRET = 'sv=2018-03-28&ss=bf&srt=sco&sp=rl&st=2019-10-14T12%3A10%3A25Z&se=2061-12-31T12%3A10%3A00Z&sig=KlSU2ullCscyTS0An0nozEpo4tO5JAgGBvw%2FJX2lguw%3D' GO -- Create external data source secured using credential CREATE EXTERNAL DATA SOURCE SqlOnDemandDemo WITH ( LOCATION = 'https://sqlondemandstorage.blob.core.windows.net', CREDENTIAL = sqlondemand ); GO CREATE EXTERNAL FILE FORMAT QuotedCsvWithHeader WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = ',', STRING_DELIMITER = '"', FIRST_ROW = 2 ) ); GO CREATE EXTERNAL TABLE [population] ( [country_code] VARCHAR (5) COLLATE Latin1_General_BIN2, [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2, [year] smallint, [population] bigint ) WITH ( LOCATION = 'csv/population/population.csv', DATA_SOURCE = SqlOnDemandDemo, FILE_FORMAT = QuotedCsvWithHeader ); GO
At the top of the script, we create a
MASTER KEY
with a random password (1). Next, we create a database-scoped credential for the containers in the external storage account (2), using a shared access signature (SAS) for delegated access. This credential is used when we create theSqlOnDemandDemo
external data source (3) that points to the location of the external storage account that contains the population data:Database-scoped credentials are used when any principal calls the OPENROWSET function with a DATA_SOURCE or selects data from an external table that doesn't access public files. The database scoped credential doesn't need to match the name of storage account because it will be explicitly used in the DATA SOURCE that defines the storage location.
In the next part of the script, we create an external file format called
QuotedCsvWithHeader
. Creating an external file format is a prerequisite for creating an External Table. By creating an External File Format, you specify the actual layout of the data referenced by an external table. Here we specify the CSV field terminator, string delimiter, and set theFIRST_ROW
value to 2 since the file contains a header row:Finally, at the bottom of the script, we create an external table named
population
. TheWITH
clause specifies the relative location of the CSV file, points to the data source created above, as well as theQuotedCsvWithHeader
file format: -
Run the script.
Please note that there are no data results for this query.
-
Replace the SQL script with the following to select from the population external table, filtered by 2019 data where the population is greater than 100 million:
SELECT [country_code] ,[country_name] ,[year] ,[population] FROM [dbo].[population] WHERE [year] = 2019 and population > 100000000
-
Run the script.
-
In the query results, select the Chart view, then configure it as follows:
- Chart type: Select
Bar
. - Category column: Select
country_name
. - Legend (series) columns: Select
population
. - Legend position: Select
bottom - center
.
- Chart type: Select
Let's create a view to wrap a SQL query. Views allow you to reuse queries and are needed if you want to use tools, such as Power BI, in conjunction with serverless SQL pools.
-
In Synapse Studio, navigate to the Data hub.
-
Select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the
asaworkspaceXX
primary ADLS Gen2 account (2) and select thewwi-02
container (3). Navigate to thecustomer-info
folder (4). Right-click on thecustomerinfo.csv
file (5), select New SQL script (6), then Select TOP 100 rows (7). -
Select Run to execute the script (1). Notice that the first row of the CSV file is the column header row (2).
-
Update the script with the following and make sure you replace YOUR_DATALAKE_NAME (1) (your primary data lake storage account) in the OPENROWSET BULK path with the value in the in the previous select statement. Set the Use database value to
demo
(2) (use the refresh button to the right if needed):CREATE VIEW CustomerInfo AS SELECT * FROM OPENROWSET( BULK 'https://YOUR_DATALAKE_NAME.dfs.core.windows.net/wwi-02/customer-info/customerinfo.csv', FORMAT = 'CSV', PARSER_VERSION='2.0', FIRSTROW=2 ) WITH ( [UserName] NVARCHAR (50), [Gender] NVARCHAR (10), [Phone] NVARCHAR (50), [Email] NVARCHAR (100), [CreditCard] NVARCHAR (50) ) AS [r]; GO SELECT * FROM CustomerInfo; GO
-
Select Run to execute the script.
We just created the view to wrap the SQL query that selects data from the CSV file, then selected rows from the view:
Notice that the first row no longer contains the column headers. This is because we used the
FIRSTROW=2
setting in theOPENROWSET
statement when we created the view. -
Within the Data hub, select the Workspace tab (1). Select the actions ellipses (...) to the right of the Databases group (2), then select Refresh (3).
-
Expand the
demo
SQL database.The database contains the following objects that we created in our earlier steps:
- 1) External tables:
All2019Sales
andpopulation
. - 2) External data sources:
SqlOnDemandDemo
andwwi-02_asadatalakeinadayXXX_dfs_core_windows_net
. - 3) External file formats:
QuotedCsvWithHeader
andSynapseParquetFormat
. - 4) Views:
CustomerInfo
.
- 1) External tables:
Tailwind Traders wants to enforce that any kind of modifications to sales data can happen in the current year only, while allowing all authorized users to query the entirety of data. They have a small group of admins who can modify historic data if needed.
- Tailwind Traders should create a security group in AAD, for example called
tailwind-history-owners
, with the intent that all users who belong to this group will have permissions to modify data from previous years. - The
tailwind-history-owners
security group needs to be assigned to the Azure Storage built-in RBAC roleStorage Blob Data Owner
for the Azure Storage account containing the data lake. This allows AAD user and service principals that are added to this role to have the ability to modify all data from previous years. - They need to add the user security principals who will have have permissions to modify all historical data to the
tailwind-history-owners
security group. - Tailwind Traders should create another security group in AAD, for example called
tailwind-readers
, with the intent that all users who belong to this group will have permissions to read all contents of the file system (prod
in this case), including all historical data. - The
tailwind-readers
security group needs to be assigned to the Azure Storage built-in RBAC roleStorage Blob Data Reader
for the Azure Storage account containing the data lake. This allows AAD user and service principals that are added to this security group to have the ability to read all data in the file system, but not to modify it. - Tailwind Traders should create another security group in AAD, for example called
tailwind-2020-writers
, with the intent that all users who belong to this group will have permissions to modify data only from the year 2020. - They would create a another security group, for example called
tailwind-current-writers
, with the intent that only security groups would be added to this group. This group will have permissions to modify data only from the current year, set using ACLs. - They need to add the
tailwind-readers
security group to thetailwind-current-writers
security group. - At the start of the year 2020, Tailwind Traders would add
tailwind-current-writers
to thetailwind-2020-writers
security group. - At the start of the year 2020, on the
2020
folder, Tailwind Traders would set the read, write and execute ACL permissions for thetailwind-2020-writers
security group. - At the start of the year 2021, to revoke write access to the 2020 data they would remove the
tailwind-current-writers
security group from thetailwind-2020-writers
group. Members oftailwind-readers
would continue to be able to read the contents of the file system because they have been granted read and execute (list) permissions not by the ACLs but by the RBAC built in role at the level of the file system. - This approach takes into account that current changes to ACLs do not inherit permissions, so removing the write permission would require writing code that traverses all of its content removing the permission at each folder and file object.
- This approach is relatively fast. RBAC role assignments may take up to five minutes to propagate, regardless of the volume of data being secured.
In this segment, we will create security groups as described above. However, our data set ends in 2019, so we will create a tailwind-2019-writers
group instead of 2021.
-
Switch back to the Azure portal (https://portal.azure.com) in a different browser tab, leaving Synapse Studio open.
-
Select the Azure menu (1), then select Azure Active Directory (2).
-
Select Groups in the left-hand menu.
-
Select + New group.
-
Select
Security
from Group type. Entertailwind-history-owners-<suffix>
(where<suffix>
is a unique value, such as your initials followed by two or more numbers) for the Group name, then select Create. -
Select + New group.
-
Select
Security
from Group type. Entertailwind-readers-<suffix>
(where<suffix>
is a unique value, such as your initials followed by two or more numbers) for the Group name, then select Create. -
Select + New group.
-
Select
Security
from Group type. Entertailwind-current-writers-<suffix>
(where<suffix>
is a unique value, such as your initials followed by two or more numbers) for the Group name, then select Create. -
Select + New group.
-
Select
Security
from Group type. Entertailwind-2019-writers-<suffix>
(where<suffix>
is a unique value, such as your initials followed by two or more numbers) for the Group name, then select Create.
To test out the permissions, we will add our own account to the tailwind-readers-<suffix>
group.
-
Open the newly created
tailwind-readers-<suffix>
group. -
Select Members (1) on the left, then select + Add members (2).
-
Add your user account that you are signed into for the lab, then select Select.
-
Open the
tailwind-2019-writers-<suffix>
group. -
Select Members (1) on the left, then select + Add members (2).
-
Search for
tailwind
, select thetailwind-current-writers-<suffix>
group, then select Select. -
Select Overview in the left-hand menu, then copy the Object Id.
Note: Save the Object Id value to Notepad or similar text editor. This will be used in a later step when you assign access control in the storage account.
-
Open the Azure resource group for this lab, which contains the Synapse Analytics workspace.
-
Open the default data lake storage account.
-
Select Access Control (IAM) in the left-hand menu.
-
Select the Role assignments tab.
-
Select + Add, then Add role assignment.
-
For Role, select
Storage Blob Data Reader
. Search fortailwind-readers
and selecttailwind-readers-<suffix>
from the results, then select Save.Since our user account is added to this group, we will have read access to all files in the blob containers of this account. Tailwind Traders will need to add all users to the
tailwind-readers-<suffix>
security group. -
Select + Add, then Add role assignment.
-
For Role, select
Storage Blob Data Owner
. Search fortailwind
and selecttailwind-history-owners-<suffix>
from the results, then select Save.The
tailwind-history-owners-<suffix>
security group is now assigned to the Azure Storage built-in RBAC roleStorage Blob Data Owner
for the Azure Storage account containing the data lake. This allows Azure AD user and service principals that are added to this role to have the ability to modify all data.Tailwind Traders needs to add the user security principals who will have have permissions to modify all historical data to the
tailwind-history-owners-<suffix>
security group. -
In the Access Control (IAM) list for the storage account, select your Azure user account under the Storage Blob Data Owner role (1), then select Remove (2).
Notice that the
tailwind-history-owners-<suffix>
group is assigned to the Storage Blob Data Owner group (3), andtailwind-readers-<suffix>
is assigned to the Storage Blob Data Reader group (4).Note: You may need to navigate back to the resource group, then come back to this screen to see all of the new role assignments.
-
Select Storage Explorer (preview) on the left-hand menu (1). Expand CONTAINERS and select the wwi-02 container (2). Open the sale-small folder (3), right-click on the Year=2019 folder (4), then select Manage Access.. (5).
-
Paste the Object Id value you copied from the
tailwind-2019-writers-<suffix>
security group into the Add user, group, or service principal text box, then select Add. -
Now you should see that the
tailwind-2019-writers-<suffix>
group is selected in the Manage Access dialog (1). Check the Access and Default check boxes and the Read, Write, and Execute checkboxes for each (2), then select Save.Now the security ACLs have been set to allow any users added to the
tailwind-current-<suffix>
security group to write to theYear=2019
folder, by way of thetailwind-2019-writers-<suffix>
group. These users can only manage current (2019 in this case) sales files.At the start of the following year, to revoke write access to the 2019 data they would remove the
tailwind-current-writers-<suffix>
security group from thetailwind-2019-writers-<suffix>
group. Members oftailwind-readers-<suffix>
would continue to be able to read the contents of the file system because they have been granted read and execute (list) permissions not by the ACLs but by the RBAC built in role at the level of the file system.Notice that we configured both the access ACLs and default ACLs in this configuration.
Access ACLs control access to an object. Files and directories both have access ACLs.
Default ACLs are templates of ACLs associated with a directory that determine the access ACLs for any child items that are created under that directory. Files do not have default ACLs.
Both access ACLs and default ACLs have the same structure.
-
In Synapse Studio, navigate to the Data hub.
-
Select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the
asaworkspaceXX
primary ADLS Gen2 account (2) and select thewwi-02
container (3). Navigate to thesale-small/Year=2016/Quarter=Q4/Month=12/Day=20161231
folder (4). Right-click on thesale-small-20161231-snappy.parquet
file (5), select New SQL script (6), then Select TOP 100 rows (7). -
Ensure Built-in is selected (1) in the
Connect to
dropdown list above the query window, then run the query (2). Data is loaded by the serverless SQL pool endpoint and processed as if was coming from any regular relational database.The cell output shows the query results from the Parquet file.
The read permissions to the Parquet file assigned to us through the
tailwind-readers-<suffix>
security group, which then is granted RBAC permissions on the storage account through the Storage Blob Data Reader role assignment, is what enabled us to view the file contents.However, since we removed our account from the Storage Blob Data Owner role, and we did not add our account to the
tailwind-history-owners-<suffix>
security group, what if we try to write to this directory?Let's give it a try.
-
In the Data hub, once again select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the
asaworkspaceXX
primary ADLS Gen2 account (2) and select thewwi-02
container (3). Navigate to thesale-small/Year=2016/Quarter=Q4/Month=12/Day=20161231
folder (4). Right-click on thesale-small-20161231-snappy.parquet
file (5), select New Notebook (6), then select Load to DataFrame (7). -
Attach your Spark pool to the notebook.
-
In the notebook, select +, then </> Code cell underneath Cell 1 to add a new code cell.
-
Enter the following in the new cell, then copy the Parquet path from cell 1 and paste the value to replace
REPLACE_WITH_PATH
(1). Rename the Parquet file by adding-test
to the end of the file name (2):df.write.parquet('REPLACE_WITH_PATH')
-
Select Run all in the toolbar to run both cells. After a few minutes when the Spark pool starts and the cells run, you should see the file data in the output from cell 1 (1). However, you should see a 403 error in the output of cell 2 (2).
As expected, we do not have write permissions. The error returned by cell 2 is,
This request is not authorized to perform this operation using this permission.
, with a status code of 403. -
Leave the notebook open and switch back to the Azure portal (https://portal.azure.com) in another tab.
-
Select the Azure menu (1), then select Azure Active Directory (2).
-
Select Groups in the left-hand menu.
-
Type
tailwind
in the search box (1), then selecttailwind-history-owners-<suffix>
in the results (2). -
Select Members (1) on the left, then select + Add members (2).
-
Add your user account that you are signed into for the lab, then select Select.
-
Switch back to the open Synapse Notebook in Synapse Studio, then Run cell 2 once more (1). You should see a status of Succeeded (2) after a few moments.
The cell succeeded this time because we added our account to the
tailwind-history-owners-<suffix>
group, which is assigned the Storage Blob Data Owner role.Note: If you encounter the same error this time, stop the Spark session on the notebook, then select Publish all, then Publish. After publishing your changes, select your user profile on the top-right corner of the page and log out. Close the browser tab after logout is successful, then re-launch Synapse Studio (https://web.azuresynapse.net/), re-open the notebook, then re-run the cell. This may be needed because you must refresh the security token for the auth changes to take place.
Now let's verify that the file was written to the data lake.
-
Navigate back to the
sale-small/Year=2016/Quarter=Q4/Month=12/Day=20161231
folder. You should now see a folder for the newsale-small-20161231-snappy-test.parquet
file we wrote from the notebook (1). If you don't see it listed here, select ... More in the toolbar (2), then select Refresh (3).