lab

title	module
Run interactive queries using serverless SQL pools	Module 2

Module 2 - Run interactive queries using serverless SQL pools

In this module, students will learn how to work with files stored in the data lake and external file sources, through T-SQL statements executed by a serverless SQL pool in Azure Synapse Analytics. Students will query Parquet files stored in a data lake, as well as CSV files stored in an external data store. Next, they will create Azure Active Directory security groups and enforce access to files in the data lake through Role-Based Access Control (RBAC) and Access Control Lists (ACLs).

In this module, the student will be able to:

Query Parquet data with serverless SQL pools
Create external tables for Parquet and CSV files
Create views with serverless SQL pools
Secure access to data in a data lake when using serverless SQL pools
Configure data lake security using Role-Based Access Control (RBAC) and Access Control Lists (ACLs)

Lab details

Module 2 - Run interactive queries using serverless SQL pools

Tailwind Trader's Data Engineers want a way to explore the data lake, transform and prepare data, and simplify their data transformation pipelines. In addition, they want their Data Analysts to explore data in the lake and Spark external tables created by Data Scientists or Data Engineers, using familiar T-SQL language or their favorite tools, which can connect to SQL endpoints.

Lab setup and pre-requisites

You have successfully completed setup to create your lab environment.
You must have permissions to create new Azure Active Directory security groups and assign members to them.

Exercise 1: Querying a Data Lake Store using serverless SQL pools in Azure Synapse Analytics

Understanding data through data exploration is one of the core challenges faced today by data engineers and data scientists as well. Depending on the underlying structure of the data as well as the specific requirements of the exploration process, different data processing engines will offer varying degrees of performance, complexity, and flexibility.

In Azure Synapse Analytics, you can use either SQL, Apache Spark for Synapse, or both. Which service you use mostly depends on your personal preference and expertise. When conducting data engineering tasks, both options can be equally valid in many cases. However, there are certain situations where harnessing the power of Apache Spark can help you overcome problems with the source data. This is because in a Synapse Notebook, you can import from a large number of free libraries that add functionality to your environment when working with data. There are other situations where it is much more convenient and faster using serveless SQL pool to explore the data, or to expose data in the data lake through a SQL view that can be accessed from external tools, like Power BI.

In this exercise, you will explore the data lake using both options.

Task 1: Query sales Parquet data with serverless SQL pools

When you query Parquet files using serverless SQL pools, you can explore the data with T-SQL syntax.

Open Synapse Studio (https://web.azuresynapse.net/), and then navigate to the Data hub.
Select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the asaworkspaceXX primary ADLS Gen2 account (2) and select the wwi-02 container (3). Navigate to the sale-small/Year=2019/Quarter=Q4/Month=12/Day=20191231 folder (4). Right-click on the sale-small-20191231-snappy.parquet file (5), select New SQL script (6), then Select TOP 100 rows (7).
Ensure Built-in is selected (1) in the Connect to dropdown list above the query window, then run the query (2). Data is loaded by the serverless SQL endpoint and processed as if was coming from any regular relational database.

The cell output shows the query results from the Parquet file.

Modify the SQL query to perform aggregates and grouping operations to better understand the data. Replace the query with the following, replacing SUFFIX with the unique suffix for your Azure Data Lake store and making sure that the file path in OPENROWSET matches the current file path:

SELECT
    TransactionDate, ProductId,
        CAST(SUM(ProfitAmount) AS decimal(18,2)) AS [(sum) Profit],
        CAST(AVG(ProfitAmount) AS decimal(18,2)) AS [(avg) Profit],
        SUM(Quantity) AS [(sum) Quantity]
FROM
    OPENROWSET(
        BULK 'https://asadatalakeSUFFIX.dfs.core.windows.net/wwi-02/sale-small/Year=2019/Quarter=Q4/Month=12/Day=20191231/sale-small-20191231-snappy.parquet',
        FORMAT='PARQUET'
    ) AS [r] GROUP BY r.TransactionDate, r.ProductId;

Let's move on from this single file from 2019 and transition to a newer data set. We want to figure out how many records are contained within the Parquet files for all 2019 data. This information is important for planning how we optimize for importing the data into Azure Synapse Analytics. To do this, we'll replace the query with the following (be sure to update the suffix of your data lake in the BULK statement):
```
SELECT
    COUNT(*)
FROM
    OPENROWSET(
        BULK 'https://asadatalakeSUFFIX.dfs.core.windows.net/wwi-02/sale-small/Year=2019/*/*/*/*',
        FORMAT='PARQUET'
    ) AS [r];
```
Notice how we updated the path to include all Parquet files in all subfolders of sale-small/Year=2019.

The output should be 4124857 records.

Task 2: Create an external table for 2019 sales data

Rather than creating a script with OPENROWSET and a path to the root 2019 folder every time we want to query the Parquet files, we can create an external table.

In Synapse Studio, navigate to the Data hub.
Select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the asaworkspaceXX primary ADLS Gen2 account (2) and select the wwi-02 container (3). Navigate to the sale-small/Year=2019/Quarter=Q4/Month=12/Day=20191231 folder (4). Right-click on the sale-small-20191231-snappy.parquet file (5), select New SQL script (6), then Create external table (7).
Make sure Built-in is selected for the SQL pool (1). Under Select a database, select + New and enter demo (2). For External table name, enter All2019Sales (3). Under Create external table, select Using SQL script (4), then select Create (5).

Note: Make sure the script is connected to the serverless SQL pool (Built-in) (1) and the database is set to demo (2).

The generated script contains the following components:
- 1) The script begins with creating the SynapseParquetFormat external file format with a FORMAT_TYPE of PARQUET.
- 2) Next, the external data source is created, pointing to the wwi-02 container of the data lake storage account.
- 3) The CREATE EXTERNAL TABLE WITH statement specifies the file location and refers to the new external file format and data source created above.
- 4) Finally, we select the top 100 results from the 2019Sales external table.
- 5) You may recieve a message Potential conversion error while reading VARCHAR column 'TransactionId' from UTF8 encoded text. Change database collation to a UTF8 collation or specify explicit column schema in WITH clause and assign UTF8 collation to VARCHAR columns. If this is the case, after the [TransactionId] varchar(8000) line add COLLATE Latin1_General_100_BIN2_UTF8, to remove the error. The finished line will state [TransactionId] varchar(8000) COLLATE Latin1_General_100_BIN2_UTF8,
Replace the LOCATION value in the CREATE EXTERNAL TABLE statement with sale-small/Year=2019/*/*/*/*.parquet.
Run the script.

After running the script, we can see the output of the SELECT query against the All2019Sales external table. This displays the first 100 records from the Parquet files located in the YEAR=2019 folder.

Task 3: Create an external table for CSV files

Tailwind Traders found an open data source for country population data that they want to use. They do not want to merely copy the data since it is regularly updated with projected populations in future years.

You decide to create an external table that connects to the external data source.

Replace the SQL script with the following:
```
IF NOT EXISTS (SELECT * FROM sys.symmetric_keys) BEGIN
    declare @pasword nvarchar(400) = CAST(newid() as VARCHAR(400));
    EXEC('CREATE MASTER KEY ENCRYPTION BY PASSWORD = ''' + @pasword + '''')
END

CREATE DATABASE SCOPED CREDENTIAL [sqlondemand]
WITH IDENTITY='SHARED ACCESS SIGNATURE',  
SECRET = 'sv=2018-03-28&ss=bf&srt=sco&sp=rl&st=2019-10-14T12%3A10%3A25Z&se=2061-12-31T12%3A10%3A00Z&sig=KlSU2ullCscyTS0An0nozEpo4tO5JAgGBvw%2FJX2lguw%3D'
GO

-- Create external data source secured using credential
CREATE EXTERNAL DATA SOURCE SqlOnDemandDemo WITH (
    LOCATION = 'https://sqlondemandstorage.blob.core.windows.net',
    CREDENTIAL = sqlondemand
);
GO

CREATE EXTERNAL FILE FORMAT QuotedCsvWithHeader
WITH (  
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '"',
        FIRST_ROW = 2
    )
);
GO

CREATE EXTERNAL TABLE [population]
(
    [country_code] VARCHAR (5) COLLATE Latin1_General_BIN2,
    [country_name] VARCHAR (100) COLLATE Latin1_General_BIN2,
    [year] smallint,
    [population] bigint
)
WITH (
    LOCATION = 'csv/population/population.csv',
    DATA_SOURCE = SqlOnDemandDemo,
    FILE_FORMAT = QuotedCsvWithHeader
);
GO
```
At the top of the script, we create a MASTER KEY with a random password (1). Next, we create a database-scoped credential for the containers in the external storage account (2), using a shared access signature (SAS) for delegated access. This credential is used when we create the SqlOnDemandDemo external data source (3) that points to the location of the external storage account that contains the population data:

Database-scoped credentials are used when any principal calls the OPENROWSET function with a DATA_SOURCE or selects data from an external table that doesn't access public files. The database scoped credential doesn't need to match the name of storage account because it will be explicitly used in the DATA SOURCE that defines the storage location.

In the next part of the script, we create an external file format called QuotedCsvWithHeader. Creating an external file format is a prerequisite for creating an External Table. By creating an External File Format, you specify the actual layout of the data referenced by an external table. Here we specify the CSV field terminator, string delimiter, and set the FIRST_ROW value to 2 since the file contains a header row:

Finally, at the bottom of the script, we create an external table named population. The WITH clause specifies the relative location of the CSV file, points to the data source created above, as well as the QuotedCsvWithHeader file format:
Run the script.

Please note that there are no data results for this query.
Replace the SQL script with the following to select from the population external table, filtered by 2019 data where the population is greater than 100 million:
```
SELECT [country_code]
    ,[country_name]
    ,[year]
    ,[population]
FROM [dbo].[population]
WHERE [year] = 2019 and population > 100000000
```
Run the script.
In the query results, select the Chart view, then configure it as follows:
- Chart type: Select Bar.
- Category column: Select country_name.
- Legend (series) columns: Select population.
- Legend position: Select bottom - center.

Task 4: Create a view with a serverless SQL pool

Let's create a view to wrap a SQL query. Views allow you to reuse queries and are needed if you want to use tools, such as Power BI, in conjunction with serverless SQL pools.

In Synapse Studio, navigate to the Data hub.
Select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the asaworkspaceXX primary ADLS Gen2 account (2) and select the wwi-02 container (3). Navigate to the customer-info folder (4). Right-click on the customerinfo.csv file (5), select New SQL script (6), then Select TOP 100 rows (7).
Select Run to execute the script (1). Notice that the first row of the CSV file is the column header row (2).

Update the script with the following and make sure you replace YOUR_DATALAKE_NAME (1) (your primary data lake storage account) in the OPENROWSET BULK path with the value in the in the previous select statement. Set the Use database value to demo (2) (use the refresh button to the right if needed):

CREATE VIEW CustomerInfo AS
    SELECT * 
FROM OPENROWSET(
        BULK 'https://YOUR_DATALAKE_NAME.dfs.core.windows.net/wwi-02/customer-info/customerinfo.csv',
        FORMAT = 'CSV',
        PARSER_VERSION='2.0',
        FIRSTROW=2
    )
    WITH (
    [UserName] NVARCHAR (50),
    [Gender] NVARCHAR (10),
    [Phone] NVARCHAR (50),
    [Email] NVARCHAR (100),
    [CreditCard] NVARCHAR (50)
    ) AS [r];
    GO

SELECT * FROM CustomerInfo;
GO

Select Run to execute the script.

We just created the view to wrap the SQL query that selects data from the CSV file, then selected rows from the view:

Notice that the first row no longer contains the column headers. This is because we used the FIRSTROW=2 setting in the OPENROWSET statement when we created the view.
Within the Data hub, select the Workspace tab (1). Select the actions ellipses (...) to the right of the Databases group (2), then select Refresh (3).
Expand the demo SQL database.

The database contains the following objects that we created in our earlier steps:
- 1) External tables: All2019Sales and population.
- 2) External data sources: SqlOnDemandDemo and wwi-02_asadatalakeinadayXXX_dfs_core_windows_net.
- 3) External file formats: QuotedCsvWithHeader and SynapseParquetFormat.
- 4) Views: CustomerInfo.

Exercise 2: Securing access to data through using a serverless SQL pool in Azure Synapse Analytics

Tailwind Traders wants to enforce that any kind of modifications to sales data can happen in the current year only, while allowing all authorized users to query the entirety of data. They have a small group of admins who can modify historic data if needed.

Tailwind Traders should create a security group in AAD, for example called tailwind-history-owners, with the intent that all users who belong to this group will have permissions to modify data from previous years.
The tailwind-history-owners security group needs to be assigned to the Azure Storage built-in RBAC role Storage Blob Data Owner for the Azure Storage account containing the data lake. This allows AAD user and service principals that are added to this role to have the ability to modify all data from previous years.
They need to add the user security principals who will have have permissions to modify all historical data to the tailwind-history-owners security group.
Tailwind Traders should create another security group in AAD, for example called tailwind-readers, with the intent that all users who belong to this group will have permissions to read all contents of the file system (prod in this case), including all historical data.
The tailwind-readers security group needs to be assigned to the Azure Storage built-in RBAC role Storage Blob Data Reader for the Azure Storage account containing the data lake. This allows AAD user and service principals that are added to this security group to have the ability to read all data in the file system, but not to modify it.
Tailwind Traders should create another security group in AAD, for example called tailwind-2020-writers, with the intent that all users who belong to this group will have permissions to modify data only from the year 2020.
They would create a another security group, for example called tailwind-current-writers, with the intent that only security groups would be added to this group. This group will have permissions to modify data only from the current year, set using ACLs.
They need to add the tailwind-readers security group to the tailwind-current-writers security group.
At the start of the year 2020, Tailwind Traders would add tailwind-current-writers to the tailwind-2020-writers security group.
At the start of the year 2020, on the 2020 folder, Tailwind Traders would set the read, write and execute ACL permissions for the tailwind-2020-writers security group.
At the start of the year 2021, to revoke write access to the 2020 data they would remove the tailwind-current-writers security group from the tailwind-2020-writers group. Members of tailwind-readers would continue to be able to read the contents of the file system because they have been granted read and execute (list) permissions not by the ACLs but by the RBAC built in role at the level of the file system.
This approach takes into account that current changes to ACLs do not inherit permissions, so removing the write permission would require writing code that traverses all of its content removing the permission at each folder and file object.
This approach is relatively fast. RBAC role assignments may take up to five minutes to propagate, regardless of the volume of data being secured.

Task 1: Create Azure Active Directory security groups

In this segment, we will create security groups as described above. However, our data set ends in 2019, so we will create a tailwind-2019-writers group instead of 2021.

Switch back to the Azure portal (https://portal.azure.com) in a different browser tab, leaving Synapse Studio open.
Select the Azure menu (1), then select Azure Active Directory (2).
Select Groups in the left-hand menu.
Select + New group.
Select Security from Group type. Enter tailwind-history-owners-<suffix> (where <suffix> is a unique value, such as your initials followed by two or more numbers) for the Group name, then select Create.
Select + New group.
Select Security from Group type. Enter tailwind-readers-<suffix> (where <suffix> is a unique value, such as your initials followed by two or more numbers) for the Group name, then select Create.
Select + New group.
Select Security from Group type. Enter tailwind-current-writers-<suffix> (where <suffix> is a unique value, such as your initials followed by two or more numbers) for the Group name, then select Create.
Select + New group.
Select Security from Group type. Enter tailwind-2019-writers-<suffix> (where <suffix> is a unique value, such as your initials followed by two or more numbers) for the Group name, then select Create.

Task 2: Add group members

To test out the permissions, we will add our own account to the tailwind-readers-<suffix> group.

Open the newly created tailwind-readers-<suffix> group.
Select Members (1) on the left, then select + Add members (2).
Add your user account that you are signed into for the lab, then select Select.
Open the tailwind-2019-writers-<suffix> group.
Select Members (1) on the left, then select + Add members (2).
Search for tailwind, select the tailwind-current-writers-<suffix> group, then select Select.
Select Overview in the left-hand menu, then copy the Object Id.

Note: Save the Object Id value to Notepad or similar text editor. This will be used in a later step when you assign access control in the storage account.

Task 3: Configure data lake security - Role-Based Access Control (RBAC)

Open the Azure resource group for this lab, which contains the Synapse Analytics workspace.
Open the default data lake storage account.
Select Access Control (IAM) in the left-hand menu.
Select the Role assignments tab.
Select + Add, then Add role assignment.
For Role, select Storage Blob Data Reader. Search for tailwind-readers and select tailwind-readers-<suffix> from the results, then select Save.

Since our user account is added to this group, we will have read access to all files in the blob containers of this account. Tailwind Traders will need to add all users to the tailwind-readers-<suffix> security group.
Select + Add, then Add role assignment.
For Role, select Storage Blob Data Owner. Search for tailwind and select tailwind-history-owners-<suffix> from the results, then select Save.

The tailwind-history-owners-<suffix> security group is now assigned to the Azure Storage built-in RBAC role Storage Blob Data Owner for the Azure Storage account containing the data lake. This allows Azure AD user and service principals that are added to this role to have the ability to modify all data.

Tailwind Traders needs to add the user security principals who will have have permissions to modify all historical data to the tailwind-history-owners-<suffix> security group.
In the Access Control (IAM) list for the storage account, select your Azure user account under the Storage Blob Data Owner role (1), then select Remove (2).

Notice that the tailwind-history-owners-<suffix> group is assigned to the Storage Blob Data Owner group (3), and tailwind-readers-<suffix> is assigned to the Storage Blob Data Reader group (4).

Note: You may need to navigate back to the resource group, then come back to this screen to see all of the new role assignments.

Task 4: Configure data lake security - Access Control Lists (ACLs)

Select Storage Explorer (preview) on the left-hand menu (1). Expand CONTAINERS and select the wwi-02 container (2). Open the sale-small folder (3), right-click on the Year=2019 folder (4), then select Manage Access.. (5).
Paste the Object Id value you copied from the tailwind-2019-writers-<suffix> security group into the Add user, group, or service principal text box, then select Add.
Now you should see that the tailwind-2019-writers-<suffix> group is selected in the Manage Access dialog (1). Check the Access and Default check boxes and the Read, Write, and Execute checkboxes for each (2), then select Save.

Now the security ACLs have been set to allow any users added to the tailwind-current-<suffix> security group to write to the Year=2019 folder, by way of the tailwind-2019-writers-<suffix> group. These users can only manage current (2019 in this case) sales files.

At the start of the following year, to revoke write access to the 2019 data they would remove the tailwind-current-writers-<suffix> security group from the tailwind-2019-writers-<suffix> group. Members of tailwind-readers-<suffix> would continue to be able to read the contents of the file system because they have been granted read and execute (list) permissions not by the ACLs but by the RBAC built in role at the level of the file system.

Notice that we configured both the access ACLs and default ACLs in this configuration.

Access ACLs control access to an object. Files and directories both have access ACLs.

Default ACLs are templates of ACLs associated with a directory that determine the access ACLs for any child items that are created under that directory. Files do not have default ACLs.

Both access ACLs and default ACLs have the same structure.

Task 5: Test permissions

In Synapse Studio, navigate to the Data hub.
Select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the asaworkspaceXX primary ADLS Gen2 account (2) and select the wwi-02 container (3). Navigate to the sale-small/Year=2016/Quarter=Q4/Month=12/Day=20161231 folder (4). Right-click on the sale-small-20161231-snappy.parquet file (5), select New SQL script (6), then Select TOP 100 rows (7).
Ensure Built-in is selected (1) in the Connect to dropdown list above the query window, then run the query (2). Data is loaded by the serverless SQL pool endpoint and processed as if was coming from any regular relational database.

The cell output shows the query results from the Parquet file.

The read permissions to the Parquet file assigned to us through the tailwind-readers-<suffix> security group, which then is granted RBAC permissions on the storage account through the Storage Blob Data Reader role assignment, is what enabled us to view the file contents.

However, since we removed our account from the Storage Blob Data Owner role, and we did not add our account to the tailwind-history-owners-<suffix> security group, what if we try to write to this directory?

Let's give it a try.
In the Data hub, once again select the Linked tab (1) and expand Azure Data Lake Storage Gen2. Expand the asaworkspaceXX primary ADLS Gen2 account (2) and select the wwi-02 container (3). Navigate to the sale-small/Year=2016/Quarter=Q4/Month=12/Day=20161231 folder (4). Right-click on the sale-small-20161231-snappy.parquet file (5), select New Notebook (6), then select Load to DataFrame (7).
Attach your Spark pool to the notebook.
In the notebook, select +, then </> Code cell underneath Cell 1 to add a new code cell.
Enter the following in the new cell, then copy the Parquet path from cell 1 and paste the value to replace REPLACE_WITH_PATH (1). Rename the Parquet file by adding -test to the end of the file name (2):
```
df.write.parquet('REPLACE_WITH_PATH')
```
Select Run all in the toolbar to run both cells. After a few minutes when the Spark pool starts and the cells run, you should see the file data in the output from cell 1 (1). However, you should see a 403 error in the output of cell 2 (2).

As expected, we do not have write permissions. The error returned by cell 2 is, This request is not authorized to perform this operation using this permission., with a status code of 403.
Leave the notebook open and switch back to the Azure portal (https://portal.azure.com) in another tab.
Select the Azure menu (1), then select Azure Active Directory (2).
Select Groups in the left-hand menu.
Type tailwind in the search box (1), then select tailwind-history-owners-<suffix> in the results (2).
Select Members (1) on the left, then select + Add members (2).
Add your user account that you are signed into for the lab, then select Select.
Switch back to the open Synapse Notebook in Synapse Studio, then Run cell 2 once more (1). You should see a status of Succeeded (2) after a few moments.

The cell succeeded this time because we added our account to the tailwind-history-owners-<suffix> group, which is assigned the Storage Blob Data Owner role.

Note: If you encounter the same error this time, stop the Spark session on the notebook, then select Publish all, then Publish. After publishing your changes, select your user profile on the top-right corner of the page and log out. Close the browser tab after logout is successful, then re-launch Synapse Studio (https://web.azuresynapse.net/), re-open the notebook, then re-run the cell. This may be needed because you must refresh the security token for the auth changes to take place.

Now let's verify that the file was written to the data lake.
Navigate back to the sale-small/Year=2016/Quarter=Q4/Month=12/Day=20161231 folder. You should now see a folder for the new sale-small-20161231-snappy-test.parquet file we wrote from the notebook (1). If you don't see it listed here, select ... More in the toolbar (2), then select Refresh (3).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LAB_02_queries_using_serverless_sql_pools.md

LAB_02_queries_using_serverless_sql_pools.md

Module 2 - Run interactive queries using serverless SQL pools

Lab details

Lab setup and pre-requisites

Exercise 1: Querying a Data Lake Store using serverless SQL pools in Azure Synapse Analytics

Task 1: Query sales Parquet data with serverless SQL pools

Task 2: Create an external table for 2019 sales data

Task 3: Create an external table for CSV files

Task 4: Create a view with a serverless SQL pool

Exercise 2: Securing access to data through using a serverless SQL pool in Azure Synapse Analytics

Task 1: Create Azure Active Directory security groups

Task 2: Add group members

Task 3: Configure data lake security - Role-Based Access Control (RBAC)

Task 4: Configure data lake security - Access Control Lists (ACLs)

Task 5: Test permissions

Files

LAB_02_queries_using_serverless_sql_pools.md

Latest commit

History

LAB_02_queries_using_serverless_sql_pools.md

File metadata and controls

Module 2 - Run interactive queries using serverless SQL pools

Lab details

Lab setup and pre-requisites

Exercise 1: Querying a Data Lake Store using serverless SQL pools in Azure Synapse Analytics

Task 1: Query sales Parquet data with serverless SQL pools

Task 2: Create an external table for 2019 sales data

Task 3: Create an external table for CSV files

Task 4: Create a view with a serverless SQL pool

Exercise 2: Securing access to data through using a serverless SQL pool in Azure Synapse Analytics

Task 1: Create Azure Active Directory security groups

Task 2: Add group members

Task 3: Configure data lake security - Role-Based Access Control (RBAC)

Task 4: Configure data lake security - Access Control Lists (ACLs)

Task 5: Test permissions