As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here), I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL on-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results.
The queries are also fast as SQL on-demand will push-down queries from the front-end to the back-end nodes (which contain a SQL Server engine) that sit next to the storage. Note there is no cache yet in SQL on-demand so the queries won’t run faster after the first run. On to my demo:
Working inside Azure Synapse Studio, first, I wrote a T-SQL view that queried all the parquet files in an ADLS Gen2 folder (the key is the OPENROWSET function):
CREATE VIEW [dbo].[vFQ_ADLSTable] AS SELECT ADLSTable.* FROM OPENROWSET( BULK 'https://asaexpdatalakejs1234.dfs.core.windows.net/twitterdata/*.parquet', FORMAT='PARQUET' ) AS ADLSTable
Notice how I am using a wildcard in the file path to query all parquet files in the folder instead of just one.
Then I wrote a T-SQL view that queried a Spark table called bannedusers in the Spark database users. Even though it is a Spark table and the Spark cluster is not running, I am able to query this table using SQL on-demand because of a feature in Synapse that copies the metadata for the Spark table to SQL on-demand so you can use it outside of Spark (see Azure Synapse Analytics shared metadata):
CREATE VIEW [dbo].[vFQ_SparkTable] AS SELECT SparkTable.* FROM users.dbo.bannedusers AS SparkTable
Then I wrote a T-SQL view that queries a database in Cosmos DB called CosmosIOTDemo and the container bannedcities using Azure Synapse Link for Cosmos DB:
CREATE VIEW [dbo].[vFQ_CosmosTable] AS SELECT CosmosTable.* FROM OPENROWSET ( 'CosmosDB', N'account=synapselinkdemoserra;database=CosmosIOTDemo;region=eastus;key=xxx',bannedcities ) with ( id varchar(36), city varchar(1), cityname varchar(30), banned varchar(1) ) AS CosmosTable
Finally, I created a view using all three above views to join the data together. The parquet files in ADLS Gen2 are twitter feeds, and the Spark table and Cosmos DB table are reference tables that list the users and cities in the twitter feeds that should be excluded from the query (they have to be in the table with the banned field set to Y to not show in the query). While the twitter feed was only about 4000 rows, this query only took 2-3 seconds (you may have to wait an additional 2-3 seconds for the SQL on-demand pool to warm up the first time):
CREATE VIEW [dbo].[vFQ_Combined] AS SELECT t1.*, t2.Banned as UserBanned, t3.Banned as CityBanned FROM vFQ_ADLSTable t1 LEFT JOIN vFQ_SparkTable t2 ON (t1.UserName COLLATE SQL_Latin1_General_CP1_CI_AS = t2.username) LEFT JOIN vFQ_CosmosTable t3 ON (t1.City = t3.cityname) WHERE (t2.banned is NULL or t2.banned = 'N') AND (t3.banned is NULL or t3.banned = 'N')
Then I went to the Develop hub, and under Power BI chose “Power BI datasets” (first you have to link to a Power BI workspace), chose “New Power BI dataset”, chose my SQL on-demand pool, and downloaded the Power BI Desktop Source (.pbids) file that has the connection info to the SQL on-demand database. Once I clicked on that download file, it opened up Power BI Desktop and allowed me to choose the combined query vFQ_Combined. You can choose to import the data or use DirectQuery to get the results in real-time as any file that is dropped into the ADLS Gen2 folder along with the other parquet files will be read.
It appears to the end user that the data is stored in a relational database because the view is creating an abstract layer with the metadata (although because of automatic schema inference you don’t even need to specify the field layouts for a file in ADLS Gen2 or a Spark table).
Data in ADLS Gen2 can be moved to different access tiers to save costs, and eventually when SQL on-demand supports querying the relational SQL Provisioned pool you can archive out older relational data to ADLS Gen2 and use a query on that storage and combine it with a query on the relational database to also save costs and improve performance within the relational database.
As I said, very cool!
The post Synapse and federated queries first appeared on James Serra's Blog.