The public preview version of Azure Synapse Analytics has three compute options and four types of storage that it can access (mentioned in my blog at SQL on-demand in Azure Synapse Analytics). This gives twelve possible combinations of querying data. Not all of these combinations currently are supported and some have a few quirks of which I list below.
(NOTE: I’ll demo these features at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/10 (session info), and Big Data Conference Europe on 11/25 (session info). I hope you can join!)
Relational Database | ADLS Gen2 | Spark Table | Cosmos DB | |
---|---|---|---|---|
SQL Provisioned pool | Y | Y (*1) | N (*2) | N (*6) |
SQL On-demand pool | N (*6) | Y | Y (*3) | N (*4) |
Apache Spark pool | Y | Y | Y | Y (*5) |
*1: This is available via an external table which uses the Polybase technology and does not use push-down queries so can be slow. A feature that will be available after Azure Synapse Analytics goes GA called fast parquet will speed up queries over external tables mapped to parquet files (the technology underneath is the same that is being used for SQL on-demand)
*2: This feature will be available soon, but will only support Spark tables created using Parquet as the input format
*3: This feature only supports Spark tables created using Parquet as the input format (explained here)
*4: This feature is available via the Azure Synapse Link and will soon be available
*5: This feature is available via the Azure Synapse Link
*6: This feature will be available after Azure Synapse Analytics goes GA
A huge benefit of using the public preview version of Azure Synapse Analytics is the ability to query a file in the data lake by simply right-clicking the file. When you do that a menu will pop-up giving you three choices on how to query the data. Currently this works on three different files types: parquet, csv, and JSON (note that you can always query additional file types using a Spark notebook). The three query choices are listed below with all but one currently supported:
Preview | New SQL script -> Select TOP 100 rows | New notebook | |
---|---|---|---|
parquet | N (*3) | Y | Y |
csv | Y | Y (*1) | Y |
JSON | Y | Y (*2) | Y |
“Preview” opens a pop-up window with the contents of the file, “Select TOP 100 rows” opens a tab with a T-SQL SELECT statement using SQL on-demand and then you can run the statement to show the results, and “New notebook” opens a Spark notebook that has PySpark code that you can run to load the data into a Spark in-memory DataFrame and display the rows.
*1: This feature was just added and has a limitation in that the column headers do not show correctly (a feature is being worked on to enable the query to infer the column names if the file has a header row)
*2: replace “varchar(8000)” with “varchar(MAX)” in the SELECT statement if you receive an error when running the query
*3: A work-around is to right-click the file, choose New SQL script -> Bulk load, click the Continue button and you will see a “Preview data” option
The post Query options in Azure Synapse Analytics first appeared on James Serra's Blog.