Many companies are seeing the value in collecting data to help them make better business decisions. When building a solution in Azure to collect the data, nearly everyone is using a data lake. A majority of those are also using delta lake, which is basically a software layer over a data lake that gives additional features. I have yet to see anyone using competing technologies to delta lake in Azure, such as Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi and Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption).
The reasons most are using delta lake is because of the following features that delta lake provides over just using a data lake (with supporting the MERGE statement the biggest one):
- ACID transactions
- Time travel (data versioning enables rollbacks, audit trail)
- Streaming and batch unification
- Schema enforcement
- Supports commands
DELETE
,UPDATE
, andMERGE
- Performance improvement
Fortunately most Azure products now support delta lake, such as:
- Mapping data flows in Azure Data Factory/ Azure Synapse (see Transform data in delta lake using mapping data flows)
- Azure Synapse serverless SQL pool via OPENROWSET (see Query Delta Lake files using serverless SQL pool in Azure Synapse Analytics and How to query your Delta Lake with Azure Synapse SQL pool). A serverless SQL pool can read delta Lake files that are created using Apache Spark, Azure Databricks, or any other producer of the Delta Lake format. However, be aware of the limitations and known issues that you might see in delta lake support in serverless SQL pools
- Azure Synapse Spark pool
However, some products or features do not support delta lake (at least not yet), so I wanted to make you aware of those:
- Copy activity in Azure Data Factory/Azure Synapse Pipelines, unless you use a Databricks cluster (see Copy data to and from Azure Databricks Delta Lake using Azure Data Factory or Azure Synapse Analytics)
- Azure Synapse dedicated SQL pool when using external tables and PolyBase (see Use external tables with Synapse SQL)
- Azure Synapse database templates
Serverless SQL pools do not support updating delta lake files. Use Azure Databricks or Apache Spark pools in Azure Synapse Analytics to update Delta Lake.
Within Power BI, there is a connector for Synapse (called “Azure Synapse Analytics SQL”) that can connect to an Azure Synapse serverless SQL pool, which can have a view that queries a delta table. However, you are limited to the compute offered by the serverless pool, and if that does not give you the performance you need or if you want direct control on the ability to scale up, you might want to look to instead use the “Azure Databricks” connector which will give you more compute (see Connecting Power BI to Azure Databricks). Note there is a new “Azure Synapse Analytics workspace (Beta)” connector in Power BI that can also query a delta table (see Supercharge BI insights with the new Azure Synapse Analytics workspace connector for Power Query and Azure Synapse Analytics workspace (Beta)), but that is also using serverless SQL pool compute and not Spark pool compute.
Note that an Azure Synapse serverless SQL pool can access data in a data lake, delta lake, and data in a spark table, called Lake database (but only if in Parquet or CSV format and NOT in delta lake format – see Azure Synapse Analytics shared metadata tables). An Azure Synapse Spark pool can access data in a data lake, delta lake, and a Lake database (any format, including delta lake). So if you are using a Lake database that is built on the delta lake format, you would not be able to use an Azure Synapse serverless SQL pool to query it, only a Azure Synapse Spark pool. Which also means if you are using the “Azure Synapse Analytics workspace (Beta)” connector in Power BI, you won’t see it display Lake database tables built on the delta lake format to connect to.
More info:
Exploring Delta Lake in Azure Synapse Analytics
The post Azure Synapse and Delta Lake first appeared on James Serra's Blog.