Blog Post

Data Lakehouse defined

,

As a follow-up to my blog Data Lakehouse & Synapse, I wanted to talk about the various definitions I am seeing about what a data lakehouse is, including a recent paper by Databricks.

Databricks uses the term “Lakehouse” in their paper (see Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics), which argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse. Instead of the two-tier data lake + relational data warehouse model, you will just need a data lake, which is made possible by implementing data warehousing functionality over open data lake file formats.

While I agree there may be some uses cases where technical designs may allow Lakehouse systems to completely replace relational data warehouses, I believe those use cases are much more limited than this paper suggests.

It’s funny how when Hadoop first came out, I heard many say that the end of relational data warehouses is here and just use a data lake for everything. This was clearly a mistake (see Is the traditional data warehouse dead?). Now we’re here once again!

For simplicity I’ll break down a data lakehouse into two types of architectures: one-tier that is data lake (in the form of using schema-on-read storage), which I’ll call NoEDW, and two-tier that is a data lake and a relational database (in the form of an enterprise data warehouse, or EDW), which I’ll call ProEDW. While Databricks touts NoEDW by using Delta Lake and SQL Analytics, Microsoft touts ProEDW with Azure Synapse Analytics.

For NoEDW, my thought process is, if you are trying to make a data lake work like a relational database, why not just use a relational database (RDBMS)? Then have the data lake do what it is good at, and the RDBMS do what it is good at?

The extra cost, complexity, and time to value in incorporating a relational database into a data lakehouse is worth it for many reasons, one of which is a relational database combines the metadata with the data to make it much easier for self-service BI compared to a data lake where the metadata is separated out from the data in many cases. This becomes even more apparent when you are dealing with data from many different sources.

I can certainly see some uses where you could be fine with a NoEDW: if you have a small amount of data, if the users are all data scientists (hence have advanced technical skills), if you just want to build a POC, or if you want to get a quick win with a report/dashboard. The additional prior reason which was to save costs, is much less now that the storage cost for Synapse has dropped around 80%, so it is about the same cost as data lake storage.

But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources. Having the metadata along with the data in a relational database allows everyone to be on the same page as to what the data actually means, versus more of a wild west with a data lake. And a ProEDW gives you the additional benefits of speed, security, and features that I mentioned at Data Lakehouse & Synapse.

In addition, the NoEDW option requires using Delta Lake, adding a layer of complexity and requiring all tools using the data lake to have to support Delta Lake. Also note that Delta Lake does not support cross-table transactions and that Databricks does not have a pay-per-query approach like Synapse serverless has. The paper also does not discuss how master data management (MDM) fits in, which are almost always relational database solutions.

You definitely do need a data lake (see reasons). The bottom line is you can try to get by with just a NoEDW, but it is very likely that you will run into issues and will need to have some of the data in the data lake copied to a relational database.

Federated queries

What I’m seeing customers do is adopting a lakehouse architecture that goes beyond the data lake and the data warehouse. They do this by using federated queries to integrate data across the data lake, data warehouses, and any purpose-built data services that are being used.

Part of this architecture is making it easy to query data in multiple sources by building out a semantic layer using a distributed query engine like Presto or OPENROWSET in a serverless pool in Synapse. With Synapse you can do a federated query over ADLS Gen2, Spark Tables, and Cosmos DB, and eventually others such as Synapse dedicated pools, SQL Database and SQL Managed Instance. It uses T-SQL and is pay-per-query, so simply create SQL views over the data and they can be called from any tool that supports T-SQL, such as Power BI (see Synapse and federated queries).

A federated query give you this ad-hoc view. So if you want to see what you’ve got in multiple places, you can write a query and can get a result right now. Then think of a SQL view as being able to materialize that result on a continuous basis. Part of thinking about data movement is not just getting data from A to B, but it’s also being able to query data wherever it might live and get a result back in one place.

While using federated queries removes the ETL complexity and staleness, there are many cons, so make sure to read Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement. Querying distributed data falls into the Data Mesh category, a distributed data architecture, the subject of my next blog.

Unified analytics platform

I see Synapse as a great solution for a ProEDW, a unified analytics platform approach, where it incorporates a data lake, a relational data warehouse, spark tables, and tools such as Azure Data Factory, Power BI, and soon Azure Purview all under one roof called Azure Synapse Studio. It also supports federated queries mentioned above via a serverless model.

And eventually Synapse will support multiple serverless pools that can query a relational database, and the ability to have multiple dedicated pools all access the same relational database, allowing for unlimited compute over a relational database just like you currently have with unlimited compute over a data lake.

Synapse also makes supporting machine learning (ML) easier via automated ML for data in a data lake – see Tutorial: Train a machine learning model without code Tutorial: Sentiment analysis with Cognitive Services (Preview) Tutorial: Anomaly detection with Cognitive Services (Preview) and even for relational data – see Tutorial: Machine learning model scoring wizard (preview) for dedicated SQL pools

All these options allow for many ways to transform and move data, so here is my list of those options in order of easiest to hardest to build, less features to more features:

Transforming data options:

  1. *Power BI Data flows
  2. Azure Data Factory Wrangling Data flows
  3. Azure Data Factory Mapping Data flows
  4. T-SQL in serverless or dedicated SQL pools
  5. Synapse Spark
  6. *Databricks

ETL options:

  1. *Power BI Data flows
  2. Azure Data Factory
  3. T-SQL in serverless or dedicated SQL pools (COPY INTO)
  4. T-SQL in serverless or dedicated SQL pools (CETAS/CTAS)
  5. Synapse Spark
  6. *Databricks

* = done outside Synapse

More info:

Data Alone Is Not Enough: The Evolution of Data Architectures

DATA LAKE ARCHITECTURE by Bill Inmon

Advancing Spark – In Response to Bill Inmon

Doing Power BI the Right Way: 4. Power Query in Dataflows or Power BI Desktop

Azure Mythbusters: I don’t need a Data Warehouse now that I have a Data Lake

Rise of the Lakehouse

The post Data Lakehouse defined first appeared on James Serra's Blog.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating