The Load of Real Time Data Warehouses

If you have a data warehouse, what do you think your ratio of reads to writes is on any given day? Do you think 1:1, as in one read for each write? Is it 10:1, with 10 reads for each write? 100:1? Do you track this in any way?

One would think that most of the databases we work on in the transactional world have many more reads than writes. I'd have assumed the ratios might be higher for data warehouses, where we load data that is queried (read) as the primary use case. After all, I expect that there are lots of people querying data that is loaded into this warehouse, with relatively few changes.

I saw a presentation recently of a paper from Amazon that analyzed Redshift workloads (Why TPC is Not Enough). Redshift is the Amazon AWS data warehousing platform that has proven to be very popular in many companies as a cloud warehouse. One interesting piece of analysis is that there are more reads than writes, but barely. There is a lot of insert/update/delete activity, which is different than the TPC benchmarks for warehouses (TPC-H and TPC-DS). Those benchmarks tend to be 75+% reads. Hence the paper and analysis.

However, Redshift isn't like that in the real world. I had a chance to chat with one of the authors, and with another ETL specialist and they noted that the current pipelines that many companies use aren't running in batch mode, but run in constant mode, moving data almost constantly. Combining this with relatively few queries from smaller groups of people results in a fairly close ratio of 1:1 for reads to writes. That was quite surprising to me.

Presumably, some of this is because people will run relatively large queries for a report, and then spend time analyzing the data while the ETL/ELT processes continue to move data to the warehouse. Much of the design for storing data in warehouses, especially columnar ones, is with the idea that the data doesn't change much. Certainly, columnstore indexes perform better when they are being read, not necessarily when they are updated.

I wonder how much of this architecture of constant writes and updates has driven the world towards a Lakehouse architecture where data is written out and then transformed into cleaner versions that are copies, albeit some of them smaller than the original. Often this data is also written to separate delta/parquet files as well, which means new writes usually occur to discrete objects rather than a managed database structure. From a logical thought standpoint, that seems to make more sense to me, even though I still tend to think most of us could use a relational structure for warehousing data.

If you have a real-time warehouse being updated, perhaps you want to measure your ratio of reads to writes and possibly rethink how to manage the system. If you don't have a specific warehouse, which is most of us, you probably like querying the real-time updates of your transactional data in the same place as it's stored. Either way, it's interesting to think about the impact of a real-time workload on the performance your users experience.

How do you secure sensitive data in a modern Data Warehouse?

by diponkar.paul

SQLServerCentral

In 2019 Canadian Broadcasting Corporation (CBC) news reported a massive data breach at the Desjardins Group, which is a Canadian financial service cooperative and the largest federation of credit unions in North America. The report indicated, a "malicious" employee copied sensitive personal information collected by Desjardins from their data warehouse. The data breach compromised the […]