With so many product options to choose from for building a big data solution in the cloud, such as SQL Data Warehouse (SQL DW), Azure Analysis Services (AAS), SQL Database (SQL DB), and Azure Data Lake (ADL), there are various combinations of using the products, each with pros/cons along with differences in cost. With many customers looking at using SQL DW, I wanted to mention various reference architectures that I have seen, ordered by most cost to lowest cost:
- Do staging, data refinement and reporting all from SQL DW. You can scale compute power up when needed (i.e. during staging, data refinement, or large number of users doing reporting) or down to save costs (i.e. nights and weekends when user reporting is low). The pros of this option are by reducing the number of technologies you are building a simpler solution and reducing the number of copies of the data. The cons are since everything is done on SQL DW you can have performance issues (i.e. doing data refinement while users are reporting), can hit the SQL DW concurrent query limit, and can have a higher cost since SQL DW is the highest-cost product, especially if you are not able to pause it. Pausing it reduces your cost to zero for compute, only having to pay for storage (see Azure SQL Data Warehouse pricing), but no one can use SQL DW when paused
- Do staging and data refinement in a Hadoop data lake, and then copy all or part of the data to SQL DW to do reporting. This saves costs in SQL DW by offloading data refinement, and gives you the benefit of using a data lake (see What is a data lake? and Why use a data lake?). You save costs by not having to scale up SQL DW to do the data refinement (scaling up would minimize affecting reporting performance and refine data quicker) and by not having to store as much data in SQL DW. You also save costs by archiving data in the data lake and using PolyBase to access it (be aware these queries could be slow as PolyBase does not support query pushdown in SQL DW). A con of this architecture is having an extra copy of the data along with the extra ETL needed
- Do staging and data refinement in SQL DW, and copy some or all data to one or more data marts (in SQL DB or SQL Server in a VM) and/or one or more cubes (in AAS or SSAS in a VM) for reporting, which is considered a “Hub-and-Spoke” model. Scale down SQL DW after data refinement and use it for a limited amount of big queries. This overcomes the SQL DW concurrent query limit by having users query the data mart/cube and saves costs by querying less expensive options. You also get the benefits that come with a cube such as creating a semantic layer and row-level security that is not available in SQL DW (see Why use a SSAS cube?). This architecture can also be combined with the previous architecture to add in a data lake. A con of this architecture is having extra copies of the data along with the extra ETL needed
- Do staging and data refinement in SQL DW, and copy all data to a data mart (SQL DB or SQL Server in a VM) and/or a cube (AAS or SSAS in a VM) for reporting. Pause SQL DW after the staging and data refinement is done. This is used when giving users access to SQL DW will impact ELT and/or user queries wouldn’t be as responsive as needed, or when cost is a top priority (you only pay for storage costs when SQL DW is paused). A con of this architecture is having extra copies of the data along with the extra ETL needed, and not having SQL DW available for big queries
More info:
Using SQL Server to Build a Hub-and-Spoke Enterprise Data Warehouse Architecture
Hub-And-Spoke: Building an EDW with SQL Server and Strategies of Implementation
Common ISV application patterns using Azure SQL Data Warehouse
Azure SQL Data Warehouse Workload Patterns and Anti-Patterns