I have heard people say that the Parallel Data Warehouse (PDW) is Microsoft’s best kept secret. So let me give a 10,000 foot overview on what PDW is and its benefits. Keep in mind the purpose for a PDW is for data warehouses, not OLTP systems.
As opposed to a symmetric multiprocessor system (SMP), which is one server where each CPU in the server shares the same memory and disk, PDW is a massively parallel processing (MPP) solution, which means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk. A query sent by a user will, behind the scenes, be sent to each server, executed, and the results combined and sent to the user. PDW is designed for data warehouses only, not OLTP applications.
With most data warehouses on SMP’s, the bottleneck is disk IO and not the cpu. With PDW, the appliance is optimized so the cpu’s are fed data from the disks as fast as they can accept data, in large part thanks to DAS. Direct Attached Storage (DAS) is much faster for data warehouse applications (see Performance Tuning SQL Server: Hardware Architectures Throughput Analysis). While SANs can be great for OLTP applications, they are less optimal for data warehouses, in addition they are costly and hard to predict performance. CPU’s can consume 250MB/sec/core but SAN disks can be feeding each cpu only 16GB/s.
That is a quick explanation of just one benefit of a PDW. For more details on this benefit, read What MPP means to SQL Server Parallel Data Warehouse. Here is a list of the many other benefits provided by a PDW over a SMP solution (where the underlined benefits are the additional benefits not found in a SMP/SQL Server 2014 solution):
- Query performance: Expect a 10x-100x increase, which is so important because nowadays there is the expectation of fast queries from users. You can also expect a reasonable linear increase when adding more servers to your PDW. PDW is not just an appliance for “big data”. It can be very useful for small sets of data that need performance.
- Data loading performance: 10-40x faster due to parallel loading of data. Data loading speed is 250Gb/hr per compute node (a half rack of 4 compute nodes gives 1 TB/hour, with minimal query performance impact)
- Scalability (data growth): Start with only a quarter-rack (2 compute servers, 32 cores, 15TB of uncompressed capacity) and grow as needed, up to 7 racks (56 compute servers, 896 cores, 1.2PB of uncompressed capacity. Using a conservative 5:1 compression, data capacity is from 75TB to 6PB. And there is no “forklifting” when you upgrade (backing up and restoring from the old server to the new server). Instead, you add the new servers and the data is automatically redistributed
- Built-in high availability and failover: One fault-tolerant cluster across the whole appliance. Virtualized architecture and no dependency of SAN technologies. Automatic VM migration on host failure. All appliance components are fully redundant (disks, networking, etc).
- PolyBase: Combine relational with non-relational data (Hadoop) using SQL. Hides all the complexity of using Hadoop so most business users do not need to know anything about Hadoop. See PolyBase explained for more details. PolyBase also has the ability to push down portions of the query processing to the Hadoop cluster and allows you to move data faster between the Hadoop and SQL world because of parallel data transfers
- Integration with cloud-born data (Windows Azure HDInsight, Windows Azure blog storage). See What is HDInsight? for more info
- HDInsight integration into the PDW rack
- Improved concurrency because of how quickly queries execute
- Mixed workload support (i.e. no performance issues with queries when a data load is happening)
- Less DBA maintenance: Don’t need to create indexes besides a clustered columnstore index, don’t need to archive/delete data to save space, management simplicity (monitor hardware and software from System Center), don’t need to worry about many normal monitoring/maintenance that happens with a SMP system (blocking, logs, query hints, wait states, IO tuning, query optimization/tuning, index reorgs/rebuilds, managing filegroups, shrinking/expanding databases, managing physical servers, patching servers). DBAs can spend more of their time as architects and not baby sitters
- Limited training needed: If you are already a Microsoft shop, using a PDW is not much different from using a SMP solution
- Use familiar BI tools: If you are already a Microsoft shop, all your familiar tools (i.e. SSRS, PowerPivot, Excel, Power View) work fine against a PDW. The only thing you do differently is enter the IP address and port number of the PDW in the connection string. So you will not have to rewrite and re-implement the many SSRS reports you have created over the years. Plus you can expand your report filters because performance is not a problem anymore (i.e. increase the number of years).
- Improved data compression: 3x-15x more than a SMP system, with 5x being a conservative number. Unique compression because of data distribution across compute nodes
- Consolidation of all your data warehouses and the ability to integrate data sources that you could not before. A centralized data warehouse that is one source for the truth
- Ease of deployment in appliance vs build-your-own: You can deploy in hours, not weeks, thanks to PDW being a turnkey solution complete with hardware and software. It is pre-tested and tuned for your data warehouse
- Data warehouse consolidation: With all the disk space and performance you get with a PDW, you can make it a true enterprise data warehouse by bringing in all the sources, data marts, and other data warehouses into one place. A true “single version of the truth”
- Easy support model: With a PDW you get an integrated support plan with a single Microsoft contact. Whether it’s a problem with the hardware or the software, you just call Microsoft and they will work with the vendor if it’s a hardware issue
If you answer “Yes” to a few of the below questions, a PDW may be right for you:
- Is your data volume growth becoming unmanageable using currently implemented DW technologies? (>20-30% annually)
- Is there a specific Big Data business need (e.g. social media analysis, fraud detection) in a high-priority industry (Retail, Financial, Pub Sec)?
- Is your DW or storage spend consuming a disproportionate and increasing amount of your IT budget?
- Do your business users need to find, combine, and refine structured and unstructured data? Internal and external sources?
- In the near future do you expect to need both on-premise and cloud-based BI capabilities?
- Do you have a need to capture and analyze streaming data? At what scale and velocity?
- Do you currently (or plan to) collect, store, and analyze multiple forms of unstructured data (XML, JSON, CSV, etc.)?
- Are you able to serve your business users’ analytics provisioning and data requests in a timely manner?
- Are you experiencing data management issues such as security or compliance due to business owners (“shadow” IT) creating their own unmanaged data stores?
- Are you trying to build, grow, and manage your next-generation DW without adding new headcount or talent (data scientists, external consultants, etc.)?
There are three vendors that sell PDW: HP, Dell, and Quanta. It comes with a integrated support plan with a single Microsoft contact.
Interested in finding out more about PDW, maybe a demo? If so, shoot me an email!
More info:
Parallel Data Warehouse (PDW) Version 2
Microsoft SQL Server Parallel Data Warehouse (PDW) Explained
Appliance: Parallel Data Warehouse (PDW)
Parallel Data Warehouse Solution Brief
Introduction to PDW (Parallel Data Warehouse)