Introduction
Microsoft announced in September 2018 that SQL Server 2019, which is now in preview, will have a Big Data Cluster deployment option. This is a Big-Data-capable SQL Server with elastic scale and extended Artificial Intelligence (AI) capabilities, mostly as a result of deep integration of Hadoop and Spark out-of-the-box. The new SQL Server Big Data Cluster is expected to yield a lot more than the ability to employ Hadoop and Spark directly from a SQL Server environment. This managed ecosystem particularly presents new approaches to data virtualization that allows one to easily integrate data across multiple data sources without needing to move data by ETL processes. It also enables Modern and Logical Data Warehouses with Polyglot Persistence architecture and designs that employs multiple data storage technologies, for e.g. via a data lake.
In a later article we will take a comprehensive and practical look at the pros and cons of SQL Server Big Data Cluster architecture. In this article however, we will try to understand the importance and benefits of Hadoop and Spark specifically, and why it was important for SQL Server to be a platform tightly knitted to these two technologies.
If you are particularly new to Hadoop and Spark, you are probably wondering what they are. If you are quite familiar with them, you could be wondering why the need to integrate them with SQL Server. Well, the major importance of this marriage, in my opinion, has to do with the importance of Structured Query Language (SQL) and the value it brings to Big Data related processing, analytics and application work-flows. Fortunately I chronicled some of these Big Data technological trends and evolution in various articles on this forum since 2013. These articles, some of which are listed below explores the technological journey that culminated to what SQL Server Big Data Cluster is today.
- Big Data for SQL folks: The Technologies (Part I) (2013)
- Hadoop For SQL Folks: Architecture, MapReduce and Powering IoT (2015)
- Distributed Computing Principles and SQL-on-Hadoop Systems (2017)
- SQL-On-Hadoop: Hive (2017)
In these past articles, I touched on trends and also the importance and evolution of various technologies, including: Big Data, Distributed Systems, Hadoop, SQL-on-Hadoop, NoSQL, PolyBase, Spark, Data Virtualization, Lambda architecture, Polyglot Persistence etc. These are trends, technologies and data architecture designs that helped shape or forms part of the SQL Server unified Data platform today.
The SQL and Big Data Saga
During the advent of the Big Data challenges, a lot of new technologies emerged to try to address the capacity to store, process and derive value from the available huge datasets. These datasets often consisted in large portions of unstructured and semi-structured data. This led to the emergence of NoSQL (not only SQL) non-relational databases with a lot of proponents even suggesting that Relational Database Management Systems (RDBMS) and Structured Query Language (SQL) will become obsolete. To the contrary SQL has emerged stronger and so has RDBMS like SQL Server that kept pace with the challenges that Big Data presented.
The importance of SQL in deriving value in the form of intelligence from Big Data today cannot be over emphasized. There are a lot of evidence that SQL is very useful if not dominant in Big Data processing and advanced analytic applications. Consider this;
- In 2007 the data team at Facebook sought to build a special SQL framework (HiveQL) on top of Hadoop to enable their Analysts to analyze their massive datasets with SQL.
- Today, Uber is running 20,000 Hive queries per day and over 10,000 Spark jobs per day on their 100 petabytes Big Data platform.
Q: What does these stories have in common?
A: Big Data and SQL ; i.e. the importance SQL play in Big Data processing.
Q: And what are the technologies and tools at play?
A: Hadoop/Hive and Spark; key technologies that leading in this front.
And like these giants (Facebook and Uber) much smaller firms are using these same Big Data and SQL platforms tools to address their Big Data Analytics needs. The rest of this article will explain how SQL has maintain dominance in Big Data processing, how Hadoop and Spark emerged and evolved to dominate the Big Data storage and processing landscape, and why SQL Server strategically benefits by integrating them.
Hadoop, Spark & SQL Server 2019
The section that follows provides a summary of Big Data trends and technological evolution with a chronological context, focusing on Hadoop, Spark, and SQL. It outlines Big Data trends, challenges relational databases faced handling huge datasets, and how Hadoop emerged as the de-facto distributed system for storing and processing Big Data. It also touches on how SQL has stayed relevant and important in analyzing Big Data. Finally, it looks at how Spark has quite recently emerge as the kid on the block for all thing analytical as far as processing speed and interactive queries are concerned. References are made to some of my previous articles for further reading.
1) Big Data & Relational Databases
Exponential growth in data suddenly meant that many Big Data projects had to deal with very large datasets from multi-petabytes to exabytes of data. Traditional Relational databases by themselves faced a lot of challenges scaling to process these often very large datasets. Mostly designed with single node computation engines, RDBMs were found wanting with the emergence of Big Data. Techniques such as horizontal partitioning or Sharding used by these systems to address horizontal scalability challenges were often complex to setup and unable to address Big Data challenges.
Read more about some Big Data and relational database challenges and solution from here:
- Big Data for SQL folks: The Technologies (Part II)
- Distributed Computing Principles and SQL-on-Hadoop Systems
2) Big Data & Distributed Systems
Many projects turned their attention to Distributed Systems as a means of storing and processing Big Data. Distributed Systems are more complex than the traditional one-node computations that relational engines are designed for. They refers to peer-to-peer distributed computing models in which data stored is dispersed onto networked computers such that components located on the various nodes in this clustered environments must communicate, coordinate and interact with each other in order to achieve a common data processing goal. They are able to address Big Data scalability and complexity issues effectively because they are built from the ground up aware of their distributed nature.
Read more on Big Data and distributed systems from here: Distributed Computing Principles and SQL-on-Hadoop Systems
3) Hadoop Distributed File Systems (HDFS)
Google revolutionized the industry with Hadoop Distributed File System (HDFS) for Big Data storage and a system, known as MapReduce, within the Hadoop framework for computations on HDFS. HDFS is a distributed, fault-tolerant storage system that can scale to petabytes of data on commodity hardware. A typical file in HDFS could be gigabytes to terabytes in size and provides high aggregate data bandwidth and can scale to hundreds of nodes in a single cluster and could support tens of millions of files on a single instance.
Despite initial challenges Hadoop emerged very quickly to became the de-facto Big Data storage system(Distributed System). Hadoop offers some of the lowest cost for massive storage and processing capabilities due to its ability to scale on inexpensive servers besides the strong support for various analytical and processing tools being developed on the framework.
Read more about Hadoop and MapReduce framework from here:
- Hadoop For SQL Folks: Architecture, MapReduce and Powering IoT
- Distributed Computing Principles and SQL-on-Hadoop Systems
4) Emergence of the SQL and Big Data Saga
Most big data and Hadoop projects also realized a few things;
- That substantial amount of their data required for insights had structure.
- MapReduce, the primary language for the Hadoop framework is tedious and limiting for analyzing structured data.
- Most of their data analysts and scientist had SQL skillset as opposed to the new MapReduce programming.
In such a dilemma in 2007, the data team at Facebook sought to build a special SQL abstraction on top of Hadoop (which they called Hive ) to enable their analysts with strong SQL skills but limited or no Java programming skills to analyze this data on Hadoop. These realizations also led to the emergence of other SQL-on-Hadoop options that offered good opportunities to leverage SQL to query and perform some Big Data Analysis on the Hadoop framework instead of using MapReduce.
You can read more about Hive here: SQL-On-Hadoop: Hive - Part I.
5) SQL-On-Hadoop:
Although SQL queries from Hive and other SQL-on-Hadoop Systems eliminated the need for writing verbose Java MapReduce code, under the hood the queries were mostly converted to MapReduce Jobs anyway. This made them relatively slow and mostly ideal for batch processing but not suitable for interactive queries and other low latency Big Data applications. Most of the SQL-On-Hadoop options available today are mostly used for batch processing.
You can read more about Hive here: SQL-On-Hadoop: Hive - Part I.
6) Emergence Of Spark
Apache Spark burst to the scene with a unique in-memory capabilities and an architecture that was able to offer performance up to 10X faster than Hadoop MapReduce and SQL-on-Hadoop Systems for certain applications on distributed computing clusters including Hadoop. Spark achieves this tremendous speed with the help of an advanced execution engine that supports acyclic data flow and in-memory computing.
Unlike other SQL-On-Hadoop options Spark's SQL option also enables one to integrate complex SQL logic not only into batch processing but also into interactive, streaming and other complex Big Data processes (e.g. text processing, collective intelligence and machine learning etc.) that might previously have required different engines.
In a separate article will take a critical look at the Spark framework and the architecture that make it achieve so much. You can read an introduction to Spark and its architecture from here: Distributed Computing Principles and SQL-on-Hadoop Systems
7) SQL Server 2019 : Integration of Hadoop & Spark
Over the years, as Hadoop and Spark rose to become inevitable tools for Big Data storage and computation. as it turned out, Big data solutions were not a one-size-fit all, by adding support for XML, JSON, in-memory, graph data, and PolyBase. SQL Server kept pace to address some use cases of the Big Data challenges. However with its roots in a relational engine, a single instance of SQL Server was never designed or built to be a database engine for analytics on the scale of petabytes or exabytes. It also was not designed for scale-out compute for data processing or machine learning, nor for storing and analyzing data in unstructured formats, such as media files etc.
As a result of this inherent limitations, SQL Server 2019 Big Data Cluster has been designed from the ground up to embrace big and unstructured data by integrating Spark and HDFS into a deployment option. By integrating Hadoop and Spark out-of-the box and by also allowing for NoSQL as part of this unified platform, SQL Server has position itself as one stop shop tool
Conclusion
In this piece we've tried to understand the importance and the role structured data and SQL play in Big Data architectures, workflows and solution designs. We further tried to establish the fact that this the major reason why SQL Server benefits by integrating two key technologies namely Hadoop and Spark.
On one hand Hadoop emerged as the most prevalent Big Data storage and processing platform. On the other hand Spark has risen to dominate not only complex batch processing but also interactive, streaming and other complex Big Data processes. In between are relational environments like SQL Server with enhanced Big Data features, which are still the most suitable for managing and querying structured data from Big Data streams and also with the most effective capabilities to masterly manage and query structured entities like Customers, Accounts, Products, Finance and Marketing campaign related ones. These bring the most value to the high-volume Big Data.
Therefore, by deeply integrating Hadoop and Spark, SQL Server Big Data Cluster position itself as an ecosystem capable of handling various Big Data solution architectures. A tool now capable of end-to-end solutions for various Big Data use cases that are able to deliver a full range of intelligence from reporting to AI at scale.