Introduction
In the beginning there was a lot of data from social networks, Web traffic, sensors and more. The data was without form and void; and darkness was upon the face of the deep. There were Terabytes, Exabytes, Petabytes and Zettabytes of data and everything was chaos.
And Doug Cutting (lawyer and an engineering) said,Let there be Hadoop to handle this mess: and there was Hadoop to handle in multiple parallel servers all the data. Using a Hadoop Distributed File System (HDFS) he could handle big amounts of data inspired in the Google Servers.
And Microsoft saw that it was good and helped to the Open Source of Hadoop. And spent 6000 engineering hours per year to improve Hadoop. And Microsoft created Azure HDInsight and the joy and happiness returned ever after.
Requirements
None
Getting Started
Inspired by Google Servers, Doug Cutting decided to create a open-source framework with distributed storage and distributed processing. The idea was to store big amounts of data using cheaper computers and reducing the storage cost. The Google guys found a way to convert simple machines into super big servers and that way it is possible to reduce the storage cost.
Doug Cutting was inspired by the yellow elephant of his son, whose name was Hadoop.
Azure HDInsight
You can have Big Data in Azure using Hadoop; Microsoft helps you to simplify your job. You do not have to worry about deploying servers to have a Hadoop environment. You can simply pay for the service in Azure and start working.
The minimum price is a A1 category that costs 70 USD/month per 70 GB of disk space aproximately and one of the more expensive options is the A11 with 362 GB of disk space, 112 GB of RAM is 2500 USD/month.
With Hadoop you can handle SQL and non-SQL data. Hadoop provides tools to handle big amounts of data structured and non structured.
Is it possible to Analyze the data with Microsoft Data Mining?
Yes, there are several ways to achieve this goal. You can import the data from Hadoop to SQL Server or a database of your preference using SSIS with the Hive Driver and then work as we explained in chapter 1.
The other way is to create a view from a Linked Server to the Hive and work with the views.
Hive? What a @@#% is Hive?
Hive is a datawarehouse infrastructure to summarize and analyze big data in Hadoop. Hive was created by Facebook and uses HiveQL or HQL, which is a query language similar to SQL to handle big data.
What is Pig?
It is an animal of the genus Sus, but in the Hadoop world, it is a platform used to create Large Data sets distributed with processes in parallel. It uses the Pig Latin language. Pig was created by Yahoo.
Azure HDInsight supports Pig and Hive.
What is HBase?
Is a non-relational distributed database.
Are Hadoop and the Big Data technologies going to replace the traditional Relational Database and multidimensional technologies like Oracle and SQL Server including Microsoft Data Mining?
Probably not in 10 years, but in the long run we can describe the future with Darwin theory: Individuals less suited to the environment are less likely to survive and less likely to reproduce; individuals more suited to the environment are more likely to survive and more likely to reproduce and leave their heritable traits to future generations, which produces the process of natural selection.
We can apply the Darwin theory here. If SQL Server and Data Mining do not evolve over time, they can be replaced by new technologies. SQL Server 2016 is oriented to be more compatible with these new open source technologies. Time will tell us if SQL Server survives or not.
Conclusion
Today, the information stored is big and it is very hard to store and administer all the information. With Hadoop we can organize the infomation using Hive, Pig or other technologies. Microsoft includes HDInsight in Azure and offers this service. This service can be analized in Microsoft Data Mining.