With the rise of HDInsight and other Hadoop based tools, it is valuable to understand how Power BI can help you take advantage of those big data investments. If you need to set up a cluster to work with, check out my previous posts on Setting Up an HDInsight Cluster and Loading Data Into Your New HDInsight Cluster. These posts show how to do this with no scripting required. If you prefer to script, there are a number of resources with sample scripts on doing the same work.
In this article, I will focus on using Power BI Desktop to get data from the Hadoop file structure in HDInsight using a Hive query. I will also be using the restaurant data I loaded as noted in the previous posts. If you need to create a cluster and load data I encourage you to check the following blog posts:
- Setting Up an HDInsight Cluster (No Scripting Required)
- Uploading Files to an HDInsight Cluster (No Scripting Required)
These posts walk through the process of creating a cluster and loading up data.
Connecting to HDInsight Using the Hive ODBC Driver
Before you can connect using a Hive query you need to download the Hive ODBC Hive from Microsoft. You can find the driver here: http://www.microsoft.com/en-us/download/details.aspx?id=40886. Once you have the driver installed, the connection can be created.
Open Power BI Desktop and click Get Data on the splash screen. This will open the Get Data dialog. Scroll down until you see the ODBC option. (Do not select a Hadoop or HDInsight option. See my previous post on connecting using HDInsight.)
Click Connect to start the process.
Here is where the “fun” begins. You get no help creating a Hive connection string. It took some searching and trial and error to figure out what was needed to make this happen. Here are the properties you need:
- Driver: Driver={Microsoft Hive ODBC Driver}
- Host: Host=yourHDInsightservername.azurehdinsight.net (Your HDInsight server name)
- Port: Port=443
- Schema: Schema=default (default Hive database schema)
- RowsFetchedPerBlock: RowsFetchedPerBlock=10000 (This is the default)
- HiveServerType: HiveServerType=2 (This is the default)
- AuthMech: AuthMech=6
- This is the Authentication Mechanism which is Windows Azure HDInsight Service.
- DefaultStringColumnLength: DefaultStringColumnLength=200 (Default is 32767, this should always be set lower)
Each property is separated by a semicolon. My completed connection string looked like this (Note: I added spaces to fit better in the post.):
Driver={Microsoft Hive ODBC Driver}; Host=hugheshdinsight.azurehdinsight.net; Port=443;Schema=default; RowsFetchedPerBlock=10000; HiveServerType=2; AuthMech=6; DefaultStringColumnLength=200;
Enter the connection string into the dialog and then you will be prompted for credentials. Use the Database option and set the Username and Password. Then click Connect. In my case, I see three tables in the resultset including the sample table. We have connected to our HDInsight cluster using Hive.
Retrieving Data from HDInsight Using HiveQL
So, getting a list of tables is not really helpful. As you can see, this is the Power Query portion of the Power BI Desktop. Let’s add a HiveQL statement to return only our sales data.
In Applied Steps, click the gear next to Source. This will reopen the From ODBC dialog. Expand the SQL Statement portion and add a SELECT * FROM yourtable to get our desired result set. Click Ok and check the results again. You should see the tablename.fieldname format for column headers. At this point, you can proceed with more data shaping and prep data for other analytics. Click Close and Load when you are done and it will load the data into the Power Pivot designer in Power BI desktop.
I hope you enjoyed this series through HDInsight and Power BI. It was a great learning experience for me.