With the rise of HDInsight and other Hadoop based tools, it is valuable to understand how Power BI can help you take advantage of those big data investments. If you need to set up a cluster to work with, check out my previous posts on Setting Up an HDInsight Cluster and Loading Data Into Your New HDInsight Cluster. These posts show how to do this with no scripting required. If you prefer to script, there are a number of resources with sample scripts on doing the same work.
In this article, I will focus on using Power Query to get data from the Hadoop file structure in HDInsight. I will be using Excel 2013 with the Power Query Add-In. I will also be using the restaurant data I loaded as noted in the three previous posts. If you need to create a cluster and load data I encourage you to check the following blog posts:
- Setting Up an HDInsight Cluster (No Scripting Required)
- Uploading Files to an HDInsight Cluster (No Scripting Required)
These posts walk through the process of creating a cluster and loading up data.
Connecting to HDInsight
First, open a new Excel workbook and click the Power Query tab. Once there, you can find the Azure HDInsight source in the From Other Sources dropdown. Select that option to open the following dialog:
You will need your storage account in order to continue. Then you will need the storage account key. Once you have added the key you will see that the Navigator opened in Excel on the right.
It should show the name of your cluster and the default container name. Double click the container name and it will open the Power Query window. It will show all the files available in the container. Even though we have it organized in folders, the view shows all the files. If you have a large amount of files and you don’t want to scroll to find them, you can click the down arrow on the Folder Path column and use the text filter to find the folder you are looking for.
Now I have the files I want to use in Power Query. If you click the binary link it will open a copy of the file. However, this is not how we want to work with the data as we have multiple files. (If you did this, remove steps up to the Filtered Rows step in the Applied Steps section.) I now have the files I uploaded showing.
In order to work with them all together we need to Combine Binaries.
This now merges all the files into a single dataset. You can now do any data shaping you would like using standard Power Query methods such as updating the column names and the data types. You can even add columns such as a Total Amount column. Here is what my final query looks like including the steps I did. Be sure to give your query a meaningful name. (Note that Power Query realized my Transaction Date column was a date dataype and changed it for me.)
Now click Close & Load and your data will be loaded into Excel. You can change the query to load a model in Power Pivot as well to do additional work with the data.
The next post will walk through using Power BI Desktop to load data using a Hive query.