Below I have attempted to list the various options for reporting off of Hadoop (HDInsight, HDP, Cloudera) using Power BI Desktop and Excel. Some of the data sources prompt you to choose the Data Connectivity mode of either Import or DirectQuery. For those that don’t you are limited to Import mode:
- Power BI via Microsoft Azure HDInsight Spark (Import or DirectQuery) This is in Beta. This works if HDInsight is using Blob storage or Azure Data Lake Storage (ADLS)
- Power BI via Microsoft Azure HDInsight (Import). This works if HDInsight is connected to Blob storage. If HDInsight is connected to ADLS, use the “Power BI via HDFS” connector instead
- Power BI via Spark (Import or DirectQuery) This is in Beta. This is for HDP/Cloudera and uses a different authentication level then the HDInsight Spark connector. NOTE: Cloudera recommends against using SparkSQL with CDH. SparkSQL uses a component called thrift server which is not supported by Cloudera (see here). For interactive applications they recommend Impala. There strategy is described here.
- Power BI via Impala (Import or DirectQuery). This is in Beta
- Power BI via SQL DW using PolyBase (Import). Note that DirectQuery is an option, but not supported with SQL DW so will import the data instead, which could greatly slow query performance
- Power BI via SQL Server 2016 using PolyBase (Import for HDInsight, DirectQuery creates MapReduce job in cluster for Cloudera/HDP, other sources will import the data instead)
- Power BI via HDFS (Import). This uses WebHDFS.
- Power BI via ODBC (if Hive ODBC, creates Hive job in cluster, so can be slow. For very fast querying, use Hive LLAP ODBC)
- Excel via Data Connection Wizard -> ODBC DSN -> Hive ODBC (Import)
- Excel via SQL Server 2016 using PolyBase (Import)
- Excel via PowerQuery/Hadoop File (HDFS) (Import). This uses WebHDFS
- Excel via PowerQuery/ODBC (Import)
- Excel via PowerQuery/SQL Server 2016 using PolyBase (Import)
- Excel via PowerQuery/Microsoft Azure HDInsight (Import)
“Import” mode requires all data from Hadoop to be copied into Power BI/Excel and then filtered. So import will first load your data (“snapshot”) into a local Power BI model and then run reports against that data, which means that interactive visualizations are not live but rather using data from the last snapshot.
With “DirectQuery” mode, the query is passed down into the data source and only the results are returned back (see Use DirectQuery in Power BI Desktop). So when using DirectQuery you are using live data. Note that Power BI Q&A and Quick Insights is not available for DirectQuery datasets (see here for other limitations).
EXCEPTION: If in Power BI you use the “Edit Queries” option, it will actually try to push down the query if you choose “Import” mode. So the difference in this case from DirectQuery mode is that Import will fallback to running locally if it can’t execute part of the query on the server, while DirectQuery will tell you the query isn’t supported. So an Import can take hours instead of failing fast like DirectQuery will do.
By using PolyBase, you use a CREATE EXTERNAL TABLE to point to the data stored in a Hadoop cluster. Thereafter, the Hadoop data is seen as a regular table, hiding complexity from the end user.
I also wanted to mention that Power BI can connect to Azure Blob Storage and Azure Data Lake Store (beta), and Excel can connect to Azure Blob Storage.
There are many more data sources that Power BI Desktop can connect to.