I wanted to share a quick resolution I had to a problem with connecting Power Query to data stored in HDFS.
Seems easy enough. Launch Excel and through Power Query select From Other Sources –> From Hadoop Files (HDFS)
Next you’re prompted to provide your Hadoop server name then click OK. I’ll be using the Hortonworks Sandbox environment for this demo.
I then see all the files available on HDFS.
If I try to explore the data in these files by clicking Binary next to any of the files listed I’m presented with the following error: DataSource.Error: HDFS cannot connect to server ‘sandbox.hortonworks.com’. The remote name could not be resolved: ‘sandbox.hortonworks.com’.
The issue here is with the name resolution and is a configuration issue. To fix this you must put your Hadoop cluster details in the HOSTS file on your machine that’s running Power Query.
To edit the HOSTS file (I recommend you take a backup of it first) you’ll need to launch a text editor as administrator then open the HOSTS file found in the following folder:
C:\Windows\System32\drivers\etc
Provide the IP address and host name of the server as shown below then save.
Next time I try to connect to a file on HDFS everything now works properly.
Hope this helps!