Run an Apache Drill cluster on Azure HDInsight
Install Apache Drill on Azure HDInsight using a Script Action. A script to install Apache Drill (1.10) is available at:
https://raw.githubusercontent.com/yaron2/hdinsight-drill/master/setup.sh
The script will work on both existing and new HDInsight clusters. To install Apache Drill on a new cluster, perform the following steps:
-
From the Cluster summary blade, select Advanced settings, then Script actions. Click Submit new and choose Custom. Use the following to populate the form:
- NAME: Enter a friendly name for the script action.
- SCRIPT URI: https://raw.githubusercontent.com/yaron2/hdinsight-drill/master/setup.sh
- HEAD: Don't check this option
- WORKER: Check this option
- ZOOKEEPER: Don't check this option
- PARAMETERS: Leave this field blank
-
At the bottom of the Script actions blade, use the Select button to save the configuration. Finally, use the Next button to return to the Cluster summary
-
From the Cluster summary page, select Create to create the cluster.
Drill is installed on all HDInsight Worker Nodes. In order to connect with Drill, you need to obtain the IP address of any Worker node in the HDInsight cluster.
One way to do so is to open the Ambari Cluster Dashboard at HTTPS://CLUSTERNAME.azurehdidnsight.net where CLUSTERNAME is the name of your cluster. Once logged in, click on Hosts, and obtain the ip address of any worker node that begins with wn.
In order to connect to the internal Worker nodes, setup SSH Tunneling and configure your browser as outlined here: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-linux-ambari-ssh-tunnel
-
Connect to the HDInsight cluster using SSH:
-
Connect to one of the Worker nodes using SSH:
ssh USERNAME@WORKER-NODE-IP
-
Start the Drill Shell:
sudo ./var/lib/drill/apache-drill-1.10.0/bin/drill-conf
-
Verify everything's working with a simple SELECT to query the drillbits
SELECT * FROM sys.drillbits;
After establishig the SSH tunneling, obtain an IP address of any Worker node as described above and go to http://node-ip:8047
In order to query data from Azure Blob Storage, you first need to add it to the list of plugins. To do so, connect to the Drill UI and navigate to the Storage page.
Then do the following:
- Click on the Update button for the dfs plugin, and copy its contents.
- Enter a name for your Azure account at the bottom of the page and click the Create button.
- Delete the null value and paste the contents you copied earlier.
- Change "file:///" to "wasb://[email protected]/" and change the container and account name accordingly.
See here for more info on Drill and Azure Blob Storage.
The script supports only hadoop clusters. Other cluster types (Spark, Kafka, Storm, Secure Hadoop etc.) are not supported.