HDInsight is Microsoft’s Hadoop PaaS offering on Azure. Microsoft have partnered with HortonWorks to bring a hosted version of the HortonWorks Data Platform to the Azure service, which is great, but there’s more. The really clever thing they’ve done is to essentially replace (or at least sideline) the HDFS part of Hadoop and insert Azure Storage blocks in its place. This means you don’t need your cluster to be persistent, all your data is off cluster, but you still get the benefits of distributed compute, and pretty close to the benefits of data localisation that you get with regular Hadoop. This lets you use the nice cheap redundant storage all the time, and only pay for huge compute nodes when you need them, but how do you manage turning clusters on and off all the time?
I do a lot of periodic testing with my clusters, but don’t have the load to keep them running all the time. After a while, going through the Azure portal everytime I needed to build a cluster Justust got annoying. Fortunately, there are a number of ways of automating the cluster creation. I tend to swap a lot between Mac, Linux and Windows machines, so I need something that works consistently for everything. So my first port of call is the NodeJS based Azure Cross-Platform Command Line Tools (Catchy, no?).
These are pretty easy to setup:
npm install -g azure-cli
Once you’re got the tools, you’ll need to connect them to your Azure account.
azure account download
brings up the right part of the portal, login, let it do its thing, and it will download a .publishsettings file.
azure account import that_file_it_downloaded.publishsettings
and you’re good to go. When everything’s hooked up, and it knows who you are, you can just create a little script, and put it somewhere sensible like ~/bin. I use this:
azure hdinsight cluster create \ --storageAccountKey "STORAGE_ACCOUNT_KEY" \ --storageContainer "cluster" \ --storageAccountName "STORAGE_ACCOUNT_NAME.blob.core.windows.net" \ --clusterName "CLUSTER_NAME" \ --location "North Europe" \ --username "CLUSTER_ADMIN_USER" \ --clusterPassword "CLUSTER_PASSWORD" \ --nodes 4 \ --subscription "SUBSCRIPTION_ID"
Then I just run the script whenever I need a cluster. To take the cluster down I just go for
azure hdinsight cluster delete "CLUSTER_NAME"
I tend to keep that incantation in a script file as well, just to save on personal memory space.
If you want to dig into this, then it’s well worth looking at the built in help for the node Azure CLI tools. Just type something like
azure help hdinsight
and you’ll get the full set of options.
This works well for just running up a quick clean test cluster with some data in it (in the attached Azure storage account). However, sadly the node cli tools don’t yet support external metastore databases. HDInsight has the option to move the Hive metastore into an Azure SQL Server database, which means you can detach all your table schemas from the lifetime of the cluster as well as the data. Awesome!
To be able to use this feature though we have to automate the cluster with Powershell, which loses the cross-platform bit, but I’ve usually got at least one copy of windows kicking about somewhere on my screens, so let’s live with it. Explaining the Powershell approach is a little more involved, so it can have its own post shortly.