HDInsight, Microsoft’s open-source, Big Data platform on Microsoft Azure, has come a long way since its 2014 introduction. It was one Azure’s first available platform offerings.  Many of our customers took an early look at the product, and frankly, were underwhelmed.

But multiple improvements since the initial release have not only made HDInsight a great product, it’s now our Big Data processing platform of choice.  Below, we share some of the new features that have pushed HDInsight to the top of our analytic processing list.


HDInsight on Linux

One of the biggest complaints with the first edition of HDInsight was that it was delivered on Windows. This meant that we didn’t have access to all of the Hortonworks Data Platform (HDP) features – Ambari included.  Because connectivity to the cluster was limited, Remote Desktop being the only option, development was challenging.  Now that HDInsight is delivered on Linux – as Hadoop should be – Big Data in the cloud is better than ever.  The cluster can be accessed via Ambari in the web browser, or directly via SSH.

Additionally, since HDInsight is based on the Hortonworks Data Platform, it follows HDP’s release schedule. Updates to HDP are delivered on Linux first, followed behind by Windows.  Updates to HDInsight are also delivered to Linux first.  Additionally, premium versions of HDInsight that include advanced tools like Apache Spark or R Server are only based on the Linux platform.

Customized Platform Configuration with Script Actions

HDInsight is a platform-as-a-service offering. What does that really mean? It means that the installation, configuration, and administration of the Hadoop platform is not performed by the Azure customer. Microsoft makes sure that the cluster is operational upon deployment and that it stays that way while running.

The cluster is deployed with a predefined configuration, and customizations are not recommended to be made directly on the cluster. Why is this? Cluster nodes can be redeployed at any time while the cluster is running, and any configurations made directly on a node can be lost.

However, we often need to make customizations to the HDInsight cluster. To make those configurations, we use Script Actions. Script Actions are Bash scripts that make modifications to nodes within the cluster. These scripts can target head nodes, worker nodes, zookeeper nodes, or any combination of the three.  Common configurations include:

  • Installation of new software on the cluster
  • Modification of software configuration
  • Pre-loading data

Script Actions can be defined and applied when the cluster is originally created, or at any time while the cluster is up and running.

Better Control Over Cluster Performance

HDInsight has always been an elastic platform for data processing.  Since its introduction, adding new nodes to HDInsight has been an easy process.  In today’s platform, it’s even more scalable. Not only can nodes be added and removed from a running cluster, but individual node size can be controlled.  We have the ability to define head node virtual machine size and the worker node size. This means that the cluster can be highly optimized to run the specific jobs that are scheduled.

In addition to CPU and Memory specifications, new storage options also allow for more control over the data processing performance of the cluster. HDInsight has always supported Azure Blob Storage, and economical cloud storage platform. For data processing workloads that demand higher performance, HDInsight also supports Azure Data Lake Store, which is optimized for parallel processing workloads.  ADLS is great for parallel processing tasks because data stored there is split into chunks, replicated, and distributed across storage clusters. While the cost for ADLS is higher, it allows for processing customization based on the required performance characteristics of the jobs.

Development Options Abound

In its initial form, there were many options for developing HDInsight processing jobs.  Today, however, there are really great options available that enable developers to build data processing applications in whatever environment they prefer.  For Windows developers, HDInsight has a rich plugin for Visual Studio that supports the creation of Hive, Pig, and Storm applications.  For Linux or Windows developers, HDInsight has plugins for both IntelliJ IDEA and Eclipse, two very popular open-source Java IDE platforms.  HDInsight also supports PowerShell, Bash, and Windows command inputs to allow for scripting of job workflows.

For data scientists, HDInsight includes Jupyter. Jupyter is a notebook-based development environment that allows for integration of code and content.  When code and content come together, they create a living document that updates with data.

Closing Thoughts

At 3Cloud, we’ve had our thumb on the pulse of the Big Data industry for several years. Our customers are looking to implement Big Data solutions that don’t require a heavy-handed administration effort to see business value. Many of our customers are moving to the cloud to optimize their IT infrastructure. For these reasons, our customers are embracing HDInsight and Windows Azure-based data platform tools.

We offer a 3-day Big Data Bootcamp that can help your team get up to speed with HDInsight, learn how to use the power of parallel processing to build advanced analytic models at Big Data scale, and learn how to deliver analytics to business users through the use of power visualizations.  Contact us today to learn more.