With the recent preview release of Apache Spark on HDInsight, Microsoft has brought the next generation of distributed data processing front-and-center in Azure. While it had already been possible to create an HDInsight cluster with Spark, Microsoft has removed the hassle of a custom configuration using a script action or a manual installation. The convenient Spark setup that Azure now offers allows more responsive, interactive analysis for both static and streaming data in a cluster.

What is Apache Spark?ApacheSparkLogo.jpg

At its core, Spark is an in-memory processing framework that operates on distributed data. For anyone with an existing investment in Hadoop who wants to extend it to obtain more business value, Spark can access HDFS or other distributed storage systems such as Azure Blob storage. It also provides standalone job processing or works within other frameworks such as YARN. Unlike disk-based processing with MapReduce or its successors in the Hadoop ecosystem like Apache Tez, however, Spark works completely in-memory and typically offers an enormous performance benefit. As a result, Spark can function as either a complement to an existing investment or a fresh start to Hadoop if desired.

In addition, an entire ecosystem is built into and building up around Apache Spark. Much like Hadoop spawned a number of supporting projects such as Hive, Pig, Mahout and others for working with and analyzing data; Spark has core capabilities surrounding analysis with SQL, streaming, graph computing and machine learning. Being a popular open source project, it also has dozens of add-on packages built by the Spark community to supplement its core offerings.

Why does Spark on HDInsight matter?

While much of the talk and writing surrounding “big data” amounts to recycled hype, Apache Spark has earned its excitement in my opinion. There are still use cases for traditional batch processing, but Spark’s faster computing capability allows for a more interactive experience. The ability to run queries in a fraction of the conventional time opens up distributed computing to a much wider audience of consumers and data analysts. With the same number of cluster nodes in HDInsight, jobs that may have taken minutes for large datasets using MapReduce can often be done in seconds with Spark.

Additional conveniences of Spark on HDInsight involve the Apache Zeppelin and Jupyter notebooks, which are easily accessible from the Azure toolbar or directly via URL. These two notebooks allow users to explore data or write code using languages such as SQL, Python, or Scala. Many users may already have experience with Jupyter, and Spark on HDInsight provides a good introduction to the rich experience of the relatively new Zeppelin project. The faster response time of Spark coupled with ease of access to coding in the notebooks removes previous barriers to quick experimentation with queries, machine learning, and other applications.

Analyze data with Spark SQL in Apache Zeppelin



First Impressions

Soon after the Spark preview became available, I had the opportunity to create a test cluster and had some favorable first impressions:

  • Like the existing HDInsight cluster setup, the Spark setup simply requires you to select the number of desired nodes, an Azure storage account, provide authentication, and then hit the “Create” button. Within a few minutes, the cluster is available.
  • Microsoft makes it easy to work with data housed in Azure storage.  To associate the Spark cluster with an existing storage container, use the Custom Create option.
  • Workloads related to data exploration and data science are responsive in Spark. Coupled with the convenience of the cluster setup and the ability to immediately get started using the notebooks, users can more easily gain insight from their data.
  • The breadth of language support largely allows users to utilize existing skillsets with Spark. For example, if a project team works primarily with Python, there is less need to bring in Scala resources or train personnel on an unfamiliar language.
  • While Zeppelin offers some charting capabilities, it is more of a tool to get a quick glimpse at a dataset and is not meant to be a full-featured data visualization tool. Visualization in Jupyter involves coding, but if you are already familiar with Jupyter, that is likely not an obstacle. Fortunately for large groups of consumers, it is easy to connect to a Spark cluster using mainstream tools such as Power BI or Tableau.
  • The Spark on Azure HDInsight preview currently installs a slightly older but more stable Spark version (1.3.1). An updated version (1.4) was released in June, which provides R integration through SparkR and many other new features that should come soon to HDInsight.

 Straightforward cluster setup


As the latest approach to distributed data processing, Apache Spark on HDInsight is a welcome addition to Azure. With multiple applications ranging from basic data exploration to machine learning, Spark removes a number of handicaps for processing large datasets or data streams. Through both convenience and performance, Spark on HDInsight fills a gap and makes “big data” more approachable to everyone.

If you have any questions about how Spark on HDInsight may fit into a new or existing solution, 3Cloud can help.