First Impressions of Apache Spark on Azure HDInsight

With the recent preview release of Apache Spark on HDInsight, Microsoft has brought the next generation of distributed data processing front-and-center in Azure. While it had already been possible to create an HDInsight cluster with Spark, Microsoft has removed the hassle of a custom configuration using a script action or a manual installation. The convenient Spark setup that Azure now offers allows more responsive, interactive analysis for both static and streaming data in a cluster.

What is Apache Spark?

At its core, Spark is an in-memory processing framework that operates on distributed data. For anyone with an existing investment in Hadoop who wants to extend it to obtain more business value, Spark can access HDFS or other distributed storage systems such as Azure Blob storage. It also provides standalone job processing or works within other frameworks such as YARN. Unlike disk-based processing with MapReduce or its successors in the Hadoop ecosystem like Apache Tez, however, Spark works completely in-memory and typically offers an enormous performance benefit. As a result, Spark can function as either a complement to an existing investment or a fresh start to Hadoop if desired.

In addition, an entire ecosystem is built into and building up around Apache Spark. Much like Hadoop spawned a number of supporting projects such as Hive, Pig, Mahout and others for working with and analyzing data; Spark has core capabilities surrounding analysis with SQL, streaming, graph computing and machine learning. Being a popular open source project, it also has dozens of add-on packages built by the Spark community to supplement its core offerings.

Why does Spark on HDInsight matter?

While much of the talk and writing surrounding “big data” amounts to recycled hype, Apache Spark has earned its excitement in my opinion. There are still use cases for traditional batch processing, but Spark’s faster computing capability allows for a more interactive experience. The ability to run queries in a fraction of the conventional time opens up distributed computing to a much wider audience of consumers and data analysts. With the same number of cluster nodes in HDInsight, jobs that may have taken minutes for large datasets using MapReduce can often be done in seconds with Spark.

Additional conveniences of Spark on HDInsight involve the Apache Zeppelin and Jupyter notebooks, which are easily accessible from the Azure toolbar or directly via URL. These two notebooks allow users to explore data or write code using languages such as SQL, Python, or Scala. Many users may already have experience with Jupyter, and Spark on HDInsight provides a good introduction to the rich experience of the relatively new Zeppelin project. The faster response time of Spark coupled with ease of access to coding in the notebooks removes previous barriers to quick experimentation with queries, machine learning, and other applications.

Analyze data with Spark SQL in Apache Zeppelin