Some of the key tasks in data science involve basic exploration of new or existing data. Raw data is given structure, data can be joined to other datasets, features are selected for later analysis, and much more. Depending on the questions to which you seek answers as well as other requirements, the process repeats until you have data that is ideal for further, more advanced, analytics.
With Apache Spark on Azure HDInsight, these core tasks are made simpler with the inclusion of both the Apache Zeppelin and Jupyter notebooks. In this Demo Day video, I walk through basic exploration of a city’s traffic crash history using Zeppelin with both Spark DataFrames and Spark SQL. I discuss some of the advantages of using Zeppelin and Spark for data of any volume. Working with a new text file, I obtain an initial look at what features are available, see what cleansing may need to take place, and obtain a basic feel for the dataset through querying and visualization. At this stage, I compute summary statistics as well as develop a repeatable process that can be used later. While this is descriptive analysis, how can the data be prepared for other applications such as predictive analytics?
Overall, I can use the data to help bring me closer to answering my initial questions as well as prompt new questions. For example:
- Weather impacts road conditions. During a snow storm, am I usually safer taking a two lane road or a freeway? Freeways may have more accidents overall, but they also have a much higher traffic volume. Factoring in a road’s average daily traffic, do accidents during snow increase at similar rates for all road types–or increase at all?
- College football home games increase traffic congestion. Is there an increase in accidents that correlates with that congestion? Do accidents on game days take place along main corridors to the stadium, or are they dispersed throughout the city?
View the video below to see how the Zeppelin notebook on a Spark on Azure HDInsight cluster can help me get answers.