As more companies look to utilize advanced analytics on Big Data platforms, it can be daunting for a data scientist to keep up with the myriad data sources and formats. I learned R with smaller datasets – using mostly Excel spreadsheets, and .csv or SAS files (see blog post here from my colleague Colby Ford). Formats like that are great for departmental solutions, development processes, and training – and they’re not going away anytime soon.

HiveMicrosoftR.jpg

As enterprises look to extract maximum value from ALL their data, advanced analytics professionals need to become familiar with other data formats, especially those found on modern platforms like Hadoop. This article will provide a data science perspective on what some of these data sources are (namely Hadoop Distributed File System (HDFS) and Hive), why they’re important, and point to some resources for getting started with sample data.

If you’ve been using R with large datasets, or an enterprise interpreter like Microsoft R Server, you might be getting to know your friendly neighborhood data architects a bit better. Let’s all say it – the data science tribe is getting bigger, and that’s a very good thing. Before I started my current role, I knew that the Hadoop thingy was shaped like an elephant, and that was the extent of my Big Data knowledge. These days, I’m talking with customers that want to train machine learning models in R with 100TB of historic data, so I must be able to discuss Big Data and the associated advantages it brings to analytics like distributed storage and computing. Thankfully, I work with a talented data platform team that has really helped me learn – but all this is to say that data science discussions involve an increasingly diverse set of technologies and skill sets. Before we look at specific data formats, let’s start with a brief and simple overview of why a platform like Hadoop is important in the context of data science.

Why Hadoop?

Azure HDInsight.png

Hadoop offers the ability to distribute data files on storage among many computers (HDFS), perform computations with each machine simultaneously on a large number of files (parallel/distributed processing) – AND get back a coherent result as if it were a single operation. Storing and processing data in very large volumes is usually cheaper and faster in a Hadoop environment compared to a traditional data warehouse technology like Oracle or SQL Server.

Additionally, there is no requirement for structuring the data in HDFS in the same way a relational database enforces an associated schema. You can just throw your .csv, .txt, image files, etc. into folders with minimal organization and worry about making sense of it later. Hadoop is an open-source project from Apache with commercial distributions from the likes of Cloudera and Hortonworks. 

What’s MapReduce and Spark?

ApacheSpark.png

Think of these technologies as the software framework within Hadoop that performs all the complicated processing needed for distributed computing. Spark represents the next generation of computing on Hadoop compared to MapReduce, providing advanced capabilities like machine learning, stream processing, and in-memory computing. Writing MapReduce or Spark applications from scratch can be complicated, so there are a variety of APIs or interfaces to other tools, like R, available for users.

How do I query and use data that’s distributed on Hadoop in R?

Great question! And also the purpose of this blog. With files all over the place and no required structure or schema, getting a dataset useful for modeling might seem difficult. Thankfully, there is a technology called Hive that provides a very user-friendly, SQL-like language called HiveQL for querying file systems, such as HDFS, that integrate with Hadoop and avoids writing Java MapReduce code directly. Since most people that want to use Hadoop already know SQL from working with relational databases, it’s a very nice tool to create familiar data representations like tables in the world of big, distributed, and unstructured data.

Hive employs schema-on-read design – which means that structure is applied to the data during reading or execution of the query, rather than having to decide when the data was written or stored. It’s like using cookie cutters to create the exact shape and variety of cookies you want, rather than buying a whole bunch of the same cookies – YUCK! This provides tremendous flexibility in how the data can be used. It also provides storage and data management advantages as the Hive query can be saved as a lightweight metadata object, rather than having to write the complete results to a file. More simply, Hive doesn’t actually store any data, it just helps us structure and use it much more efficiently.

There are a variety of ways to use Hive tables in R. One is SparkR from Apache. This R package is available only with the Spark distribution (not on CRAN), which makes getting started a pretty big investment. An easier way is through Microsoft’s HDInsight on Azure – a fully managed Hadoop-as-a-Service offering that’s easy to deploy (even for a few hours just to play) and provides an option for R integration via Microsoft R Server.

R Server allows you to directly import Hive data as Spark data frames that take advantage of Microsoft’s high-performance machine learning algorithms. In preparation for this post, I followed the tutorial for getting started with R Server on HDInsight to deploy a Hadoop cluster, and then followed the instructions in the section for accessing data in Hive. Within 15-20 minutes, I was up and running with a cluster and had experimented with the sample Hive data! For an even deeper tutorial, check out this post from Microsoft. Let’s see how easy it is:

In the code snippet below, the code including ‘hiveData <- RxHiveData(…)’ brings the HiveQL query results of an existing hive table named ‘hivesampledata’ into an Rx data source object in R. Rx data sources – part of the RevoScaleR package in R Server – can be created from a variety of sources such as ODBC, .csv, text, parquet, and others. The advantage of this format is that it’s just the metadata for the query – like a pointer to the data location and query structure. It has a very lightweight memory footprint in R, even for massive datasets.

'''
#..create a Spark compute context
myHadoopCluster <- rxSparkConnect(reset = TRUE)
'''

'''
#..retrieve some sample data from Hive and run a model
hiveData <- RxHiveData("select * from hivesampletable",
            colInfo = list(devicemake = list(type = "factor")))
rxGetInfo(hiveData, getVarInfo = TRUE)

rxLinMod(querydwelltime ~ devicemake, data=hiveData)
'''

The next line, ‘rxGetInfo(…)’ returns summary information like variable names, data types, and number of rows for the query.

Finally, the line including ‘rxLinMod(…, data = hiveData)’ trains a linear regression model using the hive data. It’s interesting to note that rather than having to fit all the data in memory, R Server intelligently streams the Hive data as needed from HDFS and allocates it among the computing nodes in the Hadoop cluster for distributed processing – super cool! 

Hopefully this article has been helpful to understanding the value of using Hadoop data in R. For more information about Microsoft R Server, please see our recent webinars here and here. For more information on Hadoop, please visit our resource center.