5 Big Data Predictions for 2016

2015 was a landmark year for Big Data. Hortonworks had its first year as a publicly traded company, acquired Onyara, and released Hortonworks Data Flow. Cloudera expanded its worldwide footprint and was appointed to Deloitte’s 2015 Technology Fast 500(TM) for Fastest Growing Companies in North America. Microsoft expanded its cloud offerings beyond HDP and HDInsight, and now offers elastic parallel processing engines like Azure Data Lake and Azure SQL Data Warehouse.

It’s hard to imagine a year bigger than 2015 in the Big Data world, but I predict that 2016 will surpass the previous years in many ways. Here are my Top 5 Big Data Trend predictions for 2016.

Spark will overtake MapReduce.

Without a doubt, if you’ve been paying attention in the Big Data industry, Apache Spark is on your radar. You are probably even using Spark for some of your interactive data processing. Apache Spark has so many improvements over MapReduce, it’s not a hard prediction to make that Spark will take over as the processing engine of choice in many Hadoop clusters.

It’s such an easy prediction to make, that IBM made that bet last year, announcing a $300 million investment into Apache Spark over 3 or 4 years. With such a large investment into Databricks, we can expect to see some great new features coming out this year. In fact, we’ve already had a new major release. Apache Spark 1.6 was released on Jan. 4, 2016. This release included new performance enhancements, as well as new Data Science/Machine Learning functionality.

From a deployment perspective, all of the three major Hadoop vendors, Hortonworks, Cloudera, and MapR, are now including Spark in their distributions:

In addition to the improvements that Databricks is adding to Apache Spark, community projects are also investing into the platform. Many common Hadoop applications are already supporting Spark as an execution engine, or have plans to in 2016:

Apache Hive has supported Spark Execution since version 1.1
Sigmoid currently has forked the Apache Pig source and created a project called Spork and also has a current development to enable Spark as an execution engine in the main source branch.
There is a community project to run Sqoop on Spark currently in development.

With all of the development activity around Spark, the next 12 months should prove to be very exciting!

Ingesting data from any device will be easier than ever before.

Big Data and the Internet of Things (IoT) are inseparable now. You just can’t have one without the other. That being said, until recently, it was actually pretty difficult to build a true streaming application with Hadoop. You have to be pretty good at Java to use Storm; and Flume only takes you so far. It was a good start, but to be truly enterprise ready, Hadoop needs more.

2015 started the momentum, and we’ll see more of it in 2016. For example, in 2015 Hortonworks acquired Onyara, who was the main contributor to Apache NiFi. Shortly thereafter Hortonworks Data Flow (HDF) was released. HDF is a huge step in the right direction for IoT implementations. For the first time, we got to see a vendor supported, open-source IoT GUI. In addition, the HDF team went to work immediately on new features, like log tailing — meaning HDF will soon support many of the same functions that Flume supports.

MapR is another great example of a major Hadoop vendor with eyes on IoT. MapR has taken a close look at Hadoop and realized that the 50 million file limit (per namenode) in HDFS isn’t sufficient when dealing with the potential trillions of files generated by sensor devices. In response, MapR has developed MapR-FS, a distributed file system without the limits of HDFS.

Finally, cloud solution providers have been laser focused on IoT as well. In late 2015 Microsoft released the Azure IoT Hub; a scalable, fully parallelized, single-point deployable infrastructure designed for collecting, sorting, and analyzing millions of points of data from all over (and sometimes off) the globe. Cloud platforms, like Microsoft Azure, are greatly disrupting the Big Data field in a good way. With this release, Microsoft made what should be (and is) a very complex infrastructure deployment, as simple as completing a wizard and waiting a few minutes. This kind of innovation allows Big Data developers to focus on what is really important, putting together the code to make things go and not worrying about the status of various services in a Hadoop cluster.

With as much effort and investment as Hadoop providers have put into IoT, it’s definitely going to remain a strong Big Data field in 2016. With tools like Azure IoT hub, and Hortonworks Data Flow, we’ll be able to implement these solutions more easily than ever before.

Data Governance and Security will become forefront in 2016.

There is no question that Enterprise wants to adopt Hadoop. I talk to customers every week that want to know what it takes to get started. There is often a hurdle to leap over though, and that hurdle is generally related to data governance and security. Enterprises are nervous, and rightly so, that managing petabytes of data will be an unruly task. With so many applications living and breathing in a Hadoop cluster, is it possible to know that all of the data is correctly secured? It’s a big concern, and frankly a valid one.

While data governance and security are two different topics, when designing a solution they often go hand in hand. This year, we are going to see more investment into both topics, and simplified implementations for enterprise.

In 2015 Hortonworks open-sourced Apache Falcon and it is now included with HDP (version 2.2+). Falcon is a data governance tool that allows developers, administrators, and data stewards to define rules for data ingestion, data access, and data lifecycles. It also includes auditing, so if any funny business happens, administrators will be able to track down the source. Hortonworks has a large list of planned features for 2016, and I think we’ll see some huge improvements to Falcon. By the end of this year, adoptions will increase, and organizations will be more confident in the data governance story for Hadoop.

Cloudera answered the request for Data Governance with Cloudera Navigator (CN). CN is a fully implemented data management tool that Cloudera ships with Cloudera Enterprise. It includes a vast array of features including data lineage, data auditing, search indexes, data policy definition, and will even help administrators understand how data is being used and suggest performance improvements to data models. In addition to the data management aspect, CN also provides encryption at rest and in motion, to ensure that any data captured by third parties will not be useable. While CN isn’t open source, and is only available on the Cloudera Distribution, enterprises who use Cloudera are very excited for the future of this product.

Security is another extremely important topic for enterprise customers. In the Hadoop landscape, security most often means configuring Kerberos delegation. I know, I know, no one likes configuring Kerberos. Good news! There are new projects to help remove that pain. Apache Knox is an authentication framework that encapsulates Kerberos around the cluster and creates a centralized environment to manage user access. When paired with Apache Ranger, which provides data and process level securables, administrators need no longer question who has access to the Hadoop cluster. Throughout 2016, we will see improvements to both of these tools, making them easier to implement, and more secure in their operation.

2016 will prove to be a big year for Enterprise Hadoop. I’m positive that customers will become confident in the security and governance of their Hadoop implementations during the next 12 months and beyond.

Cloud processing will replace or augment a record number of on-premises solutions.

They say “the cloud is just someone else’s computer”, well, this ‘someone else’ has a way better computer than I do! It’s been great watching the cloud, specifically watching Windows Azure grow over the last year. There has been a lot of very fast growth, and new tools introduced, like:

Azure HDInsight with Spark – See above why Spark is so important to today’s Big Data landscape. Azure supports it with their Platform-as-a-Service Hadoop offering, HDInsight.
Azure Data Lake – Think of this like cloud-based HDFS. It offers scalable, redundant storage, with a job processing framework (Hive/Pig/Spark) attached. It can also be used as the storage layer for HDInsight.
Azure Data Warehouse – For your structured data querying needs, ADW is a highly scalable, MPP, relational database.
Azure ML – Machine Learning in the cloud! Azure ML is a great tool. It’s a graphical approach to machine learning and includes a bunch of templates to get started quickly. It also allows the data scientist’s model to be turned easily into an on-demand web service, ready to be included in any application.

Microsoft isn’t the only company that is innovating in the cloud; Amazon has been busy at work too:

Amazon Machine Learning – Was introduced in April of 2015. It’s similar to Azure ML above, and is based on the same tools Amazon Data Scientists use to analyze and predict our shopping habits.
Amazon Supports Spark in EMR – EMR, or Elastic Map Reduce, is Amazon’s PaaS Hadoop offering. In 2015, they added Spark support. I told you Spark was a big deal!
AWS IoT – Just in time for the holidays, Amazon released AWS IoT for general use. AWS IoT allows developers to easily connect [millions of] devices to collect, store, and analyze data. Like most cloud applications, it’s highly scalable, and relatively easy to set up.

2015 was the year we were introduced to dozens of new tools for analyzing Big Data. 2016 is the year we implement them. Nearly every customer I talk to is considering cloud to be an important part of their IT vision. If they aren’t considering it as part of their vision yet, they are evaluating it. I predict this year we will see a HUGE uptick in cloud deployments. We’ll still have on-premises work too, but I’m putting my money on cloud and cloud hybrid solutions.

Nimble BI tools will outpace monolithic platforms.

I’ve been deep into SSAS for many years. For the last three years, I’ve remained a staunch defender of multi-dimensional analysis. I think it’s still an important tool. I can’t ignore the fact that its implementation rate has dropped. Sure, we still have many customers using it, but more often than not, our customers are looking at nimble tools like Microsoft Power BI or Tableau.

Microsoft Power BI was re-released in 2015 as a cloud-based product that has revolutionized how customers are thinking about Microsoft BI. It is based on PowerPivot and SSAS, and includes a rich designer, the ability to create custom visualizations using D3, and a powerful data modeling tool. In addition, it has hooks to on-premises data sources, so you don’t have to worry about storing data in the cloud.

In 2015 we saw fast and furious updates to the product, often at a weekly pace. Some of these improvements enabled Power BI to connect directly to on-premises and cloud-based Big Data environments. These new features, when paired with the data governance features above, create a powerful self-service BI platform that we’ll see implemented many times this year.

Tableau, too, has had has a very big year. They released Tableau 9.0 at the beginning of the year. New features in version 9.0 included new connectors to on-premises and cloud-based Big Data sources. They also added support to import data directly from statistical tools, like R. This is a HUGE feature for data scientists. New performance enhancements help to analyze more data than ever before.

Additionally, Tableau released Vizible, a mobile-based, self-service tool that enables analysts to take their Big Data with them, whereever they are.

Microsoft Power BI and Tableau enable business users to be agile, flexible, and empowered to create the right visualizations for their team. In 2016, we’ll see a lot of growth in this space. Paired with the new tools we have with Hadoop, business decision makers will be more informed than ever before.

If you’re interested in discussing these or other predictions, please feel to reach out to us today.

5 Big Data Predictions for 2016

Spark will overtake MapReduce.

Ingesting data from any device will be easier than ever before.

Data Governance and Security will become forefront in 2016.

Cloud processing will replace or augment a record number of on-premises solutions.

Nimble BI tools will outpace monolithic platforms.

Related Articles

Databricks: From Core Platform to Strategic Advantage

How AI Is Changing the Game for Data Analysts

Data Modeling Standards Guide

Your Cloud Transformation Journey Starts Here