On October 1st, 2015 Hortonworks announced the release of DataFlow, it’s newest product in the field of stream processing. DataFlow is “powered by” Apache Nifi.

DataFlow This is a very exciting announcement. DataFlow is purported to change the way we deal with IoT and data stream consumption. Let’s see if it lives up to the hype.

Since the announcement, I’ve been able to spend a couple of hours with the product.  I have spent a fair amount of time testing different stream processing strategies, so when this product was released to the public, I was ready to jump in and see how it can help.

One of the challenges I’ve faced when working with streaming data is simply trying to figure out which tool to use.  There are so many applications and frameworks to consume streams, it’s quite confusing to know which way to turn.  Some of the existing applications are easier to use than others. Some are just application frameworks and require a fair to vast knowledge of OO programming methodologies to fully implement.

With Hortonworks DataFlow, it appears that there is finally a tool directed to data analysts for stream consumption.  Here are my initial impressions of the new product.

Installation and Setup

Like many open source projects, it’s possible to download the source code for Hortonworks Data Flow, build the binaries manually, and deploy the solution. Hortonworks does provide a set of built binaries in both Zip and tarball format — which is nice.

Installation was really quite easy. Simply uncompress the file to a user directory, and run the script provided. The script accepts a couple of different execution options ranging from background execution, to service installation.

Total time spent downloading and getting Hortonworks Data Flow stood up and running in my test environment? About 15 mins. Seriously! 15 mins.  That’s fantastic in my book.

User Environment

 HadoopI have a lot of discussions with customers, peers, friends, and community members about Hadoop. One of the most common questions is “Is there a GUI I can use, or do I have to use the command line?”  

Often, my answer to that question is not well received (read: NO), but in this case, I think I’ll be able to surprise everyone. 

Hortonworks DataFlow is 100% based on a GUI. Finally! Again, yes, DataFlow is 100% GUI based and runs in a web browser. So far, I’ve had no issues with it running in Safari or Firefox. It’s responsive, provides great tooltips and information popups, and is quick to respond.

I think a lot of enterprises out there are going to be excited to see that they can finally build complex data flows without have to be XML experts.


At time of this writing, I’ve only spent about an hour or so actually using DataFlow to do any tasks.  I haven’t even read much of the documentation yet.  That being said, I’ve already created a flow that connects to twitter, downloads tweets related to Hortonworks, and Hadoop, and Pizza, and The Martian, and  saves them in HDFS. 

Yes, all of that work, without reading an ounce of documentation. Disclaimer: I’m not suggesting that you shouldn’t read the documentation. You *should* read the documentation. I’m reading the documentation as soon as I’m done writing this article.

I think this is a pretty big leap forward for Hadoop/Big Data here — let me reiterate. In less than two hours, I’ve installed a product, built a data flow to collect thousands of bits of information from a major social media network, and have them stored in my very own cluster computing environment. WITHOUT READING ANY DOCUMENTATION. 

What about the other stream consumption tools I’ve used? I definitely had to read the documentation (I’m still reading the Storm documentation) — and it definitely took me longer than one hour to get a full Twitter consumption task working the first time.

Score some big points for GUI based programs here. I’m glad to see an intuitive interface, and I think a lot of others will be also.

What does all of this mean?

At first glance, I think DataFlow can be a BIG game-changer in the world of IoT and stream consumption.  The interface is great so far; It can run in clustered mode; It can interface with Ambari for metrics monitoring; and it was originally built by the NSA — who doesn’t know more about collecting tons of data and doing massive amounts of processing than the NSA?

I’m really, REALLY excited about DataFlow now that I’ve gotten my hands on it.  Watch the BlueGranite resources page for more information as I and the BlueGranite team dig deeper. 

Where to learn more

If you want to learn more about Hortonworks DataFlow, here are some resources:

BlueGranite Resources page — we’ll be continuing to post resources related to Hortonworks DataFlow and other Big Data tools.

Hortonworks – The DataFlow home page has links to product documentation and will likely have hands-on lab content in the future. 

Apache Nifi – Product Documentation and up-to-date application development guides. This the documentation that we should all be reading.

The Final Word

We at BlueGranite are ready for IoT with tools like Hortonworks DataFlow. We want you to be ready too. If you are trying to wrangle massive amounts of real-time data and are looking for a team of experts to help guide you down the path, contact us today.  

In the meantime you may enjoy other articles on our blog, including my recent post on the Lambda Architecture, or this article on Data Lakes and the Modern Data Platform by our CTO, Chris Campbell.