You’ve heard about streaming, even seen a few live demos of people doing Twitter sentiment analysis, then gone back to the office to work on meeting the needs of your userbase with your existing batch system that the company has had for years. I get it. Been there, done it, sent a postcard.
I’d like to shift the perception for those of you that might not really be aware of either the value proposition or the low barrier to entry that streaming data offers.
Enabling New Business Capabilities
Putting Twitter sentiment analysis aside—there are core business processes that can be enhanced right now, today, using streaming data methodologies that can provide real competitive advantages. I don’t have to enumerate them here, because these are already available from well-respected organizations that you should be paying attention to:
- Forbes – The Competitive Advantage of Streaming Analytics
- Gartner – 5 Trends Emerge in the Gartner Hype Cycle for Emerging Technologies, 2018 (for purposes of this article, IoT Platform is synonymous with streaming data, albeit a more specific use case.)
The competitive advantages that streaming enables in the enterprise are well known to technology companies and will soon be at the doorstep of almost every industry. Is your organization ready for this challenge/opportunity?
A True Common Data Service for Every Application
Streaming data implements governance, integration, and messaging at scale within a common, extensible platform. It enables orthogonal applications to alert, act, and analyze against the same source definition. In the form of Kappa Architecture, streaming data means true implement once, consume multiple semantics, allowing for domain expertise to concentrate within the applications themselves (including self-service analytics), while governance and security exist within the data tier common to all applications. Governance and data access only needs to be implemented a single time—all applications reading from the stream source are equal in the eyes of the platform, whether messaging, alerting, analyzing, or deep learning. This extends to security and authorization rules as well.
Figure 1- from Confluent
What About Data Lakes?
Some of the use cases for streaming at first appear redundant with the capabilities of a data lake—ability to scale, enrich, and serve orthogonal applications from the same source. The differences in capabilities, especially in the form of Kappa, become significant when we step through the limitations of the data lake:
- Security and governance are managed separately from producing and consuming applications—roles must be applied based on file and folder hierarchies, which may or may not have anything to do with the actual data contained in each file. Streaming platforms maintain the same security definitions for all access requests to data.
- Latency and throughput are limited due to requiring disk writes—even if a cache system is implemented, file operations are still a limitation as opposed to single, additive events which can be read immediately in a streaming data source.
- Resiliency in writing output is dependent on either the source application or a separately triggered component—the source application or component must be built to scale with throughput, ensure checkpointing and failover on system error, as well as guarantee write once semantics, otherwise additional data cleansing will be required by downstream applications. Modern streaming platforms, like HDInsight Kafka and Azure Event Hubs, are resilient by default.
Data lakes are best suited to serve analytical and predictive applications, whereas a streaming platform can be the source system for every application because it solves the security, latency, and resiliency requirements, as well as any database system, but simultaneously allows for the scalable read throughput needed by any analytic application. Data lakes then become useful as a data munging environment, but don’t require the organizational gravitas associated with being a system of record.
If you reflect on why data warehouses and lakes were conceived in the first place—because the application’s data source couldn’t performantly serve both transactions and analytics simultaneously—then the removal of this limitation helps reset the expectation. Yes, you can have a single source of truth, and it can serve all your applications concurrently. This exists today and you can have it.
Barriers to Entry
Now that I have your interest as far as streaming capabilities, let’s step through some of the common blockers and misconceptions around what it takes to implement a streaming platform.
Streaming Data ≠ IoT
You don’t need IoT devices for a use case to implement streaming; there are plenty of requirements from your existing data where subject matter experts (SMEs) in your organization will make significant contributions with access to current state. The same no-code alerting and dashboards standard in self-service analytics for reporting become zero latency, domain enhanced, core business efficiency enabling tools when pointed at streaming data. See the Forbes article listed earlier for more specific examples.
I Need to Create a Streaming Application
Not that long ago, the only way to get streaming data was to engineer an entire end-to-end pipeline where sources and targets were required to be known entities in advance. This effort necessitated IT owning the entire process, as even deployments of reports reading from these definitions required significant technical expertise.
This is no longer the state of the industry. It is possible to refactor your existing enterprise data architecture into streaming with zero code changes to existing applications. Relational database management system (RDBMS) table deltas can be polled, NoSQL object stores can have log listeners, directories and application logs can be read—even Excel file updates can be mirrored as an event.
Sound like a lot of custom development? Actually, no. There are mature platforms known as data flow management tools that enable event streaming using visual layouts and components similar to batch based orchestration engines like Azure Data Factory and SQL Server Integration Services (SSIS).
Data Flow Management
Data flow management is an entire topic on its own—we’ll only touch on it briefly here.
Two of the most popular frameworks are Apache NiFi and StreamSets. In Azure, StreamSets is supported as a published app in HDInsight, where Apache NiFi would require a custom deployment on HDI or a roll your own IaaS cluster. Apart from Azure considerations, this post gives a good overview for comparing the frameworks. There’s a lot of parity—evaluating each platform to support your specific use cases is key. For some use cases, a data flow framework might even eschew the need for an underlying streaming platform.
Perhaps the most significant barrier to getting started is that many organizations aren’t familiar with these tools or what a streaming ecosystem even looks like. This applies equally to both business and IT. Having an experienced resource available, such as BlueGranite, when considering the impact and enablement that streaming can provide is a critical component that can help lead to a successful outcome, and a competitive advantage for your enterprise.