We all know that the textbook definition of “Big Data” are data sets that fit one or more of the three “V” attributes: High Volume, High Velocity, and High Variety.
This simple definition of Big Data is what most vendors teach us in their product literature (such as this definition of Big Data on the SAS Institute’s web site). It’s also the definition included by many authors who write books and articles on the topic.
What most don’t know is that this isn’t really the way Gartner originally defined the term. Not entirely, anyway. It’s the first nine words. The first idea in a set of three related ideas that were meant to be taken as a set.
What’s a “Big Data Platform”?
But most know only three Vs. And if the 3 Vs define Big Data, then a data platform is a “Big Data Platform” if it can—somehow—process data that has one or more of these “V attributes”. By this definition, many kinds of data platforms are “Big Data Platforms”. Of course platforms based on Hadoop are accepted as “Big Data Platforms”, because in the mind of most people, Hadoop and “Big Data” are almost synonymous (actually, they’re not). But some conventinoal data processing platforms have adopted “Big Data” in their brand message as well.
Marketing ideas follow hype more often than they create it. Many kinds of data processing platforms (and even front-end data visualization tools) have co-opted the “Big Data” label. And who could blame them? If a traditional data platform company has a platform that can process “High Volume”, or can handle data streaming at “High Velocity”, why not describe it with the term “Big Data” to generate more interest in it? Is that wrong?
Big Data wasn’t intended to describe a product
To be a relational database management system (RDBMS), a platform needs—at least—to comply with E.F. Codd’s 12 rules, which define the relational database management system, and are quite specific. Because the definition is precise at an engineering level, if a vendor calls a system an “RDBMS” when it’s not, it’s plain as day to see that the system is being over-sold.
Conversely, Big Data was a term coined (according to my reading of it) by Gartner to describe a concept, not as a label to be applied to a product category in the same way E.F. Codd’s rules do. As a result, it’s much harder to measure products and architectures against the much less precise–and not engineering based–“Big Data” definition.
So what is “Big Data” then?
Colleagues and clients are frustrated with what they perceive as the uselessness of the term “Big Data” today. But I suspect that frustration is partially rooted in a broad misunderstanding of the term (as well as its over-use to hype products and services).
If we really look at what the term “Big Data” means, it becomes a much more useful tool in identifying and planning next-generation data analytics solutions.
As restated in Gartner Research Director Svetlana Sicular’s article in Forbes:
“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”Concise? Yes. Specific? Not really. Powerful idea? Definitely.
The “real” big-data definition isn’t merely “The Three V’s”, as many have been conditioned to believe. As Ms. Sicular reminds us, the definition has three parts:
- Three V’s – the physical characteristics of the data
- Cost-effective, innovative technology – attributes of the platforms needed to process the data
- Enhanced insight and Decision-making – the business outcomes that result from processing the data with the appropriate platform.
Let’s dig into each of these parts of the true definition.
The Three V’s
So, yes, the Three V’s do exist! But they only serve to describe—at a very high level—the attributes of data sets. Nothing more. And they certainly don’t tell us anything about whether a particular data platform is a good choice to process data that has characteristics like one or more of the V’s.
Cost-effective, innovative
This is certainly the most subjective part of the definition. What is cost-effective? Well, it depends on what you spend now. Are you accustomed to shelling out $20 million per year for an exotic MPP platform, or is a $10,000 SMB SQL database your idea of cost-effective?
How about innovative? I might think a solution I just designed is innovative, but you might not. If I’m a vendor, then of course I think my product is innovative, and it’s my job to help you see my point of view!
I don’t think this part of the “Big Data” definition is intended to sort systems by their level of cost-effectiveness or innovation. What it really means (to me), is that while organizations might have streaming data or high volumes of web logs, they won’t blindly spend exorbitant sums of cash to buy systems to process these data. Do vendors who sell “Big Data Platforms” built on expensive, exotic hardware and software foundations ignore this part of the “Big Data” definition on purpose? You be the judge.
Enhanced Insight and Decision-making
This part of the definition is crucially important. No customer will buy or implement a system that analyzes data just for the sake of doing so. There has to be a reward. An ROI. A purpose.
If a Big Data project is doomed to fail, it’s often because this part of the Big Data definition was unknown (or worse, ignored). Technical staff may enjoy exploring new technologies because it brings them a sense of accomplishment and pleasure to learn new things. And arguably if you employ technical staff who aren’t interested in learning new technologies you probably should revisit your hiring strategy!
But for a Big Data initiative to get beyond the first, low-cost lab experiment it has to excite the business as well. For example, it may need to bring insights not possible with existing technologies. It needs to pickup the use cases for which RDBMS systems are too expensive to operate (extremely large scale) or cannot process (unstructured data).
It’s too bad that the perfectly valid “Big Data” term has been misunderstood by so many who use it. We tend to learn about new technology paradigms from product vendors. But in the case of Big Data there are few (if any) commercial products that truly address all three parts of the actual definition, so we often get only part of the story.
But if we, as practitioners, really look at the conceptual idea in its entirety, we can begin to map out how to make our organizations (or our clients’ organizations) successful in their use of Big Data Technology.
Where to go from here?
I’ll leave you with some thoughts about how to examine your own Big Data opportunities, and select the right ideas, technologies, and approaches:
Make sure the data sources/sets you’re looking at aren’t already being fully addressed by existing technologies.
The fastest way for your Big Data idea to die on the vine is to solve a problem that’s already been solved. Look for opportunities to bring insights the business side wants, but they can’t get any other way. Talk to the mid-level analysts and managers at your company. They all have questions that aren’t being answered by your ERP or Data Warehouse system.
Conversely, don’t be drawn into proving Big Data technologies in your company by, for example, using Hadoop to implement an ERP, CRM or small-scale Data Warehouse workload. Refer to definition part #1 – the three V’s. A relational data warehouse having 6TB of structured, transactional data isn’t a big data problem. It’s an RDBMS problem, so leave it where it belongs. Find out whether there’s another 100TB of useful, historical data at Iron Mountain that nobody can query. That could be a meaningful problem to solve.
Don’t use exotic hardware
Your organization probably wants enhanced insight for data it isn’t able to process and query today. But that doesn’t mean it will spend a fortune to do so. When weighing the status quo against gigantic capital investments, the status quo is a powerful contender.
You need to show that you can address 3V workloads for 1/10th the cost of traditional systems. You might not be able to do that if you buy your Big Data system from a hardware vendor who prefers to sell exotic hardware at a high gross margin. Think out of the box on this, and be ready to leave your comfort zone.
Business users don’t use the command line
If you’re an IT Professional experimenting with Big Data technologies like Hadoop, most likely you’re using very “techie” user interfaces. Technical people are used to the command line. We’re used to highly complex, difficult to learn and over-engineered software. Learning some new IT platforms can be like fighting a Balrog in the Mines of Moria, and many technical people love the challenge. I promise you that the business sponsor who will fund your “Phase 2” project doesn’t share this same feeling.
The business needs to see insights, and see them visually. If the first thing you plan to demonstrate is how to run a MapReduce job in a 10 node cluster, don’t bother. The business sponsors will leave the meeting (literally or mentally), and you’ll accomplish nothing. Instead, plan a dashboard over the resulting data. Plan a real-time visualization of Twitter keyword for your industry. Use Excel to demonstrate accesibility. Or Power BI. Or Tableau. Make the output of your Big Data project satisfying and visually engaging for those whose support you’ll need going forward.