Building a Data Lake is not a small task. It requires a large amount of storage, distributed across many servers, all working in sync to provide fast, reliable access to your data. Building a distributed computer system is much more complex than deploying the single-server solution that many of us are more familiar with.

iStock-469328524edited.jpg

In a recent post, Mike Cornell encouraged thinking big, but starting small – that encouragement applies to deploying a Data Lake as well. For many IT professionals, our first impulse might be to design a large, complex, and expensive hardware-based solution. However, when it comes to building a Data Lake, implementation, organization, and ease of data egress holds a higher position than specific technology requirements.

Cloud providers, like Microsoft, offer great solutions for deploying a Data Lake that is agile, scalable, and performs well with various distributed compute engines. Some of the most important reasons to think cloud first are:

In the introduction, I alluded that deploying an on-premises Data Lake is complex. Let’s take a look at what it can involve.
The first question you’ll need to be able to answer is “How much data do you want to be able to store in your Data Lake?” Before you answer, think about the next 3 years – how much data do you think users will want to bring in to analyze in the Sandbox? What about Data Warehouse archival? Do you know all of the data sources that users might need to analyze in the future? It’s common to estimate a number, say 100TB, and then decided to double it, 200TB in our case, to account for the unknown.
Once you have an idea of how much storage you want to aim for – you’ll need to triple that number. Why do we need to triple the storage limit? Because modern distributed file systems are so resilient, and because they replicate files stored in the Data Lake a minimum of 3 times (if you’re lost, check out this post on data lake organization). This ensures that if any single node were to fail, two other copies of the data are available for immediate use. Our storage target is now up to 600TB.

Further, you’ll want to increase your storage amount by a factor of 20 percent to account for space used during data transformation and staging during data processing. Our final storage requirement for the fictitious solution is 720TB. Not a small amount…

To build an on-premises environment capable of 720TB of storage, we’ll need to do some calculations. Modern servers designed for Data Lake usage contain around 60TB of usable attached storage. This means we’ll need 12 servers to hit our target of 720TB of storage for our Data Lake.

How long does it take to procure 12 servers in your environment? Once they are procured, how long does it take to install them in your data center, install operating systems, and test hardware to make sure it’s ready for use? Finally, how long should you plan for the software installation, configuration, and testing for the Data Lake platform? In my experience, this process, from beginning to end, can take anywhere from weeks to months.

Now, let’s compare this with a cloud deployment.

Microsoft Azure offers a product named Azure Data Lake Store (ADLS). ADLS is a cloud-based implementation of HDFS. It’s failure resilient, integrates with existing Active Directory security, is POSIX compliant, WebHDFS compatible, and has no file size or account size limits. In short, working with ADLS is not much different than working with HDFS. Your Data Lake developers will appreciate that.

How long does it take to deploy? About 15 minutes, once you log into your Microsoft Azure portal. Now, I don’t want to pretend that it’s that easy – most enterprises will be deploying more than just ADLS. Active Directory configurations need to happen to allow for secured data locations and networking infrastructure. For example, Azure Express Route needs to be installed to allow secure communication between your data center and Microsoft Azure. Those processes can sometimes take weeks to complete.

But, while the networking infrastructure is being deployed, you can be working with a cloud-based Data Lake. Simpler encryption options like Site-to-Site or Point-to-Site VPN can be quickly configured for temporary use.

So, yes, to fully operationalize the Data Lake in the cloud, you’ll need to plan for several weeks. But, to begin a proof-of-concept, you only need to plan for a day, or maybe two, to deploy a Data Lake infrastructure capable of housing all the data you need.

Elastic Scale

In our fictitious example, we estimated 100TB of data storage, then applied some multipliers to account for unknowns, data replication, and working space. Our final number came to 720TB of needed storage.

What if our initial estimates were wrong? What if our multipliers were way off?

With an on-premises solution you’re probably going to be locked into the hardware that you purchased. Getting budget approval for 12 servers doesn’t mean you can go out and buy 6 more because you underestimated – and vice versa – no one is going to pat you on the back because you over-estimated your storage requirements and are only using 20 percent of the expensive infrastructure that you just deployed.

With cloud solutions, like ADLS mentioned above, you only pay for what you use. What exactly does that mean? Well, if you upload 1TB of data during month 1, you’ll pay for 1TB of storage use for month 1. That’s it.

If it takes 24 months to hit the target of 200TB, you only pay for the monthly storage that you use as you climb to that final figure. Another plus with ADLS? The replication factor is set behind the scenes – with our on-premises solution example, WE had to account for the replication factor. With ADLS, we don’t have to. IF we upload 100TB of data, we pay for 100TB of data, not 300TB.

With cloud solutions, we don’t have to make a special request to increase storage limits, or ask for forgiveness when we grossly over-estimate what we think we need. We only pay for what we use. In many ways, it’s a simpler model.

Pricing and Cost Structure

It’s no secret that cost is one of the major reasons that many organizations are looking to move to cloud platforms. The days of 3 to 5-year hardware refreshes are growing slim. New projects, like building a Data Lake, that require massive hardware investments are closely watched, and serious questions about the value of purchasing, maintaining, and depreciating that hardware is being very closely measured.

As IT becomes more and more of a cost center, it is becoming more effective to treat the cost of “doing IT” as an operational cost, rather than a capital expense. So, just how much does the cloud cost? Let’s use our 100TB estimation as a benchmark and dig deeper.

When we start to estimate cost, one of the major differences between on-premises and cloud platforms is exposed. With an on-premises solution, our storage layer and our compute layers are directly related. With Hadoop clusters, (one of the most common platforms to implement a Data Lake) each compute node uses locally attached storage. Each of the nodes’ storage aggregates to the total cluster capacity.

With cloud platforms, however, storage and compute are separated into different services. ADLS only provides the storage layer. When it comes time to apply compute to the data, there are multiple engines available, and each of them can access the data in a Data Lake. Because of this difference, it can be a bit difficult to align estimates for common comparison. For the example below, we will assume that the solution requires the ability to store 720TB of data maximum, and have 150TB of the raw data stored. We’ll also assume that we are using comparable compute platforms.

For the on-premises Data Lake (single environment), we’ll need the following components:

Component Estimated Price
2x Hadoop Management Nodes $30,000 (approx. $15,000 each)
30x Hadoop Data Nodes – 720TB capacity, 450TB (150TB x 3 replicas used) $300,000 (approx. $10,000 each)
Network Components (intra-cluster switches) $7,000
Hardware Support Contract (Yearly) $35,000
Hadoop Support Plan (Yearly) $45,000

Assuming an even depreciation rate of hardware over 5 years, the approximate monthly cost for an on-premises Data Lake solution is $12,283.

Let’s compare that with the monthly cost for a cloud platform solution (single environment) hosted with Microsoft Azure:

Component Estimated Price
Azure Data Lake Store (150TB used, unlimited capacity) $5,700
HDInsight Cluster (10 compute nodes, used for average of 75 hrs / week) $3,450
Express Route (direct Fiber connection to Azure Data Center) $820
Enterprise Support $1,000

For a comparable cloud solution, the estimated monthly cost is $10,944. Please note that this pricing doesn’t include any volume discounts that might be available, and is based on public Pay-As-You-Go pricing, not enterprise pricing – which is often offered at a cheaper per-unit rate in agreement for meeting consumption targets.

You can see here, that the cloud solution isn’t “pennies on the dollar” when compared directly to the on-premises solution, but it is cheaper. Additionally, the on-premises solution does not include hidden costs, such as those required to run a data center (environmental, utilities, staff, etc.) The on-premises version is also much less reactive to change. Adding more storage to an on-premises cluster requires more capital expenditure (buying more cluster nodes). Adding more storage to an Azure Data Lake simply comes with a higher monthly bill based on the amount of storage being used.

Think Cloud First

Implementing a Data Lake is not a small task, but it can reap huge rewards in the increased availability of data, open new doors for new analytics to drive company growth, and enable data analysts to be more efficient.

Going cloud first means that you can take advantage of implementation efficiencies, deploying the Data Lake infrastructure faster. You’ll be able to take advantage of elastic scale, only paying for the storage that you’re using without worrying about maximum storage limits. And you’ll be able to take advantage of simplified cost structures, focusing on operational spending versus capitalized costs, with the ability to control the operational spending over time.

Have questions about implementing a Data Lake or just want to learn more? Contact us! 3Cloud will be happy to help you make an informed decision on the best solution for your environment.