Last week, we featured Data Lakes and Data Warehouses in our monthly webinar series. Throughout the presentation, we received several questions ranging from implementation to training recommendations. Below, you’ll find some insightful audience questions and answers from our presenters, Josh Fennessy and Merrill Aldrich.
Do you have any suggestions for handling schema changes that happen in Data Lake files? Specifically, the kind that wreak havoc with ‘schema-on-read’ tactics such as a PolyBase external table?
Merrill: This definitely is a problem, and in my mind, it has two parts. To start, it seems like the governance over the warehouse has to manage which sorts of files can change safely and which must not, because the downstream processes are “non-production” or more forgiving versus being tied into a whole lot of important code. For example, analysts might be frustrated if queries against the lake files break, but if the public uses them through, say, some website feature, it could be much worse. Secondly, the mechanics of actually encoding the change in the reading application (perhaps PolyBase) seem like they would have to be similar to a change in a source query for SQL – that is, sadly, a bit of old fashioned dev work.
Josh: Managing file formats is a big challenge with Data Lakes. It’s one of the reasons that we choose to use a staging area to help move the data to the “Raw” layer for permanent storage. There are a couple of approaches to help mitigate this problem. First, I would recommend that you store data with differing formats in different folders. Many of the compute tools that are used in a Data Lake look at all files in a folder, so it’s important to have them organized appropriately.
Secondly, for savvy programmers, there are options to build flexible processing. With Azure Data Lake Analytics, for example, there is a flexible schema extractor than can be used to deal with files that have different column counts. This blog post from Microsoft does a nice job of highlighting how to handle this problem in more detail.
What advice do you have when moving data from a Data Lake to a Data Warehouse?
Merrill: In order to get familiar with the options, I would start with reading a bit about Sqoop, PolyBase, Azure Data Factory, and even SSIS. There are many different tools available today so it might take a short research effort to match the best one to your needs. Here are a few resources to help you get started:
- PolyBase Guide
- Apache Sqoop Software Download
- Azure Data Factory Guide
- SQL Server Integration Services
Josh: In addition to the typical batch-type movement using Sqoop, Azure Data Factory, or even SSIS, you can also consider building a Lambda Architecture for dealing with data in motion AND batch data. Technologies like Kafka, Spark Streaming, Azure Event Hubs, and Azure Stream Analytics are all pieces that can be used to build solutions to move data from the Data Lake to the Data Warehouse. Reactive platform systems like Microsoft Flow also offer interesting possibilities to manage communication between the two systems.
Our Data Warehouse model has a Staging area, a Raw area and a Structured Analytics area, however, we do not have a Sandbox. Are we still on the path to a Data Lake?
Merrill: Well, if your Data Warehouse is a database system, then perhaps not – though the features you describe are useful in any case. A Data Lake typically can manage raw files, and can use a variety of tools to query and mine those files. If the Data Warehouse you refer to is a file store, then you may be on the path to a Data Lake.
Josh: One of the biggest factors that differ between a Data Lake and a Data Warehouse is the approach to loading and consuming data. With a Data Warehouse, we follow a ‘schema-on-write’ approach, meaning, we apply schema to our data as it’s ingested into the system. On the other hand, with a Data Lake, we follow a ‘schema-on-read’ approach. We DO NOT apply any schema to the data when it is ingested in the Data Lake, but rather when we query or consume data, we have to define the schema of our data for the job we are executing.
So, if your current environment follows more of a schema-on-read pattern – meaning you ingest data in its raw format and apply schema later on when you are running queries – then you’re probably on the path to a Data Lake. If not, however, and you’re fitting data into pre-defined database tables, then it’s more of a Data Warehouse approach.
When organizing the Data Lake into areas for Staging, Raw, etc., would you classify these as top-level containers? Or do you actually build separate Lake Stores?
Josh: The answer to this question depends on what technology you are using for your Data Lake. If it’s an on-premises Hadoop cluster, then we will often have multiple HDFS environments, each managed by a set of HDFS NameNodes. If you’re dealing with a platform-based solution like Azure Data Lake, it can make more sense to use top-level folders with appropriately designed security to manage user access. If you are using Azure Blob Storage as your Data Lake storage platform, then you’re probably going to be best served using a unique Storage Account for each area, as there are account size limits for Blob Storage.
How would you recommend triggering movement between the Staging and Raw areas?
Josh: Typically, in a Data Lake most of the data movement between areas is managed with batch processing. There are times when we would choose to use a Lambda Architecture to collect and do some basic analysis on data in motion. When using a Lambda Architecture, we will often land data in the Staging Area in real time, but later process the data in to the Raw and/or Curated areas using a batch process.
As data moves through Staging>Raw>Analytics, when and where would you match and merge customers together when gathering and analyzing customer lifetime value from multiple disparate sources?
Merrill: By convention, the staging and raw areas are mainly for untransformed, raw copies of source data, so it seems like if you are correlating, merging, and so on, that it would be an analytics function. That said, if it benefits the business to provide that match and merge to a large audience in a repeatable form, it could certainly be automated on the way into the analytics area.
Josh: Additionally, customer matching and cleansing is something we will typically do as we bring data into the Curated Layer – this process would probably feed the Data Warehouse.
Can I store relational data (SQL Server, Oracle, etc.) in a Data Lake?
Merrill: Not the data directly (as in the database files), but you can definitely export data out to defined files and store those in the Data Lake. For example, if you have a system that does a poor job of retaining history as data changes, and that history has value, you could extract snapshots on a daily or monthly basis of important point-in-time data and put the results in a Data Lake.
If relational data is stored, then how can I query the data out of it?
Merrill: PolyBase is one example of a way to query or join the data in files in the Data Lake with data in the database. The retrieval of the file-based data will not have the same level of performance as local data to the SQL Server, but this can work for historical data or analytics.
Josh: PolyBase is a truly unique piece of technology in that it allows us to effectively merge Data Warehouse usability with the massive storage capacity and flexibility of a Data Lake. As Merrill stated above, it can be a great way to join together data that is stored in the Data Warehouse with additional data in the Data Lake.
What toolsets and training do you recommend for an organization getting started with Data Lakes?
Josh: If you’re just getting started, I highly recommend considering a cloud-based solution. As we said during the webinar, cloud platforms, like Azure Data Lake, allow you to get started VERY quickly. We can deploy a Data Lake Storage account in just a few minutes and start uploading data right away.
PolyBase also offers a lot of flexibility in the compute layer with three major platforms for working with Azure Data Lake Store:
- Azure Data Lake Analytics is a “cluster-less distributed computing” platform that allows you to write data processing jobs in U-SQL without all of the complexity of managing a cluster.
- HDInsight is a Hadoop-as-a-Platform offering that brings the power and maturity of the open-source Hortonworks HDP distribution to the cloud. HDInsight is a great choice when you need to use a variety of tools to process your data, but don’t want to rely on a full-time administration task to keep things running.
- Azure Data Warehouse is a distributed SQL Platform that works much like on-premises SQL Server, but also includes PolyBase connected to Azure Data Lake Store.
Microsoft has done a good job of creating learning content as well. Learn Analytics is a great site that offers multiple training opportunities for Data Lake and Analytics in general. The Azure Documentation is pretty well done too! Microsoft maintains a large number of projects on GitHub that can also provide opportunities to take advantage of project templates to get started quickly.
What do you think about an Azure Data Lake solution?
Josh: Azure Data Lake is a great solution! While young, it is maturing nicely and we are excited to see how much in grows in the next 12 – 24 months.
What are the pros and cons for Google, Amazon, and Microsoft Data Lakes?
Josh: Most of our work at BlueGranite is with the Microsoft cloud, so unfortunately, we are not fully up to speed on all of the features of Amazon and Google. All three vendors offer competitive pricing on hyper-scale storage and have flexible options for computing platforms. All vendors have Hadoop-as-a-platform solutions available as well.
I think Microsoft is in a unique position when it comes to Enterprise integration in that Azure Active Directory Sync allows for same-sign-on or, in some cases, even single-sign-on access to Cloud Resources. Azure Data Lake has Active Directory security built in, so it’s pretty easy to manage user access to Data Lake Resources. PolyBase is also another HUGE differentiating factor. It’s the ‘missing link’ between the Data Warehouse and the Data Lake, plus it provides opportunities for tight integration between on-premises infrastructure and cloud platforms.
Do you have some high-level business use cases for when you would use a Data Lake versus a Data Warehouse?
Josh: Yes! Here are some examples of Data Lake solutions we have implemented for our customers:
- Manufacturer Explores Ways Big Data Can Boost the Bottom Line
- Century-old Charity Looks to Map Future through Big Data
Another example you can check out to get your idea-engine going is this blog post on the Top 6 Use Cases to Help You Understand Big Data Analytics.
Thanks to everyone who joined us for the webinar! If you have more questions for Josh and Merrill, or just want to chat about your Data Lake and Data Warehouse environment, feel free to drop us a line.