Raw. Unfiltered. Data. The raw zone – it’s the dark underbelly of your data lake, where anything can happen. The CRM data just body-slammed the accounting data, while the HR data is taking a chair to the marketing data. It’s all a rumble for the championship belt, right? Oh, wait – we’re talking data lakes. Sorry. If the raw zone isn’t where data goes to duke it out, then what is the raw zone of a data lake? How should it be set up?
First, let’s take a time-out to give some context. A data lake is a central storage pool for enterprise data; we pour information into it from all kinds of sources. Those sources might include anything from databases to raw audio and video footage, in unstructured, semi-structured, and structured formats. A data warehouse, conversely, only houses structured data. The data lake is divided into one or more zones of data, with varying degrees of transformation and cleanliness (see this video for more: Data Lake Zones, Topology, and Security). The raw zone is the foundation upon which all other data lake zones are built.
The raw zone is the original conception of a data lake. It takes the raw data, exactly as it appears in the source and captures it, but it’s not just a current copy of the raw data. Instead, it keeps every version of the raw data indefinitely. For all your sources. Oh yeah! That is a lot of investment in data storage. What do you get in return? You get three advantages:
- Auditability – Because you have an exact copy of the original data, you can look back at your data to confirm that the derived data in other zones is accurate.
- Discovery – With the raw data, a data scientist may identify new measures or attributes that can improve the data models used in day-to-day analysis. It ensures against the maxim, “You don’t know what you don’t know.”
- Recovery – If an attribute or calculation needs to be added or changed for successive zones, you can completely rebuild successive zones’ data with the new or changed data.
Now that we have some context, let’s talk about what this raw zone does NOT look like. Simply put, if your data resembles the chaos of a wrestling match, you are doing this raw zone thing wrong. Despite how it sounds, dumping data into the raw repository with no organization is a one-way ticket to a data swamp; you’ll be mired trying to get any use from the data. You’ll be pinned by the weight of trying to retrieve the data. If it’s not a no-holds-barred free-for-all, what should a raw zone look like? What does the data look like?
First, let’s look at it as if we’re standing on the top turnbuckle looking down at our opponent. The data lake is file-based storage, and that means we’ve got a directory structure. We need to stick it some place. We can leverage the location in our directory structure to give coherency to this raw data. At the top level, we use folders to demark each zone of our data lake. That’s only a start. Within the raw zone, there’s flexibility in how it’s organized, but there are a few bits of metadata that you should capture within the folder structure for your sources. These included the:
- Data Source – This is the logical name of the data source, uniquely identifying the source. This usually can be accomplished with naming conventions like “CRM” or “Finance Database”. Sometimes, though, when you are capturing from multiple similar or nearly identical systems, the data source might be represented by a folder hierarchy instead of a simple folder (e.g., region/server/database).
- Internal Structure – It’s always useful to capture the internal structure of your data source. For a traditional database, that includes capturing the schema and table name. For file-based sources, you’ll want to mirror the folder structure here. Include as much structure as the product provides, even if you’re not currently making use of it. A database may begin by only using the default schema but, as it matures, start migrating subsections to different schemas.
- Timestamp – Finally, because you’re capturing the same data as it changes over time, it needs to be marked with a timestamp. Embedding the timestamp in the folder structure allows you to keep the data untouched.
That gives us quite a bit of metadata that we’re encoding right into the folder structure. Usually, we recommend layering the folders in the order we’ve presented: source, structure, then timestamp. This often makes the most sense in terms of governance and ability to transfer to other zones.
Now that we’ve covered the overall structure, let’s get down to the mat and get technical for a moment. We’ve been talking about the raw data and how it’s pristinely copied from the source with no changes. When your source is a filesystem or set of files, you simply do a binary copy of the files. With a database, we want to extract the logical table data into a file that we can store in our data lake. So, the data extract/conversion is something to carefully consider. That’s right – we’re going to talk file formats.
With data extracts, you’ve got several choices: CSV, JSON, Parquet, ORC, Excel, and more. Since you’re coming from a structured source, the ideal would be to choose a medium that can reliably transfer not only the data, but the structural metadata as well. Compression will be a benefit for network transfer and storage volume too. Those two considerations point you to Parquet or ORC, and they are substantially similar. Choose the one that best integrates into your workflow.
If you structure your raw zone correctly and choose sensible file formats, you will be building the foundation layer for a successful data lake. As you import your raw data from your sources, you’ll build up a treasure trove of information available to be cleansed and structured into new analytic insights. And maybe, just maybe, you’ll win that data championship belt.
If you’re interested in my next post covering the data lake curated zone, you can read it here!