At its core, Data needs to be centralized and processed before being analyzed. Many data environments support this core function, with new formats popping up as technology evolves. The type of data environment for your business heavily depends on the data sources and the type of analysis you perform. This blog will provide:

  • A high-level overview of the typical data environments
  • The new concept of a data lake house
  • An introduction to Azure Synapse Analytics and Databricks

Common Data Environments

Let’s start with one of the most common data environments used in enterprises – a data warehouse. To first understand a data warehouse, think back to an even more common tool – a database. A Database is a collection of data organized to be retrieved and accessed. Something as simple as an Excel file can be considered a database, while other platforms like a Customer relationship management (CRM) database are more complex. Databases, however, are designed to simplify data processing and storage and are not optimized for analytics.

A Data Warehouse is a centralized repository for data from more than one data source. A data warehouse pulls data from multiple source systems and processes them for analytics.

Before going into our next topic, it’s essential to understand the different types of data used within a data environment – structures and unstructured.

Unstructured Data does not have a pre-defined data model or is not stored in an organized manner.

According to Forbes, as much as 90 percent of Data is defined as unstructured.

A Data Lake is an environment where raw (unstructured, semi-structured, structured) Data is held with minimum modifications. Unlike in a Data Warehouse where data must first undergo a transformation process to be stored.

Data Lakehouse

New to the scene is the idea of a Data Lakehouse. The concept addresses limits to data lakes and is “enabled by a new open and standardized system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes.” (Databricks, 2020) Data Lakehouses are also highly reliable storage and more modern than a traditional data warehouse.

Databricks and Azure Synapse Analytics

Microsoft Azure Synapse Analytics is a tool that combines data integration, data warehousing, and big data analytics. Azure Synapse Analytics, combined with Databricks, can amplify an enterprise’s analytics power. Kevin Clugage from Databricks reviews the two platforms in his blog The Analytics Evolution with Azure Databricks, Azure Synapse, and Power BI.

“Azure Databricks provides the best environment for empowering data engineers and data scientists with a productive, collaborative platform and code-first data pipelines. Azure Synapse provides high-performance data warehousing for low-latency, high-concurrency BI, integrated with no-code / low-code development.” (Databricks, 2020)

Databricks is one of the companies to make the term Data Lakehouse popular. The tool combines efforts from a data warehouse and data lake and provides data streaming, data science, and business intelligence capabilities to enterprises.