Brian Custer

Highly qualified, talented, tenacious, and well-accomplished professional with 23 years’ experience in data and software engineering, armed with broad-based background and skills helping companies large and small to become digital, data-driven enterprises.

What is Delta Lake in Databricks?

If you’re not familiar with Delta Lake in Databricks, I’ll cover what you need to know here. Delta Lake is a technology that was developed by the same developers as Apache Spark. It’s designed to bring reliability to your data lakes and provided ACID transactions, scalable metadata handling and unifies streaming and batch data processing.

Let’s begin with some of the challenges of data lakes:

  • Data lakes are notoriously messy as everything gets dumped there. Sometimes, we may not have a rhyme or reason for dumping data there; we may be thinking we’ll need it at some later date.
  • Much of this mess is because your data lake will have a lot of small files and different data types. Because there are many small files that are not compacted, trying to read them in any shape or form is difficult, if not impossible.
  • Data lakes often contain bad data or corrupted data files so you can’t analyze them unless you go back and pretty much start over again.

This is where Delta Lake comes to the rescue! It delivers an open-source storage layer that brings ACID transactions to Apache Spark big data workloads. So, instead of the mess I described above, you have an over layer of your data lake from Delta Lake. Delta Lake provides ACID transactions through a log that is associated with each Delta table created in your data lake. This log records the history of everything that was ever done to that data table or data set, therefore you gain high levels of reliability and stability to your data lake.

Key Features of Delta Lake are:

  • ACID Transactions (Atomicity, Consistency, Isolation, Durability) – With Delta you don’t need to write any code – it’s automatic that transactions are written to the log. This transaction log is the key, and it represents a single source of truth.
  • Scalable Metadata Handling – Handles terabytes or even petabytes of data with ease. Metadata is stored just like data and you can display it using a feature of the syntax called Describe Detail which will describe the detail of all the metadata that is associated with the table. Puts the full force of Spark against your metadata.
  • Unified Batch & Streaming – No longer a need to have separate architectures for reading a stream of data versus a batch of data, so it overcomes limitations of streaming and batch systems. Delta Lake Table is a batch and streaming source and sink. You can do concurrent streaming or batch writes to your table and it all gets logged, so it’s safe and sound in your Delta table.
  • Schema Enforcement – this is what makes Delta strong in this space as it enforces your schemas. If you put a schema on a Delta table and you try to write data to that table that is not conformant with the schema, it will give you an error and not allow you to write that, preventing you from bad writes. The enforcement methodology reads the schema as part of the metadata; it looks at every column, data type, etc. and ensures what you’re writing to the Delta table is the same as what the schema represents of your Delta table – no need to worry about writing bad data to your table.
  • Time Travel (Data Versioning) – you can query an older snapshot of your data, provide data versioning, and roll back or audit data.
  • Upserts and Deletes – these operations are typically hard to do without something like Delta. Delta allows you to do upserts or merges very easily. Merges are like SQL merges into your Delta table and you can merge data from another data frame into your table and do updates, inserts, and deletes. You can also do a regular update or delete of data with a predicate on a table – something that was almost unheard of before Delta.
  • 100% Compatible with Apache Spark

Delta Lake is really a game changer and I hope you educate yourself more and start using it in your organization. You’ll find a great training resource from the Databricks community at: https://academy.databricks.com/category/self-paced

Or reach out to us at 3Cloud. Our expert team and solution offerings can help your business with any Azure product or service, including Managed Services offerings. Contact us at 888-8AZURE or  [email protected].

 

Brian CusterWhat is Delta Lake in Databricks?
Read More

A Look at Azure Synapse Studio (in Preview)

Azure Synapse, formerly SQL DW, is an analytics service that gives you the ability to query data on your terms and at scale. It brings together enterprise data warehousing and big data analytics. Azure Synapse brings together serverless on-demand or provisioned resources with a unified experience so you can ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs.

In this post I’ll show you a new Azure Synapse feature in preview called Azure Synapse Studio that I’m excited about. The Studio is a one-stop shop for working with data of any size, but particularly big data. You can ingest data, you can explore and analyze data that you already have in the workspace, as well as visualize data using Power BI.

In my video demo included in this post, I’ll walk you through the Azure Synapse Studio and show you what you can do with it. On the Home Page, you can click on the “New” button and see all you can do such as, create a new SQL script which can be executed against your SQL on-demand or another SQL Database that you have connected to your Studio’s workspace or create a Jupiter notebook which you can run in either a SQL or Spark context, along with creating dataflows, pipelines and Power BI reports.

In my demo I’ll dig into the Studio and show you how to:

  • create a notebook, ingest data, explore your data by either working in a notebook or a pipeline.
  • connect to external data like an Azure Cosmos DB or Azure Data Lake Storage Gen 2 instance.
  • create a new database with the Manage feature in which you can create a new SQL pool or an Apache Spark pool.
  • create Linked Services to connect to your Data Factory pipelines and instances to work within your databases

There are so many great things you can do with the Azure Synapse Studio and I think it will be very helpful to many people from data engineers to data scientists and business analysts, allowing them to work together in a single workspace without having to flip back and forth between various tools. I highly suggest you sign up for this preview feature and give it a try.


Need further help? Our expert team and solution offerings can help your business with any Azure product or service, including Managed Services offerings. Contact us at 888-8AZURE or  [email protected].

Brian CusterA Look at Azure Synapse Studio (in Preview)
Read More

A Tutorial of Azure Data Studio

What do you know about Azure Data Studio? This application is a cross-platform database tool for data professionals when analyzing data and doing ETL work. Azure Data Studio is similar to SQL Server Management Studio but has much more functionality for data engineering-type tasks.

Brian CusterA Tutorial of Azure Data Studio
Read More

What is Databricks Community Edition?

By now, most of you have probably heard about Databricks. Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform. It is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Brian CusterWhat is Databricks Community Edition?
Read More