A common discussion we’ve had lately is about using Azure Databricks within Azure Data Factory for ETL.

Why would you consider using Databricks, particularly in Azure Data Factory, as part of your ETL processing? Let me tell you three use cases:

1. For integrating Machine Learning into your processing. With Databricks we can use scripts to integrate or execute machine learning models. This makes it simple to feed a dataset into a machine learning model and then use Databricks to render a prediction for example. Then you can output the results of that prediction into a table in SQL Server.

2. Use Databricks tooling and code for doing transformations. Azure Data Factory currently has Dataflows, which is in preview, that provides some great functionality. But if you want to write some custom transformations using Python, Scala or R, Databricks is a great way to do that.

3. Using Data Lake or Blob storage as a source. If your source data is in either of these, Databricks is very strong at using those types of data. It is designed for querying and processing large volumes of data, particularly if they are stored in a system like Data Lake or Blob storage.

My diagram below shows a sample of what the second and third use cases above might look like.

DatabricksInADF_04

The top portion shows a typical pattern we use, where I may have some source data in Azure Data Lake, and I would use a copy activity from Data Factory to load that data from the Lake into a stage table. Using either a SQL Server stored procedure or some SSIS, I would do some transformations there before I loaded my final data warehouse table.

The bottom portion shows how I could use Databricks to query that data out of Data Lake and put it into the Databricks cluster. Then within my Databricks cluster, I can perform my transformations using Databricks code and logic.

I could then use Databricks to output that transformed data directly into my data warehouse table. Which pattern you use depends a lot on the data that you have and the transformations you want to use.

I wanted to share these three real-world use cases for using Databricks in either your ETL, or more particularly, with Azure Data Factory.

Need further help? Our expert team and solution offerings can help your business with any Azure product or service, including Managed Services offerings. Contact us at 888-8AZURE or  [email protected].