I recently deployed an innovative production implementation using Azure Databricks. I worked with a large retail industry client to implement a Data Platform and BI solution that measures retail sales, market share, and product share for partnering firms. The company’s internal team and project sponsors were very familiar with the cutting-edge capabilities of Azure Data Services; this allowed us to collaboratively design and build a progressive solution to handle their Data Engineering and Artificial Intelligence use cases.
For one specific solution, we designed an architecture and coding framework to deploy client-agnostic solutions using metadata. The solution template allows the company to very quickly ramp its new clients on the application by simply filling out some data source metadata and pushing some buttons. The organization can now get its new clients up and running in a couple days, plugging each one into amazing insights on their retail data.
This architecture includes the use of many of the data services in Azure.
- Azure Blob: used to land raw flat file-based data.
- Azure Data Lake Store: used to store data in Data Lake and Data Mart database. This is the primary storage facility for the implemented architecture.
- Azure Data Factory: event-based data orchestration pipeline execution.
- Azure Databricks: data engineering and artificial intelligence compute on top of the data using Apache Spark. This is the primary compute hub for the implemented architecture.
- Azure Analysis Services: rich semantic layer for enterprise-scale reporting.
- Azure Logic Apps: used for processing Azure Analysis Services.
- Power BI Premium: used to visualize and display retail data insights.
- Git Integration: all source code was stored in GitHub. Databricks and Azure Data Factory have native Git integration, which made it easy.
- Azure DevOps: Azure DevOps was used for Continuous Integration and Continuous Development (CICD) pipelines.
- NO relational database solution was used in this project
This Modern Data Platform architecture (or Data Lake architecture) provides many benefits to the client.
- Highly Scalable & Elastic: compute can be turned on or off on demand. While off, the solution incurs only storage charges. The solution also has built-in cost controls and auto-shutdown after periods of inactivity. Compute can also be scaled up/down as needed within minutes and has auto-scale up or down capabilities based upon resource utilization.
- Event-Driven Processing: data orchestration pipelines are event driven and only begin when initiated by upstream processes. This eliminates running jobs when data isn’t even ready.
- Flexibility of Multiple Languages: the application is programmed using a combination of SQL, Python, and Scala. SQL is primary used for data transformations. Python is being used for the flexibility of enumerating metadata and/or constructing dynamic code. Scala is being used to create data ingestion repeatability. The platform can also accept R and Java.
- Utilizing Fast and Cheap Storage: the platform is utilizing fast, cheap, and secure storage from Azure Data Lake Store, which is essentially the Hadoop Distributed File System (HDFS) as a platform service, built for analytics.
- Mixed-Use Case Support: the solution platform can handle Data Engineering, Artificial Intelligence (AI), and Streaming use cases in the same platform with zero additional tools or configuration. Our production solution is currently Data Engineering-focused, but the client is also prototyping AI use cases.
- No Relational Database Needed: while Azure SQL database is a powerful service, it currently stays on in a persistent state and has no Pause functionality (except for Azure SQL Data Warehouse, which wasn’t a good fit for this client).
- Azure DevOps: the client led the charge here, but Azure DevOps provided a crucial ability for the project. DevOps pipelines were created to read code directly from GitHub and deploy to the various environments automatically. It takes some setup, but Azure DevOps ended up saving a lot of time.
Flexibility and Scalability were Key Benefits
Overall, Databricks provides a solid platform for creating analytic solutions in Azure and even housing the analytic databases used for the solution when using Databricks Delta. The platform is highly flexible. We were able to create new Databricks services instantly and launch Spark clusters in less than 5 minutes to immediately begin coding. Being able to bounce back and forth between SQL and Python, depending on need, was immensely beneficial. This language flexibility also allowed for us to ramp new resources very quickly. Our project team consisted of client and BlueGranite developers, who were all familiar/comfortable with different languages as their primary go-to. The ability to scale up and down and turn the solution on or off automagically saved time and money. Finally, the ability to have the solution pivot between Data Engineering and AI with zero additional environment configuration made it a no-brainer to begin testing Machine Learning and Artificial Intelligence capabilities.
Learning What Works Best
Like any new technology solution (Databricks and Spark aren’t new, just new to Azure), there is going to be a learning curve and maturity to the product in general. While we found the Databricks integration with other data services in Azure to be good, each integration with another service is at varying levels of maturity. We had to experiment to find what worked best for our architecture. One example is the integration between Azure Analysis Services (AAS) and Databricks; Power BI has a native connector to Databricks, but this connector hasn’t yet made it to AAS. To compensate for this, we had to deploy a Virtual Machine with the Power BI Data Gateway and install Spark drivers in order to make the connection to Databricks from AAS. This wasn’t a show stopper, but we’ll be happy when AAS has a more native Databricks connection.
One of our biggest project challenges was actually not related to Databricks or Azure at all. We suffered from the typical garbage in, garbage out scenario. At one point during development, we had a significant number of bugs reported, but they ended up being related to issues in the source files our platform was accepting. We had to add data validations upstream in order to remediate the issue. This was actually a wake-up call that sometimes process needs to be addressed and finalized prior to creating a technology solution.
Overall, using Databricks as the backbone for our client’s solution was a huge success. If you’d like to learn more about how Databricks can work in your environment, contact BlueGranite today.