There are many options for data storage, how do you know which is right for your data? Today I’d like to discuss storage in relation to the architecture of the modern data warehouse and to shed some light on your options.
The modern data warehouse can come in different forms: a relational warehouse which is stored as a traditional star or snowflake schema, a hub and spoke model or a data vault model. In these different schemas we’re storing a lot of data within and we can then derive data marts from the larger data warehouse for reporting and analytics.
Or it could be a data lake and using cloud-based, more modern storage mechanisms that serve as a trusted data repository. But if we’re not careful, a data lake can be a data junkyard. Look at the image below about the Microsoft modern data warehouse platform:
On the left side we have the simplest form, what we call a low or no code business analytics solution. Here we use the Power BI platform where we store data using data flows and the common data model, which stores data in the Azure Data Lake Storage Gen2.
Moving to the right you see that we can use transformation and orchestration tools like Azure Data Factory and Azure Databricks. Then we can store that data in Azure SQL Database or Azure SQL Data Warehouse. Azure Machine Learning and services available in the cloud like cognitive services can then be used for compute and analytics. This is all in addition to raw data that we would ingest and stage and possibly archive.
So, why used cloud-based storage instead of traditional on prem physical storage?
- Cloud storage can be less expensive
- Much easier
- More reliable
Here are some cloud storage options:
- Azure Data Lake Storage – the least expensive option with locally redundant storage. Your data is stored in two different places, so in case of a failure, we could roll over to that redundant storage.
- There’s also the option for geo redundancy.
- Block blobs, managed disks for file – some faster, more optimized storage options.
How cost effective is cloud based storage? Here’s some quick math; if I have a terabyte of data that I need to archive and store offline, I could go to Costco or Amazon and buy an inexpensive consumer grade 1TB external hard disk for about $50. Now compare that $50 disk drive to the cost of the least expensive Azure storage with local redundancy, over two years that will cost me about half of what that external disk is going to cost.
And for redundancy for my external disk that will cost me $200-300 for a RAID device that has 1 TB of storage which is 5-6 times more that what it would cost me to store my data in Azure.
Here’s a breakdown of Azure storage history/concepts:
- Azure Data Lake Gen1 was based on the Hadoop file system. A few years ago, when Big Data was new, Microsoft chose to build their data lake technology on top of Hadoop, a distributed file system that allowed very efficient storage for text-based data.
- Azure Data Lake Gen2 was introduced which is based on blob storage, making it much more flexible and faster.
- Azure Data Lake Gen2 will eventually replace Gen1. Depending on what you’re storing, like binary data stored in binary large objects, files, texts or other data sources, there are different storage containers within Gen2 that can be used as the most optimal way to store that data; blob containers are the most flexible.
- Different scale and performance options:
- Premium – High throughput but the most expensive
- Hot Storage – for most workloads
- Cool Storage – the most economic when we’re not concerned about performance
- Archive – cheap option for backups and online; data that we only need to get to occasionally or on-demand
Data transformation options include, Azure Data Factory and Logic Apps. Azure Data Factory encompasses many different transformation tools, including SSIS. You can now lift and shift SSIS packages into the cloud incorporated with Azure Data Factory.
You also have Azure Databricks with is based on Apache Spark in big data technologies adopted by Microsoft. Recently announced and coming in a few months we will have the ability to use Power Query within Azure Data Factory.
Hopefully this post helped you better understand your cloud storage options. Whether you’re architecting a database solution, an application, data warehouse or data analytics solutions, you have a lot of great options in the Azure ecosystem to store, ingest and manage your data.
Need further help? Our expert team and solution offerings can help your business with any Azure product or service, including Managed Services offerings. Contact us at 888-8AZURE or [email protected].