Yesterday at the Microsoft Build conference, Microsoft announced Azure Data Lake. Microsoft’s Azure Data Lake is true Hadoop File System (HDFS) in Azure. It’s not available yet but you can sign up for the preview using the link above.
Microsoft has had HDInsight in their Azure cloud environment for a while. HDInsight opens up Hadoop big data capability for any customer with minimal time to insight. You can stand up Hadoop cluster in minutes and leave the hardware management and administration to Microsoft.
However, one of the drawbacks to cloud-based managed Hadoop systems like HDInsight and Amazon’s Elastic Map Reduce has been that in order to enable file access using managed cloud solutions, you had to abstract away from the traditional HDFS. Doing this allowed easy management of files both in and out of the Hadoop ecosystem but at the price of scalability, redundancy and performance. The key differentiator of HDFS for data analysis solutions was that it scaled very easily to very large data volumes, file sizes and supported built-in redundancy.
If you’ve used HDInsight you might be asking yourself, how is this different?
Here are six key factors you’ll want to know more about:
1. True HDFS Compatibility
Azure Data Lake is a true Hadoop file system compatible with the major Hadoop distributions like Hortonworks Data Platform and Cloudera Enterprise Data Hub and projects like Spark, Storm, Flume, Sqoop, Kafka etc.
2. Unlimited Data Size
Other cloud storage options have file and account size limitations of only a few terabytes. This may sound like a lot but in big data environments this is quite small. Azure Data Lake removes this constraint: There is no fixed size limit on files or accounts. You can store single or multiple files at terabyte and petabyte scale and process them with Hadoop. This is an ideal situation because Hadoop is at its best when working with very large files.
3. Fault-Tolerant and Available
In a typical Hadoop implementation, data is replicated three times in the cluster. Azure Data Lake adopts this practice in order to ensure the data is available in the event of hardware failure or interruption.
4. Designed for Parallel Processing
One of the key benefits of the Hadoop file system is it keeps the data close to the compute. In an on premises implementation, this is done by having dedicated disk in each physical node and moving data between nodes as little as possible. Azure Data Lake is designed to support throughput for parallel processing scenarios in order to provide performance similar to that of an on premises Hadoop implementation.
5.Optimized for High-Speed Throughput
One of the hottest topics in data management is the Internet of Things (IoT). IoT boils down to data streaming off devices such as phones, sensors and equipment. This type of data consists of very small transactions at very high volume. Ingestion of this data into a data management system has traditionally been very difficult both from the application development and hardware and software support standpoints. Azure Data Lake supports extremely high speed data ingestion at huge scale.
6. Enabling Hadoop for the Cloud
I’m excited to see what Azure Data Lake does for Hadoop in the cloud. Many enterprises implementing Hadoop choose an on premises implementation because of the perceived limitations of cloud infrastructure. Azure Data Lake should bring cloud of limitations of Hadoop more in parity with traditional installations.
If you’re interested in discussing a big data solution either on premises or in the cloud, we can help! Get in touch to learn how we can help you get the most from your big data solutions.