As Melissa Coates showed us in her recent article, Top 3 Reasons to Upgrade Your Analytics Environment to SQL 2016, the upcoming release is huge! It includes a large number of new features, many of them enabling deep analytics and integration with Big Data solutions. One of the features it aligned with Big Data is Polybase. Polybase is included with the SQL 2016 Enterprise Edition.
Now, Polybase isn’t exactly new. It was actually developed to be used alongside SQL Server Parallel Data Warehouse (PDW) in the Analytics Platform System (APS) appliance. This is, however, the first time many enterprise customers will be introduced to Polybase, as APS is more of a niche product as opposed to the broad adoption of SQL Server.
What is Polybase?
Polybase is a feature of SQL Server that bridges the gap between SQL and Hadoop. Simply put, it allows a SQL engineer to write a standard T-SQL query that can reach into a Hadoop cluster and return data. It’s a very powerful tool for organizations that are currently building, or evaluating, a Data Lake environment. Here are 5 reasons why you should be excited about SQL Server 2016 with Polybase:
Do Big Data without Learning New Tools
If that isn’t the biggest selling point for Polybase, then I don’t know what is! Seriously, this is a very powerful feature of Polybase. It allows data analysts to use the very commonly known T-SQL, in a very commonly used development environment — SQL Server Management Studio — to query data stored in a Hadoop cluster. No Java or MapReduce required!
Flexible Storage Options
Data queried with Polybase doesn’t reside on your SQL Server. It is persisted on external storage. Polybase provides a couple of options here:
- HDFS on Hadoop – The most common distributed Hadoop file system, HDFS, is a great place to store all of your Big Data. HDFS is a distributed, resilient, redundant file system designed to store exabytes of information — a massive amount (5 exabytes could encompass all the words ever spoken on Earth in any language, according to the What’s A Byte website.) Polybase can reach directly into HDFS and return data from Hadoop alongside your SQL Server data.
- Windows Azure Blob storage – If you don’t have a Hadoop cluster, you can still take advantage of Polybase’s ability to query external data. Simply place your data in Windows Azure Blob storage and provide Polybase with the location information.
Scalable Performance Management
Polybase does a pretty good job overall of managing the performance of remotely executed queries. It automatically shifts between modes of transporting data to SQL Server for native processing, and remote execution on the Hadoop cluster (when run in Hadoop connectivity mode).
However, some situations arise where the developer would like more control over the performance management. Polybase allows this with full predicate push down. With this mode enabled, Polybase will generate a native MapReduce application that will be executed through YARN on the Hadoop cluster. This mode will allow long-running jobs to take full advantage of parallel processing, with minimal data movement across the wire to SQL Server.
Additionally, when extra horsepower is needed, Azure SQL Data Warehouse can easily be scaled to 2x, 4x, 8x, or even more processing power! The environment stays up the entire time it is being scaled, so no downtime is requried. Scaling works the other way also — when you don’t need that much oomph, scale it back to save money.
Aligned with Enterprise Security
Security is always an important concern with any major data management project. Polybase supports with Kerberos negotiation through the Hadoop cluster to ensure that queries will only touch data the logged in user is allowed to see.
Standard SQL Security is included as well, which means you’ll have object-level access to secure your data. Transparent Data Encryption (TDE) is also supported to ensure that in the rare case someone who shouldn’t places their hands on your data, they won’t be able to make sense of it.
Platform Support
Polybase includes support for both on-premises and cloud-based unstructured data storage platforms. Currently, Polybase supports the following Hadoop clusters:
- Cloudera CDH 4.3
- Cloudera CDH 5.1
- Hortonworks HDP 1.3 (Windows/Linux)
- Hortonworks HDP 2.1 (Linux)
- Hortonworks HDP 2.1 (Windows/Linux)
- Hortonworks HDP 2.2 (Windows/Linux)
In addition to supporting Hadoop clusters, Polybase also supports the following cloud-based solutions:
- Windows Azure HDInsight (Use the appropriate HDP version)
- Windows Azure Blob storage
- Windows Azure Data Lake (Future support)
When Polybase is connected to Windows Azure Blob storage, it cannot perform predicate push down, as there is no underlying processing framework attached.
Polybase is only one of the many new features in SQL Server 2016.