Data Warehouse Appliances

We all have appliances in our homes. In most cases, it just makes more sense from a cost and time perspective than building your own solutions for washing clothes, cooking and cleaning. In the data center, administrators often take a different approach. They choose to build out their own servers either because it gives them more control or because there isn’t a ready-made solution for their workload.

Appliances

When you think of an appliance, you think of a machine that is purpose built for a single or a set of tasks. It comes with everything required to do the job and it does that job very well. Installation and operation is usually straightforward and maintenance is minimal. With a washing machine for example, you unbox it, hook up water and power and you’re ready to wash clothes.

Appliances in the Data Center

Appliances in IT aren’t a new concept although you may not have always thought of them that way. Hubs, switches and routers are probably some of the first true appliances in the data center. You plug them into your network, turn them on and after some configuration, your computers and subnets are talking to one another. You don’t need to know all that much about the seven layers of the OSI model to make them work.

Next came storage appliances with Storage Area Network (SAN) and Network Attached Storage (NAS) appliances. These were far more complex on the inside, essentially servers in their own right that made the task of managing large volumes of disk space and making it available to the rest of the infrastructure easy. Then we started to see appliance architectures take on roles usually reserved for a combination of one or more servers, an operating system and a separately purchased software product.

Appliances for Data Warehouse

Data warehouse workloads are different from those performed by traditional database servers. Online Transaction Processing (OLTP) systems are the databases we all know and love. They are the heart of the systems that run our businesses like ERP, Accounting and Finance, POS and CRM to name a few. They were designed to be transactional in nature. They take in many thousands or even millions of single transactions throughout the day and they do this very fast. They are not however, great for reporting. Running reports against a traditional OLTP system can be painful because what we want from reports is generally the summary of those thousands or millions of transactions. That’s why the concept of the Online Analytic Processing (OLAP) database was developed and ultimately the concept of the Data Warehouse. The OLAP data warehouse is designed specifically answer the typically long-running, scan centric queries needed to produce enterprise reports.

For many years, hardware and software for data warehouse applications looked just like the environments deployed for OLTP databases. The data center team stood up a new server, installed an OS and relational database software and you were off and running. If you were lucky, you had several physical disks for data and logs on either the server or SAN to help spread out the query load across multiple physical spindles but no tuning was really done to enhance the performance of the scan-centric queries being done in the data warehouse (other than the few gurus out there using their arcane knowledge to make things faster).

In 2009, Microsoft along with Dell and other OEM’s, released the Fast Track Reference Architecture, a guide to building an environment that was designed expressly for data warehouse workloads. It was a recipe book that defined the hardware and software components and how to assemble them and tune them and if you followed it, you would get a predictable and repeatable result that far outperformed the traditional approach. The Fast Track architecture takes a component balanced approach. The CPU’s, RAM and disk subsystem are all tuned to work in concert to produce the best possible results using reasonably priced, off the shelf components.

The Fast Track architecture was a great step forward. It gave data warehouse architects something concrete to ask the data center manager to purchase and build and it helped with the ROI discussion with the CIO to justify the expenditure. However, it is still a toolbox, not an appliance. It takes a lot of time and resources to build and it is somewhat risky if you’ve never done it before. To this end hardware OEM’s have answered the need by producing pre-configured data warehouse appliances built on the Fast Track reference architecture. These appliances implement all of the best practices outlined in the Fast Track guide with pre-configured hardware and software. They mitigate some of the risks and cost because they require fewer resources to stand up and far less time. In fact, they can usually be deployed and configured in less than a couple hours.

So what’s in the box?

I can’t speak to all appliances but I do have hands-on experience with the Quickstart Data Warehouse Appliance from Dell. It is built on the latest generation of Dell rack servers and comes pre-loaded with Microsoft SQL Server 2012 Appliance Edition (don’t go looking on the web for the Appliance Edition…it was built specifically for hardware OEM’s). In our lab at BlueGranite we have the Model 1000 that comes with a single four-core processor, 96 GB of RAM and will support a warehouse up to 5 TB compressed disk storage. There is also a Model 2000 that is 8 cores, 128 GB and 12 TB. It came ready to rack with tool-less rails. We had it unboxed and in our racks in just a few minutes.

Did you say “compressed disk storage”?

Yes. The Fast Track reference guide specifies the use of Page Compression for data warehouse workloads. The reason is that if data is taken from the disk compressed and passed to the CPU, then the CPU can get more data across it more quickly. The CPU is faster at handling data than the disk system so this results in faster performance even with the overhead of uncompressing the data. The Model 1000 actually has 1.7 TB raw disk for the warehouse while the Model 2000 has 5. You can estimate about a 3.5:1 compression ratio so Dell was conservative with their 5 and 12 TB numbers.

What makes it so much better that the old approach?

As I mentioned earlier, the Fast Track guide is just that, a guide. You still have to implement your databases in the right way to get the best performance out of the hardware. This is not magic, the appliance makes building out the infrastructure easy and predictable. You still need someone who knows how to implement the database concepts on the hardware for best performance,

RAID

At the heart of the appliance is the way the disk subsystem is laid out. It makes heavy use of RAID (Redundant Array of Inexpensive Disks) technology to enhance performance. In our appliance, 8 LUNS (Logical Unit Numbers) or virtual disks are allocated to the data warehouse data and another is allocated for database logs. The number eight was chose purposefully to match (or be a multiple of) the number of cores in the server. This allows for the maximum use of parallelism during query processing by allowing each core of the CPU to read data from a separate LUN. In addition, RAID 1 (mirroring) pairs are used so that data is stored redundantly. This also improves performance as data can be read from both disks in a mirror pair at the same time. Another advantage of RAID is the ability to include “hot spare” disks. These are extra disks included in the chassis that will be swapped in to replace a failing disk. This allows the system to retain redundancy with no downtime.

Partitioning

Physical disk is only part of the picture. You must also build your database appropriately to work with the hardware. This is where Table Partitioning comes into play. Partitioning defines the way the data in a database table is aligned over the RAID arrays. You choose a key in the data (typically a date) to partition across. For example, you might store all of 2012 on one partition, 2011 on another, and so forth. By storing the data in this way, queries that summarize the data in these logical groups will run faster because they will have to scan less data from fewer disks to reach an answer. Less physical IO means faster queries.

Star Schemas

The star schema concept has been around since the early days of data warehousing and is well established as the best practice for building databases for reporting. It defines a method of architecting data to make it easier to understand by query and report developers and faster for the summary queries typical of data warehouse scenarios. The architecture of the data warehouse appliance is built to work best with data arranged in star schemas. The star schema generally consists of a very large table of detailed information called the Fact table plus smaller Dimension tables. The Fact table contains only of key values and “measures” (the numbers to be summarized). All descriptive values are contained in the Dimension tables. The benefit of this structure is that the queries are simple and the query plans are efficient. Generally the fact data is scanned and joined to the much smaller dimension tables. The combination of RAID disks and data partitioning allows these types of queries to plow through massive amounts of data very quickly.

Indexing

Data warehouses are usually very “index light”. Indexes, especially on fact tables, don’t make a lot of sense because fact tables are usually scanned in queries, not read directly. Indexes only make sense when you intend to retrieve only on or a few rows, not 10 million. Therefore, we would typically only index the dimension tables and sometimes not even those because they are usually so small that they can be read into memory. This has changed with SQL Server 2012 Columnstore Indexes. The Columnstore index is a new feature in SQL Server 2012. It uses a technology known as Vertipaq to compress and store data in a columnar fashion rather than row-by-row as with traditional indexes. A full explanation of the technologies behind Columnstore indexes is too large for this space but suffice it to say, they provide very fast performance for scan-based data warehouse queries. There is a cost in disk space to be paid but generally it is worth it.

Data Loading

The best structured data warehouse database can still perform poorly if the data is not laid out on the disk in the appropriate way. The disks are still spinning media and are subject to the same performance hits of seek time and rotational delay that they have always been subject to. If data is fragmented (scattered randomly) over the disks, the disk heads will need to move farther and more frequently and will have to wait for the disk platters to spin around to the correct location more often. These nanoseconds add up over millions of repetitions and will result in slower performance. To counteract this problem, proper data loading techniques must be employed to reduce fragmentation. Primarily, this consists of bulk loading data in partition key order to ensure that is packed as tightly on the disk as possible.

Wow, that sounds hard…

Let’s face it, if it was easy, we wouldn’t be in business. However, appliances makes it easier. The Dell appliance that we have installed includes a “Schema Wizard”. This is a tool that is used to calculate the best way to spread the database partitions over the hardware and then also create the DDL (Data Definition Language) to allocate the space and create the database so that it is ready to be loaded. This tool is really nothing more than a spreadsheet with macros to do the calculation and write the script. If you follow the best practices of pre-allocating all of the appliance disk space to a single database, you simply copy the code out of the spreadsheet and execute it on the appliance and you’re ready to load data. If you want to change the way disk is allocated, you can but you will be doing the calculations on your own.

Where does it fit in my organization?

Data warehouse appliances will generally sit at the center of a “hub and spoke” architecture. The data warehouse aggregates and makes available all of the organization’s data in a central place. It then serves queries to report developers and analysts either directly through SQL queries or through tools like OLAP cubes, reporting engines, dashboards or analytic cubes. In very large organizations, the data warehouse appliance may become a spoke in a larger solution that includes a Parallel Data Warehouse Appliance.

In many organizations database servers are filling multiple roles. They server databases for applications as well as reporting. They may also be running the OLAP cube engine and even be the report server and integration server. This is not best practice, it’s just happens when we are being cost conscious with regard to both licensing and hardware. When an organization is deploying a data warehouse for the first time, they may not want to risk buying hardware and software for it until it is proven that it will be successful.

A data warehouse appliance cannot fill multiple roles. It cannot have additional components loaded on it and the nature of the hardware makes running OLTP workloads on it less optimal. Technically, there is nothing stopping you from doing these tasks on the appliance, they are just not supported.

What are the drawbacks?

Appliances aren’t necessarily right for everyone. If you have already made a heavy investment in virtualization, then you will need to decide if it would make sense to add an appliance to your environment. That said, an appliance will perform better in general to a comparable VM so you’ll need to weigh performance against cost to make that decision. You’ll also need to look at your data needs and estimate your growth over about three years (the average lifespan of server hardware). If you’re going to grow beyond the scale of the available appliance capacity or not grow enough to justify the cost of an appliance, it may not be for you.

So should I buy one?

To decide if a data warehouse appliance would be right for you need to ask yourself a few questions:

Do you have a poorly performing data warehouse already?
Are you outgrowing your existing warehouse? If yes, do your data needs (including three years of growth) fit into what is available from appliance vendors?
Are you looking to stand up a data warehouse for the first time and need guidance?
Would a turnkey solution with predictable performance appeal to management?

If you answered “yes” to any of these then an appliance might be something to consider.

Talk with people that use them and who have decided against them to get their feedback. Consultants are also a great resource (OK, I’m biased) because we’ve “been there and done that” for many different customers on many different platforms. We can help you think through the decision points and give you feedback based on experience.

Data Warehouse Appliances…They’re not for Cooking Breakfast!