
Author: Rick F. van der Lans
Date: June 2018
Not every organization is happy with the reporting and analytical performance of their data warehouse environment. One customer indicated that some of their online reports take at least ten minutes to complete. Ten minutes is a long time if you have to wait for a report to show up on your screen. In some organizations it’s not the query performance that causes performance problems but loading new data into the data warehouse or data marts. Another customer mentioned that refreshing their data warehouse takes a full weekend plus the Monday morning.
But what can we do about this tenacious performance problem? That is the topic of this fourth article in a series on use cases of data virtualization [link to third article].
The Performance Struggle
Getting the right performance is a real struggle for many organizations. Their database administrators continuously strive to improve the query and load performance. Sometimes it’s hard to find a way to accelerate a query or a load process. And sometimes, if they find a way to improve something, the law of conservation of misery comes into play: if you speed up something, you’re very likely slowing down something else. It’s almost never a win-win situation.
The Need for a Faster Database Platform
So, why don’t organizations switch to another, faster database platform? Why not migrate to one of the many, really fast database platforms that have been introduced the last couple of years such as, SnowflakeDB in the cloud, MapD on a GPU-based platform, or Impala on a Hadoop cluster? All of them have been optimized and tuned specifically to support data warehouse workloads. Regrettably, life is not that easy. There are several aspects that complicates migration to another database server:
- Different SQL dialects: The SQL dialects implemented by the vendors in their SQL database servers are not the same. Some support the SQL window functions, some don’t; some support advanced analytical functions, others don’t; and some support recursive queries, and some don’t. Therefore, queries cannot always be migrated without rewriting them.
- Different internal architectures: Although SQL database servers look very similar on the outside, they can be very different on the inside. For example, the internal architecture of a GPU-based SQL database server is very different from a classic SQL product, which in turn is very different from a SQL-on-Hadoop engine. So, queries have to be reformulated to fully exploit these products.
- Different data structures: Some database servers are good at running queries on normalized table structures, while others prefer star schemas or denormalized data structures. This implies that even if we can run our queries unchanged on the new platform, we may have to rewrite them, because the data structures have been changed to make optimal use of this platform.
Data Virtualization and Database Migration
Data virtualization can come to the rescue when migrating to another database. Placing a data virtualization server between the reports and the database server eases switching to another database server. If data is migrated to another database server, the data virtualization server hides that this new product supports a (slightly) different SQL dialect. The reports work with the SQL dialect of the data virtualization server, and the latter tries to push down most of that SQL query to the underlying database server. What is pushed down differs for each database server.
Data virtualization servers also allow for a gradual and stepwise migration. There is no need to migrate all the tables in one go. Instead, tables can be migrated one by one or group by group. Data virtualization servers will use their data federation capabilities to hide that the data is distributed among the old and the new database.
As indicated in the first article of this series, data virtualization servers support caching; see Figure 1. With caching the virtual content of a virtual table is determined and stored in some database. This feature can be used to accelerate query processing. For example, if access to a certain set of physical tables in a data mart is constantly slow, the virtual tables pointing to those physical tables, can be cached to another, faster database platform. When the virtual tables are cached, query performance is determined by this new platform. Reports won’t have to be modified when caching is deployed.
Figure 1:
The Long-Term Perspective of Database Migration
If we look at data migration from a more long-term perspective, another advantage becomes clear. When a data virtualization server is used to decouple the data consumers from the data stores, it’s much easier to benefit from all the new database platforms that have been introduced lately and are probably going to be introduced the coming years. With a data virtualization layer, organizations are not stuck with the database technology they selected a long time ago. It’s worth noting that the last years we have seen an avalanche of new database technologies. Just think about all the Hadoop-related technologies, the GPU-based database products, the NoSQL products, the in-memory products, and the translytical database servers.
Summary
Data virtualization servers can be used to develop entire data warehouse projects, such as the logical data warehouse architecture (see Article 3), but also for more down-to-earth use cases, such as database migration. Data warehouses are becoming bigger and bigger and the reporting and analytical workloads are continuously expanding. Eventually for many organizations the performance of their existing database server will not be sufficient anymore. Data virtualization servers may be the solution to migrate smoothly, seamlessly, and through a risk-averse stepwise approach.
In the fifth article of this series [link to next article], we focus on a popular topic: data lake. Data virtualization supports the development of more practical data lakes architectures.