
Author: Rick F. van der Lans
Date: June 2018
Data lakes are setup as large physical storage repositories. Data from all kinds of sources is copied into this repository, which is commonly developed with Hadoop. But is this centralized storage approach really feasible and practical? Additionally, the users of data lakes are data scientists and other investigative users. In other words, data scientists form the only scheduled user group of the data lake, making data lakes single-purpose systems. Does that make sense, isn’t data contained in a data lake too valuable to restrict its use exclusively to data scientists who form a relatively small group of users? The investment in a data lake would be more worthwhile if the target audience can be enlarged.
The topic of this fifth article in a series on use cases of data virtualization [link to fourth article] addresses both questions. Can we use data virtualization to turn physical data lakes into more practical logical data lakes and from single-purpose data lakes to multi-purpose data lakes?
The Complications of the Original Data Lake
The characteristics of the data lake are described in several definitions. One of the first popular definitions of the data lake came from James Serra [ What is a Data Lake ]: “A data lake is a storage repository […] that holds a vast amount of raw data in its native format until it is needed.” Figure 1 contains a high-level architecture representing a data lake. In this figure the acronym EtL stands for Extract-Load and ET for Extract-Transform. The letter “t” in EtL is shown very small on purpose, because data scientists prefer to see the data in its orginal form, so little data transformation will be applied to the data.
Figure 1:
Having all the required data stored together in one location makes it easy for data scientists to use it. Unfortunately, practical complications exist that make the development of such a centralized data storage containing copied data be hard, impossible, or not permitted:
- Complex “T”: All ETL programmers agree that they spend most of their time on developing the “T” and not so much on the “E” or “L”. Because the data stored in the data lake is still in its raw form, data scientists still have to spend time on developing the “T”.
- Big data too big to move: In some environments the sheer amount of data coming from the data sources can be too much to send and too much to physically copy. Bandwidth and data ingestion limitations can make it impossible to copy a big data source to the data lake.
- Uncooperative departments: Not all business departments and divisions may be anxious to share their data with a centralized environment. This can lead to the upholding of their data.
- Restricting data privacy and protection regulations: More and more laws, rules, and regulations prohibit the storage of specific types of data together. Sometimes data is not allowed to leave a country, or regulations for data privacy and protection regulations may forbid certain types of data to be stored centrally in a data lake.
- Data stored in highly secure system: Some source systems have a highly secure system to protect against incorrect and fraudulent data usage. Owners may not permit their data to be copied outside the original security realm and into the less secure data lake.
- Missing metadata: Not all source data comes with descriptive metadata, making it hard for data scientists to understand what specific data elements mean. A misinterpretation may lead to incorrect business insights.
- Refreshing of data lake: Sophisticated refresh procedures exist to keep data in data warehouses as up to date as the users require. Do data scientists need their data to be refreshed as well? Some data probably does not need to be refreshed and for some data science exercises it’s not essential, but in some cases periodic refreshing must be organized.
- Management of data lake: The data lake is a data delivery system and as such it must be managed. When data scientists need access to the data, it must be available.
The Logical Data Lake to the Rescue
These are all realistic complications. An alternative data lake architecture that overcomes these complications is called the logical data lake. It’s based on data virtualization technology. Figure 2 contains a high-level overview of this architecture.
Figure 2:
In a logical data lake, data is presented to data scientists as if it’s all still stored centrally in one data storage repository. Nothing could be further from the truth. Some of the data is copied and stored centrally (the two data sources on the left), some data is accessed remotely (the two data sources in the middle), and some are cached locally (the two on the right). In the latter case, the sources themselves are not accessed by the data scientists, but only by the refresh mechanism of the data virtualization’s cache engine. Data scientists access the cached virtual tables. Depending on what’s required, possible, and feasible, one of the three approaches can be used to make data accessible to data scientists. Where copying data and storing it centrally is just one of the options in the original data lake architecture, it’s definitely not the only option in the logical data lake. Note that data scientists don’t see the difference between these three options.
The Logical, Multi-Purpose Data Lake
That brings us to the second issue raised at the start of this article. As indicated, when the data lake was introduced it was meant to be a single-purpose system: for data scientists only. Developing, maintaining, and managing a data lake is not for free. Therefore, if exploitation is limited to one or two use cases, the data lake may be unnecessarily expensive or the investment is not exploited fully. Practice has shown that exploitation of a data lake can be extended to different forms of data usage, such as operational reporting, self-service BI, and embedded BI.
A logical data lake developed with a data virtualization server as the driving technology, can easily be adapted to support a wide range of business users from traditional self-service BI users (working for e.g. finance, marketing, human resource, transport) to sophisticated data scientists; see Figure 3.
Figure 3:
In this solution, different users access different layers or zones. Accessing virtual tables in the landing zone is comparable to accessing the physical files directly. Data scientists probably use the virtual tables in this zone. Self-service BI users are probably accessing the middle zone, in which data has been integrated and processed lightly. Those users deploying standard reports access the top zone with virtual tables, in which the data has been fully processed. This is the multi-purpose data lake. One environment that supports many business users.
Summary
There is no question about the value of a data lake to data scientists, but the solution of one physical, single-purpose data lake may not be feasible or practical. Data contained in the data lake is too valuable to restrict its use exclusively to data scientists who form a relatively small group of users. The investment in a data lake would be more worthwhile if the target audience can be enlarged without hindering the original users. The logical, multi-purpose data lake is more flexible, does not have the problems of centralized data storage, and supports a wide range of business users.
Material for this blog comes from the whitepaper “Architecting the Multi-Purpose Data Lake With Data Virtualization” which contains a more detailed description of the logical data lake, see the following whitepaper: Read here. We strongly recommend you to read this paper.
In the sixth article of this series [link to next article], we focus on a cloud integration. Data virtualization can simplify migration to the cloud and can make the cloud itself transparent to most applications and reports.