SimplicityBI

Menu
  • What we do
      • Digital Transformation Strategy
      • Data Implementation Services
      • Project Services
      • Managed Services
  • Solutions
      • Unified Data
        Platforms
      • Cloud data
        Integration
      • Master Data
        Management (MDM)
      • Data Lake
        DW
      • Data
        Visualization
      • Data
        Analytics
  • Technologies 
      • denodo
        • The Denodo Platform


          All the benefits of data virtualization including the ability to provide real-time access to integrated data across an organization’s diverse data sources, without replicating any data.

          The Denodo Platform offers the broadest access to structured and unstructured data residing in an enterprise, big data, and cloud sources in both batch and real-time, exceeding the performance needs of data-intensive organizations.

          Denodo Cloud Solutions

          Oil and Gas
      • technologies-microsoft-icon
        • Unleash the power in your data

          Reimagine the realm of possibility. Microsoft data platform solutions release the potential hidden in your data - whether it's on-premises, in the cloud, or at the edge - and reveal insights and opportunities to transform your business.Why use the Microsoft data platform

          Fast and Agile
          Work with a flexible data platform that gives you a consistent experience across platforms and gets your innovations to market faster—you can build your apps and then deploy anywhere.

          Built-in Intelligence
          The Microsoft data platform brings AI to your data so you gain deep knowledge about your business and customers like never before. Only Microsoft brings machine learning to database engines and to the edge, for faster predictions and better security.

          Enterprise Proven
          Bring your business to scale while trusting that your security, performance, and availability needs are covered—with an industry-leading total cost of ownership.
      • technologies-ibm-icon-transparent
        • How your business can get smarter



          Analytics

          Gain greater insights and innovate faster


          Cloud
          Control of your cloud should belong to you. SoftLayer can help.


          IT Infrastructure
          Build the foundation for cognitive business



          Services
          Transform your business with our expertise
      • technologies-looker-icon-transparent
        • looker is more than data analytics software, a full platform.

          Bring Data to every part of your business.

          Data Everywhere

          Deliver data directly in the tools your teams use everyday. Bring data into every action and every decision - in Slack, in Salesforce.com, even in your custom applications. Or build new applications on top of the Looker Data Platform to truly customize the experience for your business.

          Analytics evolved

          We believe everybody should have access to reliable analytics to make data-driven decisions. And to deliver on this promise, we had to re-imagine and rebuild how analytics are done from the ground up.
      • snowflake
        • To support today’s data analytics, companies need a data warehouse built for the cloud.


          One that offers rapid deployment, on-demand scalability, and compelling performance at significantly lower cost than existing solutions. Snowflake on Amazon Web Services (AWS) represents a SQL data warehouse built for the cloud.

          Snowflake’s unique architecture natively handles diverse data in a single system, with the elasticity to support any scale of data, workload, and users.
      • striim-platform
        • Striim Enables Modern Cloud Architecture


          Striim is a patented, enterprise-grade platform that offers continuous real-time data ingestion, high-speed in-flight stream processing, and sub-second delivery of data to cloud and on-premises endpoints.

          Striim continuously delivers data where you need it, when you need it, and in the correct format to be immediately available to high-value operational workloads.
      • semarchy
        • Intelligent MDM™


          Semarchy is the Intelligent MDM company. Its xDM platform is an innovation in multi-vector Master Data Management (MDM) that leverages smart algorithms and material design to simplify data stewardship, governance, and integration.

          It is implemented via an agile and iterative approach that delivers business value almost immediately and scales to meet enterprise complexity.
      • google-bigquery-platform
        • Google Bigquery


          A fast, highly scalable, cost-effective, and fully managed cloud data warehouse for analytics, with built-in machine learning.

          BigQuery is Google's serverless, highly scalable, enterprise data warehouse designed to make all your data analysts productive at an unmatched price-performance. Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL without the need for a database administrator.
      • tableau-technology
        • Connect to More Data


          Connect to data on prem or in the cloud—whether it’s big data, a SQL database, a spreadsheet, or cloud apps like Google Analytics and Salesforce.

          Access and combine disparate data without writing code. Power users can pivot, split, and manage metadata to optimize data sources. Analysis begins with data. Get more from yours with Tableau.
  • Insights
      • Events
      • White Papers
      • Blog
      • Webinar
  • Who we are
      • Client Stories
  • Join us
    • Careers
Contact us
Tuesday, 20 November 2018 / Published in Data Virtualization

Data Virtualization and the Logical Data Lake

Author: Rick F. van der Lans
Date: June 2018

Data lakes are setup as large physical storage repositories. Data from all kinds of sources is copied into this repository, which is commonly developed with Hadoop. But is this centralized storage approach really feasible and practical? Additionally, the users of data lakes are data scientists and other investigative users. In other words, data scientists form the only scheduled user group of the data lake, making data lakes single-purpose systems. Does that make sense, isn’t data contained in a data lake too valuable to restrict its use exclusively to data scientists who form a relatively small group of users? The investment in a data lake would be more worthwhile if the target audience can be enlarged.

The topic of this fifth article in a series on use cases of data virtualization [link to fourth article] addresses both questions. Can we use data virtualization to turn physical data lakes into more practical logical data lakes and from single-purpose data lakes to multi-purpose data lakes?

The Complications of the Original Data Lake

The characteristics of the data lake are described in several definitions. One of the first popular definitions of the data lake came from James Serra [ What is a Data Lake ]: “A data lake is a storage repository […] that holds a vast amount of raw data in its native format until it is needed.” Figure 1 contains a high-level architecture representing a data lake. In this figure the acronym EtL stands for Extract-Load and ET for Extract-Transform. The letter “t” in EtL is shown very small on purpose, because data scientists prefer to see the data in its orginal form, so little data transformation will be applied to the data.

Figure 1:

Having all the required data stored together in one location makes it easy for data scientists to use it. Unfortunately, practical complications exist that make the development of such a centralized data storage containing copied data be hard, impossible, or not permitted:

  • Complex “T”: All ETL programmers agree that they spend most of their time on developing the “T” and not so much on the “E” or “L”. Because the data stored in the data lake is still in its raw form, data scientists still have to spend time on developing the “T”.
  • Big data too big to move: In some environments the sheer amount of data coming from the data sources can be too much to send and too much to physically copy. Bandwidth and data ingestion limitations can make it impossible to copy a big data source to the data lake.
  • Uncooperative departments: Not all business departments and divisions may be anxious to share their data with a centralized environment. This can lead to the upholding of their data.
  • Restricting data privacy and protection regulations: More and more laws, rules, and regulations prohibit the storage of specific types of data together. Sometimes data is not allowed to leave a country, or regulations for data privacy and protection regulations may forbid certain types of data to be stored centrally in a data lake.
  • Data stored in highly secure system: Some source systems have a highly secure system to protect against incorrect and fraudulent data usage. Owners may not permit their data to be copied outside the original security realm and into the less secure data lake.
  • Missing metadata: Not all source data comes with descriptive metadata, making it hard for data scientists to understand what specific data elements mean. A misinterpretation may lead to incorrect business insights.
  • Refreshing of data lake: Sophisticated refresh procedures exist to keep data in data warehouses as up to date as the users require. Do data scientists need their data to be refreshed as well? Some data probably does not need to be refreshed and for some data science exercises it’s not essential, but in some cases periodic refreshing must be organized.
  • Management of data lake: The data lake is a data delivery system and as such it must be managed. When data scientists need access to the data, it must be available.

 

The Logical Data Lake to the Rescue

These are all realistic complications. An alternative data lake architecture that overcomes these complications is called the logical data lake. It’s based on data virtualization technology. Figure 2 contains a high-level overview of this architecture.

Figure 2:

In a logical data lake, data is presented to data scientists as if it’s all still stored centrally in one data storage repository. Nothing could be further from the truth. Some of the data is copied and stored centrally (the two data sources on the left), some data is accessed remotely (the two data sources in the middle), and some are cached locally (the two on the right). In the latter case, the sources themselves are not accessed by the data scientists, but only by the refresh mechanism of the data virtualization’s cache engine. Data scientists access the cached virtual tables. Depending on what’s required, possible, and feasible, one of the three approaches can be used to make data accessible to data scientists. Where copying data and storing it centrally is just one of the options in the original data lake architecture, it’s definitely not the only option in the logical data lake. Note that data scientists don’t see the difference between these three options.

The Logical, Multi-Purpose Data Lake

That brings us to the second issue raised at the start of this article. As indicated, when the data lake was introduced it was meant to be a single-purpose system: for data scientists only. Developing, maintaining, and managing a data lake is not for free. Therefore, if exploitation is limited to one or two use cases, the data lake may be unnecessarily expensive or the investment is not exploited fully. Practice has shown that exploitation of a data lake can be extended to different forms of data usage, such as operational reporting, self-service BI, and embedded BI.

A logical data lake developed with a data virtualization server as the driving technology, can easily be adapted to support a wide range of business users from traditional self-service BI users (working for e.g. finance, marketing, human resource, transport) to sophisticated data scientists; see Figure 3.

Figure 3:

In this solution, different users access different layers or zones. Accessing virtual tables in the landing zone is comparable to accessing the physical files directly. Data scientists probably use the virtual tables in this zone. Self-service BI users are probably accessing the middle zone, in which data has been integrated and processed lightly. Those users deploying standard reports access the top zone with virtual tables, in which the data has been fully processed. This is the multi-purpose data lake. One environment that supports many business users.

Summary

There is no question about the value of a data lake to data scientists, but the solution of one physical, single-purpose data lake may not be feasible or practical. Data contained in the data lake is too valuable to restrict its use exclusively to data scientists who form a relatively small group of users. The investment in a data lake would be more worthwhile if the target audience can be enlarged without hindering the original users. The logical, multi-purpose data lake is more flexible, does not have the problems of centralized data storage, and supports a wide range of business users.

Material for this blog comes from the whitepaper “Architecting the Multi-Purpose Data Lake With Data Virtualization” which contains a more detailed description of the logical data lake, see the following whitepaper: Read here. We strongly recommend you to read this paper.

In the sixth article of this series [link to next article], we focus on a cloud integration. Data virtualization can simplify migration to the cloud and can make the cloud itself transparent to most applications and reports.

  • Tweet

What you can read next

Data Virtualization and the Logical Data warehouse
Data Virtualization and the 360-Degree Customer View
Data Virtualization and Database Migration and Acceleration

Leave a Reply Cancel reply

Your email address will not be published.

 

 

“Genius is making complex ideas simple,
not making simple ideas complex”

Albert Einstein

ABOUT SimplicityBI

  • What we do
  • Solutions
  • Technologies 
  • Who we are
  • Insights
  • Join us

GET IN TOUCH

T: +1 (800) 308 8114
Email: contact@simplicitybi.com

SimplicityBI
407 2nd Street SW, Calgary
Alberta, Canada

  • Events
  • White Papers
  • Blog
  • Contact us

© 2022. All rights reserved. Powered by Instalogic Marketing

TOP
×

Get In Touch

Find out how SimplicityBI can impact your bottom line and
elevate your organization’s performance.

  • What we do
    • Back
    • Digital Transformation Strategy
    • Data Implementation Services
    • Project Services
    • Managed Services
    • Back
  • Solutions
    • Back
    • Unified Data Platforms
    • Cloud Data Integration
    • Master Data Managment (MDM)
    • Data Lake – DW
    • Data Virtualization
    • Data Analytics
    • Back
  • Insights
    • Back
    • Events
    • White Papers
    • Blog
    • Webinar
    • Back
  • Technologies 
    • Back
    • Denodo
    • Microsoft Data Platform
    • IBM
    • looker
    • Snowflake
    • Striim
    • Semarchy
    • Google Bigquery
    • Tableau
    • Back
  • Who we are
  • Join us
    • Back
    • Careers
    • Back
  • Contact us