0% found this document useful (0 votes)

25 views21 pages

Modern Data Lakes Explained

Uploaded by

Vamshi Rapolu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views21 pages

Modern Data Lakes Explained

Uploaded by

Vamshi Rapolu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

These materials are © 2024 John Wiley & Sons, Inc.

Any dissemination, distribution, or unauthorized use is strictly prohibited.

Modern
Data Lakes
Starburst Special Edition

by Tom Nats

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Modern Data Lakes For Dummies®, Starburst Special Edition

Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
[Link]
Copyright © 2024 by John Wiley & Sons, Inc., Hoboken, New Jersey

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without
the prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at [Link]
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, [Link],
Making Everything Easier, and related trade dress are trademarks or registered trademarks of
John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not
be used without written permission. All other trademarks are the property of their respective
owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this
book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHORS HAVE

USED THEIR BEST EFFORTS IN PREPARING THIS WORK, THEY MAKE NO REPRESENTATIONS
OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF
THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION
ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES REPRESENTATIVES, WRITTEN
SALES MATERIALS OR PROMOTIONAL STATEMENTS FOR THIS WORK. THE FACT THAT AN
ORGANIZATION, WEBSITE, OR PRODUCT IS REFERRED TO IN THIS WORK AS A CITATION AND/
OR POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE PUBLISHER
AND AUTHORS ENDORSE THE INFORMATION OR SERVICES THE ORGANIZATION, WEBSITE, OR
PRODUCT MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. THIS WORK IS SOLD WITH
THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING PROFESSIONAL
SERVICES. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR
YOUR SITUATION. YOU SHOULD CONSULT WITH A SPECIALIST WHERE APPROPRIATE. FURTHER,
READERS SHOULD BE AWARE THAT WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED
OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
NEITHER THE PUBLISHER NOR AUTHORS SHALL BE LIABLE FOR ANY LOSS OF PROFIT OR ANY
OTHER COMMERCIAL DAMAGES, INCLUDING BUT NOT LIMITED TO SPECIAL, INCIDENTAL,
CONSEQUENTIAL, OR OTHER DAMAGES.

For general information on our other products and services, or how to create a custom For
Dummies book for your business or organization, please contact our Business Development
Department in the U.S. at 877-409-4177, contact info@[Link], or visit [Link]/go/
custompub. For information about licensing the For Dummies brand for products or services,
contact BrandedRights&Licenses@[Link].
ISBN 978-1-394-22650-4 (pbk); ISBN 978-1-394-22651-1 (ebk)

Printed in the United States and Great Britain.

Publisher’s Acknowledgments
Development Editor: Editorial Manager: Rev Mengle
Rachael Chilvers Business Development
Project Editor: Pradesh Kumar Representative: Matt Cox
Acquisitions Editor: Traci Martin

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Knowing why to use a modern data lake

»» Enjoying the benefits of using a modern

data lake

Chapter 1
Exploring a Modern
Data Lake

T
he term “data lake” — a repository that can store vast
amounts of data — may elicit different reactions and
definitions from different folk in the data analytics space.
Those who lived through the Hadoop era are often skeptical of the
value data lakes provide an organization. Luckily, technology has
advanced, and many issues that data lakes encountered in the
past have been resolved. Modifying data, high performance, and,
most importantly, data quality and security are now features
encompassed in data lakes.

Leveraging a single data store for all your analytical needs

without being locked into one vendor is a wish come true for many
organizations. After all, data is data — so the concept applies to
companies of all sizes and all industries. Having a seemingly
unlimited, fully managed storage area is great; however, you
must define, transform, catalog, quality check, and structure the
data before it can easily be consumed by a variety of technical and
non-technical end users.

CHAPTER 1 Exploring a Modern Data Lake 1

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Why Use a Modern Data Lake?
A modern data lake provides data warehouse functionality with-
out the constraints of legacy Hadoop-based data lakes. Addition-
ally, modern data lakes are open, which alleviates single vendor
and technology lock-in when architecting, building, and access-
ing data in your data lake. Companies leveraging modern data
lakes own their data, period.

Table 1-1 summarizes the requirements that a modern data lake

fulfills.

TABLE 1-1 How a Modern Data Lake Can Meet Your

Requirements
Requirement Modern Data Lake

Single storage platform Highly performant object stores with multiple engines
and multiple engines reading, writing, and managing the data

Can serve a majority of Business intelligence (BI) reporting, ad-hoc queries,

analytical use cases machine learning/AI model building and serving,
and so on

Can accommodate From JSON to CSV to Parquet and table formats such
different types of data as Apache Iceberg and Delta Lake

ANSI SQL support Fully ANSI SQL compliant supporting a variety of

programming languages and most BI tools

Ability to modify Full DML (Data Manipulation Language — update,

individual records delete, merge)

Data quality checks Constraints such as “not null” and valid values for table
columns

High performance Seconds to millisecond querying is possible using

different engines’ indexing and caching mechanisms

Efficient joins With Hadoop, joins between tables were discouraged.

Joins are now a common pattern and encouraged in
modern data lakes

2 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Requirement Modern Data Lake

Schema evolution Changing data structures was challenging in legacy

data lakes. Table formats, such as Iceberg, Delta Lake,
and Apache Hudi, enable these changes just like a
traditional database

Affordable and Cloud-based modern data lakes benefit from an

maintainable economic, fully managed, scalable storage repository

As you can see from Table 1-1, the modern data lake has come a
long way. It’s an exciting time for companies as they can finally
simplify their analytics architecture, avoid vendor lock-in, and
provide a single storage layer for all of their diverse data.

After the introduction of Hadoop, companies used the technol-

ogy to land massive amounts of disparate data from various
sources. From there, the data was analyzed in its raw form by
data scientists. BI professionals using standard SQL and report-
ing tools struggled to get value from the unorganized data in the
lake. Performance paled in comparison to traditional data ware-
houses. Once the data had been processed, it was copied to a data
warehouse. This increased complexity and cost for companies
who now had to maintain two data systems and employ staff with
multiple skill sets, as shown in Figure 1-1.

FIGURE 1-1: The bad ol’ days of data lakes.

With a modern data lake, also known as a data lakehouse, orga-

nizations can serve all of their analytical use cases from a single
storage platform; see Figure 1-2. Additionally, you can use different

CHAPTER 1 Exploring a Modern Data Lake 3

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
“engines” on top of this data to satisfy diverse use cases. From
standard reporting to ad-hoc querying to populating machine
learning models, organizations can choose the right engine/vendor.

FIGURE 1-2: A modern data lake has a single storage platform.

BENEFITS OF A MODERN
DATA LAKE
One common storage platform. Your data is always under your
complete control.

Multiple engines. Open source or proprietary, choose whichever one

solves your business’s use cases.

No vendor lock-in. Software vendors compete on features and inno-

vation, and are now opposed to locking customers into their proprie-
tary solutions.

4 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Looking at modern data lake
components

»» Securely managing and maintaining a

modern data lake

Chapter 2
Building a Modern
Data Lake

I n this chapter we dive into how to build and implement a mod-

ern data lake. Roll up your sleeves!

Following the Steps

You can build a modern data lake using the components shown in
Figure 2-1 and explained in the next sections.

Modern data lakes separate storage and compute. This dynamic

is advantageous because organizations can use a variety of tools
to land data without overburdening the lake. The separation of
storage and compute also enables much faster ingestion, process-
ing, and flexibility. Lastly, and a primary differentiator from data
warehouses, modern data lakes can use multiple engines to com-
plete any of the following data processing steps.

CHAPTER 2 Building a Modern Data Lake 5

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 2-1: The components of building a modern data lake.

Ingestion and landing of data

Extraction, the “E” of extract, transform, and land (ETL), is the
starting point for building a modern data lake. You can choose
from a variety of tools and methods to extract data from the mul-
tiple source systems where data is stored.

“Push” and “pull” are the two primary methods of extracting

data from source systems:

»» Pull is the traditional method of extracting data from source

systems such as databases or NoSQL systems. It involves
running SQL queries against the source system to extract
new or modified data and insert it into the data lake.
»» Push has become more popular based on data being
produced at a rate never seen before. Data from sensors or
cable TV boxes (for example) is often “pushed” into a data
lake using various tools and methods, such as traditional
secure file transfers and real-time APIs. Pushed data arrives
in the lake in raw form.

With the explosion of real-time data sources, newer technolo-

gies such as Kafka, an open-source distributed event streaming
platform, was designed to handle massive amounts of data while
maintaining reliability and acting as a company’s “data queue” to
populate a data lake.

The timing of ingesting data into a lake is often controversial as

you need to balance the desire to ingest data into the lake as fast

6 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
as possible to unlock timely insights with the cost of developing
and supporting the chosen ingestion methods. You can employ
strategies for both “batch” and “streaming” ingestion:

»» Batch. Data is pulled or pushed into a data lake at given

intervals.
»» Streaming (or micro-batch). Data is pushed into the data
lake in real-time or near real-time.

Data that arrives in the landing area of a data lake is either new
data or an alteration to data that already exists in the lake, which
is considered an update. Data lake (and data warehouse) data is
typically not deleted. Data can provide an audit trail from many
different source systems. For example, if a customer’s address
changes, the original address would remain in the lake to ensure
queries and reports could accurately capture historical data. This
is known as a slowly changing dimension (SCD). Table formats such
as Iceberg, Hudi, and Delta Lake enable you to merge and update
data within a modern data lake.

Structuring and transforming data

Once data is landed in the data lake, transformation and enrich-
ment is regularly performed before presenting the data to end
users. For example, ingested data is often joined with existing
tables in the lake to provide business context such as product
type, current cost, and other business-relevant information.

In a modern data lake, you can update data using modern table
formats such as Apache Iceberg and Delta Lake. Using these for-
mats in a data lake enables processing engines to perform similar
functions as traditional databases.

Aggregation and rollups

Data captured from source systems is often too granular for ad
hoc querying and reporting. Product sales are good examples. The
questions asked of this data are usually focused on sales during
different time frames such as daily, weekly, monthly, and yearly.
In order to provide faster performance when these types of queries
are executed, data is usually pre-aggregated into different levels of
granularity. This is often called “rolling up” the data into different
structures to aid in reporting queries and are most commonly used
by standard reporting tools such as Power BI and Tableau.

CHAPTER 2 Building a Modern Data Lake 7

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Governance and security
You need to implement proper security methods to ensure users
accessing the data are correctly authenticated and authorized.
This is essential to avoid all users gaining unfettered access and
turning your data lake into a data swamp. You can use two main
forms of security within a modern data lake: role-based access
control (RBAC) and attribute-based access control (ABAC):

»» RBAC grants privileges to roles that users are assigned. This

is the standard method for the majority of relational and
non-relational databases.
»» ABAC is a newer method of securing assets within a modern
data lake. It uses tags to provide or deny privileges. Objects
within the lake are tagged, and privileges are granted to the
tag. Any object with that tag applied to it adheres to those
privileges. This is a very powerful security mechanism
because tags can span many different resource types.

Additionally, modern data lakes should have detailed audit log-

ging as a mandatory requirement for every service and operation
within the lake.

Cataloging and data products

Data lakes are extremely scalable and can contain a massive
amount of data. Being able to find and trust data are essential
attributes of a successful data lake. In the past, data was added
to a lake with little concern about structure and documentation,
and the ingested data would be cleansed and processed later. This
practice created data swamps. Modern data lakes have adopted a
stricter approach. Today, data ingested into the lake must have
some sort of structure. A centralized catalog is used to track and
document these structures.

Cataloging
As data enters the lake, you need to catalog it. You can add addi-
tional metadata to enhance end users’ ability to search and find
relevant data sets. You can tag data assets (explained in the “Gov-
ernance and security” section) in order to apply security rules and
categorize these assets to simplify access.

8 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Say you tag a set of data assets (tables, schemas, and so on) as
“marketing.” By applying this tag, you could put in place a secu-
rity policy to limit access to the data set to only those team mem-
bers in Marketing. Doing so also makes finding the data much
easier for someone in a marketing role because they can simply
search for the “marketing” tag.

Data products
Data is arguably your organization’s most valuable asset. But
finding trustworthy, curated data sets remains a challenge in
most companies. This challenge was the genesis of data products.

The concept of a data product was introduced as a building block

of the increasingly popular data mesh. A data mesh architecture
is a decentralized data management approach where ownership
and control reside with individual domain teams, promoting scal-
ability, agility, and better alignment with business goals. Tradi-
tionally, business intelligence (BI) tools provided an area where
users could find reports and structures that were created and
maintained. As more tools became available (both SQL and NoSQL
based) to consume this data, users would have to replicate the
same structures in the single BI tool.

Data products are curated data sets designed to solve specific, tar-
geted business questions. Users are presented with a searchable
list of data products along with additional metadata such as tags,
owners, trustworthiness, and age. Users can access and query
data products using a multitude of tools, further protecting users
from vendor lock-in. Data products have been compared with a
semantic layer. Semantic layers reside between the data and busi-
ness users to add context to data. While similar, data products go
well beyond semantic layers by adding many different features
such as in-depth metadata, lineage, approval processes, con-
tracts, and even social elements such as shareability.

Accessibility
You may be familiar with the analogy that “data is the new oil”
because of data’s ever-increasing value. Take analytical data as an
example. Initially used for simple reporting, it’s now embedded
into applications and deployed in creative revenue- generating
ways. Accessing trusted analytical data is the cornerstone of any
modern data lake. Don’t lock into any specific tool because the rate
of innovation in this area has exploded. From natural language to

CHAPTER 2 Building a Modern Data Lake 9

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
different visualization methods, users crave optionality and the
choice of a variety of tools to access and analyze data:

»» JDBC (Java Database Connectivity) has become the

standard connectivity for a majority of analytical tools.
»» ODBC (Open Database Connectivity) was created by Microsoft
in 1992 and is still used by some of their reporting tools.
»» SQL IDE (SQL Integrated Development Environment)
simplifies working with data sources. Most modern data lake
platforms include a web-based SQL development and query
feature.
»» Notebooks, such as Jupyter, are popular tools for data
scientists. Most notebooks are able to connect to modern
data lake platforms.

Observability, monitoring, and

orchestration
Building and administering a modern data lake requires addi-
tional processes and tools. This work can be completed in-house
or you could leverage one of many software vendors with domain
expertise and tooling. The additional processes include:

»» Observability enables your organization to understand the

quality, health, and performance of data in a system. It’s a
critical part of data management that means you can
identify and troubleshoot data issues before they impact
business operations.
»» Monitoring is similar to observability, in that it focuses on
the quality of data in source systems. Monitoring within a
modern data lake includes monitoring if a job is completed,
auditing logs, and being able to read and write from storage.
These are crucial in maintaining high end user service level
agreements (SLAs) and compliance standards. Many software
vendors offer tools to monitor all aspects of a data lake, from
basic logging to full workflow and data quality suites.
»» Orchestration is required in modern data architecture.
Executing workflows and jobs within a modern data lake has
expanded from a simple scheduler to a suite of data pipeline
tools, such as dbt, that extract and transform data on a
scheduled basis. Organizing and streamlining data pipelines
accelerates your data-driven decision making.

10 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Chapter 3
Ten Tips for Optimizing
Modern Data Lakes

F ollow these ten tips for success when optimizing your data
lake.

Choose Open File and Table Formats

Data lakes have brought about many technological advances to
data analytics including open file formats and open table formats.
Open source, columnar based file formats (such as Parquet, ORC,
and Avro) have been adopted by a majority of data lake platforms.
Modern table formats (such as Apache Iceberg, Delta Lake, and
Apache Hudi) include features that enable data lakes to perform
operations formerly reserved for databases, including the updat-
ing and merging of data and storage optimizations.

Separate Storage and Compute

Another technological advancement is the separation of storage
and compute. The ability to use multiple engines or products on
the same storage is liberating and provides further insulation

CHAPTER 3 Ten Tips for Optimizing Modern Data Lakes 11

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
from vendor lock-in because your organization’s data no longer
has to reside in a proprietary system.

Ensure any data lake vendor is able to integrate with basic object
stores and their products don’t store data in a proprietary storage
format.

Support Multiple Engines

Data lake engine and tooling functionality includes ingestion,
transformation, machine learning, basic SQL, Python, and myr-
iad of options. Ensuring you choose the right engines for the job
will provide flexibility and freedom to fulfill organizational goals.

The impact this has for companies is extremely powerful. The

separation of storage and compute provides you full ownership of
the tools and processes you apply to your data storage and ana-
lytic requirements. Free from vendor lock-in, you now have the
freedom to select the engines and products that maximize data
insights and control spend.

Remember Data Modeling

and Semantic Layers
As data lakes grew in popularity, data modeling became a chal-
lenge for two primary reasons. First, no longer was only struc-
tured data being ingested into the data lake. Unstructured and
semi-structured data became more common forms of data being
ingested. Data modeling tools didn’t support data such as JSON
files. Consequently, data modelers could only document a por-
tion of the ingested data. Second, the introduction of schema on
read allowed data to be ingested into a data lake without a schema
being created. This practice was one of the areas which caused
data lakes to be termed “data swamps.” Building a semantic layer
that’s exposed to end users is a crucial component of a modern
data lake. This removes the complexity and allows only trusted
and verified data to be accessed.

12 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Set Data Quality Standards
The term “data swamp” was a running joke during the initial data
lake craze, as data was ingested from a variety of sources without
any documentation or structure. Users quickly lost confidence in
the data and looked to other sources, such as data warehouses, for
their reporting and analytic needs.

Setting the same data quality standards in a modern data lake that
exist in data warehouses is vital to winning the trust of end users
who access the data. Today, many tools are available to ensure
the quality of data within the data lake. All data reaching the lake,
including both streaming and batch data, need to have the same
quality standards applied.

Support for Any Client Tool

With a combination of data products, a well-documented cata-
log, and tight security, use any business intelligence (BI) tool
with confidence that the data you’re working with is trustworthy
and correct. Avoid using tools that copy or create a subset of data
because that data quickly becomes out of date and untrustworthy.

Have a Data Archive Strategy

The idea that a data lake can store data forever is enticing but can
quickly become a “data tomb.” Even with seemingly unlimited
storage in a cloud environment, a well thought out data archiving
strategy is important to ensure data in the modern data lake
remains accessible and performant.

Categorize data as either “hot” or “cold” as defined by the busi-

ness and set using a data availability SLA. Most cloud providers
offer a low-cost archive object storage option for infrequently
queried data, which is ideal for archiving old data.

CHAPTER 3 Ten Tips for Optimizing Modern Data Lakes 13

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Support Multiple Data Types
Unstructured data from Internet of Things (IoT) devices and other
data sources that produce a massive volume of data is a challenge
to perform analytics and reporting against due to their often hier-
archical nature.

Luckily, many modern engines (such as Spark and Trino) ena-

ble you to query these types of data directly. Alternatively, they
can be flattened down into tabular structures to more easily be
consumed by reporting tools such as Tableau and Power BI. Both
structured and unstructured data must be supported in order to
provide a variety of end users and tools the ability to access and
work with the data.

To Federate or Not?
Data sources that populate a modern data lake can come from
a variety of sources such as enterprise resource planning (ERP),
customer relationship management (CRM), production systems,
and third party vendors. You can query source systems directly
without copying the data to the lake first. This is known as fed-
eration. With the technological advancements within CPU, RAM,
storage, and networking, querying source systems is often
approved to some limits. For analytics users, this opens up data
that they’ve never had before — the option to query data sources
directly based on a source by source basis.

Don’t Get Squirreled!

New, “shiny objects” have come and gone, such as Hadoop, data
science, and now AI. Although these shiny tools have helped (or
sometimes hindered) companies, they are just that: tools. The
main focus of any analytical system is to provide clean, trust-
worthy, and timely data to end users so they can analyze this data
through ad hoc querying and reporting. Keep your eyes on the
main objective and don’t get too distracted by shiny objects!

14 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to [Link]/go/eula to access Wiley’s ebook EULA.

Essential Data Lake Design Guide
No ratings yet
Essential Data Lake Design Guide
37 pages
Data Lakes: Optimizing Analytics Insights
No ratings yet
Data Lakes: Optimizing Analytics Insights
37 pages
Future-Proofing Your Data Lake: 6 Tips
No ratings yet
Future-Proofing Your Data Lake: 6 Tips
10 pages
Cloud Data Lakes For Dummies Snowflake Special Edition V1 2
No ratings yet
Cloud Data Lakes For Dummies Snowflake Special Edition V1 2
10 pages
Cloud Data Lakes For Dummies Snowflake Special Edition V1 4
0% (1)
Cloud Data Lakes For Dummies Snowflake Special Edition V1 4
10 pages
Overview of Data Lakes and Architecture
No ratings yet
Overview of Data Lakes and Architecture
18 pages
Modern Data Lake Architecture Explained
88% (8)
Modern Data Lake Architecture Explained
23 pages
Understanding Data Lakes and Their Importance
No ratings yet
Understanding Data Lakes and Their Importance
12 pages
Understanding Data Lakes and Their Benefits
No ratings yet
Understanding Data Lakes and Their Benefits
2 pages
Understanding Data Lakes and Lakehouses
No ratings yet
Understanding Data Lakes and Lakehouses
6 pages
Data Lake Architecture Overview
No ratings yet
Data Lake Architecture Overview
11 pages
Converging Data Lakes and Warehouses
100% (1)
Converging Data Lakes and Warehouses
19 pages
Data Lakes in Modern Architecture
100% (1)
Data Lakes in Modern Architecture
23 pages
Data Lakes: Future of Data Warehousing
No ratings yet
Data Lakes: Future of Data Warehousing
5 pages
Alex Gorelik What Is A Data Lake OReilly Media Inc. 2020
No ratings yet
Alex Gorelik What Is A Data Lake OReilly Media Inc. 2020
82 pages
Usen 00 - 16014716USEN PDF
No ratings yet
Usen 00 - 16014716USEN PDF
17 pages
Data Lake Architecture - Designing The Data Lake and Avoiding The Garbage Dump (PDFDrive)
No ratings yet
Data Lake Architecture - Designing The Data Lake and Avoiding The Garbage Dump (PDFDrive)
209 pages
Build a Secure Cloud Data Lake
No ratings yet
Build a Secure Cloud Data Lake
15 pages
Data Lake Overview and Use Cases
0% (1)
Data Lake Overview and Use Cases
7 pages
Understanding Big Data Concepts
No ratings yet
Understanding Big Data Concepts
29 pages
Data Lake A New Ideology in Big Data Era
No ratings yet
Data Lake A New Ideology in Big Data Era
11 pages
Building a Data Lakehouse on GCP
No ratings yet
Building a Data Lakehouse on GCP
19 pages
Understanding ETL and ELT Differences
No ratings yet
Understanding ETL and ELT Differences
13 pages
Optimizing Data Lakes for Fintech Efficiency
No ratings yet
Optimizing Data Lakes for Fintech Efficiency
12 pages
Real-World Data Warehouse Applications
No ratings yet
Real-World Data Warehouse Applications
11 pages
Data Warehouse vs Data Lake Overview
No ratings yet
Data Warehouse vs Data Lake Overview
5 pages
Europe’s Data Lakes Market Insights
No ratings yet
Europe’s Data Lakes Market Insights
18 pages
From Data Lake To Data-Driven Organization
No ratings yet
From Data Lake To Data-Driven Organization
30 pages
Understanding Data Lakes: Key Insights
No ratings yet
Understanding Data Lakes: Key Insights
26 pages
Data Lakes vs. Data Warehouses Explained
No ratings yet
Data Lakes vs. Data Warehouses Explained
6 pages
Unlocking Data Lake Business Value
No ratings yet
Unlocking Data Lake Business Value
8 pages
Data Lake vs. Data Warehouse Insights
No ratings yet
Data Lake vs. Data Warehouse Insights
9 pages
Implemententerprise Data Lake
100% (1)
Implemententerprise Data Lake
9 pages
Data Lakes vs. Warehouses Explained
No ratings yet
Data Lakes vs. Warehouses Explained
8 pages
AI-Driven Data Lake Management Framework
No ratings yet
AI-Driven Data Lake Management Framework
69 pages
Architecting Modern Data Lakes
100% (10)
Architecting Modern Data Lakes
60 pages
Data Lake Implementation Insights
No ratings yet
Data Lake Implementation Insights
8 pages
Big Data and Data Lake Concepts Explained
No ratings yet
Big Data and Data Lake Concepts Explained
5 pages
ETL vs ELT: Key Concepts Explained
No ratings yet
ETL vs ELT: Key Concepts Explained
8 pages
Data Lake Architectures & Metadata Management
No ratings yet
Data Lake Architectures & Metadata Management
24 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Implementing Data Lakes in Enterprises
No ratings yet
Implementing Data Lakes in Enterprises
19 pages
Data Lake Architectures & Metadata Management
No ratings yet
Data Lake Architectures & Metadata Management
24 pages
Understanding Data Warehouses and Lakes
No ratings yet
Understanding Data Warehouses and Lakes
20 pages
Database Performance and Security Insights
No ratings yet
Database Performance and Security Insights
29 pages
Data Engineering: Overview and Applications
No ratings yet
Data Engineering: Overview and Applications
88 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Best Practices For Designing Your Data Lake
No ratings yet
Best Practices For Designing Your Data Lake
13 pages
Data Lakes: Architecture and Insights
No ratings yet
Data Lakes: Architecture and Insights
7 pages
Building a Data Lake on Google Cloud
No ratings yet
Building a Data Lake on Google Cloud
59 pages
Understanding Data Lakes and Their Use
No ratings yet
Understanding Data Lakes and Their Use
1 page
Big Data Storage: Warehouse vs. Lakehouse
No ratings yet
Big Data Storage: Warehouse vs. Lakehouse
13 pages
Big Data in Database Management Systems
No ratings yet
Big Data in Database Management Systems
3 pages
(Data-Centric Systems and Applications) David Taniar, Wenny Rahayu - Data Warehousing and Analytics - Fueling The Data Engine-Springer (2022)
No ratings yet
(Data-Centric Systems and Applications) David Taniar, Wenny Rahayu - Data Warehousing and Analytics - Fueling The Data Engine-Springer (2022)
642 pages
Boosting ROI for Data Lakehouses
No ratings yet
Boosting ROI for Data Lakehouses
17 pages
CH en Innovation Deloitte What Is Blockchain 2016
No ratings yet
CH en Innovation Deloitte What Is Blockchain 2016
4 pages
Intro To Bit Coin and Block Chain Technology
No ratings yet
Intro To Bit Coin and Block Chain Technology
13 pages
Blockchain & Beyond PDF
No ratings yet
Blockchain & Beyond PDF
38 pages
A Gentle Introduction To Blockchain Technology Web PDF
No ratings yet
A Gentle Introduction To Blockchain Technology Web PDF
18 pages
Blockchain for Pharmaceutical Supply Chain
No ratings yet
Blockchain for Pharmaceutical Supply Chain
9 pages
Ebook Reactive Microservices The Evolution of Microservices at Scale 2 PDF
100% (1)
Ebook Reactive Microservices The Evolution of Microservices at Scale 2 PDF
84 pages
Java Metadata Interface (JMI) Specification: JSR 040 Java Community Process
No ratings yet
Java Metadata Interface (JMI) Specification: JSR 040 Java Community Process
142 pages
Local Shops vs Big Box Retailers: Pros & Cons
No ratings yet
Local Shops vs Big Box Retailers: Pros & Cons
4 pages
Application to Vacate Ex-Parte Injunction
100% (1)
Application to Vacate Ex-Parte Injunction
10 pages
RFP for Fort Piqua Hotel Restaurant
No ratings yet
RFP for Fort Piqua Hotel Restaurant
5 pages
Inventory Costing and Adjusting Entries
No ratings yet
Inventory Costing and Adjusting Entries
22 pages
Inuka Money Market Fund Update
No ratings yet
Inuka Money Market Fund Update
4 pages
Nereidas IT Network Infrastructure Guide
No ratings yet
Nereidas IT Network Infrastructure Guide
42 pages
SHS Voucher Program Guidelines 2020-2021
No ratings yet
SHS Voucher Program Guidelines 2020-2021
24 pages
Future-Proofing Bank Compliance Strategies
No ratings yet
Future-Proofing Bank Compliance Strategies
12 pages
Golden Guardian 2010 Emergency Exercise Overview
No ratings yet
Golden Guardian 2010 Emergency Exercise Overview
12 pages
EDITORIAL BRUJAS, M. L. 2016. Evaluación Psicológica y Educativa Investigación y Nuevos Desarrollos. Editorial Brujas
100% (1)
EDITORIAL BRUJAS, M. L. 2016. Evaluación Psicológica y Educativa Investigación y Nuevos Desarrollos. Editorial Brujas
100 pages
2025 R.C. Lahoti Moot Court Competition
No ratings yet
2025 R.C. Lahoti Moot Court Competition
53 pages
Gaithersburg 2012 Annual Planning Report
No ratings yet
Gaithersburg 2012 Annual Planning Report
33 pages
WBSSC Food SI Recruitment 2023 Details
No ratings yet
WBSSC Food SI Recruitment 2023 Details
1 page
English Corporate and Business Law Overview
No ratings yet
English Corporate and Business Law Overview
39 pages
Customs Procedures and Regulations Exam
No ratings yet
Customs Procedures and Regulations Exam
31 pages
Property Passing in Contract Cases
No ratings yet
Property Passing in Contract Cases
4 pages
Understanding Your CAS Statement
No ratings yet
Understanding Your CAS Statement
5 pages
Invoice Sample for Immediate Payment
100% (1)
Invoice Sample for Immediate Payment
1 page
Importance of the Jaro Convent
100% (1)
Importance of the Jaro Convent
6 pages
Steps to Create Your Own Cryptocurrency
No ratings yet
Steps to Create Your Own Cryptocurrency
3 pages
New Delhi to Ahmedabad Train Ticket
No ratings yet
New Delhi to Ahmedabad Train Ticket
1 page
Insolvency Act
No ratings yet
Insolvency Act
161 pages
Vietsub - Making Sense of The ECG Cases For Self - Assessment
100% (2)
Vietsub - Making Sense of The ECG Cases For Self - Assessment
287 pages
Extra-Judicial Settlement of Heirs
No ratings yet
Extra-Judicial Settlement of Heirs
5 pages
Understanding Qiyas in Islamic Law
50% (2)
Understanding Qiyas in Islamic Law
17 pages
Tata Motors Integrated Report 2022-23
No ratings yet
Tata Motors Integrated Report 2022-23
33 pages
Supermicro X11DPG-QT
No ratings yet
Supermicro X11DPG-QT
259 pages
Evolution of Banking in India
No ratings yet
Evolution of Banking in India
12 pages
History of Accountancy in the Philippines
100% (1)
History of Accountancy in the Philippines
12 pages
Columbus Building Permit Expiration Guide
No ratings yet
Columbus Building Permit Expiration Guide
2 pages

Modern Data Lakes Explained

Uploaded by

Modern Data Lakes Explained

Uploaded by

These materials are © 2024 John Wiley & Sons, Inc.

Any dissemination, distribution, or unauthorized use is strictly prohibited.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHORS HAVE

Printed in the United States and Great Britain.

»» Enjoying the benefits of using a modern

Leveraging a single data store for all your analytical needs

CHAPTER 1 Exploring a Modern Data Lake 1

Table 1-1 summarizes the requirements that a modern data lake

TABLE 1-1 How a Modern Data Lake Can Meet Your

Can serve a majority of Business intelligence (BI) reporting, ad-hoc queries,

ANSI SQL support Fully ANSI SQL compliant supporting a variety of

Ability to modify Full DML (Data Manipulation Language — update,

High performance Seconds to millisecond querying is possible using

Efficient joins With Hadoop, joins between tables were discouraged.

2 Modern Data Lakes For Dummies, Starburst Special Edition

Schema evolution Changing data structures was challenging in legacy

Affordable and Cloud-based modern data lakes benefit from an

After the introduction of Hadoop, companies used the technol-

FIGURE 1-1: The bad ol’ days of data lakes.

With a modern data lake, also known as a data lakehouse, orga-

CHAPTER 1 Exploring a Modern Data Lake 3

FIGURE 1-2: A modern data lake has a single storage platform.

Multiple engines. Open source or proprietary, choose whichever one

No vendor lock-in. Software vendors compete on features and inno-

4 Modern Data Lakes For Dummies, Starburst Special Edition

»» Securely managing and maintaining a

I n this chapter we dive into how to build and implement a mod-

Following the Steps

Modern data lakes separate storage and compute. This dynamic

CHAPTER 2 Building a Modern Data Lake 5

Ingestion and landing of data

“Push” and “pull” are the two primary methods of extracting

»» Pull is the traditional method of extracting data from source

With the explosion of real-time data sources, newer technolo-

The timing of ingesting data into a lake is often controversial as

6 Modern Data Lakes For Dummies, Starburst Special Edition

»» Batch. Data is pulled or pushed into a data lake at given

Structuring and transforming data

Aggregation and rollups

CHAPTER 2 Building a Modern Data Lake 7

»» RBAC grants privileges to roles that users are assigned. This

Additionally, modern data lakes should have detailed audit log-

Cataloging and data products

8 Modern Data Lakes For Dummies, Starburst Special Edition

The concept of a data product was introduced as a building block

CHAPTER 2 Building a Modern Data Lake 9

»» JDBC (Java Database Connectivity) has become the

Observability, monitoring, and

»» Observability enables your organization to understand the

10 Modern Data Lakes For Dummies, Starburst Special Edition

Choose Open File and Table Formats

Separate Storage and Compute

CHAPTER 3 Ten Tips for Optimizing Modern Data Lakes 11

Support Multiple Engines

The impact this has for companies is extremely powerful. The

Remember Data Modeling

12 Modern Data Lakes For Dummies, Starburst Special Edition

Support for Any Client Tool

Have a Data Archive Strategy

Categorize data as either “hot” or “cold” as defined by the busi-

CHAPTER 3 Ten Tips for Optimizing Modern Data Lakes 13

Luckily, many modern engines (such as Spark and Trino) ena-

Don’t Get Squirreled!

14 Modern Data Lakes For Dummies, Starburst Special Edition

You might also like