Modern Data Lakes Explained
Modern Data Lakes Explained
by Tom Nats
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Modern Data Lakes For Dummies®, Starburst Special Edition
Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
[Link]
Copyright © 2024 by John Wiley & Sons, Inc., Hoboken, New Jersey
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without
the prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at [Link]
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, [Link],
Making Everything Easier, and related trade dress are trademarks or registered trademarks of
John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not
be used without written permission. All other trademarks are the property of their respective
owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this
book.
For general information on our other products and services, or how to create a custom For
Dummies book for your business or organization, please contact our Business Development
Department in the U.S. at 877-409-4177, contact info@[Link], or visit [Link]/go/
custompub. For information about licensing the For Dummies brand for products or services,
contact BrandedRights&Licenses@[Link].
ISBN 978-1-394-22650-4 (pbk); ISBN 978-1-394-22651-1 (ebk)
Publisher’s Acknowledgments
Development Editor: Editorial Manager: Rev Mengle
Rachael Chilvers Business Development
Project Editor: Pradesh Kumar Representative: Matt Cox
Acquisitions Editor: Traci Martin
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Knowing why to use a modern data lake
Chapter 1
Exploring a Modern
Data Lake
T
he term “data lake” — a repository that can store vast
amounts of data — may elicit different reactions and
definitions from different folk in the data analytics space.
Those who lived through the Hadoop era are often skeptical of the
value data lakes provide an organization. Luckily, technology has
advanced, and many issues that data lakes encountered in the
past have been resolved. Modifying data, high performance, and,
most importantly, data quality and security are now features
encompassed in data lakes.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Why Use a Modern Data Lake?
A modern data lake provides data warehouse functionality with-
out the constraints of legacy Hadoop-based data lakes. Addition-
ally, modern data lakes are open, which alleviates single vendor
and technology lock-in when architecting, building, and access-
ing data in your data lake. Companies leveraging modern data
lakes own their data, period.
Single storage platform Highly performant object stores with multiple engines
and multiple engines reading, writing, and managing the data
Can accommodate From JSON to CSV to Parquet and table formats such
different types of data as Apache Iceberg and Delta Lake
Data quality checks Constraints such as “not null” and valid values for table
columns
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Requirement Modern Data Lake
As you can see from Table 1-1, the modern data lake has come a
long way. It’s an exciting time for companies as they can finally
simplify their analytics architecture, avoid vendor lock-in, and
provide a single storage layer for all of their diverse data.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
“engines” on top of this data to satisfy diverse use cases. From
standard reporting to ad-hoc querying to populating machine
learning models, organizations can choose the right engine/vendor.
BENEFITS OF A MODERN
DATA LAKE
One common storage platform. Your data is always under your
complete control.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Looking at modern data lake
components
Chapter 2
Building a Modern
Data Lake
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 2-1: The components of building a modern data lake.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
as possible to unlock timely insights with the cost of developing
and supporting the chosen ingestion methods. You can employ
strategies for both “batch” and “streaming” ingestion:
Data that arrives in the landing area of a data lake is either new
data or an alteration to data that already exists in the lake, which
is considered an update. Data lake (and data warehouse) data is
typically not deleted. Data can provide an audit trail from many
different source systems. For example, if a customer’s address
changes, the original address would remain in the lake to ensure
queries and reports could accurately capture historical data. This
is known as a slowly changing dimension (SCD). Table formats such
as Iceberg, Hudi, and Delta Lake enable you to merge and update
data within a modern data lake.
In a modern data lake, you can update data using modern table
formats such as Apache Iceberg and Delta Lake. Using these for-
mats in a data lake enables processing engines to perform similar
functions as traditional databases.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Governance and security
You need to implement proper security methods to ensure users
accessing the data are correctly authenticated and authorized.
This is essential to avoid all users gaining unfettered access and
turning your data lake into a data swamp. You can use two main
forms of security within a modern data lake: role-based access
control (RBAC) and attribute-based access control (ABAC):
Cataloging
As data enters the lake, you need to catalog it. You can add addi-
tional metadata to enhance end users’ ability to search and find
relevant data sets. You can tag data assets (explained in the “Gov-
ernance and security” section) in order to apply security rules and
categorize these assets to simplify access.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Say you tag a set of data assets (tables, schemas, and so on) as
“marketing.” By applying this tag, you could put in place a secu-
rity policy to limit access to the data set to only those team mem-
bers in Marketing. Doing so also makes finding the data much
easier for someone in a marketing role because they can simply
search for the “marketing” tag.
Data products
Data is arguably your organization’s most valuable asset. But
finding trustworthy, curated data sets remains a challenge in
most companies. This challenge was the genesis of data products.
Data products are curated data sets designed to solve specific, tar-
geted business questions. Users are presented with a searchable
list of data products along with additional metadata such as tags,
owners, trustworthiness, and age. Users can access and query
data products using a multitude of tools, further protecting users
from vendor lock-in. Data products have been compared with a
semantic layer. Semantic layers reside between the data and busi-
ness users to add context to data. While similar, data products go
well beyond semantic layers by adding many different features
such as in-depth metadata, lineage, approval processes, con-
tracts, and even social elements such as shareability.
Accessibility
You may be familiar with the analogy that “data is the new oil”
because of data’s ever-increasing value. Take analytical data as an
example. Initially used for simple reporting, it’s now embedded
into applications and deployed in creative revenue- generating
ways. Accessing trusted analytical data is the cornerstone of any
modern data lake. Don’t lock into any specific tool because the rate
of innovation in this area has exploded. From natural language to
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
different visualization methods, users crave optionality and the
choice of a variety of tools to access and analyze data:
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Chapter 3
Ten Tips for Optimizing
Modern Data Lakes
F ollow these ten tips for success when optimizing your data
lake.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
from vendor lock-in because your organization’s data no longer
has to reside in a proprietary system.
Ensure any data lake vendor is able to integrate with basic object
stores and their products don’t store data in a proprietary storage
format.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Set Data Quality Standards
The term “data swamp” was a running joke during the initial data
lake craze, as data was ingested from a variety of sources without
any documentation or structure. Users quickly lost confidence in
the data and looked to other sources, such as data warehouses, for
their reporting and analytic needs.
Setting the same data quality standards in a modern data lake that
exist in data warehouses is vital to winning the trust of end users
who access the data. Today, many tools are available to ensure
the quality of data within the data lake. All data reaching the lake,
including both streaming and batch data, need to have the same
quality standards applied.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Support Multiple Data Types
Unstructured data from Internet of Things (IoT) devices and other
data sources that produce a massive volume of data is a challenge
to perform analytics and reporting against due to their often hier-
archical nature.
To Federate or Not?
Data sources that populate a modern data lake can come from
a variety of sources such as enterprise resource planning (ERP),
customer relationship management (CRM), production systems,
and third party vendors. You can query source systems directly
without copying the data to the lake first. This is known as fed-
eration. With the technological advancements within CPU, RAM,
storage, and networking, querying source systems is often
approved to some limits. For analytics users, this opens up data
that they’ve never had before — the option to query data sources
directly based on a source by source basis.
These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to [Link]/go/eula to access Wiley’s ebook EULA.