0% found this document useful (0 votes)
25 views21 pages

Modern Data Lakes Explained

Uploaded by

Vamshi Rapolu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views21 pages

Modern Data Lakes Explained

Uploaded by

Vamshi Rapolu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

These materials are © 2024 John Wiley & Sons, Inc.

Any dissemination, distribution, or unauthorized use is strictly prohibited.


Modern
Data Lakes
Starburst Special Edition

by Tom Nats

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Modern Data Lakes For Dummies®, Starburst Special Edition

Published by
John Wiley & Sons, Inc.
111 River St.
Hoboken, NJ 07030-5774
[Link]
Copyright © 2024 by John Wiley & Sons, Inc., Hoboken, New Jersey

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise,
except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without
the prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at [Link]
Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, [Link],
Making Everything Easier, and related trade dress are trademarks or registered trademarks of
John Wiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not
be used without written permission. All other trademarks are the property of their respective
owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this
book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHORS HAVE


USED THEIR BEST EFFORTS IN PREPARING THIS WORK, THEY MAKE NO REPRESENTATIONS
OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF
THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT LIMITATION
ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES REPRESENTATIVES, WRITTEN
SALES MATERIALS OR PROMOTIONAL STATEMENTS FOR THIS WORK. THE FACT THAT AN
ORGANIZATION, WEBSITE, OR PRODUCT IS REFERRED TO IN THIS WORK AS A CITATION AND/
OR POTENTIAL SOURCE OF FURTHER INFORMATION DOES NOT MEAN THAT THE PUBLISHER
AND AUTHORS ENDORSE THE INFORMATION OR SERVICES THE ORGANIZATION, WEBSITE, OR
PRODUCT MAY PROVIDE OR RECOMMENDATIONS IT MAY MAKE. THIS WORK IS SOLD WITH
THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING PROFESSIONAL
SERVICES. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BE SUITABLE FOR
YOUR SITUATION. YOU SHOULD CONSULT WITH A SPECIALIST WHERE APPROPRIATE. FURTHER,
READERS SHOULD BE AWARE THAT WEBSITES LISTED IN THIS WORK MAY HAVE CHANGED
OR DISAPPEARED BETWEEN WHEN THIS WORK WAS WRITTEN AND WHEN IT IS READ.
NEITHER THE PUBLISHER NOR AUTHORS SHALL BE LIABLE FOR ANY LOSS OF PROFIT OR ANY
OTHER COMMERCIAL DAMAGES, INCLUDING BUT NOT LIMITED TO SPECIAL, INCIDENTAL,
CONSEQUENTIAL, OR OTHER DAMAGES.

For general information on our other products and services, or how to create a custom For
Dummies book for your business or organization, please contact our Business Development
Department in the U.S. at 877-409-4177, contact info@[Link], or visit [Link]/go/
custompub. For information about licensing the For Dummies brand for products or services,
contact BrandedRights&Licenses@[Link].
ISBN 978-1-394-22650-4 (pbk); ISBN 978-1-394-22651-1 (ebk)

Printed in the United States and Great Britain.

Publisher’s Acknowledgments
Development Editor: Editorial Manager: Rev Mengle
Rachael Chilvers Business Development
Project Editor: Pradesh Kumar Representative: Matt Cox
Acquisitions Editor: Traci Martin

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Knowing why to use a modern data lake

»» Enjoying the benefits of using a modern


data lake

Chapter 1
Exploring a Modern
Data Lake

T
he term “data lake” — a repository that can store vast
amounts of data — may elicit different reactions and
­definitions from different folk in the data analytics space.
Those who lived through the Hadoop era are often skeptical of the
value data lakes provide an organization. Luckily, technology has
advanced, and many issues that data lakes encountered in the
past have been resolved. Modifying data, high performance, and,
most importantly, data quality and security are now features
encompassed in data lakes.

Leveraging a single data store for all your analytical needs


­without being locked into one vendor is a wish come true for many
­organizations. After all, data is data — so the concept applies to
companies of all sizes and all industries. Having a seemingly
unlimited, fully managed storage area is great; however, you
must define, transform, catalog, quality check, and structure the
data before it can easily be consumed by a variety of technical and
non-technical end users.

CHAPTER 1 Exploring a Modern Data Lake 1

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Why Use a Modern Data Lake?
A modern data lake provides data warehouse functionality with-
out the constraints of legacy Hadoop-based data lakes. Addition-
ally, modern data lakes are open, which alleviates single vendor
and technology lock-in when architecting, building, and access-
ing data in your data lake. Companies leveraging modern data
lakes own their data, period.

Table 1-1 summarizes the requirements that a modern data lake


fulfills.

TABLE 1-1 How a Modern Data Lake Can Meet Your


Requirements
Requirement Modern Data Lake

Single storage platform Highly performant object stores with multiple engines
and multiple engines reading, writing, and managing the data

Can serve a majority of Business intelligence (BI) reporting, ad-hoc queries,


analytical use cases machine learning/AI model building and serving,
and so on

Can accommodate From JSON to CSV to Parquet and table formats such
different types of data as Apache Iceberg and Delta Lake

ANSI SQL support Fully ANSI SQL compliant supporting a variety of


programming languages and most BI tools

Ability to modify Full DML (Data Manipulation Language — update,


individual records delete, merge)

Data quality checks Constraints such as “not null” and valid values for table
columns

High performance Seconds to millisecond querying is possible using


different engines’ indexing and caching mechanisms

Efficient joins With Hadoop, joins between tables were discouraged.


Joins are now a common pattern and encouraged in
modern data lakes

2 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Requirement Modern Data Lake

Schema evolution Changing data structures was challenging in legacy


data lakes. Table formats, such as Iceberg, Delta Lake,
and Apache Hudi, enable these changes just like a
traditional database

Affordable and Cloud-based modern data lakes benefit from an


maintainable economic, fully managed, scalable storage repository

As you can see from Table 1-1, the modern data lake has come a
long way. It’s an exciting time for companies as they can finally
simplify their analytics architecture, avoid vendor lock-in, and
provide a single storage layer for all of their diverse data.

After the introduction of Hadoop, companies used the technol-


ogy to land massive amounts of disparate data from various
sources. From there, the data was analyzed in its raw form by
data scientists. BI professionals using standard SQL and report-
ing tools struggled to get value from the unorganized data in the
lake. Performance paled in comparison to traditional data ware-
houses. Once the data had been processed, it was copied to a data
warehouse. This increased complexity and cost for companies
who now had to maintain two data systems and employ staff with
multiple skill sets, as shown in Figure 1-1.

FIGURE 1-1: The bad ol’ days of data lakes.

With a modern data lake, also known as a data lakehouse, orga-


nizations can serve all of their analytical use cases from a single
storage platform; see Figure 1-2. Additionally, you can use different

CHAPTER 1 Exploring a Modern Data Lake 3

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
“engines” on top of this data to satisfy diverse use cases. From
­standard reporting to ad-hoc querying to populating machine
learning models, organizations can choose the right engine/vendor.

FIGURE 1-2: A modern data lake has a single storage platform.

BENEFITS OF A MODERN
DATA LAKE
One common storage platform. Your data is always under your
complete control.

Multiple engines. Open source or proprietary, choose whichever one


solves your business’s use cases.

No vendor lock-in. Software vendors compete on features and inno-


vation, and are now opposed to locking customers into their proprie-
tary solutions.

4 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
IN THIS CHAPTER
»» Looking at modern data lake
components

»» Securely managing and maintaining a


modern data lake

Chapter 2
Building a Modern
Data Lake

I n this chapter we dive into how to build and implement a mod-


ern data lake. Roll up your sleeves!

Following the Steps


You can build a modern data lake using the components shown in
Figure 2-1 and explained in the next sections.

Modern data lakes separate storage and compute. This dynamic


is advantageous because organizations can use a variety of tools
to land data without overburdening the lake. The separation of
storage and compute also enables much faster ingestion, process-
ing, and flexibility. Lastly, and a primary differentiator from data
warehouses, modern data lakes can use multiple engines to com-
plete any of the following data processing steps.

CHAPTER 2 Building a Modern Data Lake 5

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
FIGURE 2-1: The components of building a modern data lake.

Ingestion and landing of data


Extraction, the “E” of extract, transform, and land (ETL), is the
starting point for building a modern data lake. You can choose
from a variety of tools and methods to extract data from the mul-
tiple source systems where data is stored.

“Push” and “pull” are the two primary methods of extracting


data from source systems:

»» Pull is the traditional method of extracting data from source


systems such as databases or NoSQL systems. It involves
running SQL queries against the source system to extract
new or modified data and insert it into the data lake.
»» Push has become more popular based on data being
produced at a rate never seen before. Data from sensors or
cable TV boxes (for example) is often “pushed” into a data
lake using various tools and methods, such as traditional
secure file transfers and real-time APIs. Pushed data arrives
in the lake in raw form.

With the explosion of real-time data sources, newer technolo-


gies such as Kafka, an open-source distributed event streaming
platform, was designed to handle massive amounts of data while
maintaining reliability and acting as a company’s “data queue” to
populate a data lake.

The timing of ingesting data into a lake is often controversial as


you need to balance the desire to ingest data into the lake as fast

6 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
as possible to unlock timely insights with the cost of developing
and supporting the chosen ingestion methods. You can employ
strategies for both “batch” and “streaming” ingestion:

»» Batch. Data is pulled or pushed into a data lake at given


intervals.
»» Streaming (or micro-batch). Data is pushed into the data
lake in real-time or near real-time.

Data that arrives in the landing area of a data lake is either new
data or an alteration to data that already exists in the lake, which
is considered an update. Data lake (and data warehouse) data is
typically not deleted. Data can provide an audit trail from many
different source systems. For example, if a customer’s address
changes, the original address would remain in the lake to ensure
queries and reports could accurately capture historical data. This
is known as a slowly changing dimension (SCD). Table formats such
as Iceberg, Hudi, and Delta Lake enable you to merge and update
data within a modern data lake.

Structuring and transforming data


Once data is landed in the data lake, transformation and enrich-
ment is regularly performed before presenting the data to end
users. For example, ingested data is often joined with existing
tables in the lake to provide business context such as product
type, current cost, and other business-relevant information.

In a modern data lake, you can update data using modern table
formats such as Apache Iceberg and Delta Lake. Using these for-
mats in a data lake enables processing engines to perform similar
functions as traditional databases.

Aggregation and rollups


Data captured from source systems is often too granular for ad
hoc querying and reporting. Product sales are good examples. The
questions asked of this data are usually focused on sales during
different time frames such as daily, weekly, monthly, and yearly.
In order to provide faster performance when these types of queries
are executed, data is usually pre-aggregated into different levels of
granularity. This is often called “rolling up” the data into different
structures to aid in reporting queries and are most commonly used
by standard reporting tools such as Power BI and Tableau.

CHAPTER 2 Building a Modern Data Lake 7

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Governance and security
You need to implement proper security methods to ensure users
accessing the data are correctly authenticated and authorized.
This is essential to avoid all users gaining unfettered access and
turning your data lake into a data swamp. You can use two main
forms of security within a modern data lake: role-based access
control (RBAC) and attribute-based access control (ABAC):

»» RBAC grants privileges to roles that users are assigned. This


is the standard method for the majority of relational and
non-relational databases.
»» ABAC is a newer method of securing assets within a modern
data lake. It uses tags to provide or deny privileges. Objects
within the lake are tagged, and privileges are granted to the
tag. Any object with that tag applied to it adheres to those
privileges. This is a very powerful security mechanism
because tags can span many different resource types.

Additionally, modern data lakes should have detailed audit log-


ging as a mandatory requirement for every service and operation
within the lake.

Cataloging and data products


Data lakes are extremely scalable and can contain a massive
amount of data. Being able to find and trust data are essential
attributes of a successful data lake. In the past, data was added
to a lake with little concern about structure and documentation,
and the ingested data would be cleansed and processed later. This
practice created data swamps. Modern data lakes have adopted a
stricter approach. Today, data ingested into the lake must have
some sort of structure. A centralized catalog is used to track and
document these structures.

Cataloging
As data enters the lake, you need to catalog it. You can add addi-
tional metadata to enhance end users’ ability to search and find
relevant data sets. You can tag data assets (explained in the “Gov-
ernance and security” section) in order to apply security rules and
categorize these assets to simplify access.

8 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Say you tag a set of data assets (tables, schemas, and so on) as
“marketing.” By applying this tag, you could put in place a secu-
rity policy to limit access to the data set to only those team mem-
bers in Marketing. Doing so also makes finding the data much
easier for someone in a marketing role because they can simply
search for the “marketing” tag.

Data products
Data is arguably your organization’s most valuable asset. But
finding trustworthy, curated data sets remains a challenge in
most companies. This challenge was the genesis of data products.

The concept of a data product was introduced as a building block


of the increasingly popular data mesh. A data mesh architecture
is a decentralized data management approach where ownership
and control reside with individual domain teams, promoting scal-
ability, agility, and better alignment with business goals. Tradi-
tionally, business intelligence (BI) tools provided an area where
users could find reports and structures that were created and
maintained. As more tools became available (both SQL and NoSQL
based) to consume this data, users would have to replicate the
same structures in the single BI tool.

Data products are curated data sets designed to solve specific, tar-
geted business questions. Users are presented with a searchable
list of data products along with additional metadata such as tags,
owners, trustworthiness, and age. Users can access and query
data products using a multitude of tools, further protecting users
from vendor lock-in. Data products have been compared with a
semantic layer. Semantic layers reside between the data and busi-
ness users to add context to data. While similar, data products go
well beyond semantic layers by adding many different features
such as in-depth metadata, lineage, approval processes, con-
tracts, and even social elements such as shareability.

Accessibility
You may be familiar with the analogy that “data is the new oil”
because of data’s ever-increasing value. Take analytical data as an
example. Initially used for simple reporting, it’s now embedded
into applications and deployed in creative revenue-­ generating
ways. Accessing trusted analytical data is the cornerstone of any
modern data lake. Don’t lock into any specific tool because the rate
of innovation in this area has exploded. From natural language to

CHAPTER 2 Building a Modern Data Lake 9

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
different visualization methods, users crave optionality and the
choice of a variety of tools to access and analyze data:

»» JDBC (Java Database Connectivity) has become the


standard connectivity for a majority of analytical tools.
»» ODBC (Open Database Connectivity) was created by Microsoft
in 1992 and is still used by some of their reporting tools.
»» SQL IDE (SQL Integrated Development Environment)
simplifies working with data sources. Most modern data lake
platforms include a web-based SQL development and query
feature.
»» Notebooks, such as Jupyter, are popular tools for data
scientists. Most notebooks are able to connect to modern
data lake platforms.

Observability, monitoring, and


orchestration
Building and administering a modern data lake requires addi-
tional processes and tools. This work can be completed in-house
or you could leverage one of many software vendors with domain
expertise and tooling. The additional processes include:

»» Observability enables your organization to understand the


quality, health, and performance of data in a system. It’s a
critical part of data management that means you can
identify and troubleshoot data issues before they impact
business operations.
»» Monitoring is similar to observability, in that it focuses on
the quality of data in source systems. Monitoring within a
modern data lake includes monitoring if a job is completed,
auditing logs, and being able to read and write from storage.
These are crucial in maintaining high end user service level
agreements (SLAs) and compliance standards. Many software
vendors offer tools to monitor all aspects of a data lake, from
basic logging to full workflow and data quality suites.
»» Orchestration is required in modern data architecture.
Executing workflows and jobs within a modern data lake has
expanded from a simple scheduler to a suite of data pipeline
tools, such as dbt, that extract and transform data on a
scheduled basis. Organizing and streamlining data pipelines
accelerates your data-driven decision making.

10 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Chapter 3
Ten Tips for Optimizing
Modern Data Lakes

F ollow these ten tips for success when optimizing your data
lake.

Choose Open File and Table Formats


Data lakes have brought about many technological advances to
data analytics including open file formats and open table formats.
Open source, columnar based file formats (such as Parquet, ORC,
and Avro) have been adopted by a majority of data lake platforms.
Modern table formats (such as Apache Iceberg, Delta Lake, and
Apache Hudi) include features that enable data lakes to perform
operations formerly reserved for databases, including the updat-
ing and merging of data and storage optimizations.

Separate Storage and Compute


Another technological advancement is the separation of storage
and compute. The ability to use multiple engines or products on
the same storage is liberating and provides further insulation

CHAPTER 3 Ten Tips for Optimizing Modern Data Lakes 11

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
from vendor lock-in because your organization’s data no longer
has to reside in a proprietary system.

Ensure any data lake vendor is able to integrate with basic object
stores and their products don’t store data in a proprietary storage
format.

Support Multiple Engines


Data lake engine and tooling functionality includes ingestion,
transformation, machine learning, basic SQL, Python, and myr-
iad of options. Ensuring you choose the right engines for the job
will provide flexibility and freedom to fulfill organizational goals.

The impact this has for companies is extremely powerful. The


separation of storage and compute provides you full ownership of
the tools and processes you apply to your data storage and ana-
lytic requirements. Free from vendor lock-in, you now have the
freedom to select the engines and products that maximize data
insights and control spend.

Remember Data Modeling


and Semantic Layers
As data lakes grew in popularity, data modeling became a chal-
lenge for two primary reasons. First, no longer was only struc-
tured data being ingested into the data lake. Unstructured and
semi-structured data became more common forms of data being
ingested. Data modeling tools didn’t support data such as JSON
files. Consequently, data modelers could only document a por-
tion of the ingested data. Second, the introduction of schema on
read allowed data to be ingested into a data lake without a schema
being created. This practice was one of the areas which caused
data lakes to be termed “data swamps.” Building a semantic layer
that’s exposed to end users is a crucial component of a modern
data lake. This removes the complexity and allows only trusted
and verified data to be accessed.

12 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Set Data Quality Standards
The term “data swamp” was a running joke during the initial data
lake craze, as data was ingested from a variety of sources without
any documentation or structure. Users quickly lost confidence in
the data and looked to other sources, such as data warehouses, for
their reporting and analytic needs.

Setting the same data quality standards in a modern data lake that
exist in data warehouses is vital to winning the trust of end users
who access the data. Today, many tools are available to ensure
the quality of data within the data lake. All data reaching the lake,
including both streaming and batch data, need to have the same
quality standards applied.

Support for Any Client Tool


With a combination of data products, a well-documented cata-
log, and tight security, use any business intelligence (BI) tool
with confidence that the data you’re working with is trustworthy
and correct. Avoid using tools that copy or create a subset of data
because that data quickly becomes out of date and untrustworthy.

Have a Data Archive Strategy


The idea that a data lake can store data forever is enticing but can
quickly become a “data tomb.” Even with seemingly unlimited
storage in a cloud environment, a well thought out data archiving
strategy is important to ensure data in the modern data lake
remains accessible and performant.

Categorize data as either “hot” or “cold” as defined by the busi-


ness and set using a data availability SLA. Most cloud providers
offer a low-cost archive object storage option for infrequently
queried data, which is ideal for archiving old data.

CHAPTER 3 Ten Tips for Optimizing Modern Data Lakes 13

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
Support Multiple Data Types
Unstructured data from Internet of Things (IoT) devices and other
data sources that produce a massive volume of data is a challenge
to perform analytics and reporting against due to their often hier-
archical nature.

Luckily, many modern engines (such as Spark and Trino) ena-


ble you to query these types of data directly. Alternatively, they
can be flattened down into tabular structures to more easily be
consumed by reporting tools such as Tableau and Power BI. Both
structured and unstructured data must be supported in order to
provide a variety of end users and tools the ability to access and
work with the data.

To Federate or Not?
Data sources that populate a modern data lake can come from
a variety of sources such as enterprise resource planning (ERP),
customer relationship management (CRM), production systems,
and third party vendors. You can query source systems directly
without copying the data to the lake first. This is known as fed-
eration. With the technological advancements within CPU, RAM,
storage, and networking, querying source systems is often
approved to some limits. For analytics users, this opens up data
that they’ve never had before — the option to query data sources
directly based on a source by source basis.

Don’t Get Squirreled!


New, “shiny objects” have come and gone, such as Hadoop, data
science, and now AI. Although these shiny tools have helped (or
sometimes hindered) companies, they are just that: tools. The
main focus of any analytical system is to provide clean, trust-
worthy, and timely data to end users so they can analyze this data
through ad hoc querying and reporting. Keep your eyes on the
main objective and don’t get too distracted by shiny objects!

14 Modern Data Lakes For Dummies, Starburst Special Edition

These materials are © 2024 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.
WILEY END USER LICENSE AGREEMENT
Go to [Link]/go/eula to access Wiley’s ebook EULA.

You might also like