PaperID 52140-JATIT
PaperID 52140-JATIT
ABSTRACT
The amount of data that is available to enterprises today comes from many different sources, including
social networks, sensors, and IoT devices. In order to discover trends, draw conclusions, produce
projections, and make informed decisions, this enormous amount of data needs to be stored across a variety
of platforms for processing and analysis. The capacity of conventional EDs is surpassed by the quantity
and quality of data that is being collected. To accomplish this, businesses with current data warehouses
must pick a storage architecture with enough storage and processing power for this kind of data. They must
choose one of the following options: The data warehouse can either (i) develop into a big data warehouse,
(ii) be replaced by a data lake, or (iii) be combined with a data lake to create a data LakeHouse. In this
article, we aim to find the best choice for the storage of varied and voluminous data. To do this, we
examine the big data warehousing literature. After doing a comparison of the various architectures put
forth, we draw a conclusion outlining the optimum storage practice.
Keywords: data warehouse, big data, big data warehouse, data lake, dataLakeHouse
1. INTRODUCTION
Huge volumes of heterogeneous data have been Because of this obligation, we ask ourselves the
produced as a result of the widespread usage of following questions: What role will the data
new technologies [1]. Organizations must deal with warehouse play in the age of Big Data? Should the
massive amounts of data from many sources and in company permanently stop using the data
various formats as a result. In order to create warehouse? What is the impact of investing in a
predictions, come to conclusions, and take wise data warehouse even if the organization already has
judgments, they must process and analyze all the a big data platform? An in-depth analysis of the
data, which necessitates a platform with the different solutions offered by companies that
required capabilities and features [2]. Data currently have a data warehouse is necessary to
warehouses are utilized mostly with massive find the answers to these questions.
datasets produced in various legacy systems using Numerous architectures are found in the literature.
relational data, and they constitute a traditional The data warehouse has been replaced by the big
domain of relational databases [3]. They get data warehouse, the data warehouse has been
analytical data via analysis and reporting tools and abandoned in favor of the data lake, and the two
are fed from various data sources via ETL. Because have been combined into a new tool called the
of the limitations imposed by data warehouses, LakeHouse[5].
analytical tools fall short of what analysts demand In this paper, we answer these questions by
in terms of high availability and quick responses to presenting a comparative study of the new
queries [4]. architectures that are replacing the traditional data
Due to these restrictions, organizations are forced warehouse.
to move to a big data platform that offers unlimited The state of the art for the data lake, large data
storage capacity and supports a variety of data warehouse, and LakeHouse is presented in the
formats. following section. In the third section, we outline
1
the many designs that outline excellent storage available to other organizations for use in the
methods and offer a comparison of their individual future, like data analysis [13]. It can serve as a
traits. In section 4 follows we provide a synthesis, setting for the development of in-depth analyses
and we discuss open research's difficulties and with the goal of making quick, accurate decisions
potential future prospects in the fifth section. The based on raw data. Additionally, it is the perfect
final section is when we put our labor to rest. response to issues with data integration and
accessibility.
2. LITERATURE REVIEW There are several benefits for using data lakes to
The many structures used to store, process, and store raw data. Four benefits are highlighted by
analyze vast amounts of data are highlighted in this Marilex R. L. [8]: enhanced data collecting, quick
section. It provides Data Lake, Big DW and access to raw data, reduced initial effort through
LakeHouse literature as well as a study that details data storage, and data preservation. The main uses
and contrasts the various platforms' varied features. case of data lakes is as experimentation platforms
for data scientists or analysts, staging areas or
2.1 Data lake sources for data warehouses, and as a direct source
The literature on data lakes is a little murky and for self-service BI. Figure 1 shows data lake
lacking, and the numerous implementation architecture.
strategies that have been proposed do not
completely address the topic of data lakes or
provide a detailed design and implementation
strategy [6]. The available literature discusses
certain details and tangible traits of data lakes, but
it does not offer a consistent idea or overarching
implementation plan.
Studies show that customers save all of their
unprocessed, raw data—whether it's unstructured,
semi-structured, or structured data—in a single,
central location called the data lake [5].
James Dixon, the Chief Technology Officer (CTO) Figure 1. Data lake architecture
of Pentaho, first used the phrase in 2010 to describe
the idea of a single repository collecting practically
2.2 Big data warehouse
infinite amounts of raw data for analysis or The traditional Data Warehouse and Hadoop are
indefinite future usage [7], [8]. Consumers can use combined in the hybrid design known as Big Data
specifically created schemas to querythe pertinent Warehouse, which can be a substantial benefit in
data, resulting in a smaller collection of data that terms of data processing, multidimensional
can be studied to help answer queries because each processing, and decision-making maturity [14].
data entity in the data lake is connected with a Although data warehouse and big data are two
unique identifier and a substantial amount of distinct concepts, they work quite well together.
metadata [9]. Data warehouses are architectural descriptions of
When producing or analyzing data, data models how data is organized, whereas big data is a
and schemas are employed, but not when storing technology connected to the storage and
information [10]. The data lake is described by management of vast and varied amounts of data.
Terrizzano I. et al. [7] as a collection of central A big data warehouse is a key component of the
repositories housing substantial amounts of raw organization and management of big data. It is a
data in various forms, described by metadata, hybrid system that employs both big data
arranged into recognizable datasets, and accessible technologies currently available and data
on demand. Similar to this, Hai et al. [11] define warehouse design. A Big Data warehouse can be
data lakes as big data repositories that hold raw implemented as a first step in enhancing an
data and offer on-demand integration features organization's data analytics infrastructure and
utilizing metadata descriptions. starting to apply Big Data technologies [15]. By
Data lakes in the context of big data provide fusing data warehouse analysis with big data
extensive and flexible data storage that may accept analysis, the big data warehouse makes it possible
many data formats. In spite of the trade-offs made to quickly analyze a lot of data.
while storing data in conventional designs like a Many businesses across a variety of industries are
data warehouse, they store nearly accurate or even currently working to modernize their data analytics
exact copies of the source format to give an infrastructures for this new era by switching from
unpolished view of the data [12]. There is no the traditional data warehouse (DW) concept to a
attempt made to model or integrate the data before new notion of data warehouse (BDW) based on a
storage. The goal of the data lake is to make data more dynamic data model[15], [16].
2
The large data warehouse, according to Forrester, is models [32]. In a different study[28]a strategy for
"A specialized and consistent set of data query optimization in massive data warehouses is
repositories and platforms used to support a wide adopted. The suggested method chooses a group of
variety of analytics run on-premises, in the cloud, materialized views to target the physical structure
or in a hybrid environment"[17]. The big data of big data warehouse.
warehouse makes use of both established and Nuno Silva et al. [15] chose a strategy to
emerging technologies,including Hadoop, implement large ED in the supply chain, on the
columnar and row data warehouses, streaming, other hand. They provided the technological and
ETL, and elastic storage as well as in-memory logical architectures required for its
frameworks[17]. implementation.
Data types and formats are a significant issue right The authors provide a novel framework for data
now since they contradict the core tenets of data warehouse queries that consists of a storage model
warehouse operations. In fact, spatial data, photos, and a tailored query processing model, despite the
videos, and simple text cannot be stored in data fact that a query in ED can be broken up into a
warehouses. The design and implementation of big huge number of separate subtasks and managed by
data warehouses is developing into an important a large-scale computing cluster. To optimize OLAP
area of research as a result of the contemporary queries with star joins, Y.Ramdane et al. [30]
conceptual, technological, and organizational suggested a data storage model in Hadoop. The
setting. The literature on this subject is divided into selected model offers a fresh approach to data
three sections. placement in the Apache Hadoop environment that
The first category includes works that address big enables a star join operation to be completed in a
data warehouses' physical design [18]–[23], the single Spark transaction.
second category includes works that address big Currently, the design of big data warehouses places
data warehouses' query processing and equal emphasis on the logical and physical layers,
optimization [24]–[28], and the third category which are represented by the data models and
includes works that address both axes infrastructure, respectively [33], [34]. The concept
simultaneously[16], [29]. Inorder to demonstrate is new, as evidenced by the state of the art. The
the significance of big data warehouses in article introduces two modeling approaches for
information systems, we quote a few studies in this storage and processing in the context of large data
document that deal with the topic. warehouses [33], [35], [36]. The first approach,
A new data placement approach for Hadoop's dubbed "lift and shift," entails expanding the
distributed data warehouses, dubbed Smart Data capabilities of conventional data warehouses using
Warehouse Placement (SDWP), is proposed in the big data technologies like Hadoop and NoSQL
work published in [30]. On the other hand, in [31], databases. The second tactic, known as "rip and
the authors suggest a useful tool for heterogeneous replace," suggests a scenario in which big data
data warehouses' data administration and technologies totally replace a conventional data
integration. They go over the technologies and warehouse.Figure 2 illustrates the general
architectural frameworks necessary for large data architecture of the big data warehouse proposed
processing, the back-end application that carries in[37].
out the data migration from the RDBMS to the
NoSQL data warehouse, the structure of the
NoSQL database, and how it can be useful for
upcoming data analysis.
In contrast, another study uses partitioning and
compartmentalization techniques to construct a
denormalized model-based Big Data
warehouse[29]. For their part, authors propose in
[25] a novel strategy for data integration in Big
Data warehouse. This method, known as Mapping-
ELT (M-ELT), is founded on the processing of
fundamental ELT operations and takes semantic Figure 2. Conceptual big data warehouse architecture
heterogeneity into consideration. [37]
The work released in [32]makes a fresh suggestion 3. BENCHMARKING STUDY
for the conception and application of big data The continual generation of vast amounts of data
warehouses in the context of smart cities. The by today's digital information systems necessitates
suggested method considers the gathering, the implementation of platforms with the ability to
preparing, and enrichment of data that arrives in store and handle massive data, while also taking
batches and via flow mechanisms, as well as the into account its volume, speed, diversity, and
output of data mining algorithms and simulation validity[38]. Data management technologies have
3
therefore progressed from structured databases to and fine-grained security policies. This method
big data storage systems, massive data warehouses, ensures efficient user access management while
and data lakes, but each solution has advantages enabling the construction of sophisticated user
and disadvantages. access models. Despite the fact that data lake
We give a comparison of these various security is still under development, these
architectures in the section that follows. assurance gaps result from the fact that current
data lakes concentrate on storing heterogeneous
3.1 Data lake vs. data warehouse data without considering how or why data is
Although data lakes and warehouses are used to utilized, managed, defined, or secured [41]. As
store an organization's data, each has benefits and a result, this subject has been the subject of
drawbacks. various works' research [42].
Data. Data lakes and data warehouses store Agility. The structured data kept in the data
different types of data and analyze it in different warehouse. This results in low agility because
ways. Data lakes contain data in its unprocessed any change that affects the data warehouse
state, devoid of any schema or structure. This model necessitates a reconfiguration of the data
makes it possible to store a wide range of data warehouse. Data lakes, on the other hand, do
types in a single location, including social not adhere to any structure and as a result, have
network posts, log files, pictures, and videos. a fixed configuration.
Additionally, a lot of enterprise data is Cost. Since data lakes don't need as much
unstructured, which data warehouses cannot organization and structure and don't need
handle. additional hardware or software, they are
Architecture. Data lakes employ a flat design typically less expensive to set up and maintain
that makes it easier to add and remove such a than data warehouses.
data source, in contrast to data warehouses that Despite their tremendous workload, relational data
push data to the user in the form of data marts warehouses have long dominated analytics and
in accordance with a predefined format. There decision-making. However, the development and
are metadata tags and a specific identification variety of big data have outpaced its structured data
for each data element. Although the precise integration approach. Due to their nature of design
specified structure for handling various types and poor tolerance for human error, these systems
and forms of data is not required for data lakes, are therefore very dependent on IT. Data lakes are
the order of data arrival time must be an addition to or a replacement for data
maintained [39]. While, the presence of a warehouses, not the other way around. Data lakes
comprehensive collection of substantial should be viewed as extensions of the BI
metadata ensures the agile and effective infrastructure as a result[8].
management of the data stored in it. These
enable the utilization of the data contained in 3.2 Big data warehouse vs Data warehouse
the data lake in a very flexible and simply The volume, diversity, and velocity of big data
adjustable manner [40]. severely restrict the utilization of traditional data
Processing. Schema-on-write refers to a warehouses. Emerging methods and technologies
behavior in which data that is destined for the are made possible by their rigid relational nature,
data warehouse must be processed in order to expensive scalability, and occasionally ineffective
assign it to a structure in accordance with a performance. The idea of big data warehousing is
specified model. Schema-on-read refers to the currently growing in acceptance because it
practice of processing and modeling data at provides fresh approaches to tackling big data
read time while it is still in its raw form and problems [35]. It has also drawn considerable
intended for the data lake [39]. interest from the scientific community, highlighting
Access. Although data lakes are open to all the necessity of redesigning the conventional data
users, only data scientists are equipped to do in- warehouse in order to achieve new features
depth analyses on the lake's data. Data applicable in big data environments [34].
warehouses, on the other hand, are utilized by Despite the fact that both the data warehouse and
specialized business users to create reports and the big data warehouse are used to store data, they
extract analytical data, but they don't satisfy are distinct in the following ways:
data scientists who need to venture outside the Data. Only consistent data that is organized
data warehouse's limits to gather additional data using a certain model is stored in traditional
for analysis. data warehouses. Big data warehouses, on the
Security. Since data warehouses have been other hand, hold both structured data and
around for close to 30 years, they are more heterogeneous raw data, including sensor data,
secure thanks to their experience and maturity. audio, video, image, and json files.
They implement role-based access privileges
4
Volume. The volume of data that each type of described as an ETL process that involves
warehouse can hold is one of the key erasing, customizing, reformatting, integrating,
distinctions between them. The volume, and inserting data into a conventional data
diversity, and velocity of big data are too great warehouse [43]. As a result, it is designed to
for traditional data warehouses, which are made store and handle huge amounts of organized,
to manage vast amounts of structured data. semi-structured, and unstructured data. In
massive data warehouses are made to manage contrast to a data lake, a big data warehouse
massive data from many sources and have often transforms, purifies, and organizes data to
storage capacity that exceeds petabytes. support analysis.
Analytics. Big data warehouse architecture Data control. The degree of control over the
makes advantage of cutting-edge AI-based data is one of the key contrasts between a data
analytics. By evaluating data from many lake and a big data warehouse. There is less
sources, it gives organizations a thorough and control over the data because it is stored in its
in-depth perspective of their business, enabling raw form in a data lake. Because of this, data
them to make the necessary predictions and lakes are perfect for businesses that need to
enhance system performance. Traditional data store a lot of data, but they are less appropriate
warehouses provide analytical data as well, but for mission-critical applications where data
only on the basis of sparse data. As a result, integrity and dependability are crucial. A big
they do not permit the use of sophisticated data warehouse, on the other hand, offers a high
instruments that need a substantial amount of level of control over the data through access
data, which means that the analytical data control, data governance, and clearly defined
generated falls short of fully revealing the data architectures. Big data warehouses are
company's business process. therefore the best option for businesses that
Flexibility: Both kinds of data warehouses give require structured and organized data for
stakeholders access to analytical data, but big analysis and decision-making.
data warehouses are favored because they Flexibility and scalability. Their levels of
generate insights that transcend the enterprise flexibility and scalability are another contrast
and address many categories of decision- between the two. With less control over the
makers. data, data lakes give greater freedom and
Cost: Compared to a big data warehouse, a autonomy. In contrast, big data warehouses
standard data warehouse may be more typically have more strict data management
expensive to install and operate. Traditional policies, includingaccess control, data
data warehouses need specialized gear and governance, and clearly specified data
software, and before the data can be used, it architecture.Big data warehouses are often more
must be converted and structured. On the other expensive to create and maintain than data
hand, because it doesn't need the same amount lakes. This is because less specialized gear and
of structure and organization, a massive data software are needed for data lakes, and the data
warehouse is typically less expensive to setup is not processed before usage. Contrarily, big
and operate. data warehouses typically cost more to set up
and maintain because they need specialized
3.3 Big data warehouse vs Data lake hardware and software and because the data
Organizations utilize data lakes and big data must first be converted and sorted before it can
warehouses as storage areas and ways for handling be used.
enormous amounts of data to aid in data analytics Extraction velocity. The speed of data
and decision-making. While both approaches have extraction between the two is another important
advantages, organizations must weigh the pros and distinction. Because the data must first be
downsides of each before deciding which is best converted and organized before it can be
for their information systems. evaluated, data extraction from a data lake can
data processing. A data lake is a centralized be slower than data extraction from a massive
location created for the large-scale archival of data warehouse. On the other hand, since the
organized, semi-structured, and unstructured data has already been processed, cleansed, and
data. The fact that data in a data lake is often organized, extracting data from a massive data
kept in its original format. Because of this, data warehouse is typically quicker.
lakes are perfect for businesses that need to The decision between a data lake and a big data
store and analyse massive amounts of data from warehouse ultimately comes down to the particular
many sources while also keeping the raw data requirements of the enterprise. Organizations that
for usage in the future.A big data warehouse, on need to store significant amounts of data from
the other hand, is a particular kind of data numerous sources and want to keep corporate data
warehouse created to manage big data. It is are best served by data lakes. Both disk big data
5
warehouses are excellent choices for businesses workloads, simplicity of versioning, governance,
that need to rely on a lot of data to make decisions. and data security.
6
table schema, section details, and the manifest list Table 2: Comparison between Delta Lake, Apache
path. 2) There is an entry for each manifest file Iceberg and Apache Hudi.
connected to the snapshot in the manifest list. 3) Feature Delta Apache Apache
Lake Iceberg Hudi
The manifest file includes a list of the locations of Data model Log-based Table- Log-based
the linked data files. 4) The information is stored in based
a physical data file that is written in formats like Storage Paquet Paquet, Paquet,
Parquet, ORC, and others[49]. formats ORC AVRO
Upsert Basic Basic Advanced
c. Apache Hudi Support
Similar to Apache Iceberg and Delta Lake, Apache ACID Yes Yes Yes
Hudi (Hadoop Upserts Deleted Incrementals) is a Compliance
framework made to speed up incremental Time Travel Yes Yes Yes
Integration Very good Limited Good
processing on top of data file systems.In situations
Compaction Yes No Yes
where only data collected over a period of time Object Yes Yes No
should be recovered, Apache Hudi focuses on storage
stream data optimization and capturing data Caching Yes No No
changes to speed up streaming data intake and Evolution Yes Yes Yes
Performance Best - -
analysis. By processing just fresh data and avoids
reprocessing old data, incremental processing aids
3.4.2 Data LakeHouse Frameworks
in improving query performance[50]. Plusieurs frameworks peuvent être utilisées pour
Hudi offers two methods for changing data tables: les Data LakeHouse. Bien que le HDFS représente
copy on write and merge on read[51]: leframework le plus largement utilisé, il existe
The Copy-On Write (CoW) technique locates the d'autres systèmes plus flexibles, comme Amazon
records that need to be updated in the files and S3.
eagerly rewrites them to new files with the changed Dans [56], les auteurs réalisent une étude
data, resulting in a high write amplification but no comparative entre les deux Framework de stockage
read amplification. en comparant le coût, l’élasticité, la SLA
Merge-On-Read (MoR) technique doesn't require (disponibilité et durabilité), la performance et
rewriting of any files. Instead, it delays the l’écrituretransactionnelle. Les auteurs concluent
reconciliation until query time and sends out que le stockage S3 et cloud offre une élasticité,
information about record-level changes in other avec une disponibilité et une durabilité d'un ordre
files, resulting in little write amplification. de grandeur supérieures et des performances 2 fois
These three storage options address a number of supérieures, à un coût 10 fois inférieur à celui des
issues that arise frequently while working with data clusters de stockage de données HDFS
lakes[52]: (i) atomic transactions, which make sure traditionnels.Cependant, avec S3, toutes les
that the data is not left in an inconsistent state if an lectures doivent passer par le réseau, ce qui interdit
operation fails; (ii) consistent updates, which stop l'optimisation des performances, ce qui représente
reads from inconsistent states; and (iii) scalability un sérieux inconvénient.
for the data and metadata. Furthermore, they all
provide comparable functionality including upserts, 3.5 Synthesis
deletes, transaction support, time travel, SQL We have performed a literature review on various
read/write, streaming ingestion, metadata data storage, processing, and analysis architectures
scalability, and many more. Since Apache Spark is in the previous section. The properties of the
the main need of the platform, all of these storage different storage architectures were then compared.
systems are essentially comparable in that they The decision to use such a design ultimately
depends on the specific needs and goals of the
allow write and read operations from Spark. information systems, because each, despite its
Despite the many similarities between these three ability to store, process, and analyze data, has
storage systems, the Delta Lake has consistently unique advantages and disadvantages.
come out on top in comparison studies, especially According to the comparative study presented in
in terms of performance and integration [51], [53]– Table 1, the lakehouse and the big data warehouse
[55]. Table 2 lists the findings of a few earlier have the same characteristics, which means that the
lakehouse can be considered as a big data
investigations.
warehouse. On the other hand, LakeHouse offers
7
the scalability and flexibility of data lakes while Data compaction and pruning [60]. Data lakes
maintaining the structure and control of data can amass a considerable volume of historical
warehouses. Therefore, LakeHouse remains the data over time. By eliminating or consolidating
best choice for businesses that need to process, duplicate or obsolete data, data pruning and
store and analyze huge amounts of structured, compaction procedures assist save storage costs
semi-structured and unstructured data in light of the and enhance query performance. Data
comparative data shown in Table 1. preservation regulations and temporal division
are two common techniques.
4. DATA LAKEHOUSE OPTIMIZATION Caching. Keeping frequently accessed data or
TECHNIQUES query results in memory helps speed up
responses to repeated searches.When cached
Organizations can store vast amounts of files are still accessible for reading can be
unstructured, structured, and semi-structured data simply determined by running transactions. A
in a Lakehouse, which combines aspects of data transcoded format for the cache is another
lakes and data warehouses and enables quick, option, which is more effective for query engine
scalable analytics. Additionally, it inherits robust execution [57]. In order to do this, in-memory
governance and auditability from data warehouses caching can be especially useful for workloads
as well as streaming workloads from data lakes that involve a lot of reading because it reduces
[57]. the need to contact the underlying storage.
It is essential to optimize performance in a Parallelism. greatly facilitates the creation and
Lakehouse setting, which entails enhancing query administration of query workloads on clusters.
performance, cutting expenses, and ensuring that To prevent resource conflicts and provide
data-driven insights are readily available [57]. constant performance, managing concurrent
Several methods are employed in this situation; we queries and resource allocation is essential.
list the most popular ones here: Performance monitoring. Gathering metrics on
Management of Metadata [58]. In order to how queries are executed and query profiling
assure affordable storage without sacrificing might reveal performance bottlenecks and
governance and management features, The potential areas for improvement. It assists in
LakeHouse uses a transactional metadata locating resource-intensive or lengthy queries
storage layer on top of the cloud object store that can benefit from optimization.
[58].Metadata includes details on the data Partitioning. One of the most used approaches
stored in Lakehouse, such as statistics, data of optimization is partitioning. It provides easy
graphing, and schema. Data discovery and accessibility, better scalability, and less CPU
accessibility are made easier by a well- resource usage. When partitioning data in a
structured data catalog with adequate metadata distributed environment, it is important to
tagging and search capabilities. Users may find consider the types of frequent queries that will
the information they require fast, saving time be applied to the data as well as the processing
spent looking for pertinent data. demands [53]. Since each data partition will be
Indexing. By removing unnecessary data, assigned to a compute node that will perform a
indexing primarily aims to reduce the time it specific portion of the query, this will also
takes for queries to execute. Global and local minimize inter-node exchanges and lower the
indexes are the two types of indexes that are quantity of data visited throughout the
employed. Because the two types are unrelated, processing phase.
a system may contain both or just one index, There are two different types of partitioning
and the index types may vary depending on the described in the literature: space partitioning
level [59].Some systems solely use local and data partitioning [61]. The analysis
indexes located in the slave nodes in a performed on these data typically focuses more
distributed environment, while others decide to on geometric objects than qualities, therefore
add a global index located in the master node to the first type entails combining spatial data that
speed up local query processing and reduce the is geographically close together into the same
number of trips to the master node [53].By partitions. The data distribution pattern in the
enabling direct access to particular rows or cluster is carried by the disk as well as the data
columns without having to scan the entire partitioning. Three data partitioning techniques,
dataset, indexes can dramatically improve query STR, STR+, and K-d Tree, were mentioned by
performance. Thoughtful analysis of the the authors in [61].
tradeoffs between query performance and We carried out our experiments with Delta Lake in
storage costs is necessary for index a distributed environment. For this, we used an
administration. AWS S3 object store, using the 8.3 runtime based
on Scala 2.12, Spark 3.1.1. We configured 4
8
workers including the supervisor (driver). We Our vision will enable AI-based incremental
loaded an 800GB csv file for testing. partitioning of LakeHouse data and metadata.
First, we loaded and executed a load of 100
different queries and measured performance before
and after partitioning. Then we varied the query 6. CONCLUSION
load to assess the impact of incremental The data warehouse continues to play a key role in
partitioning on performance. business intelligence (BI), even as big data
The figure shows the results obtained. technologies drive data processing. As a result, it is
possible to create a variety of hybrid designs, such
as Data LakeHouse, by combining Big Data
technologies with traditional data warehouses. This
new technology integrates two key components,
data processing and BI maturity.
We have discussed various big data storage and
processing architectures. We also compared the
main characteristics of the different architectures.
Our comparative study allows us to conclude that
data LakeHouse today represents the best choice
for companies needing to process, store and
analyze enormous quantities of raw, structured and
Figure 4. Load query performance peer partitioning type
semi-structured data.
In our experimental study, we demonstrated the
5. OPEN RESEARCH CHALLENGES AND remarkable impact of data partitioning on system
FUTURE DIRECTIONS performance. We also studied two types of
Big Data has properties that are beyond the scope partitioning techniques, namely static and
of conventional approaches, especially when data is incremental partitioning.
kept in a distributed setting that necessitates the use In our future work, we intend to include
of parallel processing tools like the MapReduce optimization techniques to improve the
paradigm. Due to these restrictions, new performance of Data LakeHouse which may
methodologies with specific features and enhanced degrade as a result of the exponential increase in
capabilities have emerged, such Hadoop the volume of data injected into Data LakeHouse.
Distributed File System (HDFS), Cassandra, and
MongoDB. High availability and large-scale data
processing are also capabilities of these scalable REFERENCES:
systems. Processing is a crucial component of the [1] M. Bala, O. Boussaid, et Z. Alimazighi, « A
Big Data universe at the storage stage. It entails Fine‐Grained Distribution Approach for ETL
processing the data necessary to get it ready for the Processes in Big Data Environments », Data
following step. New processing technologies like Knowl. Eng., vol. 111, p. 114‑136, sept.
Hadoop and Spark have been created in response to 2017, doi: 10.1016/j.datak.2017.08.003.
the functional limitations of conventional systems. [2] W. X. B. Granda, F. Molina-Granja, J. D.
These solutions allow businesses to swiftly, Altamirano, M. P. Lopez, S. Sureshkumar, et
effectively, and concurrently process enormous J. N. Swaminathan, « Data Analytics for
amounts of data. Healthcare Institutions: A Data Warehouse
The analysis phase is the last step, where data Model Proposal », in Inventive
analysis is done in order to make informed Communication and Computational
conclusions. In this context, a variety of analysis Technologies, G. Ranganathan, X. Fernando,
tools are employed, including capabilities that let et Á. Rocha, Éd., in Lecture Notes in
analysts create interactive dashboards that give Networks and Systems. Singapore: Springer
businesses a holistic view of the market. Nature, 2023, p. 155‑163. doi: 10.1007/978-
Researchers are faced with a challenge while trying 981-19-4960-9_13.
to improve any of the three phases outlined above. [3] Z. Bicevska et I. Oditis, « Towards NoSQL-
Although big data analysis also uses machine based Data Warehouse Solutions », Procedia
learning and artificial intelligence (AI), we intend Comput. Sci., vol. 104, p. 104‑111, janv.
2017, doi: 10.1016/j.procs.2017.01.080.
to propose a new architecture for the optimal
[4] L. Oukhouya, A. E. Haddadi, B. Er-raha, et H.
storage, processing and processing of big data. To Asri, « A generic metadata management
do this, we intend to create an intelligent model for heterogeneous sources in a data
architecture that merges LakeHouse's capabilities warehouse », E3S Web Conf., vol. 297, p.
with machine learning and artificial intelligence. 01069, 2021, doi:
10.1051/e3sconf/202129701069.
9
[5] A. Nambiar et D. Mundra, « An Overview of Automotive Industry », Electronics, vol. 10,
Data Warehouse and Data Lake in Modern no 18, p. 2221, sept. 2021, doi:
Enterprise Data Management », Big Data 10.3390/electronics10182221.
Cogn. Comput., vol. 6, no 4, Art. no 4, déc. [16] V. M. Ngo, N.-A. Le-Khac, et M.-T. Kechadi,
2022, doi: 10.3390/bdcc6040132. « Designing and Implementing Data
[6] C. Giebler, C. Gröger, E. Hoos, H. Schwarz, Warehouse for Agricultural Big Data », in
et B. Mitschang, « Leveraging the Data Big Data – BigData 2019, vol. 11514, K.
Lake: Current State and Challenges », in Big Chen, S. Seshadri, et L.-J. Zhang, Éd., in
Data Analytics and Knowledge Discovery, C. Lecture Notes in Computer Science, vol.
Ordonez, I.-Y. Song, G. Anderst-Kotsis, A. 11514. , Cham: Springer International
M. Tjoa, et I. Khalil, Éd., in Lecture Notes in Publishing, 2019, p. 1‑17. doi: 10.1007/978-
Computer Science. Cham: Springer 3-030-23551-2_1.
International Publishing, 2019, p. 179‑188. [17] « The Next-Generation EDW Is The Big Data
doi: 10.1007/978-3-030-27520-4_13. Warehouse », Forrester. Consulté le: 21 août
[7] I. G. Terrizzano, P. M. Schwarz, M. Roth, et 2023. [En ligne]. Disponible sur:
J. E. Colino, « Data Wrangling: The https://www.forrester.com/report/The-
Challenging Yourney from the Wild to the NextGeneration-EDW-Is-The-Big-Data-
Lake. », in CIDR, Asilomar, 2015. Warehouse/RES128005
[8] M. R. Llave, « Data lakes in business [18] L. Sautot, S. Bimonte, et L. Journaux, « A
intelligence: reporting from the trenches », Semi-Automatic Design Methodology for
Procedia Comput. Sci., vol. 138, p. 516‑524, (Big) Data Warehouse Transforming Facts
janv. 2018, doi: into Dimensions », IEEE Trans. Knowl. Data
10.1016/j.procs.2018.10.071. Eng., vol. 33, no 1, p. 28‑42, janv. 2021, doi:
[9] C. Walker et H. Alrehamy, « Personal Data 10.1109/TKDE.2019.2925621.
Lake with Data Gravity Pull », in 2015 IEEE [19] Y. Ramdane, O. Boussaid, D. Boukraà, N.
Fifth International Conference on Big Data Kabachi, et F. Bentayeb, « Building a novel
and Cloud Computing, août 2015, p. physical design of a distributed big data
160‑167. doi: 10.1109/BDCloud.2015.62. warehouse over a Hadoop cluster to enhance
[10] R. L. Grossman, « Data Lakes, Clouds, and OLAP cube query performance », Parallel
Commons: A Review of Platforms for Comput., vol. 111, p. 102918, juill. 2022,
Analyzing and Sharing Genomic Data », doi: 10.1016/j.parco.2022.102918.
Trends Genet., vol. 35, no 3, p. 223‑234, [20] S. Benkrid, L. Bellatreche, Y. Mestoui, et C.
mars 2019, doi: 10.1016/j.tig.2018.12.006. Ordonez, « PROADAPT: Proactive
[11] R. Hai, S. Geisler, et C. Quix, « Constance: framework for adaptive partitioning for big
An Intelligent Data Lake System », in data warehouses », Data Knowl. Eng., vol.
Proceedings of the 2016 International 142, p. 102102, nov. 2022, doi:
Conference on Management of Data, San 10.1016/j.datak.2022.102102.
Francisco California USA: ACM, juin 2016, [21] C.-H. Chang, F.-C. Jiang, C.-T. Yang, et S.-C.
p. 2097‑2100. doi: Chou, « On construction of a big data
10.1145/2882903.2899389. warehouse accessing platform for campus
[12] G. Phillips-Wren, L. S. Iyer, U. Kulkarni, et power usages », J. Parallel Distrib. Comput.,
T. Ariyachandra, « Business Analytics in the vol. 133, p. 40‑50, nov. 2019, doi:
Context of Big Data: A Roadmap for 10.1016/j.jpdc.2019.05.011.
Research », Commun. Assoc. Inf. Syst., vol. [22] S. Benkrid, Y. Mestoui, L. Bellatreche, et C.
37, 2015, doi: 10.17705/1CAIS.03723. Ordonez, « A Genetic Optimization Physical
[13] J. Ziegler, P. Reimann, F. Keller, et B. Planner for Big Data Warehouses », in 2020
Mitschang, « A Graph-based Approach to IEEE International Conference on Big Data
Manage CAE Data in a Data Lake », (Big Data), déc. 2020, p. 406‑412. doi:
Procedia CIRP, vol. 93, p. 496‑501, janv. 10.1109/BigData50022.2020.9378196.
2020, doi: 10.1016/j.procir.2020.04.155. [23] E. Costa, C. Costa, et M. Y. Santos,
[14] M. E. Houari, M. Rhanoui, et B. E. Asri, « Evaluating partitioning and bucketing
« Hybrid big data warehouse for on-demand strategies for Hive-based Big Data
decision needs », in 2017 International Warehousing systems », J. Big Data, vol. 6,
Conference on Electrical and Information no 1, p. 34, mai 2019, doi: 10.1186/s40537-
Technologies (ICEIT), nov. 2017, p. 1‑6. doi: 019-0196-1.
10.1109/EITech.2017.8255261. [24] K. Smelyakov, A. Chupryna, D. Sandrkin, et
[15] N. Silva et al., « Advancing Logistics 4.0 with M. Kolisnyk, « Search by Image Engine for
the Implementation of a Big Data Big Data Warehouse », in 2020 IEEE Open
Warehouse: A Demonstration Case for the Conference of Electrical, Electronic and
10
Information Sciences (eStream), Vilnius, [34] M. Y. Santos et al., « Evaluating SQL-on-
Lithuania: IEEE, avr. 2020, p. 1‑4. doi: Hadoop for Big Data Warehousing on Not-
10.1109/eStream50540.2020.9108782. So-Good Hardware », in Proceedings of the
[25] I. Hilali, N. Arfaoui, et R. Ejbali, « A new 21st International Database Engineering &
approach for integrating data into big data Applications Symposium, in IDEAS ’17.
warehouse », in Fourteenth International New York, NY, USA: Association for
Conference on Machine Vision (ICMV Computing Machinery, juill. 2017, p.
2021), SPIE, mars 2022, p. 475‑480. doi: 242‑252. doi: 10.1145/3105831.3105842.
10.1117/12.2623069. [35] C. Costa et M. Y. Santos, « Evaluating
[26] H. Wang et al., « Efficient query processing Several Design Patterns and Trends in Big
framework for big data warehouse: an almost Data Warehousing Systems », in Advanced
join-free approach », Front. Comput. Sci., Information Systems Engineering, J. Krogstie
vol. 9, no 2, p. 224‑236, avr. 2015, doi: et H. A. Reijers, Éd., in Lecture Notes in
10.1007/s11704-014-4025-6. Computer Science. Cham: Springer
[27] B. Malysiak-Mrozek, J. Wieszok, W. International Publishing, 2018, p. 459‑473.
Pedrycz, W. Ding, et D. Mrozek, « High- doi: 10.1007/978-3-319-91563-0_28.
Efficient Fuzzy Querying With HiveQL for [36] P. Russom, « Data warehouse modernization
Big Data Warehousing », IEEE Trans. Fuzzy in the age of big data analytics », Data
Syst., vol. 30, no 6, p. 1823‑1837, juin 2022, Wareh. Inst., 2016.
doi: 10.1109/TFUZZ.2021.3069332. [37] R. K. Bathla et S. G., « Research Analysis of
[28] M. Kechar, L. Bellatreche, et S. Nait-Bahloul, Big Data and Cloud Computing with
« ZigZag+: A global optimization algorithm Emerging Impact of Testing », Int. J. Eng.
to solve the view selection problem for large- Technol., vol. 7, p. 239‑243, août 2018.
scale workload optimization », Eng. Appl. [38] A. A. Harby et F. Zulkernine, « From Data
Artif. Intell., vol. 115, p. 105251, oct. 2022, Warehouse to Lakehouse: A Comparative
doi: 10.1016/j.engappai.2022.105251. Review », in 2022 IEEE International
[29] A. Shahid, T.-A. N. Nguyen, et M.-T. Conference on Big Data (Big Data), déc.
Kechadi, « Big Data Warehouse for 2022, p. 389‑395. doi:
Healthcare-Sensitive Data Applications », 10.1109/BigData55660.2022.10020719.
Sensors, vol. 21, no 7, p. 2353, mars 2021, [39] N. Miloslavskaya et A. Tolstoy, « Application
doi: 10.3390/s21072353. of Big Data, Fast Data, and Data Lake
[30] Y. Ramdane, N. Kabachi, O. Boussaid, et F. Concepts to Information Security Issues », in
Bentayeb, « SDWP: A New Data Placement 2016 IEEE 4th International Conference on
Strategy for Distributed Big Data Future Internet of Things and Cloud
Warehouses in Hadoop », in Big Data Workshops (FiCloudW), août 2016, p.
Analytics and Knowledge Discovery, vol. 148‑153. doi: 10.1109/W-FiCloud.2016.41.
11708, C. Ordonez, I.-Y. Song, G. Anderst- [40] P. Lo Giudice, L. Musarella, G. Sofo, et D.
Kotsis, A. M. Tjoa, et I. Khalil, Éd., in Ursino, « An approach to extracting complex
Lecture Notes in Computer Science, vol. knowledge patterns among concepts
11708. , Cham: Springer International belonging to structured, semi-structured and
Publishing, 2019, p. 189‑205. doi: unstructured sources in a data lake », Inf.
10.1007/978-3-030-27520-4_14. Sci., vol. 478, p. 606‑626, avr. 2019, doi:
[31] A. A. Alekseev, V. V. Osipova, M. A. Ivanov, 10.1016/j.ins.2018.11.052.
A. Klimentov, N. V. Grigorieva, et H. S. [41] P. P. Khine et Z. S. Wang, « Data lake: a new
Nalamwar, « Efficient data management ideology in big data era », ITM Web Conf.,
tools for the heterogeneous big data vol. 17, p. 03025, 2018, doi:
warehouse », Phys. Part. Nucl. Lett., vol. 13, 10.1051/itmconf/20181703025.
no 5, p. 689‑692, sept. 2016, doi: [42] « Data Lake Governance Best Practices -
10.1134/S1547477116050022. DZone », dzone.com. Consulté le: 23 août
[32] C. Costa et M. Y. Santos, « The SusCity Big 2023. [En ligne]. Disponible sur:
Data Warehousing Approach for Smart https://dzone.com/articles/data-lake-
Cities », in Proceedings of the 21st governance-best-practices
International Database Engineering & [43] S.-C. Chou, C.-T. Yang, F.-C. Jiang, et C.-H.
Applications Symposium on - IDEAS 2017, Chang, « The Implementation of a Data-
Bristol, United Kingdom: ACM Press, 2017, Accessing Platform Built from Big Data
p. 264‑273. doi: 10.1145/3105831.3105841. Warehouse of Electric Loads », in 2018
[33] P. Russom, « Evolving data warehouse IEEE 42nd Annual Computer Software and
architectures in the age of big data », Data Applications Conference (COMPSAC), juill.
Wareh. Inst., 2014.
11
2018, p. 87‑92. doi: vol. 141, p. 595‑610, avr. 2023, doi:
10.1109/COMPSAC.2018.10208. 10.1016/j.future.2022.12.007.
[44] O. Azeroual, J. Schöpfel, D. Ivanovic, et A. [53] S. Ait Errami, H. Hajji, K. Ait El Kadi, et H.
Nikiforova, « Combining Data Lake and Badir, « Spatial big data architecture: From
Data Wrangling for Ensuring Data Quality in Data Warehouses and Data Lakes to the
CRIS », Procedia Comput. Sci., vol. 211, p. LakeHouse », J. Parallel Distrib. Comput.,
3‑16, janv. 2022, doi: vol. 176, p. 70‑79, juin 2023, doi:
10.1016/j.procs.2022.10.171. 10.1016/j.jpdc.2023.02.007.
[45] J. Kutay, « Data Warehouse vs. Data Lake vs. [54] DataBeans, « Delta vs Iceberg vs hudi :
Data Lakehouse: An Overview of Three Reassessing Performance », Medium.
Cloud Data Storage Patterns », Striim. Consulté le: 4 septembre 2023. [En ligne].
Consulté le: 23 août 2023. [En ligne]. Disponible sur: https://databeans-
Disponible sur: blogs.medium.com/delta-vs-iceberg-vs-hudi-
https://www.striim.com/blog/data- reassessing-performance-cb8157005eb0
warehouse-vs-data-lake-vs-data-lakehouse- [55] F. Hellman, « Study and Comparsion of Data
an-overview/ Lakehouse Systems », 2023.
[46] Z. Chen, H. Shao, Y. Li, H. Lu, et J. Jin, [56] « Top 5 Reasons for Choosing S3 over
« Policy-based access control system for HDFS », Databricks. Consulté le: 24
delta lake », in 2022 Tenth International septembre 2023. [En ligne]. Disponible sur:
Conference on Advanced Cloud and Big https://www.databricks.com/blog/2017/05/31
Data (CBD), IEEE, 2022, p. 60‑65. /top-5-reasons-for-choosing-s3-over-
[47] M. Armbrust et al., « Delta lake: high- hdfs.html
performance ACID table storage over cloud [57] M. Armbrust, A. Ghodsi, R. Xin, et M.
object stores », Proc. VLDB Endow., vol. 13, Zaharia, « Lakehouse: a new generation of
no 12, p. 3411‑3424, août 2020, doi: open platforms that unify data warehousing
10.14778/3415478.3415560. and advanced analytics », in Proceedings of
[48] « Apache Iceberg ». Consulté le: 3 septembre CIDR, 2021.
2023. [En ligne]. Disponible sur: [58] « Data Lakehouse Architecture and AI
https://iceberg.apache.org/ Company », Databricks. Consulté le: 20
[49] V. Belov et E. Nikulchev, « Analysis of Big septembre 2023. [En ligne]. Disponible sur:
Data Storage Tools for Data Lakes based on https://www.databricks.com/
Apache Hadoop Platform », Int. J. Adv. [59] A. Eldawy et M. F. Mokbel, « The era of big
Comput. Sci. Appl. IJACSA, vol. 12, no 8, spatial data », in 2015 31st IEEE
Art. no 8, 31 2021, doi: International Conference on Data
10.14569/IJACSA.2021.0120864. Engineering Workshops, avr. 2015, p. 42‑49.
[50] « Hello from Apache Hudi | Apache Hudi ». doi: 10.1109/ICDEW.2015.7129542.
Consulté le: 3 septembre 2023. [En ligne]. [60] A. Behm et al., « Photon: A Fast Query
Disponible sur: https://hudi.apache.org/ Engine for Lakehouse Systems », in
[51] P. Jain, P. Kraft, C. Power, T. Das, I. Stoica, Proceedings of the 2022 International
et M. Zaharia, « Analyzing and Comparing Conference on Management of Data,
Lakehouse Storage Systems », CIDR, 2023. Philadelphia PA USA: ACM, juin 2022, p.
[52] L. Gagliardelli et al., « A big data platform 2326‑2339. doi: 10.1145/3514221.3526054.
exploiting auditable tokenization to promote [61] A. Eldawy, L. Alarabi, et M. F. Mokbel,
good practices inside local energy « Spatial partitioning techniques in
communities », Future Gener. Comput. Syst., SpatialHadoop », Proc. VLDB Endow., vol.
8, no 12, p. 1602‑1605, 2015.
Table 1: Comparison between DW, Big DW, Data lake and Data LakeHouse
Feature Data Lake Data Warehouse Big Data Warehouse Data LakeHouse
Data Storage Raw data in original Structured data - Structured data - Structured data
form - Unstructured data - Unstructured data
Schema On-read On-write On-read/On-write On-read/On-write
Data Integration EL ETL EL/ETL EL/ETL
Tools
12
Data Processing -Batch data Batch data -Batch data processing -Batch data processing
processing processing -Real-time data processing -Real-time data
-Real-time data processing
processing -Data management
and governance
Data Integration Data silos Eliminates data -Eliminating data silos -Eliminates data silos
silos -Ensuring data consistency -Data consistency and
accuracy
Data Less control Highly control Highly control Highly control
Management
Scalability Highly scalable Limited Extremely scalable Extremely scalable
Cost Less Expensive More expensive More expensive
Queries Ad hoc Predefined - Ad hoc - Ad hoc
- predefined - predefined
Storage of a large -Structured data -Structured and -Structured and
Use Cases amounts of raw data storage unstructured data unstructured data
for later use -Data reporting storage storage
and analysis -Processing and analyzing -Processing and
analyzing
-Focus on data quality
and consistency.