GPECOM2023 BigData
GPECOM2023 BigData
net/publication/372368521
CITATIONS READS
3 1,360
6 authors, including:
All content following this page was uploaded by Hossein Shahinzadeh on 15 July 2023.
Abstract— In recent years, data generation is increasing on Corporation (IDC) in 2011, the total volume of data that was
a large scale and fast pace, and the development of Internet produced and copied in the world was 1.8 zettabytes
applications, mobile applications, and network-connected (1.8*1024 exabytes) [5]. This figure has since increased to 40
sensors has also increased widely. These applications and zettabytes and is projected to reach 175 zettabytes by the year
extensive internet connections continuously produce a large 2025. The progression of big data's expansion from 2010 to its
volume of data, with a wide diversity and different structures, anticipated level of development in 2025 is seen in Figure 1.
which is called big data. At the same time, technologies related 180
175 ZB
to big data are also developing. The rapid growth of cloud
160
computing and the Internet of Things (IoT) is accelerating the
140
dramatic growth of data generation. Sensors around the world
Zetabytes
120
are collecting and transmitting data that will be stored and
100
processed in the cloud, and the era of big data is coming. In this
80
article, first, an overview of big data and the definitions of its
features are explained, and then the applications of big data in 60
Velocity
Generation
data visualization, data analysis, data privacy, performance • Data at scale • Speed of data
and scalability will be discussed. Finally, the technologies • Processing at scale Processing
related to big data in the field of data analysis, data storage, • Speed of data
and data virtualization, as well as the connection of big data Requests
with cloud computing, the Internet of Things, and data centers
will be discussed [11-12].
Volume Veracity
II. OVERVIEW OF BIG DATA
The term big data refers to a rapidly growing collection of Big Data
massive and heterogeneous data in structured, unstructured,
and semi-structured formats. Due to their complex nature, big • Data source diversity
Variety
• Data trustworthiness
data require powerful technologies and advanced algorithms • Data structure
• Data quality
heterogeneity
for management and analysis, and traditional business tools
are not effective for dealing with big data [13]. The definition
of big data is a topic on which different people disagree. Big Fig. 2. 4V characteristics of big data
data, in general, is a group of data that cannot be
comprehended, gathered, managed, and processed B. Applications of Big Data
simultaneously using conventional hardware/software tools There are numerous applications for big data, some of
and information technologies. Because of the importance of which are illustrated in Figure 3 [18].
the topic, technology companies, researchers, and data Building And
analysts have different definitions of big data, which will be Constructions
discussed further below [14]. Political Unemployment
Decisions
A. Definition of Big Data Characteristics
Big data refers to data assets that are both enormous and
complicated, and which need analysis in order to comprehend
Health Smart
and get information from them [15]. In 2010, Apache Hadoop Grid
Welfare
defined big data as "A dataset that has a high volume,
velocity, or variety, and traditional methods are limited in
Big Data
their ability to efficiently analyze it." Based on this definition,
in May 2011, McKinsey & Company (a global consulting
organization) introduced big data as the "next frontier of
innovation, competition, and productivity." The National Tax Evaders Agriculture
Institute of Standards and Technology (NIST) defines big
data as "data sets that have such high volume, velocity, or Natural Disaster Insurance
variety that traditional methods for efficient analysis are
Fig. 3. Applications of Big Data
limited." This definition focuses on the technological aspect
of big data. Most data scientists and big data experts define
a) Fraud Detection and Control
big data with three main characteristics (known as the "3Vs").
* Volume: The dataset that conforms to the big data In business operations, various types of fraudulent claims
standard is constantly changing and increasing over time. In or fake data exist, and identifying and controlling these data
big data, there is a large amount of data with sizes ranging and fraud in transactions is one of the most important
from terabytes to zettabytes. applications of big data. In most cases, fraud is discovered
* Velocity: Big data is characterized by the rapid after a long period of its occurrence when data is lost, and in
generation of data, which, in turn, necessitates the rapid this case, only its effects can be reduced or policies can be
processing of that data in order to derive useful insights. The implemented to prevent its recurrence. Big data-based
term "velocity" alludes to the real-time aspect of big data, and platforms can examine and analyze transactions and business
in order to make the most of the potential benefits of big data operations in real-time and detect inappropriate behavior
for businesses, it is necessary to gather, analyze, and use the from a user by examining large-scale patterns for all
data in a prompt and efficient manner. transactions and deals, thereby changing the way fraud and
* Variety: Data comes in various types, including fake data are detected [19].
structured data such as database data, semi-structured data b) Call Center Data Analysis
such as XML data, and unstructured data such as sound, Analyzing call center data is one of the useful applications
images, videos, web pages, text, etc [16]. of big data. In current processes, there are no solutions for
However, others, including IDC, which is one of the most processing customer data in the call center, and the
influential leaders in big data and its research fields, have information and knowledge that a call center can provide is
different opinions. In 2011, IDC defined big data as follows: ignored or presented with delay. Big data-based solutions in
"Big data technologies introduce a new generation of call centers can identify recurring problems and behavioral
technologies and architectures designed to extract value patterns of customers and employees by receiving and
economically from very large volumes of data with a wide processing call content, and help improve organizational
range of diversity, received, discovered, or analyzed at high performance and increase customer satisfaction [20].
speeds." With this definition, the characteristics of big data c) Social network analysis
can be defined in the form of 4V, meaning volume (large
volume), variety (different methods), velocity (fast One of the most important applications of big data directly
production), and value (high value but low density), which is related to users is the analysis of user activity on social
networks. Users are widely active on social networks and
record a lot of information about their activities on a daily and their levels of accessibility can vary widely. The purpose
basis, from expressing interest in a company's products on of displaying data is to give it meaning so that it may be
Facebook to expressing opinions or complaints about other interpreted meaningfully by both users and computers. The
products in the form of a message on Twitter. Social network value of the primary data, on the other hand, is diminished by
data can provide useful real-time information about market an unsuitable display of the data, which may even impede an
responses to products and campaigns, enabling companies to effective study of the data [26]. Displaying data effectively
prepare and offer their products in line with market and requires taking into account not only the structure, class, and
customer opinions [21]. data type, but also the requirements and preferences of the
d) Financial data analysis end user.
Big data analysis can also be used for financial analysis C. Redundancy reduction and data compression
and forecasting. For example, big data is used in tools for In most cases, there is a significant amount of duplicate
predicting stock market trends to support decision-making in information present in datasets. If the data's potential value is
this area [22]. not diminished in the process of decreasing this duplication
e) Agriculture and compressing the data, the system's overall indirect costs
Biotechnology centers use sensor data in agriculture to will be reduced to a greater extent than would have been the
increase crop productivity. They study and simulate plant case otherwise. For instance, the majority of the information
reactions in different environmental conditions so that plants that is produced by sensor networks has a significant amount
can adjust to the environment based on this information. In of redundancy. This redundancy may be eliminated, and the
addition, big data can be used to select the type of crop to be quantity of the resulting data can be reduced [27].
cultivated [23].
D. Data Lifecycle Management
III. BIG DATA CHALLENGES Sensors and ubiquitous computing systems are creating
Data analysis of big data provides attractive and valuable data at an unprecedented rate and scale, and present storage
opportunities. However, researchers and experts in this field systems are not capable of sustaining such enormous volumes
face multiple challenges when exploring big data and of data. This is in contrast to the comparatively modest
extracting knowledge and value from it. These problems exist advances that storage systems have been making in
at various levels of storage, data display, analysis, lifecycle comparison. The worth of the data is taken into consideration
management, reducing redundancy and compression, etc. In during the process of managing the data lifecycle to
addition, issues related to privacy and confidentiality are determine which data should be kept and which should be
especially obstacles and challenges that must be overcome in discarded.
distributed applications of big data. Some key obstacles and E. Analysis
challenges that must be overcome in developing big data The big data analysis process, which has a large volume
applications are described below [24]. Some of the existing of unstructured or semi-structured heterogeneous data,
challenges for big data are shown in Figure 4, which we will requires a lot of resources and time. To address this issue,
explain below. distributed processing architectures are used, where data is
4 divided into smaller sections and made available for
3
processing by the number of computers in the network, and
5 finally, the processed data are combined [28].
2
F. Confidentiality of Information
One of the important challenges of big data is the
confidentiality and preservation of information. Most big
Storage
Big Data Data Lifecycle data providers and owners cannot efficiently maintain and
1 Challenges 6
Management analyze their large datasets due to their limited capacity. They
rely on data analysis experts and tools that increase potential
security risks. Therefore, maintaining the confidentiality of
information is a major issue and challenge in big data [29].
7
10 G. Energy Management
8 With the increasing volume of data and demand for
9 analysis, processing, storage, and transfer of big data,
inevitably more electrical energy will be consumed for these
Fig. 4. Some of the challenges for Big Data
purposes. Therefore, mechanisms for controlling and
A. Storage managing energy consumption levels for big data must be
established.
Today's hard drives have a capacity of terabytes, while the
data generated in big data is far beyond that and is increasing H. Scalability
exponentially, reaching exabytes. Traditional data The big data analytics system must support current and
management and analysis systems are based on Relational future datasets. Therefore, the analytics algorithm should be
Data Base Management Systems (RDBMS) and are only capable of processing increasingly complex datasets that are
suitable for managing structured data and are unable to store expanding over time [30].
and process such large amounts of data that are semi-
structured and unstructured [25]. The solution to this problem I. Collaboration
is to use distributed file systems and NoSQL databases, which Big data analytics is interdisciplinary research that calls
are designed to manage unstructured data on a large scale. for the cooperation of specialists from diverse domains in
B. Data Display order to fully utilize the potential of big data. To enable
scientists and engineers from diverse professions to access
The types of datasets, their structures, the meanings of the various types of data and fully utilize their knowledge to
datasets, their organizations, the granularity of the datasets,
interact with one another in order to achieve the analytics * Non-Relational Databases: A strategy for managing
objectives, a comprehensive big data network architecture and constructing databases that are appropriate for use with
must be developed [31]. vast amounts of data in contexts that are dispersed is referred
to as a non-relational database, which is also known as
IV. BIG DATA MANAGEMENT TOOLS AND TECHNOLOGIES NoSQL. The most widely used of these databases is Apache
Big Data management involves organizing and utilizing a Cassandra, which was initially developed for Facebook in
large amount of data. Assuring data quality and accessibility 2008 before being made available under an open-source
for use in Business Intelligence (BI) projects and Big Data license. Additional examples of these databases are
analytics is the aim of big data management. For analytics, SimpleDB, Google BigTable, MongoDB, and Voldemort.
storage, and visualization, a variety of Big Data management Large organizations like Netflix, LinkedIn, and Twitter
solutions are employed, some of which are briefly covered in employ one or more of these databases [34].
this section [32]. In addition, the relevant technologies related C. Data visualization tools
to Big Data will also be discussed in this section.
There are numerous open-source data visualization tools,
A. Data Analysis some of which are mentioned below [35].
* Hadoop: An open-source software framework that * R: A free and open-source programming language and
provides scalable solutions for solving problems with big data development environment designed for visualizing and
on a set of computers. Hadoop is made up of two key graphically representing data based on graphic and statistical
components: the MapReduce (MR) framework and the computations. R is a programming language that is often
Hadoop Data File System (HDFS). The data storage source utilized in the statistical software development and data
for MR is HDFS, a distributed file system created by Google analysis fields.
based on the Data File System and running on commercial * Tableau: A tool used for visualizing results in the form
hardware (DFS). of charts, maps, graphs, and other graphics. There is also the
* Hive: An open-source data warehouse for querying and possibility of connecting Hadoop and Tableau, and
analyzing large sets of data stored in Hadoop files. It features interaction between these two products.
a SQL-like user interface for querying data held in multiple * Infogram: This tool allows for the easy selection of a
Hadoop-integrated databases and storage systems. It was wide range of ready-made visual templates. Additionally,
there are additional templates such as map charts and videos
initially introduced and developed by Facebook and is now
in this software, and the ability to share created models are
offered as an open-source tool.
also provided.
* Pig: An advanced environment for developing
* ChartBlocks: A free online tool that provides the ability
MapReduce applications using Hadoop. Pig Latin, a high-
to visualize databases and extensive pages without the need
level descriptive language that can express huge data
for any complex code.
gathering and analysis tasks in MR programming, is the * Tangle: This visual tool provides capabilities beyond
language utilized in this platform. data visualization and allows designers and developers to
* Platform: It is a tool for analyzing and discovering big design programs interactively for a better understanding of
data. It is a platform that automatically takes user queries to data relationships.
the target and allows users to interact visually with vast
amounts of data at a petabyte scale in the shortest possible D. Big data-related technologies
time. In fact, it creates an abstraction layer that anyone can Some significant technologies that are closely connected
use to simplify and organize their datasets. to big data are covered in this section.
* Rapidminer: It is software that offers an integrated a) Cloud computing
platform for business analysis, predictive analytics, text
mining, machine learning, and data mining. Rapidminer Cloud computing has a close relationship with big data.
covers all data mining operations, including data preparation, Figure 5 depicts the main components of cloud computing.
The term "cloud computing" refers to a type of technology
validation, visualization, and result optimization. It is used
that is capable of storing significant amounts of data. The
for both the development of commercial applications as well
main goal of cloud computing is to use centralized
as research and education [33].
management of computational resources and capacities to
B. Storage Technologies provide various applications by sharing resources in a unified
For the administration of huge volumes of data, methods manner and making these applications accessible to users in
of data storage that are both efficient and effective are a transparent and efficient manner [36].
necessary. This is due to the fact that the size and volume of Cloud Computing Applications And Services
Traditional Applications Bigdata Applications And
the data continue to rise at an alarming rate. Both the And Services Services
virtualization of storage and the compression of data have Virtual Resources Pool Inquiry, Analysis And
been major contributors to the total development that has Flexible Resource
Excavate Parallel Algorithm
Real-Time Smart Integrated Mobility Stormwater Street Lighting Connection to the Smart
Grid Software Suite • Public Transit Management and Management Grid
• Traveler Urban Flooding
Information
Gas Distribution
Management
Fig. 6. A visual representation of equipment and methods for collecting data in the smart grid and smart cities Based on IoT