CH 2 Data Science
CH 2 Data Science
MIT
Department of Computer Science and Engineering
• The infrastructure required to support the acquisition of big data must deliver low,
predictable latency in both capturing data and in executing queries.
• Be able to handle very high transaction volumes, often in a distributed environment and
Support flexible and dynamic data structures.
Data value Chain…
Data Analysis
• I t is concerned with making the raw data acquired amenable to use in decision-making as well as
domain-specific usage.
• Data curators hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
Data value Chain…
Data Storage
• I t is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access to the data.
• Relational DBMS have been the main solution to the storage paradigm
for nearly 40 years.
• NoSQL technologies have been designed with the scalability goal in mind
and present a wide range of solutions based on alternative data models.
Data value Chain…
Data Usage
• It covers the data-driven business activities that need access to data and its analysis.
• And the tools needed to integrate the data analysis within the business activity.
• Veracity: Trustworthiness, Accuracy, and Quality (can we trust the data? How accurate is it?)
Basic Concepts of Big Data…
• Below figure shows the Characteristics of big data.
The Role of Big Data in Data Science
• Enhanced Predictive Modeling
Big data enables more accurate and cultured predictive models, leading to better decision-making.
• Improved Personalization
The large volume and variety of data allow for more personalized experiences and targeted
solutions.
• Real-Time Insights
The high velocity of big data enables real-time analysis and instant decision-making in dynamic
environments.
• Increased Efficiency
Big data can help optimize business processes, reduce costs, and improve overall operational
efficiency.
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem
Clustered Computing
• Because of the quantities of big data, individual computers are
often inadequate for handling the data at most stages.
• To better address the high storage and computational needs of
big data, computer clusters are a better fit.
• Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:
• Resource Pooling: Combining the available storage space to hold data is a
clear benefit.
But CPU and memory pooling are also extremely important.
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem
Clustered Computing…
• High Availability: Clusters can provide varying levels
of fault tolerance and availability guarantees
To prevent hardware or software failures from affecting access to data and processing.
This becomes increasingly important as we continue to emphasize the importance of real-
time analytics.
• Cluster membership and resource allocation can be handled by software like Hadoop’s YARN.
• The assembled computing cluster often acts as a foundation that other software interfaces with to
process the data.
• The machines involved in the computing cluster are also typically involved with the management
of a distributed storage system.
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with big data
easier.
• It is a framework that allows for the distributed processing of large datasets across clusters of
computers
Note: Study@Home
Hadoop Ecosystem
(https://www.geeksforgeeks
.org/hadoop-ecosystem/)
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Big Data Life Cycle with Hadoop
1. Ingesting data into the system: this is the first stage and data is ingested
or transferred to Hadoop from various sources such as relational databases, local
files.
2. Processing the data in storage: data is stored and processed. The data is stored in
the distributed file system, HDFS and NoSQL perform data processing.
3. Computing and analyzing data: data is analyzed by processing frameworks such
as Pig, Hive, and Impala
4. Visualizing the results: this stage is Access, which is performed by tools such as
Hue and Cloudera Search
• The analyzed data can be accessed by users.