0% found this document useful (0 votes)

26 views28 pages

CH 2 Data Science

Uploaded by

atakilti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views28 pages

CH 2 Data Science

Uploaded by

atakilti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Mekelle University

MIT
Department of Computer Science and Engineering

Course Title: Introduction to Emerging Technologies

Data Science
Outlines
• An Overview of Data Science
• Data types and their representation
• Data value Chain
• Basic Concepts of Big Data
An Overview of Data Science
• Data science is a multi-disciplinary field
• Uses scientific methods, processes, algorithms, and systems to extract
knowledge and insights from structured and unstructured data.
• Data science is much more than simply analyzing data.
• It offers a range of roles and requires a range of skills.

• Data science is a rapidly evolving field that combines statistical analysis,

machine learning, and domain expertise to extract valuable insights
from data.
An Overview of Data Science…
An Overview of Data Science …
Data Gathering 1
Collect data from various sources, including
structured databases, unstructured text, and
real-time streams. 2 Data Preprocessing
Clean, transform, and prepare the data for
analysis, ensuring it is high-quality and ready for
modeling.
Data Analysis 3
Apply statistical and machine learning
techniques to uncover patterns, trends, and
insights within the data.
What are data and information?
• Data is a representation of facts, concepts, or instructions in a formalized manner.
• It should be suitable for communication, interpretation, or processing, by
human or electronic machines.
• It can be described as unprocessed facts and figures.
• It is represented with the help of characters such as
• Alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.).

• Information is the processed data on which decisions and actions

are based information created from organized, structured, and processed data in a
particular context.
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people or machines
to increase their usefulness and add values for a particular purpose.
• It consists of the following basic steps
• Input, processing, and output

These three steps constitute the data processing cycle

Data Processing Cycle...
1. Input: the input data is prepared in some convenient
form for processing depend on the processing machine.
• Fore example, for electronic computers, input data can be recorded on any one
of the several types of storage medium, such as hard disk, CD and flash disk.
2. Processing: the input data is changed to produce data in a
more useful form.
• Fore example, interest can be calculated on deposit to a bank, a summary of
sales for the month can be calculated from the sales orders.
3. Output: result of the processing step is collected.
• For example, output data may be payroll for employees
Data Types and their Representation
• Data types can be described from diverse perspectives.
• For instance, In computer programming, a data type is simply an attribute of data that tells the compiler how
the programmer intends to use the data.

Data Types from Computer Programming Perspective

• This data type defines the operations that can be done on the data. Though different languages
may use different terminology.
• Common data types include:
• Integers(int)- is used to store whole numbers
• Floating-point numbers(float)- store real numbers.
• Characters(char)- is used to store a single character
• Booleans(bool)- to one of two values: true or false
• Alphanumeric strings(string)- characters and numbers
Data Types and their Representation…
Data types from Data Analytics perspective
• From a data analytics point of view, there are three common types of data types or structures
1. Structured
2. Semi-structured
3. Unstructured data types

• Below figure describes the three types of data and metadata

Data Types and their Representation…
Data types from Data Analytics perspective
1. Structured Data: is data that adheres to a pre-defined data model and is therefore
straightforward to analyze
• It conforms to a tabular format with a relationship between the different rows and columns.
• Common examples of structured data are Excel files or SQL databases. Each of these has
structured rows and columns that can be sorted

2. Semi-structured Data: is a form of structured data that

• Does not conform with the formal structure of data models associated with relational databases.
• Contains tags or other markers to separate semantic elements and enforce hierarchies of records
and fields
• It is also known as a self-describing structure
• Example: JSON and XML
Data Types and their Representation…
Data types from Data Analytics perspective…
C. Unstructured Data: is information that either
• Does not have a predefined data model or is not organized in a pre-defined manner.
• It is typically text-heavy but may contain data such as dates, numbers, and facts as well.
• This results in irregularities and ambiguities
• For example, audio, video files or NoSQL databases
D. Metadata (Data about Data): provides additional information about a specific set of data.
• Frequently used by Big Data solutions for initial analysis
• For example, In a set of photographs, metadata could describe when and where the
photos were taken
Data Value Chain
• The Data Value Chain is introduced to describe the information flow within a big
data system as a series of steps needed to generate value and useful
insights from data.
• Big Data Value Chain identifies the following key high-level activities:
Data value Chain…
Data Acquisition
• It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried out.

• It is one of the major big data challenges in terms of infrastructure requirements.

• The infrastructure required to support the acquisition of big data must deliver low,
predictable latency in both capturing data and in executing queries.

• Be able to handle very high transaction volumes, often in a distributed environment and
Support flexible and dynamic data structures.
Data value Chain…
Data Analysis
• I t is concerned with making the raw data acquired amenable to use in decision-making as well as
domain-specific usage.

• Data analysis involves

• Exploring, transforming, and modeling data with the goal of highlighting relevant data
• Synthesizing and extracting useful hidden information with high potential from a business point of view.

• Related areas include:

• Data mining
• Business intelligence
• Machine learning
Data value Chain…
Data Curation
• It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities such as:
• Content creation
• Selection and classification
• Transformation, validation, and preservation
• It is performed by expert curators that are responsible for improving the
accessibility and quality of data.

• Data curators hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
Data value Chain…
Data Storage
• I t is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access to the data.

• Relational DBMS have been the main solution to the storage paradigm
for nearly 40 years.

• NoSQL technologies have been designed with the scalability goal in mind
and present a wide range of solutions based on alternative data models.
Data value Chain…
Data Usage
• It covers the data-driven business activities that need access to data and its analysis.

• And the tools needed to integrate the data analysis within the business activity.

• Data usage in business decision making can enhance competitiveness through

• The reduction of costs
• Increased added value
• Any other parameter that can be measured against existing performance criteria.
Basic Concepts of Big Data
• Big data is a term for the non-traditional strategies and technologies needed to gather, organize, process, and
gather insights from large datasets.

What Is Big Data?

• Big data is the term for a collection of large and complex data sets.
• It becomes difficult to process using on-hand database management tools or traditional data processing
applications
• Big data is characterized by 3V and more:
• Volume: large amounts of data Zeta bytes/Massive dataset
• Velocity: Data is live streaming or in motion with the speed
• Variety: data comes in many different forms from diverse sources

• Veracity: Trustworthiness, Accuracy, and Quality (can we trust the data? How accurate is it?)
Basic Concepts of Big Data…
• Below figure shows the Characteristics of big data.
The Role of Big Data in Data Science
• Enhanced Predictive Modeling
 Big data enables more accurate and cultured predictive models, leading to better decision-making.

• Improved Personalization
 The large volume and variety of data allow for more personalized experiences and targeted
solutions.

• Real-Time Insights
 The high velocity of big data enables real-time analysis and instant decision-making in dynamic
environments.

• Increased Efficiency
 Big data can help optimize business processes, reduce costs, and improve overall operational
efficiency.
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem
Clustered Computing
• Because of the quantities of big data, individual computers are
often inadequate for handling the data at most stages.
• To better address the high storage and computational needs of
big data, computer clusters are a better fit.
• Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:
• Resource Pooling: Combining the available storage space to hold data is a
clear benefit.
 But CPU and memory pooling are also extremely important.
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem
Clustered Computing…
• High Availability: Clusters can provide varying levels
of fault tolerance and availability guarantees
 To prevent hardware or software failures from affecting access to data and processing.
 This becomes increasingly important as we continue to emphasize the importance of real-
time analytics.

• Easy Scalability: Clusters make it easy to scale

horizontally by adding additional machines to the group.
 This means the system can react to changes in resource requirements without expanding
the physical resources on a machine.
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem
Clustered Computing…
• Using clusters requires a solution for managing
 Cluster membership
 Coordinating resource sharing
 Scheduling actual work on individual nodes

• Cluster membership and resource allocation can be handled by software like Hadoop’s YARN.
• The assembled computing cluster often acts as a foundation that other software interfaces with to
process the data.
• The machines involved in the computing cluster are also typically involved with the management
of a distributed storage system.
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with big data
easier.
• It is a framework that allows for the distributed processing of large datasets across clusters of
computers

• The four key characteristics of Hadoop are:

• Economical: highly economical as ordinary computers can be used for data processing
• Reliable: as it stores copies of the data on different machines and is resistant to hardware failure.
• Scalable: It is easily scalable horizontally and vertically
• Flexible: It is flexible and you can store as much structured and unstructured data as you need.
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem…
• Hadoop has an ecosystem that has evolved from its four core components:
• Data management
• Access
• Processing
• Storage
• It is continuously growing to meet the needs of Big Data.
• It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• HBase: NoSQL Database
Basic Concepts of Big Data…
Below figure shows Hadoop Ecosystem

Note: Study@Home
Hadoop Ecosystem
(https://www.geeksforgeeks
.org/hadoop-ecosystem/)
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Big Data Life Cycle with Hadoop
1. Ingesting data into the system: this is the first stage and data is ingested
or transferred to Hadoop from various sources such as relational databases, local
files.
2. Processing the data in storage: data is stored and processed. The data is stored in
the distributed file system, HDFS and NoSQL perform data processing.
3. Computing and analyzing data: data is analyzed by processing frameworks such
as Pig, Hive, and Impala
4. Visualizing the results: this stage is Access, which is performed by tools such as
Hue and Cloudera Search
• The analyzed data can be accessed by users.

GCVE - GCE Comparison
No ratings yet
GCVE - GCE Comparison
9 pages
Zeier Matthew Kanumuri Sasi Scaling Cloud Finops Proven Stra
No ratings yet
Zeier Matthew Kanumuri Sasi Scaling Cloud Finops Proven Stra
345 pages
School Proposal
100% (5)
School Proposal
31 pages
PHD Interview Questions and Answers
100% (1)
PHD Interview Questions and Answers
11 pages
IFRS 15 Questions
No ratings yet
IFRS 15 Questions
6 pages
Data Sheet: Qmatic Orchestra 7
100% (1)
Data Sheet: Qmatic Orchestra 7
6 pages
How Telcos Can Win With SMBS: Strategies For Success: Executive Briefing
No ratings yet
How Telcos Can Win With SMBS: Strategies For Success: Executive Briefing
33 pages
Data Science
No ratings yet
Data Science
32 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2 [Data Science]
No ratings yet
Chapter 2 [Data Science]
35 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Data Science
No ratings yet
Data Science
35 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
ict Ch. 2
No ratings yet
ict Ch. 2
38 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
U - 02 ET
No ratings yet
U - 02 ET
24 pages
Emerging_CH2
No ratings yet
Emerging_CH2
41 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
ETCh2
No ratings yet
ETCh2
36 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
EmTec Chapter 2 (1)
No ratings yet
EmTec Chapter 2 (1)
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter 2 Data Science (4)
No ratings yet
Chapter 2 Data Science (4)
8 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Research of Merry
No ratings yet
Research of Merry
37 pages
Final Exam
No ratings yet
Final Exam
2 pages
Programming chapter 4
No ratings yet
Programming chapter 4
8 pages
Programming Chapter 1
No ratings yet
Programming Chapter 1
11 pages
Feasibility Study of Automotive Industrial Plant and Engineering in Tigray
100% (1)
Feasibility Study of Automotive Industrial Plant and Engineering in Tigray
5 pages
Big Data Visualization
No ratings yet
Big Data Visualization
7 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Python Essentials 1 Badge20230518-28-It4cq6
No ratings yet
Python Essentials 1 Badge20230518-28-It4cq6
1 page
Big Data Analysis
No ratings yet
Big Data Analysis
25 pages
Introduction To Cloud Computing
No ratings yet
Introduction To Cloud Computing
8 pages
CH 3 and 4
No ratings yet
CH 3 and 4
60 pages
09 Korosi Estelecki Engl
No ratings yet
09 Korosi Estelecki Engl
12 pages
Cloud Technology Virtualization
No ratings yet
Cloud Technology Virtualization
5 pages
Expressing Interest in Ph.D. Opportunity Under Your Supervision
No ratings yet
Expressing Interest in Ph.D. Opportunity Under Your Supervision
1 page
Cloud Computing
No ratings yet
Cloud Computing
48 pages
Hospital Housekeeper Id Card Template
No ratings yet
Hospital Housekeeper Id Card Template
2 pages
PHD Research Proposal
No ratings yet
PHD Research Proposal
6 pages
Technical Guidance Notes, Resources and Tip Sheets
No ratings yet
Technical Guidance Notes, Resources and Tip Sheets
6 pages
WU M.SC Computer Networks - Draft Version Curriculum
No ratings yet
WU M.SC Computer Networks - Draft Version Curriculum
76 pages
Cover Letter
No ratings yet
Cover Letter
1 page
01 Index
No ratings yet
01 Index
1 page
CMDB - Product Architecture
No ratings yet
CMDB - Product Architecture
16 pages
418 Project Semifinal2
No ratings yet
418 Project Semifinal2
20 pages
Complete Download Leaner Six Sigma making Lean Six Sigma easier and adaptable to current workplaces First Edition Stern PDF All Chapters
100% (2)
Complete Download Leaner Six Sigma making Lean Six Sigma easier and adaptable to current workplaces First Edition Stern PDF All Chapters
32 pages
CCBA Exam: Questions & Answers (Demo Version - Limited Content)
No ratings yet
CCBA Exam: Questions & Answers (Demo Version - Limited Content)
11 pages
Chapter 1 CRM Definitions
No ratings yet
Chapter 1 CRM Definitions
12 pages
Inconsistency in Material Management - Inventory Management - ERP SCM - Community Wiki
No ratings yet
Inconsistency in Material Management - Inventory Management - ERP SCM - Community Wiki
17 pages
English for Project Management New
No ratings yet
English for Project Management New
27 pages
Salesforce Slack Consultant Practice Questions
No ratings yet
Salesforce Slack Consultant Practice Questions
17 pages
Resume
No ratings yet
Resume
2 pages
ISM Unit-3
No ratings yet
ISM Unit-3
25 pages
7 Software Development Life Cycle Models 1684807394
No ratings yet
7 Software Development Life Cycle Models 1684807394
17 pages
Proposal For E-Invoice & E Way Bill
No ratings yet
Proposal For E-Invoice & E Way Bill
13 pages
Stepwise Project Planning 12052016
No ratings yet
Stepwise Project Planning 12052016
17 pages
GBH1105 5121 Asm2
No ratings yet
GBH1105 5121 Asm2
26 pages
G-LBDA1361NA 002 Tattle-Tape Workstation Datasheet LR2
No ratings yet
G-LBDA1361NA 002 Tattle-Tape Workstation Datasheet LR2
2 pages
Nexus Sync Griptac Stick Junior
No ratings yet
Nexus Sync Griptac Stick Junior
1 page
Oracle Redwood Adoption_February 2025
No ratings yet
Oracle Redwood Adoption_February 2025
18 pages
Advanced Product Quality Planning - FGI
No ratings yet
Advanced Product Quality Planning - FGI
142 pages
Managerial Accounting
No ratings yet
Managerial Accounting
5 pages
UNIT 3 E-Commerce
No ratings yet
UNIT 3 E-Commerce
65 pages
Capstone Project m&A
No ratings yet
Capstone Project m&A
12 pages
YouTube Ad Template 2025
No ratings yet
YouTube Ad Template 2025
23 pages
m365 Zero Trust Deployment Plan
No ratings yet
m365 Zero Trust Deployment Plan
1 page
Rajesh Vishwakarma: A) KPMG Advisory Services Private Limited, Working As A "Senior Consultant". Sep 2012 - Till Date
No ratings yet
Rajesh Vishwakarma: A) KPMG Advisory Services Private Limited, Working As A "Senior Consultant". Sep 2012 - Till Date
3 pages
CV EN v0.1 - Director Partnerships (3)
No ratings yet
CV EN v0.1 - Director Partnerships (3)
1 page

CH 2 Data Science

Uploaded by

CH 2 Data Science

Uploaded by

Mekelle University​

Course Title: Introduction to Emerging Technologies ​

• Data science is a rapidly evolving field that combines statistical analysis,

• Information is the processed data on which decisions and actions

These three steps constitute the data processing cycle

Data Types from Computer Programming Perspective ​

• Below figure describes the three types of data and metadata​

2. Semi-structured Data: is a form of structured data that ​

• It is one of the major big data challenges in terms of infrastructure requirements.​

• Data analysis involves​

• Related areas include:​

• Data usage in business decision making can enhance competitiveness through ​

What Is Big Data? ​

• Easy Scalability: Clusters make it easy to scale

• The four key characteristics of Hadoop are: ​

You might also like

Mekelle University

Course Title: Introduction to Emerging Technologies

Data Types from Computer Programming Perspective

• Below figure describes the three types of data and metadata

2. Semi-structured Data: is a form of structured data that

• It is one of the major big data challenges in terms of infrastructure requirements.

• Data analysis involves

• Related areas include:

• Data usage in business decision making can enhance competitiveness through

What Is Big Data?

• The four key characteristics of Hadoop are: