0% found this document useful (0 votes)
48 views403 pages

IDS Full 1

The document outlines an Introduction to Data Science course offered by BITS Pilani, detailing its objectives, structure, and evaluation schedule. It covers fundamental concepts, methodologies, and real-world applications of data science, emphasizing the importance of analytics in decision-making. The course includes various modules focusing on data analytics techniques, ethical considerations, and the role of data science teams.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views403 pages

IDS Full 1

The document outlines an Introduction to Data Science course offered by BITS Pilani, detailing its objectives, structure, and evaluation schedule. It covers fundamental concepts, methodologies, and real-world applications of data science, emphasizing the importance of analytics in decision-making. The course includes various modules focusing on data analytics techniques, ethical considerations, and the role of data science teams.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 403

INTRODUCTION TO DATA SCIENCE

MODULE # 1 : INTRODUCTION
IDS Course Team
BITS Pilani
TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 2 / 79


COURSE OBJECTIVES
CO1 Gain basic understanding of the role of Data Science in various scenarios in the
real-world of business, industry and government.

CO2 Understand various roles and stages in a Data Science Project and ethical issues to be
considered.

CO3 Explore the processes, tools and technologies for collection and analysis of
structured and unstructured data.

CO4 Appreciate the importance of techniques like data visualization, storytelling with data
for the effective presentations of the outcomes with the stakeholders.

CO5 Understand techniques of preparing real-world data for data analytics.


CO6 Implement data analytic techniques for discovering interesting patterns from data.
INTRODUCTION TO DATA SCIENCE 3 / 79
COURSE STRUCTURE
M1 Introduction to Data Science
M2 Data Analytics
M3 Data and Data Models
M4 Data Wrangling
M5 Feature Engineering
M6 Classification and Prediction
M7 Association Analysis
M8 Clustering
M9 Anomaly Detection
M10 Storytelling with Data
M11 Ethics for Data Science
INTRODUCTION TO DATA SCIENCE 4 / 79
MODULES OVERVIEW
Module 2:
Part 1: Module 10: Module 11:
Process &
Data Science Story Telling Ethics
Analytics

Part II: Module 3: Module 3:


Data Data & Sources Data Pipelines

Part III: Module 4: Module 5:


Preprocessing Data Wrangling Feature Engg

Module 7: Module 9:
Part IV: Module 6: Module 8:
Assocation Anomaly
Modeling & Evaluation Classification Clustering
Mining Detection

INTRODUCTION TO DATA SCIENCE 5 / 79


I NTRODUCTION TO DATA S CIENCE
M ODULE # 2 : DATA A NALYTICS
IDS Course Team
BITS Pilani
EVALUATION SCHEDULE
No Name Weight Date Remarks

Quiz I 5% 22nd to 25rd Feb 2025 Average of


Quiz II 5% 4th to 7th May 2025 both quizzes

EC1 Assignment Part I 10% 22nd Feb to 13th Mar Sum of both
2025 Assignments
Assignment Part II 15% 19th April to 9th May
2025
EC2 Mid-sem 30% 21,22,23|MAR|2025 Closed Book
04,05,06|APR|2025
EC3 Compre-sem 40% 23,24,25|MAY|2025 Open Book
Regular 30,31|MAY,1 JUN|2025
Recap
Session 1
• What is Data Science
• Why Data science and Why now, ex: Money Ball movie
• Real-world applications , ex: Facebook, Amazon, Uber
• Data Science Challenges and Bias
• Roles in Data Science Team
• Organization of Data Science Team

I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
T ABLE OF C ONTENTS

1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
SEMMA
SMAM
Big Data Life-cycle

4 F U RT H E R R E A D I N G

I N T R OD U CT ION TO D AT A S C I E N C E 4 / 60
D EFINITION OF A NALY T I C S – D I C T I O N A RY

O X F O R D Analytics is the systematic computational analysis of data or statistics.


C A M B R I D G E Analytics is a process in which a computer examines information using
mathematical methods in order to find useful patterns.
D I C T I O N A RY . C O M Analytics is the analysis of data, typically large sets of business data,
by the use of mathematics, statistics, and computer software.

Analytics is treated as both a noun and a verb.

Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E 5 / 60
D EFINITION OF A NALY T I C S – WEBSITES

O R A C L E Analytics is the process of discovering, interpreting, and communicating


significant patterns in data and using tools to empower your entire
organization to ask any question of any data in any environment on any
device.
E D U R E K A Data Analytics refers to the techniques used to analyze data to enhance
productivity and business gain.
I N FO R M AT I C A Data analytics is the pursuit of extracting meaning from raw data using
specialized computer systems.

Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E 6 / 60
G OA L S OF D AT A A N A L Y T I C S
To predict something
• whether a transaction is a fraud or not
• whether it will rain on a particular day
• whether a tumour is benign or malignant
To find patterns in the data
• finding the top 10 coldest days in the year
• which pages are visited the most on a particular website
• finding the most searched celebrity in a particular year
To find relationships in the data
• finding similar news articles
• finding similar patients in an electronic health record system
• finding related products on an e-commerce website
• finding similar images
• finding correlation between news items and stock prices
I N T R OD U CT ION TO D AT A S C I E N C E 7 / 60
T ABLE OF C ONTENTS

1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM

4 F U RT H E R R E A D I N G

I N T R OD U CT ION TO D AT A S C I E N C E 8 / 60
D AT A A N A L Y T I C S

Data analysis is defined as a process of cleaning, transforming, and modelling data to


discover useful information for business decision-making.
4 different types of analytics
1 Descriptive Analytics
2 Diagnostic Analytics
3 Predictive Analytics
4 Prescriptive Analytics

I N T R OD U CT ION TO D AT A S C I E N C E 9 / 60
D ATA A N A LY T I C S

I N T R OD U CT ION TO D AT A S C I E N C E 10 / 60
D ESCRIPTIVE A NALY T I C S

Answers the question of what happened.


Summarize past data usually in the form of dashboards.
Insights into the past.
Also known as statistical analysis.
Raw data from multiple data sources.

I N T R OD U CT ION TO D AT A S C I E N C E 11 / 60
D E S C R I P T I V E A N A LY T I C S E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E 12 / 60
D ESCRIPTIVE A NALY T I C S

Techniques:
• Descriptive Statistics - histogram, correlation
• Data Visualization
• Exploratory Data Analysis

I N T R OD U CT ION TO D AT A S C I E N C E 13 / 60
D IA G N O S T I C A N A LY T I C S

Answers the question of why something happened.


Gives in-depth insights into data.
Identify relationship between data and identify patterns of behaviour.

I N T R OD U CT ION TO D AT A S C I E N C E 14 / 60
D IA G N O S T I C A N A LY T I C S E X A M P L E
What is the effect of global warming in the Southwest monsoon?

I N T R OD U CT ION TO D AT A S C I E N C E 15 / 60
D IA G N O S T I C A N A L Y T I C S

Pattern recognition to identify patterns.


Linear / Logistic regression to identify relationship
Deep Learning techniques

I N T R OD U CT ION TO D AT A S C I E N C E 16 / 60
P RE D IC T IV E A NALY T I C S

Answers the question of what is likely to happen.


Predict future trends.
Being able to predict allows one to make better decisions.
Analysis based on machine or deep learning.
Accuracy of the forecasting or prediction highly depends on data quality and stability
of the situation.

I N T R OD U CT ION TO D AT A S C I E N C E 17 / 60
P R E D I C T I V E A N A LY T I C S E X A M P L E

I N T R OD U CT ION TO D AT A S C I E N C E 18 / 60
P R E D I C T I V E A N A LY T I C S E X A M P L E
Covid patients LOS (Length-of-Stay in hospital) prediction

Data Details –
▪ First 1000 subjects affected by COVID
▪ Released by Ministry of Health (MOH),the Government of Singapore
▪ X-Variables (Predictor): Age, gender, positive confirmation date, discharge date etc
▪ Y-Variable (Independent): Length of stay

Exploratory Analysis – Univariate, Bivariate, Multivariate


Models Applied – Regression, SIR (suspected-infected-recovered)

Reference - https://www.degruyter.com/document/doi/10.1515/cmb-2023-0104/html
P RE D IC T IV E A NALY T I C S

Techniques / Algorithms:
• Regression
• Classification
• ML algorithms like Linear regression, Logistic regression, SVM
• Deep Learning techniques

I N T R OD U CT ION TO D AT A S C I E N C E 20 / 60
P RE S C RIPT IV E A NALY T I C S

Answers the question of what might happen.


Data-driven decision making and corrective actions, recommendations and
suggestions
Prescribe what action to take to eliminate a future problem or take full advantage of a
promising trend.
Need historical internal data and external information like trends.
Analysis based on machine or deep learning, business rules.
Use of AI to improve decision making.

I N T R OD U CT ION TO D AT A S C I E N C E 21 / 60
P R E S C R I P T I V E A N A LY T I C S E X A M P L E
How can we improve the crop production?

I N T R OD U CT ION TO D AT A S C I E N C E 22 / 60
Case Study – Data Analytics

fmRI data of healthy subjects when studied by aggregating ROI of brain as 90 nodes

❑ Modality – fMRI, Resting state, BOLD signals


❑ Subject – Healthy right-handed
❑ Age – 18 to 26 years
❑ Parcellations – AAL
Case Study – Data Analytics

fmRI data of healthy subjects when studied by aggregating ROI of brain as 90 nodes

Ref - https://ieeexplore.ieee.org/document/9826786
C O G N I T I V E A N A LY T I C S
Cognitive Analytics – What Don’t I Know?

https: / / w w w. 10x ds. com/ blog /cognitive - analytics - to- reinvent - business/
I N T R OD U CT ION TO D AT A S C I E N C E 26 / 60
C OGNITIVE A NALY T I C S

Next level of Analytics


Human cognition is based on the context and reasoning.
Cognitive systems mimic how humans reason and process.
Cognitive systems analyse information and draw inferences using probability.
They continuously learn from data and reprogram themselves.
Involves Semantics, AI, Machine learning, Deep Learning, Natural Language
Processing, and Neural Networks.

h ttp s : / / in teres tin g en g in eer in g . com / cog n iti ve - c om p u ti n g - m ore - h u m a n - t h a n - a rt if ic ia l - in t el l ig en c e


I N T R OD U CT ION TO D AT A S C I E N C E 22 / 60
C OGNITIVE A NALY T I C S
Uses all types of data: audio, video, text, images in the analytics process.

Although this is the top tier of analytics maturity, Cognitive Analytics can be used in
the prior levels.
According to one source:
“ The essential distinction between cognitive platforms and artificial intelligence
systems is that you want an AI to do something for you. A cognitive platform is
something you turn to for collaboration or for advice

According to Jean Francois Puget:


“ It extends the analytics journey to areas that were unreachable with more
classical analytics techniques like business intelligence, statistics, and
operations research ”
h ttp s : / / w w w. eca p ita la d vis ors .com/ b log/ a n a lytics - ma tu rity/
h ttp
I N T R OD U CT s : / / w w w. x en on s ta ck.com/in s ig
ION T O D AT A S C I E N C E 28 h/ ts
60/ wh a t - is -cogn itive -a n a lytics /
Cognitive Systems
T ABLE OF C ONTENTS

1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
SEMMA
SMAM
Big Data Life-cycle

4 F U RT H E R R E A D I N G

I N T R OD U CT ION TO D AT A S C I E N C E 30 / 60
D AT A A N A L Y T I C S M E T H O D O L O G I E S

Use standard methodology to ensure a good outcome.


1 CRISP-DM
2 SEMMA
3 SMAM
4 Big Data Life-cycle

I N T R OD U CT ION TO D AT A S C I E N C E 31 / 60
N EED FOR A S TANDA R D P ROCESS

Framework for recording experience.


• Allows projects to be replicated
Aid to project planning and management.
Encourage best practices and help to obtain better results.

I N T R OD U CT ION TO D AT A S C I E N C E 32 / 60
D AT A S C I E N C E M E T H O D O L O G Y
10 Questions the process aims to answer
Problem to Approach
1 What is the problem that you are trying to solve?
2 How can you use data to answer the questions?
Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you acquire it?
5 Is the data that you collected representative of the problem to be solved?
6 What additional work is required to manipulate and work with the data?
Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
I N T R OD U CT ION TO D AT A S C I E N C E 33 / 60
CRISP-DM
CRISP-DM Phases

Cross Industry Standard Process for


Data Mining
6 high-level phases
Used in IBM SPSS Modeler tool
Iterative approach to the development
of analytical models.

I N T R OD U CT ION TO D AT A S C I E N C E 34 / 60
C R I S P - D M P HASES

Business Understanding
• Understand project objectives and requirements.
• Data mining problem definition.
Data Understanding
• Initial data collection and familiarization.
• Identify data quality issues.
• Identify initial obvious results.
Data Preparation
• Record and attribute selection.
• Data cleansing.

I N T R OD U CT ION TO D AT A S C I E N C E 35 / 60
C R I S P - D M P HASES

Modeling
Run the data mining tools.
Evaluation
Determine if results meet business objectives.
Identify business issues that should have been addressed earlier.
Deployment
Put the resulting models into practice.
Set up for continuous mining of the data.

I N T R OD U CT ION TO D AT A S C I E N C E 36 / 60
C R I S P - D M P HASES AND T ASKS

I N T R OD U CT ION TO D AT A S C I E N C E 37 / 60
WHY CRISP-DM?

The data mining process must be reliable and repeatable by people with little data
mining skills.
CRISP-DM provides a uniform framework for
• guidelines.
• experience documentation.
CRISP-DM is flexible to account for differences.
• Different business/agency problems.
• Different data

I N T R OD U CT ION TO D AT A S C I E N C E 38 / 60
Case Study Evaluating Job readiness: CRISP-DM

Ref - 2015, A Case Study of Evaluating Job Readiness with Data Mining Tools and CRISP-DM Methodology
Step1: Business Understanding
Step2: Data Understanding
Step2: Data Understanding
Step3: Data Processing
Step3: Data Processing
Step4: Data Modelling
Step5: Model Evaluation and Conclusion
SEMMA

SAS Institute
Sample, Explore, Modify, Model,
Assess
5 stages

I N T R OD U CT ION TO D AT A S C I E N C E 47 / 60
S E M M A S TAGES

1 Sample
• Sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly.
• Optional stage
2 Explore
• Exploration of the data by searching for unanticipated trends and anomalies in order to
gain understanding and ideas.
3 Modify
• Modification of the data by creating, selecting, and transforming the variables to focus
the model selection process.

I N T R OD U CT ION TO D AT A S C I E N C E 48 / 60
S E M M A S TAGES

1 Model
• Modeling the data by allowing the software to search automatically for a combination of
data that reliably predicts a desired outcome.
2 Assess
• Assessing the data by evaluating the usefulness and reliability of the findings from the
data mining process and estimate how well it performs.

I N T R OD U CT ION TO D AT A S C I E N C E 49 / 60
SEMMA

“SEMMA is not a data mining methodology but rather a logical organization of the
functional tool set of SAS Enterprise Miner for carrying out the core tasks of data
mining.
Enterprise Miner can be used as part of any iterative data mining methodology
adopted by the client. Naturally steps such as formulating a well defined business or
research problem and assembling quality representative data sources are critical to
the overall success of any data mining project.
SEMMA is focused on the model development aspects of data mining.”

I N T R OD U CT ION TO D AT A S C I E N C E 50 / 60
SMAM

Standard
Methodology for
Analytics Models

I N T R OD U CT ION TO D AT A S C I E N C E 51 / 60
S M A M P HASES

Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model requirements Understanding the conditions required for the model to func-
gathering tion
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the value
Operationalization Embedding the analytical model in operational systems
Model life-cycle Governance around model lifetime and refresh

I N T R OD U CT ION TO D AT A S C I E N C E 52 / 60
More Data Analytics Methodologies

• TDSP (Team Data Science Process)


• CPMAI (Cognitive Project Management for AI)
• ASUM-DM (Analytics Solutions Unified Method for Data Mining)
• Big Data-Specific Methodologies
• ODSC (Open Data Science Lifecycle)
• DELTA (Data-Enabled Leadership and Transformation Approach)
B I G D AT A L I F E - C Y C L E

Data Acquisition
• Acquiring information from a rich and varied data environment.
Data Awareness
• Connecting data from different sources into a coherent whole, including modeling
content, establishing context, and insuring search-ability.
Data Analytics
• Using contextual data to answer questions about the state of your organization.
Data Governance
• Establishing a framework for providing for the provenance, infrastructure and disposition
of that data.

I N T R OD U CT ION TO D AT A S C I E N C E 54 / 60
B I G D AT A L I F E - C Y C L E

Phase 7: Storage
Phase 1: Foundations
Phase 8: Integration
Phase 2: Acquisition
Phase 9: Analytics and Visualization
Phase 3: Preparation
Phase 10: Consumption
Phase 4: Input and Access
Phase 11: Retention, Backup, and
Phase 5: Processing
Archival
Phase 6: Output and Interpretation
Phase 12: Destruction

PS: Some phases may overlap and can be done in parallel.

I N T R OD U CT ION TO D AT A S C I E N C E 55 / 60
B I G D ATA L I F E - C Y C L E

I N T R OD U CT ION TO D AT A S C I E N C E 56 / 60
B I G D AT A L I F E - C Y C L E

Phase 1: Foundations
• Understanding and validating data requirements, solution scope, roles and
responsibilities, data infrastructure preparation, technical and non-technical
considerations, and understanding data rules in an organization.
Phase 2: Data Acquisition
• Data Acquisition refers to collecting data.
• Data sets can be obtained from various sources, both internal and external to the
business organizations.
• Data sources can be in
• structured forms such as transferred from a data warehouse, a data mart, various
transaction systems.
• semi-structured sources such as Weblogs, system logs.
• unstructured sources such as media files consisting of videos, audios, and pictures.

I N T R OD U CT ION TO D AT A S C I E N C E 57 / 60
B I G D AT A L I F E - C Y C L E

Phase 3: Data Preparation


• Collected data (Raw Data) is rigorously checked for inconsistencies, errors, and
duplicates.
• Redundant, duplicated, incomplete, and incorrect data are removed.
• The objective is to have clean and useable data sets.
Phase 4: Data Input and Access
• Data input refers to sending data to planned target data repositories, systems, or
applications.
• Data can be stored in CRM (Customer Relationship Management) application, a data
lake or a data warehouse.
• Data access refers to accessing data using various methods.
• NoSQL is widely used to access big data.

I N T R OD U CT ION TO D AT A S C I E N C E 58 / 60
B I G D AT A L I F E - C Y C L E

Phase 5: Data Processing


• Processing the raw form of data.
• Convert data into a readable format giving it the form and the context.
• Interpret the data using the selected data analytics tools such as Hadoop MapReduce,
Impala, Hive, Pig, and Spark SQL.
• Data processing also includes activities
• Data annotation – refers to labeling the data.
• Data integration – aims to combine data existing in different sources, and provide a unified
view of data to the data consumers.
• Data representation – refers to the way data is processed, transmitted, and stored.
• Data aggregation – aims to compile data from databases to combined data-sets to be used
for data processing.

I N T R OD U CT ION TO D AT A S C I E N C E 59 / 60
B I G D AT A L I F E - C Y C L E

Phase 6: Data Output and Interpretation


• In the data output phase, the data is in a format which is ready for consumption by the
business users.
• Transform data into usable formats such as plain text, graphs, processed images, or
video files.
• This phase is also called the data ingestion.
• Common Big Data ingestion tools are Sqoop, Flume, and Spark streaming.
• Interpreting the ingested data requires analyzing ingested data and extract information
or meaning out of it to answer the questions related to the Big Data business solutions.

I N T R OD U CT ION TO D AT A S C I E N C E 60 / 60
B I G D AT A L I F E - C Y C L E

Phase 7: Data Storage


• Store data in designed and designated storage units.
• Storage infrastructure can consist of storage area networks (SAN), network-attached
storage (NAS), or direct access storage (DAS) formats.
Phase 8: Data Integration
• Integration of stored data to different systems for various purposes.
• Integration of data lakes with a data warehouse or data marts.
Phase 9: Data Analytics and Visualization
• Integrated data can be useful and productive for data analytics and visualization.
• Business value is gained in this phase.

I N T R OD U CT ION TO D AT A S C I E N C E 61 / 60
B I G D AT A L I F E - C Y C L E
Phase 10: Data Consumption
• Data is turned into information ready for consumption by the internal or external users,
including customers of the business organization.
• Data consumption require architectural input for policies, rules, regulations, principles,
and guidelines.
Phase 11: Retention, Backup, and Archival
• Use established data backup strategies, techniques, methods, and tools.
• Identify, document, and obtain approval for the retention, backup, and archival decisions.
Phase 12: Data Destruction
• There may be regulatory requirements to destruct a particular type of data after a certain
amount of times.
• Confirm the destruction requirements with the data governance team in business
organizations.
I N T R OD U CT ION TO D AT A S C I E N C E 62 / 60
T ABLE OF C ONTENTS

1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
SEMMA
SMAM
Big Data Life-cycle

4 F U RT H E R R E A D I N G

I N T R OD U CT ION TO D AT A S C I E N C E 63 / 60
D ESCRIPTIVE A NALY T I C S – E X AM PL E # 1

Data captured
Problem Statement : Gender
“Market research team at Aqua Analytics Age (In years)
Pvt. Ltd is assigned a task to identify pro- Education (In years)

file of a typical customer for a Digital fit- Relationship Status (Single or Partnered)
Annual Household income
ness band that is offered by Titanic Corp.
Average number of times customer tracks activity each
The market research team decides to inves- week
tigate whether there are differences across Number of miles customer expect to walk each week
the usage patterns and product lines with Self-rated fitness on a scale 1 – 5 where 1 is poor shape
and 5 is excellent.
respect to customer characteristics”
Models of the product purchased - IQ75, MZ65, DX87

https://medium.com/@as h is hp ah wa7/ firs t -case -stud y- in- d escriptive- an a lytics- a744140c39a4


I N T R OD U CT ION TO D AT A S C I E N C E 64 / 60
D E S C R I P T I V E A N A LY T I C S – E X A M P L E # 1

I N T R OD U CT ION TO D AT A S C I E N C E 65 / 60
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1

Problem Statement :
“During the 1980s General Electric was selling different products to its customers such as
light bulbs, jet engines, windmills, and other related products. Also, they separately sell
parts and services this means they would sell you a certain product you would use it until it
needs repair either because of normal wear and tear or because it’s broken. And you would
come back to GE and then GE would sell you parts and services to fix it. Model for GE was
focusing on how much GE was selling, in sales of operational equipment, and in sales of
parts and services. And what does GE need to do to drive up those sales?”

https://medium.com/parrotai/
u n d ers ta n d - d a ta - a n a lytics - fra mework -w ith -a - ca s e - st u d y-in - th e -b u s in es s - world - 15b fb 421028d
I N T R OD U CT ION TO D AT A S C I E N C E 66 / 60
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1

https://www.sganalytics.com/blog/change -management-analytics-adoption/
I N T R OD U CT ION TO D AT A S C I E N C E 67 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 1

Google launched Google Flu Trends (GFT), to collect predictive analytics regarding
the outbreaks of flu. It’s a great example of seeing big data analytics in action.
So, did Google manage to predict influenza activity in real-time by aggregating search
engine queries with this big data and adopting predictive analytics?
Even with a wealth of big data analytics on search queries, GFT overestimated the
prevalence of flu by over 50% in 2012-2013 and 2011-2012.
They matched the search engine terms conducted by people i n
d i f fe r e n t regions of the world. And, when these queries were
compared with t r a d i t i o n a l f l u s u r ve i l l a n c e systems, Google found
that the p re d ict ive a n a l y t i c s of the f l u season pointed towards a
co r re lat io n with higher search engine t r a f f i c f o r ce rtain phrases.

I N T R OD U CT ION TO D AT A S C I E N C E 68 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 1

https://www.slideshare.net/VasileiosLampos/
u s e r g e n e ra t ed - co n t en t - c o l l e ct i ve - a n d - p er s o n a l i s e d - i n f e re n c e - ta s ks
I N T R OD U CT ION TO D AT A S C I E N C E 69 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 2
Colleen Jones applied predictive analytics to FootSmart (a niche online catalog
retailer) on a content marketing product. It was called the FootSmart Health
Resource Center (FHRC) and it consisted of articles, diagrams, quizzes and the like.
On analyzing the data around increased search engine visibility, FHRC was found
to help FootSmart reach more of the right kind of target customers.
They were receiving more traffic, primarily consisting of people that cared about foot
health conditions and their treatments.
FootSmart decided to push more content at FHRC and also improve its
merchandising of the product.
The r e s u l t of such informed data-driven decision making?
A 36% increase i n weekly s a l e s .

https://www.footsmart.com/pages/health -resource-center
I N T R OD U CT ION TO D AT A S C I E N C E 70 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 2

Predictive Policing (Self study)


https://www.brennancenter.org/our-work/research-reports/
p r e d i c t i ve - p o l i c i n g - ex p l a i n e d
https://www.youtube.com/watch?v=YxvyeaL7NEM

I N T R OD U CT ION TO D AT A S C I E N C E 71 / 60
P RE S C RIPT IV E A NALY T I C S – E X AM PL E # 1
A health insurance company analyses its data and determines that many of its diabetic
patients also suffer from retinopathy.

With this information, the provider can now use predictive analytics to get an idea of how
many more ophthalmology claims it might receive during the next year.

Then, using prescriptive analytics, the company can look at scenarios where the
reimbursement costs for ophthalmology increases, decreases, or holds steady. These
scenarios then allow them to make an informed decision about how to proceed in a way that’s
both cost-effective and beneficial to their customers.

Analysing data on patients, treatments, appointments, surgeries, and even radiologic


techniques can ensure hospitals are properly staffed, the doctors are devising tests and
treatments based on probability rather than gut instinct, and the facility can save costs on
everything from medical supplies to transport fees to food budgets.

I N T R OD U CT ION TO D AT A S C I E N C E 72 / 60
P RE S C RIPT IV E A NALY T I C S – E X AM PL E # 2

Whenever you go to Amazon, the site recommends dozens and dozens of products to
you. These are based not only on your previous shopping history (reactive), but also
based on what you’ve searched for online, what other people who’ve shopped for the
same things have purchased, and about a million other factors (proactive).
Amazon and other large retailers are taking deductive, diagnostic, and predictive data
and then running it through a prescriptive analytics system to find products that you
have a higher chance of buying.
Every bit of data is broken down and examined with the end goal of helping the
company suggest products you may not have even known you wanted.

h ttp s : / / a ccen t -tech n ologies . com/ 2020/ 06/18/ ex a mp les -of- p res crip tive - a n a lytics /
I N T R OD U CT ION TO D AT A S C I E N C E 73 / 60
H E A LT H C A R E A NALY T I C S – C ASE S TUDY

Self study
https://integratedmp.com/
4 - key- h e a lt h ca re - analyt ics - so u rce s - i s- yo ur -pract ice - usin g -the m/
https://www.youtube.com/watch?v=olpuyn6kemg

I N T R OD U CT ION TO D AT A S C I E N C E 74 / 60
R EFERENCES
Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
https://www.kdnuggets.com/2014/10/
cris p - d m - top - m eth od olog y - a n a l ytic s - d a t a - m in in g - d a t a - s c ie n ce - p roj ect h tm l
https://www.datasciencecentral.com/profiles/ blogs/
crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
https://docu mentation.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm 1a2.htm&
docsetVersion=14.3&locale=en
http://jesshampton.com/2011/02/16/semma -and-crisp-dm-data-mining-methodologies/
https://www.kdnuggets.com/2015/08/new -standard-methodology-analytical-models.html
https://medium.com/illumination -curated/big-data-lifecycle-management-629dfe16b78d
https://www.esadeknowledge.com/view/
7 - ch a llen g es - a n d - op p ortu n ities - in - d a ta - b a sed - d ecis ion - ma kin g -193560

T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E 75 / 60
INTRODUCTION TO DATA SCIENCE
MODULE # 3 : DATA
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

INTRODUCTION TO DATA SCIENCE 2 / 79


Recap
Session 1
• What is Data Science
• Why Data science and Why now, ex: Money Ball movie
• Real-world applications , ex: Facebook, Amazon, Uber
• Data Science Challenges and Bias
• Roles in Data Science Team
• Organization of Data Science Team

Session 2
• Data Analytics
• Case Studies – (COVID, Neuro Informatics)
• Data Analytics Methodologies (CRISP-DM)

I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 4 / 79


DATA

Data is a collection of data objects and their


attributes.
The type of data determines which tools
and techniques can be used to analyze the
data.

INTRODUCTION TO DATA SCIENCE 5 / 79


DATA
Data is a collection of data objects and their
attributes.
A data field, representing a characteristic or
feature of a data object
Examples: eye color of a person,
temperature, Customer_Id
Attribute is also known as variable, field,
characteristic, or feature.
Object is also known as record, point, case,
sample, entity, or instance.

INTRODUCTION TO DATA SCIENCE 6 / 79


ATTRIBUTE / FEATURE

An attribute is a property or characteristic of


an object.
) eye color of a person, temperature
Attribute is also known as variable, field,
characteristic, or feature.
The values used to represent an attribute
may have properties that are not properties
of the attribute itself.
) Average age of an employee may have a
meaning , whereas it makes no sense to
talk about the average employee ID.

INTRODUCTION TO DATA SCIENCE 7 / 79


TYPES OF ATTRIBUTES
Categorical ATTRIBUTES

Nominal: categories, states, or “names of things”


Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not known.
Size = {small, medium, large}, grades, army rankings
Quantitative ATTRIBUTES
Quantitative ATTRIBUTES
TYPES OF ATTRIBUTES

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES EXAMPLE

Identify the types of attributes in the given data.

ID Age Gender Course ID Percentage Grade


19001 24 Female CS 104 74 Good
19002 23 Male CS 102 75 Good
19003 25 Female CS 103 67 Fair
19004 24 Female CS 104 79 Good
19005 23 Male CS 102 75 Good
19006 24 Female CS 103 87 Excellent
19007 26 Male CS 105 70 Good

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES EXAMPLE

Identify the types of attributes in the given data.

ID Age Gender Course ID Percentage Grade


19001 24 Female CS 104 74 Good
19002 23 Male CS 102 75 Good
19003 25 Female CS 103 67 Fair
19004 24 Female CS 104 79 Good
19005 23 Male CS 102 75 Good
19006 24 Female CS 103 87 Excellent
19007 26 Male CS 105 70 Good
Nominal Ratio Nominal Nominal Ratio Ordinal

INTRODUCTION TO DATA SCIENCE


• Student Majors: CS, Biology, Economics
• Student Performance: Poor, Average, Good, Excellent
• Exam Scores
• Classroom Seating: Front, Middle, Back
• Letter Grades: A, B, C, D, F
• Number of Students in a Class
• Years of Study: 2020, 2021, 2022
TYPES OF ATTRIBUTES

INTRODUCTION TO DATA SCIENCE


ATTRIBUTES AND TRANSFORMATIONS

Introduction to Data Mining by


Tan
INTRODUCTION TO DATA SCIENCE
ATTRIBUTES BY THE NUMBER OF VALUES
Discrete Attribute
) only a finite or countable infinite set of values.
) zip codes, counts, or the set of words in a collection of documents
) Often represented as integer variables.

) Note: binary attributes are a special case of discrete attributes

Continuous Attribute
) Real numbers as attribute values.
) temperature, height, or weight
) Continuous attributes are typically represented as floating-point variables.

Asymmetric Attribute
) only presence a non-zero attribute value-is considered.
) For a specific student, an attribute has a value of 1 if the student took the course
associated with that attribute and a value of 0 otherwise
) Asymmetric binary attributes.

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES EXAMPLE

Identify whether the attribute is discrete and continuous in the given data.

ID Age Gender Course ID Percentage Grade


19001 24 Female CS 104 74 Good
19002 23 Male CS 102 75 Good
19003 25 Female CS 103 67 Fair
19004 24 Female CS 104 79 Good
19005 23 Male CS 102 75 Good
19006 24 Female CS 103 87 Excellent
19007 26 Male CS 105 70 Good

INTRODUCTION TO DATA SCIENCE


TYPES OF ATTRIBUTES EXAMPLE

Identify whether the attribute is discrete and continuous in the given data.

ID Age Gender Course ID Percentage Grade


19001 24 Female CS 104 74 Good
19002 23 Male CS 102 75 Good
19003 25 Female CS 103 67 Fair
19004 24 Female CS 104 79 Good
19005 23 Male CS 102 75 Good
19006 24 Female CS 103 87 Excellent
19007 26 Male CS 105 70 Good
Discrete Continuous Discrete Discrete Continuous Discrete

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA
2 DATA-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE


TYPES OF DATA-SETS
1 Structured data
) Data containing a defined data type, format and structure.
) Example: transaction data, online analytical processing , OLAP data cubes, traditional
RDBMS, CSV file and spreadsheets.
2 Semi structured data
) Textual data file with discernible pattern that enables parsing
) Example: XML data file, HTML of a web page
3 Quasi structured data
) Textual data with erratic data format that can be formatted with effort, tools and time
) Example: Web click-stream data
4 Unstructured data
) Data that has no inherent structure.
) Example: text document, PDF, images and video, email
INTRODUCTION TO DATA SCIENCE
STRUCTURED DATA
RDBMS Data

INTRODUCTION TO DATA SCIENCE


SEMI-STRUCTURED DATA
JSON Data

INTRODUCTION TO DATA SCIENCE


QUASI-STRUCTURED DATA

Web Click-Stream

INTRODUCTION TO DATA SCIENCE


DATA-SETS

1 Public data
) Data that has been collected and preprocessed for academic or research purposes and
made public.
) https://archive.ics.uci.edu/
2 Private data
) Data that is specific to an organization.
) Privacy rules like IT Act 2000 and GDPR applies.

INTRODUCTION TO DATA SCIENCE


TYPES OF DATA-SETS

INTRODUCTION TO DATA SCIENCE


RECORD DATA
Record data – flat file (CSV), RDBMS
Transaction data – set of items – banking, retail, e-commerce
Data Matrix – record data with only numeric attributes. – SPSS data matrix
Sparse Data Matrix – binary asymmetric data. 0/1 entries.
Document term matrix – Frequency of terms that appears in documents

INTRODUCTION TO DATA SCIENCE


ORDERED DATA EXAMPLE
Sequential data or temporal data – Record data + time. Eg: Money transfer
transaction in Banking
Sequence data – Positions instead of time stamp. Eg:DNA sequence bases (G, T, A,
C)
Time series data – temporal autocorrelation

INTRODUCTION TO DATA SCIENCE


TEXT DATA

Text is considered as 1-D data


Eg: Email body, PDF document, word document

INTRODUCTION TO DATA SCIENCE


AUDIO DATA

Audio is considered as 1-D time series data


Eg: Speech, Music

INTRODUCTION TO DATA SCIENCE


IMAGE DATA

Images are considered as 2-D data in Euclidean space


Digital Images are stored in a matrix or grid form where the intensity or colour
information is stored in the (x,y) position.
Black and white image – intensity is represented as 0 and 1 respectively
Greyscale image – intensity is represented as an integer between 0 and 255. 0
represents black, grey is 125 and 255 is white.
Colour image – contains 3 bands or channels – Red, Green and Blue – each colour is
represented as an integer between 0 and 255.

INTRODUCTION TO DATA SCIENCE


DIGITAL GRAYSCALE IMAGE

Pixel intensities = I(x, y )

https://mozanunal.com/2019/11/img2sh/
INTRODUCTION TO DATA SCIENCE
DIGITAL COLOUR IMAGE

https://www.analyticsvidhya.com/blog/2021/03/grayscale-and-rgb-format-for-storing-images/
INTRODUCTION TO DATA SCIENCE
DIGITAL COLOUR IMAGE

https://www.mathworks.com/help/matlab/creating_plots/image-types.html
INTRODUCTION TO DATA SCIENCE 32 / 79
GRAPH DATA EXAMPLE

Data with relationships among objects – Web pages


Data with objects as graphs – chemical compound

https://lod-cloud.net/
INTRODUCTION TO DATA SCIENCE
TABLE OF CONTENTS

1 DATA
2 Data-SETS
3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE


DATA QUALITY ISSUES

Missing data
) Data that is not filled / available intentionally or otherwise.
) Attributes of interest may not always be available, such as customer information for sales
transaction data.
) Some data were not considered important at the time of entry.

) Relevant data may not be recorded due to a misunderstanding or because of equipment

malfunctions.
Duplicate data
Orphaned data
Text encoding errors
Data that is biased

INTRODUCTION TO DATA SCIENCE


DATA QUALITY ISSUES

Noise and outliers


) Noise is a random error or variance in a measured data object.
) Data objects with behaviors that are very different from expectation are called outliers or
anomalies.
Inaccurate data
) Inaccurate data – data having incorrect attribute values
) Caused by, faulty data collection instruments, human or computer errors occurring at data
entry, users may purposely submit incorrect data values, errors in data transmission
Inconsistent data
) inconsistencies in naming conventions or data codes, or inconsistent formats for input
fields (e.g., date).

INTRODUCTION TO DATA SCIENCE


EXAMPLE: DATA QUALITY ISSUES

Find the issues in the given data.

Name Age Date of Birth Course ID Percentage


Amy 24 01-Jan-1995 CS 104 74
Ben 23 Dec-01-1996 CS 102 75
Cathy 25 01-Nov-1994 67
Diana 24 Oct-01-1995 CS 104 79
Ben 23 Dec-01-1996 CS 102 75
Eden 24 CS 103 175
Fischer 01-01-1959 CS 105 70

INTRODUCTION TO DATA SCIENCE


EXAMPLE: DATA QUALITY ISSUES

Missing data – age, date of birth, course ID


Inconsistent data – date of birth
Duplicate data – Ben is duplicated
Data Conformity – Percentage = 175

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA DATA-
2 SETS

3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE


FORMAL DATA MODELS

Model is something we construct to help us understand the real world.


One key goal of formal modelling is to develop a precise specification of your question
and how your data can be used to answer that question.
Formal models allow you to identify clearly what you are trying to infer from data and
what form the relationships between features of the population take.

INTRODUCTION TO DATA SCIENCE


GENERAL FRAMEWORK FOR MODELLING

Apply the basic epicycle of analysis to the formal modelling portion of data analysis.
1 Setting expectations.
) Develop a primary model that represents your best sense of what provides the answer

to your question. This model is chosen based on whatever information you have
currently available.
2 Collecting Information.
) Create a set of secondary models that challenge the primary model in some way.

3 Revising expectations.
) If our secondary models are successful in challenging our primary model and put the
primary model’s conclusions in some doubt, then we may need to adjust or modify
the primary model to better reflect what we have learned from the secondary
models.

INTRODUCTION TO DATA SCIENCE


DATA MODEL - CASE STUDY

Conduct a survey of 20 people to ask them how much they’d be willing to spend on a
product you’re developing.
The survey response

25, 20, 15, 5, 30, 7, 5, 10, 12, 40, 30, 30, 10, 25, 10, 20, 10, 10, 25, 5

What do the data say?


Note: The example is hypothetical, generally we select higher sample size for
modelling.

INTRODUCTION TO DATA SCIENCE


STEP 1: SETTING EXPECTATIONS

The sample data represents the overall


population likely to purchase the product.
Mean - $17.2 and Standard Deviation - $10.39
Expectation under the Normally distributed data
(Normal model) is that the distribution of prices
that people are willing to pay.
According to the model, about 68% of the
population would be willing to pay somewhere
between $6.81 and $27.59 for this new product.

INTRODUCTION TO DATA SCIENCE


STEP 2: COMPARING MODEL EXPECTATIONS WITH REALITY

Given the parameters, our expectation under the


Normal model is that the distribution of prices
that people are willing to pay looks like a
bell-shaped curve.
E.g. Normal curve on top of the histogram of the
20 data points of the amount people say they are
willing to pay. The histogram has a large spike
around 10.
Normal distribution allows for negative values on
the left-hand side of the plot, but there are no
data points in that region of the plot.

INTRODUCTION TO DATA SCIENCE


STEP 3: REFINING OUR EXPECTATIONS

When the model and the data don’t match very


well.
) Get a different model.
) Get different data.
) Do both.

E.g. Choose a different statistical model to


represent the population, the Gamma
distribution, which has the feature that it only
allows positive.

INTRODUCTION TO DATA SCIENCE


STEP 3: REFINING OUR EXPECTATIONS

Normal vs Gamma Distribution – which model to


choose?
Qn 1: What percentage of the population is
willing to pay atleast $30 for this product?
) Normal distribution – 11% would pay $30 or
more
) Gamma distribution – 7% would pay $30 or more

Based on which model suits your problem at


hand, choose the appropriate model.

INTRODUCTION TO DATA SCIENCE


DEVELOPING A BENCHMARK MODEL

The goal is to develop a benchmark model that serves us as a baseline, upon we’ll
measure the performance of a better and more attuned algorithm.
Benchmarking requires experiments to be comparable, measurable, and reproducible.

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA DATA-
2 SETS

3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE


CLASS OR CONCEPT DESCRIPTIONS

Class or Concept Descriptions describe individual classes and concepts in


summarized, concise, and yet precise terms.
Concept descriptions can be derived using
1 data characterization, by summarizing the data of the class under study
2 data discrimination, by comparison of the target class with one or a set of comparative
classes
3 both data characterization and discrimination.

INTRODUCTION TO DATA SCIENCE


DATA CHARACTERIZATION

Data characterization is a summarization of the general characteristics or


features of a target class of data.
Methods for data characterization
) data summaries based on statistical measures and plots
) data cube-based OLAP roll-up operation
) attribute-oriented induction technique

Output of data characterization


) bar charts, curves, multidimensional data cubes, and multidimensional tables.
) generalized relations or in rule form called characteristic rules.

INTRODUCTION TO DATA SCIENCE


DATA DISCRIMINATION

Data discrimination is a comparison of the general features of the target class


data objects against the general features of objects from one or multiple
contrasting classes.
The target and contrasting classes can be specified by a user, and the corresponding
data objects can be retrieved through database queries.
The methods used for data discrimination are similar to those used for data
characterization.
Output presentation
) Discrimination descriptions expressed in the form of rules are referred to as
discriminant rules.

INTRODUCTION TO DATA SCIENCE


ASSOCIATION ANALYSES

Frequent patterns are patterns that occur frequently in data.


Many kinds of frequent patterns
) A frequent itemset refers to a set of items that often appear together in a transactional
data set. E.g: milk and bread
) A frequently occurring subsequence is a (frequent) sequential pattern. Eg:

customers tend to purchase first a laptop, followed by a digital camera, and then a
memory card.
) A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that

may be combined with itemsets or subsequences. If a substructure occurs frequently,


it is called a (frequent) structured pattern.
) Mining frequent patterns leads to the discovery of interesting associations and

correlations within data.

INTRODUCTION TO DATA SCIENCE


PREDICTION ANALYSES

The term prediction refers to both numeric prediction and class label
prediction.
Classification and regression may need to be preceded by relevance analysis, which
attempts to identify attributes that are significantly relevant to the classification and
regression process.

INTRODUCTION TO DATA SCIENCE


CLASSIFICATION ANALYSES

Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
The model are derived based on the analysis of a set of training data (i.e., data
objects for which the class labels are known).
The model is used to predict the class label of objects for which the the class label is
unknown.
The derived model may be represented in as classification rules (i.e., IF-THEN rules),
decision trees, mathematical formulae, or neural networks, naive Bayesian
classification, support vector machines, and k-nearest-neighbor classification.
Classification predicts categorical (discrete, unordered) labels.

INTRODUCTION TO DATA SCIENCE


REGRESSION ANALYSES

Regression models continuous-valued functions.


Regression is used to predict missing or unavailable numerical data values rather
than (discrete) class labels.
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
Regression also encompasses the identification of distribution trends based on the
available data.

INTRODUCTION TO DATA SCIENCE


CLUSTER ANALYSIS

Clustering analyzes data objects without consulting class labels.


Clustering can be used to generate class labels for a group of data. The objects are
clustered or grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity.
clusters of objects are formed so that objects within a cluster have high similarity in
comparison to one another, but are rather dissimilar to objects in other clusters.
Each cluster so formed can be viewed as a class of objects, from which rules can be
derived.
Clustering can also facilitate taxonomy formation, that is, the organization of
observations into a hierarchy of classes that group similar events together.

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA DATA-
2 SETS

3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE


DATA PIPELINE STAGES

Data pipelines are sets of processes that move and transform data from various
sources to a destination where new value can be derived.
In their simplest form, pipelines may extract only data from one source such as a
REST API and load to a destination such as a SQL table in a data warehouse.
In practice, data pipelines consist of multiple steps including data extraction, data
preprocessing, data validation, and at times training or running a machine learning
model before delivering data to its final destination.
Data engineers specialize in building and maintaining the data pipelines.

INTRODUCTION TO DATA SCIENCE


WHY BUILD DATA PIPELINES?

For every dashboard and insight that a data analyst generates and for each predictive
model developed by a data scientist, there are data pipelines working behind the
scenes.
A single dashboard, or a single metric may be derived from data originating in
multiple source systems.
Data pipelines extract data from sources and load them into simple database tables
or flat files for analysts to use. Raw data is refined along the way to clean, structure,
normalize, combine, aggregate, and anonymize or secure it.

INTRODUCTION TO DATA SCIENCE


DIVERSITY OF DATA SOURCES

INTRODUCTION TO DATA SCIENCE


DATA INGESTION
The term data ingestion refers to extracting data from one source and loading it into
another.
Ingestion Interface and data structure
) A database behind an application, such as a Postgres or MySQL database or NoSQL
database
) JSON from a REST API
) A stream processing platform such as Apache Kafka

) A shared network file system or cloud storage bucket containing logs, comma-separated

value (CSV) files, and other flat files


) Semi-structured log data

) A data warehouse or data lake

) Data in HDFS or HBase database

Data ingestion is traditionally both the extract and load steps of an ETL or ELT
process.
INTRODUCTION TO DATA SCIENCE
SIMPLE PIPELINE

INTRODUCTION TO DATA SCIENCE


ETL AND ELT

E– extract step
) gathers data from various sources in preparation for loading and transforming.
L – load step
) brings either the raw data (in the case of ELT) or the fully transformed data (in the case of
ETL) into the final destination.
) load data into the data warehouse, data lake, or other destination.

T – transform step
) raw data from each source system is combined and formatted in a such a way that it’s
useful to analysts, visualization tools

INTRODUCTION TO DATA SCIENCE


ELT PIPELINE

INTRODUCTION TO DATA SCIENCE


ORCHESTRATING PIPELINES

Orchestration ensures that the steps in a pipeline are run in the correct order and that
dependencies between steps are managed properly.
Pipeline steps (tasks) are always directed, meaning they start with a task or multiple
tasks and end with a specific task or tasks. This is required to guarantee a path of
execution.
Pipeline graphs must also be acyclic, meaning that a task cannot point back to a
previously completed task.
Pipelines are implemented as DAGs (Directed Acyclic Graphs).
Orchestration tool – Apache Airflow

INTRODUCTION TO DATA SCIENCE


ORCHESTRATION DAG

INTRODUCTION TO DATA SCIENCE


VARIOUS DATA SOURCE

INTRODUCTION TO DATA SCIENCE


VARIOUS DATA SOURCE

INTRODUCTION TO DATA SCIENCE


VARIOUS DATA SOURCE

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS

1 DATA DATA-
2 SETS

3 DATA QUALITY

4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE


DATABASE DATA

Database management system (DBMS), consists of a collection of interrelated


data, known as a database, and a set of software programs to manage and access
the data.
The software programs provide mechanisms for defining database structures and
data storage; for specifying and managing concurrent, shared, or distributed data
access; and for ensuring consistency and security of the information stored despite
system crashes or attempts at unauthorized access.

INTRODUCTION TO DATA SCIENCE


RDBMS DATA

A relational database (RDBMS) is a collection of tables, each of which is assigned a


unique name.
Each table consists of a set of attributes (columns or fields) and usually stores a large
set of tuples (records or rows).
Each tuple in a relational table represents an object identified by a unique key and
described by a set of attribute values.

INTRODUCTION TO DATA SCIENCE


DATA WAREHOUSE

A data warehouse is a repository of information collected from multiple


sources, stored under a unified schema, and usually residing at a single site.
Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing.
Data in a data warehouse is structured and optimized for reporting and analysis
queries.

INTRODUCTION TO DATA SCIENCE


TRANSACTIONAL DATA

Each record in a transactional database captures a transaction, such as a customer’s


purchase, a flight booking, or a user’s clicks on a web page.
A transaction includes a unique transaction identity number (trans ID) and a list of the
items making up the transaction.
A transactional database may have additional tables, which contain other information
related to the transactions, such as item description, information about the
salesperson or the branch, and so on.

INTRODUCTION TO DATA SCIENCE


DATA LAKES

A data lake is where data is stored, but without the structure or query
optimization of a data warehouse.
It will contain a high volume of data as well as a variety of data types.
It is not optimized for querying such data in the interest of reporting and analysis.
Eg: a single data lake might contain a collection of blog posts stored as text files, flat
file extracts from a relational database, and JSON objects containing events
generated by sensors in an industrial system.

INTRODUCTION TO DATA SCIENCE


Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
The Art of Data Science by Roger D Peng and Elizabeth Matsui (R1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and Micheline
Kamber Morgan Kaufmann Publishers, 2006 (T4)
On Being a Data Skeptic Publisher(s): O’Reilly Media, Inc. ISBN: 9781449374310

THANK YOU

INTRODUCTION TO DATA SCIENCE


INTRODUCTION TO DATA SCIENCE
MODULE # 4 : DATA WRANGLING(CONTD…)
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

INTRODUCTION TO DATA SCIENCE


Recap
Session 1 Session 3
• What is Data Science • Data Analytics Methodologies (SEMMA. SMAM, DataOps,
MLOps)
• Why Data science and Why now, ex: Money Ball movie
• Data (Features and Attributes)
• Real-world applications , ex: Facebook, Amazon, Uber
• Types of attributes
• Data Science Challenges and Bias
• Data sets
• Roles in Data Science Team
• Organization of Data Science Team

Session 4
Session 2
• Statistical Description of data
• Data Analytics
• Data Preparation
• Case Studies – (COVID, Neuro Informatics)
• Data aggregation and sampling
• Data Analytics Methodologies (CRISP-DM)

I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
Recap
Session 5 Session 7
• Dissimilarity and Similarity Measures
• Visualization for EDA
• Handling Numeric data

Session 6
Session 8

I N T R OD U CT ION TO D AT A S C I E N C E 4 / 60
TABLE OF CONTENTS

1 DATA SIMILARITY & DISSIMILARITY MEASURE


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 HANDLING NUMERIC DATA
54
MANAGING CATEGORICAL ATTRIBUTES
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


MEASURES OF PROXIMITY

Similarity and dissimilarity measures are measures of proximity.


A similarity measure for two objects, i and j, will typically return the value 1 if they
are identical and 0 if the objects are unalike.
The higher the similarity value, the greater the similarity between objects.
A dissimilarity measure returns a value of 0 if the objects are the same.
The higher the dissimilarity value, the more dissimilar the two objects are.

T4:Chapter 2.4
MEASURING DATA SIMILARITY AND DISSIMILARITY
Various proximity measures
• Data Matrix versus Dissimilarity Matrix
• Proximity Measures for Nominal Attributes
• Proximity Measures for Binary Attributes
• Symmetric Binary Attributes
• Asymmetric Binary Attributes
• Proximity Measures for Ordinal Attributes
• Proximity Measures for Numeric Data
• Proximity Measures for Mixed Types
• Cosine Similarity

BITS Pilani, Pilani Campus


DATA MATRIX AND DISSIMILARITY MATRIX
Data matrix
– n data points with p dimensions
– Two modes

Dissimilarity matrix
– n data points, but registers only the distance
– A triangular matrix
– Single mode

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES
Categorical Attribute
Attribute ‘Color‘ can take 2 or more states, e.g., red, yellow,
blue, green (generalization of a binary attribute)
• Simple matching
– m: # of matches, p: total # of variables

– Similarity can be computed as:

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR NOMINAL ATTRIBUTES

BITS Pilani, Pilani Campus


EXERCISE- DISCUSSION

Calculate the dissimilarity matrix for the ordinal attributes

BITS Pilani, Pilani Campus


EXERCISE-CANVAS DISCUSSION

Calculate the dissimilarity matrix and similarity matrix for the ordinal
attributes

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES

• A contingency table for binary data

– where q is the number of attributes that equal 1 for both objects i and j,
– r is the number of attributes that equal 1 for object i but equal 0 for object j,
– s is the number of attributes that equal 0 for object i but equal 1 for object j,
– t is the number of attributes that equal 0 for both objects i and j.
– The total number of attributes is p, where p = q+r+s+t .

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES-
EXERCISE

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES-
EXERCISE

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES-
EXERCISE

BITS Pilani, Pilani Campus


PROXIMITY MEASURE FOR BINARY ATTRIBUTES-
SUMMARY
• Distance measure for symmetric binary variables,
dissimilarity:
• Distance measure for asymmetric binary variables,
dissimilarity:

• Similarity between Asymmetric binary values is given by


Jaccard coefficient:

BITS Pilani, Pilani Campus


EXERCISE
Suppose that a patient record table contains the attributes
name, gender, fever, cough, test-1, test-2, test-3, and test-4,
where name is an object identifier, gender is a symmetric
attribute, and the remaining attributes are asymmetric binary.
Compute the dissimilarity matrix for asymmetric binary
attributes

BITS Pilani, Pilani Campus


EXERCISE

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR NUMERIC ATTRIBUTES:

BITS Pilani, Pilani Campus


COSINE SIMILARITY
Cosine similarity is a measure of similarity that can be used to
compare documents or, say, give a ranking of documents with
respect to a given vector of query words

Where is given by

BITS Pilani, Pilani Campus


EXERCISE
Suppose that x and y are the first two term-frequency vectors in That is,
x =(5, 0, 3, 0, 2, 0, 0, 2, 0,0) and y =(3, 0, 2, 0, 1, 1, 0, 1, 0,1). How similar
are x and y? Compute the cosine similarity between the two vectors.

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


PROXIMITY MEASURES FOR MIXED TYPE ATTRIBUTES

BITS Pilani, Pilani Campus


TABLE OF CONTENTS

1 Data Similarity & Dissimilarity Measure


2 3 Visualization Techniques for Data Exploratory analysis
3 HANDLING NUMERIC DATA
54
MANAGING CATEGORICAL ATTRIBUTES
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


BOXPLOT
A boxplot incorporates the
five-number summary.
The ends of the box are at the
quartiles.
The box length is the interquartile
range.
The median is marked by a line within
the box.
The whiskers outside the box extend
to the Minimum and Maximum
observations.
Computed in O(n log n) time.
INTRODUCTION TO DATA SCIENCE
HISTOGRAM
Graphical method for summarizing the distribution of an attribute, X .
IfX is nominal
) Bar chart
) A vertical bar is drawn for each
known value of X .
) The height of the bar indicates the

frequency of that X value.


If X is numeric
) Histogram
) The range of values for X is partitioned into disjoint consecutive subranges or buckets or
bins.
) The range of a bucket is known as the width.
) The buckets are of equal width.
INTRODUCTION TO DATA SCIENCE 47 / 113
SCATTERPLOT
Determine if there appears to be a relationship, pattern, or trend between two
numeric attributes.
Provide a visualization of bi-variate data to see clusters of points and outliers, or
correlation relationships.
Correlations can be positive, negative, or null (uncorrelated).

INTRODUCTION TO DATA SCIENCE 48 / 113


DEMO CODE

Visualization.ipynb

INTRODUCTION TO DATA SCIENCE 49 / 113


TABLE OF CONTENTS

1 Data Similarity & Dissimilarity Measure


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 Handling Numeric Data
54
MANAGING CATEGORICAL ATTRIBUTES
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


HANDLING NUMERIC DATA

Techniques are
Discretization – Convert numeric data into discrete categories
Binarization – Convert numeric data into binary categories
Normalization – Scale numeric data to a specific range
Smoothing
• which works to remove noise from the data. Techniques include binning, regression, and
clustering.
• random method, simple moving average, random walk, simple exponential, and
exponential moving average (Will learn in ISM)

T4:Chapter 3.5

51 / 113
DISCRETIZATION

Convert continuous attribute into a discrete attribute.


Discretization involves converting the raw values of a numeric attribute (e.g., age) into
) interval labels (e.g., 0–10, 11–20, etc.)
) conceptual labels (e.g., youth, adult, senior)
Discretization Process
) The raw data are replaced by a smaller number of interval or concept labels.
) This simplifies the original data and makes the mining more efficient.
) Concept hierarchies are also useful for mining at multiple abstraction levels.

INTRODUCTION TO DATA SCIENCE 52 / 113


CONCEPT HIERARCHY
Divide the range of a continuous attribute into intervals.
Interval labels can then be used to replace actual data values.
The labels, in turn, can be recursively organized into higher-level concepts.
This results in a concept hierarchy for the numeric attribute.

INTRODUCTION TO DATA SCIENCE 53 / 113


DISCRETIZATION TECHNIqUES
Discretization techniques can be categorized based on how the discretization is performed.

Supervised vs. Unsupervised discretization


) If the discretization process uses class information, then we say it is supervised
discretization. Otherwise, it is unsupervised.
Top-down discretization or Splitting
) The process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range. Then the process repeats recursively on the resulting
intervals.
Bottom-up discretization or Merging
) The process starts by considering all of the continuous values as potential split-points.
Removes some by merging neighborhood values to form intervals. Then recursively
applies this process to the resulting intervals.
INTRODUCTION TO DATA SCIENCE 54 / 113
DISCRETIZATION TECHNIqUES

Unsupervised discretization
) Binning [ Equal-interval, Equal-frequency] (Top-down split)
) Histogram analysis (Top-down split)
) Clustering analysis (Top-down split or Bottom-up merge)

) Correlation analysis (Bottom-up merge)

Supervised discretization
) Entropy-based discretization (Top-down split)

T1:Cahpter 2.3.6

55 / 113
UNSUPERVISED DISCRETIZATION

Class labels are ignored.


The best number of bins k is determined experimentally.
User specifies the number of intervals and/or how many data points to be included in
any given interval.
Use Binning methods.

INTRODUCTION TO DATA SCIENCE 57 / 113


DISCRETIZATION BY BINNING METHODS

1 Equal Width (distance) binning


) Each bin has equal width.

width = interval =
max − min
#bins
) Highly sensitive to outliers.
) If outliers are present, the width of each bin is large, resulting in skewed data.
2 Equal Depth (frequency) binning
) Specify the number of values that have to be stored in each bin.
) Number of entries in each bin are equal.
) Some values can be stored in different bins.

T4:Cahpter 3.4.6

58 / 113
BINNING EXAMPLE

Discretize the following data into 3 discrete categories using binning technique.
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81, 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70.

INTRODUCTION TO DATA SCIENCE 60 / 113


BINNING EXAMPLE

Original 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70,
Data 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81
Method Bin1 Bin 2 Bin 3
Equal width= [53, 62) = [62, 72) = [72, 81] =
Width 81-53 = 28 53, 56, 57 63, 66, 67, 67, 72, 73, 75, 75,
28/3 = 9.33 67, 68, 69, 70, 76, 76, 78,
70, 70, 70 79, 80, 81
Equal depth = 53, 56, 57, 63, 68, 69, 70, 70, 75, 75, 76, 76,
Depth 24 /3 = 8 66, 67, 67, 67 70, 70, 72, 73 78, 79, 80, 81

INTRODUCTION TO DATA SCIENCE 61 / 113


DEMO CODE

Binning.ipynb

INTRODUCTION TO DATA SCIENCE 62 / 113


DISCRETIZATION BY HISTOGRAM ANALYSIS

Histogram analysis is an unsupervised discretization technique because it does not


use class information.
Histograms use binning to approximate data distributions and are a popular form of
data reduction.
A histogram for an attribute, X, partitions the data distribution of X into disjoint
subsets, referred to as buckets or bins.
If each bucket represents only a single attribute–value/frequency pair, the buckets are
called singleton buckets.
Often, buckets represent continuous ranges for the given attribute.
The histogram analysis algorithm can be applied recursively to each partition in order
to automatically generate a multilevel concept hierarchy.
INTRODUCTION TO DATA SCIENCE 63 / 113
DISCRETIZATION BY HISTOGRAM ANALYSIS
1 Equal Width Histogram
) The values are partitioned into equal size partitions or ranges.
2 Equal Frequency Histogram
) The values are partitioned such that each partition contains the same number of data
objects.

INTRODUCTION TO DATA SCIENCE 64 / 113


VARIABLE TRANSFORMATION

Variable transformation involves changing the values of an attribute.


For each object (tuple), a transformation is applied to the value of the variable for that
object.
1 Simple functional transformations
2 Normalization

T1:Chapter 2.3.7
SIMPLE FUNCTIONAL TRANSFORMATION

INTRODUCTION TO DATA SCIENCE 66 / 113


SIMPLE FUNCTIONAL TRANSFORMATION

Variable transformations should be applied with caution since they change the nature
of the data.
For instance, the transformation 1xreduces the magnitude of values that are 1 or
larger, but increases the magnitude of values between 0 and 1.
To understand the effect of a transformation, it is important to ask questions such as:
) Does the order need to be maintained?
) Does the transformation apply to all values, especially negative values and 0?
) What is the effect of the transformation on the values between 0 and 1?

INTRODUCTION TO DATA SCIENCE 67 / 113


NORMALIZATION

Normalizing the data attempts to give all attributes an equal weight.


The goal of standardization or normalization is to make an entire set of values have a
particular property.
Normalization is particularly useful for:
) classification algorithms involving neural networks.
• normalizing the input values for each attribute in the training tuples will help speed up
the learning phase.
) distance measurements such as nearest-neighbor classification and clustering.
• normalization helps prevent attributes with initially large ranges (e.g., income)
from outweighing attributes with initially smaller ranges (e.g., binary attributes).

INTRODUCTION TO DATA SCIENCE 68 / 113


WHY FEATURE SCALING?

Features with bigger magnitude dominate over the features with smaller magnitudes.
Good practice to have all variables within a similar scale.
Euclidean distances are sensitive to feature magnitude.
Feature scaling helps decrease the time of finding support vectors.

INTRODUCTION TO DATA SCIENCE 69 / 113


WHY FEATURE SCALING?
For distance-based methods, normalization helps prevent attributes with initially large
ranges (e.g., income) from out-weighing attributes with initially smaller ranges (e.g.,
binary attributes).

INTRODUCTION TO DATA SCIENCE 70 / 113


ALGORITHMS SENSITIVE TO FEATURE MAGNITUDE

Linear and Logistic Regression


Neural Networks
Support Vector Machines
KNN
K-Means Clustering
Linear Discriminant Analysis (LDA)
Principal Component Analysis (PCA)

INTRODUCTION TO DATA SCIENCE 71 / 113


NORMALIZATION

Scale the feature magnitude to a standard range like [0, 1] or [−1, +1] or any
other.
Techniques
) Min-Max normalization
) z-score normalization
) Decimal normalization

T4:Chapter 3.5.2

72 / 113
MIN-MAX SCALING

Min-max scaling squeezes (or stretches) all feature values to be within the range of
[0, 1].
Min-Max normalization preserves the relationships among the original data values.

INTRODUCTION TO DATA SCIENCE


MIN-MAX NORMALIZATION

Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization to value of
$73,600.

INTRODUCTION TO DATA SCIENCE


Z-SCORE NORMALIZATION

In z-score normalization (or zero-mean normalization), the values for an attribute, x ,


are normalized based on the mean µ(x ) and standard deviation σ(x ) of x .
The resulting scaled feature has a mean of 0 and a variance of 1.
New range is [−3σ, +3σ].

xˆ = x − µ(x )
σ(x )

z-score normalization is useful when the actual minimum and maximum of attribute X
are unknown, or when there are outliers that dominate the min-max normalization.

INTRODUCTION TO DATA SCIENCE 75 / 113


Z-SCORE NORMALIZATION

Suppose that the mean and standard deviation of the values for the attribute income are
$54,000 and $16,000, respectively. Apply z-score normalization to value of $73,600.

INTRODUCTION TO DATA SCIENCE 76 / 113


DECIMAL NORMALIZATION

Normalizes by moving the decimal point of values of attribute x .


The number of decimal points moved depends on the maximum absolute value of x .
New range is [−1, +1].

INTRODUCTION TO DATA SCIENCE


DECIMAL NORMALIZATION

Example 1
CGPA Formula Normalized CGPA
2 2/10 0.2
3 3/10 0.3
Example 2
Bonus Formula Normalized Bonus
450 450/1000 0.45
310 310/1000 0.31
Example 3
Salary Formula Normalized Salary
48000 48000/100000 0.48
67000 67000/100000 0.67

INTRODUCTION TO DATA SCIENCE 78 / 113


DEMO CODE

Normalization.ipynb

INTRODUCTION TO DATA SCIENCE 79 / 113


TABLE OF CONTENTS

1 Data Similarity & Dissimilarity Measure


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 HANDLING NUMERIC DATA
54
Managing Categorical Attributes
DEALING WITH TEXTUAL DATA

INTRODUCTION TO DATA SCIENCE


CATEGORICAL ENCODING

We need to convert categorical columns to numeric columns so that a machine


learning algorithm understands it.

Categorical encoding is a process of converting categories to numbers.

• Binarization maps a continuous or categorical


attribute into one or more binary attributes.
• Must maintain ordinal relationship.
• Algorithms that find association patterns require that the
data be in the form of binary attributes.
• E.g., Apriori algorithm, Frequent Pattern (FP) Growth
algorithm

81 / 113
CATEGORICAL ENCODING TECHNIQUES

One-hot encoding
Label Encoding

INTRODUCTION TO DATA SCIENCE 82 / 113


ONE-HOT ENCODING EXAMPLE

INTRODUCTION TO DATA SCIENCE 83 / 113


ONE-HOT ENCODING
Encode each categorical variable with a set of Boolean variables which take values 0
or 1, indicating if a category is present for each observation.
One binary attribute for each categorical value.
Advantages
) Makes no assumption about the distribution or categories of the categorical variable .
) Keeps all the information of the categorical variable .
) Suitable for linear models.

Disadvantages
) Expands the feature space.
) Does not add extra information while encoding.
) Many dummy variables may be identical, introducing redundant information .

) Number of resulting attributes may become too large.

In multi-class classification, the class label is converted using one-hot encoding.


INTRODUCTION TO DATA SCIENCE 84 / 113
ONE-HOT ENCODING EXAMPLE

Assume an ordinal attribute for representing service of a restaurant:


(Awful < Poor < OK < Good < Great ) requires 5 bits to maintain the ordinal
relationship.
Service Quality X1 X2 X3 X4 X5
Awful 0 0 0 0 1
Poor 0 0 0 1 0
OK 0 0 1 0 0
Good 0 1 0 0 0
Great 1 0 0 0 0

INTRODUCTION TO DATA SCIENCE 85 / 113


LABEL ENCODING
Replace the categories by digits from 1 to n (or 0 to n − 1, depending the
implementation), where n is the number of distinct categories of the variable.
The categories are arranged in ascending order and the numbers are assigned.
Advantages
) Straightforward to implement.
) Does not expand the feature space.
) Work well enough with tree based algorithms.

Disadvantages
) Does not add extra information while encoding.
) Not suitable for linear models.
) Does not handle new categories in test set automatically.

Used for features which have multiple values into domain. eg: colour, protocol types
INTRODUCTION TO DATA SCIENCE 86 / 113
LABEL ENCODING EXAMPLE

Assume an ordinal attribute for representing service of a restaurant: (Awful, Poor, OK,
Good, Great)

Service Quality Integer Value


Awful 0
Poor 1
OK 2
Good 3
Great 4

INTRODUCTION TO DATA SCIENCE 87 / 113


BINARY ENCODING

If there are m categorical values, then uniquely assign each original value to an integer
in the interval [0,m - 1].

If the attribute is ordinal, then order must be maintained by the assignment

T1:Chapter 2.3.6 88 / 113


DEMO CODE

Encoding.ipynb

INTRODUCTION TO DATA SCIENCE 89 / 113


TABLE OF CONTENTS

1 Data Similarity & Dissimilarity Measure


2 3 VISUALIZATION TECHNIqUES FOR DATA EXPLORATORY ANALYSIS
3 HANDLING NUMERIC DATA
54 MANAGING CATEGORICAL ATTRIBUTES
Dealing with Textual Data

INTRODUCTION TO DATA SCIENCE


STEPS INVOLVED IN THE TEXTUAL DATA PROCESSING

Removing special characters, changing the case (up-casing and down-casing).


Tokenization – process of discretizing words within a document.
Creating Document Vector or Term Document Matrix.
Filtering Stop Words
Lexical Substitution
Stemming / Lemmatization
POS (Part of Speech) tagging

INTRODUCTION TO DATA SCIENCE 91 / 113


TOKENIZATION

Document – In the text mining context, each sentence is considered a distinct


document.
Token – Each character or word or sentence is called a token.
Tokenization – The process of discretizing tokens within a document is called
tokenization.

INTRODUCTION TO DATA SCIENCE 92 / 113


DOCUMENT VECTOR OR TERM DOCUMENT MATRIX

Create a matrix where each column consists of a token and the cells show the counts
of the number of times a token appears.
Each token is now an attribute in standard data science parlance and each document
is an example (record).
Unstructured raw data is now transformed into a format that is recognized by machine
learning algorithms for training.
The matrix / table is referred to as Document Vector or Term Document Matrix (TDM)
As more new statements are added that have little in common, we end up with a very
sparse matrix.
We could also choose to use the term frequencies (TF) for each token instead of
simply counting the number of occurrences.
INTRODUCTION TO DATA SCIENCE 93 / 113
TERM DOCUMENT MATRIX – EXAMPLE

INTRODUCTION TO DATA SCIENCE 94 / 113


STOP WORDS

There are common words such as ”a,” ”this,” ”and,” and other similar
terms. They do not really convey specific meaning.
Most parts of speech such as articles, conjunctions, prepositions, and pronouns need
to be filtered before additional analysis is performed.
Such terms are called stop words.
Stop word filtering is usually the second step that follows immediately after
tokenization.
The document vector gets reduced significantly after applying standard English stop
word filtering.

INTRODUCTION TO DATA SCIENCE 95 / 113


STOP WORDS

Domain specific terms might also need to be filtered out.


) For example, if we are analyzing text related to the automotive industry, we may want to
filter out terms common to this industry such as ”car,” ”automobile,” ”vehicle,” and so
on.
This is generally achieved by creating a separate dictionary where these context
specific terms can be defined and then term filtering can be applied to remove them
from the data.

INTRODUCTION TO DATA SCIENCE 96 / 113


LEXICAL SUBSTITUTION

Lexical substitution is the process of finding an alternative for a word in the context
of a clause.
It is used to align all the terms to the same term based on the field or subject which is
being analyzed.
This is especially important in areas with specific jargon, e.g., in clinical settings.
Example: common salt, NaCl, sodium chloride can be replaced by NaCl.
Domain specific

Paper - Lexical Substitution for the Medical Domain


URL - https://aclanthology.org/D14-1066.pdf

97 / 113
STEMMING

Stemming is usually the next process step following term filtering.


Words such as ”recognized,” ”recognizable,” or ”recognition” may be encountered
in different usages, but contextually they may all imply the same meaning.
The root of all these highlighted words is ”recognize.”
The conversion of unstructured text to structured data can be simplified by reducing
terms in a document to their basic stems, because only the occurrence of the root
terms has to be taken into account.
This process is called stemming.

INTRODUCTION TO DATA SCIENCE 98 / 113


PORTER STEMMING

The most common stemming technique for text mining in English is the Porter
Stemming method.
Porter stemming works on a set of rules where the basic idea is to remove and/or
replace the suffix of words.
) Replace all terms which end in ’ies’ by ’y,’ such as replacing the term ”anomalies” with
”anomaly.”
) Stem all terms ending in ”s” by removing the ”s,” as in ”algorithms” to ”algorithm.”
While the Porter stemmer is extremely efficient, it can make mistakes that could prove
costly.
) ”arms” and ”army” would both be stemmed to ”arm,” which would result in somewhat
different contextual meanings.

INTRODUCTION TO DATA SCIENCE 99 / 113


LEMMATIZATION

INTRODUCTION TO DATA SCIENCE 100 / 113


POS Tagging

Lemmatization uses POS Tagging (Part of Speech Tagging) heavily.


POS Tagging is the process of attributing a grammatical label to every part of a
sentence.
) Eg: ”Game of Thrones is a television series.”
) POS Tagging:
({”game”:”NN”},{”of”:”IN”},{”thrones”:”NNS”},{”is”:”VBZ”},{”a”:”DT”},
{”television”:”NN”},{”series”:”NN”})
where: NN = noun, IN = preposition, NNS = noun in its plural form, VBZ = third-person
singular verb, and DT = determiner.

INTRODUCTION TO DATA SCIENCE 101 / 113


DEMO CODE

NLP.ipynb

INTRODUCTION TO DATA SCIENCE 102 / 113


Example

For the given documents, remove stop words and lemmatize.


● D1: An apple is a fruit, which is red in colour.
● D2: An orange fruit is orange in colour.
● D3. A kiwi fruit is green coloured fruit.

INTRODUCTION TO DATA SCIENCE 103 / 113


Example

● Remove stop words according to standard English.


○ D1: apple fruit red colour
○ D2: orange fruit orange colour
○ D3: kiwi fruit green coloured fruit
● Lemmatize
○ D1: apple fruit red colour
○ D2: orange fruit orange colour
○ D3. kiwi fruit green colour fruit

INTRODUCTION TO DATA SCIENCE 104 / 113


Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and
Micheline Kamber Morgan Kaufmann Publishers, 2006 (T4)
Data Science – Concepts and Practice by Vijay Kotu and Bala Deshpande (CH -
9.1)
THANK YOU

INTRODUCTION TO DATA SCIENCE 105 / 113


INTRODUCTION TO DATA SCIENCE
MODULE # 4 : DATA WRANGLING
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.

INTRODUCTION TO DATA SCIENCE


Recap
Session 1 Session 3
• What is Data Science • Data Analytics Methodologies (SEMMA. SMAM, DataOps,
MLOps)
• Why Data science and Why now, ex: Money Ball movie
• Data (Features and Attributes)
• Real-world applications , ex: Facebook, Amazon, Uber
• Types of attributes
• Data Science Challenges and Bias
• Data sets
• Roles in Data Science Team
• Organization of Data Science Team

Session 2
• Data Analytics
• Case Studies – (COVID, Neuro Informatics)
• Data Analytics Methodologies (CRISP-DM)

I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
TABLE OF CONTENTS
1 STATISTICAL DESCRIPTIONS OF DATA
2 DATA PREPARATION
3 DATA AGGREGATION, SAMPLING
4 DATA SIMILARITY & DISSIMILARITY MEASURE
5

8
9

INTRODUCTION TO DATA SCIENCE


Statistical Description of Data

• Measuring the central tendency


• Measuring the dispersion
• Boxplot Analysis
MEASURES OF CENTRAL TENDENCY

Gives an idea of the central tendency of the data.


Measures of central tendency include the mean, median, mode, and midrange.
Let x1, x2, . . . , xN be the set of N observed values or observations for
numeric attribute X . Assume X is sorted in increasing order.

INTRODUCTION TO DATA SCIENCE


MEAN

INTRODUCTION TO DATA SCIENCE


MEDIAN

If N is odd, then the median is the middle value of the ordered set.
If N is even, then the median is not unique; it is the two middlemost values and any
value in between.
If X is a numeric attribute, the median is taken as the average of the two middlemost
values.
Issue: Median is expensive to compute when we have a large number of
observations.

INTRODUCTION TO DATA SCIENCE


MODE

Mode for a set of data is the value that occurs most frequently in the set.
Mode can be determined for qualitative and quantitative attributes.
Data sets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal. In general, a data set with two or more modes is multimodal.

INTRODUCTION TO DATA SCIENCE


SYMMETRIC DATA AND SKEWED DATA
In a unimodal frequency curve with perfect symmetric data distribution, the
mean, median, and mode are all at the same center value.

mean − mode ≈ 3(mean − median)


Data in most real applications are not symmetric.
) Positively skewed – the mode occurs at a value that is smaller than the median.
) negatively skewed – the mode occurs at a value greater than the median.

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
INTRODUCTION TO DATA SCIENCE
Kurtosis

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
MIDRANGE

Average of minimum and maximum values.

midrange =
min + max
2

INTRODUCTION TO DATA SCIENCE


EXAMPLE

X = [30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110]
mean = 30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110 = 58
12
+
median = 52 56 = 54
2
mode = 52, 70
+
midrange = 30 110 = 70
2

INTRODUCTION TO DATA SCIENCE


DATA DISPERSION MEASURES

Range
Quartiles, and interquartile range
Five-number summary and boxplots
Variance and standard deviation

INTRODUCTION TO DATA SCIENCE


RANGE

The range of the set is the difference between the largest and smallest values.

range = max − min

INTRODUCTION TO DATA SCIENCE


QUANTILES

Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-size consecutive sets.
The kth q-quantile for a given data distribution is the value x such that at most k/q of
the data values are less than x and at most (q − k )/q of the data values are more
than x , where k is an integer such that 0 < k < q.
There are q − 1 q-quantiles.

INTRODUCTION TO DATA SCIENCE


QUANTILES
QUARTILES OR PERCENTILES

Three data points that split the data distribution into four equal parts
Each part represents one-fourth of the data distribution.
Q1 is the 25th percentile and Q3 is the 75th percentile
Quartiles give an indication of a distribution’s center, spread, and shape

INTRODUCTION TO DATA SCIENCE


INTERqUARTILE RANGE (IQR)

Distance between the first and third quartiles


Measure of spread that gives the range covered by the middle half of the data.

IQR = Q3 − Q1

Identifying outliers as values falling at least 1.5 × IQR above the third quartile or
below the first quartile.

INTRODUCTION TO DATA SCIENCE


FIVE-NUMBER SUMMARY

The five-number summary of a distribution consists of the median (Q2), the quartiles
Q1 and Q3 , and the smallest and largest individual observations.
Written in the order

Five number Summary = [Minimum, Q1, Median, Q3, Maximum]

INTRODUCTION TO DATA SCIENCE


Example
Dispersion of the Data - Boxplot

Five-number summary of a distribution


Minimum, Q1, Median, Q3, Maximum

Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
The median is marked by a line within the box
Whiskers: two lines outside the box extended to Minimum and Maximum
Outliers: points beyond a specified outlier threshold, plotted individually
VARIANCE

Variance and standard deviation indicate how spread out a data distribution is.

INTRODUCTION TO DATA SCIENCE


Example

Heights of 5 different dogs are - 600 mm, 470 mm, 170 mm, 430 mm and 300 mm

Mean – 394 mm
Example

Variance = 21704
Standard Deviation = 21704 = 147
STANDARD DEVIATION

Standard deviation σ of the observations is the square root of the variance σ2.
A low standard deviation means that the data observations tend to be very close to
the mean.
A high standard deviation indicates that the data are spread out over a large range of
values.
σ measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
σ = 0 only when there is no spread, that is, when all observations have the same
value. Otherwise, σ > 0.

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS
1 STATISTICAL DESCRIPTIONS OF DATA SELF STUDY
2 DATA PREPARATION
3 DATA AGGREGATION, SAMPLING
4 DATA SIMILARITY & DISSIMILARITY MEASURE
5

8
9

INTRODUCTION TO DATA SCIENCE


DATA PREPARATION

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING

Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
Two types of errors
) Interpretation error
Age < 150
Height of a person < 7feet.
Price is positive.
) Inconsistencies between data sources or against your company’s standardized values.
Female and F
Feet and meter
Dollars and Pounds

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING
Errors from data entry
) Cause
Typos
Errors due to lack of concentration
Machine or hardware failure
) Detection
Frequency table
) Correction
Simple assignment statements
If-then-else rules
White-spaces and typos
) Remove leading and trailing white-spaces.
) Change case of the alphabets from upper to lower.
INTRODUCTION TO DATA SCIENCE
DATA CLEANSING

Physically impossible values


) Examples
Age < 100
Height of a person is less than 7feet.
Price is positive.
) If-then-else rules
Outliers
) Use visualization techniques like box plots or scatter plots.
) Use statistical summary with minimum and maximum values.
) Identifying outliers as values falling at least 1.5 × IQR above the third quartile or below
the first quartile.

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING
Missing values

INTRODUCTION TO DATA SCIENCE


MISSING VALUES

Ignore the tuple.


) Used when the class label is missing in a classification task.
) Not very effective, unless the tuple contains several attributes with missing values.
) Poor technique when the percentage of missing values per attribute varies considerably.

) By ignoring the tuple, we do not make use of the remaining attributes’ values in the

tuple. Such data could have been useful to the task at hand.
Fill in the missing value manually.
) Time consuming.
) May not be feasible given a large data set with many missing values.

INTRODUCTION TO DATA SCIENCE


MISSING VALUES

Use a global constant to fill in the missing value.


) Replace all missing attribute values by the same constant such as a label like
”Unknown” or -1.
) If missing values are replaced by, say, ”Unknown,” then the mining program may

mistakenly think that they form an interesting concept, since they all have a value in
common—that of ”Unknown.” Hence, although this method is simple, it is not foolproof.
Use a measure of central tendency for the attribute.
) Central tendency indicates the ”middle” value of a data distribution. E.g., mean or
median
) For normal (symmetric) data distributions, the mean can be used.

) Skewed data distribution should employ the median.

INTRODUCTION TO DATA SCIENCE


MISSING VALUES

Use the attribute mean or median for all samples belonging to the same class as the
given tuple.
) For example, if classifying customers according to credit risk, we may replace the missing
value with the mean income value for customers in the same credit risk category as that
of the given tuple.
) If the data distribution for a given class is skewed, the median value is a better choice.
Use the most probable value to fill in the missing value.
) This may be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
) For example, using the other customer attributes in the data set, we may construct a

decision tree to predict the missing values for income.


) Most popular strategy.

INTRODUCTION TO DATA SCIENCE


DATA CLEANSING

Deviations from code-book


) A code book is a description of your data. It contains things such as the number of
variables per observation, the number of observations, and what each encoding within a
variable means.
) Discrepancies between the code-book and the data should be corrected.

Different units of measurement


) Pay attention to the respective units of measurement.
) Simple conversion can rectify.
Different levels of aggregation
) Data set containing data per week versus one containing data per work week.
) Data summarization will fix it.

INTRODUCTION TO DATA SCIENCE


NOISY DATA

Noise is a random error or variance in a measured variable. Outliers may represent


noise.
Noisy data can be removed by using smoothing techniques.
) Binning
Smoothing by bin means
Smoothing by bin medians
Smoothing by bin boundaries
) Regression
) Outlier Analysis
) Concept hierarchies are a form of data discretization that can also be used for data

smoothing.
For example: A concept hierarchy for price may map real price values into three
categories: inexpensive, moderately priced, and expensive.
INTRODUCTION TO DATA SCIENCE
TABLE OF CONTENTS
1 STATISTICAL DESCRIPTIONS OF DATA SELF STUDY
2 DATA PREPARATION
3 DATA AGGREGATION, SAMPLING
4 DATA SIMILARITY & DISSIMILARITY MEASURE HANDLING
5

8
9

INTRODUCTION TO DATA SCIENCE


COMBINING DATA

Two operations to combine information from different data sets.


) Joining
Enriching an observation from one table with information from another table.
Requires primary keys or candidate keys.
Use views to virtually combine data.
) Appending or stacking
Adding the observations of one table to those of another table.

INTRODUCTION TO DATA SCIENCE


DATA AGGREGATION

Aggregation is combining two or more objects into a single object.


) Consider a data set consisting of transactions (data objects) recording the daily sales of
products in various store locations (Minneapolis, Chicago, Paris, ...) for different days
over the course of a year.
) One way to aggregate transactions for this data set is to replace all the transactions of a
single store with a single store-wide transaction.
) This reduces
The hundreds or thousands of transactions that occur daily at a specific store to a single
daily transaction.
The number of data objects is reduced to the number of stores.

INTRODUCTION TO DATA SCIENCE


DATA AGGREGATION

To create the aggregate transaction that represents the sales of a single store or date.
Quantitative attributes, such as price, are typically aggregated by taking a sum or an
average.
Qualitative attribute, such as item description, can either be omitted or summarized
as the set of all the items that were sold at that location.
The data in the table can also be viewed as a multidimensional array, where each
attribute is a dimension.
) Aggregation is the process of eliminating attributes (such as the type of item) or reducing
the number of values for a particular attribute (e.g., reducing the possible values for date
from 365 days to 12 months).
) Commonly used in Online Analytical Processing (OLAP).

INTRODUCTION TO DATA SCIENCE


DATA AGGREGATION

Advantages
) Require less memory and processing time.
) Provides a high-level view of the data instead of a low-level view.
) the behavior of groups of objects or attributes is often more stable than that of individual

objects or attributes.
Disadvantage
) potential loss of interesting details.

INTRODUCTION TO DATA SCIENCE


DATA SAMPLING

A process by which representative samples are selected from a well defined


population is known as sampling.
Sampling is a technique used for selecting a subset of the data objects to be
analyzed.
The motivations for sampling in statistics and data mining are often different.
) Statisticians use sampling because obtaining the entire set of data of interest is too
expensive or time consuming.
) Data miners sample because it is too expensive or time consuming to process all the
data.
In some cases, using a sampling algorithm can reduce the data size to the point
where a better, but more expensive algorithm can be used.

INTRODUCTION TO DATA SCIENCE


S AMPLING TECHNIQUES

Sampling Techniques

Probabilistic Sampling Non-Probabilistic Sampling

Simple Random

Systematic Sampling

Stratified Random

Cluster Sampling

I N T R OD U CT ION TO D AT A S CIENCE
PROBABILISTIC SAMPLING TECHNIqUES
PROBABILISTIC SAMPLING means that every item in the population has an equal
chance of being included in sample.
SIMPLE RANDOM SAMPLING means that every case of the population has an equal
probability of inclusion in sample.
Eg: Randomly picking mango from a basket of fruits.
Sampling without replacement
Sampling with replacement

INTRODUCTION TO DATA SCIENCE


PROBABILISTIC SAMPLING TECHNIqUES

SYSTEMATIC SAMPLING is where every nth case after a random start is selected.
Eg: Picking every 5th fruit from a basket of fruits.

INTRODUCTION TO DATA SCIENCE


PROBABILISTIC SAMPLING TECHNIqUES
STRATIFIED SAMPLING is where the population is divided into strata and a
random sample is taken from each strata.
Eg: One mango, one orange, one banana from a basket of fruits.
Two versions of stratified sampling:
Equal numbers of objects are drawn from each group even though the
groups are of different sizes.
The number of objects drawn from each group is proportional to the size
of that group.

INTRODUCTION TO DATA SCIENCE


TABLE OF CONTENTS
1 STATISTICAL DESCRIPTIONS OF DATA SELF STUDY
2 DATA PREPARATION
3 DATA AGGREGATION, SAMPLING
4 DATA SIMILARITY & DISSIMILARITY MEASURE HANDLING
5

8
9

INTRODUCTION TO DATA SCIENCE


MEASURES OF P ROXIMITY

Similarity and dissimilarity measures are measures of proximity.


A similarity measure for two objects, i and j, will typically return the value 1 if they
are identical and 0 if the objects are unalike.
The higher the similarity value, the greater the similarity between objects.
A dissimilarity measure returns a value of 0 if the objects are the same.
The higher the dissimilarity value, the more dissimilar the two objects are.

I N T R OD U CT ION TO D AT A S CIENCE
MEASURES OF P ROXIMITY

Proximity Measures for Nominal Data


Proximity Measures for Numerical Attributes
Proximity Measures for Binary Attributes
Symmetric Binary Attributes
Asymmetric Binary Attributes
Proximity Measures for Ordinal Attributes
Proximity Measures for Mixed Types
Cosine Similarity

I N T R OD U CT ION TO D AT A S CIENCE
Data Matrix
P ROXIMITY MEASURES FOR C ATEGORICAL ATTRIBUTES

m is the number of matches – the number of attributes for which i and j are in the
same state
p is the total number of attributes describing the objects

p−m
d (i, j ) =
p

m
sim(i, j ) =
p

I N T R OD U CT ION TO D AT A S CIENCE
EXAMPLE

I N T R OD U CT ION TO D AT A S CIENCE
EXAMPLE

I N T R OD U CT ION TO D AT A S CIENCE
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
The Art of Data Science by Roger D Peng and Elizabeth Matsui (R1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and Micheline
Kamber Morgan Kaufmann Publishers, 2006 (T4)
On Being a Data Skeptic Publisher(s): O’Reilly Media, Inc. ISBN: 9781449374310

THANK YOU

INTRODUCTION TO DATA SCIENCE


BOXPLOT
A boxplot incorporates the
five-number summary.
The ends of the box are at the
quartiles.
The box length is the interquartile
range.
The median is marked by a line within
the box.
The whiskers outside the box extend
to the Minimum and Maximum
observations.
Computed in O(n log n) time.
INTRODUCTION TO DATA SCIENCE
HISTOGRAM
Graphical method for summarizing the distribution of an attribute, X .
IfX is nominal
) Bar chart
) A vertical bar is drawn for each
known value of X .
) The height of the bar indicates the

frequency of that X value.


If X is numeric
) Histogram
) The range of values for X is partitioned into disjoint consecutive subranges or buckets or
bins.
) The range of a bucket is known as the width.
) The buckets are of equal width.
INTRODUCTION TO DATA SCIENCE
SCATTERPLOT
Determine if there appears to be a relationship, pattern, or trend between two
numeric attributes.
Provide a visualization of bi-variate data to see clusters of points and outliers, or
correlation relationships.
Correlations can be positive, negative, or null (uncorrelated).

INTRODUCTION TO DATA SCIENCE


TEXT BOOKS

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar


T2 Introducing Data Science by Cielen, Meysman and Ali
T3 Storytelling with Data, A data visualization guide for business professionals, by
Cole, Nussbaumer Knaflic; Wiley
T4 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and
Micheline Kamber Morgan Kaufmann Publishers, 2006

INTRODUCTION TO DATA SCIENCE 6 / 79


REFERENCE BOOKS

R1 The Art of Data Science by Roger D Peng and Elizabeth Matsui


R2 Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides
R3 Python Data Science Handbook: Essential tools for working with data by Jake
VanderPlas
R4 KDD, SEMMA and CRISP-DM: A Parallel Overview , Ana Azevedo and M.F.
Santos, IADS-DM, 2008

INTRODUCTION TO DATA SCIENCE 7 / 79


EVALUATION SCHEDULE

No Name Type Duration Weight Remarks


EC1 Quiz I Online 1 hr 5% Average of both quizzes
Quiz II Online 1 hr 5%

Assignment Part I Online 4 weeks 10% Sum of both


Assignment Part II Online 4 weeks 15% Assignments

EC2 Mid-sem Online As announced 30%

EC3 Compre-sem Regular Online As announced 40%

INTRODUCTION TO DATA SCIENCE 8 / 79


LEARNING PLATFORM

Most relevant and up to date info on eLearn


Handout
Schedule for Quiz, and Assignments
Session Slide Deck
Demo Lab Sheets
Quiz-I, Quiz-II
Assignment

The video recording will be available in Lecture delivery platform.

INTRODUCTION TO DATA SCIENCE 9 / 79


TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 11 / 79


DATA SCIENCE?

• Data Science is the sexiest job in the 21st century” – IBM.

• Data Science is one of the fastest growing fields in the world.

• According to the U.S. Bureau of Labor Statistics, 11.5 million new


jobs will be created by the year 2026.

• Even with COVID-19 situation, and the amount of shortage in talent,


there was no dip in data science as a career option.

INTRODUCTION TO DATA SCIENCE 12 / 79


WHAT IS DATA SCIENCE

Data Science is a ‘concept to unify


statistics, data analysis, machine learning
and their related methods’ in order to
‘understand and analyze actual
phenomena with data’ - Wikipedia

https://medium.com/analytics-vidhya/introduction-to-data-science-28deb32878e7
DATA SCIENCE

Data Science is a study of data.


Data Science is an art of uncovering insights and trends that are hiding behind the
data.
Data Science helps to translate data into a story. The story telling helps in uncovering
insights. The insights help in making decision or strategic choices.
Data Science is the process of using data to understand different things.
• Requires a major effort of preparing, cleaning, scrubbing, or standardizing the data.
• Algorithms are then applied to crunch pre-processed data.
• This process is iterative and requires analysts’ awareness of the best practices.
• The most important aspect of data science is interpreting the results of the analysis in
order to make decisions.

INTRODUCTION TO DATA SCIENCE 14 / 79


WHY DATA SCIENCE?

• Discovering what we don’t know from the data

• Obtaining Predictive insights

• Helps to create data products

• Helps to make actionable decisions

• Communicating stories from data


• Increases confidence in making valuable decisions that increases
business values
Case Study - Autonomous driving car

• Finding the car's path and detecting potential hazards.

• Analyzing sensor data to understand the vehicle's environment.

• Investigating why a car made an incorrect decision or encountered an

issue.

• Optimizing the route for fuel efficiency and time.


DATA SCIENCE

In India, the average salary of a data scientist as of January 2024 is Rs. 9 to


19L/yr. – Glassdoor, 2025.
The increase in data science as a career choice in 2025 will also see the rise in its
various job roles.
• Data Engineer
• Data Administrator
• Machine Learning Engineer
• Statistician
• Data and Analytics Manager

INTRODUCTION TO DATA SCIENCE 17 / 79


Why Data Science? – Why now?

Tons of data.
Powerful algorithms.
Open software and tools.
Computational speed, accuracy and cost.
Data storage in terms of capacity and cost.

INTRODUCTION TO DATA SCIENCE 18 / 79


DATA SCIENCE, AI AND ML

Artificial Intelligence
• AI involves making machines capable of mimicking human behavior, particularly
cognitive functions like facial recognition, automated driving, sorting mail based on
postal code.
Machine Learning
• Considered a sub-field of or one of the tools of AI.
• Involves providing machines with the capability of learning from experience.
• Experience for machines comes in the form of data.
Data Science
• Data science is the application of machine learning, artificial intelligence, and other
quantitative fields like statistics, visualization, and mathematics to uncover insights from
data to enable better decision marking.

INTRODUCTION TO DATA SCIENCE 19 / 79


DATA SCIENCE, AI AND ML

https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-
intelligence
INTRODUCTION TO DATA SCIENCE 20 / 79
Why Data Science ?
Case Study – Moneyball: The Art of Winning an Unfair Game
TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 23 / 79


USE CASES OF DATA SCIENCE

DataFlair
INTRODUCTION TO DATA SCIENCE 24 / 79
DATA SCIENCE IN FACEBOOK

Social Analytics
Quantitative research
Makes use of deep learning
Makes uses “DeepText”
It uses targeted advertising.

INTRODUCTION TO DATA SCIENCE 25 / 79


DATA SCIENCE IN AMAZON

Improving E-Commerce Experience


▪ Predictive Analysis
▪ Anticipatory shipping model
▪ Price Discounts
▪ Fraud Detection
▪ Improving Packaging Efficiency

INTRODUCTION TO DATA SCIENCE 26 / 79


DATA SCIENCE IN UBER
Improving Rider Experience
Uber maintains large database of drivers, customers, and several other records.
Makes extensive use of Big Data and crowdsourcing to derive insights and provide
best services to its customers.
Dynamic pricing
• Use of big Data and data science to calculate fares based on specific parameters.
• Uber matches customer profile with the most suitable driver and charges them based on
the time it takes to cover the distance rather than the distance itself.
• The time of travel is calculated using algorithms that make use of data related to traffic
density and weather conditions.
• When the demand is higher (more riders) than supply (less drivers), the price of the ride
goes up.

INTRODUCTION TO DATA SCIENCE 27 / 79


DATA SCIENCE IN BANK OF AMERICA
Improving Customer Experience
Erica – a virtual financial assistant (BoA)
• Erica serves as a customer advisor to over 45 million users around the world.
• Erica makes use of Speech Recognition to take customer inputs.
Fraud detection
• Uses data science and predictive analytics to detect frauds in payments, insurance,
credit cards, and customer information.
Risk modeling
• Use data science for risk modeling to regulate financial activities.
Customer segmentation
• Segment their customers in the high-value and low-value segments.
• Data scientists makes use of clustering, logistic regression, decision trees to help the
banks to understand the Customer Lifetime Value (CLV) and take group them in the
appropriate segments.
INTRODUCTION TO DATA SCIENCE 28 / 79
DATA SCIENCE IN AIRBNB

Improving Customer Experience


Providing better search results
• Uses big data of customer and host information, homestays and lodge records, and
website traffic.
• Uses data science to provide better search results to its customers and find compatible
hosts.
Detecting bounce rates
• Use of demographic analytics to analyze bounce rates from their websites.
Providing ideal lodgings and localities
• Uses knowledge graphs where the user’s preferences are matc hed with the various
parameters to provide ideal lodgings and localities.

INTRODUCTION TO DATA SCIENCE 29 / 79


DATA SCIENCE IN SPOTIFY

Improving Customer Experience and recommendation


Providing better music streaming experience
• Provide personalized music recommendations.
• Uses over 600 GBs of daily data generated by the users to build its algorithms to boost
user experience.
Improving experience for artists and managers
• Spotify for Artists application allows the artists and managers to analyze their streams,
fan approval and the hits they are generating through Spotify’s playlists.

INTRODUCTION TO DATA SCIENCE 30 / 79


DATA SCIENCE IN SPOTIFY... CONTD..

Spotify uses data science to gain insights about which universities had the highest
percentage of party playlists and which ones spent the most time on it.
”Spotify Insights” publishes information about the ongoing trends in the music.
Spotify’s Niland, an API based product, uses machine learning to provide better
searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award Winners.

INTRODUCTION TO DATA SCIENCE 31 / 79


APPLICATIONS OF DATA SCIENCE

DataFlair
INTRODUCTION TO DATA SCIENCE 32 / 79
TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 34 / 79


DATA SCIENCE CHALLENGES

Data science challenges can be categorized as:


Data related
Organization related
Technology related
People related
Skill related

INTRODUCTION TO DATA SCIENCE 35 / 79


COGNITIVE BIAS

Cognitive Biases are the distortions of reality because of the lens through which we
view the world.
Each of us sees things differently based on our preconceptions, past experiences,
cultural, environmental, and social factors. This doesn’t necessarily mean that the
way we think or feel about something is truly representative of reality.

INTRODUCTION TO DATA SCIENCE 37 / 79


TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 38 / 79


ROLES IN DATA SCIENCE TEAM [1/7]

1. Chief Analytics Officer / Chief Data Officer


2. Data Analyst
3. Business analyst
4. Data Scientist (ML Engineer, Data Journalist)
5. Data Architect
6. Data Engineer
7. Application/Data Visualization Engineer

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 39 / 79
ROLES IN DATA SCIENCE TEAM [1/7]

[1] Chief Analytics Officer / Chief Data


Officer
• CAO, a “business translator,”
bridges the gap between data
science and domain expertise
acting both as a visionary and a
technical lead.
• Preferred skills: data science and
analytics, programming skills,
domain expertise, leadership and
visionary abilities.
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 40 / 79
ROLES IN DATA SCIENCE TEAM [2/7]

[2] Data analyst


) The data analyst role implies proper data collection and interpretation activities.
) An analyst ensures that collected data is relevant and exhaustive while also interpreting
the analytics results.
) May require data analysts to have visualization skills to convert alienating numbers into

tangible insights through graphics. (eg: IBM or HP)


) Preferred skills: R, Python, JavaScript, C/C++, SQL

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 41 / 79
ROLES IN DATA SCIENCE TEAM [3/7]
3 Business analyst
) A business analyst basically realizes a CAO’s functions but on the operational level.
) This implies converting business expectations into data analysis.
) If your core data scientist lacks domain expertise, a business analyst bridges this gulf.

) Preferred skills: data visualization, business intelligence, SQL.

4 Data scientist
) A data scientist is a person who solves business tasks using machine learning and data
mining techniques.
) The role can be narrowed down to data preparation and cleaning with further model

training and evaluation.


) Preferred skills: R, SAS, Python, Matlab, SQL, noSQL, Hive, Pig, Hadoop, Spark

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 42 / 79
ROLES IN DATA SCIENCE TEAM [4/7]
Job of a data scientist is often divided into two roles
[4A] Machine Learning Engineer
) A machine learning engineer combines software engineering and modeling skills by
determining which model to use and what data should be used for each model.
) Probability and statistics are also their forte.
) Training, monitoring, and maintaining a model.

) Preferred skills: R, Python, Scala, Julia, Java

[4B] Data Journalist


) Data journalists help make sense of data output by putting it in the right context.
) Articulating business problems and shaping analytics results into compelling stories.
) Present the idea to stakeholders and represent the data team with those unfamiliar with

statistics.
) Preferred skills: SQL, Python, R, Scala, Carto, D3, QGIS, Tableau

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 43 / 79
ROLES IN DATA SCIENCE TEAM [5,6/7]
5 Data architect
) Working with Big Data.
) This role is critical to warehouse the data, define database architecture, centralize data,
and ensure integrity across different sources.
) Preferred skills: SQL, noSQL, XML, Hive, Pig, Hadoop, Spark
6 Data engineer
) Data engineers implement, test, and maintain infrastructural components that data
architects design.
) Realistically, the role of an engineer and the role of an architect can be combined in one

person.
) Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 44 / 79
ROLES IN DATA SCIENCE TEAM [7/7]

[7] Application/data visualization engineer


) This role is only necessary for a specialized data science model.
) An application engineer or other developers from front-end units will oversee end-user
data visualization.
) Preferred skills: programming, JavaScript (for visualization), SQL, noSQL.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 45 / 79
DATA SCIENTIST

two types of data scientists:


Type A stands for Analysis
• This person is a statistician that makes sense of data without necessarily having strong
programming knowledge.
• Type A data scientists perform data cleaning, forecasting, modeling, visualization, etc.
Type B stands for Building
• These folks use data in production.
• They’re excellent good software engineers with some statistics background who build
recommendation systems, personalization use cases, etc.

https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 46 / 79
SKILLSET FOR A DATA SCIENTIST

• Programming
• Quantitative analysis
• Product intuition
• Communication
• Teamwork

INTRODUCTION TO DATA SCIENCE 47 / 79


SKILLS REqUIRED FOR A DATA SCIENTIST

Communicative Qualitative

Data
Curious Technical
Scientist

Creative Skeptical

INTRODUCTION TO DATA SCIENCE 48 / 79


TOOLS AVAILABLE TO A DATA SCIENTIST

R
SQL
Python

Scala

Tools SAS

Hadoop

Julia
Tableau
Weka

INTRODUCTION TO DATA SCIENCE 50 / 79


ALGORITHMS FOR A DATA SCIENTIST

Logistic
K-means Regression
Linear
clustering Regression

PCA Algorithms Apriori

Decision
SVM
Tree
ANN

INTRODUCTION TO DATA SCIENCE 51 / 79


DATA SCIENCE TEAM BUILDING

Get to know each other for better communication


Foster team cohesion and teamwork
Encourage collaboration to boost team productivity and performance.

https://towardsdatascience.com/why-team-building-is-important-to-data-scientists-a8fa74dbc09b
INTRODUCTION TO DATA SCIENCE 52 / 79
Organization OF DATA SCIENCE TEAM
1. Decentralized
2. Functional
3. Consulting
4. Centralized
5. Centre of Excellence
6. Federated

INTRODUCTION TO DATA SCIENCE 53 / 79


ORGANIZATION OF DATA SCIENCE TEAM
1. Decentralized
2. Functional
3. Consulting
4. Centralized
5. Centre of Excellence
6. Federated

INTRODUCTION TO DATA SCIENCE 54 / 79


ORGANISATION OF DATA SCIENCE TEAM
[1] Decentralized
• Data scientists report into specific
business units (ex: Marketing) or functional
units (ex: Product Recommendations)
within a company.
• Resources allocated only to projects within
their silos with no view of analytics activities
or priorities outside their function or
business unit.
• Analytics are scattered across the
organization in different functions and
business units.
• Little to no coordination
• Drawback – lead to isolated teams
INTRODUCTION TO DATA SCIENCE 55 / 79
ORGANISATION OF DATA SCIENCE TEAM

[2] Functional
• Resource allocation driven by a functional
agenda rather than an enterprise agenda.
• Analysts are located in the functions where
the most analytical activity takes place, but
may also provide services to rest of the
corporation.
• Little coordination

INTRODUCTION TO DATA SCIENCE 56 / 79


ORGANISATION OF DATA SCIENCE TEAM

[3] Consulting
• Resources allocated based on availability
on a first-come first-served basis without
necessarily aligning to enterprise objectives
• Analysts work together in a central group
but act as internal consultants who charge
“clients” (business units) for their services
• No centralized coordination

INTRODUCTION TO DATA SCIENCE 57 / 79


ORGANISATION OF DATA SCIENCE TEAM
[4] Centralized
• Data scientists are members of a core
group, reporting to a head of data science
or analytics.
• Stronger ownership and management of
resource allocation and project prioritization
within a central pool.
• Analysts reside in central group, where they
serve a variety of functions and business
units and work on diverse projects.
• Coordination by central analytic unit
• Challenge – Hard to assess and meet
demands for incoming data science
projects. (esp in smaller teams)
INTRODUCTION TO DATA SCIENCE 58 / 79
ORGANISATION OF DATA SCIENCE TEAM

[5] Center of Excellence


• Better alignment of analytics initiatives and
resource allocation to enterprise priorities
without operational involvement.
• Analysts are allocated to units throughout
the organization and their activities are
coordinated by a central entity.
• Flexible model with right balance of
centralized and distributed coordination.

INTRODUCTION TO DATA SCIENCE 59 / 79


ORGANISATION OF DATA SCIENCE TEAM

[6] Federated
• Same as “Center of Excellence” model with
need-based operational involvement to
provide SME support.
• A centralized group of advanced analysts is
strategically deployed to enterprise-wide
initiatives.
• Flexible model with right balance of
centralized and distributed coordination.

INTRODUCTION TO DATA SCIENCE 60 / 79


TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 61 / 79


SOFTWARE ENGINEERING
In general,
Software engineering is an engineering discipline that is concerned with all aspects of
software production.
Software includes computer programs, all associated documentation, and
configuration data that are needed for software to work correctly.
Waterfall model, Iterative models, Agile models

INTRODUCTION TO DATA SCIENCE 62 / 79


DATA SCIENCE PROCESS

INTRODUCTION TO DATA SCIENCE 63 / 79


DATA SCIENCE VS. SOFTWARE ENGINEERING

Data Science Software Engineering


Data science involves analyzing Software engineering focuses on creat-
huge amounts of data, with some ing software that serves a specific pur-
aspects of programming and devel- pose.
opment.
Uses a methodology involving vari- Uses a methodology involving various
ous phases beginning from require- phases beginning from requirements
ments specification through model specification through software deploy-
deployment to better decision mak- ment into production.
ing.

INTRODUCTION TO DATA SCIENCE 64 / 79


DATA SCIENCE VS. SOFTWARE ENGINEERING

Data Science Software Engineering


Involves collecting and analyzing Concerned with creating useful appli-
data cations
Data scientists utilize the ETL (Ex- Software engineers use the SDLC pro-
tract, Tranform, Load) process cess
More process-oriented Uses frameworks like Waterfall, Agile,
and Spiral
Data scientists use tools like Ama- Software engineers use tools like Rails,
zon S3, MongoDB, Hadoop, and Django, Flask, and Vue.js
MySQL
Skills include machine learning, Skills are focused on coding languages
statistics, and data visualization
INTRODUCTION TO DATA SCIENCE 65 / 79
DATAOPS

DATAOPS AS DEFINED BY GARTNER


DataOps is a collaborative data management practice, focused on improving
communication, integration, and automation of data flow between managers and
consumers of data within an organization.

INTRODUCTION TO DATA SCIENCE 66 / 79


DATAOPS

DataOps applies Agile development, DevOps


and lean manufacturing to data analytics
development and operations.
Agile governs analytics development.
DevOps optimizes code verification, builds and delivery
of new analytics.
Lean manufacturing focuses on minimization of waste
within a system without sacrificing productivity.

Agile governs analytics development, DevOps optimizes code verification, builds and
delivery of new analytics and SPC orchestrates, monitors and validates the data factory.

INTRODUCTION TO DATA SCIENCE 67 / 79


DATAOPS

INTRODUCTION TO DATA SCIENCE 68 / 79


DATAOPS

Data analytics pipeline


1 Data ingestion – Data, extracted from various sources, is explored, validated, and
loaded into a downstream system.
2 Data transformation – Data is cleansed and enriched. Initial data models are
designed to meet business needs.
3 Data analysis – produce insights using different data analysis techniques.
4 Data visualization/reporting – Data insights are represented in the form of reports or
interactive dashboards.

https://www.altexsoft.com/blog/dataops-
essentials/
INTRODUCTION TO DATA SCIENCE 69 / 79
DATAOPS
DataOps puts data pipelines into a CI/CD paradigm.
Development – involve building a new pipeline, changing a data model or redesigning
a dashboard.
Testing – checking the most minor update for data accuracy, potential deviation, and
errors.
Deployment – moving data jobs between environments, pushing them to the next
stage, or deploying the entire pipeline in production.
Monitoring – allows data professionals to identify bottlenecks, catch abnormal
patterns, and measure adoption of changes.
Orchestration – automates moving data between different stages, monitoring
progress, triggering autoscaling, and operations related to data flow management.

https://www.altexsoft.com/blog/dataops-
essentials/
INTRODUCTION TO DATA SCIENCE 70 / 79
TECHNOLOGIES TO RUN DATAOPS

Git for version control


Jenkins for CI/CD practices
Docker for containerization and Kubernetes for managing containers
Tableau for data visualizations
Apache Airflow for data pipeline tools
Automated testing and monitoring tools
DataOps Platforms
• DataKitchen
• Saagie
• StreamSets

INTRODUCTION TO DATA SCIENCE 72 / 79


MLOPS

MLOps is an ML engineering culture and practice that aims at unifying ML system


development (Dev) and ML system operation (Ops).

Machine
Learning

MLOps

Data
DevOps
Engineering

INTRODUCTION TO DATA SCIENCE 73 / 79


MLOPS

Real challenge isn’t building an ML model, but building an integrated ML system and
to continuously operate it in production.
To deploy and maintain ML systems in production reliably and efficiently.
Automating continuous integration (CI), continuous delivery (CD), and continuous
training (CT) for machine learning (ML) systems.
Frameworks
• Kubeflow and Cloud Build
• Amazon AWS MLOps
• Microsoft Azure MLOps

https://ml-ops.org/content/mlops-
principles
INTRODUCTION TO DATA SCIENCE 74 / 79
MLOPS

https://builtin.com/machine-learning/mlops
INTRODUCTION TO DATA SCIENCE 75 / 79
MLOPS

Same data transformations but different implementations .


e.g training pipeline usually runs over batch files that contain all features, while the
serving pipeline often runs online and receives only part of the features in the
requests
Two pipelines are consistent, so code reuse and data reuse.
Each trained model need to tied to the exact versions of code, data and
hyperparameters that were have used.

https://builtin.com/machine-
learning/mlops
INTRODUCTION TO DATA SCIENCE 76 / 79
DATAOPS AND MLOPS

INTRODUCTION TO DATA SCIENCE 77 / 79


TABLE OF CONTENTS

1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING

INTRODUCTION TO DATA SCIENCE 78 / 79


DATA SCIENCE VS. BUSINESS INTELLIGENCE

INTRODUCTION TO DATA SCIENCE 79 / 79


DATA SCIENCE VS. BUSINESS INTELLIGENCE

Data Science Business Intelligence


Perspective Looking forward Looking backward
Analysis Predictive Descriptive
Explorative Comparative
Data Same data, New Data,
New analysis Same analysis
Listens to data Speaks for data
Distributed Warehoused
Scope Specific to business question Unlimited
Expertise Data scientist Business analyst
Deliverable Insight or story Table or report
Applicability Future, correction for influences Historic, confounding factors
INTRODUCTION TO DATA SCIENCE 80 / 79
DATA SCIENTIST VS. BUSINESS ANALYST

INTRODUCTION TO DATA SCIENCE 81 / 79


DATA SCIENCE VS. STATISTICS
Data Science Statistics
Type of problem Semi structured or unstruc- Well structured
tured
Inference model Explicit inference No inference
Analysis Objective Need not be well formed Well formed objective
Type of Analysis Explorative Confirmative
Data collection Data collection is not linked to Data collected based on
the objective the objective
Size of dataset Large Small
Heterogeneous Homogeneous
Paradigm Theory and heuristic Theory based
(deductive & inductive) ( deductive)
INTRODUCTION TO DATA SCIENCE 82 / 79
REFERENCES

Introducing Data Science by Cielen, Meysman and Ali

The Art of Data Science by Roger D Peng and Elizabeth Matsui

https://data-flair.training/blogs/data-science-use-cases/ https:
//www.northeastern.edu/graduate/blog/what-does-a-data-scientist-do/

https://www.visual-paradigm.com/guide/software-development-process/ what-is-a-
software-process-model/

Building an Analytics-Driven Organization, Accenture

INTRODUCTION TO DATA SCIENCE 83 / 79


REFERENCES

https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/

https://www.cio.com/article/3217026/
what-is-a-data-scientist-a-key-data-analytics-role-and-a-lucrative-caree html

https://atlan.com/what-is-dataops/

THANK YOU
INTRODUCTION TO DATA SCIENCE 84 / 79

You might also like