IDS Full 1
IDS Full 1
MODULE # 1 : INTRODUCTION
IDS Course Team
BITS Pilani
TABLE OF CONTENTS
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
CO2 Understand various roles and stages in a Data Science Project and ethical issues to be
considered.
CO3 Explore the processes, tools and technologies for collection and analysis of
structured and unstructured data.
CO4 Appreciate the importance of techniques like data visualization, storytelling with data
for the effective presentations of the outcomes with the stakeholders.
Module 7: Module 9:
Part IV: Module 6: Module 8:
Assocation Anomaly
Modeling & Evaluation Classification Clustering
Mining Detection
EC1 Assignment Part I 10% 22nd Feb to 13th Mar Sum of both
2025 Assignments
Assignment Part II 15% 19th April to 9th May
2025
EC2 Mid-sem 30% 21,22,23|MAR|2025 Closed Book
04,05,06|APR|2025
EC3 Compre-sem 40% 23,24,25|MAY|2025 Open Book
Regular 30,31|MAY,1 JUN|2025
Recap
Session 1
• What is Data Science
• Why Data science and Why now, ex: Money Ball movie
• Real-world applications , ex: Facebook, Amazon, Uber
• Data Science Challenges and Bias
• Roles in Data Science Team
• Organization of Data Science Team
I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
T ABLE OF C ONTENTS
1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
SEMMA
SMAM
Big Data Life-cycle
4 F U RT H E R R E A D I N G
I N T R OD U CT ION TO D AT A S C I E N C E 4 / 60
D EFINITION OF A NALY T I C S – D I C T I O N A RY
Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E 5 / 60
D EFINITION OF A NALY T I C S – WEBSITES
Source: Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
I N T R OD U CT ION TO D AT A S C I E N C E 6 / 60
G OA L S OF D AT A A N A L Y T I C S
To predict something
• whether a transaction is a fraud or not
• whether it will rain on a particular day
• whether a tumour is benign or malignant
To find patterns in the data
• finding the top 10 coldest days in the year
• which pages are visited the most on a particular website
• finding the most searched celebrity in a particular year
To find relationships in the data
• finding similar news articles
• finding similar patients in an electronic health record system
• finding related products on an e-commerce website
• finding similar images
• finding correlation between news items and stock prices
I N T R OD U CT ION TO D AT A S C I E N C E 7 / 60
T ABLE OF C ONTENTS
1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
Big Data Life-cycle
SEMMA
SMAM
4 F U RT H E R R E A D I N G
I N T R OD U CT ION TO D AT A S C I E N C E 8 / 60
D AT A A N A L Y T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 9 / 60
D ATA A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 10 / 60
D ESCRIPTIVE A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 11 / 60
D E S C R I P T I V E A N A LY T I C S E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 12 / 60
D ESCRIPTIVE A NALY T I C S
Techniques:
• Descriptive Statistics - histogram, correlation
• Data Visualization
• Exploratory Data Analysis
I N T R OD U CT ION TO D AT A S C I E N C E 13 / 60
D IA G N O S T I C A N A LY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 14 / 60
D IA G N O S T I C A N A LY T I C S E X A M P L E
What is the effect of global warming in the Southwest monsoon?
I N T R OD U CT ION TO D AT A S C I E N C E 15 / 60
D IA G N O S T I C A N A L Y T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 16 / 60
P RE D IC T IV E A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 17 / 60
P R E D I C T I V E A N A LY T I C S E X A M P L E
I N T R OD U CT ION TO D AT A S C I E N C E 18 / 60
P R E D I C T I V E A N A LY T I C S E X A M P L E
Covid patients LOS (Length-of-Stay in hospital) prediction
Data Details –
▪ First 1000 subjects affected by COVID
▪ Released by Ministry of Health (MOH),the Government of Singapore
▪ X-Variables (Predictor): Age, gender, positive confirmation date, discharge date etc
▪ Y-Variable (Independent): Length of stay
Reference - https://www.degruyter.com/document/doi/10.1515/cmb-2023-0104/html
P RE D IC T IV E A NALY T I C S
Techniques / Algorithms:
• Regression
• Classification
• ML algorithms like Linear regression, Logistic regression, SVM
• Deep Learning techniques
I N T R OD U CT ION TO D AT A S C I E N C E 20 / 60
P RE S C RIPT IV E A NALY T I C S
I N T R OD U CT ION TO D AT A S C I E N C E 21 / 60
P R E S C R I P T I V E A N A LY T I C S E X A M P L E
How can we improve the crop production?
I N T R OD U CT ION TO D AT A S C I E N C E 22 / 60
Case Study – Data Analytics
fmRI data of healthy subjects when studied by aggregating ROI of brain as 90 nodes
fmRI data of healthy subjects when studied by aggregating ROI of brain as 90 nodes
Ref - https://ieeexplore.ieee.org/document/9826786
C O G N I T I V E A N A LY T I C S
Cognitive Analytics – What Don’t I Know?
https: / / w w w. 10x ds. com/ blog /cognitive - analytics - to- reinvent - business/
I N T R OD U CT ION TO D AT A S C I E N C E 26 / 60
C OGNITIVE A NALY T I C S
Although this is the top tier of analytics maturity, Cognitive Analytics can be used in
the prior levels.
According to one source:
“ The essential distinction between cognitive platforms and artificial intelligence
systems is that you want an AI to do something for you. A cognitive platform is
something you turn to for collaboration or for advice
1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
SEMMA
SMAM
Big Data Life-cycle
4 F U RT H E R R E A D I N G
I N T R OD U CT ION TO D AT A S C I E N C E 30 / 60
D AT A A N A L Y T I C S M E T H O D O L O G I E S
I N T R OD U CT ION TO D AT A S C I E N C E 31 / 60
N EED FOR A S TANDA R D P ROCESS
I N T R OD U CT ION TO D AT A S C I E N C E 32 / 60
D AT A S C I E N C E M E T H O D O L O G Y
10 Questions the process aims to answer
Problem to Approach
1 What is the problem that you are trying to solve?
2 How can you use data to answer the questions?
Working with Data
3 What data do you need to answer the question?
4 Where is the data coming from? Identify all Sources. How will you acquire it?
5 Is the data that you collected representative of the problem to be solved?
6 What additional work is required to manipulate and work with the data?
Delivering the Answer
7 In what way can the data be visualized to get to the answer that is required?
8 Does the model used really answer the initial question or does it need to be adjusted?
9 Can you put the model into practice?
10 Can you get constructive feedback into answering the question?
I N T R OD U CT ION TO D AT A S C I E N C E 33 / 60
CRISP-DM
CRISP-DM Phases
I N T R OD U CT ION TO D AT A S C I E N C E 34 / 60
C R I S P - D M P HASES
Business Understanding
• Understand project objectives and requirements.
• Data mining problem definition.
Data Understanding
• Initial data collection and familiarization.
• Identify data quality issues.
• Identify initial obvious results.
Data Preparation
• Record and attribute selection.
• Data cleansing.
I N T R OD U CT ION TO D AT A S C I E N C E 35 / 60
C R I S P - D M P HASES
Modeling
Run the data mining tools.
Evaluation
Determine if results meet business objectives.
Identify business issues that should have been addressed earlier.
Deployment
Put the resulting models into practice.
Set up for continuous mining of the data.
I N T R OD U CT ION TO D AT A S C I E N C E 36 / 60
C R I S P - D M P HASES AND T ASKS
I N T R OD U CT ION TO D AT A S C I E N C E 37 / 60
WHY CRISP-DM?
The data mining process must be reliable and repeatable by people with little data
mining skills.
CRISP-DM provides a uniform framework for
• guidelines.
• experience documentation.
CRISP-DM is flexible to account for differences.
• Different business/agency problems.
• Different data
I N T R OD U CT ION TO D AT A S C I E N C E 38 / 60
Case Study Evaluating Job readiness: CRISP-DM
Ref - 2015, A Case Study of Evaluating Job Readiness with Data Mining Tools and CRISP-DM Methodology
Step1: Business Understanding
Step2: Data Understanding
Step2: Data Understanding
Step3: Data Processing
Step3: Data Processing
Step4: Data Modelling
Step5: Model Evaluation and Conclusion
SEMMA
SAS Institute
Sample, Explore, Modify, Model,
Assess
5 stages
I N T R OD U CT ION TO D AT A S C I E N C E 47 / 60
S E M M A S TAGES
1 Sample
• Sampling the data by extracting a portion of a large data set big enough to contain the
significant information, yet small enough to manipulate quickly.
• Optional stage
2 Explore
• Exploration of the data by searching for unanticipated trends and anomalies in order to
gain understanding and ideas.
3 Modify
• Modification of the data by creating, selecting, and transforming the variables to focus
the model selection process.
I N T R OD U CT ION TO D AT A S C I E N C E 48 / 60
S E M M A S TAGES
1 Model
• Modeling the data by allowing the software to search automatically for a combination of
data that reliably predicts a desired outcome.
2 Assess
• Assessing the data by evaluating the usefulness and reliability of the findings from the
data mining process and estimate how well it performs.
I N T R OD U CT ION TO D AT A S C I E N C E 49 / 60
SEMMA
“SEMMA is not a data mining methodology but rather a logical organization of the
functional tool set of SAS Enterprise Miner for carrying out the core tasks of data
mining.
Enterprise Miner can be used as part of any iterative data mining methodology
adopted by the client. Naturally steps such as formulating a well defined business or
research problem and assembling quality representative data sources are critical to
the overall success of any data mining project.
SEMMA is focused on the model development aspects of data mining.”
I N T R OD U CT ION TO D AT A S C I E N C E 50 / 60
SMAM
Standard
Methodology for
Analytics Models
I N T R OD U CT ION TO D AT A S C I E N C E 51 / 60
S M A M P HASES
Phase Description
Use-case identification Selection of the ideal approach from a list of candidates
Model requirements Understanding the conditions required for the model to func-
gathering tion
Data preparation Getting the data ready for the modeling
Modeling experiments Scientific experimentation to solve the business question
Insight creation Visualization and dash-boarding to provide insight
Proof of Value: ROI Running the model in a small scale setting to prove the value
Operationalization Embedding the analytical model in operational systems
Model life-cycle Governance around model lifetime and refresh
I N T R OD U CT ION TO D AT A S C I E N C E 52 / 60
More Data Analytics Methodologies
Data Acquisition
• Acquiring information from a rich and varied data environment.
Data Awareness
• Connecting data from different sources into a coherent whole, including modeling
content, establishing context, and insuring search-ability.
Data Analytics
• Using contextual data to answer questions about the state of your organization.
Data Governance
• Establishing a framework for providing for the provenance, infrastructure and disposition
of that data.
I N T R OD U CT ION TO D AT A S C I E N C E 54 / 60
B I G D AT A L I F E - C Y C L E
Phase 7: Storage
Phase 1: Foundations
Phase 8: Integration
Phase 2: Acquisition
Phase 9: Analytics and Visualization
Phase 3: Preparation
Phase 10: Consumption
Phase 4: Input and Access
Phase 11: Retention, Backup, and
Phase 5: Processing
Archival
Phase 6: Output and Interpretation
Phase 12: Destruction
I N T R OD U CT ION TO D AT A S C I E N C E 55 / 60
B I G D ATA L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 56 / 60
B I G D AT A L I F E - C Y C L E
Phase 1: Foundations
• Understanding and validating data requirements, solution scope, roles and
responsibilities, data infrastructure preparation, technical and non-technical
considerations, and understanding data rules in an organization.
Phase 2: Data Acquisition
• Data Acquisition refers to collecting data.
• Data sets can be obtained from various sources, both internal and external to the
business organizations.
• Data sources can be in
• structured forms such as transferred from a data warehouse, a data mart, various
transaction systems.
• semi-structured sources such as Weblogs, system logs.
• unstructured sources such as media files consisting of videos, audios, and pictures.
I N T R OD U CT ION TO D AT A S C I E N C E 57 / 60
B I G D AT A L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 58 / 60
B I G D AT A L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 59 / 60
B I G D AT A L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 60 / 60
B I G D AT A L I F E - C Y C L E
I N T R OD U CT ION TO D AT A S C I E N C E 61 / 60
B I G D AT A L I F E - C Y C L E
Phase 10: Data Consumption
• Data is turned into information ready for consumption by the internal or external users,
including customers of the business organization.
• Data consumption require architectural input for policies, rules, regulations, principles,
and guidelines.
Phase 11: Retention, Backup, and Archival
• Use established data backup strategies, techniques, methods, and tools.
• Identify, document, and obtain approval for the retention, backup, and archival decisions.
Phase 12: Data Destruction
• There may be regulatory requirements to destruct a particular type of data after a certain
amount of times.
• Confirm the destruction requirements with the data governance team in business
organizations.
I N T R OD U CT ION TO D AT A S C I E N C E 62 / 60
T ABLE OF C ONTENTS
1 A N A LY T I C S
2 D ATA A N A LY T I C S
3 D ATA A N A LY T I C S M E T H O D O L O G I E S
CRISP-DM
SEMMA
SMAM
Big Data Life-cycle
4 F U RT H E R R E A D I N G
I N T R OD U CT ION TO D AT A S C I E N C E 63 / 60
D ESCRIPTIVE A NALY T I C S – E X AM PL E # 1
Data captured
Problem Statement : Gender
“Market research team at Aqua Analytics Age (In years)
Pvt. Ltd is assigned a task to identify pro- Education (In years)
file of a typical customer for a Digital fit- Relationship Status (Single or Partnered)
Annual Household income
ness band that is offered by Titanic Corp.
Average number of times customer tracks activity each
The market research team decides to inves- week
tigate whether there are differences across Number of miles customer expect to walk each week
the usage patterns and product lines with Self-rated fitness on a scale 1 – 5 where 1 is poor shape
and 5 is excellent.
respect to customer characteristics”
Models of the product purchased - IQ75, MZ65, DX87
I N T R OD U CT ION TO D AT A S C I E N C E 65 / 60
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1
Problem Statement :
“During the 1980s General Electric was selling different products to its customers such as
light bulbs, jet engines, windmills, and other related products. Also, they separately sell
parts and services this means they would sell you a certain product you would use it until it
needs repair either because of normal wear and tear or because it’s broken. And you would
come back to GE and then GE would sell you parts and services to fix it. Model for GE was
focusing on how much GE was selling, in sales of operational equipment, and in sales of
parts and services. And what does GE need to do to drive up those sales?”
https://medium.com/parrotai/
u n d ers ta n d - d a ta - a n a lytics - fra mework -w ith -a - ca s e - st u d y-in - th e -b u s in es s - world - 15b fb 421028d
I N T R OD U CT ION TO D AT A S C I E N C E 66 / 60
D IA G N O S T I C A N A L Y T I C S – E X A M P L E # 1
https://www.sganalytics.com/blog/change -management-analytics-adoption/
I N T R OD U CT ION TO D AT A S C I E N C E 67 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 1
Google launched Google Flu Trends (GFT), to collect predictive analytics regarding
the outbreaks of flu. It’s a great example of seeing big data analytics in action.
So, did Google manage to predict influenza activity in real-time by aggregating search
engine queries with this big data and adopting predictive analytics?
Even with a wealth of big data analytics on search queries, GFT overestimated the
prevalence of flu by over 50% in 2012-2013 and 2011-2012.
They matched the search engine terms conducted by people i n
d i f fe r e n t regions of the world. And, when these queries were
compared with t r a d i t i o n a l f l u s u r ve i l l a n c e systems, Google found
that the p re d ict ive a n a l y t i c s of the f l u season pointed towards a
co r re lat io n with higher search engine t r a f f i c f o r ce rtain phrases.
I N T R OD U CT ION TO D AT A S C I E N C E 68 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 1
https://www.slideshare.net/VasileiosLampos/
u s e r g e n e ra t ed - co n t en t - c o l l e ct i ve - a n d - p er s o n a l i s e d - i n f e re n c e - ta s ks
I N T R OD U CT ION TO D AT A S C I E N C E 69 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 2
Colleen Jones applied predictive analytics to FootSmart (a niche online catalog
retailer) on a content marketing product. It was called the FootSmart Health
Resource Center (FHRC) and it consisted of articles, diagrams, quizzes and the like.
On analyzing the data around increased search engine visibility, FHRC was found
to help FootSmart reach more of the right kind of target customers.
They were receiving more traffic, primarily consisting of people that cared about foot
health conditions and their treatments.
FootSmart decided to push more content at FHRC and also improve its
merchandising of the product.
The r e s u l t of such informed data-driven decision making?
A 36% increase i n weekly s a l e s .
https://www.footsmart.com/pages/health -resource-center
I N T R OD U CT ION TO D AT A S C I E N C E 70 / 60
P RE D IC T IV E A NALY T I C S – E X AM PL E # 2
I N T R OD U CT ION TO D AT A S C I E N C E 71 / 60
P RE S C RIPT IV E A NALY T I C S – E X AM PL E # 1
A health insurance company analyses its data and determines that many of its diabetic
patients also suffer from retinopathy.
With this information, the provider can now use predictive analytics to get an idea of how
many more ophthalmology claims it might receive during the next year.
Then, using prescriptive analytics, the company can look at scenarios where the
reimbursement costs for ophthalmology increases, decreases, or holds steady. These
scenarios then allow them to make an informed decision about how to proceed in a way that’s
both cost-effective and beneficial to their customers.
I N T R OD U CT ION TO D AT A S C I E N C E 72 / 60
P RE S C RIPT IV E A NALY T I C S – E X AM PL E # 2
Whenever you go to Amazon, the site recommends dozens and dozens of products to
you. These are based not only on your previous shopping history (reactive), but also
based on what you’ve searched for online, what other people who’ve shopped for the
same things have purchased, and about a million other factors (proactive).
Amazon and other large retailers are taking deductive, diagnostic, and predictive data
and then running it through a prescriptive analytics system to find products that you
have a higher chance of buying.
Every bit of data is broken down and examined with the end goal of helping the
company suggest products you may not have even known you wanted.
h ttp s : / / a ccen t -tech n ologies . com/ 2020/ 06/18/ ex a mp les -of- p res crip tive - a n a lytics /
I N T R OD U CT ION TO D AT A S C I E N C E 73 / 60
H E A LT H C A R E A NALY T I C S – C ASE S TUDY
Self study
https://integratedmp.com/
4 - key- h e a lt h ca re - analyt ics - so u rce s - i s- yo ur -pract ice - usin g -the m/
https://www.youtube.com/watch?v=olpuyn6kemg
I N T R OD U CT ION TO D AT A S C I E N C E 74 / 60
R EFERENCES
Big Data Analytics – A Hands-on Approach by Arshdeep Bahga & Vijay Madisetti
https://www.kdnuggets.com/2014/10/
cris p - d m - top - m eth od olog y - a n a l ytic s - d a t a - m in in g - d a t a - s c ie n ce - p roj ect h tm l
https://www.datasciencecentral.com/profiles/ blogs/
crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
https://docu mentation.sas.com/?docsetId=emref&docsetTarget=n061bzurmej4j3n1jnj8bbjjm 1a2.htm&
docsetVersion=14.3&locale=en
http://jesshampton.com/2011/02/16/semma -and-crisp-dm-data-mining-methodologies/
https://www.kdnuggets.com/2015/08/new -standard-methodology-analytical-models.html
https://medium.com/illumination -curated/big-data-lifecycle-management-629dfe16b78d
https://www.esadeknowledge.com/view/
7 - ch a llen g es - a n d - op p ortu n ities - in - d a ta - b a sed - d ecis ion - ma kin g -193560
T HANK YOU
I N T R OD U CT ION TO D AT A S C I E N C E 75 / 60
INTRODUCTION TO DATA SCIENCE
MODULE # 3 : DATA
IDS Course Team
BITS Pilani
The instructor is gratefully acknowledging
the authors who made their course
materials freely available online.
Session 2
• Data Analytics
• Case Studies – (COVID, Neuro Informatics)
• Data Analytics Methodologies (CRISP-DM)
I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
TABLE OF CONTENTS
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Continuous Attribute
) Real numbers as attribute values.
) temperature, height, or weight
) Continuous attributes are typically represented as floating-point variables.
Asymmetric Attribute
) only presence a non-zero attribute value-is considered.
) For a specific student, an attribute has a value of 1 if the student took the course
associated with that attribute and a value of 0 otherwise
) Asymmetric binary attributes.
Identify whether the attribute is discrete and continuous in the given data.
Identify whether the attribute is discrete and continuous in the given data.
1 DATA
2 DATA-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Web Click-Stream
1 Public data
) Data that has been collected and preprocessed for academic or research purposes and
made public.
) https://archive.ics.uci.edu/
2 Private data
) Data that is specific to an organization.
) Privacy rules like IT Act 2000 and GDPR applies.
https://mozanunal.com/2019/11/img2sh/
INTRODUCTION TO DATA SCIENCE
DIGITAL COLOUR IMAGE
https://www.analyticsvidhya.com/blog/2021/03/grayscale-and-rgb-format-for-storing-images/
INTRODUCTION TO DATA SCIENCE
DIGITAL COLOUR IMAGE
https://www.mathworks.com/help/matlab/creating_plots/image-types.html
INTRODUCTION TO DATA SCIENCE 32 / 79
GRAPH DATA EXAMPLE
https://lod-cloud.net/
INTRODUCTION TO DATA SCIENCE
TABLE OF CONTENTS
1 DATA
2 Data-SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Missing data
) Data that is not filled / available intentionally or otherwise.
) Attributes of interest may not always be available, such as customer information for sales
transaction data.
) Some data were not considered important at the time of entry.
malfunctions.
Duplicate data
Orphaned data
Text encoding errors
Data that is biased
1 DATA DATA-
2 SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Apply the basic epicycle of analysis to the formal modelling portion of data analysis.
1 Setting expectations.
) Develop a primary model that represents your best sense of what provides the answer
to your question. This model is chosen based on whatever information you have
currently available.
2 Collecting Information.
) Create a set of secondary models that challenge the primary model in some way.
3 Revising expectations.
) If our secondary models are successful in challenging our primary model and put the
primary model’s conclusions in some doubt, then we may need to adjust or modify
the primary model to better reflect what we have learned from the secondary
models.
Conduct a survey of 20 people to ask them how much they’d be willing to spend on a
product you’re developing.
The survey response
25, 20, 15, 5, 30, 7, 5, 10, 12, 40, 30, 30, 10, 25, 10, 20, 10, 10, 25, 5
The goal is to develop a benchmark model that serves us as a baseline, upon we’ll
measure the performance of a better and more attuned algorithm.
Benchmarking requires experiments to be comparable, measurable, and reproducible.
1 DATA DATA-
2 SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
customers tend to purchase first a laptop, followed by a digital camera, and then a
memory card.
) A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that
The term prediction refers to both numeric prediction and class label
prediction.
Classification and regression may need to be preceded by relevance analysis, which
attempts to identify attributes that are significantly relevant to the classification and
regression process.
Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.
The model are derived based on the analysis of a set of training data (i.e., data
objects for which the class labels are known).
The model is used to predict the class label of objects for which the the class label is
unknown.
The derived model may be represented in as classification rules (i.e., IF-THEN rules),
decision trees, mathematical formulae, or neural networks, naive Bayesian
classification, support vector machines, and k-nearest-neighbor classification.
Classification predicts categorical (discrete, unordered) labels.
1 DATA DATA-
2 SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
Data pipelines are sets of processes that move and transform data from various
sources to a destination where new value can be derived.
In their simplest form, pipelines may extract only data from one source such as a
REST API and load to a destination such as a SQL table in a data warehouse.
In practice, data pipelines consist of multiple steps including data extraction, data
preprocessing, data validation, and at times training or running a machine learning
model before delivering data to its final destination.
Data engineers specialize in building and maintaining the data pipelines.
For every dashboard and insight that a data analyst generates and for each predictive
model developed by a data scientist, there are data pipelines working behind the
scenes.
A single dashboard, or a single metric may be derived from data originating in
multiple source systems.
Data pipelines extract data from sources and load them into simple database tables
or flat files for analysts to use. Raw data is refined along the way to clean, structure,
normalize, combine, aggregate, and anonymize or secure it.
) A shared network file system or cloud storage bucket containing logs, comma-separated
Data ingestion is traditionally both the extract and load steps of an ETL or ELT
process.
INTRODUCTION TO DATA SCIENCE
SIMPLE PIPELINE
E– extract step
) gathers data from various sources in preparation for loading and transforming.
L – load step
) brings either the raw data (in the case of ELT) or the fully transformed data (in the case of
ETL) into the final destination.
) load data into the data warehouse, data lake, or other destination.
T – transform step
) raw data from each source system is combined and formatted in a such a way that it’s
useful to analysts, visualization tools
Orchestration ensures that the steps in a pipeline are run in the correct order and that
dependencies between steps are managed properly.
Pipeline steps (tasks) are always directed, meaning they start with a task or multiple
tasks and end with a specific task or tasks. This is required to guarantee a path of
execution.
Pipeline graphs must also be acyclic, meaning that a task cannot point back to a
previously completed task.
Pipelines are implemented as DAGs (Directed Acyclic Graphs).
Orchestration tool – Apache Airflow
1 DATA DATA-
2 SETS
3 DATA QUALITY
4 DATA MODELS
5 ANALYSIS IN DATA SCIENCE
6 DATA PIPELINES AND PATTERNS
7 FURTHER READING
A data lake is where data is stored, but without the structure or query
optimization of a data warehouse.
It will contain a high volume of data as well as a variety of data types.
It is not optimized for querying such data in the interest of reporting and analysis.
Eg: a single data lake might contain a collection of blog posts stored as text files, flat
file extracts from a relational database, and JSON objects containing events
generated by sensors in an industrial system.
THANK YOU
Session 4
Session 2
• Statistical Description of data
• Data Analytics
• Data Preparation
• Case Studies – (COVID, Neuro Informatics)
• Data aggregation and sampling
• Data Analytics Methodologies (CRISP-DM)
I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
Recap
Session 5 Session 7
• Dissimilarity and Similarity Measures
• Visualization for EDA
• Handling Numeric data
Session 6
Session 8
I N T R OD U CT ION TO D AT A S C I E N C E 4 / 60
TABLE OF CONTENTS
T4:Chapter 2.4
MEASURING DATA SIMILARITY AND DISSIMILARITY
Various proximity measures
• Data Matrix versus Dissimilarity Matrix
• Proximity Measures for Nominal Attributes
• Proximity Measures for Binary Attributes
• Symmetric Binary Attributes
• Asymmetric Binary Attributes
• Proximity Measures for Ordinal Attributes
• Proximity Measures for Numeric Data
• Proximity Measures for Mixed Types
• Cosine Similarity
Dissimilarity matrix
– n data points, but registers only the distance
– A triangular matrix
– Single mode
Calculate the dissimilarity matrix and similarity matrix for the ordinal
attributes
– where q is the number of attributes that equal 1 for both objects i and j,
– r is the number of attributes that equal 1 for object i but equal 0 for object j,
– s is the number of attributes that equal 0 for object i but equal 1 for object j,
– t is the number of attributes that equal 0 for both objects i and j.
– The total number of attributes is p, where p = q+r+s+t .
Where is given by
Visualization.ipynb
Techniques are
Discretization – Convert numeric data into discrete categories
Binarization – Convert numeric data into binary categories
Normalization – Scale numeric data to a specific range
Smoothing
• which works to remove noise from the data. Techniques include binning, regression, and
clustering.
• random method, simple moving average, random walk, simple exponential, and
exponential moving average (Will learn in ISM)
T4:Chapter 3.5
51 / 113
DISCRETIZATION
Unsupervised discretization
) Binning [ Equal-interval, Equal-frequency] (Top-down split)
) Histogram analysis (Top-down split)
) Clustering analysis (Top-down split or Bottom-up merge)
Supervised discretization
) Entropy-based discretization (Top-down split)
T1:Cahpter 2.3.6
55 / 113
UNSUPERVISED DISCRETIZATION
width = interval =
max − min
#bins
) Highly sensitive to outliers.
) If outliers are present, the width of each bin is large, resulting in skewed data.
2 Equal Depth (frequency) binning
) Specify the number of values that have to be stored in each bin.
) Number of entries in each bin are equal.
) Some values can be stored in different bins.
T4:Cahpter 3.4.6
58 / 113
BINNING EXAMPLE
Discretize the following data into 3 discrete categories using binning technique.
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81, 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70.
Original 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70,
Data 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81
Method Bin1 Bin 2 Bin 3
Equal width= [53, 62) = [62, 72) = [72, 81] =
Width 81-53 = 28 53, 56, 57 63, 66, 67, 67, 72, 73, 75, 75,
28/3 = 9.33 67, 68, 69, 70, 76, 76, 78,
70, 70, 70 79, 80, 81
Equal depth = 53, 56, 57, 63, 68, 69, 70, 70, 75, 75, 76, 76,
Depth 24 /3 = 8 66, 67, 67, 67 70, 70, 72, 73 78, 79, 80, 81
Binning.ipynb
T1:Chapter 2.3.7
SIMPLE FUNCTIONAL TRANSFORMATION
Variable transformations should be applied with caution since they change the nature
of the data.
For instance, the transformation 1xreduces the magnitude of values that are 1 or
larger, but increases the magnitude of values between 0 and 1.
To understand the effect of a transformation, it is important to ask questions such as:
) Does the order need to be maintained?
) Does the transformation apply to all values, especially negative values and 0?
) What is the effect of the transformation on the values between 0 and 1?
Features with bigger magnitude dominate over the features with smaller magnitudes.
Good practice to have all variables within a similar scale.
Euclidean distances are sensitive to feature magnitude.
Feature scaling helps decrease the time of finding support vectors.
Scale the feature magnitude to a standard range like [0, 1] or [−1, +1] or any
other.
Techniques
) Min-Max normalization
) z-score normalization
) Decimal normalization
T4:Chapter 3.5.2
72 / 113
MIN-MAX SCALING
Min-max scaling squeezes (or stretches) all feature values to be within the range of
[0, 1].
Min-Max normalization preserves the relationships among the original data values.
Suppose that the minimum and maximum values for the attribute income are $12,000 and
$98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization to value of
$73,600.
xˆ = x − µ(x )
σ(x )
z-score normalization is useful when the actual minimum and maximum of attribute X
are unknown, or when there are outliers that dominate the min-max normalization.
Suppose that the mean and standard deviation of the values for the attribute income are
$54,000 and $16,000, respectively. Apply z-score normalization to value of $73,600.
Example 1
CGPA Formula Normalized CGPA
2 2/10 0.2
3 3/10 0.3
Example 2
Bonus Formula Normalized Bonus
450 450/1000 0.45
310 310/1000 0.31
Example 3
Salary Formula Normalized Salary
48000 48000/100000 0.48
67000 67000/100000 0.67
Normalization.ipynb
81 / 113
CATEGORICAL ENCODING TECHNIQUES
One-hot encoding
Label Encoding
Disadvantages
) Expands the feature space.
) Does not add extra information while encoding.
) Many dummy variables may be identical, introducing redundant information .
Disadvantages
) Does not add extra information while encoding.
) Not suitable for linear models.
) Does not handle new categories in test set automatically.
Used for features which have multiple values into domain. eg: colour, protocol types
INTRODUCTION TO DATA SCIENCE 86 / 113
LABEL ENCODING EXAMPLE
Assume an ordinal attribute for representing service of a restaurant: (Awful, Poor, OK,
Good, Great)
If there are m categorical values, then uniquely assign each original value to an integer
in the interval [0,m - 1].
Encoding.ipynb
Create a matrix where each column consists of a token and the cells show the counts
of the number of times a token appears.
Each token is now an attribute in standard data science parlance and each document
is an example (record).
Unstructured raw data is now transformed into a format that is recognized by machine
learning algorithms for training.
The matrix / table is referred to as Document Vector or Term Document Matrix (TDM)
As more new statements are added that have little in common, we end up with a very
sparse matrix.
We could also choose to use the term frequencies (TF) for each token instead of
simply counting the number of occurrences.
INTRODUCTION TO DATA SCIENCE 93 / 113
TERM DOCUMENT MATRIX – EXAMPLE
There are common words such as ”a,” ”this,” ”and,” and other similar
terms. They do not really convey specific meaning.
Most parts of speech such as articles, conjunctions, prepositions, and pronouns need
to be filtered before additional analysis is performed.
Such terms are called stop words.
Stop word filtering is usually the second step that follows immediately after
tokenization.
The document vector gets reduced significantly after applying standard English stop
word filtering.
Lexical substitution is the process of finding an alternative for a word in the context
of a clause.
It is used to align all the terms to the same term based on the field or subject which is
being analyzed.
This is especially important in areas with specific jargon, e.g., in clinical settings.
Example: common salt, NaCl, sodium chloride can be replaced by NaCl.
Domain specific
97 / 113
STEMMING
The most common stemming technique for text mining in English is the Porter
Stemming method.
Porter stemming works on a set of rules where the basic idea is to remove and/or
replace the suffix of words.
) Replace all terms which end in ’ies’ by ’y,’ such as replacing the term ”anomalies” with
”anomaly.”
) Stem all terms ending in ”s” by removing the ”s,” as in ”algorithms” to ”algorithm.”
While the Porter stemmer is extremely efficient, it can make mistakes that could prove
costly.
) ”arms” and ”army” would both be stemmed to ”arm,” which would result in somewhat
different contextual meanings.
NLP.ipynb
Session 2
• Data Analytics
• Case Studies – (COVID, Neuro Informatics)
• Data Analytics Methodologies (CRISP-DM)
I N T R OD U CT ION TO D AT A S C I E N C E 3 / 60
TABLE OF CONTENTS
1 STATISTICAL DESCRIPTIONS OF DATA
2 DATA PREPARATION
3 DATA AGGREGATION, SAMPLING
4 DATA SIMILARITY & DISSIMILARITY MEASURE
5
8
9
If N is odd, then the median is the middle value of the ordered set.
If N is even, then the median is not unique; it is the two middlemost values and any
value in between.
If X is a numeric attribute, the median is taken as the average of the two middlemost
values.
Issue: Median is expensive to compute when we have a large number of
observations.
Mode for a set of data is the value that occurs most frequently in the set.
Mode can be determined for qualitative and quantitative attributes.
Data sets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal. In general, a data set with two or more modes is multimodal.
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
INTRODUCTION TO DATA SCIENCE
Kurtosis
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm
MIDRANGE
midrange =
min + max
2
X = [30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110]
mean = 30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110 = 58
12
+
median = 52 56 = 54
2
mode = 52, 70
+
midrange = 30 110 = 70
2
Range
Quartiles, and interquartile range
Five-number summary and boxplots
Variance and standard deviation
The range of the set is the difference between the largest and smallest values.
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal-size consecutive sets.
The kth q-quantile for a given data distribution is the value x such that at most k/q of
the data values are less than x and at most (q − k )/q of the data values are more
than x , where k is an integer such that 0 < k < q.
There are q − 1 q-quantiles.
Three data points that split the data distribution into four equal parts
Each part represents one-fourth of the data distribution.
Q1 is the 25th percentile and Q3 is the 75th percentile
Quartiles give an indication of a distribution’s center, spread, and shape
IQR = Q3 − Q1
Identifying outliers as values falling at least 1.5 × IQR above the third quartile or
below the first quartile.
The five-number summary of a distribution consists of the median (Q2), the quartiles
Q1 and Q3 , and the smallest and largest individual observations.
Written in the order
Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
The median is marked by a line within the box
Whiskers: two lines outside the box extended to Minimum and Maximum
Outliers: points beyond a specified outlier threshold, plotted individually
VARIANCE
Variance and standard deviation indicate how spread out a data distribution is.
Heights of 5 different dogs are - 600 mm, 470 mm, 170 mm, 430 mm and 300 mm
Mean – 394 mm
Example
Variance = 21704
Standard Deviation = 21704 = 147
STANDARD DEVIATION
Standard deviation σ of the observations is the square root of the variance σ2.
A low standard deviation means that the data observations tend to be very close to
the mean.
A high standard deviation indicates that the data are spread out over a large range of
values.
σ measures spread about the mean and should be considered only when the mean is
chosen as the measure of center.
σ = 0 only when there is no spread, that is, when all observations have the same
value. Otherwise, σ > 0.
8
9
Focuses on removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
Two types of errors
) Interpretation error
Age < 150
Height of a person < 7feet.
Price is positive.
) Inconsistencies between data sources or against your company’s standardized values.
Female and F
Feet and meter
Dollars and Pounds
) By ignoring the tuple, we do not make use of the remaining attributes’ values in the
tuple. Such data could have been useful to the task at hand.
Fill in the missing value manually.
) Time consuming.
) May not be feasible given a large data set with many missing values.
mistakenly think that they form an interesting concept, since they all have a value in
common—that of ”Unknown.” Hence, although this method is simple, it is not foolproof.
Use a measure of central tendency for the attribute.
) Central tendency indicates the ”middle” value of a data distribution. E.g., mean or
median
) For normal (symmetric) data distributions, the mean can be used.
Use the attribute mean or median for all samples belonging to the same class as the
given tuple.
) For example, if classifying customers according to credit risk, we may replace the missing
value with the mean income value for customers in the same credit risk category as that
of the given tuple.
) If the data distribution for a given class is skewed, the median value is a better choice.
Use the most probable value to fill in the missing value.
) This may be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction.
) For example, using the other customer attributes in the data set, we may construct a
smoothing.
For example: A concept hierarchy for price may map real price values into three
categories: inexpensive, moderately priced, and expensive.
INTRODUCTION TO DATA SCIENCE
TABLE OF CONTENTS
1 STATISTICAL DESCRIPTIONS OF DATA SELF STUDY
2 DATA PREPARATION
3 DATA AGGREGATION, SAMPLING
4 DATA SIMILARITY & DISSIMILARITY MEASURE HANDLING
5
8
9
To create the aggregate transaction that represents the sales of a single store or date.
Quantitative attributes, such as price, are typically aggregated by taking a sum or an
average.
Qualitative attribute, such as item description, can either be omitted or summarized
as the set of all the items that were sold at that location.
The data in the table can also be viewed as a multidimensional array, where each
attribute is a dimension.
) Aggregation is the process of eliminating attributes (such as the type of item) or reducing
the number of values for a particular attribute (e.g., reducing the possible values for date
from 365 days to 12 months).
) Commonly used in Online Analytical Processing (OLAP).
Advantages
) Require less memory and processing time.
) Provides a high-level view of the data instead of a low-level view.
) the behavior of groups of objects or attributes is often more stable than that of individual
objects or attributes.
Disadvantage
) potential loss of interesting details.
Sampling Techniques
Simple Random
Systematic Sampling
Stratified Random
Cluster Sampling
I N T R OD U CT ION TO D AT A S CIENCE
PROBABILISTIC SAMPLING TECHNIqUES
PROBABILISTIC SAMPLING means that every item in the population has an equal
chance of being included in sample.
SIMPLE RANDOM SAMPLING means that every case of the population has an equal
probability of inclusion in sample.
Eg: Randomly picking mango from a basket of fruits.
Sampling without replacement
Sampling with replacement
SYSTEMATIC SAMPLING is where every nth case after a random start is selected.
Eg: Picking every 5th fruit from a basket of fruits.
8
9
I N T R OD U CT ION TO D AT A S CIENCE
MEASURES OF P ROXIMITY
I N T R OD U CT ION TO D AT A S CIENCE
Data Matrix
P ROXIMITY MEASURES FOR C ATEGORICAL ATTRIBUTES
m is the number of matches – the number of attributes for which i and j are in the
same state
p is the total number of attributes describing the objects
p−m
d (i, j ) =
p
m
sim(i, j ) =
p
I N T R OD U CT ION TO D AT A S CIENCE
EXAMPLE
I N T R OD U CT ION TO D AT A S CIENCE
EXAMPLE
I N T R OD U CT ION TO D AT A S CIENCE
Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar (T1)
The Art of Data Science by Roger D Peng and Elizabeth Matsui (R1)
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and Micheline
Kamber Morgan Kaufmann Publishers, 2006 (T4)
On Being a Data Skeptic Publisher(s): O’Reilly Media, Inc. ISBN: 9781449374310
THANK YOU
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
https://medium.com/analytics-vidhya/introduction-to-data-science-28deb32878e7
DATA SCIENCE
issue.
Tons of data.
Powerful algorithms.
Open software and tools.
Computational speed, accuracy and cost.
Data storage in terms of capacity and cost.
Artificial Intelligence
• AI involves making machines capable of mimicking human behavior, particularly
cognitive functions like facial recognition, automated driving, sorting mail based on
postal code.
Machine Learning
• Considered a sub-field of or one of the tools of AI.
• Involves providing machines with the capability of learning from experience.
• Experience for machines comes in the form of data.
Data Science
• Data science is the application of machine learning, artificial intelligence, and other
quantitative fields like statistics, visualization, and mathematics to uncover insights from
data to enable better decision marking.
https://www.sciencedirect.com/topics/physics-and-astronomy/artificial-
intelligence
INTRODUCTION TO DATA SCIENCE 20 / 79
Why Data Science ?
Case Study – Moneyball: The Art of Winning an Unfair Game
TABLE OF CONTENTS
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
DataFlair
INTRODUCTION TO DATA SCIENCE 24 / 79
DATA SCIENCE IN FACEBOOK
Social Analytics
Quantitative research
Makes use of deep learning
Makes uses “DeepText”
It uses targeted advertising.
Spotify uses data science to gain insights about which universities had the highest
percentage of party playlists and which ones spent the most time on it.
”Spotify Insights” publishes information about the ongoing trends in the music.
Spotify’s Niland, an API based product, uses machine learning to provide better
searches and recommendations to its users.
Spotify analyzes listening habits of its users to predict the Grammy Award Winners.
DataFlair
INTRODUCTION TO DATA SCIENCE 32 / 79
TABLE OF CONTENTS
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
Cognitive Biases are the distortions of reality because of the lens through which we
view the world.
Each of us sees things differently based on our preconceptions, past experiences,
cultural, environmental, and social factors. This doesn’t necessarily mean that the
way we think or feel about something is truly representative of reality.
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 39 / 79
ROLES IN DATA SCIENCE TEAM [1/7]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 41 / 79
ROLES IN DATA SCIENCE TEAM [3/7]
3 Business analyst
) A business analyst basically realizes a CAO’s functions but on the operational level.
) This implies converting business expectations into data analysis.
) If your core data scientist lacks domain expertise, a business analyst bridges this gulf.
4 Data scientist
) A data scientist is a person who solves business tasks using machine learning and data
mining techniques.
) The role can be narrowed down to data preparation and cleaning with further model
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 42 / 79
ROLES IN DATA SCIENCE TEAM [4/7]
Job of a data scientist is often divided into two roles
[4A] Machine Learning Engineer
) A machine learning engineer combines software engineering and modeling skills by
determining which model to use and what data should be used for each model.
) Probability and statistics are also their forte.
) Training, monitoring, and maintaining a model.
statistics.
) Preferred skills: SQL, Python, R, Scala, Carto, D3, QGIS, Tableau
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 43 / 79
ROLES IN DATA SCIENCE TEAM [5,6/7]
5 Data architect
) Working with Big Data.
) This role is critical to warehouse the data, define database architecture, centralize data,
and ensure integrity across different sources.
) Preferred skills: SQL, noSQL, XML, Hive, Pig, Hadoop, Spark
6 Data engineer
) Data engineers implement, test, and maintain infrastructural components that data
architects design.
) Realistically, the role of an engineer and the role of an architect can be combined in one
person.
) Preferred skills: SQL, noSQL, Hive, Pig, Matlab, SAS, Python, Java, Ruby, C++, Perl
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 44 / 79
ROLES IN DATA SCIENCE TEAM [7/7]
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 45 / 79
DATA SCIENTIST
https://www.altexsoft.com/blog/datascience/how-to-structure-data-science-team-key-models-and-
roles/
INTRODUCTION TO DATA SCIENCE 46 / 79
SKILLSET FOR A DATA SCIENTIST
• Programming
• Quantitative analysis
• Product intuition
• Communication
• Teamwork
Communicative Qualitative
Data
Curious Technical
Scientist
Creative Skeptical
R
SQL
Python
Scala
Tools SAS
Hadoop
Julia
Tableau
Weka
Logistic
K-means Regression
Linear
clustering Regression
Decision
SVM
Tree
ANN
https://towardsdatascience.com/why-team-building-is-important-to-data-scientists-a8fa74dbc09b
INTRODUCTION TO DATA SCIENCE 52 / 79
Organization OF DATA SCIENCE TEAM
1. Decentralized
2. Functional
3. Consulting
4. Centralized
5. Centre of Excellence
6. Federated
[2] Functional
• Resource allocation driven by a functional
agenda rather than an enterprise agenda.
• Analysts are located in the functions where
the most analytical activity takes place, but
may also provide services to rest of the
corporation.
• Little coordination
[3] Consulting
• Resources allocated based on availability
on a first-come first-served basis without
necessarily aligning to enterprise objectives
• Analysts work together in a central group
but act as internal consultants who charge
“clients” (business units) for their services
• No centralized coordination
[6] Federated
• Same as “Center of Excellence” model with
need-based operational involvement to
provide SME support.
• A centralized group of advanced analysts is
strategically deployed to enterprise-wide
initiatives.
• Flexible model with right balance of
centralized and distributed coordination.
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
Agile governs analytics development, DevOps optimizes code verification, builds and
delivery of new analytics and SPC orchestrates, monitors and validates the data factory.
https://www.altexsoft.com/blog/dataops-
essentials/
INTRODUCTION TO DATA SCIENCE 69 / 79
DATAOPS
DataOps puts data pipelines into a CI/CD paradigm.
Development – involve building a new pipeline, changing a data model or redesigning
a dashboard.
Testing – checking the most minor update for data accuracy, potential deviation, and
errors.
Deployment – moving data jobs between environments, pushing them to the next
stage, or deploying the entire pipeline in production.
Monitoring – allows data professionals to identify bottlenecks, catch abnormal
patterns, and measure adoption of changes.
Orchestration – automates moving data between different stages, monitoring
progress, triggering autoscaling, and operations related to data flow management.
https://www.altexsoft.com/blog/dataops-
essentials/
INTRODUCTION TO DATA SCIENCE 70 / 79
TECHNOLOGIES TO RUN DATAOPS
Machine
Learning
MLOps
Data
DevOps
Engineering
Real challenge isn’t building an ML model, but building an integrated ML system and
to continuously operate it in production.
To deploy and maintain ML systems in production reliably and efficiently.
Automating continuous integration (CI), continuous delivery (CD), and continuous
training (CT) for machine learning (ML) systems.
Frameworks
• Kubeflow and Cloud Build
• Amazon AWS MLOps
• Microsoft Azure MLOps
https://ml-ops.org/content/mlops-
principles
INTRODUCTION TO DATA SCIENCE 74 / 79
MLOPS
https://builtin.com/machine-learning/mlops
INTRODUCTION TO DATA SCIENCE 75 / 79
MLOPS
https://builtin.com/machine-
learning/mlops
INTRODUCTION TO DATA SCIENCE 76 / 79
DATAOPS AND MLOPS
1 COURSE LOGISTICS
2 FUNDAMENTALS OF DATA SCIENCE
3 DATA SCIENCE REAL WORLD APPLICATIONS
4 DATA SCIENCE CHALLENGES
5 DATA SCIENCE TEAMS
6 SOFTWARE ENGINEERING FOR DATA SCIENCE
7 FURTHER READING
https://data-flair.training/blogs/data-science-use-cases/ https:
//www.northeastern.edu/graduate/blog/what-does-a-data-scientist-do/
https://www.visual-paradigm.com/guide/software-development-process/ what-is-a-
software-process-model/
https://www.altexsoft.com/blog/datascience/
how-to-structure-data-science-team-key-models-and-roles/
https://www.cio.com/article/3217026/
what-is-a-data-scientist-a-key-data-analytics-role-and-a-lucrative-caree html
https://atlan.com/what-is-dataops/
THANK YOU
INTRODUCTION TO DATA SCIENCE 84 / 79