Ms.
Mehroz Sdaiq
9/23/2020 Bahria University Islamabad 1
Introduction
• Data Science - why all the excitement and hype?
• What is data science ?
• Course Information – syllabus, grading, etc.
9/23/2020 Bahria University Islamabad 2
About the
Course
• A mixture of theory and practice
• Introductory, broad overview of subjects
• Focus on practical aspects, but not on ever-changing technology and tools
• Seminar style - I am here to learn as well as to teach
• Language choice: python
• Relatively easy to learn (for computer scientist) compared to R (more popular
among statisticians)
• Open source means easy access (as opposed to SAS or MATLAB)
• Which one is more frequently used in data science?
9/23/2020 Bahria University Islamabad 3
Textbook
Required
• B1: Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The
Frontline. O’Reilly. 2014.
• B2: Data Science from Scratch: First Principles with Python by
Joel Grus, Publisher: O'Reilly Media; 1st edition (2015).
Optional
• Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive
Datasets. v2.1, Cambridge University Press. 2014. (free online).
9/23/2020 Bahria University Islamabad 4
Grading
•Policy
Quizzes - 10
• Homework Assignments and In-class Exercises - 20
• Mid-term Exam - 20
• Final-term Exam - 50
9/23/2020 Bahria University Islamabad 5
Assessment
•sThe assessment includes three components:
• Quizzes
• Class Homework/Projects, assigned after each lecture
• Final Project
• In addition to this, student attendance, originality of work and contributions to
the class will be taken into account
9/23/2020 Bahria University Islamabad 6
Tentative Course
•Plan
Week – 01 : Introduction: What is Data Science?
• Week – 02 : Exploratory Data Analysis, and the Data Science Process
• Week – 03 : Statistical Analysis
• Week – 04 : Introduction to Machine Learning
• Week – 05 : Spam Filters, Naive Bayes, and Wrangling
• Week – 06 : Feature Generation and Feature Selection (Extracting Meaning from
Data)
• Week – 07 : Business Intelligence and Data Warehousing
• Week – 08 : Revision
9/23/2020 Bahria University Islamabad 7
Tentative Course
•Plan
Week – 09 : Mid-term Exam
• Week – 10 : Recommendation Systems: Building a User-Facing Data Product
• Week – 11 : Mining Social-Network Graphs
• Week – 12 : Data Visualization
• Week – 13 : Data Leakage and Model Evaluations
• Week – 14 : Data Science and Ethical Issues
• Week – 15,16 : Project Presentations
• Week – 17 : Revision
• Week – 18 : Final-term Exam
9/23/2020 Bahria University Islamabad 8
Data Scientist Job
Trend
9/23/2020 Bahria University Islamabad 9
Data Science – A
Definition
Definition:
Data Science is the science which uses computer science, statistics and machine
learning, visualization and human-computer interactions to collect, clean, integrate,
analyze, visualize, interact with data to create data products.
9/23/2020 Bahria University Islamabad 10
Types of Data
What to do with Data
Aggregation and Statistics
Data warehousing and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
Big Data Vs. Data
Science
9/23/2020 Bahria University Islamabad 13
Skills Required to be a Data
Scientist
9/23/2020 Bahria University Islamabad 14
Data Science: Why all the
Hype?
John Hopkins University created the world’s first publicly-accessible novel coronavirus (2019-nCoV) tracking dashboard
9/23/2020 Bahria University Islamabad 15
Data Science: Why all the
Hype?
A second site named CoronaTracker has emerged from Malaysia, offering live global updates, charts, and maps of the virus’ spread,
with plans for an app in the works
9/23/2020 Bahria University Islamabad 16
Where does the DATA comes
from ?
9/23/2020 Bahria University Islamabad 17
Data Science – A
Definition
Definition:
Data Science is the science which uses computer science, statistics and machine
learning, visualization and human-computer interactions to collect, clean, integrate,
analyze, visualize, interact with data to create data products.
9/23/2020 Bahria University Islamabad 18
Data Science –
Goals
Goals:
Turn data into data products.
9/23/2020 Bahria University Islamabad 19
Data Science – A
Definition
9/23/2020 Bahria University Islamabad 20
Data Science – A
Definition
9/23/2020 Bahria University Islamabad 21
How to Use
Data ?
Product/Decision
Predictive Models Evaluate/Interpret
Making
Exploratory Knowledge Product/Decision
Analysis Models Making
9/23/2020 Bahria University Islamabad 22
Data Science
Applications
• Marketing: predict the characteristics of high life time value (LTV) customers,
which can be used to support customer segmentation, identify upsell
opportunities, and support other marking initiatives
• Logistics: forecast how many of which things you need and where will we need
them, which enables learn inventory and prevents out of stock situations
• Healthcare: analyze survival statistics for different patient attributes (age, blood
type, gender, etc.) and treatments; predict risk of re-admittance based on patient
attributes, medical history, etc.
9/23/2020 Bahria University Islamabad 23
Data Science Pipeline
Seaborn
Advantages vs.
Disadvantages
Data Science -
Process
9/23/2020 Bahria University Islamabad 27
Data Science -
Process
9/23/2020 Bahria University Islamabad 28
Data Science -
Process
1 Determine
2 Understand
3 Map
9/23/2020 Bahria University Islamabad 29
Data Science -
Process
What does the client want to achieve?
Primary Objective
1 Determine • Reduce attrition
• Customized targeting
2 Understand • Plan future media spend
• Prevent fraud
3 • Recommend Products
Map
9/23/2020 Bahria University Islamabad 30
Data Science -
Process
1 Determine • Understand success criteria.
- Specific, measurable, time-bound
2 Understand • List assumptions, constraints, and important
factors.
• Identify secondary or competing objectives.
3 Map
• Study existing solutions (if any).
9/23/2020 Bahria University Islamabad 31
Data Science -
Process
1 Determine •
2 Understand
Business Objective -> Technical Objective
3 • State the project objective(s) in technical terms.
Map
• Describe how the data science project will help
solve the business problem.
• Explore successful scenarios.
9/23/2020 Bahria University Islamabad 32
Data Science -
Process
1 Determine
2 Understand
3 Map
9/23/2020 Bahria University Islamabad 33
Data Science -
Process
Project Plan
• Duration
• Inventory of resources
• Tools and techniques
• Risks and contingencies
• Costs and benefits
• Milestones
9/23/2020 Bahria University Islamabad 34
Data Science -
Process
1 Identify
2 Collect
3 Assess
4 Vectorize
9/23/2020 Bahria University Islamabad 35
Data Science -
Process
• Data sources, formats.
• Database, Streaming API’s, Logs, Excel files,
1 Identify
Websites, etc.
• Entity Relationship Diagram (ERD).
2 Collect • Identify additional data sources.
• Demographics data appends,
• Geographical data,
3 Assess • Census data, etc.
• Identify relevant data.
4 Vectorize • Record unavailable data.
• How long a history is available and one should use?
9/23/2020 Bahria University Islamabad 36
Data Science -
Process
1 Identify
• Access or acquire all relevant data
2 Collect in a central location.
• Quality control checks and tests
• File formats, delimiters
3 Assess
• Number of records, columns
• Primary keys
4 Vectorize
9/23/2020 Bahria University Islamabad 37
Data Science -
Process
First Look at the Data
1 Identify • Get familiar with the data.
• Study seasonality.
• Monthly/weekly/daily patterns
2 Collect • Unexplained gaps or spikes
• Detect mistakes.
3 Assess • Extreme or outlier values
• Unusual values
• Special missing values
4 Vectorize • Check assumptions.
• Review distributions.
9/23/2020 Bahria University Islamabad 38
Data Science -
Process
9/23/2020 Bahria University Islamabad 39
Data Science -
Process
Goal: Create the Analysis Dataset
1 Identify
2 Collect
3 Assess
4 Vectorize
9/23/2020 Bahria University Islamabad 40
Target Definition
Churn = 90 days of consecutive inactivity (for a pre-paid telecom customer)
• What’s inactivity?
• Incoming and outgoing calls, data usage, incoming text, promotional texts,
voicemail usage, call forwarding etc.
• Customers may change their device or phone number.
• Churn at the individual (person) level, or at the device (phone) level?
• Customers may return (become active again) after 90 days of inactivity?
• Prediction window
• Predict 90 days of consecutive inactivity?
• Would 10 days of consecutive inactivity suffice?
• How many customers return after x days of inactivity?
• Fraud, Involuntary churn etc.
9/23/2020 Bahria University Islamabad 41
Accurate but not
Precise
Chart to help determine the risk of bear attack
9/23/2020 Bahria University Islamabad 42
Modeling
Sample
• Historical trends and seasonality
• Are there certain
timeframes that should be
discarded?
• The model should be
generalizable.
• Eligible, relevant population
• Must align with the business
goals
• Eligible, relevant markets
• Must align with the business
goals
• E.g., within a certain drive- Bahria University Islamabad
9/23/2020 43
Selection
Bias
9/23/2020 Bahria University Islamabad 44
Information
Leakage
The leading indicators must be calculated from the timeframe leading up to the
event – it must not overlap with the prediction window.
Beware of proxy events, e.g., future bookings.
9/23/2020 Bahria University Islamabad 45
Data
Aggregation
Attribute creation
• Derived attributes: Household income / Number of adults = Income per adult
• Brainstorm with team members (both technical and non-technical)
9/23/2020 Bahria University Islamabad 46
Data
Aggregation
• Number of transactions (Frequency)
• Days since the last transaction (Recency)
• Days since the earliest transaction (Tenure)
• Avg. days between transaction
• # of transactions during weekends
• % of transactions during weekends
• # of transactions by day-part (breakfast, lunch, etc.)
• % of transactions by day-part
• Days since last transaction / Avg. days between
transactions
• .…
9/23/2020 Bahria University Islamabad 47
Data
Aggregation
1 Identify
2 Collect
3 Assess
4 Vectorize
9/23/2020 Bahria University Islamabad 48
Data Science -
Process
9/23/2020 Bahria University Islamabad 49
Data Science -
Process
• Descriptive statistics
• Review with the client
• Correlation analysis
• Review with the client
• Watch out for data leakage
• Impute missing values
• Trim extreme values
• Process categorical attributes
9/23/2020 Bahria University Islamabad 50
Data Science -
Process
• Transformations(square, log, etc.)
• Binning / variable smoothing
• Create additional feature
• Interactions
• Normalization(scaling)
9/23/2020 Bahria University Islamabad 51
Data Science -
Process
9/23/2020 Bahria University Islamabad 52
Data Science -
Process
• Feature Reduction:
• The process of selecting a subset of features for use in model
construction
• Useful for both supervised and unsupervised learning
problems
9/23/2020 Bahria University Islamabad 53
Data Science -
Process
• True dimensionality <<< Observed dimensionality
• The abundance of redundant and irrelevant features
• Curse of dimensionality
• With a fixed number of training samples, the predictive power reduces as the dimensionality
• With 𝑑binary variables, the number of possible combinations is 𝑂(2𝑑).
increases. [Hughes phenomenon]
• Goal of the Analysis
• Descriptive, Diagnostic, Predictive, Prescriptive
• Law of Parsimony [Occam’s Razor]
• Other things being equal, simpler explanations are generally better than complex ones.
• Overfitting
• Execution time (Algorithm and data)
9/23/2020 Bahria University Islamabad 54
Data Science -
Process
• Percent missing values
• Amount of variation
• Pairwise correlation
• Multi-colinearity
• Principal Component Analysis (PCA)
• Cluster analysis
• Correlation (with the target)
• Forward selection
• Backward elimination
• Stepwise selection
• Tree-based selection
9/23/2020 Bahria University Islamabad 55
Data Science -
Process
• Try more than one machine learning technique.
• Fine-tune parameters.
• Assess model performance.
• Avoid Over-fitting.
9/23/2020 Bahria University Islamabad 56
Assess Model
Performance
• Types of Errors
• False Positive
• False Negative
• New Age: Area Under the ROC Curve (AUC), Confusion Matrix, Precision,
Recall, Log-loss, etc.
• Old School: Model Lift, Model Gains, Kolmogorov-Smirnov (KS), etc.
9/23/2020 Bahria University Islamabad 57
Data Science -
Process
1 Model Selection
2 Assessment
3 Presentation
9/23/2020 Bahria University Islamabad 58
Data Science -
Process
1 Model Selection • Law of Parsimony (Occam’s Razor)
• Model execution time
• Deployment complexity
2 Assessment
Build the simplest solution that can adequately
3 Presentation answer the question.
9/23/2020 Bahria University Islamabad 59
Data Science -
Process
1 Model Selection
2 Assessment
3 Presentation
9/23/2020 Bahria University Islamabad 60
Data Science -
Process
AUC, etc.
Cumulative Gains Chart / Lift Chart
1 Model Selection Compare against existing business rules/model
Predictor Importance
Each predictor’s relationship with the target
2 Assessment Reason-coding
Model usage recommendations
Decile reports
3 Presentation Personify
Model peer-review (Quality Control)
Interpret results as they relate to the business
application.
9/23/2020 Bahria University Islamabad 61
Data Science -
Process
• Model production cycle
• Scoring code, or publish model as a web service
• Hand-off
• Model Documentation (Technical Specifications)
• Data preparation, transformations, imputations, parameter settings, etc.
• Reproducibility
• Docker containers
• Model Persistence vs. Model Transience
9/23/2020 Bahria University Islamabad 62
Data Science –
Process
1 Monitor
2 Maintain
3 Test
9/23/2020 Bahria University Islamabad 63
Data Science –
Process
Model decay tracking (monitoring) plan
1 Monitor • Model performance over time
• Predictor distribution
2 Maintain
3 Test
9/23/2020 Bahria University Islamabad 64
Data Science –
Process
1 Monitor
• Model maintenance plan
2 Maintain • Adding new data sources
• Version control
3 Test
9/23/2020 Bahria University Islamabad 65
Data Science –
Process
1 Monitor
2 Maintain
3 Campaign Set-up and Execution
Test
• Experimental Design (A/B tests, Fractional
Factorial)
9/23/2020 Bahria University Islamabad 66
Data Science –
Process
1 Monitor
2 Maintain
3 Campaign Set-up and Execution
Test
• Experimental Design (A/B tests, Fractional
Factorial)
9/23/2020 Bahria University Islamabad 67
Experimental
Design
9/23/2020 Bahria University Islamabad 68
Experimental
Design
9/23/2020 Bahria University Islamabad 69
Data Science -
Recap
9/23/2020 Bahria University Islamabad 70