0% found this document useful (0 votes)
97 views70 pages

MS Data Science Course Syllabus Overview

CS Lecture

Uploaded by

sumera sajid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views70 pages

MS Data Science Course Syllabus Overview

CS Lecture

Uploaded by

sumera sajid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Ms.

Mehroz Sdaiq

9/23/2020 Bahria University Islamabad 1


Introduction
• Data Science - why all the excitement and hype?
• What is data science ?
• Course Information – syllabus, grading, etc.

9/23/2020 Bahria University Islamabad 2


About the
Course
• A mixture of theory and practice
• Introductory, broad overview of subjects
• Focus on practical aspects, but not on ever-changing technology and tools
• Seminar style - I am here to learn as well as to teach
• Language choice: python
• Relatively easy to learn (for computer scientist) compared to R (more popular
among statisticians)
• Open source means easy access (as opposed to SAS or MATLAB)
• Which one is more frequently used in data science?

9/23/2020 Bahria University Islamabad 3


Textbook
Required

• B1: Cathy O’Neil and Rachel Schutt. Doing Data Science, Straight Talk From The
Frontline. O’Reilly. 2014.
• B2: Data Science from Scratch: First Principles with Python by
Joel Grus, Publisher: O'Reilly Media; 1st edition (2015).

Optional

• Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive


Datasets. v2.1, Cambridge University Press. 2014. (free online).

9/23/2020 Bahria University Islamabad 4


Grading
•Policy
Quizzes - 10
• Homework Assignments and In-class Exercises - 20
• Mid-term Exam - 20
• Final-term Exam - 50

9/23/2020 Bahria University Islamabad 5


Assessment
•sThe assessment includes three components:
• Quizzes
• Class Homework/Projects, assigned after each lecture
• Final Project
• In addition to this, student attendance, originality of work and contributions to
the class will be taken into account

9/23/2020 Bahria University Islamabad 6


Tentative Course
•Plan
Week – 01 : Introduction: What is Data Science?
• Week – 02 : Exploratory Data Analysis, and the Data Science Process
• Week – 03 : Statistical Analysis
• Week – 04 : Introduction to Machine Learning
• Week – 05 : Spam Filters, Naive Bayes, and Wrangling
• Week – 06 : Feature Generation and Feature Selection (Extracting Meaning from
Data)
• Week – 07 : Business Intelligence and Data Warehousing
• Week – 08 : Revision

9/23/2020 Bahria University Islamabad 7


Tentative Course
•Plan
Week – 09 : Mid-term Exam
• Week – 10 : Recommendation Systems: Building a User-Facing Data Product
• Week – 11 : Mining Social-Network Graphs
• Week – 12 : Data Visualization
• Week – 13 : Data Leakage and Model Evaluations
• Week – 14 : Data Science and Ethical Issues
• Week – 15,16 : Project Presentations
• Week – 17 : Revision
• Week – 18 : Final-term Exam

9/23/2020 Bahria University Islamabad 8


Data Scientist Job
Trend

9/23/2020 Bahria University Islamabad 9


Data Science – A
Definition
Definition:
Data Science is the science which uses computer science, statistics and machine
learning, visualization and human-computer interactions to collect, clean, integrate,
analyze, visualize, interact with data to create data products.

9/23/2020 Bahria University Islamabad 10


Types of Data
What to do with Data

Aggregation and Statistics


Data warehousing and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
Big Data Vs. Data
Science

9/23/2020 Bahria University Islamabad 13


Skills Required to be a Data
Scientist

9/23/2020 Bahria University Islamabad 14


Data Science: Why all the
Hype?

John Hopkins University created the world’s first publicly-accessible novel coronavirus (2019-nCoV) tracking dashboard

9/23/2020 Bahria University Islamabad 15


Data Science: Why all the
Hype?

A second site named CoronaTracker has emerged from Malaysia, offering live global updates, charts, and maps of the virus’ spread,
with plans for an app in the works

9/23/2020 Bahria University Islamabad 16


Where does the DATA comes
from ?

9/23/2020 Bahria University Islamabad 17


Data Science – A
Definition
Definition:
Data Science is the science which uses computer science, statistics and machine
learning, visualization and human-computer interactions to collect, clean, integrate,
analyze, visualize, interact with data to create data products.

9/23/2020 Bahria University Islamabad 18


Data Science –
Goals
Goals:
Turn data into data products.

9/23/2020 Bahria University Islamabad 19


Data Science – A
Definition

9/23/2020 Bahria University Islamabad 20


Data Science – A
Definition

9/23/2020 Bahria University Islamabad 21


How to Use
Data ?

Product/Decision
Predictive Models Evaluate/Interpret
Making

Exploratory Knowledge Product/Decision


Analysis Models Making

9/23/2020 Bahria University Islamabad 22


Data Science
Applications
• Marketing: predict the characteristics of high life time value (LTV) customers,
which can be used to support customer segmentation, identify upsell
opportunities, and support other marking initiatives
• Logistics: forecast how many of which things you need and where will we need
them, which enables learn inventory and prevents out of stock situations
• Healthcare: analyze survival statistics for different patient attributes (age, blood
type, gender, etc.) and treatments; predict risk of re-admittance based on patient
attributes, medical history, etc.

9/23/2020 Bahria University Islamabad 23


Data Science Pipeline
Seaborn
Advantages vs.
Disadvantages
Data Science -
Process

9/23/2020 Bahria University Islamabad 27


Data Science -
Process

9/23/2020 Bahria University Islamabad 28


Data Science -
Process

1 Determine

2 Understand

3 Map

9/23/2020 Bahria University Islamabad 29


Data Science -
Process
What does the client want to achieve?
Primary Objective
1 Determine • Reduce attrition
• Customized targeting
2 Understand • Plan future media spend
• Prevent fraud
3 • Recommend Products
Map

9/23/2020 Bahria University Islamabad 30


Data Science -
Process

1 Determine • Understand success criteria.


- Specific, measurable, time-bound
2 Understand • List assumptions, constraints, and important
factors.
• Identify secondary or competing objectives.
3 Map
• Study existing solutions (if any).

9/23/2020 Bahria University Islamabad 31


Data Science -
Process

1 Determine •

2 Understand
Business Objective -> Technical Objective
3 • State the project objective(s) in technical terms.
Map
• Describe how the data science project will help
solve the business problem.
• Explore successful scenarios.

9/23/2020 Bahria University Islamabad 32


Data Science -
Process

1 Determine

2 Understand

3 Map

9/23/2020 Bahria University Islamabad 33


Data Science -
Process
Project Plan

• Duration
• Inventory of resources
• Tools and techniques
• Risks and contingencies
• Costs and benefits
• Milestones

9/23/2020 Bahria University Islamabad 34


Data Science -
Process

1 Identify

2 Collect

3 Assess

4 Vectorize

9/23/2020 Bahria University Islamabad 35


Data Science -
Process
• Data sources, formats.
• Database, Streaming API’s, Logs, Excel files,
1 Identify
Websites, etc.
• Entity Relationship Diagram (ERD).
2 Collect • Identify additional data sources.
• Demographics data appends,
• Geographical data,
3 Assess • Census data, etc.
• Identify relevant data.
4 Vectorize • Record unavailable data.
• How long a history is available and one should use?

9/23/2020 Bahria University Islamabad 36


Data Science -
Process

1 Identify

• Access or acquire all relevant data


2 Collect in a central location.
• Quality control checks and tests
• File formats, delimiters
3 Assess
• Number of records, columns
• Primary keys
4 Vectorize

9/23/2020 Bahria University Islamabad 37


Data Science -
Process
First Look at the Data
1 Identify • Get familiar with the data.
• Study seasonality.
• Monthly/weekly/daily patterns
2 Collect • Unexplained gaps or spikes
• Detect mistakes.
3 Assess • Extreme or outlier values
• Unusual values
• Special missing values
4 Vectorize • Check assumptions.
• Review distributions.

9/23/2020 Bahria University Islamabad 38


Data Science -
Process

9/23/2020 Bahria University Islamabad 39


Data Science -
Process
Goal: Create the Analysis Dataset
1 Identify

2 Collect

3 Assess

4 Vectorize

9/23/2020 Bahria University Islamabad 40


Target Definition
Churn = 90 days of consecutive inactivity (for a pre-paid telecom customer)
• What’s inactivity?
• Incoming and outgoing calls, data usage, incoming text, promotional texts,
voicemail usage, call forwarding etc.
• Customers may change their device or phone number.
• Churn at the individual (person) level, or at the device (phone) level?
• Customers may return (become active again) after 90 days of inactivity?
• Prediction window
• Predict 90 days of consecutive inactivity?
• Would 10 days of consecutive inactivity suffice?
• How many customers return after x days of inactivity?
• Fraud, Involuntary churn etc.

9/23/2020 Bahria University Islamabad 41


Accurate but not
Precise
Chart to help determine the risk of bear attack

9/23/2020 Bahria University Islamabad 42


Modeling
Sample
• Historical trends and seasonality
• Are there certain
timeframes that should be
discarded?
• The model should be
generalizable.
• Eligible, relevant population
• Must align with the business
goals
• Eligible, relevant markets
• Must align with the business
goals
• E.g., within a certain drive- Bahria University Islamabad
9/23/2020 43
Selection
Bias

9/23/2020 Bahria University Islamabad 44


Information
Leakage

The leading indicators must be calculated from the timeframe leading up to the
event – it must not overlap with the prediction window.
Beware of proxy events, e.g., future bookings.

9/23/2020 Bahria University Islamabad 45


Data
Aggregation
Attribute creation
• Derived attributes: Household income / Number of adults = Income per adult
• Brainstorm with team members (both technical and non-technical)

9/23/2020 Bahria University Islamabad 46


Data
Aggregation
• Number of transactions (Frequency)
• Days since the last transaction (Recency)
• Days since the earliest transaction (Tenure)
• Avg. days between transaction
• # of transactions during weekends
• % of transactions during weekends
• # of transactions by day-part (breakfast, lunch, etc.)
• % of transactions by day-part
• Days since last transaction / Avg. days between
transactions
• .…
9/23/2020 Bahria University Islamabad 47
Data
Aggregation

1 Identify

2 Collect

3 Assess

4 Vectorize

9/23/2020 Bahria University Islamabad 48


Data Science -
Process

9/23/2020 Bahria University Islamabad 49


Data Science -
Process
• Descriptive statistics
• Review with the client
• Correlation analysis
• Review with the client
• Watch out for data leakage
• Impute missing values
• Trim extreme values
• Process categorical attributes

9/23/2020 Bahria University Islamabad 50


Data Science -
Process
• Transformations(square, log, etc.)
• Binning / variable smoothing
• Create additional feature
• Interactions
• Normalization(scaling)

9/23/2020 Bahria University Islamabad 51


Data Science -
Process

9/23/2020 Bahria University Islamabad 52


Data Science -
Process
• Feature Reduction:
• The process of selecting a subset of features for use in model
construction
• Useful for both supervised and unsupervised learning
problems

9/23/2020 Bahria University Islamabad 53


Data Science -
Process
• True dimensionality <<< Observed dimensionality
• The abundance of redundant and irrelevant features
• Curse of dimensionality
• With a fixed number of training samples, the predictive power reduces as the dimensionality

• With 𝑑binary variables, the number of possible combinations is 𝑂(2𝑑).


increases. [Hughes phenomenon]

• Goal of the Analysis


• Descriptive, Diagnostic, Predictive, Prescriptive

• Law of Parsimony [Occam’s Razor]


• Other things being equal, simpler explanations are generally better than complex ones.
• Overfitting
• Execution time (Algorithm and data)
9/23/2020 Bahria University Islamabad 54
Data Science -
Process
• Percent missing values
• Amount of variation
• Pairwise correlation
• Multi-colinearity
• Principal Component Analysis (PCA)
• Cluster analysis
• Correlation (with the target)
• Forward selection
• Backward elimination
• Stepwise selection
• Tree-based selection

9/23/2020 Bahria University Islamabad 55


Data Science -
Process
• Try more than one machine learning technique.
• Fine-tune parameters.
• Assess model performance.
• Avoid Over-fitting.

9/23/2020 Bahria University Islamabad 56


Assess Model
Performance
• Types of Errors
• False Positive
• False Negative
• New Age: Area Under the ROC Curve (AUC), Confusion Matrix, Precision,
Recall, Log-loss, etc.
• Old School: Model Lift, Model Gains, Kolmogorov-Smirnov (KS), etc.

9/23/2020 Bahria University Islamabad 57


Data Science -
Process

1 Model Selection

2 Assessment

3 Presentation

9/23/2020 Bahria University Islamabad 58


Data Science -
Process

1 Model Selection • Law of Parsimony (Occam’s Razor)


• Model execution time
• Deployment complexity
2 Assessment
Build the simplest solution that can adequately
3 Presentation answer the question.

9/23/2020 Bahria University Islamabad 59


Data Science -
Process

1 Model Selection

2 Assessment

3 Presentation

9/23/2020 Bahria University Islamabad 60


Data Science -
Process
AUC, etc.
Cumulative Gains Chart / Lift Chart
1 Model Selection Compare against existing business rules/model
Predictor Importance
Each predictor’s relationship with the target
2 Assessment Reason-coding
Model usage recommendations
Decile reports
3 Presentation Personify
Model peer-review (Quality Control)
Interpret results as they relate to the business
application.

9/23/2020 Bahria University Islamabad 61


Data Science -
Process
• Model production cycle
• Scoring code, or publish model as a web service
• Hand-off
• Model Documentation (Technical Specifications)
• Data preparation, transformations, imputations, parameter settings, etc.
• Reproducibility
• Docker containers
• Model Persistence vs. Model Transience

9/23/2020 Bahria University Islamabad 62


Data Science –
Process

1 Monitor

2 Maintain

3 Test

9/23/2020 Bahria University Islamabad 63


Data Science –
Process

Model decay tracking (monitoring) plan


1 Monitor • Model performance over time
• Predictor distribution
2 Maintain

3 Test

9/23/2020 Bahria University Islamabad 64


Data Science –
Process

1 Monitor
• Model maintenance plan
2 Maintain • Adding new data sources
• Version control
3 Test

9/23/2020 Bahria University Islamabad 65


Data Science –
Process

1 Monitor

2 Maintain

3 Campaign Set-up and Execution


Test
• Experimental Design (A/B tests, Fractional
Factorial)

9/23/2020 Bahria University Islamabad 66


Data Science –
Process

1 Monitor

2 Maintain

3 Campaign Set-up and Execution


Test
• Experimental Design (A/B tests, Fractional
Factorial)

9/23/2020 Bahria University Islamabad 67


Experimental
Design

9/23/2020 Bahria University Islamabad 68


Experimental
Design

9/23/2020 Bahria University Islamabad 69


Data Science -
Recap

9/23/2020 Bahria University Islamabad 70

You might also like