0% found this document useful (0 votes)
39 views303 pages

ZG 536 Foundations of Data Science: Pilani

The document outlines the Foundations of Data Science course at BITS Pilani, led by Dr. Arindam Roy, detailing the evaluation structure, course content, and the importance of data science across various industries. It emphasizes the interdisciplinary nature of data science, the skills required for various roles, and the CRISP-DM framework for data mining processes. The document also highlights the challenges in data science and the necessity of a standard process for reliable and repeatable results.

Uploaded by

prudhvi raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views303 pages

ZG 536 Foundations of Data Science: Pilani

The document outlines the Foundations of Data Science course at BITS Pilani, led by Dr. Arindam Roy, detailing the evaluation structure, course content, and the importance of data science across various industries. It emphasizes the interdisciplinary nature of data science, the skills required for various roles, and the CRISP-DM framework for data mining processes. The document also highlights the challenges in data science and the necessity of a standard process for reliable and repeatable results.

Uploaded by

prudhvi raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 303

ZG 536

Foundations of Data Science


BITS Pilani Dr. Arindam Roy
Pilani Campus
BITS Pilani
Pilani Campus

Data Science Foundations


Introduction
Evaluation

Name Type Duration Weig Day, Date, Session, Time


ht

Experiential Learning
Assignment 1

EC1 Take Home-Online 30% To be announced


Experiential Learning
Assignment 2

EC2 Mid-Semester Exam Closed 2 hours 30%


Sunday, 22/09/2024
Book
EC3 Comprehensive Exam Open Book 2 ½ 40%
Sunday, 01/12/2024
hours

BITS Pilani, Pilani Campus


Instructor

Dr Arindam Roy

Qualification Bachelor of Engineering (Computer Science), WBUT


Master of Science (Information Technology), IIT Kgp
PhD (Workforce Optimization), IIT Kgp

Teaching and
Research Interests Data Science, Machine Learning, Education Research

BITS Pilani, Pilani Campus


What exactly is Data Science?

• An interdisciplinary field that uses algorithms, procedures, and


processes to examine large amounts of data
• Study of data to extract meaningful insights for business
• Using data to solve problems and make decisions!
• Applied Statistics and ML!

Breaking it down:
• Data: Everything is data. Structured, unstructured.
• Scientific methods: Scientific approach, questions, data collection, analyze,
interpret, conclusion
• Statistics: Patterns, trends, insights
• Domain expertise: SME, actionable and relevant insights
• Programming: Process and manipulate data

BITS Pilani, Pilani Campus


Applications
• Every domain!
• Healthcare Better operations, early detections, preventions
• Retail Customer behavior, STP, Customer experience
• Banking and Finance Financial advice and planning, predictions, Fraud
detection
• Transportation Optimizations, better planning
• Manufacturing Fault detection, IoT, Operations and process
improvement
• Meteorology Weather, seismic, geospatial data
• Social media/TC Sentiment analysis, Demands
• Energy and utility Consumption, control
• Public services Planning, development
• Sports, Entertainment Strategy, Content creation, Demand analysis
• Politics?

BITS Pilani, Pilani Campus


Some Examples
• Recommender systems: Amazon, Netflix, youtube

• Personalization: Learning, ads, promotions and discounts

• Decision making: Google maps

• Fraud detection: transactions

• Dynamic pricing: Surge pricing

• Smart homes, voice assistants

• Social media trends

• Spam mail filters

• Traffic lights

• Online dating

BITS Pilani, Pilani Campus


Why learn Data Science?

• Career opportunities

• Rapid digital evolution

• Data is growing

• Flexibility – all industries, freelancing

• Demand-Supply gap

• Analytical, scientific approach

• Being logical and sensible

• Life skill - Solving real life problems

BITS Pilani, Pilani Campus


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
DS, AI, ML, DL, Analytics?
• Data Science Processing, Analyzing, Insights
• Business Analytics Solving problems, Making decisions
• Artificial Intelligence Machines simulate human behavior
• Machine Learning Computers learn themselves
• Deep Learning Artificial Neural Networks

BITS Pilani, Pilani Campus


DS/ML project flow

BITS Pilani, Pilani Campus


DS/ML project flow

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Popular Roles and Skills

Data Engineer Data/BI Analyst ML Engineer Business Analyst


SQL, Python, Hive, Pig, SQL, Excel, Python/R, Python, Machine Learning Excel, Visio, SQL, Tableau
Java, Hadoop, Spark, Tableau/PowerBI/QlikView, Algorithms, DL/NLP, Java, Domain understanding,
Kafka, Azkaban, Airflow, Basics of Big Data, Basics DBMS, Cloud Architecture, Requirement Gathering,
AWS, GCP, Azure of Cloud Big Data Architectures, Requirement Elicitation,
Data Warehousing, Ability Programming skills in AWS/GCP/Azure Process Excellence, User
to write, analyze, and Python/R , Solid Understanding of data Acceptance Testing,
debug SQL queries, Big understanding of database structures, data modeling Documentation Prowess,
Data platforms like management systems, and software architecture. Basic Data Analysis Skills
Hadoop, Spark, Kafka, Proficient SQL/HQL skills, Deep knowledge of math,
Flume, Pig, Hive, etc. , Good data visualization probability, statistics and
Experience in handling skills and proficient with algorithms. Ability to write
data pipeline and Tableau/PowerBI/QlikView, robust code in Python,
workflow management etc ,basic understanding of Java and R. Familiarity with
tools like Azkaban, Luigi, predictive modelling machine learning
Airflow, etc., Strong frameworks (like Keras or
Communication Skills PyTorch) and libraries (like
scikit-learn)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Scientist

Wears many hats!


1. Data Acquisition and Preparation: Data Sources, Cleaning,
Preprocessing, Integration, Wrangling
2. Data Analysis: EDA, insights, patterns
3. Modeling: Statistical/hypothesis testing, ML models - Building,
testing, tuning, deploying
4. Communication: Story-telling, visualization, audience
5. Collaboration: Stakeholders,
6. Solutions: Practical, relevant

BITS Pilani, Pilani Campus


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Science Vs other domains

1. Interdisciplinary: Statistics, Mathematics, Computer Science,


Programming, and domain-specific expertise
2. Focus: Data
3. Problem solving: Real world challenges. Always new.
4. Evolution: Tools, techniques, algorithms
5. Lifelong learning: No crash course!
6. All industries
7. No defined scope
8. No single correct solution
9. Answer to many questions is ‘depends!’

BITS Pilani, Pilani Campus


Challenges

1. Data: Acquisition, access, quality, volume


2. Technical: Tools, algorithms
3. Explainable AI: Interpretability and explainability
4. Communication: stakeholders
5. Privacy and Security
6. Continuous learning

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Dr Arindam Roy
Pilani Campus
CRISP-DM

CRoss-Industry Standard Process


for Data Mining

2
BITS Pilani, Pilani Campus
Why Should There be a Standard Process?
The data mining process must be reliable
and repeatable by people with little data
mining background.

3
BITS Pilani, Pilani Campus
Why Should There be a Standard Process?
Framework for recording experience
– Allows projects to be replicated

Aid to project planning and management


“Comfort factor” for new adopters
– Demonstrates maturity of Data Mining
– Reduces dependency on “stars”

4
BITS Pilani, Pilani Campus
CRISP-DM
Non-proprietary
Application/Industry neutral
Tool neutral
Focus on business issues
– As well as technical analysis
Framework for guidance
Experience base
– Templates for Analysis

5
BITS Pilani, Pilani Campus
ProcessModel

Foranyone

CRISP-DM: Overview
Providesacompleteblueprint

Lifecycle:6phases

6
BITS Pilani, Pilani Campus
CRISP-DM: Phases
Business Understanding
Project objectives and requirements understanding, Data mining problem
definition
Data Understanding
Initial data collection and familiarization, Data quality problems identification
Data Preparation
Table, record and attribute selection, Data transformation and cleaning
Modeling
Modeling techniques selection and application, Parameters calibration
Evaluation
Business objectives & issues achievement evaluation
Deployment
Result model deployment, Repeatable data mining process implementation

7
BITS Pilani, Pilani Campus
Phases and Tasks
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine Collect Select


Select Evaluate Plan
Business Initial Modeling
Data Results Deployment
Objectives Data Technique

Plan Monitering
Assess Describe Clean Generate Review
&
Situation Data Data Test Design Process
Maintenance

Determine Produce
Explore Construct Build Determine
Data Mining Final
Data Data Model Next Steps
Goals Report

Verify
Produce Integrate Assess Review
Data
Project Plan Data Model Project
Quality

Format
Data

8
BITS Pilani, Pilani Campus
Phase 1. Business Understanding

Statement of Business Objective


Statement of Data Mining Objective
Statement of Success Criteria

Focuses on understanding the project objectives and


requirements from a business perspective, then converting
this knowledge into a data mining problem definition and a
preliminary plan designed to achieve the objectives

9
BITS Pilani, Pilani Campus
Phase 1. Business Understanding
Determine business objectives
- thoroughly understand, from a business perspective, what the client
really wants to accomplish
- uncover important factors, at the beginning, that can influence the
outcome of the project
- neglecting this step is to expend a great deal of effort producing the
right answers to the wrong questions

Assess situation
- more detailed fact-finding about all of the resources, constraints,
assumptions and other factors that should be considered
- flesh out the details

10
BITS Pilani, Pilani Campus
Phase 1. Business Understanding
Determine data mining goals
- a business goal states objectives in business terminology
- a data mining goal states project objectives in technical terms
ex) the business goal: “Increase catalog sales to existing customers.”
a data mining goal: “Predict how many widgets a customer will buy,
given their purchases over the past three years,
demographic information (age, salary, city) and
the price of the item.”
Produce project plan
- describe the intended plan for achieving the data mining goals and the
business goals
- the plan should specify the anticipated set of steps to be performed
during the rest of the project including an initial selection of tools and
techniques

11
BITS Pilani, Pilani Campus
Phase 2. Data Understanding

• Explore the Data


• Verify the Quality
• Find Outliers

Starts with an initial data collection and proceeds with activities in order to
get familiar with the data, to identify data quality problems, to discover first
insights into the data or to detect interesting subsets to form hypotheses
for hidden information.

12
BITS Pilani, Pilani Campus
Phase 2. Data Understanding
Collect initial data
- acquire within the project the data listed in the project resources
- includes data loading if necessary for data understanding
- possibly leads to initial data preparation steps
- if acquiring multiple data sources, integration is an additional issue,
either here or in the later data preparation phase

Describe data
- examine the “gross” or “surface” properties of the acquired data
- report on the results

13
BITS Pilani, Pilani Campus
Phase 2. Data Understanding
Explore data
- tackles the data mining questions, which can be addressed using
querying, visualization and reporting including:
distribution of key attributes, results of simple aggregations
relations between pairs or small numbers of attributes
properties of significant sub-populations, simple statistical analyses
- may address directly the data mining goals
- may contribute to or refine the data description and quality reports
- may feed into the transformation and other data preparation needed
Verify data quality
- examine the quality of the data, addressing questions such as:
“Is the data complete?”, Are there missing values in the data?”

14
BITS Pilani, Pilani Campus
Phase 3. Data Preparation
Takes usually over 90% of the time
- Collection
- Assessment
- Consolidation and Cleaning
- Data selection
- Transformations

Covers all activities to construct the final dataset from the initial raw
data. Data preparation tasks are likely to be performed multiple times
and not in any prescribed order. Tasks include table, record and
attribute selection as well as transformation and cleaning of data for
modeling tools.

15
BITS Pilani, Pilani Campus
Phase 3. Data Preparation
Select data
- decide on the data to be used for analysis
- criteria include relevance to the data mining goals, quality and technical
constraints such as limits on data volume or data types
- covers selection of attributes as well as selection of records in a table

Clean data
- raise the data quality to the level required by the selected analysis
techniques
- may involve selection of clean subsets of the data, the insertion of
suitable defaults or more ambitious techniques such as the estimation
of missing data by modeling

16
BITS Pilani, Pilani Campus
Phase 3. Data Preparation
Construct data
- constructive data preparation operations such as the production of
derived attributes, entire new records or transformed values for
existing attributes

Integrate data
- methods whereby information is combined from multiple tables or
records to create new records or values

Format data
- formatting transformations refer to primarily syntactic modifications
made to the data that do not change its meaning, but might be required
by the modeling tool

17
BITS Pilani, Pilani Campus
Phase 4. Modeling
Select the modeling technique
(based upon the data mining objective)
Build model
(Parameter settings)
Assess model (rank the models)
Various modeling techniques are selected and applied and
their parameters are calibrated to optimal values. Some
techniques have specific requirements on the form of data.
Therefore, stepping back to the data preparation phase is
often necessary.

18
BITS Pilani, Pilani Campus
Phase 4. Modeling
Select modeling technique
- select the actual modeling technique that is to be used
ex) decision tree, neural network
- if multiple techniques are applied, perform this task for each techniques
separately

Generate test design


- before actually building a model, generate a procedure or mechanism to
test the model’s quality and validity
ex) In classification, it is common to use error rates as quality measures
for data mining models. Therefore, typically separate the dataset into
train and test set, build the model on the train set and estimate its
quality on the separate test set

19
BITS Pilani, Pilani Campus
Phase 4. Modeling
Build model
- run the modeling tool on the prepared dataset to create one or more
models

Assess model
- interprets the models according to his domain knowledge, the data
mining success criteria and the desired test design
- judges the success of the application of modeling and discovery
techniques more technically
- contacts business analysts and domain experts later in order to
discuss the data mining results in the business context
- only consider models whereas the evaluation phase also takes into
account all other results that were produced in the course of the
project

20
BITS Pilani, Pilani Campus
Phase 5. Evaluation
Evaluation of model
- how well it performed on test data
Methods and criteria
- depend on model type
Interpretation of model
- important or not, easy or hard depends on algorithm
Thoroughly evaluate the model and review the steps executed to
construct the model to be certain it properly achieves the business
objectives. A key objective is to determine if there is some important
business issue that has not been sufficiently considered. At the end
of this phase, a decision on the use of the data mining results should
be reached

21
BITS Pilani, Pilani Campus
Phase 5. Evaluation
Evaluate results
- assesses the degree to which the model meets the business
objectives
- seeks to determine if there is some business reason why this
model is deficient
- test the model(s) on test applications in the real application if
time and budget constraints permit
- also assesses other data mining results generated
- unveil additional challenges, information or hints for future
directions

22
BITS Pilani, Pilani Campus
Phase 5. Evaluation
Review process
- do a more thorough review of the data mining engagement in order to
determine if there is any important factor or task that has somehow
been overlooked
- review the quality assurance issues
ex) “Did we correctly build the model?”

Determine next steps


- decides how to proceed at this stage
- decides whether to finish the project and move on to deployment if
appropriate or whether to initiate further iterations or set up new data
mining projects
- include analyses of remaining resources and budget that influences the
decisions

23
BITS Pilani, Pilani Campus
Phase 6. Deployment
Determine how the results need to be utilized
Who needs to use them?
How often do they need to be used
Deploy Data Mining results by
Scoring a database, utilizing results as business rules,
interactive scoring on-line

The knowledge gained will need to be organized and presented in a


way that the customer can use it. However, depending on the
requirements, the deployment phase can be as simple as generating a
report or as complex as implementing a repeatable data mining
process across the enterprise.

24
BITS Pilani, Pilani Campus
Phase 6. Deployment
Plan deployment
- in order to deploy the data mining result(s) into the business, takes the
evaluation results and concludes a strategy for deployment
- document the procedure for later deployment

Plan monitoring and maintenance


- important if the data mining results become part of the day-to-day
business and it environment
- helps to avoid unnecessarily long periods of incorrect usage of data
mining results
- needs a detailed on monitoring process
- takes into account the specific type of deployment

25
BITS Pilani, Pilani Campus
Phase 6. Deployment
Produce final report
- the project leader and his team write up a final report
- may be only a summary of the project and its experiences
- may be a final and comprehensive presentation of the data mining
result(s)

Review project
- assess what went right and what went wrong, what was done well and
what needs to be improved

26
BITS Pilani, Pilani Campus
Summary
Why CRISP-DM?
The data mining process must be reliable and repeatable
by people with little data mining skills

CRISP-DM provides a uniform framework for


- guidelines
- experience documentation
CRISP-DM is flexible to account for differences
- Different business/agency problems
- Different data

27
BITS Pilani, Pilani Campus
Data Scientist’s Toolbox

1. Data Collection
• Hadoop Ecosystem (HDFS, Hive, Pig)
2. Data Preparation
• SQL
• Python and Python libraries - pandas
3. EDA
• Excel
• RStudio
• Power BI
• Tableau
• Python libraries – matplotlib, pandas, seaborn

BITS Pilani, Pilani Campus


Data Scientist’s Toolbox

4. Statistical Analysis
• RStudio
• Matlab
• SAS
• SPSS
5. Model building
• Jupyter Notebook
• Python libraries – Numpy, Scipy, scikitlearn
• Tensorflow
• PyTorch
• AWS/Azure/GCP

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Dr Arindam Roy
Pilani Campus
Types of Data

Data

Qualitative Quantitative
(Categorical) (Numeric)

Nominal Ordinal Discrete Continuous

BITS Pilani, Pilani Campus


Categorical Data
• Characteristics or attributes
• Non-numeric. Cannot be computed

Nominal
• No specific order
• All categories are equal
• Can not be measured
• Gender, colors, divisions

Ordinal
• Natural order
• Categories can be compared
• High-Medium-Low, First-Second-Third, etc.

BITS Pilani, Pilani Campus


Numeric Data
• Numbers
• Measurable or countable
• Calculations can be performed

Discrete
• Only certain values
• Typically, whole numbers
• Countable
• Runs, goals, marks

Continuous
• All possible values within a range
• Typically, with fractions and decimals
• Measurable
• Height, weight, temperature

BITS Pilani, Pilani Campus


Types of Datasets

• Set of data as a collection


• Structured, unstructured, semi-structured
• Used for a meaningful activity, say, analysis

Formats
• Tabular – rows and columns (xls, csv)
• Web data – JSON, xml
• Time series dataset
• Image dataset
• Bivariate
• Multivariate
BITS Pilani, Pilani Campus
Why Data Quality?

1. Better decisions
2. Correct analysis and insights
3. Better problem-solving
4. Reliable results
5. Less ambiguity
6. Customer experience
7. Compliance
8. Cost

BITS Pilani, Pilani Campus


Data Quality

BITS Pilani, Pilani Campus


Data Preprocessing

Model, histogram,
cluster, sample

BITS Pilani, Pilani Campus


Handling missing values

1. Delete the row


2. Drop the column
3. Impute by mean/median (numeric)
4. Impute by mode (category)
5. Use algorithm
6. Forward and backward fill
7. Build a model and guess the appropriate value
8. Create new value (missingness as a feature)
9. Use libraries

BITS Pilani, Pilani Campus


ZG 536
Foundations of Data Science
BITS Pilani Dr Arindam Roy
Pilani Campus
Impact

• You’d be hard pressed to find a business today that doesn’t use analytics in some shape or
form to inform business decisions and measure performance.
• The global big data market is projected to reach $401.2 billion in 2028, up from $220.2 billion
in 2023.
• It is not just large corporations investing - Research shows that nearly 70% of small
businesses spend more than $10,000 a year on analytics to help them better understand
their customers, markets and business processes.
• The overwhelming majority of executives say that their organisation has achieved successful
outcomes from Big Data and AI.
• Data can also have a big impact on your bottom line, with businesses who utilise big data
increasing their profits by an average of 8-10%.
• Netflix reportedly saves $1 billion every year by using data analytics to improve its customer
retention strategies.
• So, what methods of data analysis are businesses using to generate these impressive
results?
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus

Descriptive, Predictive and Prescriptive analytics


Descriptive, Predictive and
Prescriptive analytics
• Business Analytics is the process by which businesses use statistical
methods and technologies for analysing data in order to gain insights and
improve their strategic decision-making.

• There are three types of analytics that businesses use to drive their decision
making; descriptive analytics, which tell us what has already happened;
predictive analytics, which show us what could happen, and finally,
prescriptive analytics, which inform us what should happen in the future.

• Whilst each of these methods are useful when used individually, they
become especially powerful when used together.
BITS Pilani, Pilani Campus
Descriptive analytics

• Descriptive analytics is the analysis of historical data using two key methods
– data aggregation and data mining - which are used to uncover trends and
patterns.
• Descriptive analytics is not used to draw inferences or make predictions
about the future from its findings; rather it is concerned with representing
what has happened in the past.

BITS Pilani, Pilani Campus


Descriptive analytics

• Descriptive analytics are often displayed using visual data representations like line, bar and
pie charts and, although they give useful insights on its own, often act as a foundation for
future analysis.
• Because descriptive analytics uses fairly simple analysis techniques, any findings should be
easy for the wider business audience to understand.
• For this reason, descriptive analytics form the core of the everyday reporting in many
businesses.
• Annual revenue reports are a classic example of descriptive analytics, along with other
reporting such as inventory, warehousing and sales data, which can be aggregated easily
and provide a clear snapshot of a company’s operations.
• Another widely used example is social media and Google Analytics tools, which summarise
certain groupings based on simple counts of events like clicks and likes.

BITS Pilani, Pilani Campus


Descriptive analytics

• Whilst descriptive data can be useful to quickly spot trends and patterns, the
analysis has its limitations.
• Viewed in isolation, descriptive analytics may not give the full picture. For
more insight, you need delve deeper.

BITS Pilani, Pilani Campus


Predictive analytics

• Predictive analytics is a more advanced method of data analysis that uses


probabilities to make assessments of what could happen in the future.
• Like descriptive analytics, predictive analytics uses data mining – however it
also uses statistical modelling and machine learning techniques to identify
the likelihood of future outcomes based on historical data.
• To make predictions, machine learning algorithms take existing data and
attempt to fill in the missing data with the best possible guesses.
• These predictions can then be used to solve problems and identify
opportunities for growth.

BITS Pilani, Pilani Campus


Predictive analytics

• For example, organisations are using predictive analytics to prevent fraud by


looking for patterns in criminal behaviour,
• Optimising their marketing campaigns by spotting opportunities for cross
selling
• Reducing risk by using past behaviours to predict which customers are most
likely to default on payments.

BITS Pilani, Pilani Campus


Predictive analytics

• Another branch of predictive analytics is deep learning, which mimics human


decision-making processes to make even more sophisticated predictions.
• For example, through using multiple levels of social and environmental
analysis, deep learning is being used to more accurately predict credit scores
• In the medical field, it is being used to sort digital medical images such as
MRI scans and X-rays to provide an automated prediction for doctors to use
in diagnosing patients.

BITS Pilani, Pilani Campus


Prescriptive analytics

• Whilst predictive analytics shows companies the raw results of their potential
actions, prescriptive analytics shows companies which option is the best.
• The field of prescriptive analytics borrows heavily from mathematics and
computer science, using a variety of statistical methods.
• Although closely related to both descriptive and predictive analytics,
prescriptive analytics emphasises actionable insights instead of data
monitoring.
• This is achieved through gathering data from a range of descriptive and
predictive sources and applying them to the decision-making process.
• Algorithms then create and re-create possible decision patterns that could
affect an organisation in different ways.

BITS Pilani, Pilani Campus


Prescriptive analytics

• What makes prescriptive analytics especially valuable is their ability to measure the
repercussions of a decision based on different future scenarios and then
recommend the best course of action to take to achieve a company’s goals.
• The business benefit of using prescriptive analytics is huge.
• It enables teams to view the best course of action before making decisions, saving
time and money whilst achieving optimal results.
• Businesses that can harness the power of prescriptive analytics are using them in a
variety of ways –
• For example, prescriptive analytics allow healthcare decision-makers to optimise business
outcomes by recommending the best course of action for patients and providers.
• They also enable financial companies to know how much to reduce the cost of a product to attract
new customers whilst keeping profits high.

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Supervised and Unsupervised Learning


Supervised learning

• Supervised learning, as the name indicates, has the presence of a supervisor as a


teacher.
• Supervised learning is when we teach or train the machine using data that is well
labelled.
• That is, some data is already tagged with the correct answer.
• The machine is provided with a new set of examples(data) so that the supervised
learning algorithm analyses the training data(set of training examples) and produces
a correct outcome from labelled data.

BITS Pilani, Pilani Campus


Supervised learning

• For instance, suppose you are given a basket filled with different kinds of
fruits. Now the first step is to train the machine with all the different fruits one
by one like this:

• If the shape of the object is rounded and has a depression at the top, is red in
color, then it will be labeled as –Apple.
• If the shape of the object is a long curving cylinder having Green-Yellow
color, then it will be labeled as –Banana.

BITS Pilani, Pilani Campus


Supervised learning

• Now suppose after training the data, you have given a new separate fruit, say Banana from
the basket, and asked to identify it.

• Since the machine has already learned the things from previous data and this time has to
use it wisely. It will first classify the fruit with its shape and color and would confirm the fruit
name as BANANA and put it in the Banana category.
• Thus the machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).

BITS Pilani, Pilani Campus


Supervised learning

Supervised learning is classified into two categories of algorithms:

• Classification: A classification problem is when the output variable is a category, such as


“Red” or “blue” , “disease” or “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.

• Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.

BITS Pilani, Pilani Campus


Supervised learning

Types:-

• Regression
• Logistic Regression
• Naive Bayes Classifiers
• K-NN (k nearest neighbors)
• Decision Trees
• Support Vector Machine

BITS Pilani, Pilani Campus


Supervised learning

Advantages:-

• Explicit Feedback: Supervised learning relies on labeled data, which provides explicit feedback on the
model’s predictions. This feedback is valuable for model training and improvement.
• Predictive Accuracy: Supervised learning models can achieve high predictive accuracy when trained
on high-quality, representative data. They are effective in tasks like classification and regression.
• Generalization: Well-trained supervised models can generalize their knowledge to make accurate
predictions on new, unseen data points, making them suitable for real-world applications.
• Interpretability: Some supervised learning algorithms, like linear regression and decision trees, provide
interpretable models that allow users to understand the relationships between input features and
predictions.
• Wide Range of Applications: Supervised learning can be applied to a wide range of domains,
including healthcare, finance, natural language processing, computer vision, and more.
• Availability of Tools and Libraries: There are numerous tools, libraries (e.g., sci-kit-learn, TensorFlow,
PyTorch), and resources available for implementing and experimenting with supervised learning
algorithms.
BITS Pilani, Pilani Campus
Supervised learning

Disadvantages:-
• Data Labeling Requirement: Supervised learning relies on labeled data, which can be expensive and time-consuming to obtain, especially for
large datasets.
• Limited to Labeled Data: The model can only make predictions on data similar to what it was trained on, limiting its ability to handle novel or
unexpected situations.
• Bias and Noise in Labels: If labeled data contains biases or errors, the model may learn and perpetuate those biases, leading to unfair or
inaccurate predictions.
• Overfitting: There’s a risk of overfitting, where the model learns the training data too well, capturing noise rather than the underlying patterns.
Regularization techniques are often required to mitigate this.
• Feature Engineering: Selecting and engineering relevant features is a crucial step in building effective supervised learning models. Poor
feature selection can lead to suboptimal performance.
• Scalability: Training complex models with large datasets can be computationally expensive and time-consuming, requiring substantial
computing resources.
• Limited to Labeled Data Distribution: Supervised models are constrained by the distribution of labeled data and may not perform well when
faced with data from a different distribution.
• Privacy Concerns: In some applications, the use of labeled data may raise privacy concerns, as it can reveal sensitive information about
individuals.
• Imbalanced Data: When dealing with imbalanced datasets (e.g., rare disease detection), supervised models may struggle to predict minority
classes accurately.
• Concept Drift: Over time, the relationship between input features and the target variable may change (concept drift). Supervised models may
require constant retraining to adapt to these changes.

BITS Pilani, Pilani Campus


Unsupervised learning

• Unsupervised learning is the training of a machine using information that is neither classified
nor labelled, it allows the algorithm to act on that information without guidance.
• Here the task of the machine is to group unsorted information according to similarities,
patterns, and differences without any prior training of data.
• Unlike supervised learning, no teacher is provided that means no training will be given to the
machine.
• Therefore the machine is restricted to find the hidden structure in unlabeled data by itself.

For instance, suppose it is given an image having both dogs and cats which it has never seen.

BITS Pilani, Pilani Campus


Unsupervised learning
• For instance, suppose it is given an image having both dogs and cats which it has never seen.

• Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs
and cats ‘. But it can group them according to their similarities, patterns, and differences, we can easily
divide the above picture into two parts.
• The first may contain all pics having dogs in them and the second part may contain all pics
having cats in them. Here we didn’t have any prior information about the categories. Which means no
training data or examples or labels.
• It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabelled data.
BITS Pilani, Pilani Campus
Unsupervised learning

Types of Unsupervised Learning:-

• Hierarchical clustering
• K-means clustering
• Principal Component Analysis

BITS Pilani, Pilani Campus


Unsupervised learning
Advantages

• Discovery of Hidden Patterns: Unsupervised learning algorithms can uncover hidden patterns and structures
within data that may not be apparent through manual inspection. This can lead to valuable insights and a deeper
understanding of the underlying data.
• Data Exploration: Unsupervised learning is a valuable tool for exploratory data analysis. It allows you to visualize
and summarize complex datasets, helping you identify trends, outliers, and potential areas of interest.
• Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) and t-SNE can reduce the
dimensionality of high-dimensional data while preserving important information. This can lead to more efficient
modeling and visualization.
• Clustering: Unsupervised learning can be used for clustering, which is the process of grouping similar data points
together. Clustering can aid in tasks like customer segmentation, image segmentation, and anomaly detection.
• Feature Engineering: Unsupervised learning can generate new features or representations of data that can be
used in subsequent supervised learning tasks. For example, word embeddings learned through unsupervised
techniques can improve natural language processing models.
• Anomaly Detection: Unsupervised learning can be used to detect unusual patterns or outliers in data, which is
important for applications like fraud detection and quality control.
• Reduced Labeling Costs: In some cases, labeling data for supervised learning can be expensive or time-
consuming. Unsupervised learning can help reduce the need for labeled data by generating insights or features that
enhance the performance of supervised models.

BITS Pilani, Pilani Campus


Unsupervised learning
Disadvantages

• Lack of Ground Truth: One of the main disadvantages of unsupervised learning is the absence of a ground truth or
labeled data to evaluate the quality of results. This makes it challenging to assess the accuracy of clustering or
pattern discovery.
• Subjectivity: Unsupervised learning results can be subjective and depend on the choice of algorithm,
hyperparameters, and preprocessing steps. Different approaches may lead to different outcomes, and there is no
one “correct” solution.
• Interpretability: Some unsupervised learning models, especially deep learning models, can be difficult to interpret.
Understanding why a model made a particular decision or identified certain patterns can be challenging.
• Computationally Intensive: Some unsupervised learning algorithms can be computationally intensive, especially
when dealing with large datasets or high-dimensional data. This can require significant computational resources.
• Curse of Dimensionality: High-dimensional data can pose challenges for unsupervised learning, as the distance
metrics used in many algorithms become less effective in high-dimensional spaces. Dimensionality reduction
techniques are often needed to address this issue.
• Difficulty in Evaluation: Evaluating the quality of clustering or pattern discovery can be challenging, as there may
be no clear ground truth to compare against. Evaluation metrics for unsupervised learning are often application-
specific.

BITS Pilani, Pilani Campus


Programming With Python
Introduction

• Python is a general purpose, high-level

interpreted programming language with easy

syntax and dynamic semantics

• Open source and community driven

• Created by Guido Van Rossum in 1989


2019 Annual Survey by Analytics India
What Language Should An Aspiring Data Analyst Learn First?

https://analyticsindiamag.com/data-science-recruitment-india-2019-survey/
Companies using Python
Python Distribution
Variables
• temporary storage space
Variables
• temporary storage space
Variables
• temporary storage space
Variables
• But, if the data is of a different size and type?
Variables
Data Types
• Every variable is associated with a data type
Assigning values to a variable
• The assignment (=) operator
Class and Object

• Basic building block of Python


• Collection of data (variables) and methods
(functions)
• Class is the prototype and object is the actual
thing
Class and Object
www.w3schools.com/python/
Introduction to Python
Suppose we want to print “Welcome to the world of programming” on our screen.
print(“Welcome to the world of programming”)

Python is an object oriented programming language. Unlike procedure oriented programming,


where the main emphasis is on functions, object oriented programming stress on objects.

Object is simply a collection of data (variables) and methods (functions) that act on those data.
And, class is a blueprint for the object.

We can think of class as a sketch (prototype) of a house. It contains all the details about the
floors, doors, windows etc. Based on these descriptions we build the house. House is the object.

As, many houses can be made from a description, we can create many objects from a class. An
object is also called an instance of a class and the process of creating this object is called
instantiation.

Variables in Python:
You can consider a variable to be a temporary storage space where you can keep changing
values. Let’s take this example to understand variables:

So, let’s say, we have this cart and initially we store an apple in it.
After a while, we take out this apple and replace it with a banana.

Again, after some time, we replace this banana with a mango.

So, here this cart acts like a variable, where the values stored in it keep on changing.

Now, that we have understood what a variable is, let’s go ahead and see how can we assign
values to a variable in python.
Assigning values to a variable:

To assign values to a variable in Python, we will use the assignment (=) operator.
Here, initially, we have stored a numeric value -> 10 in the variable ‘a’. After a while, we
have stored a string value -> “sparta” in the same variable. And then, we have stored the
logical value True.

Now, let’s implement the same thing in Spider and look at the result:

Assigning a value 10 to a:

Allocating “sparta” to a:

Assigning True to a:

Going ahead in this Python tutorial, we will learn about data types in Python.

Data Types in Python


Every variable is associated with a data type and these are the different types of data types
available in python:
Now, let’s understand these individual data types by their implementation.

Numbers in Python
Numbers in python could be integers, floating point numbers or complex numbers.

Let’s start off with an example on integer:

Here, we have assigned the value 100 to num1 and when we check the type of the variable,
we see that it is an integer.

Next, we have an example on floating-point number:

This time, we have assigned the value 13.4 to num2 and checking the type of the variable,
tells us that it is float.

Finally, let’s look at an example of a complex number:

Here, we have assigned the value 10-10j to num3. Now 10-10j comprises two parts-> the
real part and the imaginary part and combining these two gives us the complex number.

Now, let’s start with Python Strings.

Python Strings
Anything written in single or double quotes is treated as a string in Python.
Now, let’s see how can we extract individual characters from a string.

So, I’d want to extract the first two characters from ‘str1’ which I have created above:

Now, similarly, let’s extract the last two characters from str1:

Now, let’s head onto tuples in Python:

Python Tuples
A python tuple is a collection of immutable Python objects enclosed within parenthesis ().
Elements in a tuple could be of the same data type or of the different data types.

Let’s create a tuple where elements are of the same data type:

Now, let’s access the first element from this tuple:

Extracting the last element from this tuple:

Now, we will go ahead and learn about Python lists.

Python Lists
Python Lists is an ordered collection of elements.

It can contain elements of different data types, unlike arrays.


Now, let’s create a list with different data types:

Now, let’s do some operation on the list we created:


Fetching the first element from the list:

Adding an element while removing the other:

The below line of code will return the length of the list:

This will return the list in reversed order.

Now, we will further look at Python Sets.

Python Sets
Python sets are a collection of unordered and unindexed items.

Every element in a set is unique and it does contain duplicate values.


Sets can be used to perform mathematical calculations such as union, intersection, and
differences.

Creating a set:

Here, in set ‘Age’, value “22” is appearing twice. Since every element in set is unique, it will
remove the duplicate value.
Operations on Sets:

1.add: This method adds an element to the set if it is not present in it.

2.union: It returns the union of two sets.

3.intersection: This method returns the intersection of two sets.


4.difference: The difference of two sets(set1, set2) will return the elements which are present
only in set1.

Now, we will look at Python Dictionary.

Python Dictionary
Python Dictionaries is an unordered collection of data. The data in the dictionary is stored as
a key:value pair where the key should not be mutable and value can be of any type.

Creating a Dictionary:

Accessing elements from a dictionary:

Removing elements from a dictionary:

Replacing elements in a dictionary:


Get() method

Conventional method to access a value for a key:


dic = {"A":1, "B":2}
print(dic["A"])
print(dic["C"])

The problem that arises here is that the 3rd line of the code returns a key error :

Traceback (most recent call last):


File ".\dic.py", line 3, in
print (dic["C"])
KeyError: 'C'

The get() method is used to avoid such situations. This method returns the value for the given
key, if present in the dictionary. If not, then it will return None (if get() is used with only one
argument).

Syntax :

Dict.get(key, default=None)

Example:

dic = {"A":1, "B":2}


print(dic.get("A"))
print(dic.get("C"))
print(dic.get("C","Not Found ! "))

Output:

1
None
Not Found !

Conditional Statements
We use a conditional statement to run a single line of code or a set of codes if it satisfies
certain conditions. If a condition is true, the code executes, otherwise, control passes to the
next control statement.
There are three types of conditional statements as illustrated in the above example:

1. If statement: Firstly, “if” condition is checked and if it is true the statements under “if”
statements will be executed. If it is false, then the control will be passed on to the next
conditional statements.
2. Elif statement: If the previous condition is false, either it could be “if” condition or “elif”
after “if”, then the control is passed on to the “elif” statements. If it is true then the
statements after the “elif” condition will execute. There can be more than one “elif”
statement.
3. Else statement: When “if” and “elif” conditions are false, then the control is passed on to the
“else” statement and it will execute.

Now, let’s go ahead and learn about loops in this python tutorial.

Loops
If we have a block of code then statements in it will be executed sequentially. But, when we
want a statement or a set of statements to be executed multiple times then we use loops.

Types of loops:

1.While loop: We use this loop when we want a statement or a set of statement to execute as
long as the Boolean condition associated with it satisfies.

In the while loop, the number of iterations depends on the condition which is applied to the
while loop.
2.for loop: Here, we know the number of iterations unlike while loop. This for loop is also
used for iterations of statements or a set of statements multiple times.

3.nested loop: This type of loop consists of a loop inside a loop. It can be for loop or can be
a combination of for and while loop.

Now, we will learn about user-defined functions in this python tutorial.

User-Defined Function
In any programming language, functions are a better and systematic way of writing.
Functions provide us the liberty to use the code inside it whenever it is needed just by calling
the function by its name.
Syntax: def function()

Problem#1 Write a Python program to print out all even numbers from a given numbers list
in the same order and stop the printing if any numbers that come after 237 in the sequence.

Use this numbers list:

numbers = [

386, 462, 47, 418, 907, 344, 236, 375, 823, 566, 597, 978, 328, 615, 953, 345,

399, 162, 758, 219, 918, 237, 412, 566, 826, 248, 866, 950, 626, 949, 687, 217,

815, 67, 104, 58, 512, 24, 892, 894, 767, 553, 81, 379, 843, 831, 445, 742, 717,

958,743, 527

]
Solution:-

Python Code:

numbers = [ 386, 462, 47, 418, 907, 344, 236, 375, 823, 566, 597, 978, 328,
615, 953, 345, 399, 162, 758, 219, 918, 237, 412, 566, 826, 248, 866, 950,
626, 949, 687, 217, 815, 67, 104, 58, 512, 24, 892, 894, 767, 553, 81, 379,
843, 831, 445, 742, 717, 958,743, 527 ]

for x in numbers:

if x == 237:

print(x)

break;
elif x % 2 == 0:

print(x)

Problem#2 Now, for the same list of numbers and for the same conditions save the chosen
numbers in a separate list (instead of printing). Print the mean, median and mode of the
numbers in the list without importing any package.
Solution: mmm.py

Import Package
Write a Python program to compute the distance between the points (x1, y1) and (x2, y2).

Pictorial Presentation:

Sample Solution:-

Python Code:

import math
p1 = [4, 0]
p2 = [6, 6]
distance = math.sqrt( ((p1[0]-p2[0])**2)+((p1[1]-p2[1])**2) )

print(distance)
Output:

6.324555320336759

Problem#3 Mean, Median and Mode using Numpy and Scipy

import numpy as np
from scipy import stats

dataset= [1,1,2,3,4,6,18]

#mean value
mean= np.mean(dataset)
#median value
median = np.median(dataset)
#mode value
mode= stats.mode(dataset)

print("Mean: ", mean)


print("Median: ", median)
print("Mode: ", mode)

Output:
Mean: 5.0
Median: 3.0
Mode: ModeResult(mode=array([1]), count=array([2]))

Problem#4 Use Numpy and Scipy in Problem#2


Solution: mmm.py
Problem#5 Following is the input NumPy array of marks of three students for Eng, Maths
and Accounts. Save the data in an Numpy array and print it. Now, suppose there is a retest
for the subject Maths. Delete corresponding column and insert the given new retest marks in
its place.

English Maths Accounts

[75,25,73]

[82,22,86]

[53,31,66]

Retest Marks: [54, 67, 61]

Solution: marksP5.py
ZG 536
Foundations of Data Science
Linear Regression
BITS Pilani Dr. Arindam Roy
Pilani Campus
What is Machine Learning?

According to Tom M. Mitchell, Chair of Machine Learning at Carnegie Mellon University and
author of the book Machine Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with the experience
E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to performance increases (to
satisfy the definition)

Many data mining tasks are executed successfully with help of machine learning

BITS Pilani, Pilani Campus


Types of Machine Learning

BITS Pilani, Pilani Campus


History

This all started in the 1800s


with a guy named Francis Machine Math &
Learning
Galton. Galton was studying
Statistics
DS
the relationship between
Software Research
parents and their children. In
particular, he investigated the
relationship between the Domain
heights of fathers and their Knowledge
sons.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
History

What he discovered was that a


man's son tended to be Machine Math &
Learning
roughly as tall as his father. Statistics
However Galton's DS

breakthrough was that the Software Research

son's height tended to be


closer to the overall average
height of all people. Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

Let's take Shaquille O'Neal as an


example. Shaq is really tall:7ft 1in Machine Math &
Learning
(2.2 meters). Statistics
DS
If Shaq has a son, chances are
Software Research
he'll be pretty tall too. However,
Shaq is such an anomaly that
there is also a very good chance Domain
that his son will be not be as tall Knowledge
as Shaq.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

Turns out this is the case:


Shaq's son is pretty tall (6 ft 7
in), but not nearly as tall as his
dad.
Galton called this
phenomenon regression, as in
"A father's son's height tends
to regress (or drift towards)
the mean (average) height."

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

Let's take the simplest


possible example:
calculating a regression
with only 2 data points.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

All we're trying to do when we


calculate our regression line is
draw a line that's as close to
every dot as possible.

For classic linear regression, or


"Least Squares Method", you
only measure the closeness in
the "up and down" direction

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

Now wouldn't it be great if we


could apply this same concept to
a graph with more than just two
data points?

By doing this, we could take


multiple men and their son's
heights and do things like tell a
man how tall we expect his son to
be...before he even has a son!
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

Our goal with linear regression


is to minimize the vertical
distance between all the data
points and our line.

So in determining the best


line, we are attempting to
minimize the distance
between all the points and
their distance to our line.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

There are lots of different


ways to minimize this, (sum of
squared errors, sum of
absolute errors, etc), but all
these methods have a general
goal of minimizing this
distance.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

For example, one of the most


popular methods is the least
squares method.

Here we have blue data points


along an x and y axis.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

Now we want to fit a linear


regression line.

The question is, how do we


decide which line is the best
fitting one?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example

We’ll use the Least Squares


Method, which is fitted by
minimizing the sum of squares
of the residuals.

The residuals for an


observation is the difference
between the observation (the
y-value) and the fitted line.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


A Linear Model

The linear model is an example of a parametric model


f ( X ) = β0 + β 1 X 1 + β 2 X 2 + . . . β p X p

• A linear model is specified in terms of p + 1 parameters: β 0 , β 1 , . . . , β p.

• The linear model: f ( X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p has (p + 1) parameters


• We estimate the parameters by fitting the model to training data.
• Although it is almost never correct, a linear model often serves as a good and interpretable
approximation to the unknown true function f (X).

• Simple Linear Regression: Only one x variable


• Multiple Linear Regression: Many x variables

BITS Pilani, Pilani Campus


Types of Regression Models
Simple Regression

(Education) x y (Income)

Multiple Regression

(Education) x1

(Soft Skills) x2 y (Income)


(Experience) x3

(Age) x4

BITS Pilani, Pilani Campus


Direct Solution Method
Least Squares Method (Ordinary Least Squares or OLS)

• Slope for the Estimated Regression Equation

σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത
𝑏1 = σ 𝑥𝑖 −𝑥ҧ 2

• y Intercept for the Estimated Regression Equation

𝑏0 = 𝑦ത - 𝑏1 𝑥ҧ

where:
xi = value of independent variable for ith observation
yi = value of dependent variable for ith observation
𝑥ҧ = mean value for independent variable
𝑦ത = mean value for dependent variable

BITS Pilani, Pilani Campus


Exercise
Kumar’s Electronics periodically has a special week-long sale. As part of the advertising
campaign Kumar runs one or more TV commercials during the weekend preceding the sale.
Data from a sample of 5 previous sales are shown below.

# of TV Ads # of Cars Sold


(x) (y)
1 14
3 24
2 18
1 17
3 27

BITS Pilani, Pilani Campus


Solution
# of TV Ads # of Cars Sold 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝐲𝐢 − 𝐲ത 𝐱 𝐢 − 𝐱ത 𝟐

(x) (y)
1 14 -1 -6 6 1
3 24 1 4 4 1
2 18 0 -2 0 0
1 17 -1 -3 3 1
3 27 1 7 7 1
Sum 10 100 0 0 20 4
Mean 2 20

σ 𝑥𝑖 −𝑥ҧ 𝑦𝑖−𝑦ത
• Slope for the Estimated Regression Equation 𝑏1 = σ 𝑥𝑖 −𝑥ҧ 2
= 20/4 = 5

• Y Intercept for the Estimated Regression Equation 𝑏0 = 𝑦ത - 𝑏1 𝑥ҧ = 20 – 10 = 10


• Estimated Regression Equation: 𝑦ො = b0 + b1x = 10 + 5x
• Predict Sales if Ads run = 5? 15?
BITS Pilani, Pilani Campus
Regression Assumptions

1. E(ε) = 0
2. The model adequately captures the relationship
3. Var(ε) = σ2 for all values of the independent variables (Homoscedasticity)
4. ε is normally distributed
5. The values of ε are independent (No Serial Correlation or Autocorrelation)
6. There is no (or little) multicollinearity among the independent variables

BITS Pilani, Pilani Campus


Multicollinearity and VIF
• X1 and X2 are significant when included separately, but together the effect of both variables shrink.
Multicollinearity exists when there is a correlation between multiple independent variables in a multiple
regression model. This can adversely affect the regression results.
• Multicollinearity does not reduce the explanatory power of the model; it does reduce the statistical
significance of the independent variables.
• Test for Multicollinearity: Variance Inflation Factor

• VIF equal to 1 = variables are not correlated


• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated

Solutions to multicollinearity
1. Drop unnecessary variables
2. Advanced techniques: Ridge / Lasso / Stepwise / Principal Components Regression

BITS Pilani, Pilani Campus


Homoscedasticity Vs Heteroscedasticity

• Are the residuals spread equally along


the ranges of predictors?
• The plot should have a horizontal line with
equally spread points.

In the second plot, this is not the case.


• The variability (variances) of the residual
points increases with the value of the
fitted outcome variable, suggesting non-
constant variances in the residuals errors
(or heteroscedasticity)

BITS Pilani, Pilani Campus


Variables

# TV Radio Paper Sales


Sales is the Dependent Variable 1 230.1 37.8 69.2 22.1
• Also known as the Response or Target
• Generically referred to as Y
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
TV, Radio and Paper are the independent 4 151.5 41.3 58.5 18.5
variables 5 180.8 10.8 58.4 12.9
• Also known as features, or inputs, or
6 8.7 48.9 75 7.2
predictors
• Generically referred to as X (or X1, X2, X3)

BITS Pilani, Pilani Campus


Matrix X and Vector y

The Advertising data set has 4 variables and 6 # TV Radio Paper Sales
observations 1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
The variable names are “TV”, “Radio”, “Paper” and 3 17.2 45.9 69.3 9.3
“Sales”
4 151.5 41.3 58.5 18.5
p = 3 (the number of independent variables)
5 180.8 10.8 58.4 12.9
n = 6 (the number of observations) 6 8.7 48.9 75 7.2

X represents the input data set; X is a 6 * 3 matrix


y represents the output variable; y is a 6 * 1 vector

BITS Pilani, Pilani Campus


Matrix X and Vector y

# TV Radio Paper Sales


1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
X is a 6 * 3 matrix or X6*3 & y is a 6 * 1 vector or y6*1

xi represents the ith observation. xi is a vector represented as (xi1 xi2 …..xip)

xj represents the jth variable. xj is a vector represented as (x1j x2j …..xnj)


yi represents the ith observation of the output variable. y is the vector (y1 y2 ….. yp)

BITS Pilani, Pilani Campus


A Linear Model

Y = β0 + β1X1 + β2X2 + · · · + βpXp + ε

• β’s: Unknown constants, known as coefficients or parameters


• βj: The average effect on Y of a unit increase in Xj , holding all other predictors fixed.

• ε is the error term – captures measurement errors and missing variables


• ε is a random variable independent of X
• E(ε) = 0

• In the advertising example, the model becomes


sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

f is said to represent the systematic information that X provides about Y

BITS Pilani, Pilani Campus


Evaluation of Regression Model

BITS Pilani, Pilani Campus


Goodness of Fit

BITS Pilani, Pilani Campus


Coefficient of Determination (R-squared)

• Proportion of variance in a dependent variable that can be explained by an independent


variable

BITS Pilani, Pilani Campus


Adjusted R2

BITS Pilani, Pilani Campus


Adjusted R2

How Does Adjusted R-Squared Differ from R-Squared?

• Although both r-squared and adjusted r-squared evaluate regression model performance, a
key difference exists between the two metrics. The r-squared value always increases or
remains the same when more predictors are added to the model, even if those predictors do
not significantly improve the model's explanatory power. This issue can create a misleading
impression of the model's effectiveness.

• Adjusted r-squared adjusts the r-squared value to account for the number of independent
variables in the model. The adjusted r-squared value can decrease if a new predictor does
not improve the model's fit, making it a more reliable measure of model accuracy. For this
reason, the adjusted r-squared can be used as a tool by data analysts to help them decide
which predictors to include.

BITS Pilani, Pilani Campus


ZG 512
Classification: Logistic Regression
BITS Pilani Dr. Arindam Roy
Pilani Campus
Classification

Here the response variable Y is Qualitative/Categorical


• Email Spam: email is one of Y = (spam, email)
• Handwritten Digit Recognition: Digit class is oneof Y = {0,1, . . . , 9}.

Our goals are:


1. Prediction
• Build a classifier f( X ) that assigns a class label to a future unlabeled observation X
• Estimate the probability that X belongs to each category in Y
Example: We may be more interested to have an estimate of the probability that a transaction is
fraudulent than it is to classify that the transaction is fraudulent or not
2. Inference
• Understand the roles of the different predictors among X = (X 1 , X 2 , . . . , X p )

BITS Pilani, Pilani Campus


Regression Vs Classification

Variables can either be Quantitative or Qualitative (Categorical)


• Quantitative variables take on numerical values – Income, Bill amount
• Categorical values take on values on one of K different classes – Gender, Digit

Regression Problem: The response variable is quantitative


Classification Problem: The response variable is categorical

BITS Pilani, Pilani Campus


Classification Algorithms

• Naïve Bayes
• K-nearest Neighbour
• Logistic Regression
• Discriminant Analysis
• Decision Trees
• Support Vector Machine

BITS Pilani, Pilani Campus


Estimator and Error Rate
We have seen y = f(X)
f is a function that best maps an input x to output y. We wish to estimate this f.

The accuracy of መf is usually defined by the Error Rate


• The proportion of mis-classifications

Ave ෍ I yi ≠ yො i

Where I is the Indicator function

There are two error rates


• Training Error Rate
• Test Error Rate

A good classifier is one for which the test error rate is the smallest

BITS Pilani, Pilani Campus


Regression Revision
Relationships between a numerical response and numerical / categorical predictors
• Hypothesis tests for all regression parameters together – Testing the Model
• Model coefficient interpretation
• Hypothesis tests for each regression parameters
• Confidence intervals for regression parameters
• Confidence and prediction intervals for predicted means and values
• Model diagnostics, residuals plots, outliers
• RSS, MSE, R2
• Interpreting computer outputs

BITS Pilani, Pilani Campus


Classification – why and how?

• Regression gives a number. What if I want to identify a class or category and not a
number?
• Let’s say, I want to identify genuine emails vs spam emails, genuine transaction vs
fraud transaction. Here the outcomes are text values, but models can understand
only numbers?
• How do I handle this? I will replace the 2 classes by numbers. Say, one class as 1
and another as 0 and train a model which can predict the outcome value that is 0
or 1.
• But models can’t give discrete values 0 and 1.
• We can rather make it give a continuous value between 0 and 1.
• If the value is closer to 1 (i.e. >= 0.5), I consider it as 1, otherwise 0.
BITS Pilani, Pilani Campus
Classification – why and how?
• Can I use the concepts of linear regression here? How?
• Linear regression can throw out any value between - ∞ and + ∞.
• However, I want to map or convert that range (- ∞,+ ∞) to (0,1).
• We need a link function to do this.
• The most appropriate one is a sigmoid or logistic function.

BITS Pilani, Pilani Campus


● Imagine we plotted out some categorical data
against one feature. Machine Math &
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


● The X axis represents a feature value and the Y axis
represents the probability of Mathto& class 1.
belonging
Machine
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


● We can’t use a normal linear regression model on
binary groups. It won’t leadMachine Math
to a good fit: &
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


● We need a function that will fit binary categorical
data! Machine Math &
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


● It would be great if we could find a function with this
sort of behavior: Machine Math &
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sigmoid Function

● The Sigmoid (aka Logistic) Function takes in any


value and outputs it to be between
Machine Math
0 and&1.
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sigmoid Function

● This means we can take our Linear Regression


Solution and place it into the Math
Sigmoid
Machine &
Function.
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sigmoid Function

● This means we can take our Linear Regression


Solution and place it into the Math
Sigmoid
Machine &
Function.
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sigmoid Function

● This results in a probability from 0 to 1 of belonging


in the 1 class. Machine Math &
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sigmoid Function

● We can set a cutoff point at 0.5, anything below it


above is Math
results in class 0, anything Machine class 1.&
Learning
Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Review

● We use the logistic function to output a value


off of thisMath
ranging from 0 to 1. Based Machine &
probability we
Learning
assign a class. Statistics
DS

Software Research

Domain
Knowledge

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Odds and Logit Function

Odds is commonly used in gambling (and logistic regression)

For an event E ,
• If we know P(E) then
P(E) P(E)
Odds E = =
P(~E) 1 − P(E)
x
If the odds of E are “x to y”, then P E = x+y

Logit function:
p
• logit p = ln ,0 ≤ p ≤ 1
1−p
P(E)
log Odds(E) = ln
1−P(E)
logit can be interpreted as the log odds of a success

BITS Pilani, Pilani Campus


Logit Function and Logistic (Sigmoid) Function

• Logistic regression is a Generalized Linear Model (GLM)


• Uses Logistic or sigmoid function.

The logit function


p
• logit p = ln ,0 ≤ p ≤ 1
1−p

• Converts (0, 1) to range (–∞, + ∞)

The inverse function is known as Sigmoid function (or Logistic function)


ex 1
• S x = = , −∞ < x < +∞
1+ex 1+e−x

• Converts (–∞, + ∞) to range [0, 1]

BITS Pilani, Pilani Campus


Evaluating the Model

Confusion Matrix


The performance of𝑓can also be
described by a confusion matrix

A confusion matrix is a table that is used


to describe the performance of a
classification model (or "classifier") on a
set of data for which the true values are
known.

The confusion matrix gives strong clues


as to where 𝑓መ is going wrong.

BITS Pilani, Pilani Campus


Confusion Matrix

Machine Math &


Learning
Statistics
DS

Software Research

Domain
Knowledge
Example

Consider a classical problem of predicting spam and non-spam email.


The objective is to identify Spams.
The training set consists of 15 emails that are Spam, and 85 emails that are Not Spam
The model correctly classified 95 emails
• All 85 Non-Spams were correctly classified
• 10 Spams were correctly classified
• 5 Spams were classified as Non-Spams (False Negative if Target is Spam).

BITS Pilani, Pilani Campus


Consider the following scenario: There are 90 people who
are healthy (negative) and 10 people who have COVID
disease (positive). Now let’s say our machine learning
model perfectly classified the 90 people as healthy but it
also classified the COVID+ve patients as healthy. What
will happen in this scenario? Let us see the confusion
matrix and find out the accuracy?
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
What’s the right metric?

1. Many classifiers are designed to optimize error/accuracy


2. This tends to bias the performance towards the majority class
3. Anytime there is an imbalance in the data this can happen
4. It is particularly pronounced, though, when the imbalance is more pronounced
5. Accuracy is not the right measure of classifier performance in such cases
6. What are other metrics?
1. Precision
2. Recall (Sensitivity or TPR or True Positive Rate = TP/P)
3. F1-score?

Also check*
1. Specificity (TNR or True Negative Rate = TN/N)
2. False Positive Rate (FPR) = FP/N = 1 – TNR
3. And others…

Refer https://en.wikipedia.org/wiki/Confusion_matrix

BITS Pilani, Pilani Campus


The Metrics

1 is same as Positive
0 is same as Negative

BITS Pilani, Pilani Campus


Metrics to evaluate ML models for
classification
Accuracy
Precision
Recall
F1 Score

BITS Pilani, Pilani Campus


Code

BITS Pilani, Pilani Campus


Model Assessment

For Regression, the most commonly used measure is MSE

MSE = Average y − yො 2

For Classification, the most commonly used measure is the Error Rate:

Ave ෍ I yi ≠ yො i

Where I is the Indicator function

BITS Pilani, Pilani Campus


Training and Test Errors
• The model is developed on the training data.
• The Statistical Method estimates f (y = f(X)) by minimizing MSETr
• A procedure that minimizes MSETr will tend to “overfit” the data
• The training error shows the performance of the model on the training
data

What about the accuracy of the prediction on an unseen test


data?
• The usefulness of the model depends on the performance on unseen
test data
• We need a model that minimizes the test error
• We want a method that gives the lowest MSETe as opposed to
the lowest MSETr
• There are ways of estimating MSETe: Test Data, Cross Validation
BITS Pilani, Pilani Campus
k-fold Cross-Validation
• Widely used approach for estimating test error.
• Estimates can be used to
• Select best model
• Estimate the test error of the final chosen model.
• Process
• Randomly divide the data into k equal-sized parts
• Leave out part k, fit the model to the other k−1 parts (combined), and then obtain
predictions and the error for the left-out kth part.
• This is done in turn for each part k = 1, 2, . . . k
• The average of the k errors is the estimate of the test error

With a single train-test split, performance metrics can vary a lot depending on how the data
was split. K-fold cross validation reduces this variance by using all data points for both training
and testing, just in different iterations.
BITS Pilani, Pilani Campus
5-fold CV
• Widely used approach for estimating test error.

BITS Pilani, Pilani Campus


Overfitting and underfitting

Data is fitted to a linear function and


a polynomial function.
• The polynomial function is a perfect fit –
overfitting since it is adapting to the training
set
• The linear function is more rigid but may
generalize better
• If the two functions were used to extrapolate
beyond the fitted data, the linear model
• May generalize better
• May make better predictions.

BITS Pilani, Pilani Campus


Overfitting and underfitting
Overfitting
• Occurs when the model captures the noise of the training data – fits the data too well
• A method is said to be overfitting the data when it generates a small MSETr and a large MSETe
• It is often a result of an excessively complicated model
• Can be prevented by fitting multiple models and using validation or cross-validation to
compare their predictive accuracies on test data.
• The model may have low bias but high variance

Underfitting
• Occurs when the model cannot capture the underlying trend of the training data
• Occurs when the model does not fit the data well enough
• It is often a result of an excessively simple model.
• The model may have low variance but high bias

Both overfitting and underfitting lead to poor predictions on new data sets.

BITS Pilani, Pilani Campus


Bias Vs Variance

The goal of any supervised statistical learning


algorithm is to achieve
• low bias and low variance
• Thereby achieving good prediction
performance.

In reality, we cannot calculate the real bias and


variance error terms because we do not know
the actual underlying target function.

Nevertheless, as a framework, bias and variance


provide the tools to understand the behaviour
of machine learning algorithms in the pursuit
of predictive performance.

BITS Pilani, Pilani Campus


Bias Vs Variance

The algorithm learns a model from training data


The prediction error can be broken down into three parts: Bias Error, Variance Error &
Irreducible Error (noise)
• The irreducible error cannot be reduced: it is the error introduced by modelling a real-
life scenario
• Bias error arises from the simplifying assumptions made by a model to make the
target function easier to learn
• Variance is the amount that the predictions will vary with different training data sets

✓ We want a good predictor – low bias and low variance

BITS Pilani, Pilani Campus


Bias Vs Variance

Low Bias, High Variance High Bias, Low Variance


Overly Flexible Less Flexible

BITS Pilani, Pilani Campus


Training and Test Errors

Underfitting
1. Rigidity or under-complexity
2. High Bias, Low variance
Overfitting
1. Flexibility or over-complexity
2. Low Bias, High variance

1. When do we know that we are


underfitting?
2. When are we overfitting?
3. What is the optimal flexibility?
4. Does obtaining more data help in a
case of underfitting? Overfitting?

BITS Pilani, Pilani Campus


Training and Test Errors

At any iteration, we would like to know whether:

2.5
We have High Bias or High Variance?

2.0
Mean Squared Error
High Bias:

1.5
• Large MSETr & Large MSETe
• MSETr ~ MSETe

0.5 1.0
High Variance:
• Small MSETr & Large MSETe

0.0
• Small MSETr << MSETe
2 5 10 20
Flexibility

BITS Pilani, Pilani Campus


K Nearest Neighbors (KNN)
“Birds of a feather flock together"
KNN – Different names

K-Nearest Neighbors
Memory-Based Reasoning
Example-Based Reasoning
Instance-Based Learning
Lazy Learning
Instance-Based Classifiers
Set of Stored Cases • Store the training records

……... • Use training records to


Atr1 AtrN Class
predict the class label of
A unseen cases
B
B
Unseen Case
C
Atr1 ……... AtrN
A
C
B
Basic Idea
If it walks like a duck, quacks like a duck, then it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records
What is nearest?

only valid for continuous variables


What is nearest?

for categorical variables


Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

 To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
1 nearest-neighbor
Voronoi Diagram defines the classification boundary

The area takes the


class of the green
point
K-Nearest Neighbor Algorithm
1. All the instances correspond to points in an n-dimensional
feature space.

2. Each instance is represented with a set of numerical attributes.

3. Each of the training data consists of a set of vectors and a


class label associated with each vector.

4. Classification is done by comparing feature vectors of different


K nearest points.

5. Select the K-nearest examples to E in the training set.

6. Assign E to the most common class among its K-nearest


neighbors.
3NN: Example
How to choose K?
• If K is too small it is sensitive to noise points.
• Larger K works well. But too large K may include majority points from other
classes.

• Rule of thumb is K < sqrt(n), n is number of examples.


Noisy sample
Feature Scaling

• Distance between neighbors could be dominated by some


attributes with relatively large numbers.
e.g., income of customers in our previous example.

• Arises when two features are in different scales.

• Important to normalize those features.


Mapping values to numbers between 0 – 1.
Types of Scaling
• Normalization/ Min-Max Scalar :

x’ = (x-min)/(max-min)

Scales data to a range (0 to 1) based on minimum and maximum


values.

• Standardization:

Centers data around the mean (0) and scales it by the standard
deviation (1).
Pros and Cons
Strengths:
• No learning time (lazy learner)
• Highly expressive since can learn complex
decision boundaries
• Via use of k can avoid noise
• Easy to explain/justify decision

Weaknesses:
• Relatively long prediction time
• No model to provide insight
• Very sensitive to irrelevant and redundant features
• Normalization required
Random Forest
ZG 536
Foundations of Data Science
BITS Pilani
Pilani Campus Dr. Arindam Roy
BITS Pilani
Pilani Campus

M4 Predictive Modeling
Lecture 12 Model Ensembles
Why Ensemble?
• Decision trees are simple and interpretable models for regression and classification
• However they are often not competitive with other methods in terms of prediction accuracy
• Bagging, random forests and boosting are good methods for improving the prediction accuracy of trees. They work by
growing many trees on the training data and then combining the predictions of the resulting ensemble of trees.
• The latter two methods— random forests and boosting— are among the state-of-the-art methods for supervised
learning. However their results can be difficult to interpret.
• It appears that overfitting may not happen in Boosting

BITS Pilani, Pilani Campus


Definition
Ensemble methods in machine learning combine the predictions of several
base estimators built with a given learning algorithm in order to improve
generalizability / robustness over a single estimator.

BITS Pilani, Pilani Campus


Bootstrap aggregating

• A general-purpose procedure for reducing the variance of a statistical learning method

• It is particularly useful and frequently used in the context of decision trees.

• This is not practical because we generally do not have access to multiple training sets.

BITS Pilani, Pilani Campus


Bootstrap

• The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty
associated with a given estimator or statistical learning method.

• Sampling randomly ‘n’ subsets of original training data.

• Distinct data sets by repeatedly sampling observations from the original data set with replacement.

• New dataset obtained is of same size as our original dataset.

• This step ensures that the base models are trained on diverse subsets of the data, as some samples
may appear multiple times in the new subset, while others may be omitted. It reduces the risks of
overfitting and improves the accuracy of the model.

BITS Pilani, Pilani Campus


Bagging

BITS Pilani, Pilani Campus


Bagging - Regression

Since we do not have access to multiple training sets, we bootstrap

• Generate B different bootstrapped training data sets.


Sample with replacement B times from the (single) training data set.

• Train the method on the bth bootstrapped training set to get (x), b = 1,
…, B

• Average all the predictions to obtain

(x)

BITS Pilani, Pilani Campus


Bagging - Classification

• Generate B different bootstrapped training data sets.


Sample with replacement B times from the (single) training dataset.

• Train our method on the bth bootstrapped training set to get (x), b =
1, …, B

• Take the majority vote:


The overall prediction is the most commonly occurring class

BITS Pilani, Pilani Campus


Random Forest

BITS Pilani, Pilani Campus


Random Forest - Regression

• Generate B different bootstrapped training data sets.


Sample with replacement B times from the (single) training data set.

• When building a decision tree


• At each split, a random sample of m predictors is chosen as split candidates from the full set of
p predictors.
• A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p
• We get f (x), b = 1, …, B

𝟏
• Average all the predictions to obtain 𝐟𝐚𝐯𝐠 𝐱 = ∑𝐁𝐛 𝟏𝐟
𝐛
(x)
𝐁

BITS Pilani, Pilani Campus


Random Forest - Classification

• Generate B different bootstrapped training data sets.


Sample with replacement B times from the (single) training data set.

• When building a decision tree


• At each split, a random sample of m predictors is chosen as split candidates from the full
set of p predictors.
• A fresh selection of m predictors is taken at each split, and typically we choose m ≈ √p
• We get f (x)

• Take the majority vote:


The overall prediction is the most commonly occurring class

BITS Pilani, Pilani Campus


Boosting

• Boosting can be applied to regression and classification methods

• Bagging involves parallel process


• Creating multiple datasets using the bootstrap
• Fitting a separate decision tree to each copy
• Combining all of the trees in order to create a single predictive model.
• Each tree is built independently of the other trees.

• Boosting is a Serial process


• The trees are grown sequentially
• Each tree is grown using information from previously grown trees
• Each tree works on a modified version of the training set

BITS Pilani, Pilani Campus


Boosting - Procedure
• h1 is applied on the original dataset &
misclassifies 3 points. The weights of these points
are increased and others are decreased.
• h2 is applied on the new dataset & misclassifies 3
points. The weights of these points are increased
and others are decreased.
• h3 is applied on the new dataset & misclassifies 3
points.
• The final classifier is the weighted sum of the
H = α1h1 + α2h2 + α3h3 three classifiers.

BITS Pilani, Pilani Campus


Bagging Vs Boosting

BITS Pilani, Pilani Campus


Types of Boosting
AdaBoost (Adaptive Boosting)
• Multiple weak learners into a single strong learner.
• The weak learners in AdaBoost are decision trees with a single split, called decision stumps

Gradient Boosting
• Sequentially adding predictors to an ensemble, each one correcting its predecessor.
• Tries to fit the new predictor to the residual errors made by the previous predictor.

XGBoost (Extreme Gradient Boosting)


• Implementation of gradient boosted decision trees designed for speed and performance.
• Generally very slow in implementation because of sequential model training.

BITS Pilani, Pilani Campus


K-Means Clustering
“Birds of a feather flock together”
Definition and Applications
K-Means clustering is an unsupervised machine
learning technique for grouping data points into groups
(clusters) such that points in a group are more ‘similar’
to each other than to points outside the group.

Astronomical Data Analysis

Social Networks Analysis

Image Compression
K-Means Algorithm

Group unlabeled data


into two coherent
clusters (take K = 2)
K-Means Algorithm

STEP 1:

Randomly allocate two


points (K=2) as the
cluster centroids
K-Means Algorithm

STEP 2:
Cluster assignment:

Go through each data


point, depending on its
closeness with the red
or blue centroid assign
each point to one of the
two clusters
K-Means Algorithm

STEP 3:
Move centroid :

Take each centroid and


move to the average of
the correspondingly
assigned data-points
K-Means Algorithm

Repeat STEP 2:
Cluster assignment

• Repeat Step 2 and 3


until convergence:
Iteration 2
K-Means Algorithm

Repeat STEP 3:
Move Centroid

• Repeat Step 2 and 3


until convergence:
Iteration 2
K-Means Algorithm

Repeat STEP 2:
Cluster assignment

• Repeat Step 2 and 3


until convergence:
Iteration 3
K-Means Algorithm

Repeat STEP 3:
Move Centroid

• Repeat Step 2 and 3 Converged!


until convergence:
Iteration 3
K-Means Algorithm

STEP 1:

STEP 2:
Cluster assignment

STEP 3:
Move Centroid
Customer Segmentation Caselet
We will use a dataset of 300 customers containing
their Age and Income information to analyze and
understand the customer segments and identify the
key attributes of each segment

… … …
Customer Segmentation Caselet
• Clusters are segmented mostly based on income
• Salary is on larger scale (Y axis)
K=3

Preprocessing: Standardization
Customer Segmentation Caselet

K=3

Cluster 0: Customers with mean age of 31


and income of 54K. Low age and high
income
Cluster 1: Customers with mean age of 39
and income of 18K. Mid age and mid
income
Cluster 2: Customers with mean age of 46
and income of 43K. High age and low
income
Customer Segmentation Caselet
How do we choose the number of clusters (value of K)?

Objective function: Within Cluster Sum of Squares (WCSS)

• k = number of clusters
• n = number of data points
• xij = ith data point of the jth cluster
• cj = centroid of cluster j
• (xij - cj) = Distance between data point and centroid to which it is assigned
Customer Segmentation Caselet
How do we choose the number of clusters (value of K)?

Optimum
number of
The Elbow Method
clusters
Thank You!

You might also like