0% found this document useful (0 votes)
8 views61 pages

Unit 1 Pyq

Uploaded by

professionalamex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views61 pages

Unit 1 Pyq

Uploaded by

professionalamex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

POWERED BY

📚 ReferMe: Your Academic Companion


A Student-Centric Platform by Pixen

🔹 About ReferMe
ReferMe, by Pixen, offers curated academic resources to help students study
smarter and succeed faster.

✅ Class Notes
✅ Previous Year Question Papers (PYQs)
✅ Updated Syllabus
✅ Quick Revision Material

🔹 About Pixen
Pixen is a tech company helping students and startups turn ideas into reality.
Alongside ReferMe, we also offer:

✅ Custom Websites
✅ Machine learning
✅ Web Applications
✅ E‑Commerce Stores
✅ Landing Pages

https://referme.tech/ https://www.pixen.live
Powered by

hule Pune Un
ai P ive
ib
tr r

vi

sit
Sa

y
Information Technology - Third Year

Machine Learning - UNIT I

OVERVIEW OF Machine Learning

Introduction: What is Machine Learning, Definitions and Real life


applications, Data and types: Scales of Measurement. Data, Features and
Patterns: Learning Tasks- Descriptive and Predictive Tasks. Learning
Paradigms: Supervised, Unsupervised and Reinforced Learnings. Learning
Models. Data and Dimensionality: Feature Sets, Feature Extraction and
Subset Selection, Feature Transformation. Dimensionality reduction
techniques- PCA and LDA

More Stuff Inside → Click here


Powered by

More Stuff Inside → Click here


Powered by

Q. 1 Explain with example Predictive and Descriptive tasks of Machine


Learning. Also state Predictive and Descriptive Model (CO1). [6]

Definition:
Machine Learning tasks are broadly classified into two categories based on
their purpose - Predictive tasks that forecast future outcomes and
Descriptive tasks that analyze existing data patterns.

Predictive Tasks

Purpose:
To predict future or unknown outcomes based on historical data

Key Characteristics:
Supervised Learning: Uses labeled training data
Output Focus: Generates predictions for new, unseen data
Goal: Minimize prediction error and maximize accuracy

Example: Email Spam Detection


Input: Email features (sender, subject, content, attachments)
Output: Classification as "Spam" or "Not Spam"
Process: Model learns from labeled emails to predict spam probability

More Stuff Inside → Click here


Powered by

Descriptive Tasks

Purpose: To understand and describe patterns, relationships, and


structures in existing data

Key Characteristics:
Unsupervised Learning: No labeled target variable
Pattern Discovery: Identifies hidden structures in data
Goal: Extract meaningful insights and relationships

Example: Customer Segmentation


Input: Customer data (age, income, purchase history)
Output: Customer groups with similar characteristics
Process: Clustering customers into segments for targeted marketing

Models Used

Predictive Models:
Linear Regression: For continuous value prediction
Decision Trees: For classification and regression
Neural Networks: For complex pattern recognition
Support Vector Machines: For classification tasks

More Stuff Inside → Click here


Powered by

Descriptive Models:
K-Means Clustering: For grouping similar data points
Association Rules: For finding item relationships
Principal Component Analysis: For dimensionality reduction
DBSCAN: For density-based clustering

Advantages:
Predictive: Enables future planning and decision making
Descriptive: Provides insights into data structure and patterns

Disadvantages:
Predictive: Requires labeled training data, may overfit
Descriptive: Results can be subjective to interpretation

Applications:
Predictive: Weather forecasting, stock price prediction, medical diagnosis
Descriptive: Market research, social network analysis, gene analysis

More Stuff Inside → Click here


Powered by

Q.2 Explain k-fold Cross Validation technique with example (CO1). [5]

Definition:
k-fold Cross Validation is a statistical technique used to evaluate machine
learning models by dividing the dataset into k equal parts and
training/testing the model k times.

Working Process

Step 1: Data Partitioning


Division: Split dataset into k equal-sized subsets (folds)
Random Selection: Ensure each fold is representative of the whole
dataset

Step 2: Training and Testing Iterations


Iteration 1: Use k-1 folds for training, 1 fold for testing
Iteration 2: Use different fold for testing, rest for training
Continue: Repeat until each fold has been used as test set once

More Stuff Inside → Click here


Powered by

Step 3: Performance Evaluation


Calculate: Model performance for each iteration
Average: Final performance = mean of all k results

Example: 5-fold Cross Validation

Dataset: 1000 samples of house price data

Process:
Fold 1: Samples 1-200 (Test), Samples 201-1000 (Train)
Fold 2: Samples 201-400 (Test), Samples 1-200 + 401-1000 (Train)
Fold 3: Samples 401-600 (Test), Samples 1-400 + 601-1000 (Train)
Fold 4: Samples 601-800 (Test), Samples 1-600 + 801-1000 (Train)
Fold 5: Samples 801-1000 (Test), Samples 1-800 (Train)

Results:
Fold 1 Accuracy: 85%
Fold 2 Accuracy: 82%
Fold 3 Accuracy: 88%
Fold 4 Accuracy: 86%
Fold 5 Accuracy: 84%

Final Model Performance: (85 + 82 + 88 + 86 + 84) / 5 = 85.0%

More Stuff Inside → Click here


Powered by

Advantages:
Reliable Evaluation: Uses entire dataset for both training and testing
Reduced Bias: Less dependent on specific train-test split
Better Generalization: Provides robust performance estimate

Disadvantages:
Computational Cost: Requires k times more training time
Data Dependency: Performance varies with dataset size and distribution

Applications:
Model selection and hyperparameter tuning
Comparing different algorithms
Estimating model generalization performance

Q.3 Write a note on Principal Component Analysis (PCA) (CO1). [4]

Definition:
Principal Component Analysis (PCA) is a dimensionality reduction technique
that transforms high-dimensional data into a lower-dimensional space while
preserving maximum variance.

More Stuff Inside → Click here


Powered by

Working Process

Step 1: Data Standardization


Normalize: Scale features to have mean = 0 and standard deviation = 1
Purpose: Prevent features with large scales from dominating

Step 2: Covariance Matrix Calculation


Formula: Cov(X,Y) = Σ(xi - x̄)(yi - ȳ ) / (n-1)
Purpose: Measure relationships between variables

Step 3: Eigenvalue Decomposition


Eigenvalues: Represent variance captured by each component
Eigenvectors: Represent directions of maximum variance

Step 4: Component Selection


Ranking: Sort components by eigenvalue magnitude
Selection: Choose top k components that capture desired variance

More Stuff Inside → Click here


Powered by

Step 5: Data Transformation


Projection: Transform original data onto selected components
Result: Reduced dimensional representation

Mathematical Foundation

Variance Explained:
PC1: Captures maximum variance in data
PC2: Captures maximum remaining variance (orthogonal to PC1)
PC3: Captures maximum remaining variance (orthogonal to PC1 and
PC2)

Cumulative Variance:
Goal: Retain 90-95% of original variance
Selection: Choose minimum components achieving this threshold

Advantages:
Dimensionality Reduction: Reduces computational complexity
Noise Reduction: Eliminates less important features
Visualization: Enables plotting of high-dimensional data

Disadvantages:
Interpretability Loss: Components may not have clear meaning
Linear Assumption: Only captures linear relationships
Variance Focus: May discard discriminative information

More Stuff Inside → Click here


Powered by

Applications:
Image compression and processing
Data visualization and exploration
Feature extraction for machine learning
Genomics and bioinformatics analysis

Q.4 Justify which type of learning could be the most appropriate,


considering any one real world application of Machine Learning. Also
explain your reasoning (CO1). [6]

Application Selected: Netflix Movie Recommendation System

Learning Type Analysis

Chosen Learning Type: Hybrid Approach (Supervised + Unsupervised +


Reinforcement Learning)

Justification and Reasoning

Primary Component: Supervised Learning


Purpose: Predict user ratings for movies
Data: Historical user ratings, movie features, user demographics
Algorithm: Collaborative Filtering with Matrix Factorization
Advantage: Leverages explicit feedback (ratings) for accurate
predictions

More Stuff Inside → Click here


Powered by

Secondary Component: Unsupervised Learning


Purpose: Discover hidden patterns in user behavior
Data: User viewing history, movie genres, viewing time patterns
Algorithm: K-means clustering for user segmentation
Advantage: Groups similar users without labeled categories

Tertiary Component: Reinforcement Learning


Purpose: Optimize recommendation strategy in real-time
Data: User interactions (clicks, views, skips, completion rates)
Algorithm: Multi-armed bandit algorithms
Advantage: Learns from immediate user feedback to improve
recommendations

Implementation Strategy

Phase 1: Supervised Learning Foundation


Training Data: 5-star rating system with user-movie pairs
Model: Deep Neural Networks for collaborative filtering
Output: Predicted rating for unwatched movies

Phase 2: Unsupervised Pattern Discovery


Clustering: Group users with similar preferences
Association Rules: Find movies frequently watched together
Dimensionality Reduction: Compress movie feature space

More Stuff Inside → Click here


Powered by

Phase 3: Reinforcement Learning Optimization


Reward System: Click-through rate, watch completion, user satisfaction
Exploration vs Exploitation: Balance popular vs diverse recommendations
Real-time Adaptation: Adjust recommendations based on immediate
feedback

Why This Hybrid Approach?

Supervised Learning Alone - Insufficient Because:


Cold Start Problem: New users/movies lack rating history
Limited Feedback: Users rate only small fraction of content
Static Recommendations: Doesn't adapt to changing preferences

Unsupervised Learning Enhancement:


Content-Based Filtering: Recommends based on movie features
User Clustering: Addresses cold start for new users
Pattern Discovery: Finds unexpected user segments

Reinforcement Learning Addition:


Dynamic Optimization: Continuously improves recommendation quality
Personalization: Adapts to individual user behavior patterns
Business Metrics: Optimizes for engagement and retention

More Stuff Inside → Click here


Powered by

Advantages:
Comprehensive Coverage: Handles various recommendation scenarios
Continuous Improvement: Learns and adapts over time
Business Impact: Directly optimizes user engagement metrics

Disadvantages:
Complexity: Requires multiple model coordination
Computational Cost: High resource requirements
Data Requirements: Needs diverse data types

Applications:
E-commerce product recommendations
Social media content curation
Music streaming platforms
Online advertising targeting

Q.5 Explain Reinforcement Learning with diagram (CO1). [5]


Definition:
Reinforcement Learning is a type of machine learning where an agent
learns to make decisions by interacting with an environment, receiving
rewards or penalties for actions, and optimizing long-term cumulative
reward.

More Stuff Inside → Click here


Powered by

Key Components

Agent:
Definition: The learner or decision maker
Function: Observes environment and takes actions
Goal: Maximize cumulative reward over time

Environment:
Definition: External system that agent interacts with
Function: Provides states and rewards based on actions
Characteristics: Can be deterministic or stochastic

State (S):
Definition: Current situation or configuration of environment
Examples: Position in game, stock market conditions
Representation: Vector of relevant features

More Stuff Inside → Click here


Powered by

Action (A):
Definition: Choices available to agent in given state
Types: Discrete (finite options) or continuous (infinite range)
Selection: Based on policy function

Reward (R):
Definition: Immediate feedback for action taken
Purpose: Guides learning process
Design: Positive for good actions, negative for bad actions

Policy (π):
Definition: Strategy that maps states to actions
Goal: Find optimal policy that maximizes expected reward
Types: Deterministic or stochastic

Learning Process

Step 1: Exploration vs Exploitation


Exploration: Try new actions to discover better strategies
Exploitation: Use known best actions to maximize reward
Balance: ε-greedy, softmax, or upper confidence bound methods

Step 2: Value Function Learning


State Value V(s): Expected return from state s
Action Value Q(s,a): Expected return from taking action a in state s
Update: Based on temporal difference learning

More Stuff Inside → Click here


Powered by

Step 3: Policy Improvement


Evaluation: Assess current policy performance
Update: Modify policy based on learned values
Convergence: Iterate until optimal policy found

Popular Algorithms

Q-Learning:
Type: Model-free, off-policy
Update Rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
Advantage: Learns optimal policy without environment model

Deep Q-Networks (DQN):


Enhancement: Neural networks for complex state spaces
Innovation: Experience replay and target networks
Application: Game playing, robotics

Policy Gradient Methods:


Approach: Directly optimize policy parameters
Advantage: Can handle continuous action spaces
Examples: REINFORCE, Actor-Critic methods

Advantages:
No Supervision Required: Learns from trial and error
Adaptive Learning: Continuously improves with experience
Optimal Solutions: Can find globally optimal policies

More Stuff Inside → Click here


Powered by

Disadvantages:
Sample Inefficiency: Requires many interactions for learning
Exploration Challenge: Balancing exploration and exploitation
Convergence Issues: No guarantee of finding optimal solution

Applications:
Game playing (Chess, Go, video games)
Robotics and autonomous systems
Resource allocation and scheduling
Financial trading algorithms

Real-life Examples:
AlphaGo defeating world champion Go players
Autonomous vehicle navigation
Personalized treatment recommendations
Dynamic pricing in e-commerce

Q.6 Discuss various scales of measurement of features in machine


learning (CO1). [4]

Definition:
Scales of measurement classify different types of data variables based on
their mathematical properties and the operations that can be performed
on them.

More Stuff Inside → Click here


Powered by

Types of Measurement Scales

1. Nominal Scale

Definition: Categories without inherent order or ranking

Characteristics:
Qualitative: Names or labels only
No Mathematical Operations: Cannot perform arithmetic
Equality: Can check if two values are same or different

Examples:
Gender (Male, Female, Other)
Colors (Red, Blue, Green)
Marital Status (Single, Married, Divorced)
Product Categories (Electronics, Clothing, Books)

Operations Allowed:
Counting: Frequency of each category
Mode: Most frequent category
Chi-square Test: Association between categories

2. Ordinal Scale

Definition: Categories with meaningful order but no consistent intervals

More Stuff Inside → Click here


Powered by

Characteristics:
Ranking: Clear ordering relationship
No Arithmetic: Cannot add, subtract, multiply, or divide
Relative Position: Can compare which is higher/lower

Examples:
Education Level (High School < Bachelor's < Master's < PhD)
Customer Satisfaction (Poor < Fair < Good < Excellent)
Movie Ratings (1 star < 2 stars < 3 stars < 4 stars < 5 stars)
Economic Status (Low < Middle < High)

Operations Allowed:
Median: Middle value when ordered
Percentiles: Position-based statistics
Non-parametric Tests: Spearman correlation, Mann-Whitney U test

3. Interval Scale

Definition: Ordered data with consistent intervals but no true zero point

Characteristics:
Equal Intervals: Distance between consecutive values is constant
No True Zero: Zero point is arbitrary
Addition/Subtraction: Can calculate differences

More Stuff Inside → Click here


Powered by

Examples:
Temperature (Celsius, Fahrenheit)
Calendar Years (1990, 2000, 2010)
IQ Scores (100, 110, 120)
Standardized Test Scores (SAT, GRE)

Operations Allowed:
Mean: Average of all values
Standard Deviation: Measure of variability
Correlation: Pearson correlation coefficient
Linear Regression: Fitting straight lines

4. Ratio Scale

Definition: Ordered data with consistent intervals and meaningful zero


point

Characteristics:
True Zero: Zero means complete absence
All Operations: Can perform all arithmetic operations
Proportional Relationships: Can say one value is twice another

Examples:
Height (0 cm means no height)
Weight (0 kg means no weight)
Income (0 dollars means no income)

More Stuff Inside → Click here


Powered by

Age (0 years means just born)


Distance (0 km means no distance)

Operations Allowed:
All Statistical Measures: Mean, median, mode, standard deviation
Ratios: Can calculate proportions and percentages
Geometric Mean: Appropriate for ratio data
All Regression Types: Linear, polynomial, exponential

Machine Learning Implications

Feature Encoding:
Nominal: One-hot encoding, label encoding
Ordinal: Ordinal encoding preserving order
Interval/Ratio: Use values directly or normalize

Algorithm Selection:
Nominal: Decision trees, naive Bayes
Ordinal: Algorithms respecting order relationships
Interval/Ratio: All algorithms, especially distance-based

Preprocessing Requirements:
Nominal: No scaling needed
Ordinal: Careful scaling to preserve order
Interval/Ratio: Standardization or normalization

More Stuff Inside → Click here


Powered by

Advantages:
Appropriate Analysis: Ensures correct statistical methods
Data Quality: Helps identify measurement errors
Algorithm Selection: Guides choice of ML algorithms

Disadvantages:
Complexity: Requires careful consideration of data types
Mixed Data: Real datasets often contain multiple scales
Preprocessing Overhead: Different scales need different treatments

Applications:
Survey data analysis and market research
Medical diagnosis and treatment planning
Quality control in manufacturing
Social science research and analysis

Real-life Examples:
Nominal: Customer segmentation by demographic categories
Ordinal: Product review ratings and rankings
Interval: Temperature monitoring in industrial processes
Ratio: Financial analysis and performance metrics

More Stuff Inside → Click here


Powered by

Q.7 Compare supervised, unsupervised, and semi-supervised learning with


examples (CO1). [6]

Definition:
Machine learning algorithms are classified into three main categories
based on the type of feedback available during training: Supervised,
Unsupervised, and Semi-supervised learning.

Supervised Learning

Definition:
Learning with labeled training data where correct answers are provided

Characteristics:
Input-Output Pairs: Training data includes both features and target
values
Goal: Learn mapping function from input to output
Feedback: Immediate correction through labeled examples
Evaluation: Performance measured on test set with known answers

Types:
Classification: Predicts discrete categories or classes
Regression: Predicts continuous numerical values

Examples:
Email Spam Detection: Input: email content, Output: spam/not spam

More Stuff Inside → Click here


Powered by

Medical Diagnosis: Input: symptoms, Output: disease type


Stock Price Prediction: Input: market data, Output: future price
Image Recognition: Input: image pixels, Output: object class

Algorithms:
Linear Regression, Logistic Regression
Decision Trees, Random Forest
Support Vector Machines (SVM)
Neural Networks

Unsupervised Learning

Definition:
Learning from data without labeled examples or target outputs

Characteristics:
No Target Variable: Only input features available
Pattern Discovery: Finds hidden structures in data
Exploratory: Understands data distribution and relationships
Subjective Evaluation: No clear right or wrong answers

Types:
Clustering: Groups similar data points together
Association Rules: Finds relationships between variables
Dimensionality Reduction: Reduces feature space complexity

More Stuff Inside → Click here


Powered by

Examples:
Customer Segmentation: Group customers by buying behavior
Market Basket Analysis: Items frequently bought together
Gene Sequencing: Identify genetic patterns
Anomaly Detection: Detect unusual patterns in network traffic

Algorithms:
K-Means Clustering, Hierarchical Clustering
DBSCAN, Gaussian Mixture Models
Principal Component Analysis (PCA)
Association Rule Mining (Apriori)

Semi-supervised Learning

Definition:
Learning with both labeled and unlabeled data, typically with small labeled
and large unlabeled datasets

Characteristics:
Mixed Data: Combines labeled and unlabeled examples
Cost-Effective: Reduces labeling costs
Improved Performance: Often better than supervised learning with
limited labels
Realistic Scenario: Mirrors real-world data availability

More Stuff Inside → Click here


Powered by

Approaches:
Self-Training: Use model predictions on unlabeled data
Co-Training: Multiple models trained on different feature sets
Graph-Based: Propagate labels through data similarity graph

Examples:
Web Page Classification: Few labeled pages, millions unlabeled
Speech Recognition: Limited transcribed audio, vast unlabeled audio
Medical Image Analysis: Few expert-labeled images, many unlabeled
Social Network Analysis: Some user profiles labeled, most unlabeled

Algorithms:
Self-Training with Confidence Thresholding
Co-Training with Multiple Views
Graph-Based Label Propagation
Generative Adversarial Networks (GANs)

Detailed Comparison

Aspect Supervised Unsupervised Semi-supervised

Both labeled and


Training Data Labeled (X, y) Unlabeled (X only)
unlabeled

Predict Improve prediction


Goal Discover patterns
targets with limited labels

More Stuff Inside → Click here


Powered by

Aspect Supervised Unsupervised Semi-supervised

Feedback Immediate None Partial

Accuracy Subjective/Domain Accuracy on labeled


Evaluation
metrics knowledge test set

Complexity Medium High High

Data
Large labeled Large unlabeled Small labeled + Large
Requirement
dataset dataset unlabeled
s

Cost High (labeling) Low Medium

Advantages:
Supervised: High accuracy with sufficient labeled data
Unsupervised: Discovers hidden patterns without labels
Semi-supervised: Balances performance and labeling costs

Disadvantages:
Supervised: Expensive labeling, overfitting risk
Unsupervised: Difficult evaluation, subjective results
Semi-supervised: Complex algorithms, assumption-dependent

Applications:
Supervised: Medical diagnosis, fraud detection, recommendation systems

More Stuff Inside → Click here


Powered by

Unsupervised: Market research, data compression, anomaly detection


Semi-supervised: Text classification, image recognition, bioinformatics

Real-life Examples:
Supervised: Netflix movie ratings prediction
Unsupervised: Amazon customer segmentation
Semi-supervised: Google's image search with limited labeled images

Q.8 Explain k-fold Cross Validation technique with example (CO1). [5]

Definition:
k-fold Cross Validation is a statistical technique used to evaluate machine
learning models by dividing the dataset into k equal parts and
training/testing the model k times.

More Stuff Inside → Click here


Powered by

Step-by-Step Process

Step 1: Data Preparation


Shuffle: Randomly rearrange dataset to ensure representative
distribution
Split: Divide data into k equal-sized subsets (folds)
Stratification: Maintain class distribution in each fold (for classification)

Step 2: Training and Testing Loop


Iteration 1: Use (k-1) folds for training, 1 fold for testing
Iteration 2: Use different fold for testing, remaining for training
Continue: Repeat until each fold has been used as test set once

Step 3: Performance Calculation


Individual Scores: Calculate performance metric for each iteration
Final Score: Average all k performance scores
Variance: Calculate standard deviation to assess consistency

Practical Example: Student Grade Prediction

Dataset: 1000 student records with features (study hours, attendance,


previous grades) Task: Predict final exam scores k-value: 5 (5-fold cross
validation)

More Stuff Inside → Click here


Powered by

Data Split:
Total Samples: 1000 student records
Fold Size: 200 samples each
Fold 1: Students 1-200
Fold 2: Students 201-400
Fold 3: Students 401-600
Fold 4: Students 601-800
Fold 5: Students 801-1000

Training Process:

Iteration 1:
Training Set: 800 samples (Folds 2,3,4,5)
Test Set: 200 samples (Fold 1)
Model: Linear Regression
Performance: R² = 0.87

Iteration 2:
Training Set: 800 samples (Folds 1,3,4,5)
Test Set: 200 samples (Fold 2)
Performance: R² = 0.85

Iteration 3:
Training Set: 800 samples (Folds 1,2,4,5)
Test Set: 200 samples (Fold 3)
Performance: R² = 0.89

More Stuff Inside → Click here


Powered by

Iteration 4:
Training Set: 800 samples (Folds 1,2,3,5)
Test Set: 200 samples (Fold 4)
Performance: R² = 0.86

Iteration 5:
Training Set: 800 samples (Folds 1,2,3,4)
Test Set: 200 samples (Fold 5)
Performance: R² = 0.88

Final Results:
Individual Scores: [0.87, 0.85, 0.89, 0.86, 0.88]
Mean Performance: (0.87 + 0.85 + 0.89 + 0.86 + 0.88) / 5 = 0.87
Standard Deviation: 0.015 (low variance indicates consistent
performance)

Variations of Cross-Validation

Stratified k-fold:
Purpose: Maintains class distribution in each fold
Usage: Classification problems with imbalanced datasets
Benefit: Ensures each fold is representative of overall distribution

Leave-One-Out (LOO):
Special Case: k = n (number of samples)
Process: Each sample used once as test set

More Stuff Inside → Click here


Powered by

Advantage: Maximum use of data for training


Disadvantage: Computationally expensive

Time Series Cross-Validation:


Purpose: Respects temporal order in time series data
Process: Always train on past, test on future
Prevents: Data leakage from future to past

Advantages:
Robust Evaluation: Uses entire dataset for both training and testing
Reduced Overfitting: Less likely to be biased by specific train-test split
Confidence Intervals: Provides variance estimate of model performance
Model Selection: Helps choose best algorithm and hyperparameters

Disadvantages:
Computational Cost: Requires k times more training time
Memory Usage: May need to store multiple model instances
Data Dependency: Results depend on random data splitting

Applications:
Model Selection: Comparing different algorithms
Hyperparameter Tuning: Finding optimal parameter values
Feature Selection: Evaluating feature importance
Performance Estimation: Getting reliable accuracy estimates

More Stuff Inside → Click here


Powered by

Real-life Examples:
Medical Research: Validating diagnostic models across patient groups
Financial Analysis: Testing trading strategies on different time periods
Image Recognition: Evaluating CNN performance across image datasets
Natural Language Processing: Validating sentiment analysis models

Q.9 Why dataset splitting is required? State importance of each split in a


machine learning model [4]

Definition:
Dataset splitting is the process of dividing the available data into separate
subsets to train, validate, and test machine learning models effectively.

Why Dataset Splitting is Required:


Prevents Overfitting: Model learns patterns specific to training data
only
Enables Performance Evaluation: Provides unbiased assessment of
model performance
Model Selection: Helps choose best performing model among
alternatives
Generalization Testing: Ensures model works well on unseen data

Importance of Each Split:


Training Set (60-70%):
Used to train the model parameters

More Stuff Inside → Click here


Powered by

Model learns patterns and relationships from this data


Larger size ensures better pattern recognition
Validation Set (15-20%):
Used for hyperparameter tuning
Helps in model selection and optimization
Prevents overfitting during training phase
Test Set (15-20%):
Provides final unbiased evaluation
Simulates real-world performance
Never used during training or validation

Advantages: Unbiased evaluation, prevents overfitting, better


generalization
Disadvantages: Reduces available training data, requires careful splitting
Applications: All supervised learning tasks, model comparison
Real-life Examples: Email spam detection, image recognition systems

Q.10 Why size of training dataset is more compared to testing dataset?


What should be ratio of training & testing dataset? Explain any one
dataset validation technique [6]

Definition:
Training dataset size is larger than testing dataset because the model
needs sufficient data to learn patterns effectively while keeping enough
data for unbiased evaluation.

More Stuff Inside → Click here


Powered by

Why Training Dataset Size is More:


Pattern Learning: Model needs extensive examples to identify complex
patterns
Parameter Optimization: More data points help in better parameter
estimation
Generalization: Larger training set improves model's ability to
generalize
Statistical Significance: More samples provide statistically reliable
results

Recommended Ratios:
80:20 Rule: 80% training, 20% testing (most common)
70:30 Rule: 70% training, 30% testing (for smaller datasets)
60:20:20 Rule: 60% training, 20% validation, 20% testing (with
validation set)

K-Fold Cross Validation Technique:

Steps:
1. Divide Dataset: Split data into k equal folds (typically k=5 or k=10)
2. Iterative Training: Use k-1 folds for training, 1 fold for validation
3. Rotation: Repeat process k times, each time using different fold for
validation
4. Average Results: Calculate mean performance across all k iterations

More Stuff Inside → Click here


Powered by

Example:
For k=5, dataset divided into 5 parts. In iteration 1, parts 1-4 train, part 5
validates. In iteration 2, parts 1,2,3,5 train, part 4 validates, and so on.

Advantages: Better utilization of data, more robust evaluation, reduces


variance

Disadvantages: Computationally expensive, time-consuming for large


datasets

Applications: Model selection, hyperparameter tuning, performance


estimation

Real-life Examples: Medical diagnosis systems, financial prediction models

More Stuff Inside → Click here


Powered by

Q.11 What is the need for dimensionality reduction? Explain the concept of
the Curse of Dimensionality [5]

Definition:
Dimensionality reduction is the process of reducing the number of features
or variables in a dataset while retaining important information for
machine learning tasks.

Need for Dimensionality Reduction:


Computational Efficiency: Reduces processing time and memory
requirements
Storage Optimization: Decreases storage space needed for data
Visualization: Enables plotting of high-dimensional data in 2D/3D
Noise Reduction: Removes irrelevant features that may confuse the
model
Overfitting Prevention: Reduces model complexity and improves
generalization

Curse of Dimensionality:

Concept: As the number of dimensions increases, the volume of space


increases exponentially, making data points sparse and algorithms less
effective.

More Stuff Inside → Click here


Powered by

Effects:
Data Sparsity: In high dimensions, data points become far apart from
each other
Distance Concentration: All distances between points become nearly
equal
Increased Complexity: More features require more computational
resources
Sample Size Requirements: Need exponentially more samples for same
density

Mathematical Example:
In 1D: 10 points cover space well
In 2D: Need 100 points for same coverage
In 3D: Need 1000 points for same coverage

More Stuff Inside → Click here


Powered by

Advantages: Improved performance, reduced overfitting, faster


computation

Disadvantages: Potential information loss, increased preprocessing


complexity

Applications: Image processing, text mining, gene expression analysis

Real-life Examples: Face recognition, document classification,


recommendation systems

Q.12 State and justify Real life applications of supervised and unsupervised
learning. (CO1). [4]

Definition:
Supervised learning uses labeled data for prediction, while unsupervised
learning finds hidden patterns in unlabeled data.

Supervised Learning Applications:

Email Spam Detection:


Justification: Uses labeled emails (spam/not spam) to train classifier
Input: Email features (keywords, sender, subject)
Output: Spam or legitimate email classification

More Stuff Inside → Click here


Powered by

Medical Diagnosis:
Justification: Trained on patient data with known diagnoses
Input: Symptoms, test results, patient history
Output: Disease prediction or health status

Credit Score Prediction:


Justification: Uses historical data with known credit outcomes
Input: Income, credit history, employment status
Output: Credit approval or rejection decision

Unsupervised Learning Applications:

Customer Segmentation:
Justification: Groups customers without predefined categories
Input: Purchase history, demographics, behavior patterns
Output: Customer clusters for targeted marketing

Market Basket Analysis:


Justification: Finds buying patterns without prior knowledge
Input: Transaction data, product purchases
Output: Association rules (if bread, then butter)

Anomaly Detection:
Justification: Identifies unusual patterns without labeled anomalies
Input: Network traffic, financial transactions
Output: Fraud detection, security threats

More Stuff Inside → Click here


Powered by

Supervised Advantages: Accurate predictions, measurable performance,


clear objectives

Unsupervised Advantages: Discovers hidden patterns, no labeling required,


exploratory analysis

Applications: Healthcare, finance, marketing, security, e-commerce

Real-life Examples: Netflix recommendations, fraud detection, autonomous


vehicles

Q.13 Show how machine learning differs from traditional programming.


Elaborate with suitable diagram [6]

Definition:
Traditional programming uses explicit instructions to solve problems, while
machine learning learns patterns from data to make predictions or
decisions.

Traditional Programming Approach:

Input: Data + Program (explicit rules)


Process: Execute predefined instructions
Output: Results based on coded logic
Developer Role: Writes specific rules and conditions

More Stuff Inside → Click here


Powered by

• Flexibility: Limited to programmed scenarios

Machine Learning Approach:

Input: Data + Expected Output (training examples)


Process: Algorithm learns patterns automatically
Output: Trained model that can make predictions
Developer Role: Provides data and selects algorithms
Flexibility: Adapts to new patterns and scenarios

Key Differences:

Problem Solving:
Traditional: Programmer defines solution steps
ML: Algorithm discovers solution patterns

More Stuff Inside → Click here


Powered by

Rule Creation:
Traditional: Manually coded rules
ML: Automatically learned rules

Adaptability:
Traditional: Fixed behavior, needs code updates
ML: Adapts to new data patterns

Complexity Handling:
Traditional: Difficult for complex pattern recognition
ML: Excels at complex pattern identification

Examples:
Traditional Programming: Calculator, simple database queries, basic
sorting
Machine Learning: Image recognition, speech processing,
recommendation systems

Traditional Advantages: Predictable, transparent, precise control

ML Advantages: Handles complexity, learns from data, adapts to changes

Applications: ML for pattern recognition, traditional for rule-based


systems

Real-life Examples: Email filters (ML) vs. payroll systems (traditional)

More Stuff Inside → Click here


Powered by

Q.14 Explain K-fold Cross Validation technique with suitable example [5]

Definition:
K-fold Cross Validation is a statistical method used to evaluate machine
learning models by dividing the dataset into k subsets and iteratively
training and testing the model.

K-fold Cross Validation Process:

Step 1: Divide dataset into k equal-sized folds (subsets)


Step 2: Select one fold as validation set, remaining k-1 folds as training
set
Step 3: Train model on training set, evaluate on validation set
Step 4: Repeat process k times, each time using different fold for
validation
Step 5: Calculate average performance across all k iterations

Detailed Example:

Dataset: 1000 student records for grade prediction

More Stuff Inside → Click here


Powered by

K-value: 5 (5-fold cross validation)

Iteration Details:

Fold 1: Records 1-200 (Validation), Records 201-1000 (Training)


Fold 2: Records 201-400 (Validation), Records 1-200,401-1000 (Training)
Fold 3: Records 401-600 (Validation), Records 1-400,601-1000 (Training)
Fold 4: Records 601-800 (Validation), Records 1-600,801-1000 (Training)
Fold 5: Records 801-1000 (Validation), Records 1-800 (Training)

Performance Calculation:

Iteration 1 Accuracy: 85%


Iteration 2 Accuracy: 87%
Iteration 3 Accuracy: 83%
Iteration 4 Accuracy: 86%
Iteration 5 Accuracy: 84%

Average Accuracy: (85+87+83+86+84)/5 = 85%

Common K Values:
k=5: Good balance between bias and variance
k=10: More thorough evaluation, higher computational cost
k=n (Leave-One-Out): Maximum data utilization, very expensive

More Stuff Inside → Click here


Powered by

Advantages: Better data utilization, robust evaluation, reduces overfitting

Disadvantages: Computationally expensive, time-consuming for large


datasets

Applications: Model selection, hyperparameter tuning, performance


estimation

Real-life Examples: Medical diagnosis validation, financial model testing

Q.15 What is Dataset? Differentiate between Training dataset and Testing


dataset [4]

Definition:
A dataset is a structured collection of data organized in rows and
columns, where each row represents an observation and each column
represents a feature or attribute.

Dataset Components:
Features (Attributes): Independent variables used for prediction
Target (Label): Dependent variable to be predicted
Instances (Records): Individual data points or observations
Data Types: Numerical, categorical, text, images, etc.

More Stuff Inside → Click here


Powered by

Training Dataset vs Testing Dataset:

Aspect Training Dataset Testing Dataset

Evaluate final model


Purpose Train and build the model
performance

Size 70-80% of total data 20-30% of total data

Model learns patterns and Unbiased performance


Usage
relationships assessment

Used multiple times during Used only once for final


Accessibility
training evaluation

Contains both features and Contains features and target


Labels
target values for evaluation

Model Model adjusts parameters Model makes predictions


Interaction based on this data without learning

Detailed Differences:

Training Dataset:
Function: Parameter learning and pattern recognition
Frequency: Used repeatedly during training iterations
Impact: Directly influences model weights and parameters
Overfitting Risk: Model may memorize training data patterns

More Stuff Inside → Click here


Powered by

Testing Dataset:
Function: Performance validation and generalization testing
Frequency: Used only once after training completion
Impact: Provides unbiased performance metrics
Generalization: Tests model's ability on unseen data

Example: Email Spam Detection:


Training Dataset: 8000 emails with spam/not spam labels for model
training
Testing Dataset: 2000 emails with known labels for accuracy
evaluation

Data Flow: Original Dataset → Training Set (Model Learning) →


Validation Set (Hyperparameter Tuning) → Testing Set (Final Evaluation)
Advantages: Proper evaluation, prevents overfitting, ensures generalization

Disadvantages: Reduces available training data, requires careful splitting

Applications: All supervised learning tasks, model comparison

Real-life Examples: Medical diagnosis, image recognition, financial


prediction

More Stuff Inside → Click here


Powered by

Q.16 Compare Supervised, Unsupervised and Semi-supervised Learning with


examples [6]

Definition:
Machine learning algorithms are categorized based on the type of data
and learning approach used during training.

Supervised Learning:

Characteristics:
Data Type: Labeled data (input-output pairs)
Learning Goal: Predict outcomes for new data
Feedback: Immediate feedback through known correct answers
Performance Measure: Accuracy, precision, recall

Types:
Classification: Predicts categories or classes
Regression: Predicts continuous numerical values

Examples:
Email Spam Detection: Input: email text, Output: spam/not spam
House Price Prediction: Input: size, location, Output: price value
Medical Diagnosis: Input: symptoms, Output: disease type

More Stuff Inside → Click here


Powered by

Unsupervised Learning:

Characteristics:
Data Type: Unlabeled data (only input features)
Learning Goal: Find hidden patterns or structures
Feedback: No direct feedback or correct answers
Performance Measure: Clustering quality, pattern discovery

Types:
Clustering: Groups similar data points
Association: Finds relationships between variables
Dimensionality Reduction: Reduces feature space

Examples:
Customer Segmentation: Group customers by buying behavior
Market Basket Analysis: "People who buy bread also buy butter"
Anomaly Detection: Identify unusual network activity

Semi-supervised Learning:

Characteristics:
Data Type: Mix of labeled and unlabeled data
Learning Goal: Improve performance using both data types
Feedback: Limited feedback from small labeled portion
Performance Measure: Better than unsupervised, close to supervised

More Stuff Inside → Click here


Powered by

Approach:
Small Labeled Set: Provides initial guidance
Large Unlabeled Set: Provides additional patterns
Combined Learning: Uses both for better performance

Examples:
Web Page Classification: Few labeled pages, many unlabeled pages
Speech Recognition: Limited transcribed audio, lots of raw audio
Image Recognition: Some labeled images, many unlabeled images

Comparison Summary

Aspect Supervised Unsupervised Semi-supervised

Data Requirement Labeled data Unlabeled data Mixed data

Learning
Medium High High
Complexity

Accuracy High Variable Medium-High

Cost High (labeling) Low Medium

Limited labeled data


Applications Prediction tasks Pattern discovery
scenarios

More Stuff Inside → Click here


Powered by

Supervised Advantages: High accuracy, clear objectives, measurable


performance

Unsupervised Advantages: No labeling cost, discovers hidden patterns,


exploratory analysis

Semi-supervised Advantages: Better than unsupervised, lower cost than


supervised

Applications: Healthcare, finance, marketing, security, recommendation


systems

Real-life Examples: Netflix recommendations, fraud detection, autonomous


vehicles

Q.17 What is the need of dimensionality reduction? Explain subset selection


method [5]

Definition:
Dimensionality reduction is the process of reducing the number of features
in a dataset while preserving the most important information for machine
learning tasks.

More Stuff Inside → Click here


Powered by

Need for Dimensionality Reduction:


Computational Efficiency: Reduces processing time and memory usage
Storage Optimization: Decreases storage requirements for large
datasets
Visualization: Enables plotting high-dimensional data in 2D/3D space
Curse of Dimensionality: Overcomes problems of sparse data in high
dimensions
Noise Reduction: Removes irrelevant features that may confuse the
model
Overfitting Prevention: Reduces model complexity and improves
generalization

Subset Selection Method:

Definition:
Subset selection is a feature selection technique that chooses the most
relevant subset of original features without transforming them.

Types of Subset Selection:

Forward Selection:
Process: Start with empty set, add features one by one
Criterion: Add feature that improves performance most
Stopping: When no improvement or desired number reached

More Stuff Inside → Click here


Powered by

Backward Elimination:
Process: Start with all features, remove features one by one
Criterion: Remove feature that least affects performance
Stopping: When performance drops significantly

Bidirectional Selection:
Process: Combines forward and backward approaches
Criterion: Add or remove features based on performance
Stopping: When no beneficial additions or removals possible

More Stuff Inside → Click here


Powered by

Example - Student Performance Prediction:

Original Features: Age, Gender, Study Hours, Sleep Hours, Exercise Hours,
Social Media Hours, Income, Location

Forward Selection Process:


1. Start: Empty set
2. Step 1: Add "Study Hours" (highest correlation with grades)
3. Step 2: Add "Sleep Hours" (best improvement with Study Hours)
4. Step 3: Add "Exercise Hours" (further improvement)
5. Step 4: Try adding "Age" (no significant improvement - stop)
6. Final Subset: {Study Hours, Sleep Hours, Exercise Hours}

Evaluation Metrics:
Accuracy: Model performance on validation set
Cross-validation Score: Average performance across folds
Information Criteria: AIC, BIC for model comparison
Statistical Tests: F-test, t-test for feature significance

Advantages of Subset Selection:


Interpretability: Original features retained, easy to understand
Computational Speed: Fewer features mean faster processing
Reduced Overfitting: Fewer parameters to learn
Storage Efficiency: Less memory required for data storage

More Stuff Inside → Click here


Powered by

Advantages: Maintains feature interpretability, reduces computational


cost, prevents overfitting

Disadvantages: May miss feature interactions, can be computationally


expensive for large feature sets

Applications: Text classification, gene selection, image processing

Real-life Examples: Medical diagnosis, financial analysis, recommendation


systems

Q.18 What is feature? Explain types of feature selection technique [4]

Definition:
A feature is an individual measurable property or characteristic of an
object or phenomenon being observed in a dataset.

Feature Characteristics:
Measurable: Can be quantified or categorized
Relevant: Should contribute to the prediction task
Independent: Ideally not highly correlated with other features
Informative: Provides useful information about the target variable

Types of Features:
Numerical Features: Age, height, salary, temperature

More Stuff Inside → Click here


Powered by

Categorical Features: Gender, color, city, education level


Boolean Features: True/false, yes/no, present/absent
Text Features: Words, phrases, sentiment scores

Feature Selection Techniques:

1. Filter Methods:

Characteristics:
Independence: Evaluate features independently of machine learning
algorithm
Speed: Fast computation, suitable for large datasets
Approach: Use statistical measures to rank features

Examples:
Correlation Coefficient: Measures linear relationship with target
Chi-square Test: Tests independence between categorical variables
Information Gain: Measures reduction in entropy
Variance Threshold: Removes features with low variance

2. Wrapper Methods:

Characteristics:
Algorithm-dependent: Use specific machine learning algorithm for
evaluation
Accuracy: Better performance as they consider feature interactions

More Stuff Inside → Click here


Powered by

Approach: Search through feature subsets using model performance

Examples:
Forward Selection: Start empty, add features iteratively
Backward Elimination: Start with all, remove features iteratively
Recursive Feature Elimination: Recursively eliminate features

3. Embedded Methods:

Characteristics:
Integrated: Feature selection integrated into model training
Efficiency: Combines benefits of filter and wrapper methods
Approach: Algorithm automatically selects relevant features

Examples:
LASSO Regression: Uses L1 regularization to eliminate features
Random Forest: Uses feature importance scores
Ridge Regression: Uses L2 regularization to reduce feature weights

More Stuff Inside → Click here


Powered by

Comparison:

Method Speed Accuracy Complexity Use Case

Large datasets,
Filter Fast Good Low
preprocessing

Small datasets, high


Wrapper Slow Excellent High
accuracy needed

Balanced approach, most


Embedded Medium Very Good Medium
algorithms

Advantages: Improved performance, reduced overfitting, faster


computation, better interpretability

Disadvantages: Risk of removing important features, computational


overhead for wrapper methods

Applications: Text mining, image processing, bioinformatics, financial


analysis

Real-life Examples: Email spam detection, medical diagnosis,


recommendation systems

More Stuff Inside → Click here

You might also like