Unit 1 Pyq
Unit 1 Pyq
🔹 About ReferMe
ReferMe, by Pixen, offers curated academic resources to help students study
smarter and succeed faster.
✅ Class Notes
✅ Previous Year Question Papers (PYQs)
✅ Updated Syllabus
✅ Quick Revision Material
🔹 About Pixen
Pixen is a tech company helping students and startups turn ideas into reality.
Alongside ReferMe, we also offer:
✅ Custom Websites
✅ Machine learning
✅ Web Applications
✅ E‑Commerce Stores
✅ Landing Pages
https://referme.tech/ https://www.pixen.live
Powered by
hule Pune Un
ai P ive
ib
tr r
vi
sit
Sa
y
Information Technology - Third Year
Definition:
Machine Learning tasks are broadly classified into two categories based on
their purpose - Predictive tasks that forecast future outcomes and
Descriptive tasks that analyze existing data patterns.
Predictive Tasks
Purpose:
To predict future or unknown outcomes based on historical data
Key Characteristics:
Supervised Learning: Uses labeled training data
Output Focus: Generates predictions for new, unseen data
Goal: Minimize prediction error and maximize accuracy
Descriptive Tasks
Key Characteristics:
Unsupervised Learning: No labeled target variable
Pattern Discovery: Identifies hidden structures in data
Goal: Extract meaningful insights and relationships
Models Used
Predictive Models:
Linear Regression: For continuous value prediction
Decision Trees: For classification and regression
Neural Networks: For complex pattern recognition
Support Vector Machines: For classification tasks
Descriptive Models:
K-Means Clustering: For grouping similar data points
Association Rules: For finding item relationships
Principal Component Analysis: For dimensionality reduction
DBSCAN: For density-based clustering
Advantages:
Predictive: Enables future planning and decision making
Descriptive: Provides insights into data structure and patterns
Disadvantages:
Predictive: Requires labeled training data, may overfit
Descriptive: Results can be subjective to interpretation
Applications:
Predictive: Weather forecasting, stock price prediction, medical diagnosis
Descriptive: Market research, social network analysis, gene analysis
Q.2 Explain k-fold Cross Validation technique with example (CO1). [5]
Definition:
k-fold Cross Validation is a statistical technique used to evaluate machine
learning models by dividing the dataset into k equal parts and
training/testing the model k times.
Working Process
Process:
Fold 1: Samples 1-200 (Test), Samples 201-1000 (Train)
Fold 2: Samples 201-400 (Test), Samples 1-200 + 401-1000 (Train)
Fold 3: Samples 401-600 (Test), Samples 1-400 + 601-1000 (Train)
Fold 4: Samples 601-800 (Test), Samples 1-600 + 801-1000 (Train)
Fold 5: Samples 801-1000 (Test), Samples 1-800 (Train)
Results:
Fold 1 Accuracy: 85%
Fold 2 Accuracy: 82%
Fold 3 Accuracy: 88%
Fold 4 Accuracy: 86%
Fold 5 Accuracy: 84%
Advantages:
Reliable Evaluation: Uses entire dataset for both training and testing
Reduced Bias: Less dependent on specific train-test split
Better Generalization: Provides robust performance estimate
Disadvantages:
Computational Cost: Requires k times more training time
Data Dependency: Performance varies with dataset size and distribution
Applications:
Model selection and hyperparameter tuning
Comparing different algorithms
Estimating model generalization performance
Definition:
Principal Component Analysis (PCA) is a dimensionality reduction technique
that transforms high-dimensional data into a lower-dimensional space while
preserving maximum variance.
Working Process
Mathematical Foundation
Variance Explained:
PC1: Captures maximum variance in data
PC2: Captures maximum remaining variance (orthogonal to PC1)
PC3: Captures maximum remaining variance (orthogonal to PC1 and
PC2)
Cumulative Variance:
Goal: Retain 90-95% of original variance
Selection: Choose minimum components achieving this threshold
Advantages:
Dimensionality Reduction: Reduces computational complexity
Noise Reduction: Eliminates less important features
Visualization: Enables plotting of high-dimensional data
Disadvantages:
Interpretability Loss: Components may not have clear meaning
Linear Assumption: Only captures linear relationships
Variance Focus: May discard discriminative information
Applications:
Image compression and processing
Data visualization and exploration
Feature extraction for machine learning
Genomics and bioinformatics analysis
Implementation Strategy
Advantages:
Comprehensive Coverage: Handles various recommendation scenarios
Continuous Improvement: Learns and adapts over time
Business Impact: Directly optimizes user engagement metrics
Disadvantages:
Complexity: Requires multiple model coordination
Computational Cost: High resource requirements
Data Requirements: Needs diverse data types
Applications:
E-commerce product recommendations
Social media content curation
Music streaming platforms
Online advertising targeting
Key Components
Agent:
Definition: The learner or decision maker
Function: Observes environment and takes actions
Goal: Maximize cumulative reward over time
Environment:
Definition: External system that agent interacts with
Function: Provides states and rewards based on actions
Characteristics: Can be deterministic or stochastic
State (S):
Definition: Current situation or configuration of environment
Examples: Position in game, stock market conditions
Representation: Vector of relevant features
Action (A):
Definition: Choices available to agent in given state
Types: Discrete (finite options) or continuous (infinite range)
Selection: Based on policy function
Reward (R):
Definition: Immediate feedback for action taken
Purpose: Guides learning process
Design: Positive for good actions, negative for bad actions
Policy (π):
Definition: Strategy that maps states to actions
Goal: Find optimal policy that maximizes expected reward
Types: Deterministic or stochastic
Learning Process
Popular Algorithms
Q-Learning:
Type: Model-free, off-policy
Update Rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
Advantage: Learns optimal policy without environment model
Advantages:
No Supervision Required: Learns from trial and error
Adaptive Learning: Continuously improves with experience
Optimal Solutions: Can find globally optimal policies
Disadvantages:
Sample Inefficiency: Requires many interactions for learning
Exploration Challenge: Balancing exploration and exploitation
Convergence Issues: No guarantee of finding optimal solution
Applications:
Game playing (Chess, Go, video games)
Robotics and autonomous systems
Resource allocation and scheduling
Financial trading algorithms
Real-life Examples:
AlphaGo defeating world champion Go players
Autonomous vehicle navigation
Personalized treatment recommendations
Dynamic pricing in e-commerce
Definition:
Scales of measurement classify different types of data variables based on
their mathematical properties and the operations that can be performed
on them.
1. Nominal Scale
Characteristics:
Qualitative: Names or labels only
No Mathematical Operations: Cannot perform arithmetic
Equality: Can check if two values are same or different
Examples:
Gender (Male, Female, Other)
Colors (Red, Blue, Green)
Marital Status (Single, Married, Divorced)
Product Categories (Electronics, Clothing, Books)
Operations Allowed:
Counting: Frequency of each category
Mode: Most frequent category
Chi-square Test: Association between categories
2. Ordinal Scale
Characteristics:
Ranking: Clear ordering relationship
No Arithmetic: Cannot add, subtract, multiply, or divide
Relative Position: Can compare which is higher/lower
Examples:
Education Level (High School < Bachelor's < Master's < PhD)
Customer Satisfaction (Poor < Fair < Good < Excellent)
Movie Ratings (1 star < 2 stars < 3 stars < 4 stars < 5 stars)
Economic Status (Low < Middle < High)
Operations Allowed:
Median: Middle value when ordered
Percentiles: Position-based statistics
Non-parametric Tests: Spearman correlation, Mann-Whitney U test
3. Interval Scale
Definition: Ordered data with consistent intervals but no true zero point
Characteristics:
Equal Intervals: Distance between consecutive values is constant
No True Zero: Zero point is arbitrary
Addition/Subtraction: Can calculate differences
Examples:
Temperature (Celsius, Fahrenheit)
Calendar Years (1990, 2000, 2010)
IQ Scores (100, 110, 120)
Standardized Test Scores (SAT, GRE)
Operations Allowed:
Mean: Average of all values
Standard Deviation: Measure of variability
Correlation: Pearson correlation coefficient
Linear Regression: Fitting straight lines
4. Ratio Scale
Characteristics:
True Zero: Zero means complete absence
All Operations: Can perform all arithmetic operations
Proportional Relationships: Can say one value is twice another
Examples:
Height (0 cm means no height)
Weight (0 kg means no weight)
Income (0 dollars means no income)
Operations Allowed:
All Statistical Measures: Mean, median, mode, standard deviation
Ratios: Can calculate proportions and percentages
Geometric Mean: Appropriate for ratio data
All Regression Types: Linear, polynomial, exponential
Feature Encoding:
Nominal: One-hot encoding, label encoding
Ordinal: Ordinal encoding preserving order
Interval/Ratio: Use values directly or normalize
Algorithm Selection:
Nominal: Decision trees, naive Bayes
Ordinal: Algorithms respecting order relationships
Interval/Ratio: All algorithms, especially distance-based
Preprocessing Requirements:
Nominal: No scaling needed
Ordinal: Careful scaling to preserve order
Interval/Ratio: Standardization or normalization
Advantages:
Appropriate Analysis: Ensures correct statistical methods
Data Quality: Helps identify measurement errors
Algorithm Selection: Guides choice of ML algorithms
Disadvantages:
Complexity: Requires careful consideration of data types
Mixed Data: Real datasets often contain multiple scales
Preprocessing Overhead: Different scales need different treatments
Applications:
Survey data analysis and market research
Medical diagnosis and treatment planning
Quality control in manufacturing
Social science research and analysis
Real-life Examples:
Nominal: Customer segmentation by demographic categories
Ordinal: Product review ratings and rankings
Interval: Temperature monitoring in industrial processes
Ratio: Financial analysis and performance metrics
Definition:
Machine learning algorithms are classified into three main categories
based on the type of feedback available during training: Supervised,
Unsupervised, and Semi-supervised learning.
Supervised Learning
Definition:
Learning with labeled training data where correct answers are provided
Characteristics:
Input-Output Pairs: Training data includes both features and target
values
Goal: Learn mapping function from input to output
Feedback: Immediate correction through labeled examples
Evaluation: Performance measured on test set with known answers
Types:
Classification: Predicts discrete categories or classes
Regression: Predicts continuous numerical values
Examples:
Email Spam Detection: Input: email content, Output: spam/not spam
Algorithms:
Linear Regression, Logistic Regression
Decision Trees, Random Forest
Support Vector Machines (SVM)
Neural Networks
Unsupervised Learning
Definition:
Learning from data without labeled examples or target outputs
Characteristics:
No Target Variable: Only input features available
Pattern Discovery: Finds hidden structures in data
Exploratory: Understands data distribution and relationships
Subjective Evaluation: No clear right or wrong answers
Types:
Clustering: Groups similar data points together
Association Rules: Finds relationships between variables
Dimensionality Reduction: Reduces feature space complexity
Examples:
Customer Segmentation: Group customers by buying behavior
Market Basket Analysis: Items frequently bought together
Gene Sequencing: Identify genetic patterns
Anomaly Detection: Detect unusual patterns in network traffic
Algorithms:
K-Means Clustering, Hierarchical Clustering
DBSCAN, Gaussian Mixture Models
Principal Component Analysis (PCA)
Association Rule Mining (Apriori)
Semi-supervised Learning
Definition:
Learning with both labeled and unlabeled data, typically with small labeled
and large unlabeled datasets
Characteristics:
Mixed Data: Combines labeled and unlabeled examples
Cost-Effective: Reduces labeling costs
Improved Performance: Often better than supervised learning with
limited labels
Realistic Scenario: Mirrors real-world data availability
Approaches:
Self-Training: Use model predictions on unlabeled data
Co-Training: Multiple models trained on different feature sets
Graph-Based: Propagate labels through data similarity graph
Examples:
Web Page Classification: Few labeled pages, millions unlabeled
Speech Recognition: Limited transcribed audio, vast unlabeled audio
Medical Image Analysis: Few expert-labeled images, many unlabeled
Social Network Analysis: Some user profiles labeled, most unlabeled
Algorithms:
Self-Training with Confidence Thresholding
Co-Training with Multiple Views
Graph-Based Label Propagation
Generative Adversarial Networks (GANs)
Detailed Comparison
Data
Large labeled Large unlabeled Small labeled + Large
Requirement
dataset dataset unlabeled
s
Advantages:
Supervised: High accuracy with sufficient labeled data
Unsupervised: Discovers hidden patterns without labels
Semi-supervised: Balances performance and labeling costs
Disadvantages:
Supervised: Expensive labeling, overfitting risk
Unsupervised: Difficult evaluation, subjective results
Semi-supervised: Complex algorithms, assumption-dependent
Applications:
Supervised: Medical diagnosis, fraud detection, recommendation systems
Real-life Examples:
Supervised: Netflix movie ratings prediction
Unsupervised: Amazon customer segmentation
Semi-supervised: Google's image search with limited labeled images
Q.8 Explain k-fold Cross Validation technique with example (CO1). [5]
Definition:
k-fold Cross Validation is a statistical technique used to evaluate machine
learning models by dividing the dataset into k equal parts and
training/testing the model k times.
Step-by-Step Process
Data Split:
Total Samples: 1000 student records
Fold Size: 200 samples each
Fold 1: Students 1-200
Fold 2: Students 201-400
Fold 3: Students 401-600
Fold 4: Students 601-800
Fold 5: Students 801-1000
Training Process:
Iteration 1:
Training Set: 800 samples (Folds 2,3,4,5)
Test Set: 200 samples (Fold 1)
Model: Linear Regression
Performance: R² = 0.87
Iteration 2:
Training Set: 800 samples (Folds 1,3,4,5)
Test Set: 200 samples (Fold 2)
Performance: R² = 0.85
Iteration 3:
Training Set: 800 samples (Folds 1,2,4,5)
Test Set: 200 samples (Fold 3)
Performance: R² = 0.89
Iteration 4:
Training Set: 800 samples (Folds 1,2,3,5)
Test Set: 200 samples (Fold 4)
Performance: R² = 0.86
Iteration 5:
Training Set: 800 samples (Folds 1,2,3,4)
Test Set: 200 samples (Fold 5)
Performance: R² = 0.88
Final Results:
Individual Scores: [0.87, 0.85, 0.89, 0.86, 0.88]
Mean Performance: (0.87 + 0.85 + 0.89 + 0.86 + 0.88) / 5 = 0.87
Standard Deviation: 0.015 (low variance indicates consistent
performance)
Variations of Cross-Validation
Stratified k-fold:
Purpose: Maintains class distribution in each fold
Usage: Classification problems with imbalanced datasets
Benefit: Ensures each fold is representative of overall distribution
Leave-One-Out (LOO):
Special Case: k = n (number of samples)
Process: Each sample used once as test set
Advantages:
Robust Evaluation: Uses entire dataset for both training and testing
Reduced Overfitting: Less likely to be biased by specific train-test split
Confidence Intervals: Provides variance estimate of model performance
Model Selection: Helps choose best algorithm and hyperparameters
Disadvantages:
Computational Cost: Requires k times more training time
Memory Usage: May need to store multiple model instances
Data Dependency: Results depend on random data splitting
Applications:
Model Selection: Comparing different algorithms
Hyperparameter Tuning: Finding optimal parameter values
Feature Selection: Evaluating feature importance
Performance Estimation: Getting reliable accuracy estimates
Real-life Examples:
Medical Research: Validating diagnostic models across patient groups
Financial Analysis: Testing trading strategies on different time periods
Image Recognition: Evaluating CNN performance across image datasets
Natural Language Processing: Validating sentiment analysis models
Definition:
Dataset splitting is the process of dividing the available data into separate
subsets to train, validate, and test machine learning models effectively.
Definition:
Training dataset size is larger than testing dataset because the model
needs sufficient data to learn patterns effectively while keeping enough
data for unbiased evaluation.
Recommended Ratios:
80:20 Rule: 80% training, 20% testing (most common)
70:30 Rule: 70% training, 30% testing (for smaller datasets)
60:20:20 Rule: 60% training, 20% validation, 20% testing (with
validation set)
Steps:
1. Divide Dataset: Split data into k equal folds (typically k=5 or k=10)
2. Iterative Training: Use k-1 folds for training, 1 fold for validation
3. Rotation: Repeat process k times, each time using different fold for
validation
4. Average Results: Calculate mean performance across all k iterations
Example:
For k=5, dataset divided into 5 parts. In iteration 1, parts 1-4 train, part 5
validates. In iteration 2, parts 1,2,3,5 train, part 4 validates, and so on.
Q.11 What is the need for dimensionality reduction? Explain the concept of
the Curse of Dimensionality [5]
Definition:
Dimensionality reduction is the process of reducing the number of features
or variables in a dataset while retaining important information for
machine learning tasks.
Curse of Dimensionality:
Effects:
Data Sparsity: In high dimensions, data points become far apart from
each other
Distance Concentration: All distances between points become nearly
equal
Increased Complexity: More features require more computational
resources
Sample Size Requirements: Need exponentially more samples for same
density
Mathematical Example:
In 1D: 10 points cover space well
In 2D: Need 100 points for same coverage
In 3D: Need 1000 points for same coverage
Q.12 State and justify Real life applications of supervised and unsupervised
learning. (CO1). [4]
Definition:
Supervised learning uses labeled data for prediction, while unsupervised
learning finds hidden patterns in unlabeled data.
Medical Diagnosis:
Justification: Trained on patient data with known diagnoses
Input: Symptoms, test results, patient history
Output: Disease prediction or health status
Customer Segmentation:
Justification: Groups customers without predefined categories
Input: Purchase history, demographics, behavior patterns
Output: Customer clusters for targeted marketing
Anomaly Detection:
Justification: Identifies unusual patterns without labeled anomalies
Input: Network traffic, financial transactions
Output: Fraud detection, security threats
Definition:
Traditional programming uses explicit instructions to solve problems, while
machine learning learns patterns from data to make predictions or
decisions.
Key Differences:
Problem Solving:
Traditional: Programmer defines solution steps
ML: Algorithm discovers solution patterns
Rule Creation:
Traditional: Manually coded rules
ML: Automatically learned rules
Adaptability:
Traditional: Fixed behavior, needs code updates
ML: Adapts to new data patterns
Complexity Handling:
Traditional: Difficult for complex pattern recognition
ML: Excels at complex pattern identification
Examples:
Traditional Programming: Calculator, simple database queries, basic
sorting
Machine Learning: Image recognition, speech processing,
recommendation systems
Q.14 Explain K-fold Cross Validation technique with suitable example [5]
Definition:
K-fold Cross Validation is a statistical method used to evaluate machine
learning models by dividing the dataset into k subsets and iteratively
training and testing the model.
Detailed Example:
Iteration Details:
Performance Calculation:
Common K Values:
k=5: Good balance between bias and variance
k=10: More thorough evaluation, higher computational cost
k=n (Leave-One-Out): Maximum data utilization, very expensive
Definition:
A dataset is a structured collection of data organized in rows and
columns, where each row represents an observation and each column
represents a feature or attribute.
Dataset Components:
Features (Attributes): Independent variables used for prediction
Target (Label): Dependent variable to be predicted
Instances (Records): Individual data points or observations
Data Types: Numerical, categorical, text, images, etc.
Detailed Differences:
Training Dataset:
Function: Parameter learning and pattern recognition
Frequency: Used repeatedly during training iterations
Impact: Directly influences model weights and parameters
Overfitting Risk: Model may memorize training data patterns
Testing Dataset:
Function: Performance validation and generalization testing
Frequency: Used only once after training completion
Impact: Provides unbiased performance metrics
Generalization: Tests model's ability on unseen data
Definition:
Machine learning algorithms are categorized based on the type of data
and learning approach used during training.
Supervised Learning:
Characteristics:
Data Type: Labeled data (input-output pairs)
Learning Goal: Predict outcomes for new data
Feedback: Immediate feedback through known correct answers
Performance Measure: Accuracy, precision, recall
Types:
Classification: Predicts categories or classes
Regression: Predicts continuous numerical values
Examples:
Email Spam Detection: Input: email text, Output: spam/not spam
House Price Prediction: Input: size, location, Output: price value
Medical Diagnosis: Input: symptoms, Output: disease type
Unsupervised Learning:
Characteristics:
Data Type: Unlabeled data (only input features)
Learning Goal: Find hidden patterns or structures
Feedback: No direct feedback or correct answers
Performance Measure: Clustering quality, pattern discovery
Types:
Clustering: Groups similar data points
Association: Finds relationships between variables
Dimensionality Reduction: Reduces feature space
Examples:
Customer Segmentation: Group customers by buying behavior
Market Basket Analysis: "People who buy bread also buy butter"
Anomaly Detection: Identify unusual network activity
Semi-supervised Learning:
Characteristics:
Data Type: Mix of labeled and unlabeled data
Learning Goal: Improve performance using both data types
Feedback: Limited feedback from small labeled portion
Performance Measure: Better than unsupervised, close to supervised
Approach:
Small Labeled Set: Provides initial guidance
Large Unlabeled Set: Provides additional patterns
Combined Learning: Uses both for better performance
Examples:
Web Page Classification: Few labeled pages, many unlabeled pages
Speech Recognition: Limited transcribed audio, lots of raw audio
Image Recognition: Some labeled images, many unlabeled images
Comparison Summary
Learning
Medium High High
Complexity
Definition:
Dimensionality reduction is the process of reducing the number of features
in a dataset while preserving the most important information for machine
learning tasks.
Definition:
Subset selection is a feature selection technique that chooses the most
relevant subset of original features without transforming them.
Forward Selection:
Process: Start with empty set, add features one by one
Criterion: Add feature that improves performance most
Stopping: When no improvement or desired number reached
Backward Elimination:
Process: Start with all features, remove features one by one
Criterion: Remove feature that least affects performance
Stopping: When performance drops significantly
Bidirectional Selection:
Process: Combines forward and backward approaches
Criterion: Add or remove features based on performance
Stopping: When no beneficial additions or removals possible
Original Features: Age, Gender, Study Hours, Sleep Hours, Exercise Hours,
Social Media Hours, Income, Location
Evaluation Metrics:
Accuracy: Model performance on validation set
Cross-validation Score: Average performance across folds
Information Criteria: AIC, BIC for model comparison
Statistical Tests: F-test, t-test for feature significance
Definition:
A feature is an individual measurable property or characteristic of an
object or phenomenon being observed in a dataset.
Feature Characteristics:
Measurable: Can be quantified or categorized
Relevant: Should contribute to the prediction task
Independent: Ideally not highly correlated with other features
Informative: Provides useful information about the target variable
Types of Features:
Numerical Features: Age, height, salary, temperature
1. Filter Methods:
Characteristics:
Independence: Evaluate features independently of machine learning
algorithm
Speed: Fast computation, suitable for large datasets
Approach: Use statistical measures to rank features
Examples:
Correlation Coefficient: Measures linear relationship with target
Chi-square Test: Tests independence between categorical variables
Information Gain: Measures reduction in entropy
Variance Threshold: Removes features with low variance
2. Wrapper Methods:
Characteristics:
Algorithm-dependent: Use specific machine learning algorithm for
evaluation
Accuracy: Better performance as they consider feature interactions
Examples:
Forward Selection: Start empty, add features iteratively
Backward Elimination: Start with all, remove features iteratively
Recursive Feature Elimination: Recursively eliminate features
3. Embedded Methods:
Characteristics:
Integrated: Feature selection integrated into model training
Efficiency: Combines benefits of filter and wrapper methods
Approach: Algorithm automatically selects relevant features
Examples:
LASSO Regression: Uses L1 regularization to eliminate features
Random Forest: Uses feature importance scores
Ridge Regression: Uses L2 regularization to reduce feature weights
Comparison:
Large datasets,
Filter Fast Good Low
preprocessing