Lecture 1: "Introduction & EDA"
Machine Learning Overview
1. Definition:
o Machine learning allows computers to learn patterns from data
without being explicitly programmed (Arthur Samuel, 1958).
2. Types of Machine Learning:
o Supervised Learning:
Uses labeled data.
Examples: Classification (e.g., spam detection), Regression
(e.g., predicting house prices).
o Unsupervised Learning:
Uses unlabeled data.
Examples: Clustering (e.g., market segmentation),
Dimensionality Reduction, Anomaly Detection.
3. Applications:
o Fraud detection, recommendation systems, image recognition,
medical diagnostics.
CRISP-DM Framework
1. Phases:
o Business Understanding: Define business objectives and success
criteria.
o Data Understanding: Explore and verify data.
o Data Preparation: Clean and preprocess data.
o Modeling: Apply machine learning algorithms.
o Evaluation: Assess model performance using predefined metrics.
o Deployment: Use the model in a real-world application.
2. Key Insights:
o CRISP-DM is iterative, not sequential.
o Business and data mining goals must align.
Data Preparation
1. Steps:
o Data Cleaning:
Handle missing values (remove rows, impute with
mean/median).
o Feature Scaling:
o Feature Selection:
Remove irrelevant or redundant features using:
Low variance threshold.
Statistical tests (e.g., ANOVA, χ2).
Recursive Feature Elimination (RFE).
o Data Balancing:
Handle imbalanced datasets using:
Oversampling: Duplicate minority class samples.
Undersampling: Remove majority class samples.
SMOTE: Generate synthetic data for the minority
class.
2. Why Important:
o Ensures data quality, improves model accuracy, and prevents bias.
Exploratory Data Analysis (EDA)
1. Definition:
o The process of analyzing datasets to summarize their main
characteristics using visualizations and statistical techniques.
2. Goals:
o Discover patterns and relationships.
o Spot anomalies.
o Test hypotheses.
o Verify assumptions.
3. Steps:
o Identify data types: Numerical (discrete, continuous) vs. Categorical
(nominal, ordinal).
o Check for data quality: Missing values, duplicates, and
inconsistencies.
4. Types of EDA:
o Univariate Analysis:
Analyze individual variables (e.g., histograms, box plots).
o Bivariate Analysis:
Explore relationships between two variables (e.g., scatter
plots, correlation matrix).
o Multivariate Analysis:
Investigate interactions among multiple variables (e.g., 3D
scatter plots, contour plots).
Summary Statistics
1. Measures of Central Tendency:
o Mean: Average value (sensitive to outliers).
o Median: Middle value (more robust to outliers).
o Mode: Most frequent value (useful for categorical data).
2. Measures of Dispersion:
o Variance (σ2): Spread of data points around the mean.
o Standard Deviation (σ): Square root of variance.
o Percentiles: Value below which a given percentage of observations
fall (e.g., 75th percentile).
3. Data Frequency:
o Understand the frequency distribution of categorical or numerical
data.
Data Visualizations
1. Univariate Visualizations:
o Histograms: Distribution of numerical data.
o Bar Plots: Frequency of categorical data.
o Box Plots: Summary of data distribution (min, Q1, median, Q3,
max) and outliers.
o Violin Plots: Combines box plots and kernel density estimation.
2. Bivariate Visualizations:
o Scatter Plots: Relationships between two variables.
o Heatmaps: Correlation matrix (Pearson’s R).
3. Multivariate Visualizations:
o Pair Plots: Matrix of scatter plots for all variable pairs.
o 3D Scatter Plots: Visualize relationships among three variables.
o Contour Plots: Density of data points in a multivariate setting.
Key Exam Tips
1. Understand Definitions:
o Be clear on supervised vs. unsupervised learning.
o Know the steps of CRISP-DM and EDA.
2. Know Statistical Measures:
o Be able to calculate and interpret mean, median, variance, and
percentiles.
o Understand when to use normalization vs. standardization.
3. Be Familiar with Visualizations:
o Identify appropriate plots for univariate, bivariate, and multivariate
data.
4. Data Preparation:
o Understand the importance of cleaning, scaling, and balancing data.
5. Applications:
o Connect concepts to real-world examples (e.g., anomaly detection,
clustering for market segmentation).
Lecture 2: ANOVA, PCA, and Clustering
1. ANOVA (Analysis of Variance)
Purpose: Assess the impact of a feature on a target variable by comparing
group means.
Key Concepts:
1. F-Statistic Formula:
2. Steps to Perform ANOVA:
o Define Hypotheses:
H0: No significant difference between group means.
H1: At least one group mean differs.
o Compute Sum of Squares:
SSB: Variability between groups.
SSW: Variability within groups.
o Degrees of Freedom (df):
dfB= k - 1, dfW=N - k (where k = number of groups).
o Calculate Mean Squares:
o Compute F-Statistic and compare with critical F-value.
o Decision: Reject H0 if p<0.05.
Applications:
Feature selection in supervised learning to identify relevant predictors.
Important Notes:
Features with low F-statistics contribute little and can be removed.
Assumes data normality and equal variance.
2. Principal Component Analysis (PCA)
Purpose: Reduce high-dimensional data to a lower-dimensional space while
retaining as much variance as possible.
Key Concepts:
1. Steps in PCA:
o Standardize Data:
Ensure mean = 0 and variance = 1 for all features.
o Covariance Matrix:
Captures relationships between variables.
o Eigen Decomposition:
Extract eigenvectors (principal components) and eigenvalues
(explained variance).
o Sort by Eigenvalues:
Largest eigenvalue corresponds to the principal component
explaining the most variance.
o Select Principal Components:
Use the "Elbow Method" to decide how many components to
retain.
2. Key Properties:
o Linear relationships between variables are essential.
o PCA is sensitive to outliers.
o Results are best with numeric, continuous variables.
3. Elbow Method:
o Plot variance explained by each principal component.
o Select the number of components before the curve levels off.
Applications:
Visualizing high-dimensional data.
Preprocessing for machine learning models to improve training efficiency.
3. Clustering
Purpose: Group data points into clusters based on similarity.
Key Concepts:
1. Types of Clustering:
o Hierarchical Clustering:
Builds a hierarchy of clusters, visualized with a dendrogram.
Methods:
Agglomerative (bottom-up): Start with single-point
clusters and merge.
Divisive (top-down): Start with one cluster and split.
Linkages:
Single: Closest point in each cluster.
Complete: Farthest point in each cluster.
Average: Average distance between points in
clusters.
Weakness: Computationally expensive for large datasets.
o K-Means Clustering:
Divides data into k non-overlapping clusters.
Steps:
1. Choose k (number of clusters).
2. Randomly initialize k centroids.
3. Assign points to nearest centroid.
4. Recalculate centroids and repeat until convergence.
Selecting k:
Elbow Method: Analyze Within-Cluster Sum of
Squares (WCSS) to find the optimal kk.
Silhouette Analysis: Measures how similar a point is
to its cluster compared to other clusters.
2. Distance Metrics:
o Euclidean distance (most common).
o Manhattan, cosine similarity for specific use cases.
3. Key Differences:
o Hierarchical is more interpretable (via dendrogram) but
computationally intensive.
o K-Means is faster and better for larger datasets but requires
predefined kk.
Applications:
Customer segmentation.
Anomaly detection.
Document clustering.
Exam Tips
1. Understand and Calculate:
o F-statistic for ANOVA.
o Covariance matrix and eigen decomposition in PCA.
o Distance metrics and linkage methods in clustering.
2. Visual Interpretation:
o Elbow and silhouette plots.
o Dendrograms for hierarchical clustering.
3. Compare and Contrast:
o Hierarchical vs. K-Means.
o PCA vs. Feature Selection.
4. Applications:
o Be ready to explain when and why to use ANOVA, PCA, or clustering
in real-world scenarios.
Lecture 3: Classification
1. Introduction to Classification
Definition:
o Classification is a supervised learning task that predicts a
categorical label based on input features.
o Example: Predicting whether a tax return is fraudulent (Yes/No).
General Steps:
1. Split the dataset into training and test sets.
2. Use the training set (with labels) to build a classification model.
3. Evaluate the model using the test set.
4. Apply the model to new, unlabeled data.
2. Simple Classifiers
K-Nearest Neighbors (KNN):
How it works:
o Classify a new instance based on the majority vote of its k-nearest
neighbors.
Key Features:
o Requires a distance function (e.g., Euclidean or Manhattan).
o No explicit training phase.
o Weakness: Not suitable for imbalanced datasets.
Linear Classifier (Logistic Regression):
How it works:
o Models the conditional probability P(Y∣X) to predict labels.
Key Features:
o Suitable for linearly separable data.
o Simple and interpretable.
3. Model Evaluation
Confusion Matrix:
Summarizes prediction results:
o True Positive (TP): Correctly predicted positive instances.
o True Negative (TN): Correctly predicted negative instances.
o False Positive (FP): Negative instances incorrectly classified as
positive.
o False Negative (FN): Positive instances incorrectly classified as
negative.
Evaluation Metrics
5. ROC Curve:
Plots True Positive Rate (TPR) vs. False Positive Rate (FPR) at various
thresholds.
AUC-ROC:
1. Area under the ROC curve.
2. AUC = 0.5: Random guessing.
3. AUC= 1.0: Perfect classifier.
4. Overfitting and Underfitting
Underfitting:
Model is too simple and fails to capture patterns in the data.
Symptoms:
o High bias error.
o Poor performance on both training and test data.
Overfitting:
Model is too complex and captures noise in the training data.
Symptoms:
o High variance error.
o Excellent performance on training data but poor on test data.
5. Cross-Validation Techniques
Holdout Validation:
Reserve a portion of the data for testing (e.g., 2/3 training, 1/3 testing).
k-Fold Cross-Validation:
Split data into kk folds.
Train on k-1 folds and test on the remaining fold, repeating k times.
Average results.
Stratified Cross-Validation:
Ensures that each fold has the same class distribution as the original
dataset.
Useful for imbalanced datasets.
Leave-One-Out Cross-Validation (LOOCV):
Use one sample as the test set and the rest for training.
Repeat for every sample.
Monte Carlo Cross-Validation:
Randomly split data into training and testing sets multiple times.
Time Series Cross-Validation:
Use a sliding window or growing dataset for sequential time-based splits.
6. Validation Framework
Training Set:
o Used to train the model.
Validation Set:
o Used for hyperparameter tuning and model selection.
Test Set:
o Used for final evaluation; must remain untouched during training.
7. Hyperparameter Tuning
Grid Search:
o Tests all combinations of hyperparameters exhaustively.
Randomized Search:
o Tests a random subset of hyperparameter combinations.
8. Complex Classifications
Multiclass Classification:
Predicts among n classes.
Strategies:
o One-vs-Rest (OvR):
Train one classifier per class.
o One-vs-One (OvO):
Train classifiers for each pair of classes.
Multilabel Classification:
Each instance can belong to multiple classes.
Example: Movie genres like horror, action, and comedy.
Multioutput Classification:
Predicts multiple outputs simultaneously.
Example: Predicting the type and color of a fruit.
Exam Tips
1. Know the Metrics:
o Be able to calculate precision, recall, F1-score, and interpret AUC-
ROC.
2. Understand Model Behavior:
o Differentiate between overfitting and underfitting and know how to
address them.
3. Cross-Validation:
o Understand when to use k-fold, stratified, or LOOCV.
4. Classifier Selection:
o Recognize scenarios for using KNN, Logistic Regression, or more
complex classifiers.
5. Applications:
o Be prepared to explain trade-offs like false positives vs. false
negatives in real-world examples.
Lecture 4: Regression
1. What is Regression?
Definition:
o Regression estimates relationships between a numerical dependent
variable (outcome) and one or more independent variables
(predictors/features).
o It is a supervised learning task.
Examples:
o Predicting house prices.
o Estimating salary based on years of experience.
2. Types of Regression
1. Simple Linear Regression:
o Models the relationship between one independent variable (xx) and
one dependent variable (y).
2. Multiple Linear Regression:
o Models relationships between one dependent variable and multiple
independent variables.
3. Polynomial Regression:
o Extends linear regression to model non-linear relationships by
including powers of the independent variable.
3. Key Concepts in Regression
Metrics for Model Evaluation:
1. Mean Absolute Error (MAE):
o Average absolute difference between predicted and actual values.
2. Mean Squared Error (MSE):
o Penalizes larger errors by squaring differences.
3. Root Mean Squared Error (RMSE):
o Square root of MSE, interpretable in original units.
4. R-squared (R^2):
o Proportion of variability in the dependent variable explained by the
model.
o RSS: Residual Sum of Squares.
o TSS: Total Sum of Squares.
o R^2 ranges from 0 to 1 (closer to 1 is better).
Assumptions in Linear Regression:
1. Linearity: Relationship between independent and dependent variables is
linear.
2. Normality: Residuals are normally distributed.
3. Homoscedasticity: Variance of residuals is constant.
4. Independence: Residuals are independent.
4. Bias-Variance Trade-off
1. Bias:
o Error due to overly simplistic assumptions.
o High bias leads to underfitting.
2. Variance:
o Error due to excessive sensitivity to training data.
o High variance leads to overfitting.
3. Irreducible Error:
o Noise inherent in the data that cannot be eliminated.
4. Trade-off:
o Increasing model complexity reduces bias but increases variance,
and vice versa.
5. Feature Selection
1. Forward Stepwise Selection:
o Starts with no predictors and sequentially adds the most significant
predictor.
o Stop when adding predictors no longer improves the model
significantly.
2. Backward Stepwise Selection:
o Starts with all predictors and removes the least significant one
iteratively.
3. Considerations:
o Can lead to overfitting with many features.
o Use domain knowledge or cross-validation for robust feature
selection.
6. Regularization in Regression
Purpose: Reduces overfitting by adding a penalty term to the regression
model.
Techniques:
1. Lasso Regression (L1 Regularization):
o Adds a penalty proportional to the absolute values of coefficients.
o Encourages sparsity by setting some coefficients to zero (automatic
feature selection).
2. Ridge Regression (L2 Regularization):
o Adds a penalty proportional to the square of the coefficients.
o Shrinks coefficients but does not set them to zero.
3. Elastic Net:
o Combines L1 and L2 penalties.
o Controlled by a hyperparameter α\alpha to weight the contributions
of L1 and L2.
7. Practical Insights
Correlation Analysis:
o Correlation (r) measures linear relationships, ranging from -1
(negative) to +1 (positive).
o High correlation between predictors may cause multicollinearity,
reducing model interpretability.
Multicollinearity:
o High correlation among independent variables can distort
regression estimates.
o Solutions:
Remove one of the correlated variables.
Use regularization (e.g., Ridge, Lasso).
Regularization vs. Stepwise Selection:
o Regularization is better for large datasets and avoids overfitting.
o Stepwise methods require more computation and are prone to
overfitting in complex datasets.
Exam Tips
1. Know the Formulas:
o Be ready to compute MAE, MSE, RMSE, and R^2.
2. Understand Assumptions:
o Identify violations of linear regression assumptions.
3. Trade-offs:
o Explain the bias-variance trade-off and its implications for model
complexity.
4. Feature Selection:
o Differentiate between forward, backward selection, and
regularization.
5. Regularization:
o Recognize when to use Lasso, Ridge, or Elastic Net.
Lecture 5: Decision Trees and Ensemble Learning
1. Decision Trees
Definition:
o Non-parametric supervised learning algorithm for classification and
regression.
o Uses hierarchical, rule-based splitting of data based on features.
Key Terminology:
Root Node: The top node representing the entire dataset.
Decision Node: Splits data into sub-nodes.
Leaf Node: Terminal node with a predicted label.
Splitting: Dividing data based on a feature.
Pruning: Removing subtrees to prevent overfitting.
Procedure:
1. Place the most important attribute at the root.
2. Split the dataset into subsets based on the attribute.
3. Repeat for each subset until pure subsets (leaf nodes) are reached.
Key Measures:
Entropy: Measures impurity or randomness.
Information Gain (IG):
Overcoming Overfitting:
Pruning:
o Pre-Pruning: Stop early based on criteria (e.g., max depth, min
gain).
o Post-Pruning: Replace subtrees with leaves based on validation
set performance.
Reduced Error Pruning:
o Iteratively remove nodes, ensuring no accuracy loss on validation
data.
Advantages:
1. Simple and interpretable.
2. No need for feature scaling or normalization.
3. Handles both classification and regression tasks.
Disadvantages:
1. Prone to overfitting.
2. Greedy algorithm might miss optimal solutions.
3. Less suitable for complex datasets.
2. Ensemble Learning
Definition:
o Combines predictions of multiple models (weak learners) to form a
stronger model.
Goal:
o Reduce bias, variance, or improve prediction accuracy.
Techniques:
1. Voting:
o Combines predictions of different models.
o Hard Voting: Majority class wins.
o Soft Voting: Averages predicted probabilities.
2. Bagging (Bootstrap Aggregating):
o Trains multiple models on bootstrap samples (sampling with
replacement).
o Predictions are aggregated (e.g., majority vote for classification).
o Examples:
Random Forest:
Uses multiple decision trees.
Features are randomly selected at each split,
improving diversity and reducing overfitting.
3. Boosting:
o Sequentially trains models, focusing on misclassified instances.
o Examples:
AdaBoost:
Adjusts weights for misclassified data in subsequent
iterations.
Gradient Boosting:
Builds models sequentially, minimizing errors using
gradient descent.
XGBoost:
Optimized version of Gradient Boosting with
regularization, parallelism, and handling of missing
values.
4. Stacking:
o Combines predictions from multiple models using a meta-model.
o Meta-model learns the best way to combine base model predictions.
3. Comparison of Ensemble Methods
Criteria Bagging Boosting Stacking
Goal Decrease variance Decrease bias Improve predictions
Model
Parallel Sequential Parallel
Training
Model Independent Dependent Independent
Criteria Bagging Boosting Stacking
Dependency
Complexity Moderate High High
Random Forest, AdaBoost, Gradient Diverse models and
Examples
Extra Trees Boosting, XGBoost meta-model
4. Random Forest vs. XGBoost
Random Forest:
o Pros:
Handles high-dimensional data.
Robust to noise and overfitting.
o Cons:
Less effective with imbalanced data.
May overfit if trees are too deep.
XGBoost:
o Pros:
Regularization reduces overfitting.
Faster due to parallel processing.
o Cons:
Complex to tune.
Requires careful handling of hyperparameters.
Exam Tips
1. Decision Trees:
o Know how to calculate entropy and information gain.
o Understand pruning techniques and overfitting prevention.
2. Ensemble Learning:
o Compare bagging, boosting, and stacking.
o Be familiar with Random Forest and XGBoost features and
differences.
3. Metrics:
o Understand the bias-variance tradeoff and how ensemble methods
address it.
Lecture 6: Support Vector Machines (SVMs)
1. What Are SVMs?
SVMs are supervised learning models used for classification and
regression.
Aim: Find the optimal hyperplane that separates classes with the
maximum margin.
2. Key Terminology
Hyperplane: A decision boundary separating data points of different
classes.
Support Vectors: Data points closest to the hyperplane; they define the
margin.
Margin: The perpendicular distance between the hyperplane and the
nearest support vectors. SVM seeks to maximize this.
3. Linear vs. Non-Linear SVM
1. Linear SVM:
o Works when data is linearly separable.
o Equation of the hyperplane:
2. Non-Linear SVM:
o For non-linear datasets, uses the kernel trick to map data to a
higher-dimensional space where it becomes linearly separable.
o Kernel Functions:
4. Hard Margin vs. Soft Margin
1. Hard Margin:
o No data points are allowed inside the margin.
o Requires perfectly separable data, prone to overfitting if there’s
noise.
2. Soft Margin:
o Allows some misclassifications using slack variables (ξi).
C: Trade-off parameter controlling margin size and misclassification
penalty.
Large C: Smaller margin, less misclassification (risk of
overfitting).
Small C: Larger margin, more misclassification (risk of
underfitting).
5. Optimization Objective
Maximizes the margin while ensuring correct classification.
6. The Kernel Trick
Maps data to a higher-dimensional space without explicitly computing the
transformation.
Key idea:
Popular kernels:
o Linear, Polynomial, RBF, Sigmoid.
7. Multiclass Classification with SVM
SVM is inherently binary. Common approaches for multiclass problems:
o One-vs-One: Pairwise classifiers, majority voting.
o One-vs-All: Separate classifier for each class vs. the rest.
o Modifying the objective function.
8. Advantages and Disadvantages
Advantages:
Effective in high-dimensional spaces.
Robust to overfitting in small, clean datasets.
Customizable using kernel functions.
Disadvantages:
Computationally expensive for large datasets.
Difficult to interpret results (non-probabilistic).
Requires careful tuning of parameters (e.g., C, kernel).
9. Practical Tip for Exams
1. Understand the optimization objective and constraints.
2. Be ready to explain kernel functions and their use cases.
3. Know the difference between hard and soft margins.
4. Be prepared to calculate or interpret C's effect on the margin.
5. Understand the multiclass strategies (e.g., one-vs-one).
Lecture 7: Artificial Neural Networks (ANNs)
1. Definition and Structure of Artificial Neural Networks (ANNs)
What are ANNs?
o A supervised learning model used for classification and
regression.
o Structure inspired by the human brain, with artificial "neurons"
mimicking biological ones.
o Key characteristic: Layers of interconnected neurons.
Structure:
o Input Layer: Receives raw data (e.g., features or variables).
o Hidden Layers:
Process data using weights, biases, and activation functions.
Nodes in these layers transform inputs into nonlinear outputs.
o Output Layer:
Produces final predictions (e.g., classes or regression values).
Number of nodes corresponds to the number of output
variables.
Layer Depth:
o Single hidden layer: Suitable for simpler tasks.
o Multiple hidden layers: Handle complex problems but risk
overfitting.
o Tradeoff: Adding layers increases computational cost without
always improving accuracy.
2. Biological vs. Artificial Neurons
Biological Neuron Components:
o Dendrites: Input channels.
o Axon: Output channel.
o Synapses: Connection points that vary in strength
(excitatory/inhibitory).
Artificial Neuron Analogy:
o Inputs correspond to dendrites.
o A weighted sum of inputs is calculated, with bias (similar to synapse
strength).
o Activation Function: Determines if the neuron “fires” (produces
output).
3. Perceptron Basics
What is a Perceptron?
o The simplest ANN model for binary classification.
o Models a decision boundary using weights and a threshold.
Key Components:
o Weights (www): Importance of each input.
o Threshold (θ\thetaθ): Determines the boundary for activation.
o Output: Binary result (1 or 0).
Activation Function in Perceptron:
o Threshold acts as a decision boundary.
Perceptron Training Process
1. Initialization:
o Randomly initialize weights (w) and thresholds (θ).
2. For Each Training Instance:
o Compute the actual output (y) using
o Compare predicted output (y) with desired output (d).
o Adjust weights using:
o η\: Learning rate, controls step size for
weight updates.
3. Repeat until convergence or acceptable performance.
4. Activation Functions
Purpose of Activation Functions:
o Introduce non-linearity in the model, enabling it to learn complex
relationships.
o Without activation, a neural network reduces to linear regression.
Types of Activation Functions:
1. Step Function:
o Binary output: 1 if f(z)>0, otherwise 0.
o Suitable for perceptron-like models.
2. ReLU (Rectified Linear Unit):
o Outputs 0 for negative inputs, otherwise outputs the input.
o Pros: Faster convergence, avoids saturation.
o Commonly used in modern deep networks.
3. Sigmoid:
o S-shaped curve that outputs values between 0 and 1.
o Suitable for probabilistic outputs or multilabel classification.
o Cons: Computationally expensive, prone to vanishing gradients.
4. Tanh (Hyperbolic Tangent):
o Outputs values between -1 and 1.
o Advantage: Zero-centered, simplifies optimization compared to
sigmoid.
Choosing the Right Activation Function:
Depends on the task:
o Sigmoid/Tanh for smooth decision boundaries.
o ReLU for faster training and sparse activations.
5. Feedforward and Recurrent Networks
Feedforward Neural Networks:
o Information flows one way (input → hidden → output).
o Suitable for static, non-sequential data (e.g., image classification).
Recurrent Neural Networks (RNNs):
o Incorporate feedback connections to handle sequential data.
o Temporal Memory: Use previous time steps to predict future
steps.
o Applications: Speech recognition, video processing, time-series
forecasting.
Special Case: Elman Networks (RNN Type):
o Include context layers for short-term memory.
o Improve temporal dependency handling in time-series problems.
6. Perceptron Convergence Theorem
Statement:
o If training data is linearly separable, and the learning rate is
sufficiently small, the perceptron algorithm will converge.
o Implication: Weights will stop updating after finite iterations.
Limitations:
o Fails for non-linearly separable data.
o Requires careful tuning of the learning rate (η).
7. ANN Advantages and Disadvantages
Advantages:
Human Brain Emulation: Mimics neuron interconnections for learning.
Flexible Tasks: Can perform both classification and regression.
Adaptability: Handles a wide range of problems.
Disadvantages:
Overfitting Risk: Complex networks memorize training data instead of
generalizing.
Tuning Complexity: Requires careful adjustment of hyperparameters
(e.g., learning rate).
Data Dependency: Performance is highly dependent on the quality and
quantity of data.
8. Practical Notes for Exam
1. Understand pseudocode for perceptron training and its application in
logical functions (e.g., AND/OR operations).
2. Familiarize yourself with decision boundaries and linear separability
concepts.
3. Know when to use specific activation functions (ReLU vs. Sigmoid).
4. Review examples of feedforward and recurrent neural networks.
5. Be prepared for overfitting-related questions: explain monitoring and
regularization techniques (e.g., validation sets, dropout).
9. Applications of Neural Networks
Image recognition (e.g., classifying objects in pictures).
Text generation (e.g., language modeling).
Time-series predictions (e.g., stock forecasting, weather).
Notes from quiz:
Feature ENGINEERING is the process of using domain knowledge to extract
features from raw data via data mining techniques.
Feature SELECTION is the process of selecting a subset of relevant features for
use in model construction.
Applying a ML model on an imbalanced dataset can incorporate big issues with
regards to ethics, bias etc.
OVERSAMPLING technique randomly duplicates examples in the minority class.
UNDERSAMPLING removes the majority class samples.
Univariate feature selection DOES NOT TRY every possible combination of
features and chooses the best performing model. It evaluates each feature
independently to determine its relationship with the target variable.
Main steps of data preparation:
1. Data cleaning
2. Feature scaling
3. Feature selection
4. Data balancing
Main differences between Structured and Unstructured data:
1. Structured Data
Definition: Data that is organized in a defined format, usually in rows and
columns, making it easy to search and analyze using traditional tools like
SQL.
Storage: Stored in relational databases (RDBMS) or spreadsheets.
Analysis: Easily analyzed using structured query languages and
traditional analytics software.
Examples:
o Customer database: A table containing customer IDs, names,
ages, and email addresses.
o Sales records: An Excel file with columns for transaction dates,
product names, quantities, and prices.
2. Unstructured Data
Definition: Data that does not have a pre-defined format or organization,
making it more difficult to store and analyze using traditional methods.
Storage: Often stored in data lakes, NoSQL databases, or specialized file
systems.
Analysis: Requires advanced tools, such as Natural Language Processing
(NLP) or machine learning, to derive insights.
Examples:
o Social media posts: Texts, images, and videos from platforms like
Twitter or Instagram.
o Emails: Free-form text with attachments or embedded links.
This Python code defines a function build_pipeline that constructs a machine
learning pipeline using scikit-learn. The pipeline handles preprocessing for both
categorical and numerical features and integrates a given regressor for model
training and prediction.
Categorical Processing:
Uses OneHotEncoder to encode categorical variables into numerical
values.
The sparse_output=False ensures that the output is a dense matrix.
The handle_unknown='ignore' avoids errors during transformation if new
categories appear in the test data.
Numerical Processing:
Uses SimpleImputer with a strategy of mean to handle missing values in
numerical data
Input Data
|
--------------------------------------------
| |
Categorical Columns Numerical Columns
| |
OneHotEncoder SimpleImputer (mean)
| |
| ----------------------------------------
| | |
| Optional: StandardScaler Remainder Processing
| | |
-------------------------- |
| |
Preprocessed Data -------------------------------------------
|
Column Transformer (Preprocess Step)
|
V
Combine with Regressor ('reg') in Pipeline
|
Final Machine Learning Pipeline
SVMs are effective in cases where the number of dimensions is greater than the
number of samples due to their ability to find the optimal hyperplane in the
feature space, even with limited data.
They can be customized by changing kernels and tuning parameters since they
support the use of different kernel functions making them highly flexible for
handling various types of data distributions. Additionally, parameters like the
regularization term (CCC) and kernel-specific parameters (e.g., gamma for RBF)
can be tuned to improve performance.
Types of complex classifications problems:
8. Complex Classifications
Multiclass Classification:
Predicts among n classes.
Strategies:
o One-vs-Rest (OvR):
Train one classifier per class.
o One-vs-One (OvO):
Train classifiers for each pair of classes.
Multilabel Classification:
Each instance can belong to multiple classes.
Example: Movie genres like horror, action, and comedy.
Multioutput Classification:
Predicts multiple outputs simultaneously.
Example: Predicting the type and color of a fruit.
Ridge regression helps in reducing overfitting, but it doesn’t help in feature
selection since it only shrinks the coefficients from a regression, it doesn’t set
them to 0.
Lasso shrinks the coefficients and can even reduce them to 0, thus can
be considered as a helper in feature selection.
F-stat steps:
K is the number of groups we have
n is the number of observations per group
the degrees of freedom are of two types:
Here’s an expanded version of the *practice exam*, adding more questions and
sections for thorough assessment. This includes more topics, practical exercises,
and real-world applications.
---
### *Expanded Practice Exam*
---
### *Section 1: Multiple Choice Questions (MCQs)*
#### *Machine Learning Overview*
1. What type of machine learning is used when the data lacks labels?
a. Supervised learning
b. Unsupervised learning
c. Reinforcement learning
d. Transfer learning
*(Answer: b)*
2. Which step in the CRISP-DM framework ensures that business goals and
machine learning objectives align?
a. Data Understanding
b. Business Understanding
c. Data Preparation
d. Modeling
*(Answer: b)*
3. In supervised learning, which of the following metrics is most suitable for an
imbalanced dataset?
a. Accuracy
b. Precision
c. Recall
d. F1-Score
*(Answer: d)*
---
#### *Exploratory Data Analysis (EDA)*
4. What does a heatmap typically represent in EDA?
a. Distribution of a single variable
b. Frequency of categorical variables
c. Correlations between numerical variables
d. Density of multivariate data
*(Answer: c)*
5. Which measure of central tendency is most robust to outliers?
a. Mean
b. Median
c. Mode
d. Standard Deviation
*(Answer: b)*
---
#### *ANOVA and PCA*
6. What is the purpose of the elbow method in PCA?
a. Determine the optimal number of clusters
b. Select the number of principal components
c. Visualize variance explained by each feature
d. Normalize data
*(Answer: b)*
7. ANOVA assumes which of the following?
a. Data normality and homogeneity of variance
b. Linearity and independence of residuals
c. High variance and feature correlation
d. Data sparsity
*(Answer: a)*
---
#### *Clustering*
8. Which clustering method starts with all data points as individual clusters and
iteratively merges them?
a. K-Means
b. Divisive clustering
c. Agglomerative clustering
d. Spectral clustering
*(Answer: c)*
9. In K-Means clustering, what is the purpose of the centroid?
a. To maximize distances between clusters
b. To act as the center of a cluster
c. To define cluster boundaries
d. To calculate hierarchical distances
*(Answer: b)*
---
#### *Classification*
10. Which classifier is most suitable for linearly separable data?
a. K-Nearest Neighbors
b. Support Vector Machine
c. Decision Tree
d. Logistic Regression
*(Answer: d)*
---
### *Section 2: Open-Ended Questions*
#### *Exploratory Data Analysis (EDA)*
11. *Explain why both statistical and graphical EDA are necessary. Provide
examples of techniques for each.*
- *Sample Answer*: Statistical EDA (e.g., summary statistics) provides
numerical insights, while graphical EDA (e.g., histograms) visualizes patterns.
Together, they identify relationships, anomalies, and distributions.
---
#### *Dimensionality Reduction*
12. *Compare feature selection and feature extraction. Provide an example of a
technique used for each.*
- *Sample Answer*: Feature selection removes irrelevant features (e.g.,
ANOVA), while feature extraction creates new features (e.g., PCA).
---
#### *Clustering*
13. *Describe how the silhouette coefficient works. Why is it useful in clustering?
*
- *Sample Answer*: The silhouette coefficient measures how well a data point
fits into its cluster versus others. A value close to 1 indicates a good fit.
---
#### *Classification*
14. *Explain the trade-off between precision and recall. In what scenario would
you prioritize one over the other?*
- *Sample Answer*: Precision prioritizes minimizing false positives, and recall
minimizes false negatives. For medical diagnostics, recall is critical to avoid
missed cases.
---
#### *Bias-Variance Tradeoff*
15. *How does model complexity influence bias and variance? Provide an
example of a high-bias and high-variance model.*
- *Sample Answer*: Higher complexity reduces bias but increases variance. A
linear model often has high bias, while a deep neural network may have high
variance.
---
### *Section 3: Practical Problems*
#### *Hierarchical Clustering*
16. Perform hierarchical clustering using *average linkage* for the following
distance matrix:
| |A |B |C |D |
|-----|-----|-----|-----|-----|
| A | - | 2 | 6 | 10 |
|B |2 |- |4 |8 |
|C |6 |4 |- |6 |
| D | 10 | 8 | 6 | - |
*Show all merging steps and updated distance matrices.*
---
#### *Principal Component Analysis (PCA)*
17. You are given a dataset with three features: \(X_1\), \(X_2\), and \(X_3\). After
standardizing, you calculate the covariance matrix:
\[
\text{Covariance Matrix} =
\begin{bmatrix}
1.0 & 0.8 & 0.6 \\
0.8 & 1.0 & 0.7 \\
0.6 & 0.7 & 1.0
\end{bmatrix}
\]
Perform the following:
a. Compute the first principal component (eigenvector with the largest
eigenvalue).
b. Explain its significance.
---
#### *Classification*
18. A confusion matrix for a binary classifier is as follows:
| | Predicted Positive | Predicted Negative |
|--------|---------------------|---------------------|
| Actual Positive | 50 | 10 |
| Actual Negative | 20 | 120 |
a. Calculate precision, recall, and F1-score.
b. Interpret the results.
---
#### *Regression*
19. A linear regression model predicts house prices based on square footage (\
(X\)). The model equation is:
\[
Y = 500 + 200X
\]
a. What does the coefficient \(200\) represent?
b. If a house has 1500 square feet, what is the predicted price?
---
#### *Ensemble Learning*
20. Compare Random Forest and XGBoost in terms of handling overfitting and
scalability. Which would you choose for a dataset with 1 million records? Why?
---
### *Section 4: Real-World Applications*
21. *Fraud Detection*
A company uses anomaly detection for fraud detection in transactions.
a. Would you use supervised or unsupervised learning? Why?
b. Propose a clustering method suitable for this task and justify your choice.
---
22. *Medical Diagnostics*
You are tasked with building a model to detect cancer in medical images.
Discuss the following:
a. Would you prioritize precision or recall? Why?
b. Propose two machine learning methods and explain their strengths for this
task.
---
23. *Customer Segmentation*
Your company wants to segment customers based on purchase behavior.
a. Would you use K-Means or hierarchical clustering? Why?
b. How would you determine the optimal number of clusters?
Use K means, not hierarchical clustering., since k-means handles big datasets
well.
Determine optimal number of clusters by using elbow method.
---
Here’s a *longer and more challenging mock exam* based on the topics
provided, with a mix of theoretical, computational, and real-world application
questions.
---
### *Advanced Mock Exam*
---
### *Section 1: Multiple Choice Questions (MCQs)*
#### *Machine Learning Frameworks*
1. Which step in CRISP-DM involves defining business success criteria?
a. Data Understanding
b. Business Understanding
c. Modeling
d. Evaluation
*(Answer: b)*
2. Which of the following is NOT a goal of Exploratory Data Analysis (EDA)?
a. Discovering patterns and relationships
b. Improving model accuracy
c. Spotting anomalies
d. Testing hypotheses
*(Answer: b)*
3. In data preparation, which method is used to handle imbalanced datasets by
generating synthetic samples?
a. Undersampling
b. SMOTE
c. Oversampling
d. PCA
*(Answer: b)*
---
#### *Dimensionality Reduction*
4. Which of the following is TRUE about Principal Component Analysis (PCA)?
a. It requires categorical data.
b. It maximizes the variance in high-dimensional data.
c. It reduces model complexity by feature selection.
d. It is insensitive to outliers.
*(Answer: b)*
5. In PCA, what does an eigenvalue represent?
a. The direction of variance in data
b. The amount of variance explained by a principal component
c. The correlation between variables
d. The dimensionality of the dataset
*(Answer: b)*
---
#### *Clustering*
6. What is the primary limitation of hierarchical clustering?
a. Requires pre-specifying the number of clusters
b. Cannot handle missing data
c. Computational expense for large datasets
d. Inability to use distance metrics
*(Answer: c)*
7. Which distance metric is most sensitive to outliers?
a. Euclidean distance
b. Manhattan distance
c. Cosine similarity
d. Jaccard distance
*(Answer: a)*
---
#### *Classification*
8. A logistic regression classifier outputs a probability of \(0.9\) for class 1. What
is the predicted label if the threshold is \(0.7\)?
a. Class 0
b. Class 1
*(Answer: b)*
9. In which situation is a high recall more critical than high precision?
a. Fraud detection
b. Email spam classification
c. Cancer detection
d. Recommender systems
*(Answer: c)*
---
#### *Evaluation Metrics*
10. The AUC-ROC value for a model is \(0.5\). What does this indicate?
a. Perfect classification
b. Random guessing
c. Overfitting
d. High recall and low precision
*(Answer: b)*
---
### *Section 2: Open-Ended Questions*
#### *Exploratory Data Analysis*
11. *Describe three types of EDA (univariate, bivariate, multivariate) and provide
an example visualization for each.*
Univariate – it focuses on the behaviour of one variable only. histograms
Bivariate – relationship between two vars. heatmap, pair plots
Multivariate – interaction between multiple variavbles. Scatter plots
#### *Clustering*
12. *Explain the difference between agglomerative and divisive hierarchical
clustering. Provide an example where each might be more appropriate.*
When conducting agglomerative clustering, we start with n clusters which are
also the number of observation, and we cluster them together from bottom
upwards until we have only one cluster containing all the instances.
Appropriate When: The dataset is small or has a clear bottom-up structure.
Grouping species based on genetic similarity, where clusters naturally emerge as
species are combined into genera and families.
Divisive clustering starts with one big set containing all the variables, and it
starts dividing them until we have n different clusters, one for each observation.
This method is from up towards bottom. Appropriate When: The dataset is
large, and the clusters have a natural top-down hierarchical structure.
Segmenting a market based on demographics, where the initial split might divide
the population by gender, and subsequent splits refine segments by age or
income.
13. *Describe how the elbow method and silhouette score are used to select the
optimal number of clusters in K-Means.*
Elbow method helps select the optimal number of clusters by analysing the
within cluster sum of squares WCSS.
Process:
1. Run K-Means for different numbers of clusters (KKK).
2. Calculate WCSS for each value of KKK. WCSS measures the total variance
within each cluster.
3. Plot K (x-axis) vs. WCSS (y-axis).
4. Identify the "elbow point" in the plot, where the rate of decrease in WCSS
slows significantly. This point indicates the optimal KKK, as adding more
clusters beyond this doesn't provide substantial improvement.
Limitation: The elbow point can sometimes be ambiguous or subjective to
identify.
The silhouette score measures how similar an observation is to the cluster and it
gives that observation a weight. The closer the weight is to 1 the bigger the
similarity so the better the classification.
---
#### *Dimensionality Reduction*
14. *Given a dataset with highly correlated features, explain why PCA would be a
better choice than feature selection.*
PCA creates uncorrelated (orthogonal) principal components by combining
correlated features. This eliminates redundancy and provides a more compact
representation of the data.
---
#### *Bias-Variance Tradeoff*
15. *Illustrate the bias-variance tradeoff with an example. Explain how this
tradeoff is managed in ensemble methods such as Random Forest and Gradient
Boosting.*
---
#### *Classification*
16. *What are the trade-offs between one-vs-rest (OvR) and one-vs-one (OvO)
strategies in multiclass classification? Provide an example for each.*
17. *Define and calculate the F1-score given the following confusion matrix:*
| | Predicted Positive | Predicted Negative |
|--------|---------------------|---------------------|
| Actual Positive | 60 | 15 |
| Actual Negative | 10 | 100 |
---
#### *Cross-Validation*
18. *Explain stratified cross-validation and why it is essential for imbalanced
datasets.*
---
#### *Regression*
19. *Explain the assumptions of linear regression and how violations of these
assumptions affect the model.*
20. *Compare and contrast Lasso, Ridge, and Elastic Net regularization methods.
Provide a scenario where each would be appropriate.*
---
### *Section 3: Practical Problems*
#### *Clustering*
21. Perform hierarchical clustering using *complete linkage* for the following
distance matrix:
| |A |B |C |D |
|-----|-----|-----|-----|-----|
| A | - | 4 | 8 | 10 |
| B | 4 | - | 6 | 12 |
| C | 8 | 6 | - | 14 |
| D | 10 | 12 | 14 | - |
*Show all merging steps and updated distance matrices.*
---
#### *Principal Component Analysis (PCA)*
22. You have the following covariance matrix for a dataset:
\[
\text{Covariance Matrix} =
\begin{bmatrix}
4 & 2 \\
2&3
\end{bmatrix}
\]
a. Compute the eigenvalues and eigenvectors.
b. Identify the principal component and explain its significance.
---
#### *Classification*
23. A classifier has the following probabilities and true labels:
| Predicted Probability | True Label |
|------------------------|------------|
| 0.95 |1 |
| 0.85 |1 |
| 0.65 |0 |
| 0.45 |1 |
| 0.25 |0 |
a. Plot the ROC curve.
b. Calculate the AUC.
---
### *Section 4: Real-World Applications*
24. *Fraud Detection*
A bank wants to develop a fraud detection system using machine learning.
a. Should the model prioritize recall or precision? Justify your answer.
b. Suggest two appropriate models for this task and explain your choice.
---
25. *Recommender Systems*
An e-commerce company uses collaborative filtering for product
recommendations.
a. What is the primary limitation of this approach?
b. How can content-based filtering complement collaborative filtering?
---
26. *Medical Diagnostics*
You are developing a model to classify chest X-rays as normal or abnormal.
a. What pre-processing techniques would you apply to the image data?
b. Which evaluation metrics would you prioritize and why?
---
27. *Customer Segmentation*
A retailer wants to group customers based on purchasing behavior to design
targeted campaigns.
a. Would you use K-Means or hierarchical clustering? Justify your choice.
b. How would you determine the optimal number of clusters?
---
This exam is structured to challenge students on both theoretical knowledge and
practical application, with a mix of problem-solving and critical thinking. Let me
know if you'd like further additions or refinements!