0% found this document useful (0 votes)
29 views23 pages

Mlans

Uploaded by

tanmayshinde006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views23 pages

Mlans

Uploaded by

tanmayshinde006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

​Machine Learning (SPPU 2019 Pattern)​

​Solutions to Insem Questions (Oct 2022, Sep 2023, Sep 2024)​

​Unit 1: Foundational Concepts​


​Comparison of Artificial Intelligence and Machine Learning​
​(Ref: Q1a Oct 2022, Q2a Sep 2024)​

​Definition:​
​●​ ​Artificial Intelligence (AI):​​A broad branch of computer science concerned with​
​ uilding smart machines capable of performing tasks that typically require human​
b
​intelligence. The scope of AI is vast and includes areas like problem-solving,​
​reasoning, perception, and language understanding.​
​ ​ ​Machine Learning (ML):​​A specific subset of AI that focuses on the development​

​of algorithms that allow a computer to learn from and make predictions or​
​decisions based on data, without being explicitly programmed for the task.​
​ elationship:​
R
​Machine Learning is a method to achieve Artificial Intelligence. Not all AI systems use machine​
​learning; some rely on hard-coded rules and logic. However, ML is currently the most​
​successful and dominant approach to AI.​
​Key Differences:​

​Feature​ ​Artificial Intelligence (AI)​ ​Machine Learning (ML)​

​Scope​ ​ road. Aims to create​


B ​ arrow. Aims to learn from​
N
​intelligent systems to simulate​ ​data to perform a specific​
​human intelligence.​ ​task accurately.​

​Approach​ ​ an use logic, rule-based​


C ​ rimarily uses statistical​
P
​systems, optimization, and​ ​methods and algorithms to​
​machine learning.​ ​learn from data.​

​Goal​ ​ o build a system that can​


T ​ o build a model that can​
T
​perform complex, human-like​ ​make accurate predictions or​
​tasks.​ ​decisions on new data.​

​Example​ ​ sophisticated humanoid​


A ​ n email spam filter that​
A
​robot or a strategic​ ​learns to identify junk mail, or​
​game-playing AI like Deep​ ​a recommendation engine.​
​Blue.​
​Parametric and Non-Parametric Machine Learning Models​
​(Ref: Q1b Oct 2022)​

​1. Parametric Models​


​●​ ​Definition:​​A parametric model is one that makes assumptions​​about the​
f​ unctional form of the relationship between input and output variables. It​
​summarizes the data with a set of parameters of a fixed size, regardless of the​
​amount of training data. The learning process involves finding the optimal values​
​for these fixed parameters.​
​ ​ ​Characteristics:​

​○​ ​Assumptions:​​Makes strong assumptions about the data​​(e.g., assumes a​
​linear relationship).​
​○​ ​Speed:​​Faster to train and requires less data.​
​○​ ​Complexity:​​Simpler and easier to interpret.​
​○​ ​Limitation:​​Prone to high bias if the assumptions​​are incorrect, leading to​
​lower accuracy.​
​●​ ​Examples:​
​○​ ​Linear Regression:​​Assumes a linear relationship between​​features and​
​output. The parameters are the coefficients (β) in the equation Y=β0​+β1​X1​+....​
​○​ ​Logistic Regression:​​Assumes a linear decision boundary.​
​○​ ​Naive Bayes:​​Assumes features are conditionally independent.​

​2. Non-Parametric Models​


​●​ ​Definition:​​A non-parametric model does not make strong​​assumptions about​
t​ he form of the target function. The number of parameters is not fixed and can​
​grow as it learns from more data. These models are more flexible and can fit a​
​wide range of functional forms.​
​ ​ ​Characteristics:​

​○​ ​Assumptions:​​Makes few or no assumptions about the​​data's underlying​
​structure.​
​○​ ​Flexibility:​​Can model complex relationships, leading​​to potentially higher​
​accuracy.​
​○​ ​Complexity:​​Requires more data and is computationally​​more expensive.​
​○​ ​Limitation:​​Prone to high variance and overfitting​​if not handled carefully.​
​●​ ​Examples:​
​○​ ​k-Nearest Neighbors (k-NN):​​The model is the entire​​training dataset.​
​○​ ​Decision Trees:​​Can create complex, non-linear decision​​boundaries.​
​○​ ​Support Vector Machines (SVM) with kernels:​​Can model​​highly non-linear​
​boundaries.​
​Data Formats in Machine Learning​
​(Ref: Q1c Oct 2022, Q1c Sep 2023)​

​ achine learning algorithms require data to be in a structured, machine-readable​


M
​format. The choice of format depends on the data type, size, and performance​
​requirements.​

​1. Tabular Formats:​​For structured data organized​​in rows and columns.​


​●​ ​CSV (Comma-Separated Values):​​A plain text format​​where values in a row are​
s​ eparated by commas. It is simple, human-readable, and universally supported by​
​data analysis tools like Pandas.​
​ ​ ​Excel (XLS, XLSX):​​A spreadsheet format common for​​business data. While easy​

​to use, it is less efficient for very large datasets compared to binary formats.​
​ . Hierarchical/Semi-Structured Formats:​​For data​​with nested structures, common​
2
​in web applications.​
​●​ ​JSON (JavaScript Object Notation):​​A lightweight,​​text-based format using​
k​ ey-value pairs. It is ideal for data from web APIs and for storing configuration​
​files.​
​ ​ ​XML (eXtensible Markup Language):​​A tag-based format​​that is more verbose​

​than JSON but provides strong schema enforcement, often used in enterprise​
​systems.​
​3. High-Performance Binary Formats:​​For large-scale​​and big data applications.​
​●​ ​Apache Parquet:​​A columnar storage format. It stores​​data column by column,​
​ hich allows for highly efficient compression and fast queries, as only the​
w
​required columns need to be read. It is the standard for large datasets in​
​ecosystems like Apache Spark.​
​ ​ ​HDF5 (Hierarchical Data Format):​​A format designed​​to store large,​

​multi-dimensional numerical arrays (tensors). It is widely used in scientific​
​computing and deep learning for storing model weights and large feature sets.​
​ . Unstructured Data Formats:​​These are containers​​for data like text and images,​
4
​which must be converted into a numerical representation before use.​
​●​ ​Image Formats (JPEG, PNG):​​Converted into 3D numerical​​arrays of pixel values​
(​ height x width x color channels).​
​ ​ ​Text Formats (.txt):​​Converted into numerical vectors​​using techniques like​

​TF-IDF or word embeddings.​
​Supervised, Unsupervised, and Semi-supervised Learning​
​(Ref: Q2a Oct 2022)​

​1. Supervised Learning​


​●​ ​Definition:​​A type of machine learning where the algorithm​​learns from a dataset​
t​ hat is fully​​labeled​​, meaning each data point is​​tagged with a correct output. The​
​goal is to learn a mapping function that can predict the output for new, unseen​
​data.​
​ ​ ​Analogy:​​Learning with a teacher or an answer key.​

​●​ ​Types:​
​○​ ​Classification:​​The output is a discrete category​​(e.g., "Spam" or "Not​
​Spam").​
​○​ ​Regression:​​The output is a continuous value (e.g.,​​predicting the price of a​
​house).​
​●​ ​Example:​​A credit card fraud detection system trained​​on a historical dataset of​
​transactions labeled as "fraudulent" or "legitimate."​
​2. Unsupervised Learning​
​●​ ​Definition:​​A type of machine learning where the algorithm​​learns from a dataset​
t​ hat has​​no labels​​. The system tries to learn the​​patterns and structure from the​
​data on its own.​
​ ​ ​Analogy:​​Finding patterns without a teacher.​

​●​ ​Types:​
​○​ ​Clustering:​​Grouping similar data points together​​(e.g., customer​
​segmentation).​
​○​ ​Association:​​Discovering rules that describe relationships​​in data (e.g.,​
​market basket analysis).​
​○​ ​Dimensionality Reduction:​​Reducing the number of variables​​(e.g., PCA).​
​●​ ​Example:​​A marketing firm using clustering to identify​​different segments of its​
​customer base from purchase history data.​
​3. Semi-Supervised Learning​
​●​ ​Definition:​​A hybrid approach that uses a training​​dataset containing​​both​
l​abeled and unlabeled data​​. This is useful in scenarios​​where labeling data is​
​expensive or time-consuming.​
​ ​ ​Analogy:​​Learning with a teacher who only answers​​a few questions.​

​●​ ​Process:​​The model uses the small set of labeled data​​to guide the learning​
​process and infer labels for the large set of unlabeled data.​
​●​ ​Example:​​A photo-tagging service that uses a few user-tagged​​faces (labeled) to​
​ elp automatically identify and tag the same faces in a much larger collection of​
h
​untagged photos (unlabeled).​
​Statistical Learning Approaches​
​(Ref: Q2b Oct 2022, Q1b Sep 2023)​

​ tatistical learning is a framework for understanding data that formalizes the learning​
S
​problem as finding a function f that best models the relationship between input​
​variables (predictors, X) and an output variable (response, Y), represented as​
​Y=f(X)+ϵ, where ϵ is random error.​

​There are two main goals or approaches within this framework:​

​1. Prediction​
​●​ ​Objective:​​To accurately predict the output Y for​​new, unseen inputs X.​
​●​ ​Focus:​​The accuracy of the prediction is the primary​​concern. The exact form of​
t​ he function f is often treated as a "black box" and is not important as long as it​
​yields good predictions.​
​ ​ ​Example:​​Predicting stock prices or identifying spam​​emails.​

​2. Inference​
​●​ ​Objective:​​To understand the relationship between​​the inputs X and the output Y.​
​●​ ​Focus:​​The interpretability of the model is key. We​​want to answer questions like:​
​○​ ​Which predictors are most strongly associated with the output?​
​○​ ​What is the nature of the relationship (linear, non-linear)?​
​○​ ​Can we quantify the effect of each predictor on the output?​
​●​ ​Example:​​Understanding how factors like advertising​​spend, price, and​
​competitor pricing affect product sales.​
​ hese goals are pursued using different methods, which are categorized based on the​
T
​availability of a response variable (​​Supervised vs.​​Unsupervised Learning​​) and the​
​assumptions made about the function f (​​Parametric​​vs. Non-parametric Methods​​).​

​Machine Learning vs. Traditional Programming​


​(Ref: Q1a Sep 2023)​

​ he fundamental difference between machine learning and traditional programming​


T
​lies in how a system generates an output.​
​●​ ​Traditional Programming:​​A developer writes explicit,​​step-by-step rules (an​
​algorithm) that the computer follows to process input data and produce an​
​output. The logic is entirely defined by the human programmer.​
​○​ ​Workflow:​​Data + Program (Rules) -> Computer -> Output​
​ ​ ​Machine Learning:​​A developer provides the computer​​with input data and the​

​corresponding correct outputs (labels). The learning algorithm then discovers the​
​rules and patterns connecting the inputs and outputs on its own, creating a​
​"model." This model can then make predictions on new data.​
​○​ ​Workflow:​​Data + Outputs -> Computer (Learning Algorithm)​​-> Program​
​(Model)​
​Comparison Table:​

​Aspect​ ​Traditional Programming​ ​Machine Learning​

​Logic​ ​ xplicitly coded by a​


E ​ earned automatically from​
L
​programmer.​ ​data.​

​Process​ ​ eterministic; follows​


D ​ robabilistic; makes​
P
​predefined rules.​ ​predictions based on learned​
​patterns.​

​Scalability​ ​ ifficult to scale for complex​


D ​ cales well for complex​
S
​problems with many rules.​ ​problems by learning from​
​more data.​

​Example: Spam Filter​ i​f email contains "viagra" then​ ​ he model is trained on​
T
​mark as spam.​ ​thousands of spam/non-spam​
​emails and learns what​
​features (words, senders) are​
​indicative of spam.​

​Applications of Machine Learning in Data Science​


​(Ref: Q2a Sep 2023)​

​ achine learning is a core component of data science, providing the tools to build​
M
​predictive and descriptive models from data. Key applications include:​
​1.​ ​Predictive Analytics:​​Using historical data to forecast​​future outcomes.​
​○​ ​Example:​​A retail company using ML to predict sales​​for the next quarter​
​ ased on past sales data, seasonality, and economic indicators.​
b
​ .​ ​Recommendation Engines:​​Personalizing user experiences​​by suggesting​
2
​relevant items.​
​○​ ​Example:​​Netflix analyzing your viewing history to​​recommend movies and TV​
s​ hows you are likely to enjoy.​
​3.​ ​Customer Churn Prediction:​​Identifying customers who​​are at high risk of​
​leaving a service.​
​○​ ​Example:​​A telecom company using customer usage patterns​​and support​
​call history to predict which customers might switch to a competitor, allowing​
​them to offer retention incentives.​
​4.​ ​Fraud Detection:​​Identifying and preventing fraudulent​​activities in real-time.​
​○​ ​Example:​​Banks using ML models to analyze transaction​​patterns and flag​
​unusual activities that may indicate a stolen credit card.​
​5.​ ​Sentiment Analysis:​​Automatically determining the​​emotional tone of text data.​
​○​ ​Example:​​A company analyzing social media mentions​​of its brand to gauge​
​public opinion and customer satisfaction.​
​6.​ ​Image Recognition:​​Identifying and classifying objects​​within images.​
​○​ ​Example:​​A self-driving car using computer vision​​to identify pedestrians,​
​traffic signs, and other vehicles.​
​Geometric and Probabilistic Models​
​(Ref: Q2b Sep 2023, Q1c Sep 2024)​

​ achine learning models can be conceptualized in different ways. Two major​


M
​categories are geometric and probabilistic models.​

​1. Geometric Models​


​●​ ​Concept:​​These models represent data instances as​​points in a high-dimensional​
s​ pace (feature space). The learning process involves defining a geometric shape​
​or boundary to separate these points or find proximity between them.​
​ ​ ​Core Idea:​​Using concepts of distance, planes, and​​margins to make predictions.​

​●​ ​Types & Examples:​
​○​ ​Models based on Distance:​​Predictions are made based​​on the proximity of​
​data points.​
​■​ ​k-Nearest Neighbors (k-NN):​​A new data point is classified​​based on the​
​majority class of its 'k' nearest neighbors.​
​○​ ​Models based on Separating Hyperplanes:​​A linear boundary​​(a line in 2D, a​
​plane in 3D, a hyperplane in higher dimensions) is learned to separate classes.​
​■​ ​Support Vector Machines (SVM):​​Finds the optimal hyperplane​​that best​
​separates data points with the maximum possible margin.​
​■​ ​Linear Regression:​​Fits a line (or hyperplane) that​​is closest to all the​
​data points.​
​2. Probabilistic Models​
​●​ ​Concept:​​These models use the principles of probability theory to make​
​ redictions. They aim to model the probability distribution of the data or the​
p
​probability of an outcome given the input.​
​ ​ ​Core Idea:​​Using probability to handle uncertainty​​and make predictions based​

​on the most likely outcome. The output is often a probability score.​
​●​ ​Examples:​
​○​ ​Naive Bayes:​​A classification algorithm based on Bayes'​​Theorem. It​
​calculates the probability of a data point belonging to a certain class, given its​
​features, e.g., P(Class∣Features).​
​○​ ​Logistic Regression:​​Although it has a geometric interpretation,​​it is​
​fundamentally a probabilistic model. It models the probability that a given​
​input belongs to a certain class using the logistic (sigmoid) function.​
​○​ ​Gaussian Mixture Models (GMM):​​A clustering algorithm​​that assumes data​
​points are generated from a mixture of several Gaussian (normal)​
​distributions.​
​Steps in a Machine Learning Application​
​(Ref: Q2c Sep 2023, Q2b Sep 2024)​

​ eveloping a machine learning application is an iterative, cyclical process involving​


D
​several key steps:​
​1.​ ​Define the Objective & Frame the Problem:​​Clearly​​articulate the business​
​ roblem and define the success metric. Determine if the problem is a​
p
​classification, regression, or clustering task.​
​ .​ ​Data Collection:​​Gather all necessary data from various​​sources like databases,​
2
​APIs, or files.​
​3.​ ​Data Preprocessing and Preparation:​​This is the most​​critical and​
​time-consuming phase.​
​○​ ​Data Cleaning:​​Handle missing values (imputation/deletion),​​correct errors,​
​and remove duplicates.​
​○​ ​Feature Engineering:​​Create new, more informative​​features from existing​
​ones.​
​○​ ​Feature Scaling:​​Normalize or standardize numerical​​features to bring them​
​to a common scale (e.g., Min-Max Scaling, Z-score Normalization).​
​○​ ​Encoding:​​Convert categorical features into a numerical​​format (e.g.,​
​One-Hot Encoding).​
​4.​ ​Data Splitting:​​Divide the dataset into three parts:​
​○​ ​Training Set (70-80%):​​Used to train the machine learning​​model.​
​○​ ​Validation Set (10-15%):​​Used to tune the model's​​hyperparameters.​
​○​ ​Test Set (10-15%):​​Used for the final, unbiased evaluation of the model's​
​ erformance on unseen data.​
p
​5.​ ​Model Selection & Training:​​Choose a suitable algorithm​​and train it on the​
​training dataset. During training, the model learns the underlying patterns by​
​minimizing a loss function.​
​6.​ ​Model Evaluation:​​Assess the trained model's performance​​on the test set using​
​appropriate metrics (e.g., accuracy, precision, recall for classification; Mean​
​Squared Error for regression).​
​7.​ ​Hyperparameter Tuning:​​Systematically adjust the model's​​hyperparameters​
​(e.g., using Grid Search or Random Search) to find the combination that yields​
​the best performance on the validation set.​
​8.​ ​Deployment & Monitoring:​​Deploy the final model into​​a production​
​environment where it can make predictions on real-world data. Continuously​
​monitor its performance and retrain it periodically with new data to maintain its​
​accuracy.​
​Grouping and Grading Models​
​(Ref: Q2c Sep 2024)​

​ his terminology refers to two fundamental tasks in machine learning, corresponding​


T
​to unsupervised and supervised learning, respectively.​

​1. Grouping Models (Unsupervised Learning)​


​●​ ​Concept:​​"Grouping" models perform​​unsupervised learning​​,​​specifically​
​ lustering​​. Their goal is to automatically discover​​natural groupings or clusters in​
c
​unlabeled data. The model groups data points such that points within the same​
​group are more similar to each other than to those in other groups.​
​●​ ​Purpose:​​To find hidden structure and patterns in​​data without prior knowledge​
​of the categories.​
​●​ ​Key Idea:​​Similarity or distance between data points.​
​●​ ​Examples:​
​○​ ​K-Means Clustering:​​Partitions data into a pre-specified​​number ('K') of​
​clusters.​
​○​ ​DBSCAN:​​Groups together points that are closely packed​​in high-density​
​regions.​
​●​ ​Application:​​Customer segmentation, social network​​analysis, anomaly detection.​
​2. Grading Models (Supervised Learning)​
​●​ ​Concept:​​"Grading" models perform​​supervised learning​​.​​Their goal is to assign​
​a "grade"—which can be a categorical label or a continuous score—to a new data​
​ oint based on what it has learned from labeled training data.​
p
​ ​ ​Purpose:​​To predict an output for new, unseen data.​

​●​ ​Key Idea:​​Learning a mapping function from inputs​​to labeled outputs.​
​●​ ​Types and Examples:​
​○​ ​Classification (Categorical Grade):​​Assigns a discrete​​class label.​
​■​ ​Example:​​An email is "graded" as either 'Spam' or​​'Not Spam'. A tumor is​
​"graded" as 'Benign' or 'Malignant'.​
​○​ ​Regression (Continuous Grade/Score):​​Assigns a continuous​​numerical​
​value.​
​■​ ​Example:​​A house is "graded" with a predicted price.​​A student is "graded"​
​with a predicted exam score.​

​Unit 2: Data Preprocessing and Feature Engineering​


​Feature Scaling Calculations​
​(Ref: Q3a Oct 2022, Q4a Sep 2023, Q4c Sep 2023, Q3a Sep 2024)​

​1. Min-Max Scaling (Normalization)​


​ his technique scales data to a fixed range, usually [0, 1].​
T
​Formula: Xscaled​=Xmax​−Xmin​X−Xmin​​
​Problem (Oct 2022):​​Consider a vector x = (23, 29,​​52, 31, 45).​
​●​ ​Step 1:​​Find min and max values.​
​○​ ​Xmin​=23​
​○​ ​Xmax​=52​
​●​ ​Step 2:​​Apply the formula.​
​○​ ​For 23: 52−2323−23​=0.0​
​○​ ​For 29: 52−2329−23​=296​≈0.207​
​○​ ​For 52: 52−2352−23​=1.0​
​○​ ​For 31: 52−2331−23​=298​≈0.276​
​○​ ​For 45: 52−2345−23​=2922​≈0.759​
​●​ ​Answer:​​The min-max scaled vector is (0.0, 0.207,​​1.0, 0.276, 0.759).​

​Problem (Sep 2024):​​Convert D = {23, 29, 52, 31, 45,​​19, 18, 27}.​
​●​ ​Step 1:​​Find min and max values.​
​○​ ​Xmin​=18​
​○​ ​Xmax​=52​
​●​ ​Step 2:​​Apply the formula (Denominator = 52 - 18 =​​34).​
​○​ ​For 23: 3423−18​≈0.147​
​○​ ​For 29: 3429−18​≈0.324​
​○​ ​For 52: 3452−18​=1.0​
​○​ ​For 31: 3431−18​≈0.382​
​○​ ​For 45: 3445−18​≈0.794​
​○​ ​For 19: 3419−18​≈0.029​
​○​ ​For 18: 3418−18​=0.0​
​○​ ​For 27: 3427−18​≈0.265​
​●​ ​Answer:​​The normalized data set is {0.147, 0.324,​​1.0, 0.382, 0.794, 0.029, 0.0,​
​0.265}.​
​2. Z-Score Normalization (Standardization)​
​ his technique rescales data to have a mean (μ) of 0 and a standard deviation (σ) of 1.​
T
​Formula: Z=σX−μ​​
​Problem (Oct 2022):​​For x = (23, 29, 52, 31, 45).​
​●​ ​Step 1:​​Calculate the mean (μ).​
​○​ ​μ=523+29+52+31+45​=36​
​●​ ​Step 2:​​Calculate the standard deviation (σ).​
​○​ ​Variance​
​ 2=5(23−36)2+(29−36)2+(52−36)2+(31−36)2+(45−36)2​=5169+49+256+25+81​=1​
σ
​16​
​○​ ​σ=116​≈10.77​
​ ​ ​Step 3:​​Apply the z-score formula.​

​○​ ​For 23: 10.7723−36​≈−1.207​
​○​ ​For 29: 10.7729−36​≈−0.650​
​○​ ​For 52: 10.7752−36​≈1.486​
​○​ ​For 31: 10.7731−36​≈−0.464​
​○​ ​For 45: 10.7745−36​≈0.836​
​●​ ​Answer:​​The z-score normalized vector is (-1.207,​​-0.650, 1.486, -0.464, 0.836).​

​Problem (Sep 2023):​​For AGE = {18, 22, 25, 42, 28,​​43, 33, 35, 56, 28}.​
​●​ ​Step 1:​​Calculate the mean (μ).​
​○​ ​μ=1018+22+25+42+28+43+33+35+56+28​=33​
​●​ ​Step 2:​​Calculate the standard deviation (σ).​
​○​ ​∑(X−μ)2=(18−33)2+...+(28−33)2=225+121+64+81+25+100+0+4+529+25=1174​
​○​ ​Variance σ2=101174​=117.4​
​○​ ​σ=117.4​≈10.835​
​●​ ​Step 3:​​Apply the z-score formula.​
​○​ ​Z(18) = 10.83518−33​≈−1.384​
​○​ ​Z(22) = 10.83522−33​≈−1.015​
​○​ ​Z(25) = 10.83525−33​≈−0.738​
​○​ ​Z(42) = 10.83542−33​≈0.831​
​○​ ​Z(28) = 10.83528−33​≈−0.461​
​○​ ​Z(43) = 10.83543−33​≈0.923​
​○​ ​Z(33) = 10.83533−33​=0.0​
​○​ ​Z(35) = 10.83535−33​≈0.185​
​○​ ​Z(56) = 10.83556−33​≈2.122​
​●​ ​Answer:​​The Z-score normalized data is {-1.384, -1.015,​​-0.738, 0.831, -0.461,​
​0.923, 0.0, 0.185, 2.122, -0.461}.​
​Principal Component Analysis (PCA)​
​(Ref: Q3b Oct 2022, Q4c Sep 2024)​

​ efinition:​
D
​Principal Component Analysis (PCA) is an unsupervised, linear dimensionality reduction​
​technique. Its main goal is to transform a dataset with a large number of potentially correlated​
​variables into a smaller set of new, uncorrelated variables called principal components, while​
​retaining as much of the original data's variance (information) as possible.​
​Process of PCA:​
​1.​ ​Standardize the Data:​​PCA is sensitive to the scale​​of the features. Therefore, all​
f​ eatures must be scaled to have a mean of 0 and a standard deviation of 1​
​(Z-score normalization).​
​2.​ ​Compute the Covariance Matrix:​​A covariance matrix​​is calculated for the​
​standardized data. This square matrix shows the correlation between all pairs of​
​variables and indicates how they move together.​
​3.​ ​Calculate Eigenvectors and Eigenvalues:​​The eigenvectors​​and eigenvalues of​
​the covariance matrix are computed.​
​○​ ​Eigenvectors:​​These represent the directions of the​​new feature space (the​
​principal components). They are orthogonal to each other.​
​○​ ​Eigenvalues:​​These represent the magnitude or importance​​of the​
​corresponding eigenvector. A high eigenvalue means that the principal​
​component explains a large amount of variance in the data.​
​4.​ ​Select Principal Components:​​The eigenvectors are​​sorted in descending order​
​based on their corresponding eigenvalues. The top 'k' eigenvectors are chosen to​
​be the new feature dimensions. The value of 'k' is selected based on the desired​
​amount of cumulative variance to be retained (e.g., 95%).​
​5.​ ​Transform the Data:​​The original standardized data​​is projected onto the new​
​feature space defined by the selected 'k' eigenvectors. This is done by taking the​
​dot product of the standardized data and the matrix of chosen eigenvectors. The​
​result is a new dataset with 'k' dimensions.​
​Use of PCA in Preprocessing:​
​●​ ​Dimensionality Reduction:​​It reduces the number of​​features, which helps to​
​ ombat the "curse of dimensionality," reduce model training time, and lower​
c
​computational complexity.​
​ ​ ​Noise Reduction:​​By discarding components with low​​variance (low eigenvalues),​

​PCA can effectively filter out statistical noise from the data.​
​●​ ​Multicollinearity Removal:​​It transforms correlated​​features into a set of​
​uncorrelated principal components. This is highly beneficial for algorithms that​
​are sensitive to multicollinearity, such as Linear Regression.​
​●​ ​Data Visualization:​​By reducing a high-dimensional​​dataset to 2 or 3 principal​
​components, PCA allows for the data to be plotted and visually inspected, which​
​can help in understanding its structure and identifying clusters or outliers.​
​Handling Missing Values​
​(Ref: Q4a Oct 2022)​

​ andling missing values is a critical data preprocessing step, as most ML algorithms​


H
​cannot work with them. The choice of method depends on the nature and amount of​
​missing data.​

​1. Deletion Methods:​


​●​ ​Listwise Deletion:​​The entire row (observation) containing​​one or more missing​
v​ alues is removed. This is the simplest method but can lead to significant data​
​loss if missing values are widespread, potentially introducing bias.​
​ ​ ​Pairwise Deletion:​​When calculating statistics like​​covariance, the algorithm only​

​uses pairs of data points that are complete, ignoring missing values for specific​
​calculations. This retains more data but can be complex to implement.​
​ . Imputation Methods (Filling the Values):​
2
​This involves filling the missing values with a plausible substitute.​
​●​ ​Mean/Median/Mode Imputation:​​This is the most common​​approach.​
​○​ ​Mean:​​Replace missing numerical values with the mean​​of the entire column.​
​ est for normally distributed data without outliers.​
B
​○​ ​Median:​​Replace missing numerical values with the​​median of the column.​
​This is more robust to outliers than the mean.​
​○​ ​Mode:​​Replace missing categorical values with the​​mode (most frequent​
​value) of the column.​
​ ​ ​End of Tail Imputation:​​Missing values are replaced​​by a value at the far end of​

​the distribution (e.g., mean + 3 * std dev). This can help the model learn that the​
​value was originally missing.​
​●​ ​Model-Based Imputation:​​Use other features to predict the missing value. This​
​is more accurate but more complex.​
​○​ ​Regression Imputation:​​A regression model is built​​to predict the missing​
​numerical value based on other features.​
​○​ ​k-NN Imputation:​​The missing value is imputed using​​the mean/median of​
​the 'k' most similar neighbors, where similarity is based on other features.​
​3. Using Algorithms that Support Missing Values:​
​●​ ​Some modern tree-based algorithms, such as​​XGBoost​​,​​LightGBM​​, and​
​ atBoost​​, can handle missing values internally without​​requiring explicit​
C
​imputation, often by learning the best direction to send missing values down the​
​tree.​
​Wrapper Methods for Feature Selection​
​(Ref: Q4b Oct 2022)​

​ efinition:​
D
​Wrapper methods are a class of feature selection techniques that use a specific machine​
​learning model to evaluate the usefulness of a subset of features. They "wrap" the model​
​training process inside the feature selection loop. They treat feature selection as a search​
​problem, where different feature combinations are prepared, evaluated, and compared. The​
​performance of the model (e.g., accuracy) on a validation set is the objective function used to​
​score each feature subset.​
​Characteristics:​
​●​ ​Model-Specific:​​The selected features are optimized​​for the specific machine​
l​earning algorithm used.​
​ ​ ​Computationally Expensive:​​They require training a​​new model for each feature​

​subset considered, making them much slower than filter methods.​
​●​ ​High Performance:​​They tend to find feature subsets​​that yield better model​
​performance because they consider feature interactions.​
​Types of Wrapper Methods:​
​1.​ ​Forward Selection:​
​○​ ​Process:​​Starts with an empty set of features. In​​each iteration, it adds the​
s​ ingle feature from the remaining set that results in the best model​
​performance. This process is repeated until adding new features no longer​
​improves the model significantly.​
​○​ ​Limitation:​​It cannot remove features once they are​​added, so it might miss​
​the optimal combination if a feature becomes redundant later.​
​ .​ ​Backward Elimination:​
2
​○​ ​Process:​​Starts with the full set of all features. In each iteration, it removes​
t​ he single feature whose removal leads to the best model performance (or the​
​least performance degradation). This process is repeated until no further​
​improvement is gained by removing features.​
​○​ ​Limitation:​​Extremely computationally expensive, especially​​with a large​
​number of initial features, as it starts with the most complex model.​
​ .​ ​Recursive Feature Elimination (RFE):​
3
​○​ ​Process:​​This is a greedy optimization algorithm that​​is more efficient than​
​backward elimination.​
​1.​ ​Train a model on the entire set of features.​
​2.​ ​Compute an importance score for each feature (e.g., coefficients in a​
​linear model or feature importances from a tree-based model).​
​3.​ ​Remove the least important feature(s).​
​4.​ ​Repeat the process with the remaining features until the desired number​
​of features is reached.​
​○​ ​Advantage:​​It is often more robust and faster than​​simple backward​
​elimination.​
​Local Binary Pattern (LBP)​
​(Ref: Q4c Oct 2022, Q3c Sep 2023)​

​ efinition:​
D
​Local Binary Pattern (LBP) is a simple yet very efficient feature extraction technique used for​
​texture classification in computer vision. It works by describing the local texture pattern of an​
​image by comparing each pixel with its surrounding neighbors.​
​Process of LBP:​
​1.​ ​For each pixel in an image, a neighborhood is selected (typically a 3x3 grid with​
t​ he pixel of interest at the center).​
​2.​ ​The intensity value of the center pixel is used as a​​threshold​​.​
​3.​ ​This threshold is compared with the intensity value of its 8 neighbors.​
​4.​ ​For each neighbor, if its value is greater than or equal to the center pixel's value, it​
​is assigned a binary '1'. Otherwise, it is assigned a '0'.​
​5.​ ​This creates an 8-bit binary number. The bits are collected in a sequence (e.g.,​
​clockwise starting from the top-left neighbor).​
​6.​ ​This binary number is converted to its decimal equivalent. This decimal value is​
​the LBP code for the center pixel and represents the local texture.​
​7.​ ​After computing the LBP code for every pixel, a​​histogram​​of these LBP codes is​
​created for the entire image (or for regions of it). This histogram serves as the​
​feature vector that describes the overall texture of the image and can be used to​
​train a classifier.​
​ xample Calculation (from Sep 2023 paper):​
E
​Calculate the LBP code for the central point (9) in the neighborhood:​
​1079​1292​1864​​
​1.​ ​Center Pixel Value (Threshold):​​9​
​2.​ ​Compare neighbors to the center (9):​
​○​ ​Top-left (10) >= 9 ->​​1​
​○​ ​Top (12) >= 9 ->​​1​
​○​ ​Top-right (18) >= 9 ->​​1​
​○​ ​Right (6) < 9 ->​​0​
​○​ ​Bottom-right (4) < 9 ->​​0​
​○​ ​Bottom (2) < 9 ->​​0​
​○​ ​Bottom-left (9) >= 9 ->​​1​
​○​ ​Left (7) < 9 ->​​0​
​3.​ ​Form the binary string​​(reading clockwise from top-left):​​11100010​
​4.​ ​Convert the binary number to decimal:​
1​ ⋅27+1⋅26+1⋅25+0⋅24+0⋅23+0⋅22+1⋅21+0⋅20​
​=128+64+32+0+0+0+2+0​
​=226​
​ ​ ​Answer:​​The LBP code for the central pixel is​​226​​.​

​Feature Selection and Filtering Technique​


​(Ref: Q3a Sep 2023)​

​ efinition of Feature Selection:​


D
​Feature Selection is the process of automatically or manually selecting a subset of the most​
​relevant features from a dataset to be used in model construction. The primary goals are to:​
​●​ ​Simplify models to make them easier to interpret.​
​●​ ​Reduce the time required to train a model.​
​●​ ​Reduce overfitting by removing irrelevant or redundant features (combating the​
​ urse of Dimensionality).​
C
​ ​ ​Improve model performance and generalization to new data.​

​ iltering Technique (Filter Methods):​
F
​Filter methods are a class of feature selection techniques where features are selected before​
​the model training process begins. They evaluate and rank features based only on their​
​intrinsic statistical properties and their relationship with the target variable, independent of​
​any machine learning algorithm.​
​●​ ​How they work:​
​1.​ ​A statistical measure (or scoring function) is used to score each feature's​
​relevance.​
​2.​ ​The features are ranked based on their scores.​
​3.​ ​A threshold is applied to select the highest-scoring features (e.g., select the​
t​ op 'k' features or all features above a certain score).​
​ ​ ​Characteristics:​

​○​ ​Fast and Efficient:​​They are computationally much​​cheaper than other​
​methods like wrapper methods.​
​○​ ​Model-Agnostic:​​The selected feature set is not tied​​to a specific model and​
​can be used with any algorithm.​
​○​ ​Limitation:​​They ignore the interaction between features.​​A feature might be​
​useless by itself but highly valuable when combined with another. They also​
​ignore the impact of the selected features on the performance of a specific​
​model.​
​Common Filter Techniques:​
​●​ ​Correlation Coefficient (e.g., Pearson's r):​​Measures​​the linear relationship​
​ etween a numerical feature and the numerical target variable. Features with high​
b
​correlation to the target are selected.​
​ ​ ​Chi-Squared Test:​​Used for categorical features. It​​tests the independence​

​between a categorical feature and the categorical target variable. A higher​
​Chi-Squared value indicates a stronger, more relevant relationship.​
​●​ ​Information Gain / Mutual Information:​​Measures the​​reduction in uncertainty​
​(entropy) of the target variable given the knowledge of a feature. A higher​
​information gain means the feature is more useful for predicting the target.​
​●​ ​ANOVA F-test:​​Used when the input features are numerical​​and the target​
​variable is categorical. It checks if the mean of a numerical feature is significantly​
​different across the different classes.​
​Kernel PCA​
​(Ref: Q3b Sep 2023)​

​ efinition:​
D
​Kernel Principal Component Analysis (Kernel PCA) is a non-linear dimensionality reduction​
​technique. It is an extension of standard PCA that is used when the data is not linearly​
​separable, meaning its structure cannot be captured by linear components.​
​The Problem with Standard PCA:​
​Standard PCA works by finding linear projections of the data. If the data has a complex,​
​non-linear structure (e.g., data points arranged in two concentric circles), a straight line (a​
​linear principal component) cannot effectively separate the classes or capture the variance.​
​How Kernel PCA Works:​
​Kernel PCA overcomes this limitation by using the "kernel trick."​
​1.​ ​Implicit Mapping to a Higher-Dimensional Space:​​Kernel​​PCA uses a​​kernel​
f​ unction​​(e.g., Polynomial, Radial Basis Function - RBF, Sigmoid) to implicitly map​
​the original data from its input space into a much higher-dimensional feature​
​space. The core idea is that in this higher-dimensional space, the complex​
​non-linear structure of the data becomes simpler and can be captured by linear​
​methods.​
​ .​ ​The Kernel Trick:​​The key to Kernel PCA is that it​​never actually computes the​
2
​coordinates​​of the data points in this high-dimensional​​space, which would be​
​computationally infeasible. Instead, the kernel function computes the dot​
​products between the images of all pairs of data points in the high-dimensional​
​space directly from the original data points. This matrix of dot products is called​
​the​​kernel matrix​​or​​Gram matrix​​.​
​3.​ ​Performing PCA in the High-Dimensional Space:​​Kernel​​PCA then performs​
​the standard PCA algorithm (i.e., centering the data, finding​
​eigenvectors/eigenvalues) in this new, high-dimensional space. However, it does​
​so using the kernel matrix instead of the explicit coordinates and the covariance​
​matrix.​
​4.​ ​Result:​​The result is a set of​​non-linear principal​​components​​in the original​
​space. These components are projections of the data that capture its non-linear​
​structure, allowing for effective dimensionality reduction and feature extraction​
​for complex datasets.​
I​ n summary, Kernel PCA = Kernel Trick + Standard PCA.​​It allows PCA to find​
​non-linear patterns in data by implicitly projecting it into a space where those patterns​
​become linear.​

​Matrix Factorization and Content-Based Filtering​


​(Ref: Q4b Sep 2023)​

​1. Matrix Factorization​


​●​ ​Definition:​​Matrix Factorization is a class of collaborative​​filtering algorithms​
​ sed in recommendation systems, and more generally, a dimensionality reduction​
u
​technique. The core idea is to​​decompose a large matrix​​into the product of​
​two or more smaller matrices​​(called factors).​
​ ​ ​Application in Recommendation Systems:​

​○​ ​The large matrix is typically a​​user-item interaction​​matrix​​(e.g., a matrix of​
​users' ratings for movies). This matrix is usually very sparse, as users only rate​
​a small fraction of the available items.​
​○​ ​Matrix Factorization decomposes this user-item matrix (R) into two​
​lower-dimensional matrices: a​​user-feature matrix​​(P) and an​​item-feature​
​matrix​​(Q).​
​■​ ​Rm×n​≈Pm×k​×Qk×nT​​
​ ​ ​The "features" (the 'k' dimension) are​​latent (hidden)​​factors​​learned by the​

​algorithm. These factors might represent abstract concepts like genres,​
​actors, or user tastes.​
​○​ ​The predicted rating for a user for an item is simply the dot product of that​
​user's vector in P and that item's vector in Q. This allows the system to predict​
​ratings for empty cells in the original matrix, thus generating​
​recommendations.​
​2. Content-Based Filtering​
​●​ ​Definition:​​Content-Based Filtering is a type of recommendation​​system that​
r​ ecommends items to a user based on the​​attributes​​of the items​​(their​
​"content") that the user has liked in the past. It operates on the principle: "Show​
​me more of what I like." It does not use information about other users.​
​ ​ ​How it works:​

​1.​ ​Item Profile Creation:​​For each item, a profile is​​created containing its​
​attributes. For a movie, this could include genre, director, actors, plot​
​keywords, etc. This is often represented as a feature vector.​
​2.​ ​User Profile Creation:​​A profile is created for each​​user that summarizes​
​their preferences based on the content of the items they have rated highly. If​
​a user likes many sci-fi movies, their user profile will indicate a strong​
​preference for the "sci-fi" attribute.​
​3.​ ​Recommendation Generation:​​The system compares the​​user's profile with​
​the profiles of unrated items. It then recommends items whose content​
​profiles are a close match to the user's profile (e.g., by calculating the cosine​
​similarity between the user profile vector and item profile vectors).​
​●​ ​Example:​
​○​ ​A user watches and rates "The Matrix" and "Blade Runner" highly.​
​○​ ​The system analyzes the content of these movies and builds a profile for the​
​user that shows a high preference for attributes like genre: sci-fi, theme:​
​dystopian, theme: artificial intelligence.​
​○​ ​The system then searches its catalog for other movies with similar attributes,​
​such as "Ex Machina," and recommends it to the user.​
​Categorical Variable Encoding and One-Hot Encoding​
​(Ref: Q3b Sep 2024)​

1​ . Need for Categorical Variable Encoding​


​Machine learning algorithms are based on mathematical equations and can only operate on​
​ umerical data. They cannot directly process text-based categorical variables (e.g., 'Red',​
n
​'Green', 'USA', 'India'). Therefore, to use categorical data in a model, we must first convert​
​these non-numeric categories into a numerical format that the algorithm can understand. This​
​conversion process is called categorical variable encoding. Without it, the model cannot be​
​trained.​
​2. One-Hot Encoding​
​●​ ​Definition:​​One-Hot Encoding is one of the most common​​and effective​
t​ echniques for encoding nominal categorical variables (where there is no intrinsic​
​order among categories). It works by creating​​new​​binary (0 or 1) columns​​for​
​each unique category present in the original feature.​
​ ​ ​Process:​

​1.​ ​Identify all unique categories in the feature column.​
​2.​ ​Create a new binary column for each unique category.​
​3.​ ​For each data row, place a '1' in the column corresponding to its original​
​category, and place a '0' in all other new columns.​
​●​ ​Advantage:​​This method avoids imposing an artificial​​order on the categories. If​
​we were to encode 'Red' as 1, 'Green' as 2, and 'Blue' as 3 (Label Encoding), the​
​model might incorrectly assume that Green is "greater than" Red, which is not​
​true. One-Hot Encoding prevents this.​
​●​ ​Example:​
​Imagine a feature called 'City' in a dataset.​
​Original Data:​
​| ID | City |​
​|:--:|:---:|​
​| 1 | Pune |​
​| 2 | Mumbai |​
​| 3 | Delhi |​
​| 4 | Pune |​
​The feature 'City' has three unique categories: 'Pune', 'Mumbai', and 'Delhi'.​
​After One-Hot Encoding:​
​Three new binary columns are created: City_Pune, City_Mumbai, and City_Delhi.​

​ID​ ​City_Pune​ ​City_Mumbai​ ​City_Delhi​

​1​ ​1​ ​0​ ​0​

​2​ ​0​ ​1​ ​0​

​3​ ​0​ ​0​ ​1​


​4​ ​1​ ​0​ ​0​

​●​
​ his new numerical representation can now be used by machine learning​
T
​algorithms.​
​Statistical Methods to Describe Data​
​(Ref: Q4a Sep 2024)​

​ he statistical methods used to summarize and describe the main features and nature​
T
​of a dataset are known as​​Descriptive Statistics​​.​​They provide simple summaries​
​about the sample and the measures, forming the basis of virtually every quantitative​
​analysis of data. They do not allow us to make conclusions or inferences about a​
​larger population beyond the data we have analyzed.​

​The key descriptive statistical methods are categorized as follows:​

1​ . Measures of Central Tendency:​


​These measures describe the center or typical value of a dataset.​
​●​ ​Mean:​​The arithmetic average of all data points. It​​is sensitive to outliers.​
​○​ ​Formula: μ=N∑X​​
​●​ ​Median:​​The middle value of a dataset when it is sorted​​in ascending or​
​ escending order. It is robust to outliers.​
d
​ ​ ​Mode:​​The most frequently occurring value in the dataset.​​A dataset can have​

​one mode (unimodal), two modes (bimodal), or more (multimodal).​
​ . Measures of Dispersion (or Variability):​
2
​These measures describe the spread or how much the data points differ from each other and​
​from the central tendency.​
​●​ ​Range:​​The difference between the maximum and minimum​​values in the dataset.​
I​t is simple but highly affected by outliers.​
​ ​ ​Variance (​​σ2​​):​​The average of the squared differences​​of each data point from​

​the Mean. It measures how far the data is spread out from its average value.​
​○​ ​Formula: σ2=N∑(X−μ)2​​
​●​ ​Standard Deviation (​​σ​):​​The square root of the variance.​​It is expressed in the​
​same units as the data, making it more interpretable than variance as a measure​
​of spread.​
​●​ ​Interquartile Range (IQR):​​The range between the first​​quartile (Q1, the 25th​
​percentile) and the third quartile (Q3, the 75th percentile). It represents the​
​spread of the middle 50% of the data and is robust to outliers.​
​3. Measures of Shape:​
​These describe the shape of the data distribution.​
​●​ ​Skewness:​​Measures the asymmetry of the probability​​distribution of a variable.​
​ distribution can be right-skewed (positive skew), left-skewed (negative skew),​
A
​or symmetric (zero skew).​
​ ​ ​Kurtosis:​​Measures the "tailedness" of the distribution,​​or how heavy its tails are​

​compared to a normal distribution. It indicates the presence of outliers.​
​Features of Multidimensional Scaling (MDS)​
​(Ref: Q4b Sep 2024)​

​ efinition:​
D
​Multidimensional Scaling (MDS) is a dimensionality reduction and data visualization technique.​
​Its primary objective is to represent the relationships between a set of items as distances in a​
​low-dimensional space (typically 2D or 3D), such that the geometric distances between points​
​in the low-dimensional "map" correspond as closely as possible to the known dissimilarities​
​between the items.​
​Key Features of MDS:​
​1.​ ​Input is a Dissimilarity Matrix:​​Unlike PCA, which​​takes a feature matrix as​
i​nput, the primary input for MDS is a​​distance or​​dissimilarity matrix​​. This is an​
​N×N matrix where N is the number of items, and each entry (i,j) represents the​
​measured dissimilarity between item i and item j. This dissimilarity can be derived​
​from Euclidean distance, correlation distance, or even subjective human ratings​
​of dissimilarity.​
​ .​ ​Preservation of Distances:​​The core goal of MDS is​​to find a low-dimensional​
2
​configuration of points whose pairwise distances are as close as possible to the​
​original dissimilarities. It aims to create a faithful geometric representation of the​
​dissimilarity data.​
​3.​ ​Primary Use is Visualization:​​The most common application​​of MDS is to​
​visualize the underlying structure of data. By plotting the items as points in a 2D​
​or 3D space, one can visually inspect their relationships, identify natural clusters,​
​discover patterns, and understand the dimensions along which the items seem to​
​vary.​
​4.​ ​Types of MDS:​
​○​ ​Classical MDS (or Metric MDS):​​This type assumes the​​input dissimilarities​
​are actual distances in a high-dimensional Euclidean space. It aims to​
​preserve these distances as accurately as possible. It is mathematically​
​equivalent to PCA when the input is a Euclidean distance matrix.​
​○​ ​Non-metric MDS:​​This is a more flexible version that​​assumes only the​​rank​
​order​​of the dissimilarities is important. It tries​​to arrange the points in the​
​low-dimensional map such that the order of distances is preserved (i.e., if​
i​tem A is more dissimilar to B than to C, the distance between points A and B​
​on the map should be greater than the distance between A and C). This is​
​useful for psychological or survey data where dissimilarities are subjective.​
​ .​ ​Axes are Not Directly Interpretable:​​A key​​difference from PCA is that the axes​
5
​in an MDS plot do not have a direct, interpretable meaning. They are arbitrary​
​dimensions chosen to best represent the pairwise distances. The focus of​
​interpretation is on the​​relative positions of the​​points and the clusters they​
​form​​, not their specific coordinates on the axes.​

You might also like