0% found this document useful (0 votes)
9 views12 pages

Found A Mentals

Uploaded by

Sanjoy Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Found A Mentals

Uploaded by

Sanjoy Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Fundamental Concepts of Machine Learning

and Statistics

September 14, 2025

Contents
1 The Core Idea Behind Machine Learning 3
1.1 Difference Between Hard-coding and Machine Learning . . . . 3

2 Classification and Regression 4


2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 ROC and AUC 5


3.1 ROC (Receiver Operating Characteristic) . . . . . . . . . . . . 6
3.2 AUC (Area Under the Curve) . . . . . . . . . . . . . . . . . . 6

4 SSR and R² 6
4.1 SSR (Sum of Squared Residuals) . . . . . . . . . . . . . . . . 6
4.2 R Squared — Coefficient of Determination . . . . . . . . . . . 6

5 Strengths and Weaknesses of Histograms for Probability Es-


timation 7
5.1 Strengths of Histograms . . . . . . . . . . . . . . . . . . . . . 7
5.2 Weaknesses of Histograms . . . . . . . . . . . . . . . . . . . . 7

6 Calculating Probability Using Histograms 8


6.1 How to Calculate the Probability of a Range of Values . . . . 8
6.2 Assumptions for Histogram-based Probability Estimation . . . 8

1
7 Consequences of Modeling Variables with Inappropriate Dis-
tributions 8
7.1 Modeling a Discrete Variable with a Continuous Distribution . 8
7.2 Modeling a Continuous Variable with a Discrete Distribution . 9

8 Potential Pitfalls of R² When Evaluating Regression 9

9 Distinguishing Valid and Invalid Probability Distributions 10


9.1 Valid Probability Distribution . . . . . . . . . . . . . . . . . . 10
9.2 Invalid Probability Distribution . . . . . . . . . . . . . . . . . 10

10 Importance of Total Area Under Probability Distribution


Curve 10

11 Modeling Variables in Both Discrete and Continuous Distri-


butions 11
11.1 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

12 How SSR Reflects Model Performance and Why Squaring


Residuals is Useful 12
12.1 How SSR Reflects Model Performance . . . . . . . . . . . . . 12
12.2 Why Squaring the Residual is Useful . . . . . . . . . . . . . . 12

2
1 The Core Idea Behind Machine Learning
Added by Sanjoy:

The core idea behind machine learning is to enable computer systems


to learn from data and improve their performance on tasks without being
explicitly programmed for every scenario. Instead of providing a rigid set of
instructions, machine learning models are trained on large datasets, allowing
them to identify patterns, relationships, and trends within the data. This
learning process then enables them to make predictions, classify information,
or make decisions on new, unseen data.

Added by souvik:
They both involve writing code. They both require data. But they operate
on fundamentally different principles. One is about giving machines exact
instructions. The other is about teaching them to recognize patterns on their
own.
As AI and machine learning become more prominent in the tech land-
scape, there’s a growing anxiety that these technologies might replace tra-
ditional programming — and the programmers behind it. But the reality
is far more collaborative than competitive. Rather than replacing classical
programming, AI and ML are here to augment it, offering new ways to solve
problems that are too complex or dynamic for hard-coded logic alone.

1.1 Difference Between Hard-coding and Machine Learn-


ing
• Traditional Coding (Hard-coding): Involves writing explicit in-
structions that the computer follows step by step. The programmer
defines all the rules, conditions, and logic to solve a specific problem.
– Same input always yields the same output.
– You can trace every outcome back to the code.
– Easy to debug, test, and verify.
For example, a hard-coded spam filter might have rules like ”if email
contains ’free money’, mark as spam.”
• Machine Learning: Involves providing the system with data and
letting it learn the patterns.
– Outputs are based on likelihood, not certainty.

3
– The system learns from examples, not from rules.
– Same input might produce different outputs over time (especially
if the model is updated or retrained).

For the spam filter example, a machine learning approach would in-
volve feeding the system thousands of examples of spam and non-spam
emails, and letting it learn the characteristics that distinguish them.

2 Classification and Regression


2.1 Classification
Classification is a supervised learning task where the goal is to predict a
discrete class label.
Example: Predicting whether an email is spam or not spam is a bi-
nary classification problem. Another example would be classifying images of
animals into categories like ”cat,” ”dog,” or ”bird.”

2.2 Regression
Regression is a supervised learning task where the goal is to predict a con-
tinuous value.
Example: Predicting the price of a house based on features like its size,
number of bedrooms, and location is a regression problem. Another example
would be predicting the temperature tomorrow based on historical weather
data.

4
Figure 1: Classification vs regression

3 ROC and AUC


AUC-ROC curve is a graph used to check how well a binary classification
model works. It helps us to understand how well the model separates the
positive cases like people with a disease from the negative cases like people
without the disease at different threshold level. It shows how good the model
is at telling the difference between the two classes by plotting:

• True Positive Rate (TPR):True Positive Rate (TPR): how often the
model correctly predicts the positive cases also known as Sensitivity or
Recall.

• False Positive Rate (TPR): how often the model incorrectly predicts
a negative case as positive.

• Specificity: measures the proportion of actual negatives that the


model correctly identifies. It is calculated as 1 - FPR.

5
3.1 ROC (Receiver Operating Characteristic)
ROC is a graphical plot that illustrates the diagnostic ability of a binary
classifier system as its discrimination threshold is varied. The ROC curve is
created by plotting the true positive rate (TPR) against the false positive
rate (FPR) at various threshold settings.

• True Positive Rate (TPR) = TP


T P +F N

• False Positive Rate (FPR) = FP


F P +T N

where TP is true positive, FN is false negative, FP is false positive, and


TN is true negative.

3.2 AUC (Area Under the Curve)


AUC represents the area under the ROC curve. It provides a single measure
of a classifier’s performance across all possible thresholds. An AUC of 1
indicates a perfect classifier, while an AUC of 0.5 suggests no discriminative
ability (equivalent to random guessing). The AUC can be interpreted as the
probability that a randomly chosen positive example is ranked higher than
a randomly chosen negative example.

4 SSR and R²
4.1 SSR (Sum of Squared Residuals)
SSR (Sum of Squared Residuals) is the sum of the squared differences between
the observed values and the predicted values. It measures the amount of
variation in the dependent variable that is not explained by the regression
model.
n
X
SSR = (Observedi − P redictedi )2
i=1

4.2 R Squared — Coefficient of Determination


The R-squared is the statistical measure in the stream of regression analysis.
In regression, we generally deal with the dependent and independent vari-
ables. A change in the independent variable is likely to cause a change in the
dependent variable.

6
The R-squared coefficient represents the proportion of variation in the de-
pendent variable (y) that is accounted for by the regression line, compared to
the variation explained by the mean of y. Essentially, it measures how much
more accurately the regression line predicts each point’s value compared to
simply using the average value of y.
R² is calculated as:
SSR(F ittedLine)
R2 = 1 −
SSR(M ean)

where SSR is the total sum of squares (the sum of squared differences
between the observed values and their mean):
n
X
SSR = (Observedi − P redictedi )2
i=1

5 Strengths and Weaknesses of Histograms


for Probability Estimation
5.1 Strengths of Histograms
• Simple and intuitive: Histograms are easy to understand and inter-
pret.

• Non-parametric: They don’t assume any specific distribution for the


data.

• Flexible: Can represent various shapes of distributions.

• Visual representation: Provides a clear visual picture of the data


distribution.

5.2 Weaknesses of Histograms


• Bin dependency: The shape of the histogram depends on the choice
of bin width and starting point.

• Discontinuity: The resulting estimate is discontinuous, which may


not reflect the true continuous nature of the underlying distribution.

• Sensitivity to sample size: With small sample sizes, histograms can


be noisy and unreliable.

7
• Inefficiency: They are not statistically efficient compared to other
density estimation methods like kernel density estimation.

6 Calculating Probability Using Histograms


6.1 How to Calculate the Probability of a Range of
Values
To calculate the probability of a range of values using a histogram:
1. Divide the data into bins (intervals) of equal width.
2. Count the number of observations in each bin.
3. Divide the count in each bin by the total number of observations to get
the probability for each bin.
4. To find the probability of a range of values, sum the probabilities of all
bins that fall within that range.

6.2 Assumptions for Histogram-based Probability Es-


timation
Assumptions that must hold true:
1. The data is independent and identically distributed (i.i.d.).
2. The bin width is appropriate for the data (not too wide to lose impor-
tant details, not too narrow to result in noisy estimates).
3. The sample size is large enough to provide a reliable estimate of the
underlying distribution.
4. The underlying distribution is relatively smooth within each bin.

7 Consequences of Modeling Variables with


Inappropriate Distributions
7.1 Modeling a Discrete Variable with a Continuous
Distribution
• May result in probabilities for values that the variable cannot actually
take.

8
• Can lead to biased estimates and incorrect inferences.

• May produce unrealistic predictions (e.g., predicting 2.5 children when


the actual variable must be an integer).

• Can result in inefficient estimates, as the model is trying to estimate


parameters for a distribution that doesn’t match the nature of the data.

7.2 Modeling a Continuous Variable with a Discrete


Distribution
• Loss of information due to discretization.

• Inability to capture the true variability of the continuous variable.

• May lead to biased estimates, especially if the discretization is coarse.

• Can result in less powerful statistical tests, as information is lost in the


discretization process.

8 Potential Pitfalls of R² When Evaluating


Regression
• R² always increases (or stays the same) when more predictors are added
to the model, even if they are irrelevant. This can lead to overfitting.

• R² does not indicate whether the regression model is adequate. A high


R² does not necessarily mean the model is good.

• R² does not indicate whether the predictors are causally related to the
response.

• R² does not indicate the correctness of the regression model’s functional


form.

• R² can be artificially inflated by outliers or high-leverage points.

• R² does not provide information about the prediction error of the


model.

9
9 Distinguishing Valid and Invalid Probabil-
ity Distributions
9.1 Valid Probability Distribution
• All probabilities are non-negative: P (x) ≥ 0 for all x.

• The sum (for discrete distributions) or integral (for continuous distri-


butions) of all probabilities equals 1.

• For discrete distributions, each probability is between 0 and 1: 0 ≤


P (x) ≤ 1 for all x.

• For continuous distributions, the probability density function is non-


negative.

9.2 Invalid Probability Distribution


• Contains negative probabilities.

• The sum or integral of probabilities is not equal to 1.

• For discrete distributions, probabilities greater than 1.

• For continuous distributions, negative values of the probability density


function.

10 Importance of Total Area Under Proba-


bility Distribution Curve
It’s important for the total area under a probability distribution curve to be
1 because:

1. It represents the total probability of all possible outcomes, which must


be 1 (or 100%).

2. It ensures that the distribution is properly normalized and can be used


to calculate valid probabilities.

3. It allows for meaningful interpretation of the area under the curve as


probabilities.

If the total area is not 1, it could indicate:

10
1. The distribution is not properly normalized.

2. There might be an error in the calculation or specification of the dis-


tribution.

3. The function might not be a valid probability distribution.

4. For empirical distributions, it might indicate that the sample is not


representative of the population or that there are missing data points.

11 Modeling Variables in Both Discrete and


Continuous Distributions
Yes, a variable can sometimes be modeled in both discrete and continuous
distributions, depending on the context and the level of precision required.

11.1 Scenarios
• Time: Time can be modeled as discrete (e.g., number of hours) or
continuous (e.g., exact time in seconds).

• Age: Age can be modeled as discrete (number of years) or continuous


(exact age including fractions of years).

• Money: Monetary values can be modeled as discrete (number of cents)


or continuous (allowing for any real number).

• Measurements: Physical measurements like height or weight can be


discretized (e.g., to the nearest centimeter) or treated as continuous.

• Counts: For large counts, a continuous approximation (like using a


normal distribution to approximate a binomial distribution) can be
more convenient.

The choice between discrete and continuous modeling often depends on


the nature of the data, the precision required, and the mathematical conve-
nience for analysis.

11
12 How SSR Reflects Model Performance and
Why Squaring Residuals is Useful
The sum of squared residuals (SSR) is a measure of the discrepancy between
the data and an estimation model. It is calculated by summing the squares
of the differences between the observed values and the predicted values:
n
X
SSR = (yi − ŷi )2
i=1

where yi is the observed value and ŷi is the predicted value.

12.1 How SSR Reflects Model Performance


SSR reflects model performance by measuring the total amount of varia-
tion in the dependent variable that is not explained by the model. A lower
SSR indicates a better fit, as it means the model’s predictions are closer to
the actual values. SSR is often used as a criterion for model selection and
parameter estimation (e.g., in least squares regression).

12.2 Why Squaring the Residual is Useful


Squaring the residuals is useful for several reasons:

1. It ensures all residuals are positive, preventing positive and negative


residuals from canceling each other out.

2. It gives more weight to larger errors, making the model more sensitive
to outliers and large discrepancies.

3. The squared function is differentiable, which makes it easier to find the


minimum using calculus-based optimization methods.

4. Minimizing the sum of squared residuals is equivalent to maximizing


the likelihood under the assumption of normally distributed errors.

5. The resulting estimator has desirable statistical properties, such as be-


ing the Best Linear Unbiased Estimator (BLUE) under the Gauss-
Markov assumptions.

12

You might also like