Fundamental Concepts of Machine Learning
and Statistics
September 14, 2025
Contents
1 The Core Idea Behind Machine Learning 3
1.1 Difference Between Hard-coding and Machine Learning . . . . 3
2 Classification and Regression 4
2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 ROC and AUC 5
3.1 ROC (Receiver Operating Characteristic) . . . . . . . . . . . . 6
3.2 AUC (Area Under the Curve) . . . . . . . . . . . . . . . . . . 6
4 SSR and R² 6
4.1 SSR (Sum of Squared Residuals) . . . . . . . . . . . . . . . . 6
4.2 R Squared — Coefficient of Determination . . . . . . . . . . . 6
5 Strengths and Weaknesses of Histograms for Probability Es-
timation 7
5.1 Strengths of Histograms . . . . . . . . . . . . . . . . . . . . . 7
5.2 Weaknesses of Histograms . . . . . . . . . . . . . . . . . . . . 7
6 Calculating Probability Using Histograms 8
6.1 How to Calculate the Probability of a Range of Values . . . . 8
6.2 Assumptions for Histogram-based Probability Estimation . . . 8
1
7 Consequences of Modeling Variables with Inappropriate Dis-
tributions 8
7.1 Modeling a Discrete Variable with a Continuous Distribution . 8
7.2 Modeling a Continuous Variable with a Discrete Distribution . 9
8 Potential Pitfalls of R² When Evaluating Regression 9
9 Distinguishing Valid and Invalid Probability Distributions 10
9.1 Valid Probability Distribution . . . . . . . . . . . . . . . . . . 10
9.2 Invalid Probability Distribution . . . . . . . . . . . . . . . . . 10
10 Importance of Total Area Under Probability Distribution
Curve 10
11 Modeling Variables in Both Discrete and Continuous Distri-
butions 11
11.1 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
12 How SSR Reflects Model Performance and Why Squaring
Residuals is Useful 12
12.1 How SSR Reflects Model Performance . . . . . . . . . . . . . 12
12.2 Why Squaring the Residual is Useful . . . . . . . . . . . . . . 12
2
1 The Core Idea Behind Machine Learning
Added by Sanjoy:
The core idea behind machine learning is to enable computer systems
to learn from data and improve their performance on tasks without being
explicitly programmed for every scenario. Instead of providing a rigid set of
instructions, machine learning models are trained on large datasets, allowing
them to identify patterns, relationships, and trends within the data. This
learning process then enables them to make predictions, classify information,
or make decisions on new, unseen data.
Added by souvik:
They both involve writing code. They both require data. But they operate
on fundamentally different principles. One is about giving machines exact
instructions. The other is about teaching them to recognize patterns on their
own.
As AI and machine learning become more prominent in the tech land-
scape, there’s a growing anxiety that these technologies might replace tra-
ditional programming — and the programmers behind it. But the reality
is far more collaborative than competitive. Rather than replacing classical
programming, AI and ML are here to augment it, offering new ways to solve
problems that are too complex or dynamic for hard-coded logic alone.
1.1 Difference Between Hard-coding and Machine Learn-
ing
• Traditional Coding (Hard-coding): Involves writing explicit in-
structions that the computer follows step by step. The programmer
defines all the rules, conditions, and logic to solve a specific problem.
– Same input always yields the same output.
– You can trace every outcome back to the code.
– Easy to debug, test, and verify.
For example, a hard-coded spam filter might have rules like ”if email
contains ’free money’, mark as spam.”
• Machine Learning: Involves providing the system with data and
letting it learn the patterns.
– Outputs are based on likelihood, not certainty.
3
– The system learns from examples, not from rules.
– Same input might produce different outputs over time (especially
if the model is updated or retrained).
For the spam filter example, a machine learning approach would in-
volve feeding the system thousands of examples of spam and non-spam
emails, and letting it learn the characteristics that distinguish them.
2 Classification and Regression
2.1 Classification
Classification is a supervised learning task where the goal is to predict a
discrete class label.
Example: Predicting whether an email is spam or not spam is a bi-
nary classification problem. Another example would be classifying images of
animals into categories like ”cat,” ”dog,” or ”bird.”
2.2 Regression
Regression is a supervised learning task where the goal is to predict a con-
tinuous value.
Example: Predicting the price of a house based on features like its size,
number of bedrooms, and location is a regression problem. Another example
would be predicting the temperature tomorrow based on historical weather
data.
4
Figure 1: Classification vs regression
3 ROC and AUC
AUC-ROC curve is a graph used to check how well a binary classification
model works. It helps us to understand how well the model separates the
positive cases like people with a disease from the negative cases like people
without the disease at different threshold level. It shows how good the model
is at telling the difference between the two classes by plotting:
• True Positive Rate (TPR):True Positive Rate (TPR): how often the
model correctly predicts the positive cases also known as Sensitivity or
Recall.
• False Positive Rate (TPR): how often the model incorrectly predicts
a negative case as positive.
• Specificity: measures the proportion of actual negatives that the
model correctly identifies. It is calculated as 1 - FPR.
5
3.1 ROC (Receiver Operating Characteristic)
ROC is a graphical plot that illustrates the diagnostic ability of a binary
classifier system as its discrimination threshold is varied. The ROC curve is
created by plotting the true positive rate (TPR) against the false positive
rate (FPR) at various threshold settings.
• True Positive Rate (TPR) = TP
T P +F N
• False Positive Rate (FPR) = FP
F P +T N
where TP is true positive, FN is false negative, FP is false positive, and
TN is true negative.
3.2 AUC (Area Under the Curve)
AUC represents the area under the ROC curve. It provides a single measure
of a classifier’s performance across all possible thresholds. An AUC of 1
indicates a perfect classifier, while an AUC of 0.5 suggests no discriminative
ability (equivalent to random guessing). The AUC can be interpreted as the
probability that a randomly chosen positive example is ranked higher than
a randomly chosen negative example.
4 SSR and R²
4.1 SSR (Sum of Squared Residuals)
SSR (Sum of Squared Residuals) is the sum of the squared differences between
the observed values and the predicted values. It measures the amount of
variation in the dependent variable that is not explained by the regression
model.
n
X
SSR = (Observedi − P redictedi )2
i=1
4.2 R Squared — Coefficient of Determination
The R-squared is the statistical measure in the stream of regression analysis.
In regression, we generally deal with the dependent and independent vari-
ables. A change in the independent variable is likely to cause a change in the
dependent variable.
6
The R-squared coefficient represents the proportion of variation in the de-
pendent variable (y) that is accounted for by the regression line, compared to
the variation explained by the mean of y. Essentially, it measures how much
more accurately the regression line predicts each point’s value compared to
simply using the average value of y.
R² is calculated as:
SSR(F ittedLine)
R2 = 1 −
SSR(M ean)
where SSR is the total sum of squares (the sum of squared differences
between the observed values and their mean):
n
X
SSR = (Observedi − P redictedi )2
i=1
5 Strengths and Weaknesses of Histograms
for Probability Estimation
5.1 Strengths of Histograms
• Simple and intuitive: Histograms are easy to understand and inter-
pret.
• Non-parametric: They don’t assume any specific distribution for the
data.
• Flexible: Can represent various shapes of distributions.
• Visual representation: Provides a clear visual picture of the data
distribution.
5.2 Weaknesses of Histograms
• Bin dependency: The shape of the histogram depends on the choice
of bin width and starting point.
• Discontinuity: The resulting estimate is discontinuous, which may
not reflect the true continuous nature of the underlying distribution.
• Sensitivity to sample size: With small sample sizes, histograms can
be noisy and unreliable.
7
• Inefficiency: They are not statistically efficient compared to other
density estimation methods like kernel density estimation.
6 Calculating Probability Using Histograms
6.1 How to Calculate the Probability of a Range of
Values
To calculate the probability of a range of values using a histogram:
1. Divide the data into bins (intervals) of equal width.
2. Count the number of observations in each bin.
3. Divide the count in each bin by the total number of observations to get
the probability for each bin.
4. To find the probability of a range of values, sum the probabilities of all
bins that fall within that range.
6.2 Assumptions for Histogram-based Probability Es-
timation
Assumptions that must hold true:
1. The data is independent and identically distributed (i.i.d.).
2. The bin width is appropriate for the data (not too wide to lose impor-
tant details, not too narrow to result in noisy estimates).
3. The sample size is large enough to provide a reliable estimate of the
underlying distribution.
4. The underlying distribution is relatively smooth within each bin.
7 Consequences of Modeling Variables with
Inappropriate Distributions
7.1 Modeling a Discrete Variable with a Continuous
Distribution
• May result in probabilities for values that the variable cannot actually
take.
8
• Can lead to biased estimates and incorrect inferences.
• May produce unrealistic predictions (e.g., predicting 2.5 children when
the actual variable must be an integer).
• Can result in inefficient estimates, as the model is trying to estimate
parameters for a distribution that doesn’t match the nature of the data.
7.2 Modeling a Continuous Variable with a Discrete
Distribution
• Loss of information due to discretization.
• Inability to capture the true variability of the continuous variable.
• May lead to biased estimates, especially if the discretization is coarse.
• Can result in less powerful statistical tests, as information is lost in the
discretization process.
8 Potential Pitfalls of R² When Evaluating
Regression
• R² always increases (or stays the same) when more predictors are added
to the model, even if they are irrelevant. This can lead to overfitting.
• R² does not indicate whether the regression model is adequate. A high
R² does not necessarily mean the model is good.
• R² does not indicate whether the predictors are causally related to the
response.
• R² does not indicate the correctness of the regression model’s functional
form.
• R² can be artificially inflated by outliers or high-leverage points.
• R² does not provide information about the prediction error of the
model.
9
9 Distinguishing Valid and Invalid Probabil-
ity Distributions
9.1 Valid Probability Distribution
• All probabilities are non-negative: P (x) ≥ 0 for all x.
• The sum (for discrete distributions) or integral (for continuous distri-
butions) of all probabilities equals 1.
• For discrete distributions, each probability is between 0 and 1: 0 ≤
P (x) ≤ 1 for all x.
• For continuous distributions, the probability density function is non-
negative.
9.2 Invalid Probability Distribution
• Contains negative probabilities.
• The sum or integral of probabilities is not equal to 1.
• For discrete distributions, probabilities greater than 1.
• For continuous distributions, negative values of the probability density
function.
10 Importance of Total Area Under Proba-
bility Distribution Curve
It’s important for the total area under a probability distribution curve to be
1 because:
1. It represents the total probability of all possible outcomes, which must
be 1 (or 100%).
2. It ensures that the distribution is properly normalized and can be used
to calculate valid probabilities.
3. It allows for meaningful interpretation of the area under the curve as
probabilities.
If the total area is not 1, it could indicate:
10
1. The distribution is not properly normalized.
2. There might be an error in the calculation or specification of the dis-
tribution.
3. The function might not be a valid probability distribution.
4. For empirical distributions, it might indicate that the sample is not
representative of the population or that there are missing data points.
11 Modeling Variables in Both Discrete and
Continuous Distributions
Yes, a variable can sometimes be modeled in both discrete and continuous
distributions, depending on the context and the level of precision required.
11.1 Scenarios
• Time: Time can be modeled as discrete (e.g., number of hours) or
continuous (e.g., exact time in seconds).
• Age: Age can be modeled as discrete (number of years) or continuous
(exact age including fractions of years).
• Money: Monetary values can be modeled as discrete (number of cents)
or continuous (allowing for any real number).
• Measurements: Physical measurements like height or weight can be
discretized (e.g., to the nearest centimeter) or treated as continuous.
• Counts: For large counts, a continuous approximation (like using a
normal distribution to approximate a binomial distribution) can be
more convenient.
The choice between discrete and continuous modeling often depends on
the nature of the data, the precision required, and the mathematical conve-
nience for analysis.
11
12 How SSR Reflects Model Performance and
Why Squaring Residuals is Useful
The sum of squared residuals (SSR) is a measure of the discrepancy between
the data and an estimation model. It is calculated by summing the squares
of the differences between the observed values and the predicted values:
n
X
SSR = (yi − ŷi )2
i=1
where yi is the observed value and ŷi is the predicted value.
12.1 How SSR Reflects Model Performance
SSR reflects model performance by measuring the total amount of varia-
tion in the dependent variable that is not explained by the model. A lower
SSR indicates a better fit, as it means the model’s predictions are closer to
the actual values. SSR is often used as a criterion for model selection and
parameter estimation (e.g., in least squares regression).
12.2 Why Squaring the Residual is Useful
Squaring the residuals is useful for several reasons:
1. It ensures all residuals are positive, preventing positive and negative
residuals from canceling each other out.
2. It gives more weight to larger errors, making the model more sensitive
to outliers and large discrepancies.
3. The squared function is differentiable, which makes it easier to find the
minimum using calculus-based optimization methods.
4. Minimizing the sum of squared residuals is equivalent to maximizing
the likelihood under the assumption of normally distributed errors.
5. The resulting estimator has desirable statistical properties, such as be-
ing the Best Linear Unbiased Estimator (BLUE) under the Gauss-
Markov assumptions.
12