Parametric
Parametric
# Explanation:
# - Linear Regression (Parametric):
# - Definition: Assumes a linear relationship
between the input (X) and output (y). The model
learns the parameters of a straight line (slope and
intercept).
# - fit(): The model learns the optimal
parameters by minimizing the sum of squared
errors between the predicted and actual values.
# - predict(): Uses the learned parameters
to make predictions on the input data.
# - Decision Tree (Non-parametric):
# - Definition: A tree-like model that makes
decisions based on the values of the input features.
It partitions the feature space into regions and
assigns a prediction to each region.
# - max_depth: A hyperparameter that
limits the depth of the tree to prevent overfitting.
# - fit(): The model learns the tree structure
by recursively partitioning the data based on the
features that best separate the target variable.
# - predict(): Traverses the tree based on
the input features to reach a leaf node, which
provides the prediction.
2. Assumptions of Parametric Machine Learning
Methods
Deeper Content:
o Definition: Assumptions are conditions that
must be met for a parametric model to be
valid and provide reliable results.
o Impact of Violations (Expanded):
Linearity:
Violation: The relationship between
the independent and dependent
variables is not linear.
Impact: The model will not
accurately capture the relationship,
leading to poor predictions.
Example: Trying to fit a straight line
to a curved relationship.
Independence of Errors (Residuals):
Violation: The errors (residuals) are
correlated with each other. This
often occurs in time series data.
Impact: Standard errors of the
coefficients will be underestimated,
leading to unreliable hypothesis
tests and confidence intervals.
Example: In a time series, if one
day's error is positive, the next
day's error is also likely to be
positive.
Homoscedasticity (Equal Variance of
Errors):
Violation: The variance of the errors
is not constant across all levels of
the independent variables.
Impact: Standard errors will be
unreliable, affecting hypothesis
tests and confidence intervals.
Predictions will be more precise in
some ranges of the independent
variable than others.
Example: The variance of errors
might be higher for higher values
of income.
Normality of Errors (Residuals):
Violation: The errors are not
normally distributed.
Impact: Hypothesis tests and
confidence intervals may be
unreliable, especially with small
sample sizes.
Example: Errors might be skewed
or have heavy tails.
o Assumption Testing (Expanded):
Scatter Plots:
Purpose: Visual inspection of the
relationship between variables to
assess linearity.
Interpretation: A linear pattern
suggests linearity. A curved pattern
suggests non-linearity.
Residual Plots:
Purpose: To check for
homoscedasticity and linearity.
Interpretation:
Homoscedasticity: Residuals
should be randomly scattered
around zero with no discernible
pattern. A funnel shape or
other pattern suggests
heteroscedasticity.
Linearity: Residuals should be
randomly scattered around
zero. A curved pattern suggests
non-linearity.
Durbin-Watson Test:
Purpose: To test for autocorrelation
(correlation between errors) in time
series data.
Interpretation: Values range from 0
to 4. A value of 2 indicates no
autocorrelation. Values close to 0
indicate positive autocorrelation,
and values close to 4 indicate
negative autocorrelation.
Q-Q Plots (Quantile-Quantile Plots):
Purpose: To assess the normality of
residuals.
Interpretation: If the residuals are
normally distributed, the points on
the Q-Q plot will fall approximately
along a straight line.
Shapiro-Wilk Test:
Purpose: A statistical test for
normality.
Interpretation: The test returns a
test statistic and a p-value. A small
p-value (typically < 0.05) indicates
that the data is significantly
different from a normal distribution.
o Remedial Measures (Expanded):
Non-linearity:
Variable Transformations:
Log Transformation: Useful for
reducing skewness and
linearizing exponential
relationships.
Square Root Transformation:
Useful for count data and
stabilizing variance.
Polynomial Regression: Adding
polynomial terms of the
independent variables to the
model.
Heteroscedasticity:
Variable Transformations: Can
sometimes stabilize variance.
Weighted Least Squares (WLS):
Assigns weights to observations
based on the variance of their
errors.
Non-normality:
Variable Transformations: Can
sometimes make the distribution of
errors more normal.
Python Example (with more explanation):
Python
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Get residuals
residuals = model.resid