MACHINE LEARNING
Models
1. CART (Classification and Regression Trees):
o A decision tree algorithm used for classification and regression
tasks.
o Splits data into subsets based on certain conditions, resulting in
a tree-like model.
2. Regression:
o Linear Regression: Predicts a continuous outcome by fitting a
linear relationship between the dependent and independent
variables.
o Logistic Regression: A classification algorithm used for
predicting binary outcomes (e.g., yes/no, true/false) by
estimating probabilities using a logistic function.
o Risk of Loss Diff: Indicates the potential error or loss that might
occur when making predictions using regression models.
Regularization
Regularization is used to prevent overfitting by adding a penalty to the model's complexity.
1. Lambda (λ):
o Represents the regularization parameter that controls the
amount of penalty applied.
o As λ increases, the model's complexity decreases, leading to a
reduction in overfitting but also potentially reducing accuracy.
2. Accuracy and Complexity Relationship:
o If accuracy increases and complexity also increases, it
might indicate overfitting.
o When accuracy increases and complexity decreases, it
shows the model is generalizing well, which is the desired effect
of regularization.
3. Lasso (Least Absolute Shrinkage and Selection Operator):
o Involves L1 penalty, which adds the absolute value of the
coefficients as a penalty.
o Helps in feature selection by shrinking some coefficients to
zero, effectively eliminating less important features.
o Forward and Backward Elimination: Refers to the process of
adding or removing features to find the optimal model.
o Useful for handling multicollinearity (high correlation
between independent variables), as it can reduce the effect of
correlated features.
4. Shrinkage of Parameters:
o Refers to reducing the magnitude of the coefficients, which helps
to control the model's complexity and prevent overfitting.
5. Elastic Net Regularization (E-Net):
o A combination of Lasso (L1) and Ridge (L2) regularization.
o Provides a balance between Lasso's feature selection and Ridge's
ability to handle multicollinearity.
6. Advantages of Regularization:
o Improves model generalization: Regularization reduces the
variance of the model, helping it perform better on new data.
o Sparsity: In Lasso, some coefficients become zero, resulting in a
simpler model with fewer predictors.
Problems and Considerations
1. Bias and Variance Trade-off:
o Regularization helps find a balance between bias (error due to
overly simple models) and variance (error due to overly complex
models).
2. Sample Size:
o The choice of regularization technique may depend on the size of
the data. For small sample sizes, regularization can be more
beneficial to prevent overfitting.
The notes emphasize the importance of choosing appropriate regularization techniques (Lasso,
Ridge, Elastic Net) based on the data and problem characteristics, focusing on improving the
model's generalization ability.
Classification and Regression Trees (CART) is a decision tree algorithm used in machine
learning for both classification and regression tasks. It creates a tree-like structure to make
decisions, where each internal node represents a "test" or "decision" based on an attribute
(feature), each branch represents the outcome of the test, and each leaf node represents a final
prediction (classification or regression value).
How CART Works
1. Splitting Criteria:
o The CART algorithm starts at the root node (top of the tree) and
splits the data based on the feature that results in the best
partition.
o For classification, CART uses metrics like the Gini index or
entropy (related to information gain) to decide the best split,
aiming to create pure nodes (where most samples belong to one
class).
o For regression, CART uses the mean squared error (MSE) or
variance reduction to choose the best split that minimizes the
error in predicting a continuous outcome.
2. Decision Rules:
o At each node, a decision rule is applied to determine how to split
the data. For example, if the feature is "age," a rule could be
"age < 30," where all data points meeting this condition go to
one branch, and those that do not go to the other branch.
3. Recursive Splitting:
o This process of splitting continues recursively, creating sub-
nodes, and making deeper splits, aiming to optimize the
partitioning of data based on the chosen metric.
o It stops when it reaches a specified condition, such as:
The node is "pure" (contains data points of only one class).
The maximum depth of the tree is reached.
There are too few samples to further split.
There is no significant improvement in the splitting metric.
4. Pruning:
o After the tree is built, it may be too complex and overfit the
training data. Pruning is applied to simplify the tree by
removing branches that contribute little to the model’s predictive
power.
o Pre-pruning (early stopping): Stops the tree from growing
before it reaches the maximum depth.
o Post-pruning: Grows the full tree and then removes branches
that do not significantly improve model performance, based on a
certain cost-complexity metric.
Key Metrics in CART
1. Gini Index (for classification):
o Measures the degree of impurity of a node. A lower Gini index
indicates purer nodes.
o Formula: Gini=1−∑i=1npi2Gini = 1 - \sum_{i=1}^{n}
p_i^2Gini=1−∑i=1npi2, where pip_ipi is the probability of a data
point belonging to class iii.
o The goal is to minimize the Gini index when making splits.
2. Entropy and Information Gain (alternative for classification):
o Entropy measures the uncertainty or randomness in a node.
o Information Gain is the reduction in entropy when a node is
split. Higher information gain indicates a more informative split.
3. Mean Squared Error (MSE) (for regression):
o Measures the average squared difference between predicted and
actual values. Lower MSE indicates a better fit.
Strengths of CART
Easy to interpret and visualize: The decision tree structure is
intuitive and resembles human decision-making.
Handles both numerical and categorical data: It can be applied to
diverse data types without much preprocessing.
Feature selection: Automatically selects the most important features
during the splitting process.
Weaknesses of CART
Prone to overfitting: If the tree is too deep, it can memorize the
training data, leading to poor generalization.
Instability: Small changes in the data can result in different splits and
a different tree structure.
Non-smooth predictions in regression: The predictions may not be
continuous, as they correspond to the average values in the leaf
nodes.
Improving CART
Ensemble methods like Random Forests and Gradient Boosting
Machines (GBMs) use multiple decision trees to improve predictive
performance and reduce overfitting.
Pruning techniques and setting hyperparameters (e.g., max
depth, min samples per leaf) can also help control overfitting.
CART forms the foundation for many advanced machine learning algorithms, making it a
versatile tool for both simple and complex predictive tasks.
Below is a detailed explanation of each section with reference to the image.
5.1 Key Terminology
Before diving into decision trees, understanding key terms is essential:
Node: Each point in the tree where a decision is made. Internal nodes
split based on a feature, while leaf nodes represent the final decision or
outcome.
Root Node: The topmost node of the tree, representing the initial
feature used for the first split.
Branch/Sub-tree: Represents the segment of the tree that extends
from an internal node.
Leaf Node (Terminal Node): The end point of a branch, where a final
decision or predicted value is made.
Splitting: The process of dividing a node into two or more sub-nodes
based on a certain feature and criterion (e.g., Gini index or variance
reduction).
Pruning: The process of reducing the size of the decision tree by
removing less significant branches to avoid overfitting.
5.2 Introduction
This section introduces decision trees, which are models that use a tree-like structure to make
predictions. They are used for both classification and regression tasks, and their operation
mimics human decision-making by splitting data at various decision points.
5.2.1 Example 1: Likely provides an initial illustration to demonstrate
how a simple decision tree works, helping to build a foundational
understanding.
5.3 Describing the Tree
Understanding the components and structure of a decision tree is crucial for interpreting its
predictions and improving model accuracy.
5.3.1 Example 2: Likely presents another example that builds upon
the first, providing more complex cases or variations.
5.4 Decision Tree Algorithms
This section discusses various algorithms for creating decision trees.
5.4.1 CART (Classification and Regression Trees):
o A fundamental algorithm used for both classification and
regression tasks.
o It uses Gini index for classification and mean squared error for
regression to determine the best splits.
5.4.2 Pruning:
o Helps in simplifying the decision tree by cutting back branches
that do not provide significant predictive power, which reduces
overfitting.
o Two common methods are pre-pruning (early stopping) and
post-pruning.
5.4.3 Conditional Inference Trees:
o These use statistical tests to determine splits, ensuring that the
splits are statistically significant.
o Can be advantageous in avoiding biases introduced by traditional
splitting methods.
5.5 Miscellaneous Topics
Additional considerations when using decision trees:
5.5.1 Interactions:
o Decision trees can capture interactions between variables, where
the effect of one variable on the target depends on the value of
another variable.
5.5.2 Pathways:
o Refers to the specific sequence of splits (decisions) leading from
the root node to a leaf node, defining the rules for prediction.
5.5.3 Stability:
o Decision trees are sensitive to the data used to train them. Small
changes in the data can result in different trees being formed.
o Techniques like ensemble methods (e.g., random forests) can
help improve stability.
5.5.4 Missing Data:
o Decision trees can handle missing data in various ways, such as
using surrogate splits (alternative features) or predicting missing
values.
5.5.5 Variable Importance:
o Decision trees can provide insight into the importance of each
feature by analyzing the reduction in impurity (e.g., Gini index or
variance) provided by splits on that feature.
5.6 Summary
This section likely provides an overview of the key points covered in the chapter, summarizing
the main concepts and best practices when working with decision trees.
5.6.1 Further Reading: Suggests additional resources for a deeper
understanding.
5.6.2 Computational Time and Resources: Discusses
considerations related to the computational cost of building and using
decision trees.
Key Takeaways:
Decision trees offer an intuitive way of making predictions based on
a series of decisions or conditions.
CART is a foundational algorithm for creating decision trees used in
both classification and regression tasks.
Pruning and stability are important for improving decision tree
models, ensuring they generalize well to unseen data.
Decision trees can handle interactions, missing data, and provide
insights into variable importance.
Overall, decision trees are versatile and widely used, but they require careful handling to prevent
overfitting and to improve stability.