Linear Regression: predict scores.
Logistic Regression: binary, pass or fail
Decision Trees: yes/no questions >> good for classification and regression
tasks
Random Forests: like gathering a bunch of friends (trees), each with their
own decision, and make a decision
SVM: each network processes a small but of info, they collectively make a
decision.
kNearest: look around for the closest data points, if most of them like the
category.
Naïve Bayes
Linear Regression
Advantages:
o Simple and easy to understand.
o Good for predicting numerical outcomes.
Disadvantages:
o Assumes a linear relationship between variables.
o Can be easily affected by outliers.
Logistic Regression
Advantages:
o Provides probabilities for outcomes.
o Good for binary classification.
Disadvantages:
o Assumes a linear relationship between the log odds of the
outcome and predictor variables.
o Not suitable for complex relationships.
Decision Trees
Advantages:
o Easy to interpret and understand.
o Can handle both numerical and categorical data.
Disadvantages:
o Prone to overfitting, especially with complex trees.
o Can be unstable, as small changes might result in a completely
different tree.
Random Forests
Advantages:
o More accurate than a single decision tree.
o Handles overfitting well.
Disadvantages:
o More complex and computationally intensive.
o Less interpretable than a single decision tree.
Support Vector Machines (SVM)
Advantages:
o Effective in high dimensional spaces.
o Works well with clear margin of separation.
Disadvantages:
o Requires careful tuning of parameters.
o Not suitable for large datasets.
Neural Networks
Advantages:
o Extremely powerful, can model complex nonlinear
relationships.
o Good for a wide range of applications (image recognition, NLP,
etc.).
Disadvantages:
o Requires a lot of data to train.
o Complex and hard to interpret.
K-Nearest Neighbors (KNN)
Advantages:
o Simple and easy to implement.
o No assumption about data distribution.
Disadvantages:
o Computationally expensive as the dataset grows.
o Sensitive to irrelevant or redundant features.
CLUSTERING CLASSIFICATION
Unsupervised Learning Supervised Learning:
Finding Groups Predicting Categories
Labers NOT Required Prelabeled data
MODELS: MODELS:
K-Means Logistic Regression
DBSCAN Decision Trees
Agglomerative Hierarchical Random Forest
SVM
Naïve Bayes
KNN
Neural Networks
EVALUATION FOR CLASSIFICATION:
Accuracy:
ALL the positive results (TP & TN) over EVERYTHING
Precision:
True Positives over ALL the positives (TP & FP) even the false positives!!
Recall:
True Positives over ALL the correct positives (TP & FN)
F1:
The harmonic Mean = Useful to balance precision and recall
Improving precision and recall can have a negative effect on
accuracy
Improving Precision: Reducing False Positives (FP) >> might increase FN
to be absolutely sure it’s true before predicting a T.
Improving Recall: Reducing False Negatives (FN) >>increase FP
THEREFORE, the number in the denominator is bigger so, Accuracy (TP & TN)
gets smaller!!
Evaluation metrics that align with the goals of the specific application and
the characteristics of the data being modeled are important,
Precision-Recall Trade-off: There's often a trade-off between
precision and recall. Improving precision typically reduces recall and
vice versa. This is because increasing one generally involves making
the model more conservative or more liberal in predicting positives,
which can decrease or increase the other metric, respectively.
Accuracy's Role: Accuracy might not always reflect changes in
precision and recall, especially in imbalanced datasets. For instance, a
model that always predicts the majority class can have high accuracy
but low recall and precision for the minority class.
Detecting spam emails>> improve recall
EVALUATION FOR REGRESSION:
Mean Absolute Error (MAE): The average of the absolute errors between
predicted and actual values. It gives an idea of how wrong the predictions
are.
Mean Squared Error (MSE): The average of the squares of the errors. It
penalizes larger errors more than MAE.
R-squared (Coefficient of Determination): Represents the proportion of
the variance for the dependent variable that's explained by the independent
variables in the model.
Increasing Precision
Email Spam Detection: Higher precision minimizes the risk of
important emails being incorrectly marked as spam, ensuring
important messages reach the inbox.
Financial Fraud Detection: In banking, high precision helps in
accurately identifying fraudulent transactions while minimizing false
positives that could inconvenience customers through false alerts or
blocked transactions.
Product Recommendation Systems: High precision ensures that
the recommendations are relevant to the user, enhancing user
satisfaction and engagement.
Increasing Recall
Disease Screening: High recall is crucial to ensure that as many true
cases of a disease are identified as possible, minimizing the number of
cases that go undetected.
Disaster Response: In disaster response scenarios, high recall in
identifying areas needing assistance ensures that help is dispatched to
as many affected areas as possible, even if it means some areas might
receive unnecessary aid.
Content Moderation: For social media platforms, higher recall in
identifying and removing harmful content is vital to ensure a safe
environment, even if some non-harmful content is mistakenly removed.