Chapter - 2 Machine Learning Overview
Chapter - 2 Machine Learning Overview
Page 1
Objectives
Upon completion of this course, you will be able to:
Master the learning algorithm definition and machine learning process.
Know common machine learning algorithms.
Understand concepts such as hyperparameters, gradient descent, and cross validation.
Page 2
Contents
1. Machine Learning Algorithms
6. Case Study
Page 3
Machine Learning Algorithms (1)
Machine learning (including deep learning) is a study of learning algorithms. A computer program
is said to learn from experience 𝐸 with respect to some class of tasks 𝑇 and performance measure
𝑃 if its performance at tasks in 𝑇, as measured by 𝑃, improves with experience 𝐸.
Learning Basic
Data
algorithms understanding
(Experience E)
(Task T) (Measure P)
Page 4
Machine Learning Algorithms (2)
Induction Training
Page 5
Created by: Jim Liang
Training
data
Machine
learning
Page 6 .
Application Scenarios of Machine Learning (1)
The solution to a problem is complex, or the problem may involve a large amount of data without a clear
data distribution function.
Machine learning can be used in the following scenarios:
Task rules change over time. For
Rules are complex or cannot be example, in the part-of-speech Data distribution changes over time,
described, such as facial tagging task, new words or requiring constant readaptation of
recognition and voice recognition. meanings are generated at any programs, such as predicting the
time. trend of commodity sales.
Page 7
Application Scenarios of Machine Learning (2)
Simple
Rule-based
Simple problems
algorithms
Small Large
Page 8
Rational Understanding of Machine Learning
Algorithms
Target equation
𝑓: 𝑋 → 𝑌
Ideal
Actual
Training data Hypothesis function
Learning algorithms
𝐷: {(𝑥1 , 𝑦1 ) ⋯ , (𝑥𝑛 , 𝑦𝑛 )} 𝑔≈𝑓
Page 9
Main Problems Solved by Machine Learning
Machine learning can deal with many types of tasks. The following describes the most typical and common types of
tasks.
Classification: A computer program needs to specify which of the k categories some input belongs to. To accomplish this task,
learning algorithms usually output a function 𝑓: 𝑅𝑛 → (1,2, … , 𝑘). For example, the image classification algorithm in
computer vision is developed to handle classification tasks.
Regression: For this type of task, a computer program predicts the output for the given input. Learning algorithms typically
output a function 𝑓: 𝑅𝑛 → 𝑅. An example of this task type is to predict the claim amount of an insured person (to set
the insurance premium) or predict the security price.
Clustering: A large amount of data from an unlabeled dataset is divided into multiple categories according to internal similarity of
the data. Data in the same category is more similar than that in different categories. This feature can be used in scenarios such as
image retrieval and user profile management.
Classification and regression are two main types of prediction, accounting from 80% to 90%. The output of classification is
discrete category values, and the output of regression is continuous numbers.
Page 10
Contents
1. Machine Learning Algorithms
6. Case study
Page 11
Machine Learning Classification
Supervised learning: Obtain an optimal model with required performance through training and learning based on the samples of known
categories. Then, use the model to map all inputs to outputs and check the output for the purpose of classifying unknown data.
Unsupervised learning: For unlabeled samples, the learning algorithms directly model the input datasets. Clustering is a common form
of unsupervised learning. We only need to put highly similar samples together, calculate the similarity between new samples and
existing ones, and classify them by similarity.
Semi-supervised learning: In one task, a machine learning model that automatically uses a large amount of unlabeled data to assist
learning directly of a small amount of labeled data.
Reinforcement learning: It is an area of machine learning concerned with how agents ought to take actions in an environment to
maximize some notion of cumulative reward. The difference between reinforcement learning and supervised learning is the teacher
signal. The reinforcement signal provided by the environment in reinforcement learning is used to evaluate the action (scalar signal)
rather than telling the learning system how to perform correct actions.
Page 12
Supervised Learning
Data feature Label
Supervised learning
Feature 1 ... Feature n Goal
algorithm
Page 13
Supervised Learning — Regression Questions
Regression: reflects the features of attribute values of samples in a sample dataset. The dependency
between attribute values is discovered by expressing the relationship of sample mapping through functions.
How much will I benefit from the stock next week?
What's the temperature on Tuesday?
Page 14
Supervised Learning — Classification
Questions
Classification: maps samples in a sample dataset to a specified category by using a classification model.
Will there be a traffic jam on XX road during
the morning rush hour tomorrow?
Which method is more attractive to customers:
5 yuan voucher or 25% off?
Page 15
Unsupervised Learning
Data Feature
Monthly Consumption
Commodity
Consumption Time Category
Badminton Cluster 1
1000–2000 6:00–12:00
racket
Cluster 2
500–1000 Basketball 18:00–24:00
1000–2000 Game console 00:00–6:00
Page 16
Unsupervised Learning — Clustering
Questions
Clustering: classifies samples in a sample dataset into several categories based on the clustering model. The
similarity of samples belonging to the same category is high.
Which audiences like to watch movies
of the same subject?
Which of these components are
damaged in a similar way?
Page 17 .
Semi-Supervised Learning
Data Feature Label
Semi-supervised
Feature 1 ... Feature n Unknown
learning algorithms
Page 18 .
Reinforcement Learning
The model perceives the environment, takes actions, and makes adjustments and choices based on the
status and award or punishment.
Model
Award or Action 𝑎𝑡
Status 𝑠𝑡
punishment 𝑟𝑡
𝑟𝑡+1
𝑠𝑡+1 Environment
Page 19
Reinforcement Learning — Best Behavior
Reinforcement learning: always looks for best
behaviors. Reinforcement learning is targeted
at machines or robots.
Autopilot: Should it brake or accelerate when the
yellow light starts to flash?
Cleaning robot: Should it keep working or go
back for charging?
Page 20 .
Contents
1. Machine learning algorithm
6. Case study
Page 21
Machine Learning Process
Feature Model
Data Model
Data cleansing extraction and Model training deployment and
collection evaluation
selection integration
Page 22 .
Basic Machine Learning Concept — Dataset
Dataset: a collection of data used in machine learning tasks. Each data record is called a sample.
Events or attributes that reflect the performance or nature of a sample in a particular aspect are
called features.
Training set: a dataset used in the training process, where each sample is referred to as a training
sample. The process of creating a model from data is called learning (training).
Test set: Testing refers to the process of using the model obtained after learning for prediction.
The dataset used is called a test set, and each sample is called a test sample.
Page 23 .
Checking Data Overview
Typical dataset form
4 80 9 Southeast 1100
Page 24
Importance of Data Processing
Data is crucial to models. It is the ceiling of model capabilities. Without good data, there is no good
model.
Data
Data cleansing
preprocessing Data normalization
Data dimension
reduction
Simplify data attributes to
avoid dimension
explosion.
Page 25
Workload of Data Cleansing
Statistics on data scientists' work in machine learning
Page 26
Data Cleansing
Most machine learning models process features, which are usually numeric representations of
input variables that can be used in the model.
In most cases, the collected data can be used by algorithms only after being preprocessed. The
preprocessing operations include the following:
Data filtering
Processing of lost data
Processing of possible exceptions, errors, or abnormal values
Combination of data from multiple data sources
Data consolidation
Page 27
Dirty Data
Generally, real data may have some quality problems.
Incompleteness: contains missing values or the data that lacks attributes
Noise: contains incorrect records or exceptions.
Inconsistency: contains inconsistent records.
Missing value
Invalid value
Page 28
Data Conversion
After being preprocessed, the data needs to be converted into a representation form suitable for the
machine learning model. Common data conversion forms include the following:
With respect to classification, category data is encoded into a corresponding numerical representation.
Value data is converted to category data to reduce the value of variables (for age segmentation).
Other data
In the text, the word is converted into a word vector through word embedding (generally using the word2vec model, BERT
model, etc).
Process image data (color space, grayscale, geometric change, Haar feature, and image enhancement)
Feature engineering
Normalize features to ensure the same value ranges for input variables of the same model.
Feature expansion: Combine or convert existing variables to generate new features, such as the average.
Page 29
Necessity of Feature Selection
Generally, a dataset has many features, some of which may be redundant or irrelevant to the value to be predicted.
Feature selection is necessary in the following aspects:
Simplify
models to
Reduce the
make them
training time
easy for users
to interpret
Improve
Avoid model
dimension generalization
explosion and avoid
overfitting
Page 30
Feature Selection Methods — Filter
Filter methods are independent of the model during feature selection.
By evaluating the correlation between each feature and
the target attribute, these methods use a statistical
measure to assign a value to each feature. Features are
then sorted by score, which is helpful for preserving or
eliminating specific features.
Common methods
Select the
• Pearson correlation coefficient
Traverse all Train models Evaluate the
features optimal feature performance • Chi-square coefficient
subset
• Mutual information
Page 31
Feature Selection Methods — Wrapper
Wrapper methods use a prediction model to score feature subsets.
Page 32
Feature Selection Methods — Embedded
Embedded methods consider feature selection as a part of model construction.
Common methods
Procedure of an embedded method
• Lasso regression
• Ridge regression
Page 33 .
Overall Procedure of Building a Model
Model Building Procedure
1 2 3
6 5 4
Page 34
Examples of Supervised Learning — Learning Phase
Page 35 .
Examples of Supervised Learning — Prediction Phase
• Generalization capability
Can it accurately predict the actual service data?
• Interpretability
Is the prediction result easy to interpret?
• Prediction speed
How long does it take to predict each piece of data?
• Plasticity
Is the prediction rate still acceptable when the service volume
increases with a huge data volume?
Page 37
Model Validity (1)
Generalization capability: The goal of machine learning is that the model obtained after learning should
perform well on new samples, not just on samples used for training. The capability of applying a model to
new samples is called generalization or robustness.
Error: difference between the sample result predicted by the model obtained after learning and the actual
sample result.
Training error: error that you get when you run the model on the training data.
Generalization error: error that you get when you run the model on new samples. Obviously, we prefer a model with
a smaller generalization error.
Underfitting: occurs when the model or the algorithm does not fit the data well enough.
Overfitting: occurs when the training error of the model obtained after learning is small but the
generalization error is large (poor generalization capability).
Page 38
Model Validity (2)
Model capacity: model's capability of fitting functions, which is also called model complexity.
When the capacity suits the task complexity and the amount of training data provided, the algorithm effect is usually optimal.
Models with insufficient capacity cannot solve complex tasks and underfitting may occur.
A high-capacity model can solve complex tasks, but overfitting may occur if the capacity is higher than that required by a task.
Bias:
Difference between the expected (or average) prediction value and the correct value we are trying to predict.
Page 40
Machine Learning Performance Evaluation —
Regression
The closer the Mean Absolute Error (MAE) is to 0, the better the model can fit the training data.
𝑚
1
𝑀𝐴𝐸 = 𝑦𝑖 − 𝑦ො𝑖
m
𝑖=1
The value range of R2 is (–∞, 1]. A larger value indicates that the model can better fit the training data. TSS indicates the
difference between samples. RSS indicates the difference between the predicted value and sample value.
𝑚 2
2
𝑅𝑆𝑆 σ𝑖=1 𝑦𝑖 − 𝑦ො𝑖
𝑅 =1− =1− 𝑚 2
𝑇𝑆𝑆 σ𝑖=1 𝑦𝑖 − 𝑦ത𝑖
Page 43 .
Machine Learning Performance Evaluation —
Classification (1)
Estimated
Terms and definitions: amount
yes no Total
𝑃: positive, indicating the number of real positive cases Actual amount
in the data. yes 𝑇𝑃 𝐹𝑁 𝑃
𝑁: negative, indicating the number of real negative cases no 𝐹𝑃 𝑇𝑁 𝑁
in the data.
Total 𝑃′ 𝑁′ 𝑃+𝑁
𝑇P : true positive, indicating the number of positive cases that are correctly
classified by the classifier. Confusion matrix
𝑇𝑁: true negative, indicating the number of negative cases that are correctly classified by the classifier.
𝐹𝑃: false positive, indicating the number of positive cases that are incorrectly classified by the classifier.
𝐹𝑁: false negative, indicating the number of negative cases that are incorrectly classified by the classifier.
Confusion matrix: at least an 𝑚 × 𝑚 table. 𝐶𝑀𝑖,𝑗 of the first 𝑚 rows and 𝑚 columns indicates the number of cases that
actually belong to class 𝑖 but are classified into class 𝑗 by the classifier.
Ideally, for a high accuracy classifier, most prediction values should be located in the diagonal from 𝐶𝑀1,1 to 𝐶𝑀𝑚,𝑚 of the table
while values outside the diagonal are 0 or close to 0. That is, 𝐹𝑃 and 𝐹𝑃 are close to 0.
Page 44 .
Machine Learning Performance Evaluation —
Classification (2)
Measurement Ratio
𝑇𝑃 + 𝑇𝑁
Accuracy and recognition rate
𝑃+𝑁
𝐹𝑃 + 𝐹𝑁
Error rate and misclassification rate
𝑃+𝑁
𝑇𝑃
Sensitivity, true positive rate, and recall
𝑃
𝑇𝑁
Specificity and true negative rate
𝑁
𝑇𝑃
Precision
𝑇𝑃 + 𝐹𝑃
𝐹1 , harmonic mean of the recall rate and 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
precision 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
(1 + 𝛽 2 ) × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹𝛽 , where 𝛽 is a non-negative real number 𝛽 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
Page 45 .
Example of Machine Learning Performance Evaluation
We have trained a machine learning model to identify whether the object in an image is a cat. Now we use
200 pictures to verify the model performance. Among the 200 images, objects in 170 images are cats, while
others are not. The identification result of the model is that objects in 160 images are cats, while others are
not.
𝑇𝑃 140
Precision: 𝑃 = 𝑇𝑃+𝐹𝑃 = 140+20 = 87.5% Estimated amount
Actual 𝒚𝒆𝒔 𝒏𝒐 Total
amount
𝑇𝑃 140
Recall: 𝑅 = 𝑃
=
170
= 82.4% 𝑦𝑒𝑠 140 30 170
𝑇𝑃+𝑇𝑁 140+10 𝑛𝑜 20 10 30
Accuracy: 𝐴𝐶𝐶 = 𝑃+𝑁
=
170+30
= 75%
Total 160 40 200
Page 46 .
Parameters and Hyperparameters in Models
The model contains not only parameters but
also hyperparameters. The purpose is to
enable the model to learn the optimal
parameters. Model parameters are
"distilled" from data.
Parameters are automatically learned by
models. Model
Hyperparameters are manually set.
Training
Use hyperparameters to
control training.
Page 51 .
Hyperparameters of a Model
• Often used in model parameter • λ during Lasso/Ridge regression
estimation processes. • Learning rate for training a neural
• Often specified by the practitioner. network, number of iterations, batch
• Can often be set using heuristics. size, activation function, and number
• Often tuned for a given predictive of neurons
modeling problem. • 𝐶 and 𝜎 in support vector machines
(SVM)
• K in k-nearest neighbor (KNN)
• Number of trees in a random forest
Page 52 .
Contents
1. Machine learning algorithm
6. Case study
Page 58 .
Machine Learning Algorithm Overview
Machine learning
GBDT GBDT
KNN
Naive Bayes
Page 59 .
Linear Regression (1)
Linear regression: a statistical analysis method to determine the quantitative relationships between two or
more variables through regression analysis in mathematical statistics.
Linear regression is a type of supervised learning.
Page 60
Linear Regression (2)
The model function of linear regression is as follows, where 𝑤 indicates the weight parameter, 𝑏 indicates the bias, and 𝑥 indicates the
sample attribute.
hw ( x) w x b T
The relationship between the value predicted by the model and actual value is as follows, where 𝑦 indicates the actual value, and 𝜀
indicates the error.
y w xb
T
The error 𝜀 is influenced by many factors independently. According to the central limit theorem, the error 𝜀 follows normal distribution.
According to the normal distribution function and maximum likelihood estimation, the loss function of linear regression is as follows:
1
J ( w) to hw (the We can use the gradient descent method to
yvalue.
2
To make the predicted value close to the actual value, we need minimize
x) loss
2m
calculate the weight parameter 𝑤 when the loss function reaches the minimum, and then complete model building.
Page 61 .
Linear Regression Extension — Polynomial
Regression
Polynomial regression is an extension of linear regression. Generally, the complexity of a dataset exceeds the
possibility of fitting by a straight line. That is, obvious underfitting occurs if the original linear regression
model is used. The solution is to use polynomial regression.
hw ( x ) w1 x w2 x 2 wn x n b
where, the nth power is a polynomial regression
dimension (degree).
Polynomial regression belongs to linear regression as the
relationship between its weight parameters 𝑤 is still
linear while its nonlinearity is reflected in the feature
dimension. Comparison between linear regression and
polynomial regression
Page 62 .
Linear Regression and Overfitting Prevention
Regularization terms can be used to reduce overfitting. The value of 𝑤 cannot be too large or too small in
the sample space.You can add a square sum loss on the target function.
1
J ( w) w + w
2 2
h ( x ) y 2
2m
Regularization terms (norm): The regularization term here is called L2-norm. Linear regression that uses this
loss function is also called Ridge regression.
1
J ( w) hw ( x) y + w 1
2
2m
Linear regression with absolute loss is called Lasso regression.
Page 63 .
Logistic Regression (1)
Logistic regression: The logistic regression model is used to solve classification problems. The model is
defined as follows:
𝑒 𝑤𝑥+𝑏
𝑃 𝑌=1𝑥 =
1 + 𝑒 𝑤𝑥+𝑏
1
𝑃 𝑌=0𝑥 =
1 + 𝑒 𝑤𝑥+𝑏
where 𝑤 indicates the weight, 𝑏 indicates the bias, and 𝑤𝑥 + 𝑏 is regarded as the linear function of 𝑥. Compare the preceding two
probability values. The class with a higher probability value is the class of 𝑥.
Page 64 .
Logistic Regression (2)
Both the logistic regression model and linear regression model are generalized linear models. Logistic
regression introduces nonlinear factors (the sigmoid function) based on linear regression and sets
thresholds, so it can deal with binary classification problems.
According to the model function of logistic regression, the loss function of logistic regression can be
estimated as follows by using the maximum likelihood estimation:
1
J ( w) y ln hw ( x) (1 y ) ln(1 hw ( x))
m
where 𝑤 indicates the weight parameter, 𝑚 indicates the number of samples, 𝑥 indicates the sample, and 𝑦
indicates the real value. The values of all the weight parameters 𝑤 can also be obtained through the gradient
descent algorithm.
Page 65 .
Logistic Regression Extension — Softmax
Function (1)
Logistic regression applies only to binary classification problems. For multi-class classification problems, use
the Softmax function.
Grape?
Male? Orange?
Apple?
Female? Banana?
Page 66 .
Logistic Regression Extension — Softmax
Function (2)
Softmax regression is a generalization of logistic regression that we can use for K-class classification.
The Softmax function is used to map a K-dimensional vector of arbitrary real values to another K-dimensional vector of
real values, where each vector element is in the interval (0, 1).
The regression probability function of Softmax is as follows:
wkT x
e
p ( y k | x; w) K
, k 1, 2 ,K
e
l 1
wlT x
Page 67 .
Logistic Regression Extension — Softmax
Function (3)
Softmax assigns a probability to each class in a multi-class problem. These probabilities must add up to 1.
Softmax may produce a form belonging to a particular class. Example:
Category Probability
Grape? 0.09
Banana? 0.01
Page 68
Decision Tree
A decision tree is a tree structure (a binary tree or a non-binary tree). Each non-leaf node represents a test on a feature attribute. Each
branch represents the output of a feature attribute in a certain value range, and each leaf node stores a category. To use the decision
tree, start from the root node, test the feature attributes of the items to be classified, select the output branches, and use the category
stored on the leaf node as the final result.
Root
Short Tall
Short Long
Might be a Might be a
Might be a nose nose
squirrel giraffe
rat
Might be a Might be a
rhinoceros hippo
Page 69 .
Decision Tree Structure
Root
Subnode
Subnode
Page 70 .
Key Points of Decision Tree Construction
To create a decision tree, we need to select attributes and determine the tree structure between feature
attributes. The key step of constructing a decision tree is to divide data of all feature attributes, compare the
result sets in terms of 'purity', and select the attribute with the highest 'purity' as the data point for dataset
division.
The metrics to quantify the 'purity' include the information entropy and GINI coefficient. The formula is as
follows: K K
H ( X )= - pk log 2 ( pk ) Gini 1 pk2
k 1 k 1
where 𝑝𝑘 indicates the probability that the sample belongs to class k (there are K classes in total). A greater
difference between purity before segmentation and that after segmentation indicates a better decision tree.
Common decision tree algorithms include ID3, C4.5, and CART.
Page 71 .
Decision Tree Construction Process
Feature selection: Select a feature from the features of the training data as the split standard of
the current node. (Different standards generate different decision tree algorithms.)
Decision tree generation: Generate subnodes from top down based on the selected features and
stop until the dataset can no longer be split.
Pruning: The decision tree may easily become overfitting unless necessary pruning (including pre-
pruning and post-pruning) is performed to reduce the tree size and optimize its node structure.
Page 72 .
Decision Tree Example
The following figure shows a classification when a decision tree is used. The classification result is impacted
by three attributes: Refund, Marital Status, and Taxable Income.
Marital Taxable
Tid Refund Cheat
Status Income
Refund
1 Yes Single 125,000 No
2 No Married 100,000 No
Marital
3 No Single 70,000 No No Status
4 Yes Married 120,000 No
5 No Divorced 95,000 Yes Taxable
Income No
6 No Married 60,000 No
7 Yes Divorced 220,000 No
8 No Single 85,000 Yes No Yes
9 No Married 75,000 No
10 No Single 90,000 Yes
Page 73 .
SVM
SVMs are binary classification models whose basic model is a linear classifier defined in the eigenspace with the largest
interval. SVMs also include kernel tricks that make them nonlinear classifiers. The SVM learning algorithm is the optimal
solution to convex quadratic programming.
Projection
Page 74 .
Linear SVM (1)
How do I split the red and blue datasets by a straight line?
or
With binary classification Both the left and right methods can be used to divide
Two-dimensional dataset datasets. Which of them is correct?
Page 75 .
Linear SVM (2)
Straight lines are used to divide data into different classes. Actually, we can use multiple straight lines to divide data. The
core idea of the SVM is to find a straight line and keep the point close to the straight line as far as possible from the
straight line. This can enable strong generalization capability of the model. These points are called support vectors.
In two-dimensional space, we use straight lines for segmentation. In high-dimensional space, we use hyperplanes for
segmentation.
Distance between
support vectors
is as far as possible
Page 76 .
Nonlinear SVM (1)
How do I classify a nonlinear separable dataset?
Linear SVM can function well for Nonlinear datasets cannot be split
linear separable datasets. with straight lines.
Page 77 .
Nonlinear SVM (2)
Kernel functions are used to construct nonlinear SVMs.
Kernel functions allow algorithms to fit the largest hyperplane in a transformed high-dimensional feature space.
Polynomial
Linear kernel
kernel
function
function
Gaussian Sigmoid
kernel kernel Input space High-dimensional
function function feature space
Page 78 .
KNN Algorithm (1)
The KNN classification algorithm is a theoretically
mature method and one of the simplest machine
learning algorithms. According to this method, if
the majority of k samples most similar to one
sample (nearest neighbors in the eigenspace) ?
belong to a specific category, this sample also
belongs to this category.
Page 79 .
KNN Algorithm (2)
As the prediction result is determined based on the number and
weights of neighbors in the training set, the KNN algorithm has
a simple logic.
KNN is a non-parametric method which is usually used in
datasets with irregular decision boundaries.
The KNN algorithm generally adopts the majority voting method
for classification prediction and the average value method for
regression prediction.
Page 80 .
KNN Algorithm (3)
Generally, a larger k value reduces the impact of noise on classification, but obfuscates the boundary
between classes.
A larger k value means a higher probability of underfitting because the segmentation is too rough. A smaller k value
means a higher probability of overfitting because the segmentation is too refined.
Page 81 .
Naive Bayes (1)
Naive Bayes algorithm: a simple multi-class classification algorithm based on the Bayes theorem. It assumes that
features are independent of each other. For a given sample feature 𝑋, the probability that a sample belongs to a
category 𝐻 is:
P X 1 , , X n | Ck P Ck
P Ck | X 1 , , X n
P X 1 , , X n
1,2….n are data features, which are usually described by measurement values of m attribute sets.
For example, the color feature may have three attributes: red, yellow, and blue.
Page 82 .
Naive Bayes (2)
Independent assumption of features.
For example, if a fruit is red, round, and about 10 cm (3.94 in.) in diameter, it can be considered an apple.
A Naive Bayes classifier considers that each of these features independently contributes to the
probability that the fruit is an apple, regardless of any possible correlation between the color, roundness,
and diameter.
Page 83 .
Ensemble Learning
Ensemble learning is a machine learning paradigm in which multiple learners are trained and combined to solve the same problem.
When multiple learners are used, the integrated generalization capability can be much stronger than that of a single learner.
If you ask a complex question to thousands of people at random and then summarize their answers, the summarized answer is better
than an expert's answer in most cases. This is the wisdom of the masses.
Training set
Large
Model
model
synthesis
Page 84 .
Classification of Ensemble Learning
Page 85 .
Ensemble Methods in Machine Learning(1)
Random forest = Bagging + CART decision tree
Random forests build multiple decision trees and merge them together to make predictions more accurate
and stable.
Random forests can be used for classification and regression problems.
Bootstrap sampling Decision tree building Aggregation prediction
result
Data subset 1 Prediction 1
Page
. 86
Ensemble Methods in Machine Learning(2)
GBDT is a type of boosting algorithm.
For an aggregative mode, the sum of the results of all the basic learners equals the predicted value. In essence, the residual of the error
function to the predicted value is fit by the next basic learner. (The residual is the error between the predicted value and the actual
value.)
During model training, GBDT requires that the sample loss for model prediction be as small as possible.
Prediction
30 years old 20 years old
Residual
calculation
Prediction
10 years old 9 years old
Residual
calculation
Prediction
1 year old 1 year old
Page 87 .
Unsupervised Learning — K-means
K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with
the nearest mean, serving as a prototype of the cluster.
For the k-means algorithm, specify the final number of clusters (k). Then, divide n data objects into k clusters. The
clusters obtained meet the following conditions: (1) Objects in the same cluster are highly similar. (2) The similarity of
objects in different clusters is small.
x1 x1
K-means clustering
Page 88 .
Unsupervised Learning — Hierarchical
Clustering
Hierarchical clustering divides a dataset at different layers and forms a tree-like clustering structure. The dataset
division may use a "bottom-up" aggregation policy, or a "top-down" splitting policy. The hierarchy of clustering is
represented in a tree graph. The root is the unique cluster of all samples, and the leaves are the cluster of only a sample.
Page 89 .
Contents
1. Machine Learning Algorithm
6. Case study
Page 90 .
Comprehensive Case
Assume that there is a dataset containing the house areas and prices of 21,613 housing units sold in a city.
Based on this data, we can predict the prices of other houses in the city.
Page 91 .
Problem Analysis
This case contains a large amount of data, including input x (house area), and output y (price), which is a continuous value. We can use
regression of supervised learning. Draw a scatter chart based on the data and use linear regression.
Our goal is to build a model function h(x) that infinitely approximates the function that expresses true distribution of the dataset.
Then, use the model to predict unknown price data.
Price
algorithm
Output
y
Label: price
House area
Page 92 .
Goal of Linear Regression
Linear regression aims to find a straight line that best fits the dataset.
Linear regression is a parameter-based model. Here, we need learning parameters 𝑤0 and 𝑤1 . When these
two parameters are found, the best model appears.
h( x) wo w1 x
Price
Price
House area House area
Page 93 Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved.
Loss Function of Linear Regression
To find the optimal parameter, construct a loss function and find the parameter values when the loss
function becomes the minimum.
1
J ( w) h( x ) y
2
Loss function of linear
regression: 2m
Error
Error
Error
Error
Goal:
Price
1
arg min J ( w) h( x ) y
2
w 2m
• where, m indicates the number of samples,
• h(x) indicates the predicted value, and y indicates the
House area actual value.
Page 94 .
Gradient Descent Method
The gradient descent algorithm finds the minimum value of a function through iteration.
It aims to randomize an initial point on the loss function, and then find the global minimum value of the loss function
based on the negative gradient direction. Such parameter value is the optimal parameter value.
Point A: the position of 𝑤0 and 𝑤1 after random initialization.
𝑤0 and 𝑤1 are the required parameters.
Page 95 .
Iteration Example
The following is an example of a gradient descent iteration. We can see that as red points on the loss
function surface gradually approach a lowest point, fitting of the linear regression red line with data
becomes better and better. At this time, we can get the best parameters.
Page 96 C.
Model Debugging and Application
After the model is trained, test it with the test set to
The final model result is as follows:
ensure the generalization capability.
h( x) 280.62 x 43581
If overfitting occurs, use Lasso regression or Ridge
regression with regularization terms and tune the
hyperparameters.
Price
Page 97 .
Summary
First, this course describes the definition and classification of machine learning, as well as
problems machine learning solves. Then, it introduces key knowledge points of machine learning,
including the overall procedure (data collection, data cleansing, feature extraction, model training,
model training and evaluation, and model deployment), common algorithms (linear regression,
logistic regression, decision tree, SVM, naive Bayes, KNN, ensemble learning, K-means, etc.),
gradient descent algorithm, and hyperparameters.
Finally, a complete machine learning process is presented by a case of using linear regression to
predict house prices.
Page 98 .
Quiz
1. (Single-answer question) Which of the following algorithms is not supervised learning? ( )
A. Linear regression
B. Decision tree
C. KNN
D. K-means
2. (True or false) Gradient descent iteration is the only method of machine learning algorithms. ( )
A. True
B. False
Page 99
Recommendations
Online learning website
https://e.huawei.com/en/talent/#/
Page 100
Thank You
Page 101 Copyright © 2019 Huawei Technologies Co., Ltd. All rights reserved.