Matrix partial derivative
Matrix transpose properties
Linear Algebraic Equations
Under-determined systems
Ax b
A is a m x n matrix, x is a n x 1
vector, and b is a m x 1 vector
Minimum Norm
2 2 2
Solution J x1 x 2 x n x T x.
f Ax b 0
J a J (1 f 1 2 f 2 m 1 f m 1 m f m ) J λ T f .
J a
0 2x A T λ ,
x
J
0 a Ax b.
λ
x A**b, A** AT ( AAT ) 1
2x1+3x2 A = [2 3]; b = 8; xa = A\b
=8 xa = 0
2.6667
xb = lsqminnorm(A,b)
xb = 1.2308
1.8462
Least Squares Solutions (Minimum error solution)
Over-determined system
The least squares solution is solution which minimizes
the squared norm (size) of the error
T T
J e e ( Ax b) ( Ax b).
Premultiplying
X=lsqr(A,b)
Radial Basis Function (RBF) Networks
1. They are two-layer feed-forward networks.
2. The hidden nodes implement a set of radial basis
functions (e.g. Gaussian functions).
3. The output nodes implement linear summation
functions as in an MLP.
4. The network training is divided into two stages: first the
weights from the input to hidden layer are determined,
and then the weights from the hidden to output layer.
5. The training/learning is very fast.
6. The networks are very good at interpolation.
There is considerable evidence that
neurons in visual cortex are tuned to
local regions in the retina. They are
maximally sensitive to some specific
stimulus, & their output falls off as
presented stimulus moves away from
this “best” stimulus.
Gaussian basis functions
Implementing XOR
2
0.5
When mapped into the feature space (z1, z2) , the two
classes become linearly separable.
Training RBF nets
Typically, the weights of the two layers are determined separately,
i.e. find RBF weights, and then find output layer weights
Hidden layer
– estimate parameters for each hidden unit k (whose
output depends on distance between input and a stored
prototype)
e.g. for Gaussian activation function, estimate parameters:
µ k , σk 2
– This stage involves an Unsupervised training process (no
targets available)
Output layer
– set the weights (including bias weights)
– the same as training a single layer perceptron: each unit’s
output depends on weighted sum of inputs,
– using for example, the gradient descent rule
– This stage involves a Supervised training process
Clustering
K-Means Approach
1. Select k multidimensional points to be the
“seeds” or initial centroids for the k clusters to
be formed. Seeds usually selected at random
2. Assign each observation to the cluster with the
nearest seed.
3. Update cluster centroids once all observations
have been assigned.
4. Repeat steps 2 and 3 until changes in cluster
centroids small.
5. Repeat steps 1-5 with new starting seeds. Do this
step 3 to 5 times.
K-Means Illustration – two dimensions
Fine Tuning
Computing the Output Weights
We want W (a weight matrix) such that
Target T = WX
Thus W= TX-1
If an inverse exists, then the error can be minimized
If no inverse exists, then use the pseudo-inverse to
get minimum error
‘Minimum-norm solution to a
linear system’
The pseudo-inverse is defined as
W=TX+
Where X+ = (XTX)-1 XT
XOR Problem
The relationship between the input and the
output of the network can be given by
where xj is an input vector and dj is the associated value of the desired
output.
Classifier Evaluation Metrics
Test data
y
Sl. x1 x2 t True Positives (TP): number of actual positive
No Actual Predicted examples, predicted as positive. 2
1 0.7 0.7 - + False Positives (FP): number of actual negative
+ examples, predicted as positive. 1 False Alarms
2 0.8 0.9 +
True Negatives (TN): number of actual negative
3 0.8 0.25 - _
examples, predicted as negative. 1
_
4 1.2 0.8 + False Negatives (FN): number of actual positive
5 0.6 0.4 + + examples, predicted as negative. 2
6 1.3 0.5 + _
Actual class\Predicted class Positive Negative
Positive True Positives (TP) 2 False Negatives (FN) 2
Negative False Positives (FP) 1 True Negatives (TN) 1
Be careful of “Accuracy”
The simplest measure of performance would be the fraction of
items that are correctly classified, or the “accuracy” which is:
TP + TN
TP + TN + FP + FN
But this measure is dominated by the larger set (of positives or
negatives) and favors trivial classifiers.
e.g. if 5% of instances are actually positive, then a classifier that
always says “negative” is % accurate.
Confusion Matrix:
Actual class\Predicted class Positive Negative
Positive True Positives (TP) 2 False Negatives (FN) 2
Negative False Positives (FP) 1 True Negatives (TN) 1
Precision: Given all the predicted classes (for a given class X),
how many instances were correctly predicted?
Recall: For all instances that should have an actual class X,
how many of these were correctly predicted?
Recall, hit rate, sensitivity, True positive rate
False positive rate =FP/FP+TN
Precision measures what fraction of our detections are actually positive Pp= TP/(TP + FP)
Recall measures what fraction of the positives are detected Rp= TP/(TP + FN)
For multi-class classification
Actual class\ A B C
Predicted class
A True A (30) A FalseB (50) A False C (20)
B BFalse A (20) True B Bfalse C
C CFalseA (10) CFalse B True C
RA=30/100 PA=30/60
F measure (F1 or F-score): harmonic mean of
precision and recall,
TPR= TP/N+ sensitivity, recall FPR=FP/N- False alarm rate, type 1 error rate
FNR=FN/N+ miss rate, type 2 error rate TNR= TN/N- = Specificity
Sensitivity: Probability of predicting disease given true state is disease
Specificity: Probability of predicting non-disease given true state is non-disease
ROC (Receiver Operating Characteristics) curves: for
visual comparison of classification models
Originated from signal detection theory
Shows the trade-off between the true positive rate
and the false positive rate
The area under the ROC curve is a measure of the
accuracy of the model
Specific Example
People People
without with
disease disease
Test Result
Threshold
Call these patients Call these patients “positive”
“negative”
Test Result
Some definitions ...
Call these patients Call these patients “positive”
“negative”
True Positives
Test Result
without the
disease
with the disease
Moving the Threshold: left
‘‘-’ ‘‘+’
’ ’
Test Result
without the Which line has the higher recall of -?
disease
with the disease Which line has the higher precision
of -?
Call these patients Call these patients “positive”
“negative”
Test Result False
Positives
without the
disease
with the disease
ROC curve
100%
True Positive Rate
(Recall)
0
% 100
0
% False Positive %
Rate (1-
specificity)
Area under ROC curve (AUC)
100% 100%
AUC =
True Positive Rate
True Positive Rate
100%
AUC =
0
%
0
%
0
50% 100%
0 100% False Positive Rate
False Positive Rate %
%
100% 100%
AUC =
True Positive Rate
True Positive Rate
90%
AUC =
65%
0 0
% %
0 100% 100%
False Positive Rate 0
% % False Positive Rate
Performance Metrics for Regression
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
R-squared (Coefficient of Determination)
Mean Absolute Percentage Error (MAPE)