ML Lectures 2022 Part 1
ML Lectures 2022 Part 1
Machine Learning
in
Communication Networks
Machine Learning (ML)
• ML is a branch of artificial intelligence (AI):
– Uses computing based systems to make sense out
of data
• Extracting patterns, fitting data to functions, classifying
data, etc
– ML systems can learn and improve
• With historical data, time and experience
– Bridges theoretical computer science and real
noise data.
2
ML in real-life
17
OBJECTIVES:
To enable the student to understand the concept of machine learning and its
application in wireless communication and bio-medical.
To expose the student to be familiar with a set of well-known supervised, semi-
supervised and unsupervised learning algorithms
COURSE OUTCOMES:
UNIT V ML IN BIO-MEDICAL 10
Machine Learning in Medical Imaging. Deep Learning for Health Informatics. Deep Learning Automated ECG
Noise Detection and Classification System for Unsupervised Healthcare Monitoring, Techniques for Electronic
Health Record (EHR) Analysis.
Machine Learning as a Process
- Define measurable and quantifiable goals
Define
- Use this stage to learn about the problem
Objectives
- Normalization
- Transformation
Model - Missing Values
Deployment Data - Outliers
Preparation
20
ML as a Process: Data Preparation
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the model
performance
• Time on data preparation should not be underestimated
21
ML as a Process: Feature engineering
• Determining the predictors (features) to be used is one of the most critical
questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for
non-informative predictors
• Binning predictors
Evaluate the
Based normally on
Filters relevance of the
correlations
predictor
22
ML as a Process: Model Building
• Data Splitting
– Allocate data to different tasks
• model training
• performance evaluation
– Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
– Visualization of results – discovery of interesting areas of the
problem space
– Statistics and performance measures
• Evaluation and Model selection
– The ‘no free lunch’ theorem : no apriory assumptions can be
made
– Avoid use of favorite models if NEEDED
23
Machine Learning as a Process
- Define measurable and quantifiable goals
Define
- Use this stage to learn about the problem
Objectives
- Normalization
- Transformation
Model - Missing Values
Deployment Data - Outliers
Preparation
33
ML as a Process: Data Preparation
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the model
performance
• Time on data preparation should not be underestimated
34
ML as a Process: Feature engineering
• Determining the predictors (features) to be used is one of the most critical
questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for
non-informative predictors
• Binning predictors
Evaluate the
Based normally on
Filters relevance of the
correlations
predictor
35
ML as a Process: Model Building
• Data Splitting
– Allocate data to different tasks
• model training
• performance evaluation
– Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
– Visualization of results – discovery of interesting areas of the
problem space
– Statistics and performance measures
• Evaluation and Model selection
– The ‘no free lunch’ theorem : no apriory assumptions can be
made
– Avoid use of favorite models if NEEDED
36
Combined
classification of
inter-linked objects
using label-attribute
a machine learning correlations and
method where we label-label neighbor
reuse a pre-trained correlations.
model as the starting
point for a model on a
new task.
• Accuracy : number of correct predictions
made as a ratio of all predictions made
( [TP+TN]/[TP+FP+FN+TN])
• Precision : number of correct documents
returned by our ML model (Eg. Document
retrievals) TP/[TP+FP]
• Recall or Sensitivity : number of positives
returned by our ML model TP/[TP+FN]
• Specificity : number of negatives returned
by our ML model TN/[TN+FP]
• Margin : the distance between the two
hyperplanes that separate linearly-separable
classes of data points
• Squared Error : average squared • Entropy : the randomness or measuring the
difference between the estimated disorder of the information being processed
values and the actual value in Machine Learning (LOW purity of data)
• K-L Divergence : Kullback-Leibler Divergence
• Likelihood : ~ probability or chances score, or KL divergence score, quantifies how
of something happening much one probability distribution differs
• Posterior Probability : updated from another probability distribution (LOW)
probability of an event occurring • Cost / Utility: Cost utility analysis (CUA) is
after taking into consideration new one type of economic evaluation that can
information help you compare the costs and effects of
alternative interventions
ML in Communication Network design
•Machine/deep learning for signal detection, channel modeling, estimation,
interference mitigation, and decoding.
•Resource and network optimization using machine learning techniques.
•Distributed learning algorithms and implementations over realistic
communication networks.
•Machine learning techniques for application/user behavior prediction and
user experience modeling and optimization.
•Machine learning techniques for anomaly detection in communication
networks.
•Machine learning for emerging communication systems and applications,
such as drone systems, IoT, edge computing, caching, smart cities, and
vehicular networks.
•Machine learning for transport-layer congestion control.
•Machine learning for integrated radio frequency/non-radio frequency
communication systems.
ML in Communication Network design
•Machine learning techniques for information-centric networks and data
mining.
•Machine learning for network slicing, network virtualization, and software
defined networking.
•Performance analysis and evaluation of machine learning techniques in
wired/wireless communication systems.
•Scalability and complexity of machine learning in networks.
•Techniques for efficient hardware implementation of neural networks in
communications.
•Synergies between distributed/federated learning and communications.
•Secure machine learning over communication networks.
Example
• Vector in Rn is an ordered
1
set of n real numbers.
6
– e.g. v = (1,6,3,4) is in R4 3
– A column vector: 4
– A row vector:
• m-by-n matrix is an object 1 6 3 4
in Rmxn with m rows and n
columns, each entry filled
1 2 8
with a (typically) real
number: 4 78 6
9 3 2
• Tensor: An array with more than two axes, A at coordinates ( i,j,k) written as Ai,j,k
Norms
Vector norms: A norm of a vector ||x|| is informally a
measure of the “length” of the vector.
Squared L2 norm is more
convenient to work with
mathematically and
computationally
T
e.g. a
a b
b
T
a b a c
c d b d
We will use upper case letters for matrices. The elements
are referred by Ai,j.
• Matrix product:
a b 0 0 a 0 0
c d e 0
b c 0 lower-triangular
0 tri-diagonal
f g h d e f
0 0 i j
Symmetric Matrix A = AT
1 0 0
0 1 0 I (identity matrix) Unit vector vector with
0 0 1 unit norm
x3 = −2x1 + x2
e.g. 2
1
0
0
2 2 0 2 1 2 0
2 0 0 1
(0,0,1)
(0,1,0)
(1,0,0)
By Thomas Minka. Old and New Matrix Algebra Useful for Statistics
Prove the following using example
vector/matrices
http://matrixcookbook.com/
Matrix Inverse
• A Rm×n , b Rm known vector, and x Rn is a vector of
unknown variables
– Decompose X=UDVT
– Then XTX = VDUTUDVT = VD2VT
– Since V(D2)VTV(D2)-1VT=I
– We know that (XTX )-1= V(D2)-1VT
– Inverting a diagonal matrix D2 is trivial
Parallel Decomposition of MIMO Channels
Singular Value Decomposition of H = UVH
System Model
y1 (k ) h11 h1M T s1 (k ) n1 (k )
yM (k ) hM 1 hM M sM (k ) nM ( k )
R R R T
T
R
H U V H
98
Precoding in MIMO
~
Y U *T Y ; Y [ HX n]
U *T [UV *T X n]
(U *T U )V *T X U *T n]
~ ~
X n ,
~ ~
where X V X ; X VX
*T
n~ U *T n
~ ~ ~ [Note : var n~i var ni ]
Y X n
Principal Component Analysis (PCA)
Factor or Component Analysis
• Discover a new set of factors/dimensions/axes against which to represent, describe
or evaluate the data
– For more effective reasoning, insights, or better visualization
– Reduce noise in the data
– Typically a smaller set of factors: dimension reduction
– Better representation of data without losing much information
– Can build more effective data analyses on the reduced-dimensional
space: classification, clustering, pattern recognition
• Multidimensional Scaling:
– Find projection that best preserves inter-point distances
(cm)
Andrew Ng
Ex. Data Compression
Reduce data from 3D to 2D
Andrew Ng
What are the new axes?
Original Variable B PC 2
PC 1
Original Variable A
• And so on …
PCA Summary
• PCA is “an orthogonal linear transformation that transfers the data
to a new coordinate system such that the greatest variance by any
projection of the data comes to lie on the first coordinate (first
principal component), the second greatest variance lies on the
second coordinate (second principal component), and so on.”
• Most common form of factor analysis
• The new variables/dimensions
– Are linear combinations of the original ones
– Are uncorrelated with one another
» Orthogonal in original dimension space
– Capture as much of the original variance in the data as
possible
– Are called Principal Components
PCA Summary
• Principle
– Linear projection method to reduce the number of parameters
– Transfer a set of correlated variables into a new set of uncorrelated
variables
– Map the data into a space of lower dimensionality
– Form of unsupervised learning
• Properties
– It can be viewed as a rotation of the existing axes to new positions in
the space defined by original variables
– New axes are orthogonal and represent the directions with maximum
variability
Principal Component Analysis (PCA) problem formulation
• Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D)
non-zero values
20
Variance (%)
15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
• You do lose some information, but if the eigenvalues are small, you
don’t lose much
– n dimensions in original data
– calculate n eigenvectors and eigenvalues
– choose only the first p eigenvectors, based on their eigenvalues
– final data set has only p dimensions
PCA – mathematical background
• Suppose attributes are A1 and A2, and we have n training
examples. x’s denote values of A1 and y’s denote values of A2
over the training examples.
n
(x x)i
2
( x x )( y
i i y)
• Covariance of two attributes: cov( A1 , A2 ) i 1
(n 1)
A: Square Matrix
λ: Eigenvector or characteristic vector
X: Eigenvalue or characteristic value
A: Square Matirx
λ: Eigenvector or characteristic vector
X: Eigenvalue or characteristic value
Example
Eigenvector and Eigenvalue
Ax - λx = 0
Ax = λx (A – λI)x = 0
BUT! an
If B has an inverse: x= B-10 =0 eigenvector
cannot be
zero!!
x will be an eigenvector of A if and only if B does
not have an inverse, or equivalently det(B) = 0 :
det(A – λI) = 0
Eigenvector and Eigenvalue
2 12
Example 1: Find the eigenvalues of A
1 5
2 12
I A ( 2)( 5) 12
1 5 two eigenvalues:
2 3 2 ( 1)( 2) 1, 2
Note: The roots of the characteristic equation can be repeated. That is, λ1=
λ2 =…= λk. If that happens, the eigenvalue is said to be of multiplicity k.
Example 2: Find the eigenvalue of 2 1 0
A 0 2 0
2 1 0 0 0 2
I A 0 2 0 ( 2)3 0 λ = 2 is an
eigenvector of
0 0 2 multiplicity 3
Principal Component Analysis
Input:
var( H ) 104.5
104.5 var( M )
47.7 104.5
104.5 370
Covariance matrix
PCA - Example
1. Given original data set S = {x1, ..., xk}, produce new set by
subtracting the mean of attribute Ai from each xi.
x
y
.677873399
v1 1.28402771
.735178956
.735178956
v 2 .0490833989
.677873399
.677873399 .735178956
FeatureVec tor1
.735178956 .677873399
.677873399
FeatureVec tor 2
.735178956
5. Derive the new data set.
.677873399 .735178956
RowFeature Vector 1
.735178956 .677873399
.69 1.31 .39 .09 1.29 .49 .19 .81 .31 .71
RowDataAdj ust
.49 1.21 .99 .29 1.09 .79 .31 .81 .31 1.01
so we can do
RowDataAdjust = RowFeatureVector -1 TransformedData
= RowFeatureVector T TransformedData
and
RowDataOriginal = RowDataAdjust + OriginalMean
Probability Theory - A Tool For Artificial
Intelligence Applications
Frequentist Probability, related directly to the rates at which events occur
Eg. When we say that an outcome has a probability p of occurring, it means that if
we repeated the experiment (e.g., drawing a hand of cards) infinitely many times,
then a proportion p of the repetitions would result in that outcome.
Eg, If a doctor analyzes a patient and says that the patient has a 40 percent chance
of having the flu, this means something very different—we cannot make infinitely
many replicas of the patient, nor is there any reason to believe that different
replicas of the patient would present with the same symptoms yet have varying
underlying conditions. In the case of the doctor diagnosing the patient, we use
probability to represent a degree of belief, with 1 indicating absolute certainty that
the patient has the flu and 0 indicating absolute certainty that the patient does not
have the flu.
Mixtures of Distributions
• Defining probability distributions by combining other simpler
probability distributions
• Mixture Distribution:
• A mixture distribution is made up of several
component distributions
• On each trial, the choice of which component distribution should
generate the sample is determined by sampling a component
identity from a multinoulli distribution
P x P (c i ) P ( x c i )
i
Mixture Model
• Mixture model is one simple strategy for combining
probability distributions to create a richer distribution
• it is a smoothed, or
“softened,” version of
x+ = max(0, x)
Some
Useful
Properties
Information theory
• A branch of applied mathematics that revolves around
quantifying how much information is present in a signal
• Learning that an unlikely event has occurred is more informative
than learning that a likely event has occurred
• Self-information of an event x = x to be I(x) = - log P(x) nats
• One nat is the amount of information gained by observing an
event of probability 1/e
• Self-information deals only with a single outcome.
• Uncertainty in an entire probability distribution can be quantified
using the Shannon entropy
• is non-negative
• is 0 if and only if P and Q are the same distribution in the
case of discrete variables, or equal “almost everywhere” in
the case of continuous variables
• is often conceptualized as measuring some sort of distance
between these distributions
• is not a true distance measure because it is not symmetric:
DKL(P || Q) not equal to DKL(Q || P ) for some P and Q.
• Cross-entropy - A quantity that is closely related
to the KL divergence is the
• Unsupervised Learning
– There are not predefined and known set of outcomes
– Look for hidden patterns and relations in the data
– A typical example: Clustering 2.5
2.0
1.5
irisCluster$cluster
Petal.Width
1
1.0
0.5
0.0
2 4 6
Petal.Length
229
Supervised and Unsupervised Learning
• Supervised Learning
– For every example in the data there is always a predefined outcome
– Models the relations between a set of descriptive features and a
target (Fits data to a function)
– 2 groups of problems:
• Classification
• Regression
230
Supervised Learning
• Classification
– Predicts which class a given sample of data (sample of descriptive
features) is part of (discrete value).
virginica
0.0 4.0 96.0
Percent
100
75
Predicted
versicolor
0.0 96.0 4.0 50
25
• Regression setosa
100.0 0.0 0.0
– Predicts continuous values.
setosa versicolor virginica
Actual
231