Machine Learning (1)
Machine Learning (1)
OF INDIA)
Department of CSE
(Emerging Technologies)
(Datascience,Cybersecurity,Internet of Things)
B.TECH(R-20 Regulation)
(III YEAR – II SEM)
(2022-23)
MACHINE LEARNING
(R20A0525)
LECTURE NOTES
Prepared by
V.Suneetha,Associate Professor
MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY
(Autonomous Institution – UGC, Govt. of India)
Recognized under 2(f) and 12(B) of UGC ACT 1956
(Affiliated to JNTUH, Hyderabad, Approved by AICTE-Accredited by NBA & NAAC – ‘A’ Grade - ISO 9001:2015 Certified)
Maisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad–500100, Telangana State, India
Machine Learning
Department of Computer Science and Engineering
EMERGING TECHNOLOGIES
Vision
Mission
The department of CSE (Emerging Technologies) is committed to:
To offer highest Professional and Academic Standards in terms of Personal growth and
satisfaction.
Make the society as the hub of emerging technologies and thereby capture
opportunities in new age technologies.
To create a benchmark in the areas of Research, Education and Public Outreach.
To provide students a platform where independent learning and scientific study are
encouraged with emphasis on latest engineering techniques.
QUALITY POLICY
To pursue continual improvement of teaching learning process of Undergraduate and
PostGraduateprogramsin Engineering &Managementvigorously.
Toprovidestateofartinfrastructureandexpertisetoimpartthequalityeducation and
research environment to students for a complete learning experiences.
To offer quality relevant and cost effective programmes to produce engineers as per
requirements of the industry need.
Formoreinformation:www.mrcet.ac.in
Machine Learning
B.Tech – CSE (Emerging Technologies)
3-/-/-3
Course Outcome:
1. Explain the concepts and able to prepare the dataset for different Machine
learning models.
2. Identify and Apply appropriate Supervised Learning models.
3. Design Neural Network models for the given data.
4. Perform Evaluation of Machine Learning algorithms and Model Selection.
5. Devise un-supervised and Reinforcement learning models
UNIT – I
Introduction: Introduction to Machine learning, Supervised learning, Unsupervised
learning, Reinforcement learning. Deep learning.
Feature Selection: Filter, Wrapper , Embedded methods.
Feature Normalization:- min-max normalization, z-score normalization, and constant
factor normalization
Introduction to Dimensionality Reduction : Principal Component Analysis(PCA),
Linear Discriminant Analysis(LDA)
UNIT-II
Supervised Learning – I (Regression/Classification)
Regression models: Simple Linear Regression, multiple linear Regression. Cost Function,
Gradient Descent, Performance Metrics: Mean Absolute Error(MAE),Mean Squared Error(MSE)
R-Squared error, Adjusted R Square.
Classification models: Decision Trees-ID3,CART, Naive Bayes, K-Nearest-Neighbours (KNN),
Logistic Regression, Multinomial Logistic Regression Support Vector Machines (SVM) -
Nonlinearity and Kernel Methods
B.Tech – CSE (Emerging Technologies)
UNIT – III
Supervised Learning – II (Neural Networks)
Neural Network Representation – Problems – Perceptrons , Activation Functions,
Artificial Neural Networks (ANN) , Back Propagation Algorithm.
Convolutional Neural Networks - Convolution and Pooling layers, , Recurrent Neural
Networks (RNN).
Classification Metrics: Confusion matrix, Precision, Recall, Accuracy, F-Score, ROC
curves
UNIT - IV
Model Validation in Classification : Cross Validation - Holdout Method, K-Fold,
Stratified K-Fold, Leave-One-Out Cross Validation. Bias-Variance tradeoff,
Regularization , Overfitting, Underfitting. Ensemble Methods: Boosting, Bagging,
Random Forest.
UNIT – V
Unsupervised Learning : Clustering-K-means, K-Modes, K-Prototypes, Gaussian
Mixture Models, Expectation-Maximization.
Reinforcement Learning: Exploration and exploitation trade-offs, non-associative
learning, Markov decision processes, Q-learning.
Text Book(s)
1. Machine Learning – Tom M. Mitchell, -MGH
2. Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT Press,2012
3. R. S. Sutton and A. G. Barto. Reinforcement Learning - An Introduction. MIT
Press.1998
Reference Books
1. Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical
Learning, Springer2009
2. Christopher Bishop, Pattern Recognition and Machine Learning, Springer,2007.
3. Machine Learning Yearning, AndrewNg.
4. Data Mining–Concepts and Techniques -Jiawei Han and Micheline Kamber,Morgan
Kaufmann
B.Tech – CSE (Emerging Technologies)
INDEX
Overfitting, Underfitting 67
Expectation-Maximization. 70
V
Reinforcement Learning: Exploration and 75
exploitation trade-offs
Non-associative learning 77
UNIT-1
Machine Learning is a concept which allows the machine to learn from examples and experience, and
that too without being explicitly programmed. So instead of you writing the code, what you do is you
feed data to the generic algorithm, and the algorithm/ machine builds the logic based on the given
data.
Machine Learning algorithm is trained using a training data set to create a model. When new input
data is introduced to the ML algorithm, it makes a prediction on the basis of the model.The prediction
is evaluated for accuracy and if the accuracy is acceptable, the Machine Learning algorithm is
deployed. If the accuracy is not acceptable, the Machine Learning algorithm is trained again and
again with an augmented raining data set
Supervised Learning is the one, where you can consider the learning is guided by a teacher. We have a
dataset which acts as a teacher and its role is to train the model or the machine. Once the model gets trained
it can start making a prediction or decision when new data is given to it.
The model learns through observation and finds structures in the data. Once the model is given a dataset, it
automatically finds patterns and relationships in the dataset by creating clusters in it. What it cannot do is
add labels to the cluster, like it cannot say this a group of apples or mangoes, but it will separate all the
apples from mangoes.
Suppose we presented images of apples, bananas and mangoes to the model, so what it does, based on
some patternsand relationships it creates clusters and divides the dataset into those clusters. Now if a new
data is fed to the model, it adds it to one of the created clusters.
Classification of Machine Learning Algorithms Machine Learning algorithms can be classified into:
1. Supervised Algorithms – Linear Regression, Logistic Regression, Support Vector Machine (SVM),
DecisionTrees, Random Forest
2. Unsupervised Algorithms – K Means Clustering.
3. Reinforcement Algorithm
given labels based on certain parameters through which the machine will learn these features and
patterns andclassify some new input data based on the learning from this training data.
Supervised Learning Algorithms can be broadly divided into two types of algorithms, Classification and
Regression.Classification Algorithms
Just as the name suggests, these algorithms are used to classify data into predefined classes or labels.
Regression Algorithms
These algorithms are used to determine the mathematical relationship between two or more variables and
the level of dependency between variables. These can be used for predicting an output based on the
interdependency of two or more variables. For example, an increase in the price of a product will decrease
its consumption, which means, in this case, the amount of consumption will depend on the price of the
product. Here, the amount of consumption will be called as the dependent variable and price of the product
will be called the independent variable. The level of dependency of the amount of consumption on the price
of a product will help us predict the future value of the amount of consumption based on the change in
prices of the product.
We have two types of regression algorithms: Linear Regression and Logistic Regression
blue etc. The graph of logistic regression consists of a non-linear sigmoid function which demonstrates the
probabilities of the variables.
Another machine learning concept which is extensively used in the field is Neural Networks..
Normalization is a scaling technique in Machine Learning applied during data preparation to change
the values of numeric columns in the dataset to use a common scale. It is not necessary for all datasets
in a model. It is required only when features of machine learning models have different ranges.
Although Normalization is no mandate for all datasets available in machine learning, it is used
whenever the attributes of the dataset have different ranges. It helps to enhance the performance and
reliability of a machine learning model. In this article, we will discuss in brief various Normalization
techniques in machine learning, why it is used, examples of normalization in an ML model, and much
more. So, let's start with the definition of Normalization in Machine Learning.
o Xn = Value of Normalization
o Xmaximum = Maximum value of a feature
o Xminimum = Minimum value of a feature
Example: Let's assume we have a model dataset having maximum and minimum values of feature as
mentioned above. To normalize the machine learning model, values are shifted and rescaled so their
range can vary between 0 and 1. This technique is also known as Min-Max scaling. In this scaling
Case1- If the value of X is minimum, the value of Numerator will be 0; hence Normalization will also be 0.
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
This method was introduced by Karl Pearson. It works on a condition that while the data in a higher
dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower
dimensional space should be maximum.
It involves the following steps:
Construct the covariance matrix of the data.
Compute the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of
variance of theoriginal data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss in the
process. But, the most important variances should be retained by the remaining eigenvectors.
There are a lot of machine learning problems which a nonlinear, and the use of nonlinear feature mappings
can help to produce new features which make prediction problems linear. In this section we will discuss the
following idea: transformation of the dataset to a new higher-dimensional (in some cases infinite-
dimensional) feature space and theuse of PCA in that space in order to produce uncorrelated features. Such a
method is called Kernel Principal Component Analysis or KPCA.
where . Will consider that the dimensionality of the feature space equals to .
Eigendecompsition of is given by
By the definition of
And therefore
So far, we have assumed that the mapping is known. From the equations above, we can see, that only a
thing that we need for the data transformation is the eigendecomposition of a Gram matrix . Dot products,
which are its elements can be defined without any definition of . The function defining such dot
products in some Hilbert space is called kernel. Kernels are satisfied by the Mercer’s theorem. There are
many different types of kernels, there are several popular:
1. Linear: ;
2. Gaussian: ;
3. Polynomial: .
Using a kernel function we can write new equation for a projection of some data item onto -th
eigenvector:
So far, we have assumed that the columns of have zero mean. Using
Summary: Now we are ready to write the whole sequence of steps to perform KPCA:
1. Calculate .
2. Calculate .
3. Find the eigenvectors of corresponding to nonzero eigenvalues and normalize them:
.
4. Sort found eigenvectors in the descending order of coresponding eigenvalues.
5. Perform projections onto the given subset of eigenvectors.
The method described above requires to define the number of components, the kernel and its parameters. It
shouldbe noted, that the number of nonlinear principal components in the general case is infinite, but since
we are computing the eigenvectors of a matrix , at maximum we can calculate nonlinear principal
components.
UNIT-II
Regression models: Simple Linear Regression, multiple linear Regression. Cost Function, Gradient Descent,
Performance Metrics: Mean Absolute Error(MAE),Mean Squared Error(MSE) R-Squared error, Adjusted R
Square.
Supervised and unsupervised are mostly used by a lot machine learning engineers and data geeks.
Reinforcementlearning is really powerful and complex to apply for problems.
Supervised learning
as we know machine learning takes data as input. lets call this data Training data
what are Inputs and Labels(Targets)?? for example addition of two numbers a=5,b=6 result =11, Inputs are
5,6and Target is 11
We first train the model with the lots of training data(inputs&targets)then with new data and the logic
we got before we predict the output
(Note :We don’t get exact 6 as answer we may get value which is close to 6 based on training data and
algorithm)
This process is called Supervised Learning which is really fast and accurate.
Regression: This is a type of problem where we need to predict the continuous-response value (ex :
above we predictnumber which can vary from -infinity to +infinity)
how many total runs can be on board in a cricket game?etc… there are tons of things we can predict if we
wish.
Classification: This is a type of problem where we predict the categorical response value where the data
can beseparated into specific “classes” (ex: we predict one of the values in a set of values).
Unsupervised learning
The training data does not include Targets here so we don’t tell the system where to go , the system has
to understanditself from the data we give.
Here training data is not structured (contains noisy data,unknown data and etc..)
Unsupervised process
There are also different types for unsupervised learning like Clustering and anomaly detection (clustering is
prettyfamous)
Bit similar to multi class classification but here we don’t provide the labels, the system understands
from data itselfand cluster the data.
Unsupervised learning is bit difficult to implement and its not used as widely as supervised.
Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial Intelligence.
It allows machines and software agents to automatically determine the ideal behavior within a specific
context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its
behavior; this is known as thereinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement Learning is
defined by a specific type of problem, and all its solutions are classed as Reinforcement Learning
algorithms. In the problem, an agent is supposed to decide the best action to select based on his current state.
When this step is repeated, the problemis known as a Markov Decision Process.
In order to produce intelligent programs (also called agents), reinforcement learning goes through the
following steps:
3. After the action is performed, the agent receives reward or reinforcement from the environment.
Q-Learning
Use cases:
Some applications of the reinforcement learning algorithms are computer played board games (Chess, Go),
robotic hands, and self-driving cars.
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically, Regression
analysis helps us to understand how the value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It predicts continuous/real values such
as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get
sales on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine learning,
we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between variables and
enablesus to predict the continuous output variable based on the one or more predictor variables. It is
mainly used for prediction, forecasting, time series modeling, and determining the causal-effect
relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical
distance between the datapoints and the regression line is minimum." The distance between datapoints and
line tells whether a model has captured a strong relationship or not.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between thecontinuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis) and the
dependent variable(Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there ismore than one input variable, then such linear regression is called multiple linear
regression.
o The relationship between variables in the linear regression model can be explained using the below
image. Here we arepredicting the salary of an employee on the basis of the year of experience.
o Salary forecasting
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a
similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern
recognition already in the beginning of 1970’s as a non-parametric technique
ALGORITHM
A case is classified by a majority vote of its neighbors, with the case being assigned to the class most
common amongst its K nearest neighbors measured by a distance function. If K = 1, then the case is simply
assigned to the class of its nearest neighbor
what is a classifier?
A classifier is a machine learning model that is used to discriminate different objects based on certain
features.
Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that Bhas occurred. Here, B
is the evidenceand A is the hypothesis. The assumption made here is that the predictors/features are
independent. That is presence of one particular feature does not affect the other. Hence it is called
naive.
Example:
Let us take an example to get some better intuition. Consider the problem of playing golf. The dataset is
represented as below.
We classify whether the day is suitable for playing golf, given the features of the day. The columns represent
these features and the rows represent individual entries. If we take the first row of the dataset, we can observe
that is not suitable for playing golf if the outlook is rainy, temperature is hot, humidity is high and it is not
windy. We make twoassumptions here, one as stated above we consider that these predictors are independent.
That is, if the temperature is hot, it does not necessarily mean that the humidity is high. Another assumption
made here is that all the predictors have an equal effect on the outcome. That is, the day being windy does not
have more importance in deciding to playgolf or not.
The variable y is the class variable(play golf), which represents if it is suitable to play golf or not given the
conditions. Variable X represent the parameters/features.
X is given as,
Here x_1,x_2….x_n represent the features, i.e they can be mapped to outlook, temperature, humidity and
windy. By substituting for X and expanding using the chain rule we get,
Now, you can obtain the values for each by looking at the dataset and substitute them into the equation. For all
entries in the dataset, the denominator does not change, it remain static. Therefore, the denominator can be
removed and a proportionality can be introduced.
In our case, the class variable(y) has only two outcomes, yes or no. There could be cases where the
classificationcould be multivariate. Therefore, we need to find the class y with maximum probability.
Using the above function, we can obtain the class, given the predictors.Types of Naive Bayes Classifier:
Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when
the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more
sophisticated classification methods.
To demonstrate the concept of Naïve Bayes Classification, consider the example displayed in the illustration
above. As indicated, the objects can be classified as either GREEN or RED. Our task is to classify new cases
as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects.
Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which
hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian
analysis, this belief isknown as the prior probability. Prior probabilities are based on previous experience, in
this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually
happen.
Thus, we can write:
Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many
GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given
that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final
classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form
a posterior probability using the so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
Assume that you are given a characteristic information of 10,000 people living in your town. You are asked to
study them and come up with the algorithm which should be able to tell whether a new person coming to the
town is male or a female.
The tree shown above divides the data in such a way that we gain the maximum information, to understand
the tree
– If a person’s hair length is less than 5 Inches, weight greater than 55 KGs then there are 80% chances
for thatperson being a Male.
If you are familiar with Predictive Modelling e.g., Logistic Regression, Random Forest etc. – You might be
wondering what is the difference between a Logistic Model and Decision Tree!
Because in both the algorithms we are trying to predict a categorical variable.
There are a few fundamental differences between both but ideally both the approaches should give you the
same results. The best use of Decision Trees is when your solution requires a representation. For example,
you are working for a Telecom Operator and building a solution using which a call center agent can take a
There are very less chances that a call center executive will understand the Logistic Regression or the
equations, but using a more visually appealing solution you might gain a better adoption from your call center
team.
How does Decision Tree work?
There are multiple algorithms written to build a decision tree, which can be used according to the problem
characteristics you are trying to solve. Few of the commonly used algorithms are listed below:
Though the methods are different for different decision tree building algorithms but all of them works on the
principle of Greediness. Algorithms try to search for a variable which give the maximum information gain or
divides the data in the most homogenous way.
For an example, consider the following hypothetical dataset which contains Lead Actor and Genre of a movie
alongwith the success on box office:
Lead Actor Genre Hit(Y/N)
Let say, you want to identify the success of the movie but you can use only one variable – There are the
followingtwo ways in which this can be done:
You can clearly observe that Method 1 (Based on lead actor) splits the data best while the second method
(Based on Genre) have produced mixed results. Decision Tree algorithms do similar things when it comes to
select variables.
There are various metrics which decision trees use in order to find out the best split variables. We’ll go
through them one by one and try to understand, what do they mean?
Entropy & Information Gain
The word Entropy is borrowed from Thermodynamics which is a measure of variability or chaos or
randomness. Shannon extended the thermodynamic entropy concept in 1948 and introduced it into statistical
studies and suggested the following formula for statistical entropy:
Graph shown above shows the variation of Entropy with the probability of a class, we can clearly see
that Entropy ismaximum when probability of either of the classes is equal. Now, you can understand
that when a decision algorithm tries to split the data, it selects the variable which will give us
maximum reduction in system Entropy.
Captured impurity or entropy after splitting data using Method 1 can be calculated using the
followingformula: “Entropy (Parent) – Weighted Average of Children Entropy”
Which is,
Now using the method used above, we can calculate the Information Gain as:
Hence, we can clearly see that Method 1 gives us more than 4 times information gain compared to Method 2
and hence Method 1 is the best split variable.
Gain Ratio
Soon after the development of entropy mathematicians realized that Information gain is biased toward multi-
valued attributes and to conquer this issue, “Gain Ratio” came into picture which is more reliable than
Information gain. The gain ratio can be defined as:
Assuming we are dividing our variable into ‘n’ child nodes and Di represents the number of records going
into various child nodes. Hence gain ratio takes care of distribution bias while building a decision tree.
And Hence,
Gini Index
There is one more metric which can be used while building a decision tree is Gini Index (Gini Index is
mostly used in CART). Gini index measures the impurity of a data partition K, formula for Gini Index can be
written down as:
Where m is the number of classes, and Pi is the probability that an observation in K belongs to the class. Gini
Index assumes a binary split for each of the attribute in S, let say T 1 & T2. The Gini index of K given this
partitioning is given by:
Which is nothing but a weighted sum of each of the impurities in split nodes. The reduction in impurity is
given by:
Similar to Information Gain & Gain Ratio, split which gives us maximum reduction in impurity is
considered fordividing our data.
= 0.49
= 0.24 + 0.19
= 0.43
LINEAR REGRESSION
Linear regression is a statistical approach for modelling relationship between a dependent variable
with a given setof independent variables.
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts theresponse value(y) as accurately as possible as a function of the feature or independent
variable(x).
Let us consider a dataset where we have a value of response y for every feature x:
Now, the task is to find a line which fits best in above scatter plot so that we can predict the
response for any newfeature values. (i.e a value of x not present in dataset)
This line is called regression line.
The equation of regression line is represented as:
Here,
h(x_i) represents the predicted response value for ith observation.
b_0 and b_1 are regression coefficients and represent y-intercept and slope of regression line
respectively.
To create our model, we must “learn” or estimate the values of regression coefficients b_0 and b_1.
And once we’veestimated these coefficients, we can use the model to predict responses!
LOGISTIC REGRESSION
Consider an example dataset which maps the number of hours of study with the result of an exam.
The result cantake only two values, namely passed(1) or failed(0):
HOURS(X) 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00
PASS(Y) 0 0 0 0 0 0 1 0 1 0 1
The Generalized Linear Model (GLZ) is a generalization of the general linear model .In its simplest form, a
linear model specifies the (linear) relationship between a dependent (or response) variable Y, and a set of
predictor variables, the X's, so that
Y = b0 + b1X1 + b2X2 + ... + bkXk
In this equation b0 is the regression coefficient for the intercept and the bi values are the regression
coefficients (for variables 1 through k) computed from the data.
So for example, we could estimate (i.e., predict) a person's weight as a function of the person's height and
gender. You could use linear regression to estimate the respective regression coefficients from a sample of
data, measuring height, weight, and observing the subjects' gender. For many data analysis problems,
estimates of the linear relationships between variables are adequate to describe the observed data, and to
make reasonable predictions for new observations..
However, there are many relationships that cannot adequately be summarized by a simple linear equation, for
two major reasons:
Distribution of dependent variable. First, the dependent variable of interest may have a non-continuous
distribution, and thus, the predicted values should also follow the respective distribution; any other predicted
values are not logically possible. For example, a researcher may be interested in predicting one of three
possible discrete outcomes (e.g., a consumer's choice of one of three alternative products). In that case, the
dependent variable can only take on 3 distinct values, and the distribution of the dependent variable is said to
be multinomial. Or suppose you are trying to predict people's family planning choices, specifically, how
many children families will have, as a function of income and various other socioeconomic indicators. The
dependent variable - number of children - is discrete (i.e., afamily may have 1, 2, or 3 children and so on, but
cannot have 2.4 children), and most likely the distribution of that variable is highly skewed (i.e., most
families have 1, 2, or 3 children, fewer will have 4 or 5, very few will have 6
Support Vector Machine or SVM are supervised learning models with associated learning algorithms that
analyze data for classification( clasifications means knowing what belong to what e.g ‘apple’ belongs to
class ‘fruit’ while ‘dog’ to class ‘animals’ -see fig.1)
In support vector machines, it looks somewhat like which separates the blue balls from red.
SVM is a classifier formally defined by a separating hyperplane. An hyperplane is a subspace of one
dimension lessthan its ambient space. The dimension of a mathematical space (or object) is informally
defined as the minimum
number of coordinates (x,y,z axis) needed to specify any point (like each blue and red point) within it
while anambient space is the space surrounding a mathematical object.
Therefore the hyperplane of a two dimensional space below (fig.2) is a one dimensional line dividing the red
and bluedots.
Can you try to solve the above problem linearly like we did with Fig. 2?NO!
The red and blue balls cannot be separated by a straight line as they are randomly distributed and this, in
reality, is how most real life problem data are -randomly distributed.
In machine learning, a “kernel” is usually used to refer to the kernel trick, a method of using a linear
classifier to solve a non-linear problem. It entails transforming linearly inseparable data like (Fig. 3) to
linearly separable ones (Fig. 2). The kernel function is what is applied on each data instance to map the
original non-linear observations intoa higher-dimensional space in which they become separable.
Using the dog breed prediction example again, kernels offer a better alternative. Instead of defining a slew of
features, you define a single kernel function to compute similarity between breeds of dog. You provide this
kernel, together with the data and labels to the learning algorithm, and out comes a classifier.
So this is with two features, and we see we have a 2D graph. If we had three features, we could have a 3D
graph. The 3D graph would be a little more challenging for us to visually group and divide, but still do-able.
The problem occurs when we have four features, or four-thousand features. Now you can start to understand
the power of machine learning, seeing and analyzing a number of dimensions imperceptible to us.
Common examples include image classification (is it a cat, dog, human, etc)or
handwritten digitrecognition (classifying an image of a handwritten number into a digit
from 0 to 9).
Performance Metrics
• Accuracycan be calculated by taking average of the values lying across the “main diagonal” i.e
Accuracy = (True Positives+False Negatives)/Total Number of Samples
Precision:-It is the number of correct positive results divided by the number of positive results predicted by
classifier.
• Recall :- It is the number of correct positive results divided by the number of all relevant samples
It is an umbrella term for supervised machine learning techniques that involves predicting structured objects,
rather than scalar discrete or real values.
Similar to commonly used supervised learning techniques, structured prediction models are typically trained
by means of observed data in which the true prediction value is used to adjust model parameters. Due to the
complexityof the model and the interrelations of predicted variables the process of prediction using a trained
model and of training itself is often computationally infeasible and approximate inference and learning
methods are used.
For example, the problem of translating a natural language sentence into a syntactic representation such as a
parse tree can be seen as a structured prediction problem in which the structured output domain is the set of
all possible parse trees. Structured prediction is also used in a wide variety of application
domains including bioinformatics, natural language processing, speech recognition, and computer vision.
Sequence tagging is a class of problems prevalent in natural language processing, where input data are often
sequences (e.g. sentences of text). The sequence tagging problem appears in several guises, e.g. part-of-
speech tagging and named entity recognition. In POS tagging, for example, each word in a sequence must
receive a "tag" (class label) that expresses its "type" of word:
DT-DeterminerVB-Verb
JJ-AdjectiveNN-Noun
Ranking :-
Learning to Rank (LTR) is a class of techniques that apply supervised machine learning (ML) to solve
ranking problems. The main difference between LTR and traditional supervised ML is this:
The most common application of LTR is search engine ranking, but it's useful anywhere you need to produce
a ranked list of items.
The training data for a LTR model consists of a list of items and a "ground truth" score for each of those
items. For search engine ranking, this translates to a list of results for a query and a relevance rating for each
of those results with respect to the query. The most common way used by major search engines to generate
these relevance ratingsis to ask human raters to rate results for a set of queries
Learning to rank algorithms have been applied in areas other than information retrieval:
This algorithm applies the same trick as k-means but with one difference that here in the calculation of
distance,kernel method is used instead of the Euclidean distance.
Let X = {a1, a2, a3, ..., an} be the set of data points and 'c' be the number of clusters.
2) Compute the distance of each data point and the cluster center in the transformed space using:
where,
Matrix Factorization:
matrix factorization is to, obviously, factorize a matrix, i.e. to find out two (or more) matrices such that when
you multiply them you will get back the original matrix.
Matrix factorization can be used to discover latent features underlying the interactions between two
different kinds of entities. (Of course, you can consider more than two kinds of entities and you will be
dealing with tensor factorization, which would be more complicated.) And one obvious application is to
predict ratings in collaborative filtering.
In a recommendation system such as Netflix or MovieLens, there is a group of users and a set of items
(movies for the above two systems). Given that each users have rated some items in the system, we would
like to predict how the users would rate the items that they have not yet rated, such that we can make
recommendations to the users. In this case, all the information we have about the existing ratings can be
represented in a matrix. Assume now we have 5 users and 10 items, and ratings are integers ranging from 1
to 5, the matrix may look something like this (a
hyphen means that the user has not yet rated the movie):
D1 D2 D3
U1 5 3 -
U2 4 - -
U3 1 1 -
U4 1 - -
U5 - 1 5
In this way, each row of would represent the strength of the associations between a user and the features.
Similarly, each row of would represent the strength of the associations between an item and the features.
To get the prediction of a rating of an item by , we can calculate the dot product of the two vectors
corresponding to and :
Now, we have to find a way to obtain and . One way to approach this problem is the first intialize the
two matrices with some values, calculate how `different’ their product is to , and then try to minimize this
difference iteratively. Such a method is called gradient descent, aiming at finding a local minimum of the
difference.
The difference here, usually called the error between the estimated rating and the real rating, can be
calculated by the following equation for each user-item pair:
Here we consider the squared error because the estimated rating can be either higher or lower than the real
rating.
To minimize the error, we have to know in which direction we have to modify the values of and In other
words, we need to know the gradient at the current values, and therefore we differentiate the above equation
with respect to these two variables separately:
Having obtained the gradient, we can now formulate the update rules for both and :
Here, is a constant whose value determines the rate of approaching the minimum. Usually we will choose a
small value for , say 0.0002. This is because if we make too large a step towards the minimum we may run
into the risk of missing the minimum and end up oscillating around the minimum.
A question might have come to your mind by now: if we find two matrices and
such that approximates , isn’t that our predictions of all the unseen ratings will all be zeros? In fact,
we are not really trying to come up with and such that we can reproduce
exactly. Instead, we will only try to minimise the errors ofthe observed user-item pairs.
In other words, if we let be a set of tuples, each of which is in the formof , such that
contains all the observed user-item pairs together with the associated ratings, we are
only trying to minimise every for . (In other words, is our set of training data.)
As for the restof the unknowns, we will be able to determine their values once the associations between
the users, items andfeatures have been learnt.
Using the above update rules, we can then iteratively perform the operation until the error converges to its
minimum. We can check the overall error as calculated using the following equation and determine when we
shouldstop the process.
UNIT-III
Supervised Learning – II (Neural Networks)
Neural Network Representation – Problems – Perceptrons , Activation Functions, Artificial Neural
Networks(ANN) , Back Propagation Algorithm.
Convolutional Neural Networks - Convolution and Pooling layers, , Recurrent Neural Networks (RNN).
Classification Metrics: Confusion matrix, Precision, Recall, Accuracy, F-Score, ROC curves
Online Learning:-
Online machine learning is a method of machine learning in which data becomes available in a sequential
order and is used to update our best predictor for future data at each step, as opposed to batch learning
techniques which generate the best predictor by learning on the entire training data set at once. Online
learning is a common technique used in areas of machine learning where it is computationally infeasible to
train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it
is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is
generated as a function of time, e.g., stock price prediction. Online learning algorithms may be prone to
catastrophic interference, a problem that can be addressed by incremental learning approaches.
Online or incremental machine learning is a subfield of machine learning wherein we focus on incrementally
updating a model as soon as we see a new training example. This differs from the more traditional batch
learning approach where we train on all of our data in one go.
Online learning has several upsides, the two most important of which relate to space complexity and speed.
Traininga model online means you don’t need to keep your entire training data in memory, which is great
news if you are dealing with massive datasets. The incremental nature also means that your model can
quickly react to changes in thedistribution of the data coming in, provided the algorithm you use is tweaked
properly.
The second point makes online learning especially attractive: implemented correctly, it can do near
real- time learning as well as inference. It’s a good choice in situations where you want to react to unforseen
changes as fast as possible, making it a viable option for implementing dynamic pricing systems,
recommendation engines, decision engines and much more besides. It’s also an excellent choice for
immediate-reward reinforcement learning (e.g. contextual bandits).
When you don’t have enough labeled data to produce an accurate model and you don’t have the ability or
resources to get more, you can use semi-supervised techniques to increase the size of your training data. For
example, imagineyou are developing a model for a large bank intended to detect fraud. Some fraud you know
about, but other instances of fraud slipped by without your knowledge. You can label the dataset with the
fraud instances you’re aware of, but the rest of your data will remain unlabelled:
You can use a semi-supervised learning algorithm to label the data, and retrain the model with the newly
labeled dataset:
Then, you apply the re-trained model to new data, more accurately identifying fraud using supervised
machine learning techniques. However, there is no way to verify that the algorithm produced labels that are
100% accurate, resulting in less trustworthy outcomes than traditional supervised techniques.
Considering the variety of IoT systems, it can be difficult for business players to decide what they really
need. The number of IoT solutions is increasing at quite an impressive pace, and these systems are designed
to perform variousfunctions.
Basing on our experience in IoT development, we want to bring more clarity to this variety and
introduce ourapproach to the IoT systems classification.
With the help of sensor data, a user of a smart, connected thing can monitor its real-time state and
environment. In a longer-term perspective, the results of monitoring can be gathered and applied for
advanced insights.
Accumulated sensor data helps to get detailed and meaningful statistics, assess the performance of equipment
from different perspectives, uncover new patterns and tendencies and more. Monitoring is important
assistance in proactive maintenance as IoT system users get an opportunity to identify the problems before
the damage is done and take necessary measures.
Connected devices can give an expanded picture of patients’ health, environmental conditions, equipment
state in factories and power plants, help users monitor their pets, cars, homes and more. Remote monitoring
of facilities, processes and events brings better operational insights: sensors can gather the data that helps to
see and assess real- time state of smart connected things.
Let’s take a smart railway as an example. Trains can be equipped with sensors that take the data about
real-time status of the parts of a train and enable the IoT system users to monitor the state of breaks, wheels
and engines. The results can be viewed by a train driver, traffic superintendent or any other responsible
person as well as collected for further analysis.
In subway trains, big data can be used to measure passenger flow and identify the hours and days when
additional trains are needed. Another case (when the solutions are designed not only for the monitoring) is
adding a new train
to the route if many passengers are waiting (for example, trains come every 20 minutes, but when the
platform is getting overcrowded, additional trains are provided).
An IoT system can not only take and display sensor data (for example, temperature or humidity), but also
make conclusions about what certain values of data could mean.
Processing the data coming from sensors, an IoT system can not only detect anomalies, but also predict
operational malfunctioning and point at the root causes of problems. Thus, comparing current data coming
from sensors and the data stored in the cloud as acceptable values, an IoT solution monitoring trains and
railways can show that a certain part is about to be out of order. The historical data about the conditions of
using trains and the breakdowns that happen can help identify (and then – predict) the conditions leading to
failures.
In many cases, the potential of the Internet of things can bring far more value than just improved monitoring.
Thus, a user of an IoT system can give commands to the connected things and enable them to perform certain
operations.
Let's take a case when a user receives the results of monitoring and some response is needed. Theoretically, a
smart system can tackle certain issues even without human participation, but not always. In some cases,
machines are
incapable to perform required operations, in the others – the issues cannot be trusted to machines (for
example, replacing a broken detail).
Manual control is a good solution to be on the safe side in the situations an IoT system didn’t face before.
With unfamiliar incoming data, it can be better to let humans make final decisions. An IoT system can, for
example, give recommendations – and people decide whether to follow them or not. In the longer term, the
data about user actions in response to certain sensor data can be used by machine learning module to make
models for control applications and contribute to further automation of a system.
In IoT systems with automated control, control applications send commands to actuators. The choice of the
commands to be sent depends on the data coming from sensors and/or previously defined schedules.
Rule-based control
An IoT system with rule-based control is designed to act in accordance with the algorithms featuring what
should bedone in response to certain data coming from sensors. Rules are stated before the system is put into
action.
In freight trains, sensors can measure temperature, vibrations and other parameters critical for cargo. This big
data goes to the cloud, and, when the smart system identifies that some parameters differ from acceptable
values, control applications send commands to adjust these parameters (for example, increase or decrease
refrigerating).
In such freight trains, it’s also possible to set acceptable values in different carriages before the trip begins
(as soon as various goods can be transported in different cars and each type of transported goods requires
corresponding conditions).
The next stage of IoT systems evolution is the systems with machine-learning based control when IoT
potential is used to its fullest extent. In machine learning, sensor data is continuously collected and regularly
used in standard machine learning algorithms. New models are generated, and their applicability is then
tested by analysts and / or data scientists. When the models are approved, they can be used by an IoT system.
In a smart railway, such learning can be performed with analyzing human commands. The responses of
humans to certain sensor data are accumulated in a big data warehouse, and then the models of how to act are
built accordingly (considering the actions of humans in certain situations).
Cameras can take photos of potential problems (suspicions that there are some problems) and send them for
further analysis (either manually or with computer assistance). As soon as various images are collected in the
cloud (and theproblems are identified), smart systems “learn” to determine the types of problems without
human participation and send corresponding notifications
Machine learning potential can contribute to optimizing subway trains’ schedules. Smart system
accommodates the data about the passenger flow on different days and at different times of the day. Then, it
defines the days and the time slots when additional trains should be put on the line, and, thus, offers schedule
optimizations.
It makes sense to notice that, even if an IoT solution can, in most cases, successfully operate without human
participation, there should be an option of manual control.
Endnote
Solutions for monitoring: sensor data helps monitor the state and environment of smart connected things. In
this case, IoT solutions can perform storing data and showing it to users. Also, the data gathered with
sensors can be analyzed and used for detecting specific situations.
Monitoring + manual control: with user apps, users are empowered to give the commands to connected
things’ actuators and control the processes in an IoT system.
Monitoring + automated control: control apps automatically send the commands to actuators, and human
participation in controlling an IoT system is significantly reduced. Automated control can be performed on
the basis of the previously defined rules (rule-based control). With machine learning, IoT systems can
adapt to user behavior and changing environment and “learn” how to perform operations in more productive
ways. However, it’s reasonable to enable the shift from automated to manual control over IoT solution’s
operation as no IoT system is immune to breakdowns and unpredicted situations.
An organization needs to clearly realize what it expects from the Internet of Things and which types of IoT
solution will help to cover current and future business needs. The exploration of the IoT path should begin
with the great dealof business and IT strategy planning.
1. Smart home
Smart Home clearly stands out, ranking as highest Internet of Things application on all measured channels.
More than 60,000 people currently search for the term “Smart Home” each month. This is not a surprise. The
IoT Analytics company database for Smart Home includes 256 companies and startups. More companies are
active in smart home than any other application in the field of IoT. The total amount of funding for Smart
Home startups currently exceeds $2.5bn. This list includes prominent startup names such as Nest or AlertMe
as well as a number ofmultinational corporations like Philips, Haier, or Belkin.
2. Wearables
Wearables remains a hot topic too. As consumers await the release of Apple’s new smart watch in April
2015, there are plenty of other wearable innovations to be excited about: like the Sony Smart B
Trainer, the Myo gesture control, or LookSee bracelet. Of all the IoT startups, wearables maker Jawbone is
probably the one with the biggest funding to date. It stands at more than half a billion dollars!
3. Smart City
Smart city spans a wide variety of use cases, from traffic management to water distribution, to waste
management, urban security and environmental monitoring. Its popularity is fueled by the fact that many
Smart City solutions promise to alleviate real pains of people living in cities these days. IoT solutions in the
area of Smart City solve traffic congestion problems, reduce noise and pollution and help make cities safer.
4. Smart grids
Smart grids is a special one. A future smart grid promises to use information about the behaviors of
electricity suppliers and consumers in an automated fashion to improve the efficiency, reliability, and
economics of electricity. 41,000 monthly Google searches highlights the concept’s popularity. However, the
lack of tweets (Just 100 per month) shows that people don’t have much to say about it.
5. Industrial internet
The industrial internet is also one of the special Internet of Things applications. While many market
researches such as Gartner or Cisco see the industrial internet as the IoT concept with the highest overall
potential, its popularity currently doesn’t reach the masses like smart home or wearables do. The industrial
internet however has a lot going for it. The industrial internet gets the biggest push of people on Twitter
(~1,700 tweets per month) compared to other non-consumer-oriented IoT concepts.
6. Connected car
The connected car is coming up slowly. Owing to the fact that the development cycles in the automotive
industry typically take 2-4 years, we haven’t seen much buzz around the connected car yet. But it seems we
are getting there. Most large auto makers as well as some brave startups are working on connected car
solutions. And if the BMWs and Fords of this world don’t present the next generation internet connected car
soon, other well-known giants will: Google, Microsoft, and Apple have all announced connected car
platforms.
Connected health remains the sleeping giant of the Internet of Things applications. The concept of a
connectedhealth care system and smart medical devices bears enormous potential (see our analysis of market
segments), not just for companies also for the well-being of people in general. Yet, Connected Health has not
reached the masses yet. Prominent use cases and large-scale startup successes are still to be seen. Might 2015
bring the breakthrough?
8. Smart retail
Proximity-based advertising as a subset of smart retail is starting to take off. But the popularity ranking
shows that it is still a niche segment. One LinkedIn post per month is nothing compared to 430 for smart
home.
Supply chains have been getting smarter for some years already. Solutions for tracking goods while they are
on the road, or getting suppliers to exchange inventory information have been on the market for years. So
while it is perfectly logic that the topic will get a new push with the Internet of Things, it seems that so far its
popularity remains limited.
Smart farming is an often overlooked business-case for the internet of Things because it does not really fit
into the well-known categories such as health, mobility, or industrial. However, due to the remoteness of
farming operations and the large number of livestock that could be monitored the Internet of Things could
revolutionize the way farmers work. But this idea has not yet reached large-scale attention. Nevertheless, one
of the Internet of Things applications that should not be underestimated. Smart farming will become the
important application field in the predominantly agricultural-product exporting countries.
Airline – An equipment tracking app provides an airline’s engineers with a live view of the locations of each
piece of maintenance equipment. By increasing the efficiency of engineers, this IoT application is not only
generating significant cost savings and process improvements, but also impacting the customer experience in
the end through more reliable, on-time flights.
Pharmaceutical – A medication temperature monitoring app uses sensors to detect if the medication’s
temperature has gone outside of the acceptable range and ensures medical supplies still meet quality
standards upon delivery. The handling temperatures are medications, vaccines for examples, is critical to
their effectiveness. IoT based smartapplications can be used to not monitor that medications are kept within
the proper handling temperature range, but also to remind patients when it is time to take their medication.
Manufacturing – A lighting manufacturer for the horticultural industry built a Smart App that leverages
IoT sensors and predictive analytics to perform predictive maintenance and optimize lighting, power
consumption and plant photosynthesis. The IoT application transformed their business from a lighting
systems manufacturer to a greenhouse optimization as-a-service business.
Insurance – An insurance company offers policyholders discounts for wearing Internet-connected Fitbit
wristbands.The fitness tracking service is part of the insurer’s Vitality program aimed at integrating wellness
benefits with life insurance. Through this IoT application, this insurer is creating smart life insurance
products and rewarding customers for their positive actions.
Business Services – A facility services company uses their multi-device IoT software to enable support
personnel to receive alerts about service issues and take immediate action. By aggregating data from
thousands of sensors in devices like coffee machines, soap dispensers, paper towel dispensers and mouse
traps rather than doing manual checks, the application has significantly cut costs and improved service
levels.
Media & Entertainment – An entertainment design and production firm uses sensors in turnstiles of venues
to understand the foot traffic of people at events. Their IoT application visualizes the attendee traffic patterns
in real time to help sponsors understand the best places to advertise, and to ensure the attendee count stays
within the fire code compliance of the venue.
A common problem in machine learning is sparse data, which alters the performance of machine
learning algorithms and their ability to calculate accurate predictions. Data is considered sparse when
certain expected values in a dataset are missing, which is a common phenomenon in general large scaled data
analysis.
Sparse modeling is a rapidly developing area at the intersection of statistical learning and signal processing,
motivated by the age-old statistical problem of selecting a small number of predictive variables in high-
dimensional datasets.
Sparse models contain fewer features and hence are easier to train on limited data. Fewer features also means
less chance of over fitting. Fewer features also means it is easier to explain to users, as only most meaningful
features remain in face recognition, sparse models provide a unique way to recognize a face from a database
of profiles takenunder different orientations in MRI, sparse models promise faster image acquisition.
Sparse models – models where only a small fraction of parameters are non-zero – arise frequently in machine
learning. Sparsity is beneficial in several ways: sparse models are more easily interpretable by humans, and
sparsity can yield statistical benefits – such as reducing the number of examples that have to be observed to
learn the model. In a sense, we can think of sparsity as an antidote to the oft-maligned curse of
dimensionality.
More formally, suppose we’re given data X on which we want to run an optimization algorithm F to
obtain a model W = F(X). We might think of F as stochastic gradient descent for learning a logistic
regressor, or alternating least-squares for non-negative matrix factorization (NMF). If X is high-dimensional,
this procedure can be expensive both in time and memory. But if our target model W is sparse, we can
instead run our optimization routine F on a compressed, lower-dimensional version of the data X’ = PX,
where P is a random projection matrix. This yields a compressed solution W’ = F(X’). We can then leverage
classic methods from compressive sensing to recover an approximate solution Wapprox in the original, high-
dimensional space such that Wapprox ≈ W.
Deep learning is a computer software that mimics the network of neurons in a brain. It is a subset of
machine learning and is called deep learning because it makes use of deep neural networks.
Each Hidden layer is composed of neurons. The neurons are connected to each other. The neuron will
process and then propagate the input signal it receives the layer above it. The strength of the signal given the
neuron in the next layer depends on the weight, bias and activation function.
The network consumes large amounts of input data and operates them through multiple layers; the network
can learnincreasingly complex features of the data at each layer.
Deep learning is a powerful tool to make prediction an actionable result. Deep learning excels in pattern
discovery (unsupervised learning) and knowledge-based prediction. Big data is the fuel for deep learning.
When both are combined, an organization can reap unprecedented results in term of productivity, sales,
management, and innovation.
Deep learning can outperform traditional method. For instance, deep learning algorithms are 41% more
accurate than machine learning algorithm in image classification, 27 % more accurate in facial recognition
and 25% in voice recognition.
It has been shown that simple deep learning techniques like CNN can, in some cases, imitate the knowledge
of experts in medicine and other fields. The current wave of machine learning, however, requires training
data sets that are not only labeled but also sufficiently broad and universal.
Deep-learning methods required thousands of observation for models to become relatively good at
classification tasks and, in some cases, millions for them to perform at the level of humans. Without surprise,
deep learning is famous in giant tech companies; they are using big data to accumulate petabytes of data. It
allows them to create an impressive and highly accurate deep learning model.
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time
series isa sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time
data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of
the Dow Jones Industrial Average.
Time series are very frequently plotted via line charts. Time series are used in statistics, signal processing,
pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, and
largely in any domain of applied science and engineering which involves temporal measurements.
Time series analysis comprises methods for analyzing time series data in order to extract meaningful
statistics and other characteristics of the data
Methods for time series analysis may be divided into two classes: frequency-domain methods and
time- domain methods. The former include spectral analysis and wavelet analysis; the latter include
auto- correlation and cross-correlation analysis. In the time domain, correlation and analysis can be made in
a filter-like manner using scaled correlation, thereby mitigating the need to operate in the frequency domain.
Additionally, time series analysis techniques may be divided into parametric and non-parametric
methods. The parametric approachesassume that the underlying stationary stochastic process has a certain
structure which can be described using a small number of parameters (for example, using an autoregressive
or moving average model).In these approaches, the task is to estimate the parameters of the model that
describes the stochastic process. By
network. HMM models are widely used in speech recognition, for translating a time series of spoken words
into text.
Feature Learning:-
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic
of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial
step for effective algorithms in pattern recognition, classification and regression.
In machine learning, feature learning or representation learning[1] is a set of techniques that allows a system
to automatically discover the representations needed for feature detection or classification from raw data.
This replaces manual feature engineeringand allows a machine to both learn the features and use them to
perform a specific task.
Feature learning is motivated by the fact that machine learning tasks such as classification often require input
that is mathematically and computationally convenient to process. However, real-world data such as images,
video, and sensor data has not yielded to attempts to algorithmically define specific features. An alternative
is to discover such features or representations through examination, without relying on explicit algorithms.
Supervised feature learning is learning features from labeled data. The data label allows the system to
compute an error term, the degree to which the system fails to produce the label, which can then be used as
feedback to correct the learning process (reduce/minimize the error). Approaches include:
Dictionary learning develops a set (dictionary) of representative elements from the input data such that each
data point can be represented as a weighted sum of the representative elements.
In supervised feature learning, features are learned using labeled input data. Examples include supervised
neural networks, multilayer perceptron and (supervised) dictionary learning.
In unsupervised feature learning, features are learned with unlabeled input data. Examples include dictionary
learning, independent component analysis, autoencoders, matrix factorization and various forms of clustering
Neural networks
Neural networks are a family of learning algorithms that use a "network" consisting of multiple layers of
inter- connected nodes. It is inspired by the animal nervous system, where the nodes are viewed as neurons
and edges are viewed as synapses. Each edge has an associated weight, and the network defines
computational rules for passing input data from the network's input layer to the output layer. A network
function associated with a neural network characterizes the relationship between input and output layers,
which is parameterized by the weights. With appropriately defined network functions, various learning tasks
can be performed by minimizing a cost function over the network function (weights).
Multilayer neural networks can be used to perform feature learning, since they learn a representation of their
input atthe hidden layer(s) which is subsequently used for classification or regression at the output layer. The
most popular network architecture of this type is Siamese networks.
Unsupervised
Unsupervised feature learning is learning features from unlabeled data. The goal of unsupervised feature
learning is often to discover low-dimensional features that captures some structure underlying the high-
dimensional input data. When the feature learning is performed in an unsupervised way, it enables a form of
semisupervised learning where features learned from an unlabeled dataset are then employed to improve
performance in a supervised setting with labeled data. Several approaches are introduced in the following.
K-means clustering
K-means clustering is an approach for vector quantization. In particular, given a set of n vectors, k-means
clustering groups them into k clusters (i.e., subsets) in such a way that each vector belongs to the cluster with
the closest mean.The problem is computationally NP-hard, although suboptimal greedy algorithms have been
developed.
K-means clustering can be used to group an unlabeled set of inputs into k clusters, and then use the
centroids of these clusters to produce features. These features can be produced in several ways. The simplest
is to add k binary features to each sample, where each feature j has value one iff the jth centroid learned by k-
means is the closest tothe sample under consideration. It is also possible to use the distances to the clusters as
features, perhaps after transforming them through a radial basis function (a technique that has been used to
train RBF networks.
Principal component analysis (PCA) is often used for dimension reduction. Given an unlabeled set of n input
data vectors, PCA generates p (which is much smaller than the dimension of the input data) right
singular vectors corresponding to the p largest singular values of the data matrix, where the kth row of the
data matrix isthe kth input data vector shifted by the sample mean of the input (i.e., subtracting the sample
mean from the data vector). Equivalently, these singular vectors are the eigenvectors corresponding to
the plargest eigenvalues of the sample covariance matrix of the input vectors. These p singular vectors are
the feature vectors learned from the input data, and they represent directions along which the data has the
largest variations.
Unsupervised dictionary learning does not utilize data labels and exploits the structure underlying the data
for optimizing dictionary elements. An example of unsupervised dictionary learning is sparse coding, which
aims to learn basis functions (dictionary elements) for data representation from unlabeled input data.
Multilayer/deep architectures
The hierarchical architecture of the biological neural system inspires deep learning architectures for feature
learning by stacking multiple layers of learning nodes. These architectures are often designed based on
the assumption of distributed representation: observed data is generated by the interactions of many
different factors on multiple levels. In a deep learning architecture, the output of each intermediate layer can
be viewed as a representation of the original input data. Each level uses the representation produced by
previous level as input, and produces new representations as output, which is then fed to higher levels. The
input at the bottom layer is raw data, and the output of the final layer is the final low-dimensional feature or
representation.
Restricted Boltzmann machines (RBMs) are often used as a building block for multilayer learning
architectures. An RBM can be represented by an undirected bipartite graph consisting of a group of binary
hidden variables, a group
of visible variables, and edges connecting the hidden and visible nodes .
Autoencoder
An autoencoder consisting of an encoder and a decoder is a paradigm for deep learning architectures. An
example is provided by Hinton and Salakhutdinov[18] where the encoder uses raw data (e.g., image) as input
and produces feature or representation as output and the decoder uses the extracted feature from the encoder
as input and reconstructs the original input raw data as output.
Applications of sequence modeling are plentiful in day-to-day business practice. Some of them emerged to
meet today’s challenges in terms of quality of service and customer engagement. Here some examples:
In the auto industry, self-parking is also a sequence modeling task. In fact, parking could be seen as a
sequence of mouvements where the next movement depends on the previous ones.
Other applications cover text classification, translating videos to natural language, image caption generation,
hand writing recognition/generation, anomaly detection, and many more in the future…which none of us can
think (or aware) at the moment.
However, before we go any further in the applications of Sequence Modeling, let us understand what we are
dealingwith when we talk about sequences.
Sequences are a data structure where each example could be seen as a series of data points. This sentence: “I
am currently reading an article about sequence modeling with Neural Networks” is an example that consists
of multiple words and words depend on each other. The same applies to medical records. One single medical
record consists in many measurments across time. It is the same for speech waveforms.
So why we need a different learning framework to model sequences and what are the special features that
we are looking for in this framework?
For illustration purposes and with no loss of generality, let us focus on text as a sequence of words to
motivate this need for a different learning framework.
In fact, machine learning algorithms typically require the text input to be represented as a fixed-length
vector. Manyoperations needed to train the model (network) can be expressed through algebraic operations
on the matrix of input feature values and the matrix of weights (think about a n-by-p design matrix, where n
is the number of samples observed, and p is the number of variables measured in all samples).
Perhaps the most common fixed-length vector representation for texts is the bag-of-words or bag-of-n-grams
due to its simplicity, efficiency and often surprising accuracy. However, the bag-of-words (BOW)
representation has many disadvantages:
First, the word order is lost, and thus different sentences can have exactly the same representation, as long as
the same words are used. Example: “The food was good, not bad at all.” vs “The food was bad, not good at
all.”. Even though bag-of-n-grams considers the word order in short context, it suffers from data sparsity and
high dimensionality.
In addition, Bag-of-words and bag-of-n-grams have very little knowledge about the semantics of the words
or more formally the distances between the words. This means that words “powerful”, “strong” and “Paris”
are equally distant despite the fact that semantically, “powerful” should be closer to “strong” than “Paris”.
Humans don’t start their thinking from scratch every second. As you read this article, you understand each
word based on your understanding of previous words. Traditional neural networks can’t do this, and it
seems like a major shortcoming. Bag-of-words and bag-of-n-grams as text representations do not allow to
keep track of long-term dependencies inside the same sentence or paragraph.
Another disadvantage of modeling sequences with traditional Neural Networks (e.g. Feedforward Neural
Networks) is the fact of not sharing parameters across time. Let us take for example these two sentences :
“On Monday, it was snowing” and “It was snowing on Monday”. These sentences mean the same thing,
though the details are in different parts of the sequence. Actually, when we feed these two sentences into a
Feedforward Neural Network for a prediction task, the model will assign different weights to “On Monday”
at each moment in time. Things we learn about the sequence won’t transfer if they appear at different
points in the sequence. Sharing parameters gives the network the ability to look for a given feature
everywhere in the sequence, rather than in just a certain area.
Thus, to model sequences, we need a specific learning framework able to:
So, let us find out more about RNNs! How a Recurrent Neural Network works?
A Recurrent Neural Network is architected in the same way as a “traditional” Neural Network. We
have someinputs, we have some hidden layers and we have some outputs.
The only difference is that each hidden unit is doing a slightly different function. So, let’s explore how this
hiddenunit works.
A recurrent hidden unit computes a function of an input and its own previous output, also known as the cell
state. For textual data, an input could be a vector representing a word x(i) in a sentence of n words (also
known as word embedding).
W and U are weight matrices and tanh is the hyperbolic tangent function.
Similarly, at the next step, it computes a function of the new input and its previous cell state: s2 =
tanh(Wx1+ Us1 . This behavior is similar to a hidden unit in a feed-forward Network. The difference, proper
to sequences, is that we are adding an additional term to incorporate its own previous state.
A common way of viewing recurrent neural networks is by unfolding them across time. We can notice that
we are using the same weight matrices W and U throughout the sequence. This solves our problem of
parameter sharing. We don’t have new parameters for every point of the sequence. Thus, once we learn
something, it can apply at any point in the sequence.
The fact of not having new parameters for every point of the sequence also helps us deal with variable-
length sequences. In case of a sequence that has a length of 4, we could unroll this RNN to four timesteps.
In other cases, we can unroll it to ten timesteps since the length of the sequence is not prespecified in the
algorithm. By unrolling we simply mean that we write out the network for the complete sequence. For
example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-
layer neural network, one layer for each word.
1. Input Layers: It’s the layer in which we give input to our model. The number of neurons in this
layer is equalto the total number of features in our data (number of pixels in the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer. There can be
many hidden layers depending upon our model and data size. Each hidden layer can have different
numbers of neurons which are generally greater than the number of features. The output from each layer is
computed by matrix multiplication of output of the previous layer with learnable weights of that layer and
then by the addition of learnable biases followed by activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function like sigmoid or
softmax which converts the output of each class into the probability score of each class.
The data is then fed into the model and output from each layer is obtained this step is called feedforward, we
then calculate the error using an error function, some common error functions are cross-entropy, square loss
error, etc. After that, we backpropagate into the model by calculating the derivatives. This step is called
Backpropagation whichbasically isused tominimize the loss. Here’s the
basic python code for a neural network with random inputs and two hidden layers.
Python
Now imagine taking a small patch of this image and running a small neural network on it, with say, k outputs
and represent them vertically. Now slide that neural network across the whole image, as a result, we will get
another image with different width, height, and depth. Instead of just R, G, and B channels now we have
more channels but lesser width and height. This operation is called Convolution. If the patch size is the
same as that of the image it will be a regular neural network. Because of this small patch, we have fewer
weights.
Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
Convolution layers consist of a set of learnable filters (a patch in the above image). Every filter has
small width and height and the same depth as that of input volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with dimension 34x34x3. The possible size of
filters can be axax3, where ‘a’ can be 3, 5, 7, etc but small as compared to image dimension.
During forward pass, we slide each filter across the whole input volume step by step where each step is
called stride (which can have value 2 or 3 or even 4 for high dimensional images) and compute the dot
product between the weights of filters and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together and as a result,
we’ll get output volume having a depth equal to the number of filters. The network will learn all the filters.
1. Input Layer: This layer holds the raw input of the image with width 32, height 32, and depth 3.
2. Convolution Layer: This layer computes the output volume by computing the dot product between all
filters and image patches. Suppose we use a total of 12 filters for this layer we’ll get output volume of
UNIT - IV
Model Validation in Classification : Cross Validation - Holdout Method, K-Fold, Stratified K-Fold, Leave-
One-Out Cross Validation. Bias-Variance tradeoff, Regularization , Overfitting, Underfitting. Ensemble
Model validation is the process of evaluating a trained model on test data set. This provides the
generalization ability of a trained model. Here I provide a step by step approach to complete first iteration of
model validation in minutes.
The basic recipe for applying a supervised machine learning model are:
Jake VanderPlas, gives the process of model validation in four simple and clear steps. There is also a whole
process needed before we even get to his first step. Like fetching all the information we need from the data to
make a good judgement for choosing a class model. Also providing finishing touches to confirm the results
after. I will get into depth about these steps and break it down further.
Feature engineering to optimize the metrics. (Skip this during first pass).
Data pre-processing.
Feature selection.
Model selection.
Model validation.
Get the best model and check it against test data set.
MACHINE LEARNING Page 64
(
DEPARTMENT OF CSE (Emerging Technologies)
I will be using data set from UCI Machine Learning Repository. Data set is from the Blood Transfusion
Service Center in Hsin-Chu City in Taiwan. This is a classification problem. The idea behind this
extends to regression problem as well
Blood Transfusion Service Center Data Set is a clean data set. This will not be the case for most
other data sets. So this is the step to inspect and clean up the data for example handling missing
values…
Split the data into training and test data sets.
There are many ways to get the training and test data sets for model validation like:
3-way holdout method of getting training, validation and test data sets.
The main idea behind this step is to get the baseline estimate of metrics which is being optimized. This
baseline will work as reference in further steps of model validation. There are several ways to get the
baseline estimate for classification problem. I am using the majority class for prediction. The baseline
accuracy score is approximately 77%.
# Get
quick initial metrics
estimate.
# Using simple pandas value counts method
print(y_train.value_counts(normalize=True))
# Using sklearn accuracy_score
import numpy as np
from sklearn.metrics import accuracy_score
majority_class = y_train.mode()[0]
prediction = np.full(shape=y_train.shape,
fill_value=majority_class)
accuracy_score(y_train, prediction)
Feature engineering is the process of using domain knowledge of the data to create features that make
machine learning algorithms work. Feature engineering is fundamental to the application of machine
learning, and is both difficult and expensive.
From Wikipedia
This means identifying the relationships between independent and dependent features. This is with the help
of graphs like pair plots or correlation matrix. Then the identified relationships we can add as polynomial or
interaction features.
Feature engineering step is the point of entry for successive iterations. This is a critical step and plays a
greater rolein predictions as compared to model validation.
As a quick solution we can throw in some polynomial features using PolynomialFeatures in sci-kit learn.
Domain knowledge on the problem in hand will be of great use for feature engineering. This is a bigger topic
in itselfand requires extensive investment of time and resource.
Data pre-processing.
Data pre-processing converts features into format that is more suitable for the estimators. In general,
machine learning model prefer standardization of the data set. I will make use of RobustScaler for our
example.
Feature selection.
I will use SelectKBest, univariate feature selection method. The scoring function used for classification
andregression problems will vary.
Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the amount of available data. Hence,
there are many hypotheses with the same accuracy on the data and the learning algorithm chooses only one of
them! There is a risk that the accuracy of the chosen hypothesis is low on unseen data!
Computational Problem –
The Computational Problem arises when the learning algorithm cannot guarantees finding the best
hypothesis.
Representational Problem –
The Representational Problem arises when the hypothesis space does not contain any good approximation of
thetarget class(es).
Main Challenge for Developing Ensemble Models?
The main challenge is not to obtain highly accurate base models, but rather to obtain base models which
make different kinds of errors. For example, if ensembles are used for classification, high accuracies can be
accomplished if different base models misclassify different training examples, even if the base classifier
accuracy is low.
Randomness Injection
Feature-Selection Ensembles
Error-Correcting Output Coding
Methods for Coordinated Construction of Ensembles
UNIT – V
Unsupervised Learning : Clustering-K-means, K-Modes, K-Prototypes, Gaussian
Mixture Models,Expectation-Maximization.
Reinforcement Learning: Exploration and exploitation trade-offs, non-associative learning, Markov decision
processes, Q-learning.
Introduction to clustering
As the name suggests, unsupervised learning is a machine learning technique in which models are
not supervised using training dataset. Instead, models itself find the hidden patterns and insights
from the given data. It can be compared to learning which takes place in the human brain while
learning new things. It can be defined as:
“Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.”
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
WorkHere, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms such as k- means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groupsaccording to
the similaritiesand difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
Clustering: Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of another group.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Apriori algorithm
One of the most used clustering algorithm is k-means. It allows to group the data according
to the existingsimilarities among them in k clusters, given as input to the algorithm. I‟ll start
with asimple example.
Let‟s imagine we have 5 objects (say 5 people) and for each of them we know two features
(height and weight). Wewant to group them into k=2 clusters.
As you probably already know, I‟m using Python libraries to analyze my data. The k-means
algorithm is implemented in the scikit-learn package. To use it, you will just need the following line
in your script:
At this point, you will maybe have noticed something. The basic concept of k-means stands on
mathematical calculations (means, euclidian distances). But what if our data is non-numerical or, in
other words, categorical? Imagine, for instance, to have the ID code and date of birth of the five
people of the previous example, instead of their heights and weights.
We could think of transforming our categorical values in numerical values and eventually apply k-
means. But beware: k-means uses numerical distances, so it could consider close two really distant
objects that merely have been assigned two close numbers.
Reinforcement learning
Reinforcement learning addresses the question of how an autonomous agent that senses and acts in its
environment can learn to choose optimal actions to achieve its goals
Introduction
Consider building a learning robot. The robot, or agent, has a set of sensors
to observe the state of itsenvironment, and a set of actions it can perform to alter this state.
Its task is to learn a control strategy, or policy, for choosing actions that achieve its goals.
The goals of the agent can be defined by a reward function that assigns a
numericalvalue to each distinctaction the agent may take from each distinct state.
This reward function may be built into the robot, or known only to an external
teacher whoprovides thereward value for each action performed by the robot.
The task of the robot is to perform sequences of actions, observe their
consequences,and learn a controlpolicy.
The control policy is one that, from any initial state, chooses actions that
maximize thereward accumulatedover time by the agent.
Example:
A mobile robot may have sensors such as a camera and sonars, and actions such as "move
forward"and "turn."
The robot may have a goal of docking onto its battery charger whenever its battery level is
MACHINE LEARNING Page 72
(
DEPARTMENT OF CSE (Emerging Technologies)
low.
The goal of docking to the battery charger can be captured by assigning a positive
reward (Eg., +100) to state- action transitions that immediately result in a connection to the charger
and a reward of zero to every other state-action transition.
1. Delayed reward: The task of the agent is to learn a target function 𝜋 that maps from
the current state s to the optimal action a = 𝜋 (s). In reinforcement learning, training information is
not available in (s, 𝜋 (s)). Instead, the trainer provides only a sequence of immediate reward values
as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal
credit assignment: determining which of the actions in its sequence are to be credited with
producing the eventual rewards.
3. Partially observable states: The agent's sensors can perceive the entire state of the
environment at each time step, in many practical situations sensors provide only partial information.
In such cases, the agent needs to consider its previous observations together with its current sensor
data when choosing actions, and the best policy may be onethat chooses actions specifically to
4. Life-long learning: Robot requires to learn several related tasks within the same
environment,using the same sensors. For example, a mobile robot may need to learn how to dock
on its battery charger, how to navigate through narrow corridors, and how to pick up output from
laser printers. This setting raises the possibility of using previously obtained experience or
knowledge to reduce sample complexity when learning new tasks.
Learning Task
Consider Markov decision process (MDP) where the agent can perceive a set S of distinct states of
itsenvironment and has a set A of actions that it can perform
At each discrete time step t, the agent senses the current state st, chooses a current action
at, andperforms it.
The environment responds by giving the agent a reward rt = r(st, at) and by producing the
succeedingstate st+l
= δ(st, at). Here the functions δ(st, at) and r(st, at) depend only on the current state and action, and not
onearlier states or actions.
The task of the agent is to learn a policy, 𝝅: S → A, for selecting its nextaction a, based on
the current observedstate st; that is, 𝝅(st) = at.
How shall we specify precisely which policy π we would like the agent to learn?
Considers the average reward per time step over the entire lifetime of the agent.
We require that the agent learn a policy π that maximizes Vπ (st) for allstates s. such a
policy is called an optimalpolicy and denote it by π*
Refer the value function Vπ*(s) an optimal policy as V*(s). V*(s) gives the maximum
discounted cumulative rewardthat the agent can obtain starting from state s.
Example:
The six grid squares in this diagram represent six possible states, or locations,for the agent.
Each arrow in the diagram represents a possible action the agent can take tomove from one state
to another.
The number associated with each arrow represents the immediate reward r(s, a) the
agent receives if it executesthe corresponding state-action transition
The immediate reward in this environment is defined to be zero forall state-action
transitions except for those leading into the state labelled G. The state G as the goal
state, and the agent can receive reward by entering thisstate.
Once the states, actions, and immediate rewards are defined, choose a value for the
discount factor γ, determine theoptimal policy π * and itsvalue function V*(s).
Let’s choose γ = 0.9. The diagram at the bottom of the figure shows one optimal
Values of V*(s) and Q(s, a) follow from r(s, a), and the discount factor γ =
0.9. An optimal policy, corresponding toactions with maximal Q values,is also shown.
Q LEARNING
The training information available to the learner is the sequence of immediate rewards r(si,ai)
for i = 0, 1,2, . . . .
Given this kind of training information it is easier to learn a numerical evaluation
function defined over states andactions, then implement the optimal policy in terms of
this evaluation function.
What evaluation function should the agent attempt to learn?
One obvious choice is V*. The agent should prefer state sl over state s2 whenever
V*(sl) > V*(s2), because thecumulative future reward will begreater from sl
The optimal action in state s is the action a that maximizes the sum of the immediate
reward r(s, a) plus the value V*of the immediate successor state, discounted by γ.
The Q Function
The value of Evaluation function Q(s, a) is the reward receivedimmediately
upon executing action a from state s,plus the value (discounted by γ ) of
following the optimal policy thereafter
The key problem is finding a reliable way to estimate training valuesfor Q, given only
a sequence of immediaterewards r spread out over
Rewriting Equation
Q learning algorithm:
An Illustrative Example
To illustrate the operation of the Q learning algorithm, consider a single action taken
by an agent, and thecorresponding refinement to
The agent moves one cell to the right in its grid world and receives an
immediate reward of zero for thistransition.
According to the training rule, the new 𝑄̂ estimate for this transitionis the sum of the
received reward (zero)and the highest 𝑄̂ value associated with the resulting state (100),
discounted by γ (.9).
Convergence
Will the Q Learning Algorithm converge toward a Q equal to the true Q function?
Here are four machine learning trends that could become a reality in the near future:
Algorithms can help companies unearth insights about their business, but this proposition can be
expensive with no guarantees of a bottom-line increase. Companies often deal with havingto collect
data, hire data scientists and train them to deal with changing databases. Now that more data metrics
are becoming available, the cost to store it is dropping thanks to the cloud. There will no longer be
the need to manage infrastructure as cloud systems can generate new models as the scale of an
operation increases, while also delivering more accurate results. More open-source ML frameworks
are coming to the fold, obtaining pre-trained platforms thatcan tag images, recommend products and
perform natural language processing tasks.
Some of the tasks that ML can help companies deal with is the manipulation and classification of
large quantities of vectors in high-dimensional spaces. Current algorithms take a large chunk of time
to solve these problems, costing companies more to complete their business processes. Quantum
computers are slated to become all the rage soon as they can manipulate high-dimensional vectors at
a fraction of the time. These will be able to increase the number of vectors and dimensions that are
processed when compared to traditional algorithms in a quicker period of time.
3) Improved Personalization
Retailers are already making waves in developing recommendation engines that reach their target
audience more accurately. Taking this a step further, ML will be able to improve the personalization
techniques of these engines in more precise ways. The technology will offer more specific data that
they can then use on ads to improve the shopping experience for consumers.
4) Data on Data
As the amount of data available increases, the cost of storing this data decreases at roughly thesame
rate. ML has great potential in generating data of the highest quality that will lead to better models,
an improved user experience and more data that helps repeat but improve uponthis cycle. Companies
such as Tesla add a million miles of driving data to enhance its self- driving capabilities every hour.
Its Autopilot feature learns from this data and improves the software that propels these self-driving
vehicles forward as the company gathers more data onthe possible pitfalls of autonomous driving
technology.