Module 9 The Science of Predictive Analytics - Elounge
Module 9 The Science of Predictive Analytics - Elounge
PREDICTIVE ANALYTICS
Module Workbook
The Science of Predictive Analytics
Contents
Introduction.............................................................................................................................. 3
Module performance objectives .................................................................................... 4
Lesson 1 : Time series in predictive analytics ............................................................. 5
Time series predictive models ....................................................................................... 6
Seasonality and trend in time series ............................................................................. 9
Forecasting time series ................................................................................................. 11
Lesson Summary .................................................................................................................. 14
Notes: ............................................................................................................................... 16
Lesson 2 : Machine learning for predictive modeling ........................................... 17
Understanding machine learning ............................................................................... 18
Supervised machine learning ...................................................................................... 22
Decision trees, k-Nearest Neighbors........................................................................... 24
2
Unsupervised machine learning ................................................................................. 28
Lesson Summary .................................................................................................................. 30
Notes: ............................................................................................................................... 32
Lesson 3 : The wisdom of the crowds ......................................................................... 33
Majority Vote ................................................................................................................... 34
Bagging and Boosting .................................................................................................. 35
Random forest ................................................................................................................ 38
Lesson Summary .................................................................................................................. 40
Notes: ............................................................................................................................... 42
References .............................................................................................................................. 43
The Science of Predictive Analytics
Introduction 3
Module Workbook
The Science of Predictive Analytics
Notes:
After completing the lessons, you should be able to:
• Define time series, identify their components, and learn how to use them in
predictions;
• Understand the science of machine learning, its different approaches and
application scenarios;
• Describe different ensembling models such as: Majority vote, Bagging and
Boosting, Random Forest.
Note
4
The Science of Predictive Analytics
Module Workbook
The Science of Predictive Analytics
Many governments and public organizations publish their data, so it can be publicly
accessible and available to any interested citizen following open-data guidelines, policies
and directives. Let's explore an example from the Canadian Vital Statistics birth
database, where we have obtained the number of births per month in Canada from 2016 6
to 2020.
The Science of Predictive Analytics
1. Pattern: The time series seems to follow a repeating pattern every year, with
similar maximum and minimum values each year.
2. Year 2020: However, the year 2020 is different, possibly attributed to the impact of
the COVID-19 pandemic.
3. To be more specific: the figure shows a very low pick decrease in births in
January, while seven months later, we find the opposite situation. In August, there
is a significant increase in the number of births. However, after this positive peak,
the curve decreases once again.
7
4. Year 2021: With this information, we can predict that in 2021, the curve would tend
to increase from January until August, and then the birth numbers will decrease
during the second half of 2021.
Modeling means extracting information from the existing series and then using this
information to forecast future behavior. For example, if the past series has some clear
patterns, those will be repeated in the future.
By resampling, the line is smoothed to obtain visualizations that enrich our understanding
of the series. For example, the effects of the pandemic in 2020 are more evident in the
semester resample.
Sometimes by applying resampling, we observe new features that may help model
predictive systems.
Meanwhile, resampling data needs to be performed carefully, as it discards some initial
information. When we resample data, we need to inform the audience of our analysis that
the raw data has been modified.
8
The Science of Predictive Analytics
All the time series consist of three systematic components including level, trend,
seasonality, and one non-systematic component that is noise.
Among these four components, trend and seasonality are the more important ones for the
predictive models, as they determine how we will forecast the variable.
The Science of Predictive Analytics
A well-known method for forecasting time series is the Moving Averages method. It
estimates the trend cycle, capturing the average change in a data series over time.
This table and its visualization show the values of a 3-year time series with the number of
births per quarter. We use the Moving Averages method to predict future values of births.
The Science of Predictive Analytics
The Moving Averages method calculates the averages for several periods. It requires to be
tested for different periods in order to identify which one is the best that more accurately
reflects the values of the actual time series. We recalculate the time series by using the
Moving Averages formula.
12
And this is the calculation sample. Although we could draw the moving averages line
already, we recommend using the Centered Moving Averages method. With this additional
step, we center the resulting averaged periods.
The Science of Predictive Analytics
Drawing the Moving Average line - The moving averages line offers a smoother view of the
original series. By extrapolating the moving averages line, we can predict future values for
the series.
This method can be applied to longer and more complex time series, predicting a step
ahead using previous data. Here we applied our method to the example of the Canadian
Vital Statistics birth dataset mentioned in previous sections.
13
Prediction: The new line (red line) is our prediction and has smoothed properties, with
fewer extreme edges.
When should we use Moving Averages? This method is favorable for time series
with relatively low volatility and strong seasonality, like the stock exchange trading.
The Science of Predictive Analytics
Lesson Summary 14
Module Workbook
The Science of Predictive Analytics
IN SUMMARY:
Time series predictive models
• Time series is a relevant data structure in any scientific field. Numerous
economic, biological, or medical scenarios can be represented by time
series.
• Time series is modelled using statistical approaches and sometimes using
machine learning.
Seasonality and trend in time series
• The components of a time series are: level, trend, seasonality, and
noise/residual.
• The concept of seasonality and trend is essential in time series modeling.
Forecasting time series
• There is extensive research on time series modeling, and there are many
predictive models applicable.
• Moving Averages is a simple model that is easy to understand and can be
applied to many situations.
15
The Science of Predictive Analytics
Notes:
16
The Science of Predictive Analytics
Module Workbook
The Science of Predictive Analytics
18
Structured data
Machine learning can work with data in a standardized format. The data is considered
structured when it conforms to a tabular format where rows and columns are connected.
Common examples of structured data are Excel files or SQL databases. Each of these has 19
structured rows and columns that can be sorted.
Unstructured data
Machine learning also works with unstructured data. It is information that either does not
have a predefined data model or is not organised in a pre-defined manner. Unstructured
information is typically text-heavy, but may contain data such as dates, numbers, and facts
as well. This results in irregularities and ambiguities that make it difficult to understand
using traditional programs as compared to data stored in structured databases. Common
examples of unstructured data include audio, video files or NoSQL databases.
Neural networks
Furthermore, there is a new family of algorithms called neural networks to cope with the
complexity of vast amounts of unstructured data. These algorithms are much more
powerful than traditional machine learning techniques. That is why neural networks are the
basis of deep learning, which is a subfield of machine learning. Neural networks form the
backbone of deep learning algorithms.
The Science of Predictive Analytics
Deep learning obtains impressive results in fields like game playing (defeating the world
champion of GO), biology (by discovering how proteins fold), genetics, medical diagnosis,
speech recognition, etc.
As shown in the graph below, the neural networks are robust but challenging to train, as
their functions are inspired by how the human brain works. They are arranged in layers
and connect with others to transmit information.
20
We apply machine learning techniques to build systems with structured data. But with
unstructured data or highly complex problems, we look into deep learning to obtain better
results.
Supervised learning
In this approach, the algorithm learns from labeled data, and then, the algorithm infers or
predicts unlabeled data. It is mainly used for classification and regression applications.
Unsupervised learning
Here, the program analyzes a large amount of data and identifies similarities between the 21
elements, obtaining clusters or groups of similar elements.
In most unsupervised learning approaches, we must tell the algorithm some information
beforehand, such as the number of classes. However, some algorithms are also able to
identify the best number of classes for the input data.
The Science of Predictive Analytics
22
Phase A: Training
The training phase defines the machine learning model as follows:
1. First, we perform a feature selection on the dataset.
2. Then, the program is able to learn from these features by looking at each labelled
item.
3. Finally, we calculate the accuracy of the test data
Machine learning extracts information about the features in our data set and finds
correlations and meaningful values to identify the items. The human brain works in the
same way. However, our ability to look into hundreds or thousands of items is limited. Click
on the next button to continue.
Some machine learning algorithms can prioritise some features over others. For instance,
a suitable algorithm may determine that the shape is possibly the most crucial feature,
whereas the smell or color is irrelevant.
Phase B: Prediction
When the model is trained, it predicts the class from items that have not been classified
before:
The Science of Predictive Analytics
• When a Machine Learning algorithm has high accuracy in training and also high in
prediction, we say that the algorithm has good 'generalization.'
• However, in the hypothetical case that the accuracy significantly decreases, the
program might have poor 'generalization' and possibly have some issues in the
training phase, parameter selection, or data input.
Non-linear boundaries can be drawn by non-linear algorithms, and not all Machine
Learning algorithms can generate non-linear separation lines.
Overfitting and Underfitting are two machine learning training problems. Both are related to
generalization, the property that defines how well the classifier works with unclassified
data.
23
• Underfitting: underfitting happens when the model is too simple and generates a
solution with low accuracy.
• Overfitting: overfitting happens when the system has been trained for too long,
causing an overly complex line as a boundary between classes.
If a classifier has high accuracy with data that has not been used in training, it does have
a good generalization.
The Science of Predictive Analytics
There is not a single machine learning algorithm, but many. Each one has strengths and
weaknesses. No algorithm can be considered 'better' than the others.
For this reason, we must test several approaches for a specific application and see which
is the best-performing approach. It is an exploratory task that can consume quite a lot of
work.
Two machine learning algorithms for supervised learning:
• Decision Tree
• k-Nearest Neighbours
Decision Tree
The Decision Tree is a simple machine learning algorithm but highly effective.
Below you can see a demo graph of the algorithm.
24
- Trees: The algorithm builds trees with a series of questions designed to assign a
classification.
- Decision node: Decision node shows a decision to be made. They contain
questions which split into subnodes.
The Science of Predictive Analytics
- Leaf node/ terminal node: The leaf nodes represent the attribute where the item is
classified. They don’t split into more nodes.
- Chains: The branch/edge represents the answers to the questions, like yes/no,
above/below, etc. The attribute value determines branching, and the tree can be
large, with many chained decision nodes if necessary.
- After the tree: After the tree has been built with a training dataset, items of a new
dataset can be classified.
Decision trees, like any other machine learning programs, require large amounts of
data. Otherwise, the solution is incomplete.
There are many possible decision trees suitable for solving the same problem. However,
many refinements allow the trees to be more compact. One of these measures is called
Gini impurity.
For instance, if a decision node produces 2 subnodes and they split the data in 80%-20%,
the split is 80% pure. A bag full of balls of the same color has the lowest Gini impurity
index. A bag full of balls of different colors has a very high Gini impurity index.
A variable is significant if it is situated at the top of the decision tree. If we are classifying
bulls and horses, horns will be a significant variable, because, most bulls have horns and
horses do not. In below small sample, 100% of the elongated fruits are pears, and 0% are
apples. Therefore, this property will be at the top of the classification tree as it splits the
tree with a very low Gini Impurity. 25
Decision trees classify non-linear boundaries and can also handle numerical and
categorical data. In addition, their simplicity generates simple results to understand and
interpret, which is an advantage over other more robust algorithms.
This approach is based on the fact that items with high similarity are close when
represented in a diagram. With this property in mind, by defining a distance metric
between the items, we can identify groups of items belonging to the same class.
The Science of Predictive Analytics
For example, the picture below has a group of fruits in a two-dimensional figure. We know
that the different fruits are grouped together. This is the representation of the training
data.
If we want to find the label of a new item, we place this item in the network of the figure.
Then, the label of this item is the one defined in the area where this item was finally
placed.
The algorithm is based on this 'closeness' property, where each new point is assigned to
a class by comparing it with the training data areas.
In the training phase we define the areas (the coloured areas in the figure):
• First, we start with the training dataset (labeled data), and we find the areas.
• Next, we must classify the non-labeled data by determining their class, which is
done by looking at the closest neighbours' class.
26
There are many other algorithms, like the Bayesian family of approaches that use
probabilities of the elements for classification, or some highly complex and sophisticated
methods like the Support Vector Machines, that work by dimensional transformations of
the data. Also, we have the king of machine learning algorithms, the Neural Networks.
• Like how human neurons work: Neural Networks are extremely powerful and
replicate, to some extent, how human neurons work. They are responsible for some
important advances in machine learning, like image recognition, language
understanding, and speech recognition.
• Effective with complex data: By using Neural Networks, some scientists are
obtaining spectacular advances towards fully intelligent algorithms. Neural
The Science of Predictive Analytics
Networks are complex and difficult to manage but highly effective with complex
data.
• Data ‘greedy’: Neural Networks have some problems. One is that they are data
‘greedy’, meaning that they require large amounts of data, and this impacts the
computing resources needed for training. Some neural networks are so large that it
is criticized for not being sustainable.
• Another one: Another problem is that they are unable to explain themselves. For
instance, they will classify an apple but they will not be able to say why.
27
The Science of Predictive Analytics
In the previous section, we analyzed the supervised learning approach, and in Chapter 3,
we explored two of its most common machine learning algorithms, Decision Trees and K-
Nearest Neighbour. Let’s see in this chapter the unsupervised machine learning approach.
K-Means algorithm
As in supervised learning, there are several unsupervised learning algorithms. One of the
most used is the K-Means algorithm.
This algorithm segments a dataset into a pre-established number of classes. We must tell
the algorithm how many classes we want, and the algorithm will segment the data into that
specific number.
28
K-Means is an algorithm relatively easy to implement and can be used in very large
datasets.
However, K-Means Algorithm might experience problems where the clusters have varying
sizes and densities. Another problem is that centroids can be dragged by outliers, or
outliers might get their own cluster instead of being ignored.
29
The Science of Predictive Analytics
Lesson Summary
30
Module Workbook
The Science of Predictive Analytics
IN SUMMARY:
Understanding machine learning
• Machine learning is a set of algorithms that self-learn from data.
• Machine learning algorithms use large amounts of data to train.
Supervised machine learning
• The main difference between supervised and unsupervised learning is the
use of labeled data.
• Supervised learning uses labeled input and output data, while an
unsupervised learning algorithm does not.
Decision trees, k-Nearest Neighbors
• There are many machine learning algorithms. For supervised learning, we
analyzed Decision Trees and k-Nearest Neighbors. For unsupervised
learning, the K-Means algorithm.
Unsupervised machine learning
• Unsupervised machine learning does not require a training phase, and it is
used to create groups or clusters of data that are not obvious to the human 31
eye (i.e. customer behavior).
The Science of Predictive Analytics
Notes:
32
The Science of Predictive Analytics
Module Workbook
The Science of Predictive Analytics
Majority Vote
35
• Bias: Bias is when the result of the classification algorithm is skewed. For example,
bias may point to underfitting or a not well-suited classifier if it is consistently biased
to a side. This is a failure in capturing the data trends in the dataset properly.
The Science of Predictive Analytics
Algorithms can suffer from both high bias and high variance simultaneously. In the
figure below, we can see the difference between variance and bias simulating a target.
36
In bagging, we generate several random subsets of the data, and apply a classifier to
each subset in parallel. The result is obtained through a majority voting algorithm that
averages all these weak classifiers.
NOTE: Bagging reduces the variance of the result and is very effective.
• Step 1: Using replacement to create several subsets of the original data.
• Step 2: Create a base model for each subset.
• Step 3: Combine all the base models to form a final model.
• Step 4: Make the final predictions by combining the predictions from all the models.
The Science of Predictive Analytics
In boosting, we use a classification algorithm sequentially for subsets of data, but each
time we apply the algorithm to the wrongly classified subsets. The first classifier works with
the original data, and the subsequent classifier uses only the subset of data that has been
misclassified in the first step. We can use a pre-determined number of classifiers or use
this process until a threshold is obtained or after the accuracy does not improve further.
• Step 1: Run the original data with an equal weight of all data points to generate the
first model. 37
• Step 2: Identify the wrongly classified data points and assign them a higher weight.
Decrease the weights of correctly classified data points.
• Step 3: Run the weighted data to generate the second model, which tries to correct
the errors from the previous model.
• Step 4: Repeat step 2 and 3 until it reached the required results.
• Step 5: Make the final model by the weighted mean of all the models.
The Science of Predictive Analytics
Random forest
38
Step 1: We create different random subset samples from the complete training set. The
subset is created with replacement which means one element can be in several samples.
Step 2: We apply a different decision tree classifier to each one of the subset samples.
To predict, we combine the result of all the trees using a majority vote, obtaining a
classification for each item.
The Science of Predictive Analytics
39
Lesson Summary 40
Module Workbook
The Science of Predictive Analytics
IN SUMMARY:
Majority Vote
• Statistics have confirmed that by combining several algorithms, we can
obtain higher accuracy in our results than by using a single algorithm. This is
the principle of the ‘wisdom of the crowds’.
• Ensembling involves the use of different algorithms, where the results are
aggregated.
• Majority vote is an algorithm for finding the majority of a sequence of
elements.
• The algorithm finds a majority element that occurs repeatedly for more than
half of the elements of the input.
Bagging and boosting
• Bagging and Boosting help us improve combinations of algorithms and foster
learning by enhancing the characteristics of the available data sets.
• Bagging reduces the variance of the result.
• Boosting reduces bias.
Random forest 41
• A very effective ensembling mode is the Random Forest which is used
extensively in many predictive applications.
• Random forest is composed of many decision trees.
The Science of Predictive Analytics
Notes:
42
The Science of Predictive Analytics
References
• Hassani, H.; Huang, X.;MacFeely, S.; Entezarian, M.R. Big Data and the United
Nations Sustainable Development Goals (UNSDGs) at a Glance. Big Data
Cogn.Comput. 2021, 5, 28. https://doi.org/10.3390/bdcc5030028
• Kumari, S., & Kumar Singh, S. (n.d.). Machine learning-based time series
models for effective CO2 emission prediction in India. Environmental science and
pollution research international. Retrieved December 15, 2022, from
https://pubmed.ncbi.nlm.nih.gov/35780266/
• Shastri, S., Singh, K., Deswal, M., Kumar, S., & Mansotra, V. (2022). COBID-
Net: A Tailored Deep Learning Ensemble model for time series forecasting of covid-
19. Spatial Information Research. Retrieved December 15, 2022, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8196282/
• What is machine learning? - united nations. (n.d.). Retrieved December 15, 2022,
from https://unite.un.org/sites/unite.un.org/files/emerging-tech-series-machine-
learning.pdf
The Science of Predictive Analytics
44
@UNSSC
facebook.com/UNSSC
linkedin.com/company/unssc
www.unssc.org