0% found this document useful (0 votes)
5 views44 pages

Module 9 The Science of Predictive Analytics - Elounge

The document is a workbook on the science of predictive analytics, covering key concepts such as time series analysis, machine learning techniques, and forecasting methods. It outlines the objectives of the module, including understanding time series components, machine learning approaches, and ensembling models. The workbook emphasizes the importance of seasonality and trend in predictive modeling and provides insights into various forecasting methods, including Moving Averages.

Uploaded by

rucklidgeman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views44 pages

Module 9 The Science of Predictive Analytics - Elounge

The document is a workbook on the science of predictive analytics, covering key concepts such as time series analysis, machine learning techniques, and forecasting methods. It outlines the objectives of the module, including understanding time series components, machine learning approaches, and ensembling models. The workbook emphasizes the importance of seasonality and trend in predictive modeling and provides insights into various forecasting methods, including Moving Averages.

Uploaded by

rucklidgeman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

THE SCIENCE OF

PREDICTIVE ANALYTICS
Module Workbook
The Science of Predictive Analytics

Contents

Introduction.............................................................................................................................. 3
Module performance objectives .................................................................................... 4
Lesson 1 : Time series in predictive analytics ............................................................. 5
Time series predictive models ....................................................................................... 6
Seasonality and trend in time series ............................................................................. 9
Forecasting time series ................................................................................................. 11
Lesson Summary .................................................................................................................. 14
Notes: ............................................................................................................................... 16
Lesson 2 : Machine learning for predictive modeling ........................................... 17
Understanding machine learning ............................................................................... 18
Supervised machine learning ...................................................................................... 22
Decision trees, k-Nearest Neighbors........................................................................... 24
2
Unsupervised machine learning ................................................................................. 28
Lesson Summary .................................................................................................................. 30
Notes: ............................................................................................................................... 32
Lesson 3 : The wisdom of the crowds ......................................................................... 33
Majority Vote ................................................................................................................... 34
Bagging and Boosting .................................................................................................. 35
Random forest ................................................................................................................ 38
Lesson Summary .................................................................................................................. 40
Notes: ............................................................................................................................... 42
References .............................................................................................................................. 43
The Science of Predictive Analytics

Introduction 3

The Science of Predictive Analytics

Module Workbook
The Science of Predictive Analytics

Module performance objectives

Notes:
After completing the lessons, you should be able to:

• Define time series, identify their components, and learn how to use them in
predictions;
• Understand the science of machine learning, its different approaches and
application scenarios;
• Describe different ensembling models such as: Majority vote, Bagging and
Boosting, Random Forest.

Note

4
The Science of Predictive Analytics

Lesson 1 : Time series in predictive analytics


5

The Science of Predictive Analytics

Module Workbook
The Science of Predictive Analytics

Time series predictive models

Time series and Resampling


Time series are sequences of observations at regular time intervals. Forecasting time
series through mathematical modelling is an essential technique in predictive analytics to
understand what will happen in the future.
Resampling
Time series data contains observations sampled at a certain time interval. However, we
can change the frequency of time series observations. This transformation is called
resampling.
The main two reasons to resample:
• Data: To reduce the amount of data used in the analysis.
• Frequencies: To evaluate how data behaves differently under different
frequencies.

Many governments and public organizations publish their data, so it can be publicly
accessible and available to any interested citizen following open-data guidelines, policies
and directives. Let's explore an example from the Canadian Vital Statistics birth
database, where we have obtained the number of births per month in Canada from 2016 6
to 2020.
The Science of Predictive Analytics

Then we visualize the data in a broken line graph.

1. Pattern: The time series seems to follow a repeating pattern every year, with
similar maximum and minimum values each year.
2. Year 2020: However, the year 2020 is different, possibly attributed to the impact of
the COVID-19 pandemic.
3. To be more specific: the figure shows a very low pick decrease in births in
January, while seven months later, we find the opposite situation. In August, there
is a significant increase in the number of births. However, after this positive peak,
the curve decreases once again.
7
4. Year 2021: With this information, we can predict that in 2021, the curve would tend
to increase from January until August, and then the birth numbers will decrease
during the second half of 2021.

Modeling means extracting information from the existing series and then using this
information to forecast future behavior. For example, if the past series has some clear
patterns, those will be repeated in the future.

Let’s see the graphs resampled, per quarter and semester.


The Science of Predictive Analytics

By resampling, the line is smoothed to obtain visualizations that enrich our understanding
of the series. For example, the effects of the pandemic in 2020 are more evident in the
semester resample.
Sometimes by applying resampling, we observe new features that may help model
predictive systems.
Meanwhile, resampling data needs to be performed carefully, as it discards some initial
information. When we resample data, we need to inform the audience of our analysis that
the raw data has been modified.
8
The Science of Predictive Analytics

Seasonality and trend in time series

Components of a time series


Decomposition
Previously we have visualized the patterns followed by the values of a variable over time.
We can decompose a time series mathematically by separating its distinct components.
A decomposition of the time series allows us to build an abstract model that facilitates the
analysis of the series and its forecasting.

Four components of a time series


The components of a time series are: level, trend, seasonality, and noise/residual. Not
every time series combines all these four components, but level and noise are universal.

1. Level: The average value for a specific time period


As an example, we are using the same dataset from birth per month in Canada
from 2016-2020. With the observed data (shown below), we can easily calculate the
average value in the year 2018 of this time series.

2. Trend: The increasing or decreasing value


A trend can grow, decrease, or show alternate growth or decrease intervals. In this
case, the time series slightly decreases over time, and in the future, we might
expect further decreases.

3. Seasonality: The repeating short-term cycle


We can observe the pattern of a cyclic event over a fixed period. There can be
multiple cycles in the same time series, such as the seasonality of the 24-hour day
and the 12-month calendar season.
The Science of Predictive Analytics

4. Noise/residual: Random variations


We can also observe those random variations after extracting trends and
seasonality.

All the time series consist of three systematic components including level, trend,
seasonality, and one non-systematic component that is noise.

• Systematic: Components of the time series that have consistency or recurrence 10


and can be described and modeled.
• Non-Systematic: Components of the time series that cannot be directly modeled.

Among these four components, trend and seasonality are the more important ones for the
predictive models, as they determine how we will forecast the variable.
The Science of Predictive Analytics

Forecasting time series

Aspects to consider when forecasting time series


There are different aspects to consider when forecasting time series.
1. Types of prediction
• One-step ahead prediction: We predict the next step from the last observation
or a specific step in the future (10h ahead, 2 years ahead, …)
• Multiple-step ahead prediction: We predict multiple steps or a subset of the
future time series

2. Types of time series


• Univariate time series: The time series of a single variable.
• Multivariate time series: When the time series includes multiple variables (can
be considered like parallel univariate time series).

3. Types of forecasting methods


• Linear methods forecasting: They are based on a linear regression of
variables.
• Non-linear methods forecasting: These models require performing logarithmic
transformations.
11

A well-known method for forecasting time series is the Moving Averages method. It
estimates the trend cycle, capturing the average change in a data series over time.

This table and its visualization show the values of a 3-year time series with the number of
births per quarter. We use the Moving Averages method to predict future values of births.
The Science of Predictive Analytics

The Moving Averages method calculates the averages for several periods. It requires to be
tested for different periods in order to identify which one is the best that more accurately
reflects the values of the actual time series. We recalculate the time series by using the
Moving Averages formula.

12

And this is the calculation sample. Although we could draw the moving averages line
already, we recommend using the Centered Moving Averages method. With this additional
step, we center the resulting averaged periods.
The Science of Predictive Analytics

Drawing the Moving Average line - The moving averages line offers a smoother view of the
original series. By extrapolating the moving averages line, we can predict future values for
the series.

This method can be applied to longer and more complex time series, predicting a step
ahead using previous data. Here we applied our method to the example of the Canadian
Vital Statistics birth dataset mentioned in previous sections.

13
Prediction: The new line (red line) is our prediction and has smoothed properties, with
fewer extreme edges.
When should we use Moving Averages? This method is favorable for time series
with relatively low volatility and strong seasonality, like the stock exchange trading.
The Science of Predictive Analytics

Lesson Summary 14

The Science of Predictive Analytics

Module Workbook
The Science of Predictive Analytics

IN SUMMARY:
Time series predictive models
• Time series is a relevant data structure in any scientific field. Numerous
economic, biological, or medical scenarios can be represented by time
series.
• Time series is modelled using statistical approaches and sometimes using
machine learning.
Seasonality and trend in time series
• The components of a time series are: level, trend, seasonality, and
noise/residual.
• The concept of seasonality and trend is essential in time series modeling.
Forecasting time series
• There is extensive research on time series modeling, and there are many
predictive models applicable.
• Moving Averages is a simple model that is easy to understand and can be
applied to many situations.

15
The Science of Predictive Analytics

Notes:

16
The Science of Predictive Analytics

Lesson 2 : Machine learning for predictive


modeling 17

The Science of Predictive Analytics

Module Workbook
The Science of Predictive Analytics

Understanding machine learning

Definition – Machine learning


Machine Learning (ML) techniques are tools used in predictive analytics to find underlying
patterns within complex data automatically. ML uses programmed algorithms that analyze
input data to predict output values.
• Traditional Programming: In traditional programming, a computer engineer writes
codes that detail the program's behavior with all the possible data cases that can be
used as input. In other words, the code determines all the program outputs. This
traditional programming is a deterministic process defined from the start.
Alternatively, machine learning is a type of Artificial Intelligence with a different
approach from traditional programming, as the program's behaviour is not fixed in
advance, but it learns from data. Machine learning allows software applications to
become more accurate at predicting outcomes without being explicitly programmed
to do so. Machine learning algorithms use historical data as input to predict new
output values.

18

• Machine Learning: A machine learning application learns the internal patterns of


the data, generating a capability to identify those patterns in new data elements. It
is the program that understands the processing objective by training with data. It is
not the computer engineer but the data itself who decides what must be done by the
program.
In machine learning, data comes first, as the value lies in the data. As many
researchers say, data is the new oil.
The Science of Predictive Analytics

Machine learning with structured and unstructured data

Structured data
Machine learning can work with data in a standardized format. The data is considered
structured when it conforms to a tabular format where rows and columns are connected.
Common examples of structured data are Excel files or SQL databases. Each of these has 19
structured rows and columns that can be sorted.

Unstructured data
Machine learning also works with unstructured data. It is information that either does not
have a predefined data model or is not organised in a pre-defined manner. Unstructured
information is typically text-heavy, but may contain data such as dates, numbers, and facts
as well. This results in irregularities and ambiguities that make it difficult to understand
using traditional programs as compared to data stored in structured databases. Common
examples of unstructured data include audio, video files or NoSQL databases.

Neural networks
Furthermore, there is a new family of algorithms called neural networks to cope with the
complexity of vast amounts of unstructured data. These algorithms are much more
powerful than traditional machine learning techniques. That is why neural networks are the
basis of deep learning, which is a subfield of machine learning. Neural networks form the
backbone of deep learning algorithms.
The Science of Predictive Analytics

Let’s roll back for a second to understand the comparison better.


- Traditional Machine Learning algorithms perform pattern recognition in
structured databases composed of numbers and maybe texts.
- Deep Learning algorithms work on much more challenging tasks, leveraging
complex and unstructured data, such as X-RAY data.

Deep learning obtains impressive results in fields like game playing (defeating the world
champion of GO), biology (by discovering how proteins fold), genetics, medical diagnosis,
speech recognition, etc.
As shown in the graph below, the neural networks are robust but challenging to train, as
their functions are inspired by how the human brain works. They are arranged in layers
and connect with others to transmit information.

20

We apply machine learning techniques to build systems with structured data. But with
unstructured data or highly complex problems, we look into deep learning to obtain better
results.

Two approaches to machine learning


Machine learning is often categorized by how an algorithm learns to become more
accurate in its predictions. There are two basic approaches:
Supervised learning and unsupervised learning
The type of algorithm chosen highly depends on what type of data is to be predicted.
The Science of Predictive Analytics

Supervised learning
In this approach, the algorithm learns from labeled data, and then, the algorithm infers or
predicts unlabeled data. It is mainly used for classification and regression applications.

Unsupervised learning
Here, the program analyzes a large amount of data and identifies similarities between the 21
elements, obtaining clusters or groups of similar elements.

In most unsupervised learning approaches, we must tell the algorithm some information
beforehand, such as the number of classes. However, some algorithms are also able to
identify the best number of classes for the input data.
The Science of Predictive Analytics

Supervised machine learning

Two phases: training and prediction


In supervised learning, algorithms work in two phases:
• Phase A: Training
• Phase B: Prediction

22

Phase A: Training
The training phase defines the machine learning model as follows:
1. First, we perform a feature selection on the dataset.
2. Then, the program is able to learn from these features by looking at each labelled
item.
3. Finally, we calculate the accuracy of the test data
Machine learning extracts information about the features in our data set and finds
correlations and meaningful values to identify the items. The human brain works in the
same way. However, our ability to look into hundreds or thousands of items is limited. Click
on the next button to continue.
Some machine learning algorithms can prioritise some features over others. For instance,
a suitable algorithm may determine that the shape is possibly the most crucial feature,
whereas the smell or color is irrelevant.

Phase B: Prediction

When the model is trained, it predicts the class from items that have not been classified
before:
The Science of Predictive Analytics

• When a Machine Learning algorithm has high accuracy in training and also high in
prediction, we say that the algorithm has good 'generalization.'
• However, in the hypothetical case that the accuracy significantly decreases, the
program might have poor 'generalization' and possibly have some issues in the
training phase, parameter selection, or data input.
Non-linear boundaries can be drawn by non-linear algorithms, and not all Machine
Learning algorithms can generate non-linear separation lines.

Overfitting and underfitting

Overfitting and Underfitting are two machine learning training problems. Both are related to
generalization, the property that defines how well the classifier works with unclassified
data.

23

We can see the two extremes:

• Underfitting: underfitting happens when the model is too simple and generates a
solution with low accuracy.
• Overfitting: overfitting happens when the system has been trained for too long,
causing an overly complex line as a boundary between classes.

If a classifier has high accuracy with data that has not been used in training, it does have
a good generalization.
The Science of Predictive Analytics

Decision trees, k-Nearest Neighbors

Two machine learning algorithms: Decision Trees and k-Nearest Neighbours

There is not a single machine learning algorithm, but many. Each one has strengths and
weaknesses. No algorithm can be considered 'better' than the others.
For this reason, we must test several approaches for a specific application and see which
is the best-performing approach. It is an exploratory task that can consume quite a lot of
work.
Two machine learning algorithms for supervised learning:
• Decision Tree
• k-Nearest Neighbours

Decision Tree

The Decision Tree is a simple machine learning algorithm but highly effective.
Below you can see a demo graph of the algorithm.

24

- Trees: The algorithm builds trees with a series of questions designed to assign a
classification.
- Decision node: Decision node shows a decision to be made. They contain
questions which split into subnodes.
The Science of Predictive Analytics

- Leaf node/ terminal node: The leaf nodes represent the attribute where the item is
classified. They don’t split into more nodes.
- Chains: The branch/edge represents the answers to the questions, like yes/no,
above/below, etc. The attribute value determines branching, and the tree can be
large, with many chained decision nodes if necessary.
- After the tree: After the tree has been built with a training dataset, items of a new
dataset can be classified.

Decision trees, like any other machine learning programs, require large amounts of
data. Otherwise, the solution is incomplete.

There are many possible decision trees suitable for solving the same problem. However,
many refinements allow the trees to be more compact. One of these measures is called
Gini impurity.
For instance, if a decision node produces 2 subnodes and they split the data in 80%-20%,
the split is 80% pure. A bag full of balls of the same color has the lowest Gini impurity
index. A bag full of balls of different colors has a very high Gini impurity index.

A variable is significant if it is situated at the top of the decision tree. If we are classifying
bulls and horses, horns will be a significant variable, because, most bulls have horns and
horses do not. In below small sample, 100% of the elongated fruits are pears, and 0% are
apples. Therefore, this property will be at the top of the classification tree as it splits the
tree with a very low Gini Impurity. 25

Decision trees classify non-linear boundaries and can also handle numerical and
categorical data. In addition, their simplicity generates simple results to understand and
interpret, which is an advantage over other more robust algorithms.

K-Nearest Neighbours Algorithm

This approach is based on the fact that items with high similarity are close when
represented in a diagram. With this property in mind, by defining a distance metric
between the items, we can identify groups of items belonging to the same class.
The Science of Predictive Analytics

For example, the picture below has a group of fruits in a two-dimensional figure. We know
that the different fruits are grouped together. This is the representation of the training
data.
If we want to find the label of a new item, we place this item in the network of the figure.
Then, the label of this item is the one defined in the area where this item was finally
placed.
The algorithm is based on this 'closeness' property, where each new point is assigned to
a class by comparing it with the training data areas.
In the training phase we define the areas (the coloured areas in the figure):
• First, we start with the training dataset (labeled data), and we find the areas.
• Next, we must classify the non-labeled data by determining their class, which is
done by looking at the closest neighbours' class.

26

Some other algorithms

There are many other algorithms, like the Bayesian family of approaches that use
probabilities of the elements for classification, or some highly complex and sophisticated
methods like the Support Vector Machines, that work by dimensional transformations of
the data. Also, we have the king of machine learning algorithms, the Neural Networks.

• Like how human neurons work: Neural Networks are extremely powerful and
replicate, to some extent, how human neurons work. They are responsible for some
important advances in machine learning, like image recognition, language
understanding, and speech recognition.
• Effective with complex data: By using Neural Networks, some scientists are
obtaining spectacular advances towards fully intelligent algorithms. Neural
The Science of Predictive Analytics

Networks are complex and difficult to manage but highly effective with complex
data.
• Data ‘greedy’: Neural Networks have some problems. One is that they are data
‘greedy’, meaning that they require large amounts of data, and this impacts the
computing resources needed for training. Some neural networks are so large that it
is criticized for not being sustainable.
• Another one: Another problem is that they are unable to explain themselves. For
instance, they will classify an apple but they will not be able to say why.

27
The Science of Predictive Analytics

Unsupervised machine learning

In the previous section, we analyzed the supervised learning approach, and in Chapter 3,
we explored two of its most common machine learning algorithms, Decision Trees and K-
Nearest Neighbour. Let’s see in this chapter the unsupervised machine learning approach.

K-Means algorithm

As in supervised learning, there are several unsupervised learning algorithms. One of the
most used is the K-Means algorithm.

This algorithm segments a dataset into a pre-established number of classes. We must tell
the algorithm how many classes we want, and the algorithm will segment the data into that
specific number.

In real applications, we compare classifications with a different number of classes to


observe which classification is more accurate.

28

The algorithms work as the following steps:


The Science of Predictive Analytics

1. Classes number – K: We set an initial number of classes K.


2. K number of centers: We randomly place K centers in the dataset.
3. Calculate the class: We calculate the class of the items by proximity to the
centers.
4. Reposition the centers: Then we reposition the k centers to the centroid of each
cluster, and establish a new classification. A centroid is the imaginary or real
location representing the center of the cluster.
5. Classification is stable: This process stops when our classification is stable.

K-Means is an algorithm relatively easy to implement and can be used in very large
datasets.

However, K-Means Algorithm might experience problems where the clusters have varying
sizes and densities. Another problem is that centroids can be dragged by outliers, or
outliers might get their own cluster instead of being ignored.

29
The Science of Predictive Analytics

Lesson Summary
30

The Science of Predictive Analytics

Module Workbook
The Science of Predictive Analytics

IN SUMMARY:
Understanding machine learning
• Machine learning is a set of algorithms that self-learn from data.
• Machine learning algorithms use large amounts of data to train.
Supervised machine learning
• The main difference between supervised and unsupervised learning is the
use of labeled data.
• Supervised learning uses labeled input and output data, while an
unsupervised learning algorithm does not.
Decision trees, k-Nearest Neighbors
• There are many machine learning algorithms. For supervised learning, we
analyzed Decision Trees and k-Nearest Neighbors. For unsupervised
learning, the K-Means algorithm.
Unsupervised machine learning
• Unsupervised machine learning does not require a training phase, and it is
used to create groups or clusters of data that are not obvious to the human 31
eye (i.e. customer behavior).
The Science of Predictive Analytics

Notes:

32
The Science of Predictive Analytics

Lesson 3 : The wisdom of the crowds 33

The Science of Predictive Analytics

Module Workbook
The Science of Predictive Analytics

Majority Vote

Wisdom of the crowds


In 1906 at a country fair in Plymouth, England, 800 people participated in a contest to
estimate the weight of an ox. Francis Galton (who also created the regression approach)
observed that the median of the responses was accurate to 1% of the proper weight. By
integrating the inaccurate wild guesses of the people attending the auction, he built a very
accurate estimation of the animal's weight.
James Surowiecki's book, The Wisdom of Crowds: Why the Many Are Smarter Than the
Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations,
published in 2004, relates several examples where decisions taken by a large group,
are always better than decisions made by small numbers of individuals.

Majority vote algorithm


It is an algorithm for finding the majority of a sequence of elements using linear time and
constant space. In its simplest form, the algorithm finds a majority element, if there is one:
that is, an element that occurs repeatedly for more than half of the elements of the input.
There are some variants of this approach by weighing the votes. The weights can be 34
assigned manually or by weighting some past experiences (if we have an algorithm that
historically has been more precise, we weigh it more, or we can review the weights after
every application of the program).
We can combine any machine learning algorithm in the voting mechanism, there are no
limitations, and it can be applied to weak learners (not very good classifiers).
In summary, the voting algorithm works with the same training dataset and combines the
results of the different algorithms of classifiers.
The Science of Predictive Analytics

Bagging and Boosting

Variance and Bias


We have reviewed the majority vote that combines the same algorithm several times,
obtaining the result via voting. However, we can perform other combinations of the
elements in a classification.
• Bootstrapping: Divide the original data into different subsets
• Bagging: Use the subsets of data with the same classifier
• Boosting: Use the subsets of data with the same classifier sequentially
The concepts of bagging and boosting are essential in machine learning because they
help us improve combinations of algorithms and foster learning by enhancing the
characteristics of the available data sets.
To understand better the bagging and boosting techniques, we need to review the concept
of variance and bias.
• Variance: Variance is the result of applying the classifier to different datasets. If the
results are consistent, then the classifier has low variance. If the variance is high,
the classifier is too sensible to the training data (maybe it is overfitting or there is
noise in the data).

35

• Bias: Bias is when the result of the classification algorithm is skewed. For example,
bias may point to underfitting or a not well-suited classifier if it is consistently biased
to a side. This is a failure in capturing the data trends in the dataset properly.
The Science of Predictive Analytics

Algorithms can suffer from both high bias and high variance simultaneously. In the
figure below, we can see the difference between variance and bias simulating a target.

36

Bagging and boosting


Bagging and Boosting are two types of Ensemble Learning. By using different ways to
combine several algorithms together, they both form better estimation models compared
with a single model.

In bagging, we generate several random subsets of the data, and apply a classifier to
each subset in parallel. The result is obtained through a majority voting algorithm that
averages all these weak classifiers.
NOTE: Bagging reduces the variance of the result and is very effective.
• Step 1: Using replacement to create several subsets of the original data.
• Step 2: Create a base model for each subset.
• Step 3: Combine all the base models to form a final model.
• Step 4: Make the final predictions by combining the predictions from all the models.
The Science of Predictive Analytics

In boosting, we use a classification algorithm sequentially for subsets of data, but each
time we apply the algorithm to the wrongly classified subsets. The first classifier works with
the original data, and the subsequent classifier uses only the subset of data that has been
misclassified in the first step. We can use a pre-determined number of classifiers or use
this process until a threshold is obtained or after the accuracy does not improve further.
• Step 1: Run the original data with an equal weight of all data points to generate the
first model. 37
• Step 2: Identify the wrongly classified data points and assign them a higher weight.
Decrease the weights of correctly classified data points.
• Step 3: Run the weighted data to generate the second model, which tries to correct
the errors from the previous model.
• Step 4: Repeat step 2 and 3 until it reached the required results.
• Step 5: Make the final model by the weighted mean of all the models.
The Science of Predictive Analytics

Random forest

Random forest is a classification algorithm that is composed of many decision trees


and merges them together to get a more accurate and stable prediction. It is one of the
most extended machine learning algorithms because it is simple to understand and highly
effective. The algorithm has two steps for the training cycle.

38

Step 1: We create different random subset samples from the complete training set. The
subset is created with replacement which means one element can be in several samples.
Step 2: We apply a different decision tree classifier to each one of the subset samples.

To predict, we combine the result of all the trees using a majority vote, obtaining a
classification for each item.
The Science of Predictive Analytics

39

1. Classification: the classification cycle combines the different trained trees.


2. Final result: The final result depends on the combination of results using a majority
vote algorithm.
The Science of Predictive Analytics

Lesson Summary 40

The Science of Predictive Analytics

Module Workbook
The Science of Predictive Analytics

IN SUMMARY:
Majority Vote
• Statistics have confirmed that by combining several algorithms, we can
obtain higher accuracy in our results than by using a single algorithm. This is
the principle of the ‘wisdom of the crowds’.
• Ensembling involves the use of different algorithms, where the results are
aggregated.
• Majority vote is an algorithm for finding the majority of a sequence of
elements.
• The algorithm finds a majority element that occurs repeatedly for more than
half of the elements of the input.
Bagging and boosting
• Bagging and Boosting help us improve combinations of algorithms and foster
learning by enhancing the characteristics of the available data sets.
• Bagging reduces the variance of the result.
• Boosting reduces bias.
Random forest 41
• A very effective ensembling mode is the Random Forest which is used
extensively in many predictive applications.
• Random forest is composed of many decision trees.
The Science of Predictive Analytics

Notes:

42
The Science of Predictive Analytics

References

• Hassani, H.; Huang, X.;MacFeely, S.; Entezarian, M.R. Big Data and the United
Nations Sustainable Development Goals (UNSDGs) at a Glance. Big Data
Cogn.Comput. 2021, 5, 28. https://doi.org/10.3390/bdcc5030028

• Kumari, S., & Kumar Singh, S. (n.d.). Machine learning-based time series
models for effective CO2 emission prediction in India. Environmental science and
pollution research international. Retrieved December 15, 2022, from
https://pubmed.ncbi.nlm.nih.gov/35780266/

• Shastri, S., Singh, K., Deswal, M., Kumar, S., & Mansotra, V. (2022). COBID-
Net: A Tailored Deep Learning Ensemble model for time series forecasting of covid-
19. Spatial Information Research. Retrieved December 15, 2022, from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8196282/

• Towards responsible artificial intelligence innovation - july 2020. UNICRI. (n.d.).


Retrieved December 15, 2022, from https://unicri.it/towards-responsible-artificial-
intelligence-innovation

• Tsbmail. (n.d.). Architectural framework for machine learning in future networks


including IMT-2020. SERIES Y: GLOBAL INFORMATION INFRASTRUCTURE,
INTERNET PROTOCOL ASPECTS, NEXT-GENERATION NETWORKS, 43
INTERNET OF THINGS AND SMART CITIES Future networks. Retrieved
December 15, 2022, from https://www.itu.int/rec/T-REC-Y.3172-201906-I/en

• What is machine learning? - united nations. (n.d.). Retrieved December 15, 2022,
from https://unite.un.org/sites/unite.un.org/files/emerging-tech-series-machine-
learning.pdf
The Science of Predictive Analytics

44

@UNSSC

facebook.com/UNSSC

linkedin.com/company/unssc

www.unssc.org

You might also like