0% found this document useful (0 votes)
67 views

Football Match Winner Prediction

Uploaded by

farizabid23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Football Match Winner Prediction

Uploaded by

farizabid23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

International Journal of Computer Applications (0975 – 8887)

Volume 154 – No.3, November 2016

Football Match Winner Prediction


Saurabh Vaidya Harshal Sanghavi Kushal Gevaria
Department of Department of Department of
Computer Engineering Computer Engineering Computer Engineering
Dwarkadas J. Sanghvi Dwarkadas J. Sanghvi Dwarkadas J. Sanghvi
College of Engineering College of Engineering College of Engineering
Mumbai, India Mumbai, India Mumbai, India

ABSTRACT source used by us in this project is www.football-co.uk . The


data has to be scraped and stored to extract the features. We
Prediction of football match outcome should follow
collect data over 10 seasons from 2004-05 to 2014-15. We
approaches that are more generalized. Hence for our project
extract set of 4 features per team. All the data are scraped with
we predict outcomes of English Premier League based on the
help of crawlers.
historical data of the matches and using machine learning
algorithms. We gathered data from past 10 seasons and The features generally used are taken in its direct form like
extracted features like form, goals scored and conceded, shots shots, cards, goals etc. However, we have attempted to
ratio. The computation of form feature is different from has perform some computations to make some complex features.
been prevalent till now. More focus is given to gain more Various machine learning techniques have been used to
insight and associate a deeper and better meaning to form of a predict match outcomes like Clustering, SVM, Bayesian
team. Basic features like shots ratio and goals scored are classifiers etc. We would be trying different techniques to find
combined to create feature of attacking quotient. We using the one which suits our data sets.
Logistic Regression and implement voting algorithm between
Random Forest and Naive Bayes classifier to achieve 2. LITERATURE REVIEW
accuracy between 47-50% with mean absolute error of 0.37. The term “Data Mining” was first used around 1990 in the
database community. Data mining and Knowledge discovery
Keywords are used interchangeably. Data mining is the process of
Machine learning; Data mining; Prediction system; Football; extracting information from a data set and converts it into
Classifiers; Knowledge discovery database system understandable structured form [4]. Data mining has many
applications and thus this term is much useful in predicting
1. INTRODUCTION the match winner in football sports by analyzing the previous
2010 FIFA World cup, showed a display of sheer brilliance by match data. Data mining with machine learning can make
Paul the Octopus. Paul predicted the winner correctly an such predictions work efficiently. Arthur Samuel in 1959,
astonishing 8 times when he was tested. There are other defined machine learning as "Field of study that gives
predicting techniques, which can predict the outcome after computers the ability to learn without being explicitly
half-time; while some predict the outcomes on an on-going programmed". Machine learning conflated with data mining
basis; however, the accuracy is not good. So, for the love of helps us to focus more towards exploratory data analysis.
the game and the eagerness to learn new techniques of Based on trained data, machine learning does the prediction
prediction, we have made an attempt to devise our own that depends on the properties learnt from those trained data
method to predict the outcome of a football match. [5].
The problem of predicting football match winner is a multi- Betting is widely popular among sporting events ranging from
class classification problem having three classes: win, loss, cricket, football to tennis and snooker. Douwe Buursma gives
draw. Out of these, win and loss are comparatively easy to importance towards effective betting on football matches [1].
classify. However, the class of draw is very difficult to predict Betting is prominently popular in football, as it is one of the
even in real world scenario. A draw is not a favored outcome world’s famous and most widely watched sport in the world.
for pundits as well as betting enthusiasts. The betting system works in following way: The bettor wins
money if his bets placed turn out to be correct and loses
English Premier League (EPL) is the most watched football
money otherwise. The money earned or lost is based on the
league in the world with almost 4.7 billion viewers. In our
odds determined by the bookmakers. When the probability of
paper, we have chosen English Premier League for its
the outcome is say 0.5, the bookmakers odds would be 5.
competitiveness as well as its random nature of outcomes. For
However to earn profit, the bookmakers place the odds at say
example, in the season of 2010-11, the distribution of wins,
4.5. Thus, to eliminate this “unfairness” it is necessary to find
losses and draws was 35.5%, 35.5% and 29% respectively. So
accurate probabilities of wins or draws to beat the
if we calculate the measure of randomness:
bookmakers’ odds. Douwe Buursma uses different machine
Entropy = − (.29 ∗ log3(.29) + 2(.355 ∗ log3(.355))) learning classifiers and the accuracy of 55.08% is obtained by
using regression and multi-class classifier [1].
= 0.72 [3].
Nivard van Wijk uses the betting concept which leads one to
This is very close to 1 (state of complete randomness). Thus predict a match winner and thus proposes two models to
testing our results on EPL would only help to justify the explain the prediction. These two models are toto-model and
generality of our approach. score-model respectively. This paper explains the prediction
The major challenge in task of predicting match outcome is system mathematically by all the methods and formulas
the extraction and availability of required data. The data specified in the article itself. The accuracy of about 53.03% is

31
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.3, November 2016

obtained after comparing all the models proposed in this paper ratio. These two features would signify how good the team is
[2]. in terms of attack. The defense quotient is computed using the
features: successful tackles and intercepted passes. These
Ben Ulmer and Matthew Fernandez predicted the soccer would signify the strength of the defense.
match results in English Premier League. They used some
machine learning techniques, which include classifiers namely After feature selection and computation, the next task would
Linear from stochastic gradient descent, Naïve Bayes, hidden be selecting upon the classifier to be used. Initially we used
Markov model, Support Vector Machine and Random forest. Logistic regression to classify the data set, however it
Accuracy of each and every model was calculated to find the classified only 2 classes and not the 3rd one.
better approach. They proposed that the results of the first few
matches couldn’t be predicted due to the lack of data
regarding the form of the team. They compared all the
methods out of which SVM showed the best result of 40% -
52% accuracy [3].

3. WORKING OF THE SYSTEM


As seen in literature survey, different systems had their own
different set of parameters and classifiers. The accuracy of the
system would thus depend on the feature selection and
computation as well as the type of classifier used. In order to
achieve a better accuracy than previous systems, we would
focus on selecting proper features and computing accurate
algorithms on those features and selecting the best classifier. Fig. 1: Form v/s Form Graph
The prediction system proposed by us would have three main
On plotting the dataset on a graph, we got the following
parameter components viz. current form, attacking quotient
result:
and defensive quotient.
As we can observe, the dataset is very sparse and hence using.
The current form is calculated keeping in mind two factors:
Decision trees and Naïve Bayes classification would yield
home/away outcome and relative position of two teams. A
better results. Hence, the next algorithm that we implemented
form matrix is constructed which implements the above
is Vote algorithm. This algorithm uses the best outcomes of
factors and gives a detailed information about the magnitude
all the listed algorithms and generates a cumulative outcome.
of a team’s loss or win.
We used Random forest and Naïve Bayes classification
Table 1. FORM MATRIX algorithms. This algorithm was able to classify the 3rd class
which was not possible using any other algorithm.
Teams Points Multiplying Home Away
Factor loss win The following is our system architecture:

A 0.75 0.15 -20% 20%

B 0.6 0.25 -16% 16%

C 0.4 0.4 -12% 12%

D 0.15 0.6 -10% 10%

Fig. 2: System Architecture


The above table is used to calculate a team’s form (recent 5
matches). 20 teams are divided equally in groups of 4 based As seen in the architecture we would extract all our features
on their table position. When a team wins, +1 and some extra that would be required, from a data source and compute the
points are awarded which depicts the magnitude of that win. above-mentioned parameters such as form and attack, defense
That magnitude is calculated using the above table. For quotients. The classifier system would give us a value that
example, if a team from group A wins against a team of group will determine the class to which the output would belong.
C (home of group C), points structure of Team A will be This output would then be approximated and mapped to
defined outputs (1 for win, 0 for a loss, and 0.5 for a draw).
Points = ((+1) + (0.15 * 0.4)) * 1.2 The final output would be a list of outcomes predicted for a
And that of team C will be Points = ((-1) – (0.15 * 0.4)) * 0.88 set of matches.

Finally, all the points of 5 recent matches will be added to 4. EXPECTED OUTCOME
generate a collective form. We collected data from various websites and data sources
using different scrapping tools. We generated a mathematical
Two main aspects of a football game are attack and defense.
model to represent the data in the format required by the
Thus comparing these two quotients of two teams gives us an
algorithms. The dataset was then divided in the ratio 80:20
intuition about the better team both attack-wise and defense-
(training: testing). We achieved 49.37% accuracy using
wise. The attacking quotient is again computed using
Logistic regression algorithm and below is the confusion
following features: shots on target and shots on target/goals
matrix:

32
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.3, November 2016

Table 2. Confusion Matrix of Logistic Regression Although this algorithm is not as accurate as the previous one,
it still classifies the 3rd class and hence there is a compromise
Predicted Predicted Predicted between accuracy and classification of all classes.
Win Loss Draw
5. CONCLUSION AND FUTURE SCOPE
Actual 268 32 1 Thus, it is seen that the case of draw reduces the accuracy of
Win predicting the remaining two classes. It is observed that by
removing the draw instances, accuracy can be increased up to
Actual 135 57 0 65%. Logistic regression fails to classify the draw class. So in
order to achieve generality, voting algorithm is preferred.
Loss
Availability of more features that can help in solving the issue
of predicting draw class would improve the accuracy. Also,
Actual 138 27 0 algorithms optimal for sparse data such as decision trees and
Draw boosting algorithms may also increase the accuracy.

6. REFERENCES
As we can see from the confusion matrix, Logistic regression [1] Douwe Buursma; Predicting sports events from past
classifies only 2 classes and just 1 instance of class 3. Hence, results, University of Twente, 2011.
we used a different algorithm Vote which selects the best
results of multiple algorithms. Here, we have used Random [2] Nivard, W. & Mei, R. D.Soccer analytics: Predicting the
forest and Naïve Bayes classification algorithms for voting. of soccer matches. (Master thesis: UV University of
Accuracy achieved is 47.11% and below is the confusion Amsterdam), 2012.
matrix:
[3] Ben Ulmer and Matthew Fernandez; Predicting Soccer
Table 3. Confusion Matrix of Vote Algorithm Match results in the English Premier League, cs229,
2014.
Predicted Predicted Predicted
Win Loss Draw [4] Data mining [Online]. Available:
https://en.wikipedia.org/wiki/Data_mining
Actual Win 235 52 14 [5] Machine Learning [Online]. Available:
https://en.wikipedia.org/wiki/Machine_learning
Actual Loss 114 66 12

Actual 112 44 9
Draw

IJCATM : www.ijcaonline.org 33

You might also like