0% found this document useful (0 votes)
31 views11 pages

1 s2.0 S016920702300033X Main

Mgugg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views11 pages

1 s2.0 S016920702300033X Main

Mgugg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

International Journal of Forecasting 40 (2024) 302–312

Contents lists available at ScienceDirect

International Journal of Forecasting


journal homepage: www.elsevier.com/locate/ijforecast

Forecasting football match results using a player rating based


model

Benjamin Holmes a,b , Ian G. McHale a ,
a
Centre for Sports Business, University of Liverpool Management School, UK
b
Department of Mathematics, University of Liverpool, UK

article info a b s t r a c t

Keywords: The paper presents a model for forecasting the results of football matches, which takes
Sports forecasting into account the abilities of the players on each team. The advantage of this approach is
Football that the dynamic nature of team strengths is incorporated into the model directly. We
Betting
test our model against the bookmaker’s predictions and in a Kelly-type betting strategy
Rating
applied to the pre-match win/draw/loss market. The new model results in significant
Ranking
positive returns to betting.
© 2023 The Authors. Published by Elsevier B.V. on behalf of International Institute of
Forecasters. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction Although work has been done on allowing team


strengths to vary, little attention has been paid to why
Models for predicting the outcomes of football matches team strengths vary and the physical mechanism that
use historical information on the teams competing to drives these variations in strengths from match to match
obtain team ratings. These estimated team ratings are and season to season. It seems likely that the leading
then used to generate estimated probabilities of the result cause of the dynamic nature of team strengths is that
(win, draw, loss) or scoreline (0-0, 1-0, 0-1, etc.). For the identity of the team members varies, and the chang-
example, arguably the most famous of all models for foot- ing quality of these players means the team strengths
ball match scorelines, the Dixon and Coles (1997) model themselves change.
(which is itself based on Maher, 1982), uses historical In this paper, we focus not on estimating teams’ ratings
information on goals scored and conceded by teams to to forecast future results but on using player ratings to
estimate team attack and defence strengths. forecast team results. In doing so, we deal directly with
Much of the literature on forecasting models in foot- the mechanism that causes team strengths to vary. Con-
sequently, our modelling framework results in a well-
ball has focused on allowing the estimated strengths to
performing model, in terms of forecasting accuracy and
vary with time. For example, Dixon and Coles (1997)
when compared with the betting market, despite its sim-
apply a down-weighting in the likelihood function so that
plicity.
matches played further in the past influence a team’s
The paper is structured as follows. In the following
estimated strength less than matches played more re-
section, we review the recent literature on forecasting
cently; Baker and McHale (2015) assume team strengths
models in football. Section 3 describes the data we use.
vary deterministically over a long time period; and Crow-
Section 4 then introduces our player ratings system. Mod-
der et al. (2002), Owen (2011), and Koopman and Lit els estimating the level of interaction between two oppos-
(2015) adopt models that allow the strengths to vary ing players are introduced in Section 5. Section 6 presents
stochastically from match to match. several models for forecasting results using the Skellam
distribution for modelling goal difference. The results of
∗ Corresponding author. out-of-sample predictions and a betting simulation are
E-mail address: [email protected] (I.G. McHale). detailed in Section 7. Finally, Section 8 concludes the

https://doi.org/10.1016/j.ijforecast.2023.03.002
0169-2070/© 2023 The Authors. Published by Elsevier B.V. on behalf of International Institute of Forecasters. This is an open access article under
the CC BY license (http://creativecommons.org/licenses/by/4.0/).
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

paper with a summary of our findings and thoughts for as the basis of a forecasting model. Peeters (2018) do not
future work. use player ratings to forecast match results. Instead, they
use a simple average of the crowd-sourced player transfer
2. Recent literature valuations from Transfermarkt.com for the two teams and
find that their predictions outperform the team-based
Since Maher (1982) and Dixon and Coles (1997), many rating model they use. A major contribution of the model
published models for forecasting the scorelines (and/or we propose here is that we do not use simple averages of
results) have continued to use the same basic specifica- player ratings on the two teams as the basis for generating
tion. Team attack and defence strengths are estimated, the forecasts. Instead, we build a model to mimic how
and a team’s attack strength interacts with the opposing each player on one team interacts with each player on the
team’s defence strength and vice versa. The beauty of this opposition team.
specification cannot be understated. It represents the real-
ity of football: the attackers on one team interact with the 3. Data
opposition’s defending players. Boshnakov et al. (2016)
follow the lead of Maher (1982) in their model specifi- The data requirements of our model are non-trivial–
cation and use a bivariate Weibull count distribution as which would be the case for any player-based model.
the underlying probability distribution for the counts of Three required data sources cover the player ratings,
goals. match event, and odds data. Each data set was obtained
The Elo rating system has been used to model foot- for all seasons from 2013/14 to 2020/21. All processing of
ball (see, for example, Hvattum and Arntzen (2010)), es- data and subsequent modelling was performed using the
timates team strengths based on previous results, and R programming language (R. Core Team, 2022).
includes a method for updating the team strengths as
new results are recorded. The pi-ratings of Constantinou 3.1. WhoScored player matchday ratings
et al. (2012) and the GAP ratings of Wheatcroft (2020)
follow similarly in which team ratings are updated as new Our modelling framework requires individual player
information is recorded. ratings of the players on the pitch (the line-ups and iden-
Following the ‘Soccer Prediction Challenge’ (Dubitzky tity of players on each team are announced a minimum
et al., 2019), a flurry of papers adopting machine learning of 30 minutes before a match and often known well in
techniques were published. Berrar et al. (2019) ‘won’ the advance) as inputs into a model for the results of matches.
competition with an ensemble of gradient-boosted trees. We collected match performance ratings published by
But perhaps most noteworthy is their conclusion that WhoScored.com. In total, we used 1,505,177 individual
incorporating domain knowledge in forecasting models ratings attributed to 24,167 unique players. These ratings
for football is a key driver of forecasting success. Hubacek spanned 14/07/2013 to 31/05/2021.
et al. (2019) came a close second using a combination
of rating models for teams, including pi-ratings, Elo and 3.2. Match event data
Google PageRank. Constantinou (2019) performed well
using the pi-ratings, as did Tsokos et al. (2019), who used Match event data describes all of the actions (shots,
a Poisson model with scoring intensities allowed to vary passes, tackles, interceptions, etc.) within a match and is
with time according to an INLA process. In our round-up becoming more commonplace in football literature. We
of machine learning models, we mention da Costa et al. use such event data as part of a series of multinomial
(2021), who estimated the probability of both teams to models to estimate the level of interaction between two
score using machine learning classifiers but notably used opposing players (which will be introduced in Section 5).
team-level variables. InStat provided the match event data.
Despite the efforts of the machine learning commu- For the multinomial models, we required all defensive
nity, the marginal gains in terms of predictive accuracy actions: aerial duels, blocks, ground duels, interceptions,
are limited. For example, the best-performing model in and all shots. In addition, we required the playing posi-
the ‘Soccer Prediction Challenge’ achieved an accuracy of tions of players and the formations of both teams. We en-
53.88%, whereas the worst (of the serious entries) had sured that position or formation changes during a match
an accuracy of 50.49%. As Berrar et al. (2019) stated, were accounted for.
domain knowledge is a key driver to success, and machine The final forecasting model uses matches within the
learning algorithms have the unattractive property of not top five European leagues from seasons 15/16 to 20/21
representing reality. Like the Maher (1982) and Dixon and (this will be discussed further in Section 6.1). The multi-
Coles (1997), our model has the attractive property of nomial models are based on actions within the top five
representing the reality of how football is played. leagues from 13/14 to 14/15 to ensure our predictions are
Despite the clear benefits, there have been few at- fully out-of-sample. The final dataset included 764,712
tempts to utilise a player-based model for forecasting defensive actions and 108,286 shots.
football match results. Kharrat (2016) and Lasek (2019) Match information such as the results, scoreline, iden-
use player ratings from the popular FIFA video game tities of teams and players, and formations of the teams
franchise in forecasting models. Arntzen and Hvattum were also provided within the InStat data. Further, details
(2021) utilised the difference in the simple average of the of any changes in team formation occurring during a
regularised plus-minus player ratings on the two teams match, and the timing of the change, were recorded.
303
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

Fig. 1. Histogram of the raw WhoScored ratings achieved by all players. Players are separated by whether they started the match or not.

Table 1
Top ten WhoScored average ratings achieved by players over a two-year rolling window (RWS). Players must have
made at least ten appearances and can only appear once in the table.
Player Date Team League RWS
Neymar 09/04/2019 PSG France Ligue 1 8.821
Lionel Messi 19/12/2018 Barcelona Spain LaLiga 8.698
Hakim Ziyech 12/02/2020 Ajax Netherlands Eredivisie 8.429
Carlos Vela 20/08/2020 Los Angeles FC USA Major League Soccer 8.394
Cristiano Ronaldo 02/09/2016 Real Madrid Spain LaLiga 8.240
James Tavernier 07/12/2020 Rangers Scotland Premiership 8.174
Kylian Mbappé 05/12/2020 PSG France Ligue 1 8.149
Robert Lewandowski 23/05/2021 Bayern Germany Bundesliga 8.098
Luuk de Jong 19/09/2019 Sevilla Spain LaLiga 8.073
Zlatan Ibrahimovic 21/02/2017 Man Utd England Premier League 8.026

3.3. Odds data for Granada when they lost 9-1 to Real Madrid in 2015,
resulting in a score of just 1.89. The resultant ratings are
Finally, historical betting odds were used to test the popular amongst fans and the media. Fig. 1 displays the
predictive capabilities of our forecasting model and home- histogram of ratings achieved by players. We separate
win, draw, and away-win odds were obtained for Bet365 players by whether they started in the match or came on
(a bookmaker) from football-data.co.uk. as a substitute.
By taking an average of the WhoScored match perfor-
4. Player ratings: League-adjusted WhoScored ratings mance ratings for a player, one obtains an idea of how
good the player is and how they might be expected to
Before describing the modelling framework, we present perform in future matches. Let a player’s raw WhoScored
the player ratings system we use as the forecasting model’s rating (RWS) be the average rating they achieved over two
input. years. Table 1 shows the highest RWS ratings of individual
The basis of our player-ratings model (which will be players throughout the data.
described in more detail in Section 5 and Section 6) Although some high-profile and widely accepted top
are the matchday ratings published by WhoScored.com. players offer reassurance that the WhoScored ratings are
WhoScored publishes performance ratings for every player meaningful, there are some unexpected names in Ta-
within a match based on their in-match actions. Although ble 1 (namely: Carlos Vela, James Tavernier, and Luuk
the methodology for calculating the ratings is not fully de Jong). This highlights three potential problems with
in the public domain, the general, top-level concept is the WhoScored ratings and taking a simple average of
described on the WhoScored website.1 To summarise, the match performance ratings. First, it appears that the
a player in a match starts with a rating score of 6. As methodology does not adjust the match performance rat-
the game progresses, a player receives points for actions ings for the league’s quality (the quality of the play-
deemed to impact team performance positively and is ers within a league) such that performances in different
penalised for actions that have been judged to have a leagues are not directly comparable. As such, an adjust-
negative impact on the team. Players can earn a maxi- ment to the WhoScored player ratings is needed to ac-
mum match performance score of 10. There have been count for the strength of the league and the players within
1084 instances where a player has received a 10. The that league.
famed ‘‘MSN’’ trio, Lionel Messi, Luis Suaréz and Ney- Second, some players appear in a small number of
mar, hold the records with 52, 20, and 22 perfect rat- games. Taking a simple average of their performance rat-
ings, respectively. The unfortunate recipient of the lowest ings to estimate how they can be expected to perform in
recorded rating is Oier Olazábal, who was the goalkeeper
the future is likely to result in volatile, unrealistically high
or low average ratings. Third, it is likely that more recent
1 See https://www.whoscored.com/Explanations performances by a player are more relevant to how that
304
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

Table 2
Top ten Adjusted WhoScored (AWS) ratings achieved by players. There is no minimum number of games required for
a player to appear in the table.
Player Date Team League AWS
Lionel Messi 19/05/2021 Barcelona Spain LaLiga 1.799
Neymar 20/02/2018 PSG France Ligue 1 1.603
Cristiano Ronaldo 07/08/2015 Real Madrid Spain LaLiga 1.397
Robert Lewandowski 23/05/2021 Bayern Germany Bundesliga 1.225
Kylian Mbappé 01/11/2020 PSG France Ligue 1 1.166
Kevin De Bruyne 20/01/2021 Man City England Premier League 1.106
Eden Hazard 10/05/2019 Chelsea England Premier League 1.100
Zlatan Ibrahimovic 20/08/2016 Man Utd England Premier League 1.091
Hakim Ziyech 11/03/2019 Ajax Netherlands Eredivisie 1.075
Harry Kane 21/01/2018 Tottenham England Premier League 1.053

player is expected to play than performances further in one-half that of the most recent matches when estimating
the past. the player’s adjusted WhoScored rating.
These problems can be addressed in a regression model, Table 2 shows the resulting top ten players accord-
which we use to generate ‘adjusted WhoScored ratings’ ing to the adjusted WhoScored ratings, which we denote
(AWS). The dependent variable equals the ‘raw’ WhoScored by AWS. Due to the shrinkage, there is no need to set
match performance rating. The covariates include dummy a threshold of ten games for the minimum number of
variables for the player and league and a home indicator matches a player must have played to appear in the table.
to allow for home advantage. To be explicit, suppose we The list of players is a who’s who of the top footballers
observe y1 , . . . , yN WhoScored ratings. For observation i, offering strong reassurance that the new adjusted ratings
let p(i) denote the player who achieved that rating, let l(i) are meaningful. The surprise inclusions in Table 1 have
denote the league it was achieved in and let h(i) indicate now disappeared from the top 10. Vela and Tavernier
whether the player was competing at their home ground. were competing in the MLS and Scottish Premiership at
Then our model can be written as the time of their top ratings. De Jong transferred from
the Eredivisie on 01/07/2019, shortly before his maximum
yi = α0 + h(i)α1 + βp(i) + γl(i) + ei , (1) RWS rating. Consequently, the high RWS ratings achieved
by these three players consisted of good performances in
where ei ∼ N(0, σ ) and β and γ are the estimated ratings
2
relatively easier leagues. The AWS ratings account for the
of each player and league, respectively, whilst α0 is the
weaker leagues; hence, the adjusted ratings are lower.
intercept, and α1 represents a home advantage parameter.
A potential problem with the AWS ratings presented
To account for players with small numbers of games,
in Table 2 is that forward players dominate it. Indeed, the
we shrink the ratings towards the average rating by
majority of the top 50 players are forwards. Of course, it
adding ‘fake’ games. In these fake games, we assume a
may be that the best players in the world are forwards;
player competed in a match within their current league.
after all, they attract the highest wages and transfer fees.
They receive a rating equal to the average within their
But it is also possible that the AWS ratings are biased to-
current league, and we assume home advantage is equal
wards forward players. One could use different rating sys-
to 0.5 (which means these games are effectively played tems in our modelling framework, but as demonstrated by
at a neutral venue). By increasing the weight of these the performance of the forecasting model (see later), the
pseudo-observations, we can adjust the level of shrinkage. AWS ratings perform well.
Letting ω denote the weight, when ω = 0, the model As a sense check of the newly adjusted WhoScored
has no shrinkage. As ω increases, the level of shrinkage ratings, we calculate the average rating of the eleven
increases. This itself is a hyper-parameter that must be starting players for each team. Table 3 shows the results.
tuned. As for the individual players, the identity of the teams
To account for changing player ability and form, we making up this table raises confidence that the ratings are
weight the observations to allow for match performance meaningful.
ratings further in the past to have a smaller effect on An interesting aside to the main topic here is the
the coefficient estimate of the player dummies than more estimated league strengths. This is an important area
recent match performances. We apply an exponential of research in itself, as when clubs recruit players from
weighting scheme to observations as was used by Dixon leagues other than their own, it is essential to gauge
and Coles (1997) and others since. We include only games whether a player will be able to play as well in the new
played within ψ years before the rating date, where the league as they have done in the current league. Fig. 2
weight is exp(−φ · t /3.5) and t is the number of days the shows the estimated league adjustments to player match
performance is from the calculation day. performance ratings (where the second tier of football
To tune the hyper-parameters, ω, ψ and φ , we aimed in England, the English Championship, is the reference
to minimise the RMSE when predicting a player’s future league). It is probably no surprise that the English Pre-
performance throughout the validation data. We found mier League is the most difficult league throughout the
the values that minimised the RMSE were ψ = 2.00, φ = data. Each match performance rating is worth around
0.0062 and ω = 7.00. The estimated value of φ is such 0.25 more than the same score in the English Cham-
that matches one year ago have a weight approximately pionship. Another interesting finding is the rise of the
305
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

Fig. 2. Plot of the league strengths over time. Note that we plot the negative value of the actual estimate, given a more negative value implies the
league is harder. The reference league is the English Championship.

Table 3 decisions, a right-back’s primary responsibility is to stop


Highest average AWS and RWS ratings for individual clubs’ starting 11 the attacking threat of a left-winger.
players.
A complicating matter is that the level of interac-
Team Date AWS RWS
tion between players on opposing teams depends on the
Man City 29/01/2019 0.589 7.212 formations the two teams are playing. A common for-
Barcelona 24/02/2018 0.562 7.410
Liverpool 29/02/2020 0.519 7.174
mation for a team is to play with four defenders, four
Tottenham 20/08/2017 0.517 7.144 midfielders, and two strikers (all teams must play with
Man Utd 31/01/2018 0.514 7.154 a single goalkeeper). This formation is known as 4-4-
Arsenal 15/10/2016 0.511 7.228 2. Another common formation is 4-3-3 (four defenders,
Real Madrid 21/09/2016 0.477 7.380 three midfielders and three attackers). The level of in-
Chelsea 03/01/2018 0.468 7.129
PSG 19/01/2019 0.468 7.295
teraction between the left-winger on the team playing a
Juventus 29/09/2018 0.459 7.276 4-4-2 formation and the right-back on the opposing team
depends on whether the opposing team is playing in a
4-4-2 or 4-3-3 formation.
In a model for match outcomes based on individual
French Ligue 1. During the 2016/17 and 2017/18 seasons, player ratings, the level of interaction between players in
ratings in the Championship were worth more than in different positions and on teams with various formations
Ligue 1 (the estimated coefficient was negative). Still, as must be correctly represented. To accomplish this, we
of the 2021/22 season, a rating in Ligue 1 is worth more estimate the proportion of events a player in a given
than the Championship and the German Bundesliga. position in a given formation (e.g., left striker in a 4-4-2
formation) interacts with each player in given positions
5. Including player level ratings in a team results fore- on the opposing team, given their formation. We esti-
casting model mate these proportions using two multinomial models
explained in the following two subsections.
In this section, we present our general methodology
for including player-level ratings/metrics in models for 5.1. Estimating players’ interaction with opposing outfield
forecasting the results of matches. The specification we players
present has two benefits. First, it replicates the way play-
ers on opposing teams interact in matches. Second, the The first multinomial model estimates the level of in-
specification can be used in many different models as teraction between opposing outfield players using data on
it produces a single covariate (or, in machine learning events in which two players are involved (and recorded
terminology, a single ‘feature’). in the data): tackles (won or lost), blocks, interceptions,
In any given match, depending on the teams’ forma- and aerial duels. Suppose there are M events indexed by
tions, an individual player will play most of the game j = 1 . . . M. Let posaj and posdj denote the positions of
competing with one opposing player, and competing against the attacking and defending players who were involved
other opposing players less frequently. in the jth action, respectively. Further, let formaj and formdj
For example, a team’s left winger plays in an advanced denote the formations the attacking and defending teams
position on the left. Left-wingers thus compete with the were using for the jth event. The dependent variable is
opposition’s right-back more frequently than they will the position of the defending player. The independent
compete with, for example, the opposition’s left-back (on variables comprise the attacking player’s position and the
the other side of the pitch). Indeed, in match strategy two teams’ formations.
306
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

There are 25 unique outfield positions; thus, 24 logit midfielder (RM). We note that the LST in a 4-4-2 will
models associated with the opponent’s outfield position interact with the opponent’s RST 0.022 of the time. For
are estimated. Each model is estimated relative to a refer- an LW in a 4-3-3, this increases to 0.031, suggesting a
ence category; in our case, the central attacking midfield winger will, on average, play more defensively than a
(CAM) position and separate coefficients are estimated for striker, which accords with intuition. Further, there is a
each logit model. 20.8% chance that the left-winger will attempt a shot.
For instance, the logit model, which estimates the We propose the following metric to measure the dif-
probability the opponent’s right-back (RB) attempts a ference in two teams’ strengths in a match.
defensive action, is ∑∑
( ) ∆= pij (AWSi − AWSj ) (3)
P(posdj = k) i j
log = int RB
+β RB
posdj

RB
formaj
+β RB
formdj
, (2)
P(posd = CAM) where AWSi is the AWS rating of the ith player on the
where k = 1 . . . 25 is an index for the 25 playing positions. home team, and AWSj is the AWS rating of the jth player
When generating predictions from this model, 25 non- on the away team. pij is the weight estimated from the
zero probabilities are calculated. However, only ten out- multinomial models described above. The first summation
field players are possible.2 Consequently, we normalise provides a weighted difference between a player’s rating
these ten values to sum to one. The value pi,j (j ̸ = GK) thus and each of the opposition team’s player’s ratings. The
represents the probability that a defensive action against second summation calculates the first sum for each of the
attacking player i will be attempted by defending position players.
j, which measures the overall level of interaction between
the two positions. 6. Forecasting models

6.1. Data
5.2. Estimating players’ interaction with the opposing goal-
keeper
Having trained our multinomial positional models on
the 13/14 and 14/15 data, we use the remaining data
Similarly, we use data on shot events to determine
(15/16–20/21) for modelling results. This ensures the
the level of interaction between outfield players and the
probabilities generated by the multinomial models are
opposing goalkeepers. This model is needed because goal-
themselves out-of-sample.
keepers do not tend to interact with opponent players
As is common practice, we use the first 80% of the
in the duel-type events used as the basis for our first
15/16–20/21 data for training and the remaining 20% for
model above. This time, the dependent variable is the
testing. The order of the data is maintained so that no
player’s position who has taken a shot. The position of
leakage occurs.
the defending player is always the goalkeeper, so only the
There are several parameters to tune: the hyper-
two teams’ formations are used as independent variables.
parameters of the player ratings model, time-weightings
As in the previous model, we normalise the ten values
in the team-based models we use for comparison (see
associated with players actually on the pitch.
Section 7.1), and optimal thresholds for betting strate-
The value pi,GK represents the probability that a given
gies (see Section 7.2). We split the training set again to
shot will be attempted by position i. This measures the
optimise these parameters and ensure results are fully
level of interaction between i and the opponent’s goal-
out-of-sample, keeping the last 20% as a validation set.
keeper.
Fig. 4 displays these splits graphically.
Consequently, we present fully out-of-sample results
5.3. Examples in all the experiments reported herein. During the Covid
pandemic, football matches were played behind closed
Fig. 3 shows the results of these multinomial models doors. We removed these matches from our analysis as
for two example cases. The first plot shows how a left the home advantage is known to have been distorted
striker (LST), playing on a team in a 4-4-2 formation, during these games where no fans were present (see, for
interacts with each player on the opposition team, also example, McCarrick et al. (2021)). This leaves us with a
playing a 4-4-2 formation. 22.8% of their interactions are sample of 6824 matches between 12th August 2016 and
with the right-centreback (RCB), whilst just 2.3% of their 23rd May 2021. Of this sample, we use the final 20% as
interactions are with the left-striker (LST) on the opposing testing data.
team. We see a 21.1% chance that the left-striker will
attempt a given shot, indicating their interaction with the 6.2. Skellam model
goalkeeper.
The second plot shows that a left-winger (LW) in a Throughout the literature on forecasting in football,
4-3-3 formation has 75.4% of his interactions with the there has been considerable focus on estimating the scor-
opposing right-back (who is playing in a 4-4-2 formation) ing rates of teams. The pioneering idea of Maher (1982)
and 12.2% of his interactions against the opposition’s right was allowing teams to have separate attack and defence
abilities. The scoring rate of the home team is estimated
2 We note that the predicted probabilities of players not on the using the home team’s attack strength and the away
pitch are extremely small. team’s defence strength, and vice-versa for the away
307
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

Fig. 3. Examples of the player weights for two different scenarios. The defending team is coloured in red.

Fig. 4. Plot detailing how data was split during the three main stages of this work. Black indicates the portion of data used for training models,
whilst grey represents data used for testing.

308
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

Table 4 7. Out-of-sample testing


Model results for the Skellam regression model. Variables are standard-
ised to have a mean of 0 and a variance of 1. The associated p-values
are given in the right column. 7.1. Scoring rules
Dependent variable:
Goal-difference Coefficient p-value To compare and benchmark models for forecasting
football match results against other models, we advocate
β0h 0.3670 0.0000
β1h 0.2493 0.0000 using scoring rules (see, for example, Johnstone et al.
β2h 0.0583 0.0088 (2013)) and examining returns based on betting.
β0a 0.0500 0.0686 Since both the adjusted WhoScored ratings and po-
β1a −0.3403 0.0000 sitional weights are novel features of our model, we fit
β2a −0.0635 0.0269
several variations to investigate whether both additions
Observations 5459
add predictive power.
Log-likelihood −10374.17
AIC 20760.34 Let ∆Start
full be the sum of differences in the AWS ratings
of opposing players, weighted by the level of interaction
between them–that is, the variable defined in Eq. (3). Let
∆Start
adj be the difference in the average AWS ratings of the
team’s scoring rate. The original model used independent opposing teams’ starting players. Similarly, let ∆Start
raw be
Poisson regressions, and since then, researchers have de- the difference in the average RWS ratings of the opposing
veloped more complex frameworks, for instance, a bivariate- teams’ starting players. Finally, let ∆Sub
adj or ∆raw be the
Sub
copula Weibull regression (Boshnakov et al., 2016). difference in the average AWS or RWS ratings of the op-
Our variables, which represent the difference between posing teams’ starting players, respectively. We consider
teams’ overall strengths, do not fit as naturally in this four models defined as follows:
framework. Instead of modelling the goals scored by each
team, we can estimate the goal difference using a Skellam • Model skellamfull is the Skellam model displayed in
regression. Of course, the probabilities of each match out- Table 4. This uses ∆Start
full and ∆adj as covariates.
Sub

come are easily obtained by summing the relevant goal • Model skellamadj removes the player interaction
difference probabilities. weights, thus using ∆Start
adj and ∆adj as covariates.
Sub

For game i, suppose the goal-difference is GDi , ∆Start


i • Model skellamraw removes the WhoScored rating ad-
is the weighted difference in the adjusted ratings of the
justments, thus using ∆Start
raw and ∆raw as covariates.
Sub
players who start the match defined in Eq. (3), and ∆Sub i • Model skellamteam is a team-based model, where
is the unweighted difference in the substitutes adjusted
3 we include dummy variables for each team which
ratings. Since the Skellam distribution represents the dif-
can take values 1 or −1 if they are the home or
ference between two independent Poisson random vari-
away team, respectively. A value of 0 indicates the
ables, we estimate two coefficients for each independent
team is not involved in a match. As with the player
variable. These coefficients naturally correspond to the
ratings, team strengths are updated using all results
scoring rates of the home and away teams, which we
available before a match.
have included in the following notation for the model.
Consequently, we estimate We can compare the results from these four models to
assess whether the adjusted WhoScored ratings improve
GDi ∼ Skellam(λih , λia ) (4) on the raw average ratings. We also include how the
log(λih ) = β0h + β1h ∆Start
i + β2h ∆Sub
i (5) interaction between opposing players improves results
and whether having player-based information is better
log(λia ) = β0a + β 1a ∆i
Start
+β 2a ∆i
Sub
. (6)
than team-based information.
We include time-weighting in the team ratings-based
The estimated coefficients of this model are displayed
models and test whether including shrinkage through
in Table 4. Note that the variables were scaled and cen-
‘fake’ games improves the fit. We optimise these param-
tred when fitting the model. We see a significant home
eters through cross-validation on the validation data to
advantage, and the effect size aligns with past literature.
minimise the average Brier score.
The weighted sum of differences in player ratings is highly
We report the accuracy and Brier score achieved by
statistically significant. The difference in the ratings of the
the models. The Brier score is the most commonly used
substitutes is also highly statistically significant, with a
scoring rule in forecasting literature. Whilst accuracy is
smaller estimated coefficient. This is to be expected as it
not a proper scoring rule, it is the most intuitive to a
will matter less to the outcome of the match than the
reader and is interpretable in and of itself.
strength of the starting players on each team. We also
Table 5 shows the Brier scores for the different models
observe a significant home advantage.
and the bookmaker implied probabilities. We use the
odds available from Bet365 through www.football-data.
co.uk and remove the bookmaker’s margin by scaling the
3 We use the unweighted difference in ratings for the substitutes as implied probabilities to sum to 1.
it isn’t known before the game if the substitutes will be used, and for
In terms of accuracy (the proportion of matches in
how long they will play. If they are used, it is unknown what position which the outcome with the highest implied probability
they will play (though they are likely to play in their specialised roles). occurs), the bookmaker performs the best. However, we
309
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

Table 5 the (fractional) odds offered by the bookmaker (where


Scoring rules for several models used to predict the results of football 1/(b + 1) can be interpreted loosely as the bookmaker’s
matches. Also shown are the corresponding results derived from the
bookmaker’s (Bet365) implied probabilities.
implied probability of the event occurring).
Model n Accuracy Brier
Whereas the usual Kelly strategy follows a rolling
bankroll updated after each bet, we stake the equivalent
Bet365 1350 52.74% 0.5877
skellamfull 1350 52.00% 0.5955
of one ‘‘unit’’ multiplied by f for each bet. Effectively we
skellamadj 1350 51.85% 0.5957 reset our bankroll to one after each bet.
skellamteam 1350 52.89% 0.5962 An additional ‘protection’ was also introduced: we re-
skellamraw 1350 51.33% 0.6029 strict ourselves to ‘quality bets’ when the expected value
of any bet is above a threshold. For each game, there are
three possible events to bet on: home win, draw, and
note the tiny differences in these figures. A baseline model away win.4 For event type A, we only bet if
of predicting a home win in every match, regardless of the
EV (A) = P(A) × Odds(A) − 1 > t ,
teams, results in an accuracy of 45.11%. The marginal gain
from going from the simplest model of all available to the where t is a threshold parameter and effectively pro-
best-performing model (the bookmakers) is perplexingly tects the investment strategy when the bookmaker knows
small. Even the ‘best’ performing machine learning model more than the model. As mentioned in Section 6.1, we
from the Soccer Prediction Challenge (Dubitzky et al., split our data into training, validation, and testing sets. To
2019) achieved an accuracy only slightly higher at 53.88% determine the optimal threshold, we fit an initial model
(though, of course, it was achieved on a different dataset to the training data (without the validation data). The
so direct comparisons are not possible). optimal betting threshold is then determined by finding
The bookmaker also achieves the best Brier score. Of the maximal return on investment (ROI) on the validation
our models, skellamfull performs best; the small gap be- data (which, at this point, is out-of-sample to the training
hind the bookmakers provides encouragement. We see data). Finally, the full model is fit using the training and
that skellamfull improves on skellamadj , assuring that the validation data, and the out-of-sample betting results are
opposing player interaction weightings improve perfor- calculated using the pre-determined threshold on the test
mance (albeit only to a small degree). There is a notable data.
improvement when moving from RWS to AWS ratings, In addition to looking at the returns to investment, we
justifying the use of our player-rating framework. The believe it is important to consider the Sharpe ratio as we
most interesting results are that all models utilising our see in finance. The Sharpe ratio is a measure for calcu-
AWS player ratings beat the team-based models. lating risk-adjusted returns and is defined as the rate of
We note that these results are robust to other scor- return per unit of volatility. Just as in finance, we calculate
ing rules, for instance: the rank-probability score used the Sharpe ratio as the ROI over all bets divided by the
throughout the Soccer Prediction Challenge (Dubitzky standard deviation of the ROI of each individual √ bet. The
et al., 2019); and the ignorance score used in Wheatcroft result is then annualised by multiplying by n where n
(2021). is the total number of bets. A general rule of thumb is
that a Sharpe ratio of 1 or higher is considered good (and
7.2. Betting the higher the better, as the investment achieves higher
returns at lower risk).
In the previous section, we showed that our model We compare the results of the modified Kelly strategy
achieved results similar to the bookmakers according to with those of using simpler flat staking strategies. In these
the Brier Score and the accuracy. However, unlike the schemes, we place one unit on the most likely outcome
bookmakers, a bettor does not have to ‘invest’ in ev- according to the model. As with the Kelly strategies, we
ery match. We now use the models to bet with and find the optimal expected value threshold and report the
investigate what returns on investment are obtained on corresponding results.
the 1X2 (home win, draw, away win) market. In betting It should be noted that the bet set may differ when
against the bookmakers, despite the bookmakers having using flat or Kelly staking strategies. When applying flat
an advantage built into their odds (the margin or vig), the stakes, the user bets on the outcome they believe is the
bettor has an advantage in that they can choose not to most likely. However, in Kelly scenarios, the user places
bet. bets based on the expected value of the bet. Consequently,
Our investment strategy is based on the Kelly Cri- this choice may not be the most probable outcome ac-
terion (Kelly, 1956) and is the same as the one used cording to the user.
in Boshnakov et al. (2016). The Kelly criterion is borne The results of betting with skellamfull are given in
from a desire to maximise long-run log-utility, and it Table 6. Given the literature on forecasting in football,
results in an investment strategy where the bettor invests these returns are very promising, especially given the high
a fraction f of his overall wealth
(b + 1)p − 1
f = , 4 To be explicit, this means we may bet on a maximum of three
b
outcomes (in the unlikely situation that all have positive expected
where p is the bettor’s estimate of the probability of an value). This also means we may bet on outcomes which are not
event (e.g. the home team winning the game), and b is necessarily the most likely result predicted by the model.

310
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

Table 6
Results for several betting strategies using the skellamfull model.
Strategy t N Accuracy (%) Stakes Profits ROI (%) Sharpe
Kelly 0.1866 556 24.10 65.36 7.81 11.96 1.07
Kelly 0.0000 1457 29.44 105.42 6.03 5.72 1.02
Flat 0.1760 199 37.19 199.00 9.04 4.54 0.45
Flat 0.0000 568 45.25 568.00 16.93 2.98 0.59
Flat 1350 52.00 1350.00 −32.56 −2.41 −0.87

Fig. 5. Plot displaying the ROI that would be achieved using the skellamfull model for betting under the modified Kelly strategy for different minimum
expected value thresholds.

number of bets being placed. For example, Koopman and and access to these should become increasingly easy in
Lit (2015) placed just 50 bets over two seasons. the future.
We find that only the most basic strategy- flat staking We have demonstrated the goodness-of-fit of the model.
with no value threshold- results in losses. Both Kelly Scoring rules suggest the model performs very well com-
strategies performed well, achieving very promising ROIs pared to bookmakers. Even when we perform the sternest
and Sharpe ratios. We highlight that these results have test of all forecasting models, examining the returns to
been obtained on a large number of bets. Whilst flat betting, the results are positive to the extent that we
stakes with t = 0 and t = 0.1866 both achieve positive achieve positive returns to betting on the 1X2 market.
returns, the Sharpe is less than 1 indicating more risk than Our results have implications in economics studies of
reward. market efficiency and the practice of trading in football.
Fig. 5 shows the relationship between the minimum For example, the player-based model may reduce, at least
expected value threshold and the ROI achieved by the to some extent, the reliance of bookmakers on expert
skellamfull model under the modified Kelly staking strat- traders to adjust predictions from a team-based model in
egy. Also shown is the number of bets placed along the light of information about the actual line-up of players,
chart’s top. The number of bets decreases as the threshold say when a star player is injured. Currently, traders are
increases, but the ROI increases to very high levels. typically required to adjust model probabilities subjec-
tively. Our player-based model does this automatically.
Future work on this type of model is promising. One
8. Closing remarks could, for example, model the interactions of players on
the same team. Football fans often believe some play-
In this paper, we have presented a new model for ers play well together and are greater together than the
forecasting the results of football matches. The model is sum of their abilities. A model including some interaction
a ‘player-based’ model as opposed to the previously pub- between players on the same team would be able to
lished ‘team-based’ models of Maher (1982) and Dixon identify whether this was true. Another area for potential
and Coles (1997). We developed a novel rating framework improvement of the model is to use ‘better’ player ratings.
which adjusts publicly available player matchday ratings Here we use WhoScored ratings, but these ratings may be
to ensure comparability across leagues. Further, we in- weak. For example, there may be a bias towards forward
troduced multinomial models to account for the level players in the WhoScored ratings (given that the top
of interaction between two opposing players, knowing ten are exclusively forwards). One could even use this
that different formations dictate how often a player will model to rate the player ratings themselves. For example,
compete against a particular opponent. one rating of players is their pass completion percentage.
Player-based models rely heavily on data but solve the This could be used as the metric feeding the forecasting
major issue with team-based models. There is no need to model (instead of the WhoScored rating), and the model’s
worry about time-varying team strengths: the mechanism performance is used to measure the usefulness of play-
which causes the dynamics is modelled directly, that is, ers’ pass completion percentage as a predictor of future
the changing line-ups of the teams and the changing team performance. Many player-level metrics could be
short-term form of the players. Admittedly, the model is tested, compared, and rated in this framework for their
data-hungry, but databases of player ratings now exist, usefulness.

311
B. Holmes and I.G. McHale International Journal of Forecasting 40 (2024) 302–312

Lastly, we note the model could be used to develop re- Dubitzky, W., Lopes, P., Davis, J., & Berrar, D. (2019). The Open Inter-
cruitment tools for football clubs and predict the potential national Soccer Database for machine learning. Machine Learning,
108(1), 9–28.
impact a new player might have on a club’s results.
Hubacek, O., Sourek, G., & Zelezny, F. (2019). Learning to predict soccer
results from relational data with gradient boosted trees. Machine
Declaration of competing interest Learning, 108(1), 29–47.
Hvattum, L. M., & Arntzen, H. (2010). Using elo ratings for match result
prediction in association football. International Journal of Forecasting,
The authors declare that they have no known com-
26(3), 460–470, Sports Forecasting.
peting financial interests or personal relationships that Johnstone, D. J., Jones, S., Jose, V. R. R., & Peat, M. (2013). Measures
could have appeared to influence the work reported in of the economic value of probabilities of bankruptcy. Journal of
this paper. the Royal Statistical Society: Series A (Statistics in Society), 176(3),
635–653.
Kelly, J. L. (1956). A new interpretation of information rate. Bell System
References Technical Journal, 35(4), 917–926.
Kharrat, T. (2016). A journey across football modelling with application
Arntzen, H., & Hvattum, L. M. (2021). Predicting match outcomes in to algorithmic trading (Ph.D. thesis).
association football using team ratings and player ratings. Statistical Koopman, S. J., & Lit, R. (2015). A dynamic bivariate Poisson model
Modelling, 21(5), 449–470. for analysing and forecasting match results in the english premier
Baker, R. D., & McHale, I. G. (2015). Time varying ratings in association league. Journal of the Royal Statistical Society: Series A (Statistics in
football: the all-time greatest team is.. Journal of the Royal Statistical Society), 178(1), 167–186.
Society: Series A (Statistics in Society), 178(2), 481–492. Lasek, J. (2019). New data-driven rating systems for association
Berrar, D., Lopes, P., & Dubitzky, W. (2019). Incorporating domain football (Ph.D. thesis).
knowledge in machine learning for soccer outcome prediction. Maher, M. J. (1982). Modelling association football scores. Statistica
Machine Learning, 108(1), 97–126. Neerlandica, 36(3), 109–118.
Boshnakov, G., Kharrat, T., & McHale, I. (2016). A bivariate weibull McCarrick, D., Bilalic, M., Neave, N., & Wolfson, S. (2021). Home
count model for association football scores. Journal of International advantage during the covid-19 pandemic: Analyses of european
Forecasting, 33(2), 458–466. football leagues. Psychology of Sport and Exercise, 56, Article 102013.
Constantinou, A. C. (2019). Dolores: a model that predicts football Owen, A. (2011). Dynamic bayesian forecasting models of foot-
match outcomes from all over the world. Machine Learning, 108(1), ball match outcomes with estimation of the evolution variance
49–75. parameter. IMA Journal of Management Mathematics, 22, 99–113.
Constantinou, A. C., Fenton, N. E., & Neil, M. (2012). pi-football: A Peeters, T. (2018). Testing the wisdom of crowds in the field: Trans-
fermarkt valuations and international soccer results. International
bayesian network model for forecasting association football match
Journal of Forecasting, 34(1), 17–29.
outcomes. Knowledge-Based Systems, 36, 322–339.
R. Core Team (2022). R: A language and environment for statistical
Crowder, M., Dixon, M., Ledford, A., & Robinson, M. (2002). Dynamic
computing. Vienna, Austria: R Foundation for Statistical Computing.
modelling and prediction of english football league matches for bet-
Tsokos, A., Narayanan, S., Kosmidis, I., Baio, G., Cucuringu, M.,
ting. Journal of the Royal Statistical Society: Series D (the Statistician),
Whitaker, G., & Király, F. (2019). Modeling outcomes of soccer
51(2), 157–168.
matches. Machine Learning, 108(1), 77–95.
da Costa, I. B., Marinho, L. B., & Pires, C. E. S. (2021). Forecasting football
Wheatcroft, E. (2020). A profitable model for predicting the over/under
results and exploiting betting markets: The case of ‘‘both teams to
market in football. International Journal of Forecasting, 36(3),
score’’. International Journal of Forecasting.
916–932.
Dixon, M. J., & Coles, S. G. (1997). Modelling association Wheatcroft, E. (2021). Evaluating probabilistic forecasts of football
football scores and inefficiencies in the football betting market. matches: the case against the ranked probability score. Journal of
Journal of the Royal Statistical Society. Series C. Applied Statistics, Quantitative Analysis in Sports, 17(4), 273–287.
46(2), 265–280.

312

You might also like