Cricket Match Outcome Prediction
Cricket Match Outcome Prediction
https://doi.org/10.1007/s10479-022-04541-6
ORIGINAL RESEARCH
Abstract
One of the significant challenges in the sports industry is identifying the factors influencing
match results and their respective weightage. For appropriate recommendations to the team
management and the team players, there is a need to predict the match and quantify the impor-
tant factors for which prediction models need to be developed. The second thing required
is identifying talented and emerging players and performing an associative analysis of the
important factors to the match-winning outcome. This paper formulates a hybrid machine
learning-clustering-associative rules model. This paper also implements the framework for
cricket matches, one of the most popular sports globally watched by billions around the
world. We predict the match outcome for One day Internationals (ODIs) and Twenty 20 s
(T20s) (two formats of Cricket representing fifty over and twenty over versions respectively)
adopting state-of-the-art machine learning algorithms, Random Forest, Gradient Boosting,
and Deep neural networks. The variable importance is computed using machine-learning
techniques and further statistically validated through the regression model. The emerging
talented players are identified by clustering. Association rules are generated for determin-
ing the best possible winning outcome. The results show that environmental conditions are
equally crucial for determining a match result, as are internal quantitative factors. The model
B Ajay Kumar
[email protected]
Praveen Ranjan Srivastava
[email protected]; [email protected]
Ashish Kumar Jha
[email protected]
Lalitha Dhamotharan
[email protected]
1 Indian Institute of Management (IIM) Rohtak, Rohtak, India
2 Emlyon Business School, Écully, France
3 Trinity Business School, Trinity College Dublin, Dublin, Ireland
4 University of Exeter Business School, Exeter, UK
123
Annals of Operations Research
is thus helpful for both team management and for players to improve their winning strategy
and also for discovering emerging players to form an unbeatable team.
1 Introduction
Cricket is one of the most-watched and most-followed sports with a presence worldwide
(Saha, 2020; Thomson, Reyers, & Swartz, 2021; Thomson et al., 2021). Statistical analysis
is one of the critical aspects of the game, as demonstrated by using mathematical techniques
to resolve weather-affected games (Stern, 2016). It is a game that has penetrated all strata
of society irrespective of age and demographics. The interest in the game has drawn many
enthusiasts to watch and analyze the decisions of the game and express opinions about
players, team composition, and match-winning strategies. However, opinions backed by data
and statistical analysis would add weight to the observation and increase the validity.
For this purpose, data in sports are being collected and analyzed, made possible by the
availability and integration of physical and digital sources. Data is available on websites like
ESPN Cricinfo, CricBuzz, Cricingif, and Howstat, etc., catering to the special interests of
sports experts, analysts, and enthusiasts across the world. The availability of sports records in
different formats (player-wise, team-wise, and matches-wise) can enhance decision-making
capabilities related to the players and team’s performance, mental health, and safety. Further,
this encourages fan engagement and helps in formulating marketing strategies.
Players and the team management are confronted with the impending problem of the
team’s performance and country rankings. All the sports managers strive to groom an ideal
group of cricket players who are considered a formidable winning competition. Players, on the
other hand, strive to be a part of this winning combination. They are interested to understand
the rationale or strategies required to accomplish a win. There is a need, therefore, to first
identify the different factors impacting favorable match results. Existing studies conducted
in this regard have focused primarily on internal factors influencing match decisions.
These include runs, wickets, bowling and batting averages, strike rates, and catches for
match decisions. However, the impact of player-specific factors and environmental conditions
like match time, match venue (home or away), batting position (whether the team bats first
or second), and toss decisions are understated in current studies. There is a need to consider
these factors as they are pertinent in the long run for determining match results. There is also
a need to identify the emerging talented cricket players by predicting the result of the match
based on their parameters. This gains prominence in the study context of India as the sports
leagues, with billions of dollars of investments, rely on the identification of talent to make it
successful (Kamath et al., 2020). Such data-backed identification would help understand how
their performance impacts the winning outcome and identify associative rules that determine
which combination of factors leads to a favorable match outcome.
Further, while different statistical learning and operations research methods are adopted
in current literature, a more customized player-wise and captain-wise analysis for selection
are not considered. Cricket is not a game of any individual but a team. It is not driven by the
captain’s performance only but all eleven players and the impact of various categorical factors
like the toss, whether the team is batting first or second, to impact match result. However,
123
Annals of Operations Research
the impact of these factors is not statistically analyzed. It is required to identify if there are
patterns in deciding the match result.
While current research primarily consists of statistical and operations research-based
techniques, there is a proliferation of machine learning algorithms (Deval et al., 2021) that
can predict the match result and quantify and identify the most critical factors influencing the
results. Given the nature of Cricket as a sport with multiple formats, this becomes even more
pertinent as these results will vary according to the match format, i.e., different for One Day
Internationals (ODI) match and a different set of important factors for T20 match results.
Like one team leading in ODI, but the same team is placed 6th in the T20.
Hence, a more comprehensive model backed by machine learning algorithms would help
in making appropriate team management, toss decisions. Player retention strategies can be
formulated by identifying the top contributing cricket players leading the team for victory.
To successfully implement the model, the study is divided modularly into the following
research objectives (research questions):
RQ1 What is the impact of player-specific factors and environmental conditions like match
time and toss on match outcome?
RQ2 How do categorical factors like the toss, match time, and batting position impact match
result?
RQ3 Which ML approach predicts the most accurate match outcome?
RQ4 What are the most significant factors (both individual and interaction) impacting ODI
match and T20 match result?
RQ5 Who are the emerging talented cricket players for India?
2 Literature review
Prior studies conducted in the domain of sports analytics have analyzed the factors impacting
match outcome and showed that the factors considered could be broadly categorized into
(1) Quantitative factors and (2) Categorical factors. Quantitative metrics include the Runs
scored, Wickets taken, Bowling average, and number of catches (Bose et al., 2021; Deval
et al., 2021; Kamble, 2021). Other quantitative factors like player ratings and partnerships
123
Annals of Operations Research
data were not consistently available for all matches and teams on the cricket statistics websites
and hence were excluded. There are other decision variables, categorical in nature, that are
important in the state of the game but have been rarely included in such analysis. These are
Toss decision (won/lost), Match location (home/away), Match time (day or day and night),
and Batting position (batting first or batting second). These have been incorporated separately
as categorical factors, as depicted in Fig. 1 below.
The factors that emerge from the existing literature and the rationale for their inclusion in
the study are described below:
• Runs scored The number of runs scored by a batsman is one of the most critical factors
deciding the match irrespective of the pitch and playing conditions. It sets a formidable
benchmark and target for the batsman and is vital for match victory (Kamble, 2021).
• Wickets taken From a bowler’s perspective, the number of wickets taken is an equally
vital factor for deciding the match outcome. The factor is antithetical to the number of
runs scored since the fall of an important batsman can mold the outcome of the match
accordingly in the direction of the bowling team (Thorley, 2021).
• Bowling average Similarly, for a bowler, bowling average is also important since for a
bowler, other than taking wickets, constraining the flow of runs (economy) is also critical
for match outcome (Deval et al., 2021).
• The number of catches The number of catches, particularly, the catches taken of important
players can swing the match outcome in favor of the fielding team (Bose et al., 2021).
• Batting position The extent to which a team is comfortable batting first or fielding first is
also a deciding factor since some teams have a demonstrated history of being successful
more as chasers or in batting first. This depends on the team composition and potential of
the team and is hence included as a factor in categorical form for the study (1 for batting
first and 0 for fielding first) (Weeraddana & Premaratne, 2021).
Runs scored
Wickets taken
Quantitative
factors Bowling average
Number of catches
Factors
impacting the
outcome of Toss decision
cricket match
Categorical Match time
factors
Batting position
Match Location
123
Annals of Operations Research
• Match location The venue of the match is also an important determinant of the match
outcome (classified in terms of home or away i.e., whether the match is played on home
ground of the team or in a different country). Different teams have different records that
vary based on the venue since some teams may be more successful at their local venue
than a different country due to familiarity with the pitch conditions (Bliss et al., 2021).
• Match time The duration of play of the match and the time at which the match starts and
ends (whether day match or day and night match) is a critical external factor determining
the outcome. The same match pitch can have different conditions at different times like the
presence of dew on the ground and may indirectly influence the match outcome (Mondal
et al., 2021)
• Toss decision One of the main external factors is the toss. It is believed that some matches are
won just by winning the toss due to demonstrated history of some teams being successful
with winning the toss. This also depends on the pitch conditions and if a team is available
to get a favorable toss outcome, this may influence the match outcome probabilistically
and hence is included as a factor in the study (Sahu, 2021).
Prior works adopting machine learning (ML) techniques in match result prediction are tab-
ulated in Table 1.
The Table 1 therefore summarizes the existing research in the domain of match result pre-
diction. The studies have currently adopted more baseline techniques like statistical analysis,
optimization for team allocation and analyzing team performance. However, a more granular
player-wise analysis identifying the key factors influencing match result need to be explored.
There is a need to develop match result prediction models from more state-of-the-art machine
learning techniques for higher accuracy and identify a weightage (importance score) to each
factor influencing match outcome. We discuss the limitations of the existing studies reviewed
in Sect. 2.2.
The limitations of the existing studies of match result prediction are illustrated below in
Fig. 2:
Firstly, while existing studies identify internal factors for match outcome, the impact of
player-specific factors and environmental conditions like match time and toss is not factored
in for predicting match outcome.
Second, none of the existing studies stated above has considered a player-wise and captain-
wise analysis for selection. However, teams are driven by the performance of the captains
and individual players, and the impact of various categorical factors like the toss, match time,
and whether the team is batting first or second have an impact on match outcome. However,
the impact of these factors is not statistically analyzed. This is needed to identify if there are
patterns in deciding the match outcome.
Third, a comparative analysis of ML approaches for match result prediction is not done,
which is needed to identify and predict the match outcome. A reliable data source like ESPN
CricInfo can be utilized by sourcing player-wise statistics and categorical variables like the
toss, match time, and batting position to perform the prediction.
Fourth, there is a need to identify the most significant factors (both individual and interac-
tion) impacting ODI match and for T20 match outcomes. This would help make appropriate
123
Annals of Operations Research
123
Annals of Operations Research
Table 1 (continued)
LIMITATIONS OF RESEARCH ON
MATCH RESULT PREDICTION
123
Annals of Operations Research
team management, toss decisions, and player retention strategies by identifying top con-
tributing cricket players for victory of the team.
Fifth, there is a need to identify the emerging talented cricket players by predicting the
outcome of the match based on their parameters. This would help in understanding how their
performance impacts the win ratio of the team. This would enable clustering the players into
different grades and forming effective match-winning team combinations. Further, no study
associates the impact of various factors on match outcome in a rule-based format. This would
help in analyzing which parameters can be tuned to achieve the desired outcome.
For overcoming the above limitations, a hybrid machine learning-clustering-association
rule model is adopted in the paper for predicting match outcomes, identifying important fac-
tors, clustering and identifying emerging players, and formulating association rule patterns.
The data collection procedure and research methodology adopted in this paper are dis-
cussed next, in Sect. 3.
The dataset for this research is sourced from ESPN CricInfo Statsguru1 dataset, a com-
pendium of all cricket statistics worldwide drawn across all the three formats of the game
namely, Test Cricket, One Day Internationals (ODI), and T20s. The aggregated statistics
of four current and renowned international cricket team captains of Australia, India, New
Zealand, and England (namely, Aaron Finch, Virat Kohli, Kane Williamson, and Joe Root)
for all the three above formats were collected in terms of the number of matches played,
runs scored and the trending batting average respectively. The data is aggregated according
to the following quantitative and categorical attributes. The data stored in table format on
the web page of ESPN Cricinfo is scraped by a built-in package in R ’rvest’. The predictors
incorporated in the model are discussed below:
The following appropriate factors, namely, Runs scored, wickets taken, Bowling Average,
number of catches taken, Toss decision, Match venue/location, Match time, and Batting
Position are considered for the predictive model.
The following variables are extracted from ESPN CricInfo Statsguru:
[1] Runs scored operationalized as ’Runs’.
[2] Wickets taken operationalized as ‘Wkts’.
[3] Bowling Average operationalized as ‘Bowl Av’.
[4] Number of catches taken The number of catches taken in a particular match is opera-
tionalized as ‘Catches’.
[5] Toss decision The decision taken at the toss (either the toss is won by the team which
provides them an opportunity to decide whether to bat first or field first) is denoted by
’Toss decision’. In the dataset, 1 denotes that the toss is won while 0 represents the lost
toss.
[6] Match venue/location The location of the match with respect to the playing teams (either
the match venue is in one of the host countries of the playing teams or in a different
1 Retrieved from: https://stats.espncricinfo.com/ci/engine/stats/index.html.
123
Annals of Operations Research
country) is denoted by ‘Match location’. In the dataset, 1 denotes home country while
0 implies the match is “away” i.e., played in a different country.
[7] Match time The time of the day when the match is played is also extracted and denoted
by ‘Match time’ representing whether the match is played as a ‘Day’ match or a ‘Day
and night’ match. ‘Day’ match is denoted by 1 and ‘Day and night’ match by 0.
[8] Batting position The variable denotes whether the team under consideration has batted
first and enforced a target to the opposition or batted second and chased a target of runs.
’Batting first’ is denoted by 1 and ’Fielding first’ (’Batting second’) is denoted by 0.
The interactive web scraping is performed using the ‘rvest’ package in R. The data is
present in the website links as a “HTML Table” format. The web page is first read into R
and the appropriate HTML Table is scraped by inspecting the HTML code and tags used
for designing the tables in the webpage. The parameters are scraped as follows illustrated in
Fig. 3.
For instance, to retrieve the statistics of the Indian captain Virat Kohli, there is a name-wise
search where we can enter the name of the player or team and retrieve the year-wise records
aggregated according to the attributes entailed above.
The webpage showing all the links to test match, ODI, and twenty-twenty (T20) records
is illustrated as in Fig. 4.
From this page, we need to retrieve the all-round records of Virat Kohli for one day
internationals, test matches and Twenty-20 matches.
The page redirects to the webpage containing the data for all the formats to be web scraped
in HTML tables format as illustrated in Fig. 5.
The R package ’rvest’ is now deployed to scrape the third HTML table (as inspected in
HTML webpage) into a data frame which is exported in Comma Delimited File (.csv) format
for further analysis.
Further, for the machine learning models implementation, the other parameters defined like
Toss decision(won/lost), Match time (day/day and night) and Batting Position(first/second)
need to be extracted however, it is found that these parameters are not displayed directly
in the HTML table as features however they are in the form of advanced search filters as
illustrated below in Fig. 6.
123
Annals of Operations Research
Fig. 6 Features for machine learning model as filter parameters Source https://stats.espncricinfo.com/
123
Annals of Operations Research
For the machine learning model, the overall aggregate records of all player matches from
2016–2020 for each match format (Test, ODI, and T20) are web scraped to form the training
set illustrated below in Table 2. Since, the additional features illustrated above are filters,
the filters are applied to retrieve different datasets sorted by runs scored and subsequently
merged into a single dataset creating dummy variables for the above filters.
The Outcome is considered as the dependent variable in line with the objective of the
paper which is to predict the match outcome from the above parameters, hence data for the
above variables are collected interactively during this period. Outcomes pertaining to draw
or tie are excluded from analysis for lack of relevance. The statistics are restricted to the
ESPN Cricinfo website due to data availability in a structured format (both player-wise and
match-wise extraction of data possible through this data source, satisfying the objectives of
the paper).
Thus, the cricket dataset is constructed with 3000 ESPN match records; a snippet of the
dataset is illustrated in Table 2.
Table 2 presents a snippet of the scraped player dataset. For instance, the Player Tim
Southee (TG Southee) [Row number 7] has played two matches in the year of 2020 scored
an aggregate of 1 run from both matches, has taken total of 5 wickets, maintained a Bowling
Average of 62, took 1 important catch. New Zealand won the toss (Southee is NZ player),
played at home venue (Match location 1) during the day (Match time 1), batting first
(Batting position 1) and won the match (Outcome 1).
The outcome variable in this study is ‘Outcome’ used to quantify the team performance.
The outcome of the match result is predicted for a completely new validation set of cho-
sen players and based on respective parameters. The methodology adopted in the paper is
illustrated below:
This paper examines the role of factors determining the match outcome from different match-
wise parameters scraped from ESPN Cricinfo Statsguru. The factor variables are selected
based on their decision-making significance in the game of Cricket. This objective is accom-
plished in two steps. The first step is to perform predictive modeling using ML on the dataset
considering Outcome (match outcome) as the independent variable (to be forecasted) and
the above-mentioned quantitative and categorical factors, as illustrated in Table 3. The mod-
els are compared in terms of accuracy and the importance of each variable, also known as
relative importance. This relative importance of a variable generated from the predictive
models computes the extent to which each predictor variable is significant and generated by
these predictive models and thus, defines a method to compute ‘Outcome’ from the above
predictors. Predictive modeling techniques can be broadly categorized into supervised and
unsupervised techniques (Loureiro et al., 2018). In this paper, the predictive techniques, i.e.,
Random Forest model, Gradient Boosting, and more complex ML-based models, like deep
neural networks, have been adopted to derive the relative significance of factors and improve
the extent to which prediction can be made in terms of accuracy.
Further, the outcome variable predicted from the above ML techniques is tuned to different
user-generated scenarios, and match result is predicted based on the parameters defined in
Table 3.
123
123
Table 2 A snippet of the match ODI dataset scraped from ESPN CricInfo
Player Mat Runs Wkts Bowl Av Catches Toss decision Match venue Match time Batting Position Outcome
HK Bennett 1 20 1 20 0 1 0 1 1 1
MS Chapman 1 27 0 0 0 1 1 1 1 1
LH Ferguson 2 53 4 12.5 0 1 1 1 1 1
DJ Mitchell 1 55 0 0 0 1 0 1 1 1
IS Sodhi 2 68 5 11.6 2 1 1 1 1 1
BM Tickner 1 92 2 12.5 0 1 1 1 1 1
TG Southee 2 1 5 62 1 1 1 1 1 1
C Munro 1 6 0 0 2 1 1 1 1 1
MJ Santner 1 15 1 41 0 1 1 1 1 1
Annals of Operations Research
Annals of Operations Research
Considering existing studies (Bose et al., 2021; Deval et al., 2021; Kamble, 2021) that build
match outcome prediction models and provide a consistent output that can handle multiple
features with high predictive accuracy, ML techniques are adopted paper.
The model building phase starts by simulating highly sophisticated and layered ML mod-
els, such as random forest, gradient boost, and deep neural network to compare the prediction
accuracy of match outcome across different ML models, one simple random forest, an ensem-
ble gradient boost algorithm, and multiple layered deep neural networks.
The open-source data analytics tool R (Jiang & Chen, 2019) was adopted to build the
above three models is built. The methodology undertaken in the paper is illustrated in Fig. 7
and outlined by the steps given below:
Step 1 Exploratory data analysis is performed on the dataset initially on Aaron Finch for
performance in Tests and further, by comparing the performance of four captains Aaron
Finch, Virat Kohli, Kane Williamson, and Joe Root in terms of runs and batting average
under different criteria like the toss, match time conditions, batting position and nature of
the tournament. The comparison is performed for ODIs and T20.
Step 2 Perform tenfold cross-validation (Schneider and Gupta, 2016) for implementing ML
using the ‘trControl’ function under predefined R package ‘caret’.
Step 2.1 Build ML techniques (Random Forest, Gradient Boosting and Deep Neural networks)
with the seven input variables for prediction. The prediction is performed separately for
ODI matches and T20 matches data. The R tool is used for machine learning, which has a
predefined package ’caret’ for implementing the models.
Step 2.2 The prediction accuracy and performance on 5% of the real data-points (150 data-
points) from the ESPN Cricinfo website for the Indian players Hardik Pandya, Shreyas Iyer,
Ravindra Jadeja, Rishabh Pant, and Virat Kohli are then compared to determine which of the
three models (Random Forest, Gradient Boosting and Deep Neural networks) outperforms
the others in match outcome prediction. The objective of this exercise is to identify the
most emerging Indian players based on different parameters. The relative importance of the
variables is also compared.
Step 2.3 Further, to validate the relative importance of variables and their interactions, a
multiple linear regression model is formulated from the four quantitative predictors and four
external condition-based variables, and the interplay between variables is further factored in
to test for interaction effects (27 new pair-wise interaction and 17 triplet interaction variables).
The total number of predictors is, therefore, 52 ( 27 + 17+8) however, the most significant
predictors are considered for the final regression model 3.
Step 3 The players’ performance is quantified in terms of the win ratio (number of matches
won to number of matches in total) and are clustered into four grades (A + , A, B, and C
recoded as 1,2,3 and 4). The grades are compared with the actual grades of the players in
real-time to validate the efficacy of the model.
123
Annals of Operations Research
PERFORM 10 FOLD
CROSS VALIDATION
COMPUTE VARIABLE
IMPORTANCE
COMPARE PREDICTIVE
ACCURACY ON UNSEEN
PLAYERS DATA
123
Annals of Operations Research
Step 4 Association rule patterns are generated from the dataset for match result prediction.
Further, network analysis of player performance in different countries for T20s and ODIs are
also illustrated. For measuring the connectedness, the mutual spillover effects of the major
cricket-playing countries Australia, New Zealand, England, and India were estimated using
the Vector Auto-regressive model (VAR) for both the different periods.
Based on the spillover values, the connectedness graph or adjacency graph (minimum
spanning tree) was plotted between the different countries for ODIs and T20s using the
‘frequencyConnectedness’ package in R. The results of the spillover matrix and the minimum
spanning tree graphs are thus illustrated below in subsection 4.4.
Figure 8 illustrates the working procedure of the ML predictive models.
Random Forest-based predictive model (Bendazzoli et al., 2019): A random forest is a
supervised machine learning classifier that combines the output of several Decision trees
using a voting algorithm and predicts the outcome resulting from the aggregated outcomes.
They are easy to implement with cleaner output and fit on a large set of data. For this model,
123
Annals of Operations Research
all predictors are converted to numeric, and the target variable, i.e., Outcome, is predicted.
Random Forest can be implemented by randomForest () predefined library in R tool with the
following syntax:
model R F random Forest(out ∼ x1 + x2 + . . . + xn , data tr set, ntr ee) (1)
where out is the outcome variable; x1 , x2 , . . . , xn . represent the ‘n’ number of
inputs/predictors considered for the model, trset is the training set input to the model and
‘ntree’ represents size of regression/decision tree.
Gradient Boosting predictive model (Hubáček et al., 2019) The gradient boost ensemble
model is also run to predict ‘Outcome’ to boost the predictive accuracy and interpretability.
The boosting technique can be modeled as an optimization problem where the objective is
to gradually and iteratively minimize the ensemble model’s eor rate and iteratively using a
gradient descent-like procedure. A weak algorithm like a decision tree is combined with a
robust predictive model to form an ensemble that boosts the predictive auracy. They help
to deal with an unbalanced and large set of data to provide accurate results. To boost the
predictive accuracy and interpretability, the gradient boost ensemble model is also run for
the prediction of ‘Outcome’. All predictors are converted to numeric for this model, and the
output variable, i.e., ‘Outcome’, is predicted.
The caret package in R implements gradient boost on the dataset with syntax:
train out ∼ x1 + x2 + . . . + xn , data tr set, method gbm , trContr ol n( f olds)
(2)
where; out is the outcome variable; x1 , x2 , . . . , xn represent the ‘n’ number of
inputs/predictors considered for the model, trset is the training set input to the mode; ‘gbm’
stands for Gradient Boosting Machines methodology in this case and trControl is the control
parameter used to set number of cross-validation folds [n(folds)] (mostly 10).
Deep neural network-based model A Deep Neural Network (DNN) model is simulated
based on the human brain’s working. A typical architecture ilayer-wise: the input layer takes
the normalized input data, and the last layer provides the output. In between, the processing
of inputs is performed in hidden layers (one or more), which process the input values and
compute an activation function (preferably sigmoid) based on the importance of variables. A
DNN is trained to learn the input weights and consequently generate the output incrementally.
DNNs effectively handle a large number of multiple input variables, for example, in big data
scenarios.
The caret package in R implements deep neural networks on the dataset with syntax:
train out ∼ x1 + x2 + . . . + xn ; , data tr set , method nnet , trContr ol n( f old)
(3)
where; out is the outcome variable; x1 , x2 , . . . , xn . represent the ‘n’ number of
inputs/predictors considered for the model, trset is the training set input to the mode; ‘nnet’
stands for Artificial Neural Networks in this case and trControl is the control parameter used
to set number of cross-validation folds [n(folds)] (mostly 10).
Further, the variable importance of the predictors for the above machine learning models
is computed by the R tool using the varImp() function in the predefed ’caret’ package is
used to compute variable importance with the syntax: var I mp(model_trained); where
model_trained represents the machine learning algorithm adopted.
In this paper, the DNN has been constructed with five hidden layers (as illustrated in Figure
Eleven) due to the minimum root mean square error (RMSE) of 0.25, which also minimizes
123
Annals of Operations Research
the probability of over-fit of the DNN model, i.e., the model fitting only on some data points
and under-performing on other data points in the dataset (Loureiro et al., 2018). The number
of input nodes is 8, considering eight individual predictors.
Further, the variable importance of the predictors for the above machine learning models is
computed by the R tool using the varImp() function in the predefined ’caret’ package is used to
compute variable importance with the syntax: varImp(model_trained); where model_trained
represents the machine learning algorithm adopted.
Multiple linear regression-based predictive model (Kong et al., 2019; Cappelli et al.,
2019): A multiple regression model that factors in all the eight predictors (four quantitative
and four external) and twenty-six interactions is formulated below.
Three regression models were implemented (first is baseline model with variables, second
includes pair-wise interactions and the third model also incorporates triplet interactions).
The equation for implementing the baseline multiple linear regression is formulated as
follows:
out1 β1 z 1 + β2 z 2 + . . . + βm z m + ε 1 (4)
where out 1 is the outcome variable [match outcome]; z 1 , z 2 , . . . , z m . represent the ‘m’ number
of inputs/predictors considered for the model representing the main (effect) variables Runs
scored, wickets taken, Bowling Average, number of catches taken, Toss decision, Match
venue/location, Match time, and Batting Position. β1 , β2 , . . . , βm . represent the regression
.
coefficients which signify the sensitivity of each variable to the overall output. ε. is the
1
residual of the regression, which implies the component of the match result outcome variable
which is unexplained by the predictors.
Further, the second regression model implementing the pair-wisinteractions is constructed
as follows:
.
out2 β1 z 1 + β2 z 2 + . . . + βm z m + β1 z 1 z 2 + β2 z 2 z 3 . + · · · + βm−1
z m−1 z m . + · · · + ε (5)
1
123
Annals of Operations Research
The interaction terms z 1 z 2 z 3 …….z m−2 z m−1 z m represent the triplet interactions like Num-
ber of runs x Bowling Average x Toss decision derived from grouping the most significant
pair-wise interactions.
Clustering (D’Urso et al., 2019, 2020, 2021; de Zepeda et al., 2021): Clustering is used
to validate the efficacy of the machine learning algorithms prediction result by clustering the
players into different grades based on their contribution to the win ratio of the teams. The
unseen validation data points for each of the players based on different parameter combina-
tions is input to the machine learning algorithms to predict the match outcome in each case.
The match outcome is analyzed and aggregated for each player in terms of the win ratio
(number of matches won to total number of matches). This win ratio is used for clustering
of the players into different grades. The predicted grades are then compared with the real
grades allocated to the players by the respective cricket board. The closer the actual grades
are with the predicted, the more accurate and valid is the machine learning prediction model.
Association mining (Huang et al., 2021) Association rules were formulated just to derive
some patterns from the data and are not incorporated into the final model.
The measures adopted in association rules are:
The results of the initial exploratory data analysis performed on the four captains Aaron
Finch, Kane Williamson, Joe Root and Virat Kohli are first illustrated in Sect. 4.1. Further, the
machine learning model results (Random Forest, Gradient Boost and Deep Neural Network)
comparing the predictive accuracy and variable importance are illustrated in Sect. 4.2. The
clustering of the emerging Indian players into different grades and comparison with real-time
allocated grades for model efficacy is illustrated in 4.3. The association rules and network
analysis results are illustrated in Sect. 4.4.
The data was collected and scraped from the ESPN Cricinfo Statguru website as illustrated
above in 3.1.1. Initially, an exploratory comparative analysis of Australian captain Aaron
Finch with three other prominent international captains Kohli, Williamson and Root are
considered for the exploratory data analysis, and their performance is compared for ODIs
and T20s.
123
Annals of Operations Research
4.1.1 Exploratory comparative analysis of top four captains in ODIs and T20s
The top four captains are compared in terms of performance first in one-day internationals
(ODIs) and then in twenty-twenty match formats (T20s). The four captains Aaron Finch,
Kane Williamson, Joe Root, and Virat Kohli are denoted in the below graphs as AF, KW,
JR, and VK for readability of labels, and their batting average and runs scored are compared
under different conditions as follows:
In Fig. 9, with respect to the batting average, Virat Kohli is found to be the highest of the
four batsmen with batting average of 80–140.
Virat Kohli is successful the most in matches where India lost the toss and was put in
to bat first illustrating the chasing capability of the Indian team. India winning the toss and
fielding first was least effective. Similar results are found also for Aaron Finch. For Kane
Williamson, however, the decision to win the toss and bat first was found to be the second
most effective strategy after “lost toss and batted first” decision.
Aaron Finch is found effective when fielding first while Virat Kohli when batting first in
Fig. 10.
123
Annals of Operations Research
Virat Kohli performs the best in the third ODI match of the ODI series while Aaron Finch
is most effective in the first match and Williamson and Root in the second match in Fig. 11.
Similarly, for T20 matches, the following insights were found:
Kohli again is found to be successful in scoring runs as illustrated in Fig. 12 when India
loses the toss and bats first in T20s similar to ODI matches while Finch is effective when
Australia wins the toss and fields first. Williamson succeeds when New Zealand wins the
toss and fields first.
For Kohli, Finch, and Root, batting average is directly proportional to match victory except
for Williamson who demonstrated the highest average > 80 in a drawn match(tied) result in
Fig. 13.
The results of the machine learning prediction and variable importance results are illus-
trated below in Sect. 4.2:
0 10 20 30 40 50 60 70 80
123
Annals of Operations Research
120
100
80
60
40
20
0
WON MATCH LOST MATCH DRAWN MATCH
4.2 Results of machine learning prediction models for ODI and T20
Outcomes of the random forest-based model generated using the ‘random Forest’ package
of the R tool are displayed in Tables 3 and 4. The training set comprises 2400 data points,
while the test set had 600 data points (total 3000 instances). The parameters, i.e., the number
of predictors ’mtry’ and the number of trees ‘ntree’ are tuned to choose the best model. A
sample of the dataset is illustrated in Table 4:
The number of optimal predictors is considered to be n/3., where n. is the number of
variables considered in the model (Adam et al., 2014) while mtry values vary from 1 to
n − 1, i.e., 6 in this case for seven predictors. This is to be tested for which the RMSE values
are plotted against the number of predictors ‘mtry’ and tuned by changing the number of
trees ‘ntree’.
Table 4 shows that the optimal accuracy of this model’s prediction for ODI matches is
86.5% at mtry 2 and ntree 200 while for T20 matches, the accuracy i89.41%, which
implies that 89.41% of the test set’s data points were accurately classified.
Fther, the weightage assigned to the predictors for the Random Forest model is tabulated
in Fig. 14:
It is observed from Fig. 14 that in ODI matches, the reputation of the player is the most
significant predictor, followed by runs scored, toss decision, match location(home/away), the
number of catches, and bowling average. Further, match time (day or day/night) and wickets
are less important.
123
Annals of Operations Research
4%
13% 19%
16%
13%
6%
13%
9%
7%
3%
16%
30%
1%
1%
12%
1%
1%
35%
Fig. 15 Variable Importance in the Random Forest model for T20 Match prediction
It is observed from Fig. 15 that in T20 matches, the Batting order (first or second) by
a team is the most significant predictor, followed by toss decision, runs scored, match
location(home/away), number of catches and bowling average. Further, match time (day
or day/night) and wickets are less important.
From Table 4, the optimal accuracy of the prediction of this model for ODI matches is 89.41%,
which implies that 89.41% of the data points of the test set were accurately classified. The
ideal size of the classification tree is n_tree 50. For T20 matches, the optimal predictive
123
Annals of Operations Research
Fig. 16 Relative Importance of the variables in the Gradient Boosting Model for ODI Match Result
accuracy is 89.7% for optimal tree size n_tree 150. This is an improvement over the
Random Forest model due to an ensemble of techniques and aggregation of output from
multiple decision trees.
Further, the weightage assigned to the predictors for the Gradient Boost model is illustrated
in Fig. 16.
It is observed from Fig. 16 that Wickets is the most significant predictor, followed by
the player reputation, match location (home/away) and match time (day or day/night). Toss
decision and number of catches are the next most important factors in determining match
outcome.
It is observed from Fig. 17 that Batting order (first or second) is the most significant
predictor, followed by the match location (home/away), toss decision and number of runs
scored. Number of catches are the next most important factors in determining match outcome.
4.2.3 Analysis of deep neural network-based model for ODI and T20
The DNN based model performance is illustrated in Fig. 18 below. A simple model showed
70.9% accuracy, but an ensemble of 100 such simple DNN models boosted the accuracy
to 96% at tenfold cross validation. Cross-validation is performed to efficiently validate the
performance of the designed model. It is a statistical procedure to estimate the classification
ability of learning models.
Figure 18 plots the variation of the deep neural network model’s performance with a
number of hidden layers for processing adopted. Based on Fig. 18, the optimal weight decay
of 0.1 and the optimal cross-validation RMSE is attained at hidden layer 3. Three hidden
layers have been adopted in the deep neural network model. Further, the weightage assigned
to the predictors is shown in Fig. 19.
123
Annals of Operations Research
Fig. 17 Relative Importance of the variables in the Gradient Boosting Model for T20 Match Result
Fig. 18 Neural Network RMSE v/s Weight Decay Curve for ODI Results
It is observed from Fig. 19 that Player reputation, runs scored, match location (home/away)
and Match time (day and day/night) are the most significant predictors, followed by number
of catches and runs scored. Similarly, for T20 matches, the results are as follows in Fig. 20:
Figure 21 plots the variation of the deep neural network model’s performance with a
number of hidden layers for processing adopted for T20 results. Based on Fig. 21, the optimal
weight decay of 0.1 and the minimum cross-validation RMSE is attained at hidden layer
5. Five hidden layers have been adopted in the deep neural network model.
It is observed from Fig. 22 that Batting order (first/second), Match location (home/away),
toss decision, and runs scored are the most significant predictors, followed by player reputa-
tion.
Therefore, across all the machine learning models, the most significant factors impacting
ODI match outcome are Player reputation and Match time (day and day/night), followed
by the number of catches and runs scored. The vital factors for T20 match outcome are
123
Annals of Operations Research
24%
31%
19% 3%
2%
0%
8%
13%
Fig. 19 Relative importance plot of Deep Neural Networks model for ODI Match
Fig. 20 Neural Network model output for T20 Match Result Prediction
Batting order(first/second), toss decision, and runs scored are the most significant predictors,
followed by player reputation.
Then, 5% of the dataset (i.e., 150) with unseen new real-life data points considered from
the ESPN Cricinfo Statsguru website has been considered for comparing the models’ per-
formance on these new data points for validation, and the result is illustrated in Fig. 23.
Deep Neural networks are the most accurate predictor of ODI match result (95%) [green
color], followed by Gradient Boosting algorithm (78%) [yellow color] and Random Forest
(60%) [red color].
123
Annals of Operations Research
2%
12%
39% 21%
2%
1%
2%
21%
Fig. 22 Relative importance plot of Deep Neural Networks model for T20 Match
From Fig. 24, it can be inferred that the Deep Neural Network model (90%) [yellow]
predicts the rank closest to the actual Outcome, followed by Gradient Boost (70%) [blue]
and then the Random Forest model (50%) [green].
We discovered the feature’s effect through the random forest, gradient boost, and Artifi-
cial Neural Networks (ANN) summarized in Figs. 14, 15, 16, 17, 19 and 22. Next, we
attempt to understand the impact of interactions of the features like external match factors.
Hence, multiple regression model including single, pairwise, triplet and quadruplet interac-
tions [Regression Model 3] has been performed to predict the match outcome based on the
features and their interactions under study.
The regression models formulated in Tables 5 and 6 ensure that all the robustness tests for
the assumption of linear regression, namely multi-collinearity, linearity, auto-correlation, and
homoskedasticity are validated (Abadie et al., 2020). For the match instances, the robustness
tests, namely Durbin-Watson, Langrange Multiplier (LM Coefficient), and Variance Inflation
Factor (VIF) are run to ensure the reliability of the model variable significance.
123
Annals of Operations Research
According to the Durbin Watson test (Lumbantobing et al., 2020), the value of the Durbin
Watson statistic (DW) must lie between 2 and 4 with a value tending closer to 2 implying that
auto-correlation is not present in the dataset. Moreover, the significance value rho must be
closer to 0. Similarly, for Langrange Multiplier (LM) test (Chauhan et al., 2020), if the p-value
statistic is greater than level of significance ‘alpha’, the null hypothesis of homoscedasticity
is validated. The VIF (Variance Inflation Factor) (Vörösmarty & Dobos, 2020) for all the
predictors is expected to be < 10 to indicate that there is no multi-collinearity in the data.
Further, the tests are summarized across all the three models implemented each for ODI
and T20 match instances, namely Model 1(which only contains individual variables), Model
2(which includes pair-wise interactions), and Model 3(which includes even triplet interac-
tions and significant quadruplet interactions).
The unstandardized coefficients (original) and p-value statistics (in parenthesis) and above
robustness statistics (DW, LM and VIF) for ODI and T20 instances are reported. *** indicates
a 1% statistical significance level.
In Table 5, three regression models are implemented: Model 1 is implemented by regress-
ing the outcome variable (Match outcome) on the individual predictors (direct effects)
considered in the machine learning model. At 95% significance level, Runs, Wickets, Toss
decision, Match location and Batting position are the most significant factors considered by
for winning a match. The value of adjusted R-squared is 0.65 i.e., 65% is explained. Further, in
Model 1, the Durbin Watson Statistic (DW) is reported to be 2.42 with a p-value of 0.15. The
123
Annals of Operations Research
123
Annals of Operations Research
Table 5 (continued)
significance value rho is 0.002. Both these statistics imply that there is no presence of auto-
correlation in the dataset. The Langrange Multiplier (LM) is reported to be 3.5 with a p-value
of 0.2, which is greater than the level of significance alpha 0.05 (5% significance). This
implies that the dataset is homoscedastic. The above model relies on the assumption that only
one factor at a time impacts match outcome. However, in real-time, multiple factors simul-
taneously impact the result. In light of this scenario, the regression model can be augmented
with variable interactions. This implies that individual variables like ‘Runs’, ‘Bowling Aver-
age’ can be multiplied pair-wise to form a new interaction term ‘Runs*Bowling Average’.
Similarly, all the eight predictors are taken two at a time and assuming that no two predictor
variables are multiplied twice, i.e., 27 pair-wise interaction terms are initially factored in the
regression models. Thus, a new regression Model 2 is implemented augmenting Model 1
with pair-wise interaction effects.
123
Annals of Operations Research
123
Annals of Operations Research
123
Annals of Operations Research
0.25
0.2
Standarduzed Coefficients(Beta)
0.15
0.1
0.05
-0.05
-0.1
-0.15
Fig. 25 Feature contribution and importance chart from the Multiple Linear Regression Model for ODI Dataset
Average, Toss decision, Match location, Match time and Batting position are the most signif-
icant factors considered for determining match outcome. The value of adjusted R-squared is
0.67 i.e., 67% is explained. Further, in Model 1, the Durbin Watson Statistic (DW) is reported
to be 1.97 with a p-value of 0.4. The significance value rho is 0.002. Both these statistics
imply that there is no presence of autocorrelation in the dataset. The Langrange Multiplier
(LM) is reported to be 2.15 with a p-value of 0.5, which is greater than the level of significance
alpha 0.05 (5% significance). This implies that the dataset is homoscedastic (Table 6).
From Model 2, it is found that interaction variables ‘Runs*Wickets’, ‘Runs*Match loca-
tion’, ‘Wickets *Batting position’, ‘Catches *Match location’ and ‘Toss decision * Batting
position’ are significant implying and corroborating the result in Model 1. The value of
adjusted R-squared is 0.80 i.e., 80% is explained. For Model 2, the Durbin Watson Statistic
(DW) is reported to be 1.85 with a p-value of 0.2, showing no autocorrelation presence in
the dataset. Similarly, the Langrange Multiplier (LM) is reported to be 2.67 with a p-value
of 0.3 (> alpha 0.05), which implies that the dataset is homoscedastic.
The triple interaction term ‘Runs*Wickets* Catches’, ‘Runs*Bowling Aver-
age*Catches’, ‘Runs* Catches*Match location’ and the quadruple interaction among
‘Runs*Wickets*Catches *Toss decision’ are the most significant.
The variables included in Models 1 and 2 are still significant, along with the additional
variables. Overall model fit, as evident Adj.R-Squared shows an increase from Model 2 with
a value of 0.85. This shows that the included triplet interaction variable as anticipated. For
Model 3, the Durbin Watson Statistic (DW) is reported to be 2.25 with a p-value of 0.03,
showing no autocorrelation presence in the dataset. Similarly, the Langrange Multiplier (LM)
123
Annals of Operations Research
0.200
0.150
0.100
Coefficients (Beta)
0.050
0.000
-0.050
-0.100
-0.150
Predictors
Fig. 26 Feature contribution and importance chart from the Multiple Linear Regression Model for T20 Dataset
is reported to be 2.53, with a p-value of 0.5(> alpha 0.05), which implies that the dataset
is homoscedastic. The Variance Inflation Factor (VIF) is found to be < 10 implying no
multicollinearity.
Figure 26 depicts the contribution of the features (feature importance) and their interaction
to predict the T20 match outcome.
For the feature contribution chart, the significant variables’ standardized coefficients (both
direct effect and interaction effect) are considered by normalizing the original coefficient.
This normalization is performed by dividing the original coefficient by the sum of all the
significant variable model coefficients (denoted by *** in Table 6) and multiplying by 100.
For instance, the original coefficient of ’Batting position’ is -0.4 in Table 6. The sum of all
significant variable coefficients (all direct effect variables + all the interaction effect variables
in Table 6)
0.22324898 + 0.484620369 + 0.004078276–0.004297746 +
0.427463407–0.045603454 + 0.146893442–0.389278136 + 0.812 + 0.358546556 +
0.294270929 + 0.162731298 + 0.069206258 + 0.005201239 + 0.000618909 + 0.048063411
+ 0.003083849 + 0.007749245 + 0.313611866 + 0.518307433 + 0.042724935 + 0.037069386
+ 0.062479593 + 0.095869497 + 0.003183311 + 0.052044257 + 0.000259921 3.73.
Thus, the standardized feature coefficient for variable ’Batting position’ is normalized
and plotted above in Fig. 11 as (− 0.4/3.73) (− 0.105). Similarly, other coefficients are
standardized in Fig. 26.
The above results are corroborated in the feature importance graph where Wickets,
‘Wickets*Runs’ and the triplet interaction variables ‘Runs*Wickets*Catches’ are the largest
positive drivers of T20 match outcome. Further, individual variable ‘Batting position’ and
pair-wise interaction variable ‘Bowling Average*Match location’ are negative drivers.
123
Annals of Operations Research
123
Annals of Operations Research
Table 7 Summary of Variable Importance across all Models and match formats
Hardik Pandya 1 0 1 1
Hardik Pandya 0.66 0 0 0 0
Hardik Pandya 1 1 1 1
Shreyas Iyer 0 0 0 0
Shreyas Iyer 0.25 1 1 1 1
Shreyas Iyer 0 1 0 1
Shreyas Iyer 0 0 0 0
Ravindra Jadeja 0.75 1 0 1 1
Ravindra Jadeja 0 0 0 0
Ravindra Jadeja 1 1 0 1
Ravindra Jadeja 1 0 1 1
Ravindra Jadeja 0 1 0 0
Rishabh Pant 0.6 0 1 0 0
Rishabh Pant 1 0 1 1
Rishabh Pant 1 1 0 1
Rishabh Pant 0 1 1 0
Rishabh Pant 1 1 0 1
Virat Kohli 0.8 1 1 1 1
Virat Kohli 1 0 1 1
Virat Kohli 1 1 0 1
Virat Kohli 0 0 0 0
Virat Kohli 1 1 1 1
123
Annals of Operations Research
number of matches (implying win ratio 2/3 66%). Similarly, win ratio is computed for
other four players. The players are then clustered into four different grades based on win
ratio (players with a win ratio of greater than 70% are clustered under grade A + or 1, those
with win ratio of 50–70% in grade A or 2, 30–50% in grade B or 3 and players under 30%
are under grade C or 4. This is performed to identify emerging players for formulating teams
with optimal winning combination. This helps in better talent acquisition and will boost the
chances of Team India to win the game. The clustering results of the Indian players are
compared with the actual player grades computed and stated in the Board of Cricket Control
of India (BCCI) list and results are illustrated below:
The predicted clusters/grades of the players are compared with the actually allocated clusters
by BCCI in Fig. 27 as follows:
It can be inferred that for Virat Kohli, Rishabh Pant, Ravindra Jadeja, and Shreyas Iyer,
the cluster predicted for them and actually assigned to them by BCCI in real-time are the
same (1 for Virat Kohli, 2 for Pant and Jadeja, 4 for Shreyas Iyer). On the other hand, in the
case of Hardik Pandya, while the predicted cluster according to the machine learning model
based on winning ratio is 3, the actual grade allocated by BCCI is 2. In the case of Pandya,
the grade was promoted to 2 (grade A) based on the performance in the latest match season
(April 2021) which was not yet reflected in the ESPN CricInfo Statsguru website and hence
predicted to be previous grade 3(B). The accuracy rate overall is found to be 95% (Figs. 28
and 29).
From the grades allocated to each of the players, it can be implied that the top five emerging
players for India are: Kohli, Pant, Jadeja, Pandya, and then Iyer (lower the grade, higher is
the propensity of the player to be considered emerging and chosen in the team for future
matches).
Further, having identified the important factors influencing a match winning outcome,
there is also a need to determine a suitable combination of the different factors (optimal
values) at which a match winning result can be achieved. A pattern between different factor
parameters and the ultimate match outcome needs to be derived in a rule-based method which
Virat Kohli
Rishabh Pant
Ravindra Jadeja
Shreyas Iyer
Hardik Pandya
0 1 2 3 4
123
Annals of Operations Research
would give an insight into how parameters can be tuned to achieve the desired outcome
separately for ODIs and T20s.
The association rule mining and network analysis results are illustrated.
From the final dataset used for the modeling, variables such as runs, catches, bowling average,
wickets, toss decision (win/loss), and match time (day/day and night) are considered. Using
these variables, association rule mining is used to draw some useful rules to predict the match
result (win or loss). Rules are generated such that all the above-listed features except outcome
will participate in the antecedent and the outcome of the match will be the consequent. The
support value at 0.001 was set with a filter condition stating a lift value greater than 1, having
expected some meaningful rules and gaining some insight into the data. With the above
structure in place, more than 70% of the confidence value is filtered to draw the top ten
association rules.
Further, individual variable ‘Batting position’ and pair-wise interaction variable ‘Bowling
Average*Match location’ are negative drivers.
The association rules are generated using the Apriori algorithm for support 0.001 and
confidence 0.7 for ODI matches result outcome prediction, the top 10 of which are:
For instance, in Rule 9, the probability of a match won with the top player scoring more
than 50 runs, more than 5 catches, bowling average > 10 of best bowler, day and night match
123
Annals of Operations Research
(match time 0), and won the toss (toss decision 1) has support of 0.0003 and confidence
of 0.86 and lift of 1.23.
Similarly, for T20 matches win, the following association rules are observed:
For instance, the probability of won T20 match with a number of runs greater than 50
scored by top player and number of catches greater than 5 has a support of 0.01, confidence
of 75% and lift of 2.0 depicting the most probable winning combination.
Thus, from a team point of view, it can be concluded that for winning an ODI match, the
top performing batsman needs to score a half century (or more runs), more than five catches
need to be taken and the best bowler must maintain an average of 10. Further, a day and night
match with toss won will boost the match winning propensity.
Similarly, for T20 matches, a top-notch performance by the best batsman and more than
five catches by the best fielder in a single innings will lead to a most probable favorable
outcome for the team.
Therefore, the Association rules are found to corroborate the important factors and inter-
actions derived from the machine learning and regression models.
Further, to analyze the influence of a player’s performance with a country on how he
performs with another country, network analysis is performed for both T20 and ODIs for the
top four captains considered in exploratory data analysis.
The ‘frequencyConnectedness’ package in R computes the influence of one country over
the other by constructing a mutual influence table termed as “Spillover matrix” for all the
countries considered in the analysis. The “Spillover matrix” in Table 9 is obtained from the
predefined function ‘spilloverDY09’. This provides the variance error spillover matrix for
each captain with other four countries (Australia, New Zealand, England, and India) during
the pre-lockdown phase:
The respective country captains are not spilled over to their own country and indicated
by zero in the matrix. For instance, Virat Kohli for India, Finch for Australia, Williamson
for New Zealand, and Root for England are marked by zero since it is their respective host
country. The captains are compared with the other three countries for performance indicated
by the numeric spillover values.
For instance, in the first row, of a total of say 100 units (percentage), of Virat Kohli’s overall
performance, he is found to be 71% successful with Australia, 28% with New Zealand and
1% with England, similarly, for other captains, the spillover table is prepared.
The positive spillover values indicate a positive influence to another country in terms of
performance.
Further, from the net spillover computed above to determine the connectedness of one
country performance with the other, two minimum spanning trees are plotted for both ODI
and T20 matches. The purpose of the minimum spanning trees is to represent the mutual
influence of the countries graphically.
123
Annals of Operations Research
In the minimum spanning tree, each country is represented by a small node with country
name.
The topology follows connecting each country with lines. Countries connected by shorter
lines indicate stronger influence while those connected by longer lines indicates relatively less
mutual influence. This implies that if captains of two countries (other than those connected in
the graph) perform well with one of the countries connected in the graph, they have a higher
probability to perform well with other country connected by the shorter line to this country
while less probability to perform well with country connected by longer line.
Following this notation, the minimum spanning tree (Zhang et al. 2020) is plotted for the
ODI matches in Fig. 30.
From the minimum spanning tree above, it is observed that the England and India have a
larger mutual influence while India and Australia have the least influence. This implies that
if Aaron Finch and Kane Williamson (two other country captains i.e., Australia and New
Zealand) perform well with England, they have a higher probability to perform well with
India. If Williamson and Root (captains of New Zealand and England) perform well with
India, however, they do not need to perform well with Australia due to the long line between
Australia and India.
Similarly, for the T20 results, the net spillover matrix is illustrated in Table 10 as follows:
From the minimum spanning tree in Fig. 31, it is found that for T20 matches, Australia and
India have a larger mutual influence while India and New Zealand have the least influence.
This implies that if Joe Root and Kane Williamson (two other country captains i.e., England
and New Zealand) perform well with Australia, they have a higher probability to perform well
123
Annals of Operations Research
with India in T20 matches. If Finch and Root (captains of Australia and England) perform
well with India, however, they do not need to perform well with New Zealand due to longer
line between New Zealand and India.
5 Discussion
This study makes the following four contributions to the literature. First, this study extends
the literature in the domain of match outcome prediction by factoring in categorical factors
along with the traditionally used quantitative factors. Categorical environmental factors like
toss result, match venue and batting order (whether the team is batting first or second) prove
to be equally important as player specific statistics (quantitative factors) in determining match
outcome.
Second, a comparative analysis of multiple predictive algorithms has been performed to
identify the technique most suitable for prediction in context of sports outcomes, specifically
cricket match outcomes. It was found that DNN outperforms the other two algorithms (random
forest and gradient boosting). Thus, DNN can be recalibrated for different datasets to predict
the match results in real-time. The predictors’ feature importance is also computed and
compared to identify the significant drivers of outcome prediction.
Third, this study validated the feature importance of predictors. It is an important step
towards establishing the explainability of blackbox ML methods. We found that the most
significant factors impacting ODI match outcome as predicted by the DNN model are Player
reputation and Match time (day and day/night), followed by the number of catches and runs
scored. The vital factors for T20 match outcome are Batting order(first/second), toss decision,
and runs scored are the most significant predictors, followed by player reputation. The results
of the deep learning model are validated for efficacy by clustering and association rule results.
Fourth, we introduce the nuance of the complex interaction of different parameters in
the predictive models. We find that the pair-wise interaction ‘Wickets*Batting position’
and the triplet interaction variables ‘Runs*Wickets*Catches’ and ‘Wickets*Catches*Match
time’ are the largest positive drivers of match outcome for one day international matches
123
Annals of Operations Research
(ODI). For T20s, the pair-wise interaction ‘Wickets*Runs’ and the triplet interaction variables
‘Runs*Wickets*Catches’ are the most significant drivers of match outcome.
The match result prediction model enables the management to identify the various environ-
mental factors and player parameters that influence the winning rate of a sports team. The
management can utilize these findings to formulate strategies to improve the performance.
The cluster analysis model, which clusters and identifies emerging players, enables the man-
agement to assign a grade to the players and optimally utilize the talented players in the team.
Such analysis can enable combinatorial optimization by choosing a winning combination of
players to optimize team performance. It will also have the knock-on effect of reducing the
incidence of idle talent on bench.
Data driven and optimized team management, in turn, would improve the reputation
of the team and the national board council by uplifting the team rankings. Therefore, board
councils and team management of respective countries can recalibrate the derived match result
prediction model and tune the sensitive parameters to generate customized recommendations.
Thus, from a team point of view, it can be concluded that for winning an ODI match, the top
performing batsman needs to score a half century (or more runs), more than five catches need
to be taken and the best bowler must maintain an average of 10. Similarly, for T20 matches,
a top notch performance by the best batsman and more than five catches by the best fielder
in a single innings will lead to a most probable favorable outcome for the team.
The match outcome prediction model enables the captains of the team to make appropriate
decisions during toss and based on the condition of the pitch to maximize the winning rate.
Further, captains can identify talented players in their teams, train them in a customized man-
ner, and position them based on their strengths to win games. New teams can be groomed
accordingly to know in real-time the order of priority of choosing players based on personal-
ized preferences. Individual players can benefit in analyzing their shortcomings and making
an individual game plan to maximize the winning rate of the team and their individual grades
in the process. For instance, a player can gauge based on the match conditions and toss, how
many runs to score, wickets to take, and catches to win the match.
6 Conclusion
The paper has attempted to compare the predictive performance of random forest, gradient
boost, and deep neural network-based models and present the significance of the factors in
predicting the outcome of match instances scraped from ESPN CricInfo website. The deep
123
Annals of Operations Research
neural network model is observed to outperform the other two machine learning models in
terms of predictive accuracy and performance on unseen new players’ data. The clustering
results of the emerging players into different grades and comparison with actually allocated
grades by BCCI are useful to recommend new talent and validate model efficacy. The asso-
ciation rules generated for both ODI and T20s present an insight into how parameters can be
tuned to achieve the desired outcome. The rules generated corroborate the most significant
variables and their interactions identified by machine learning and regression models.
However, the study is not without its limitations that can be worked upon in future research
to generate additional insights in the area of sports outcome prediction. First, we utilize
data from the ESPN CricInfo website only. Future studies could compare and aggregate the
statistics with other cricket statistics websites like CricBuzz, Cricket World, HowSTAT!,
and so on, to generate richer prediction. Second, the data sample considered for model
building is limited to 3000 instances. This sample size can be varied, and other parameters
like the significance of partnerships, match-winning batting combinations can be considered
predictors in the models. The opinions about the players (player rating) (Xia et al., 2019)
can also be considered as a potential predictor of match outcome. Third, the analysis can
be performed during different seasons and even outside the home country to understand
the efficacy of the model and the validity of the findings. The data about coaches for each
team could be also scraped and included as a predictor variable of match outcome since the
team composition and game strategies are defined by the coach of the team and h/she is a
major driving factor of the match outcome. The results can be extended to examining the
performance of bowlers and non-captain players for more varied insights.
Therefore, the paper attempts to compare and contrast techniques to solve research ques-
tions in sports and bring out insights from sports analytics in an international cricket context.
References
Abadie, A., Athey, S., Imbens, G. W., & Wooldridge, J. M. (2020). Sampling-based versus design-based
uncertainty in regression analysis. Econometrica, 88(1), 265–296.
Adam, E., Mutanga, O., Abdel-Rahman, E. M., & Ismail, R. (2014). Estimating standing biomass in papyrus
(Cyperus papyrus L) swamp: Exploratory of in situ hyper-spectral indices and random forest regression.
International Journal of Remote Sensing, 35(2), 693–714.
Bendazzoli, S., Brusini, I., Damberg, P., Smedby, Ö., Andersson, L., & Wang, C. (2019). Automatic rat brain
segmentation from MRI using statistical shape models and random forest. In Medical Imaging 2019:
Image Processing (Vol. 10949, p. 109492O). International Society for Optics and Photonics.
Bose, A., Mitra, S., Ghosh, S., Ghosh, R., Patra, T., & Chakrabarti, S. (2021). Unsupervised learning based
evaluation of player performances. Innovations in Systems and Software Engineering, 17(2), 121–130.
Bliss, A., Ahmun, R., Jowitt, H., Scott, P., Jones, T. W., & Tallent, J. (2021). Variability and physical demands
of international seam bowlers in one-day and Twenty20 international matches across five years. Journal
of Science and Medicine in Sport, 24(5), 505–510.
Cappelli, C., Di Iorio, F., Maddaloni, A., & D’Urso, P. (2019). Atheoretical regression trees for classifying
risky financial institutions. Annals of Operations Research, 1–21.
Cea, S., Durán, G., Guajardo, M., Sauré, D., Siebert, J., & Zamorano, G. (2020). An analytics approach to the
FIFA ranking procedure and the World Cup final draw. Annals of Operations Research, 286(1), 119–146.
Chauhan, S., Pande, R., & Sharma, S. (2020). The causal relationship between Indian energy consumption and
the GDP: A shift from conservation to feedback hypothesis post economic liberalisation. Theoretical &
Applied Economics, 27(3), 203–212.
D’Urso, P., De Giovanni, L., & Massari, R. (2019). Trimmed fuzzy clustering of financial time series based
on dynamic time warping. Annals of Operations Research, 1–17.
D’Urso, P., De Giovanni, L., Massari, R., D’Ecclesia, R. L., & Maharaj, E. A. (2020). Cepstral-based clustering
of financial time series. Expert Systems with Applications, 161, 113705.
D’Urso, P., De Giovanni, L., & Vitale, V. (2021). Spatial robust fuzzy clustering of COVID 19 time series
based on B-splines. Spatial Statistics, 100518.
123
Annals of Operations Research
Deval, G., Hamid, F., & Goel, M. (2021). When to declare the third innings of a test cricket match?. Annals
of Operations Research, 1–19.
de Zepeda, M. V. N., Meng, F., Su, J., Zeng, X. J., & Wang, Q. (2021). Dynamic clustering analysis for driving
styles identification. Engineering Applications of Artificial Intelligence, 97, 104096.
Goossens, D. R., Beliën, J., & Spieksma, F. C. (2012). Comparing league formats with respect to match
importance in Belgian football. Annals of Operations Research, 194(1), 223–240.
Hubáček, O., Šourek, G., & Železný, F. (2019). Learning to predict soccer results from relational data with
gradient boosted trees. Machine Learning, 108(1), 29–47.
Huang, J., Tan, J., & Hua, D. (2021). Data mining of association between hyperuricemia and common chronic
diseases based on evolutionary apriori algorithm (EAA). In 2021 IEEE 6th International Conference on
Cloud Computing and Big Data Analytics (ICCCBDA) (pp. 73–77). IEEE.
Jain, P. K., Quamer, W., & Pamula, R. (2021). Sports result prediction using data mining techniques in
comparison with base line model. Opsearch, 58(1), 54–70.
Jiang, Y., & Chen, N. C. (2019). Event attendance motives, host city evaluation, and behavioral intentions.
International Journal of Contemporary Hospitality Management.
Kamath, G. B., Ganguli, S., & George, S. (2020). Attachment points, team identification and sponsorship
outcomes: evidence from the Indian Premier League. International Journal of Sports Marketing and
Sponsorship.
Kamble, R. R. (2021). Cricket score prediction using machine learning. Turkish Journal of Computer and
Mathematics Education (TURCOMAT), 12(1S), 23–28.
Kong, Y. S., Abdullah, S., Schramm, D., Omar, M. Z., & Haris, S. M. (2019). Development of multiple linear
regression-based models for fatigue life evaluation of automotive coil springs. Mechanical Systems and
Signal Processing, 118, 675–695.
Lumbantobing, I. P., Sulivyo, L., Sukmayuda, D. N., & Riski, A. D. (2020). The effect of debt to asset ratio and
debt to equity ratio on return on assets in hotel, restaurant, and tourism sub sectors listed on Indonesia
stock exchange for the 2014–2018 period. International Journal of Multicultural and Multireligious
Understanding, 7(9), 176–186.
Loureiro, A. L., Miguéis, V. L., & da Silva, L. F. (2018). Exploring the use of deep neural networks for sales
forecasting in fashion retail. Decision Support Systems, 114, 81–93.
Mondal, S., Plumley, D., & Wilson, R. (2021). The evolution of competitive balance in men’s international
Cricket. Managing Sport and Leisure, 1–20.
Nikolaidis, Y. (2015). Building a basketball game strategy through statistical analysis of data. Annals of
Operations Research, 227(1), 137–159.
Reyers, M., & Swartz, T. B. (2021). Quarterback evaluation in the national football league using tracking data.
AStA Advances in Statistical Analysis, 1–16.
Saha, D., (2020). 10 Reasons why cricket is the most famous sport In India. Retrieved from: https://sportzwiki.
com/cricket/why-cricket-most-famous-sport-india
Sahu, A. (2021). Predictive analysis of cricket. Turkish Journal of Computer and Mathematics Education
(TURCOMAT), 12(6), 5111–5124.
Schneider, M. J., & Sachin, G. (2016). Forecasting sales of new and existing products using consumer reviews:
A random projections approach. International Journal of Forecasting, 32(2), 243–256.
Stern, S. E. (2016). The Duckworth-Lewis-Stern method: Extending the Duckworth-Lewis methodology to
deal with modern scoring rates. Journal of the Operational Research Society, 67(12), 1469–1480.
Thomson, J., Perera, H., & Swartz, T. B. (2021). Contextual batting and bowling in limited overs Cricket.
South African Statistical Journal, 55(1), 73–86.
Thorley, J. (2021). Age-related changes in the performance of bowlers in Test match cricket. International
Journal of Sports Science & Coaching, 17479541211001726.
Vörösmarty, G., & Dobos, I. (2020). Green purchasing frameworks considering firm size: A multicollinearity
analysis using variance inflation factor. Supply Chain Forum: an International Journal, 21(4), 290–301.
Weeraddana, N., & Premaratne, S. (2021). Unique approach for cricket match outcome prediction using
Xgboost algorithms. Journal of Theoretical and Applied Information Technology, 99(9), 2162–2173.
Xia, H., Yang, Y., Pan, X., Zhang, Z., & An, W. (2019). Sentiment analysis for online reviews using conditional
random fields and support vector machines. Electronic Commerce Research, 1–18.
Zhang, B., Guan, X., & Zhang, Q. (2020). Inverse optimal value problem on minimum spanning tree under
unit l∞ norm. Optimization Letters, 14(8), 2301–2322.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
123