Probability Formulas and Statistical Analysis in Tennis
Probability Formulas and Statistical Analysis in Tennis
Sports
Volume 4, Issue 2 2008 Article 15
Recommended Citation:
O'Malley, A. James (2008) "Probability Formulas and Statistical Analysis in Tennis," Journal of
Quantitative Analysis in Sports: Vol. 4: Iss. 2, Article 15.
Available at: http://www.bepress.com/jqas/vol4/iss2/15
DOI: 10.2202/1559-0410.1100
2008 American Statistical Association. All rights reserved.
Probability Formulas and Statistical Analysis
in Tennis
A. James O'Malley
Abstract
In this paper an expression for the probability of winning a game in a tennis match is derived
under the assumption that the outcome of each point is identically and independently distributed.
Important properties of the formula are evaluated and presented pictorially. The accuracy of this
formula is tested by comparing observed proportions against predicted values using data from the
2007 Wimbledon Tennis Championships. We also derive expressions for the probability of several
other milestones in a tennis match including winning a tiebreaker, winning a set, winning a match,
and recovering from a break of serve down to win a set. The resulting "tennis formulas" are used
to evaluate the implications of possible rule changes, to demonstrate how broadcasts of tennis
matches could be made more interesting and informative, and to potentially improve a player's
chance of winning a match.
Author Notes: Appreciation is extended to Mike Beaser of the Sloan School at the Massachusetts
Institute of Technology for helpful comments on an early draft of the manuscript.
O'Malley: Probability Formulas and Statistical Analysis in Tennis
1. Introduction
Tennis lends itself to the construction of probability formulas because the score
has a fixed number of hierarchically structured states; points are nested within
service games, which are nested within sets, which are nested within a match (see
Section 2). Thus, a natural approach to modeling a tennis match is to first develop
methods for predicting the outcome of a game given probabilities for individual
points, then predict the outcome of a set given probabilities for the outcome of
each game, and finally the outcome of the match given probabilities for the
outcome of each set. We derive expressions for these events and also evaluate
formulas for the probabilities of other events such as a player making a comeback
to win the match from various deficits.
The assumption at the heart of this paper is that the event that a server wins a
point is independent and identically distributed (iid) over the entire match. This
assumption requires that the result of any one point has no bearing on the result of
subsequent points. It also requires that the probability a player wins a point on
serve is fixed throughout the match; the probability can differ between the players
but must be constant throughout the match for any given player. Although iid is a
severe assumption, past work suggests that it is adequate for analyses that
aggregate over many matches (Newton and Keller, 2005).
Historically, work on the derivation of tennis formulas has been sporadic. Of
particular note, Newton and Keller (2005) unified several earlier works on tennis
probabilities using hierarchical recurrence relations to derive the probability of
winning a game, set, and match (also see references therein). Liu (2001) used
finite Markov chains and results for random walks to derive equivalent formulas.
Notable earlier works include Riddle (1988) on probability models for various
alternative tiebreaker scoring systems, and Morris (1977) on determining which
point is the most important in a tennis match.
The potential usefulness of tennis formulas is evident in the various
applications to which they might be applied. These include: making coverage of
tennis matches more informative and interesting to viewers, evaluating the effect
of potential rule changes, optimizing a players training methods, and studying the
performance of players at key stages of a match. The potential users of tennis
formulas include broadcasters, administrators, coaches, players, and of course
bookies or gamblers; online betting agencies such as Betfair allow bets to be
placed on a professional tennis match before and during the contest (Betfair,
2008). Thus, there are a growing number of potential users and opportunity for
financial gain from successful application of tennis formulas.
The original work presented in this paper includes studying and interpreting
mathematical properties of tennis formulas, developing new tennis formulas such
as the probabilities of a player recovering from various deficits to win a set, and
http://www.bepress.com/jqas/vol4/iss2/15 2
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
2 points to 0 is read as thirty-love. If both players have won three points the
score is deuce and from this point the game continues eternally until one player
has a two-point lead; a player who is one point ahead in the game after the game
has reached deuce has the advantage. If the serving player is one point away
from winning a game, the point is referred to as a game-point; if the non-
serving player is one point away from winning a game, the point is referred to as a
break-point.
3.1. Properties of G ( p )
Although G ( p ) has been derived previously, its properties have not been
thoroughly examined. To explore its properties we plot G ( p ) , its derivative
function, and its integral function against p (Figure 1). After differentiating (1)
and simplifying the resulting expression, the derivative function is obtained as
5 p3 3 p2 + 4 p4
G '( p ) = 20 p 3 3 p + (2)
(1 2 p(1 p )) 2
and after integrating (1) and simplifying the resulting expression, the integral
function is obtained as
p 2 5 5 5 5
G ( p ) = G ( x )dx = p 6 + 2 p 5 p 4 p 3 + p + log{1 2 p (1 p )}. (3)
0 3 4 6 4 8
The plots of G ( p ) , G '( p ) , and G ( p ) against p give the probability of a player
winning a game as a function of the probability of winning a point, the rate of
change in G ( p ) as p increases, and the average of G ( p ) over the interval
[0, p ] .
It is clear from Figure 1 that G ( p ) is a monotone increasing function
asymmetric about the point of inflection at p = 0.5 . The maximal value of G '( p )
occurs at 0.5, implying that the greatest change in the probability of winning the
game occurs at p = 0.5 ; e.g., the benefit of a 0.01 increase in the probability of
winning a point is greater if the baseline probability is 0.5 than if it is 0.7. A
player with 0, 1, and 0.5 chance of winning a point has G (0) = 0 , G (1) = 1 , and
G (0.5) = 0.5 probability of winning the game; 0, 1, and 0.5 are the only self-
evaluating values of p . The fact that a game offers more insurance against
random bad luck than a single point is illustrated by the inequalities G ( p ) < p for
0 < p < 0.5 and G ( p ) > p for 0.5 < p < 1 . For example, G (0.6) = 0.7357 and
G (0.7) = 0.9008 indicating that players with point winning probabilities of 0.6
and 0.7 have 73.57% and 90.08% chances of winning a game respectively.
The quantity G (1) is the probability a player wins a game when their point-
winning probability is chosen at random, i.e., when p is uniformly distributed on
[0,1] . The fact that G (1) = 0.5 confirms that tennis is a fair game. Under the iid
assumption, G (0.5) = 0.0616 , implying that if a players probability of winning a
point is uniformly distributed on [0.5,1] they are expected to win 87.68% of the
games they play.
http://www.bepress.com/jqas/vol4/iss2/15 4
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
We now consider the probabilities of winning a tiebreaker, a set, a match, and the
probability of recovering from a deficit to win a set. These differ from the
probability of winning a game in that they involve the probability of winning a
point when receiving serve. We use q to denote the probability of winning a
point when receiving serve (recall that p denotes the probability of winning a
point when serving) and denote the probability of winning a tiebreaker, a set, a
three-set match, and a five-set match by TB( p, q ) , S ( p, q ) , M 3 ( p, q) and
M 5 ( p, q ) respectively. Due to their length, detailed formulas for TB( p, q ) and
S ( p, q ) appear in the Appendix.
The fact that the server alternates every odd-point is the major difference between
a tiebreaker and a regular (service) game. However, a tiebreaker emulates a game
in that play continues past the target number of points (seven for a tiebreaker
versus four for a regular game) if the score does not differ by at least two points
when the target is reached. The algebraic expression for TB( p, q ) in Equation
(A1) has more terms than Equation (1) due to the increased number of distinct
scores that are possible in a tiebreaker compared to a regular game. However, the
procedure for deriving the formula is similar. One approach is to sum the
probabilities for the events that a player wins the tiebreaker with the loss of 0, 1,
2, 3, 4, 5, and 6 or more points. The probability of winning if the score reaches
six-points all equals the sum of the geometric sequence whose argument,
p (1 q ) + (1 p )q , is the probability the score transitions from n points-all to
n + 1 points-all, n 6 ; hence, the appearance of p (1 q ) + (1 p )q in the
denominator of (A1). An alternative way of deriving TB ( p, q ) and other tennis
formulas is to use recurrence relations or transition matrices.
Figure 2 displays plots of TB ( p, 0.5) , TB ( p, p ) , and TB ( p,1 p + 0.02)
versus p . The top plot is the probability of winning the tiebreaker as a function of
the probability of winning a point as the server when the probability of winning a
point as the receiver is fixed at 0.5. As expected TB ( p, 0.5) < 0.5 when p < 0.5 ,
TB ( p, 0.5) > 0.5 when p > 0.5 , and TB ( p, 0.5) = p when p {0, 0.5,1} .
The second plot depicts the case when a player has the same probability of
winning a point returning serve as they do when serving; in this case the
tiebreaker is a sequence of iid events. Because a tiebreaker is a longer contest than
a regular game, TB ( p, p ) < G ( p ) when p < 0.5 and TB ( p, p ) > G ( p ) when
p > 0.5 .
The bottom plot depicts the case where a player has a 0.02 higher point-
winning probability than their opponent. The plot has a bathtub appearance
because the relative advantage of the better player is greatest when p = 0.02 or
p = 1 (it is impossible for the better player to lose in these cases) and least when
p = 0.51 .
http://www.bepress.com/jqas/vol4/iss2/15 6
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
The third plot shows that a player with a 0.02 point-winning advantage has at
least a 60% chance of winning a best of three-set match. An interesting feature of
this plot is the presence of local minima near 0.15 and 0.85 and a local maximum
at 0.51. The local optimum at 0.51 occurs because the player has equal chance of
winning both service and receiving points (and thus games). As the difference in
the probabilities increases, the likelihood that an event such as an isolated poorly
played service game decides the match increases, which causes the probability of
winning the match to drop. The trend continues until the probabilities get so close
to 0 or 1 that the relative difference in the point winning probabilities skyrockets
along with the probability of winning the match.
An important property of a tennis match is that it is a contest that is not over until
the final point has been won. In fact, the only truly important point in a tennis
match is the final point since knowing which player won that point determines the
winner.
http://www.bepress.com/jqas/vol4/iss2/15 8
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
The probability a player wins a set conditional on the current score can be
evaluated using the same approach as for S ( p, q ) . To illustrate, we consider the
probability that a player recovers from a break-down (i.e., the player has lost their
serve one time more than their opponent) in a set to win. Table 1 displays the
probabilities of a player winning the set when down a break of serve.
Previous work has indicated that although the iid assumption may not hold
exactly, analyses that aggregate data over multiple matches may still be fairly
accurate (Newton and Keller, 2005). To illustrate how the formulas may be
applied to data from tennis matches, we test this assumption on new data. Data
were obtained from the final 14 Mens singles matches played at the 2007
Wimbledon tennis tournament. This included 7 fourth round (or last sixteen)
matches (the eighth matched was defaulted so no points were played), 4 quarter-
finals, 2 semi-finals, and the final. Summary data from these matches is displayed
in Table 2.
http://www.bepress.com/jqas/vol4/iss2/15 10
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
9 Mattieu 59 90 12 16 0 1
10 Gasquet 56 75 13 15 1 6
10 Tsonga 51 92 9 14 2 4
11 Baghdatis 62 100 12 16 4 10
11 Davydenko 65 113 11 17 4 8
12 Djokovic 114 174 19 23 5 8
12 Hewitt 100 157 20 23 8 12
13 Berdych 73 101 17 17 11 18
13 Bjorkman 68 119 10 17 3 3
14 Nadal 83 115 19 21 4 10
14 Youzhny 83 139 15 21 2 4
Total 2219 3364 442 532 118 208
The above result suggests that a servers performance suffers when under the
pressure of having to win a point in order to avoid losing their service game. To
accommodate such an effect, the formulas could be extended to allow different
probabilities on break points. Game, set, and match points might be other
occasions when the probability of winning a point deviates from baseline.
Another modification would be to allow the probability of winning a point to
depend on the side of the court from which the serve is delivered.
The independence assumption may also be violated. For instance, the
outcomes of successive points could be serially correlated, especially during
crucial stages of a match. If the complete sequence of points within a sample of
tennis matches was available, the independence assumption could be tested
against alternative models (see Jackson and Mosurski, 1997, for an examination
of the assumption that successive tennis points are independent). One family of
alternate models is the class of Markovian models. These posit that the outcome
of a point depends only on the current state of the match. A special case of the
class of Markovian models in the AR(1) model.
http://www.bepress.com/jqas/vol4/iss2/15 12
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
In this Section, applications of the tennis formulas that have the potential to
benefit commentators, administrators or rule makers, and coaches or players are
discussed.
Tennis commentators typically make a lot of predictions about the future course
of a match when significant events happen or are about to happen. The statement
if the player breaks serve now the match is effectively over or this break-point
is essentially a match-point is often said. Such statements give the sense that it is
impossible for a player to comeback and win the match. However, as mentioned
earlier, one of the gripping features of tennis is that the match is never over until
the final point is played and so a player always has the chance to comeback.
To illustrate the point, we evaluated the formulas for the probability that a
player recovers from a deficit of one service break to win a set. Probabilities were
evaluated under four different specifications of p and q for each of the nine
situations in which a player can be behind a service break (Table 2).
The first two columns of Table 3 correspond to the case where a player has a
better/worse point-winning differential than their opponent, while the final two
columns assume that players are evenly matched (i.e., have equal point-winning
probabilities). The fourth column predicts what would happen if the players
stopped playing tennis and instead tossed a coin to decide the outcome of each
subsequent point (i.e., as if the probability of winning a point is 0.5 for all
subsequent points).
Comparing the first two columns, a player with a point-winning advantage is
more likely to make a successful comeback than a player with a point-winning
disadvantage. Comparing the third and fourth columns, it is better to have a high
service winning probability when a player must win more games as the server
than the receiver in order to complete a comeback (e.g. when the score is 2-5 with
the player to serve next), whereas it is better to have point-winning probabilities
close to 0.5 if an equal number of serving and receiving games must be won (e.g.,
from 3-5).
http://www.bepress.com/jqas/vol4/iss2/15 14
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
Figure 5 reveals that a player expected to win around 51-53% of serving and
receiving points has a 0.04-0.05 increase in the probability of winning the match
when they play best of five-sets versus best of three-sets. If the point-winning
probability is only 47-49% a disadvantage of the same magnitude is incurred.
When the point winning probability is outside of [0.4, 0.6], the probability of
winning the match under both formats is so close to 0 or 1 that the difference is
essentially 0.
In 1999, a proposal to replace the deuce-advantage system in a game with the
sudden death rule that the winner of the first point after deuce wins the game was
made public. The proposal drew comment from several quarters, including
players. Opinions varied. For instance, Pete Sampras and Andre Agassi were
reported as being against and for the change, respectively. Although it was
recognized that sudden death would shorten the average length of a game and thus
of a match itself, there was no analysis of how the change might affect the results
from tennis matches. For example, would more upsets be expected under the
sudden-death system?
The probability of the server winning a game under the sudden-death scoring
system is:
G% ( p ) = p 4 + 4 p 4 (1 p ) + 10 p 4 (1 p ) 2 + 20 p 4 (1 p )3 .
This is simpler in appearance than G ( p ) because the geometric series at
deuce is replaced with a single probability. The plot of G ( p ) G% ( p ) against p
(Figure 6) shows that G ( p ) < G% ( p ) when 0 < p < 0.5 and G ( p ) > G% ( p ) when
0.5 < p < 1 , with equality at 0, 0.5, and 1. For example, a player who won 70.9%
of service points and 37.1% of receiving points would see their best of three-set
match winning probability fall from 83.31% to 82.10% (a drop of 0.0121 on the
probability scale). Similarly, a player with 65.7% and 41.8% point-winning
probabilities for serving and receiving would drop from 82.96% to 80.92%, a
drop of 0.0205 on the probability scale. This indicates that a better player is more
likely to lose a match under sudden death scoring. Although sudden death will
place a lot of importance on the point played at deuce, viewers might miss the
crescendo of excitement that builds up when a game continues well past deuce.
To use the tennis formulas a player must acquire results on a large sample of
points theyve played. This would enable a player (or their coach) to estimate
their probability of winning a match against a group of opponents and also
specific opponents. More importantly, a player could evaluate the amount that
their odds of winning a match would improve if they were able to increase their
point-winning probabilities by certain amounts. Such analyses could be used to
help decide how to prioritize training time and resources.
http://www.bepress.com/jqas/vol4/iss2/15 16
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
For example, a player that currently wins 65% of service points and 37% of
receiving points has a 59.85% chance of winning a best-of-three set match. If
focused training on either the serve or the return of serve would enable a player to
improve their point-winning probabilities by 0.01 then, since
M 3 (0.66, 0.37) = 0.645 and M 3 (0.65,0.38) = 0.647 , they would be (slightly)
better off focusing on their return game (because the baseline receiving
probability is closer to 0.5, increasing the receiving probability results in a bigger
increase in the probability of winning the match than does increasing the serving
probability). However, because serving performance is less reliant on opponents
performances, it might be the case that the player could make a bigger
improvement on their serve than on their return of serve. If the respective
improvements were 0.011 and 0.01, the new match-win probabilities would be
0.650 and 0.647 respectively, indicating that in this case the better strategy is to
focus on their serving game. To assist with preparations to play a certain player,
similar calculations could be made using data from previous matches against that
player.
7. Conclusion
This paper has highlighted the wide-range of opportunities for using probability
models and statistical analysis in tennis. Like baseball, the scoring system in
tennis consists of repeated contests involving a fixed number of states. This
makes tennis a prime candidate for detailed modeling and statistical analysis.
The methods and analyses in this paper can be extended in several ways. Our
analysis of the servers performance on break points versus non-break points
suggests that a more appropriate model would allow for different probabilities at
different stages of a match or even sides of the court. In subsequent work,
analyses could be undertaken to determine if there are other stages of a match
where point-winning probabilities are liable to differ, and to evaluate if the
outcome of successive points are correlated, especially during crucial stages of a
match.
The use of tennis formulas has the potential to yield several benefits (besides
making profits for savvy gamblers). As long as appropriate explanations are
provided, probabilistically based predictions of the outcome of a tennis match
may enhance television coverage of tennis the way baseball statistics and player
ratings have extended interest in that sport. Administrators and other concerned
parties could use tennis formulas to evaluate the implications of any proposed
scoring changes on the results of tennis matches and, in particular, on of the
prevalence of upsets. Finally, tennis formulas provide players and coaches with
another tool to use in training and in developing match strategies.
The development of tennis formulas has the potential to increase the publics
interest in tennis. It is hoped that this paper raises the profile of probabilistic
modeling of tennis and leads to further development and application of models of
tennis and associated formulas.
Appendix
http://www.bepress.com/jqas/vol4/iss2/15 18
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
1 3 0 4 0 0
3 3 1 4 0 0
4 4 0 3 1 0
6 3 2 4 0 0
16 4 1 3 1 0
6 5 0 2 2 0
10 2 3 5 0 0
40 3 2 4 1 0
30 4 1 3 2 0
4 5 0 2 3 0
5 1 4 6 0 0
50 2 3 5 1 0
100 3 2 4 2 0
50 4 1 3 3 0
A=
5 5 0 2 4 0
1 1 5 6 0 0
30 2 4 5 1 0
150 3 3 4 2 0
200 4 2 3 3 0
75 5 1 2 4 0
6 6 0 1 5 0
.
1 0 6 6 0 1
36 1 5 5 1 1
225 2 4 4 2 1
400 3 3 3 3 1
225 4 2 2 4 1
36 5 1 1 5 1
1
1 6 0 0 6
The formula for computing S ( p, q ) is given by
S ( p, q ) = i =1 B (i,1)G ( p ) B ( i ,2) (1 G ( p )) B ( i ,3) G ( q ) B ( i ,4) (1 G ( q )) B ( i ,5)
21
(A2)
( G ( p )G ( q) + {G ( p )(1 G ( q)) + (1 G ( p ))G ( q )}TB( p, q ) )
B ( i ,6)
,
References
http://www.bepress.com/jqas/vol4/iss2/15 20
DOI: 10.2202/1559-0410.1100
O'Malley: Probability Formulas and Statistical Analysis in Tennis
Morris, C. N. The most important points in tennis. In: Optimal Strategies in Sport
(S. P. Ladany and R.E. Nichol, Eds.), pp. 131-140, Amsterdam; North
Holland, 1977.
Betfair Tennis. (2008). http://form.tennis.betfair.com/tennis
Jackson, D. and K. Mosurski. (1997). Heavy defeats in tennis: Psychological
momentum or random effects. Chance, 10, 27-34.