You are on page 1of 4

Selected for a Viewpoint in Physics

week ending
PRL 103, 218701 (2009) PHYSICAL REVIEW LETTERS 20 NOVEMBER 2009

Zipf’s Law in the Popularity Distribution of Chess Openings


Bernd Blasius1 and Ralf Tönjes2
1
ICBM, University Oldenburg, 26111 Oldenburg, Germany
2
Institut für Physik, Universität Potsdam, 14415 Potsdam, Germany and Ochadai Academic Production, Ochanomizu University,
Tokyo 112-8610, Japan
(Received 28 February 2008; revised manuscript received 4 August 2009; published 16 November 2009)
We perform a quantitative analysis of extensive chess databases and show that the frequencies of
opening moves are distributed according to a power law with an exponent that increases linearly with the
game depth, whereas the pooled distribution of all opening weights follows Zipf’s law with universal
exponent. We propose a simple stochastic process that is able to capture the observed playing statistics and
show that the Zipf law arises from the self-similar nature of the game tree of chess. Thus, in the case of
hierarchical fragmentation the scaling is truly universal and independent of a particular generating
mechanism. Our findings are of relevance in general processes with composite decisions.

DOI: 10.1103/PhysRevLett.103.218701 PACS numbers: 89.20.a, 05.40.a, 89.75.Da

Decision making refers to situations where individuals (Fig. 1). Every opening is represented by its move se-
have to select a course of action among multiple alterna- quence as a directed path starting from the initial node.
tives [1]. Such processes are ubiquitous, ranging from We will differentiate between two game situations if they
one’s personal life to business, management, and politics, are reached by different move sequences. This way the
and have a large part in shaping our life and society. graph becomes a game tree, and each node  is uniquely
Decision making is an immensely complex process and, associated with an opening sequence.
given the number of factors that influence each choice, a Using a chess database [6] we can measure the popular-
quantitative understanding in terms of statistical laws re- ity n or weight of every opening sequence as the number
mains a difficult and often elusive goal. Investigations are of occurrences in the database. We find that the weighted
complicated by the shortage of reliable data sets, since game tree of chess is self-similar and the frequencies SðnÞ
information about human behavior is often difficult to be of weights follow a Zipf law [7]
quantified and not easily available in large numbers,
SðnÞ  n (1)
whereas decision processes typically involve a huge space
of possible courses of action. Board games, such as chess, with universal exponent  ¼ 2. Note, the precise scaling in
provide a well-documented case where the players in turn the histogram of weight frequencies SðnÞ and in the cumu-
select their next move among a set of possible game lative distribution CðnÞ over the entire observable range
continuations that are determined by the rules of the game. [Fig. 2(a)]. Similar power law distributions with universal
Human fascination with the game of chess is long- exponent have been identified in a large number of natural,
standing and pervasive [2], not least due to the sheer economic, and social systems [7–15]—a fact which has
infinite richness of the game. The total number of different come to known as the Zipf or Pareto law [7,8]. If we count
games that can be played, i.e., the game-tree complexity of only the frequencies Sd ðnÞ of opening weights n after the
chess, has roughly been estimated as the average number of first d moves we still find broad distributions consistent
legal moves in a chess position to the power of the length of with power law behavior Sd ðnÞ  nd [Fig. 2(b)]. The
a typical game, yielding the Shannon number 3080  10120 exponents d are not universal, however, but increase
[3]. Obviously only a small fraction of all possible games linearly with d [Fig. 2(b), inset). The results are robust:
can be realized in actual play. But even during the first similar power laws could be observed in different data-
moves of a game, when the game complexity is still bases and other board games, regardless of the considered
manageable, not all possibilities are explored equally game depth, constraints on player levels or the decade
often. While the history of successful initial moves has when the games were played. Stretching over 6 orders of
been classified in opening theory [4], not much is known magnitude, the here-reported distributions are among the
about the mechanisms underlying the formation of fashion- most precise examples for power laws known today in
able openings [5]. With the recent appearance of extensive social data sets.
databases, playing habits have become accessible to quan- As seen in (Fig. 1) for each node  the weights of its
titative analysis, making chess an ideal platform for ana- subtrees define a partition of the integers (1 . . . n ). The
lyzing human decision processes. assumption of self-similarity implies a statistical equiva-
The set of all possible games can be represented by a lence of the branching in the nodes of the tree. We can thus
directed graph whose nodes are game situations and whose define the branching ratio distribution over the real interval
edges correspond to legal moves from each position r 2 ½0; 1 by the probability QðrjnÞ that a random pick

0031-9007=09=103(21)=218701(4) 218701-1 Ó 2009 The American Physical Society


week ending
PRL 103, 218701 (2009) PHYSICAL REVIEW LETTERS 20 NOVEMBER 2009
A
8
10 6
6 10
10 10
4

frequency S(n)
4 C(n)
10 10
2

2 0
10 10
10
0
10
2
10
4
n 6
10
0
10 histogram
-2 logbin
10 Zipf-distribution
-4
10
-6
10 0 1 2 3 4 5 6
10 10 10 10 10 10 10
B
6
10 4
4 3
10

frequency Sd(n)
α2
2 1
10 0
0 1 10 20 30 40
10 d
d=4 logbin
-2
10 d=16 histogram
d=16 logbin
-4 d=22 logbin
10
0 1 2 3 4 5 6
10 10 10 10 10 10 10
popularity n

FIG. 2 (color online). (a) Histogram of weight frequencies


SðnÞ of openings up to d ¼ 40 in the Scid database and with
logarithmic binning. A straight line fit (not shown) yields an
exponent of  ¼ 2:05 with a goodness of fit R2 > 0:9992. For
FIG. 1 (color online). (a) Schematic representation of the
weighted game tree of chess based on the SCIDBASE [6] for the P with  ¼ 1 is indicated
comparison, the Zipf distribution Eq. (8)
as a solid line. Inset: number CðnÞ ¼ N m¼nþ1 SðmÞ of openings
first three half moves. Each node indicates a state of the game.
with a popularity m > n. CðnÞ follows a power law with ex-
Possible game continuations are shown as solid lines together
ponent  ¼ 1:04 (R2 ¼ 0:994). (b) Number Sd ðnÞ of openings of
with the branching ratios rd . Dotted lines symbolize other game
depth d with a given popularity n for d ¼ 16 and histograms
continuations, which are not shown. (b) Alternative representa-
with logarithmic binning for d ¼ 4, d ¼ 16, and d ¼ 22. Solid
tion emphasizing the successive segmentation of the set of
lines are regression lines to the logarithmically binned data
games, here indicated for games following a 1.d4 opening until
(R2 > 0:99 for d < 35). Inset: slope d of the regression line
the fourth half move d ¼ 4. Each node  is represented by a box
as a function of d and the analytical estimation Eq. (6) using
of a size proportional to its frequency n . In the subsequent half
N ¼ 1:4  106 and  ¼ 0 (solid line).
move these games split into subsets (indicated vertically below)
according to the possible game continuations. Highlighted in (a)
and (b) is a popular opening sequence 1.d4 Nf6 2.c4 e6 (Indian The form of the branching ratio distribution suggests that
defense). in the case of chess there is no preferential growth process
involved, but something entirely different which must be
from the numbers 1 . . . n is in a subset of size smaller or rooted in the decision process during the opening moves of
equal to rn. Taking n to infinity QðrjnÞ may have a a chess game [5].
continuous limit QðrÞ for which we find the probability In the following we show that the asymptotic Zipf law in
density function (PDF) qðrÞ ¼ Q0 ðrÞ. If the limit distribu- the weight frequencies arises independently from the spe-
tion qðrÞ of branching ratios exists it carries the fingerprint cific form of the distribution qðrÞ, and hence, the micro-
of the generating process. For instance, the continuum limit scopic rules of the underlying branching process. Consider
of the branching ratio distribution for a Yule-Simon pref- N realizations of a general self-similar random segmenta-
erential growth process [13] in each node of the tree would tion process of N integers, with paths (0 ; 1 ; . . . ) in the
be qðrÞ  r , where  < 0 is a model specific parameter. corresponding weighted tree. In the context of chess each
On the other hand, in a k-ary tree where each game realization of this process corresponds to a random game
continuation has a uniformly distributed random a priori from the database of N games (e.g., dark shading in Fig. 1).
probability the continuum limit corresponds to a random The weights nd ¼ nd describe a multiplicative random
stick breaking process in each node, yielding qðrÞ  ð1  process
rÞk2 . For the weighted game tree of chess qðrÞ can directly
Y
d
be measured from the database [Fig. 3(a)]. We find that nd ¼ N ri ; n0 ¼ N; (3)
qðrÞ is remarkably constant over most of the interval but i¼1
diverges with exponent 0.5 as r ! 1, and is very well fitted
by the parameterless arcsine distribution where the branching ratios rd ¼ nd =nd1 for sufficiently
large nd are distributed according to qðrÞ independent of d.
2
qðrÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi : (2) For lower values of nd the continuous branching ratio
 1  r2 distribution is no longer a valid approximation and a
218701-2
week ending
PRL 103, 218701 (2009) PHYSICAL REVIEW LETTERS 20 NOVEMBER 2009

A 9
The functions Sd ðnÞ are strongly skewed and can exhibit
8

7
power law like scaling over several decades. A logarithmic
6
expansion for 1 < n  N shows that they approximately
q(r) 5 follow a scaling law Sd ðnÞ  nd with exponent
4 1
d ¼ ð1  Þ þ ðd  1Þ: (6)
3
logN
2

1 The exponent d is linearly increasing with the game depth


0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
d and with a logarithmic finite size correction which is in
r excellent agreement with the chess database [Fig. 2(b),
B
10
6
inset]. Power laws in the stationary distribution of random
10
4 database d=22
simulation arcsin
segmentation and multiplicative processes have been re-
simulation uniform
theory uniform
ported before [9] and can be obtained by introducing slight
frequency Pd(n)

2
10
modifications, such as reflecting boundaries, frozen seg-
10
0
ments, merging, or reset events [16–18]. In contrast, the
10
-2 approximate scaling of Sd ðnÞ in Eq. (5) is fundamentally
-4
different, as our process does not admit a stationary distri-
10
bution. The exponents d increase due to the finite size of
-6
10 1
10
2
10 10
3
10
4 the database.
n As shown in Fig. 3(b) we find excellent agreement
between the weight frequencies Sd ðnÞ in the chess database
FIG. 3 (color online). (a) Probability density qðrÞ of branching and direct simulations of the multiplicative process, Eq. (3)
ratios r sampled from all games in the Scid database with a bin
using the arcsine distribution Eq. (2). If the branching
size of r ¼ 0:01 and arcsine distribution Eq. (2) (black solid
line). Every edge of the weighted game tree, from nodes of size
ratios are approximated by a uniform distribution qðrÞ ¼
nd1 to nd , contributes to the bin corresponding to r ¼ nd =nd1 1 the predicted values of Sd ðnÞ are systematically too
with weight r. We disregarded clusters with nd < 100 so that, in small, since a uniform distribution yields a larger flow
principle, a cluster could contribute to any of the bins. We found into the absorbing state n ¼ 1 than observed in the data-
qðrÞ to be depth independent. (b) Distribution of opening popu- base. Still, due to the asymptotic behavior of qðrÞ for r !
larities Sd ðnÞ for d ¼ 22 obtained from the Scid database (black) 0, this approximation yields the correct slope in the log-log
and from a direct simulation of the multiplicative process plot so that the exponent d can be estimated quite well
Eq. (2), with branching ratios qðrÞ taken from a uniform or based on Eq. (6) with  ¼ 0.
arcsine distribution. Further indicated is the theoretical result By observing that Sd ðnÞ in Eq. (5) is the dth term in a
Eq. (5) (dashed line). Similar results are obtained for other
series expansion of an exponential function, we find the
values of d.
weight
P distribution in the whole game tree as SðnÞ ¼
d Sd ðnÞ to be an exact Zipf law. For branching ratio
node of weight one has at most one subtree; i.e., the state distributions qðrÞ different from Eq. (4) the weight fre-
nd ¼ 1 is absorbing. quencies are difficult to obtain analytically. But using
To calculate the PDF pd ðnÞ of the random variable nd renewal theory [19] the scaling can be shown to hold
after d steps it is convenient to consider the log- asymptotically for n  N and a large class of distributions
transformed variables  ¼ logðN=nÞ and  ¼  logr. qðrÞ. For this, note that the random variable ðÞ ¼
The corresponding process fd g is a random walk d ¼
P maxðd:d < Þ is a renewal process in . The expectation
d
i¼1 i with non-negative increments i and its PDF E½ðÞ is the corresponding renewal
d ðÞ transforms as npd ðnÞ ¼ d ðÞ. An analytic solution P1function related to
the distributions of the d as d¼1 Probðd < Þ ¼
can be obtained for the class E½ðÞ. If the expected value  ¼ E½ ¼ E½ logr is
qðrÞ ¼ ð1 þ Þr ; 0  r  1; (4) finite and positive [e.g., for the distribution (4)  ¼
of power law distributions, which typically arise in prefer- 1=ð1 þ Þ], the renewal theorem provides
ential attachment schemes. In this case the jump process d d 1
is Poissonian and distributed according to a gamma distri- lim E½ðÞ ¼ : (7)
!1 d 
bution d ðÞ ¼ ð1þÞ
d
d1 ð1þÞ P1
ðd1Þ!  e . After retransforma-
Thus, we obtain lim!1 d ðÞ ¼ 1 and finally
tion to the original variables and noting that from the d¼1

probability pd ðnÞ for a single node at distance d to the N


lim SðnÞ ¼ : (8)
root to have the weight n one obtains the expected number ðn=NÞ!0 n2
Sd ðnÞ of these nodes in N realizations of the random
process as Sd ðnÞ ¼ Npd ðnÞ=n, and in particular Thus, the multiplicative random process [Eq. (3)] with any
    well behaving branching ratio distribution qðrÞ on the
ð1 þ Þd N d1 N 1 interval [0, 1] always leads to an asymptotically universal
Sd ðnÞ ¼ log : (5)
Nðd  1Þ! n n scaling for n  N [compare also the excellent fit of Eq. (8)
218701-3
week ending
PRL 103, 218701 (2009) PHYSICAL REVIEW LETTERS 20 NOVEMBER 2009
A B
1 1
W=0.9 Our study suggests the analysis of board games as a
fraction of openings Q

fraction of openings Q
0.8 0.8 W=0.8
promising new perspective for statistical physics. The

Gini coefficient G
0.6 0.6
W=0.5
enormous amount of information contained in game data-
0.4 0.4 bases, with its evolution resolved in time and in relation to
0.2
d=25 d=20 d=15 d=10 d=5
0.2 an evolving network of players, provides a rich environ-
0
0 0.2 0.4 0.6 0.8 1
0
0 5 10 15 20 25 30
ment to study the formation of fashions and collective
fraction of games W game depth d behavior in social systems.
We are indebted to Andriy Bandrivskyy for invaluable
FIG. 4. Inequality [11,21] of the distribution Sd ðnÞ. help with the data analysis.
(a) Proportion W of games that is concentrating in the fraction
Q of the most popular openings, for several levels of the game
depth d. (b) Q as a function of d for three different
R values of W
(solid lines) and Gini coefficient G ¼ 1  2 10 QðWÞdW as a
function of game depth (dotted line). [1] H. Simon, Administrative Behaviour (Macmillan,
New York, 1947); I. L. Janis and L. Mann, Decision
Making: A Psychological Analysis of Conflict, Choice,
to the chess data in Fig. 2(a)]. In [15] the same Zipf-law and Commitment (Free Press, New York, 1977).
scaling was found for the sizes of the directory trees in a [2] H. J. R. Murray, A History of Chess (Oxford University
computer cluster. The authors propose a growing mecha- Press, London, 2002), reprint.
nism based on linear preferential attachment. Here we have [3] C. E. Shannon, Philos. Mag. 41, 256 (1950).
shown that the exponent  ¼ 2 for the weight distribution [4] The Encyclopedia of Chess Openings A–E, edited by
of subtrees in a self-similar tree is truly universal in the A. Matanovic (Chess Informant, Beograd, Serbia, 2001),
sense that it is the same for a much larger class of generat- 4th ed.
ing processes and not restricted to preferential attachment [5] M. Levena and J. Bar-Ilan, Computer Journal (UK) 50,
or growing. 567 (2007).
There are direct implications of our theory to general [6] Here we present results based on ScidBase (http://scid.
sourceforge.net) with N ¼ 1:4  106 recorded games.
composite decision processes, where each action is as-
Each game was uniquely coded by a string (up to d ¼
sembled from a sequence of d mutually exclusive choices. 40 half moves) and the game strings were sorted alphabeti-
What in chess corresponds to an opening sequence, may be cally, so that all games following the same move sequence
a multivariate strategy or a customized ordering in other up to a given game depth d were grouped together in
situations. The question how such strategies are distributed clusters. Popularities were obtained by counting the clus-
is important for management and marketing [20]. One ter sizes nd .
consequence of our theory is that in a process of d com- [7] G. K. Zipf, Human Behaviour and the Principle of Least-
posite decisions the distribution Sd ðnÞ  nd of decision Effort (Addison-Wesley, Cambridge, 1949).
sequences, or strategies, which occur n times shows a tran- [8] V. Pareto, Cours d’Economie Politique (Droz, Geneva,
sition from low exponents d  2, where a few strategies 1896).
are very common, to higher exponents d > 2, where [9] D. Sornette, Critical Phenomena in Natural Sciences
(Springer, Heidelberg, 2003), 2nd ed.
individual strategies are dominating. This is due to the
[10] M. Mitzenmacher, Internet Math. 1, 226 (2004).
divergence of the first moment in power laws with expo- [11] M. E. J. Newman, Contemp. Phys. 46, 323 (2005).
nents smaller than 2 [11]. From [Eq. (6)] the critical [12] J. Willis and G. Yule, Nature (London) 109, 177 (1922).
number dcr of decisions at which this transition occurs [13] H. A. Simon, Biometrika 42, 425 (1955).
depends logarithmically on the sample size N and on the [14] R. A. K. Cox et al., J. Cult. Econ. 19, 333 (1995);
leading order  of qðrÞ near zero as S. Redner, Eur. Phys. J. B 4, 131 (1998); A. L. Barabasi
dcr ¼ 1 þ ð1 þ Þ logN: (9) and R. Albert, Science 286, 509 (1999); R. L. Axtell,
Science 293, 1818 (2001); X. Gabaix et al., Nature
6
Applied to the chess database with N ¼ 1:4  10 we (London) 423, 267 (2003).
obtain dcr  15 [see also Figs. 2(b), inset, and 4]. This [15] K. Klemm et al., Phys. Rev. Lett. 95, 128701 (2005).
separates the database into two very different regimes: in [16] H. Kesten, Acta Math. 131, 207 (1973); D. Sornette Phys.
their initial phase (d < dcr ) the majority of chess games are Rev. E 57, 4811 (1998); D. Sornette and R. Cont, J. Phys. I
distributed among a small number of fashionable openings 7, 431 (1997); S. C. Manrubia and D. H. Zanette, Phys.
(for d ¼ 12, for example, 80% of all games in the database Rev. E 59, 4945 (1999).
[17] P. L. Krapivsky et al., Phys. Rev. E 61, R993 (2000).
are concentrated in about 23% of the most popular open-
[18] J. R. Banavar et al., Phys. Rev. E 69, 036123 (2004).
ings), whereas beyond the critical game depth rarely used [19] W. Feller, An Introduction to Probability Theory and Its
move sequences are dominating such that in aggregate they Applications (John Wiley & Sons, New York, 1971),
comprise the majority of all games (Fig. 4). Note, that this Vol. 2.
result arises from the statistics of iterated decisions and [20] C. Anderson, The Long Tail: Why the Future of Business Is
does not indicate a crossover of playing behavior with Selling Less of More (Hyperion, New York, 2006).
increasing game depth. [21] J. L. Gastwirth, Rev. Econ. Stat. 54, 306 (1972).

218701-4

You might also like