You are on page 1of 3

Brief Communication

Estimation of Zipf Parameters


Paul Travis Nicholls
School of Library and Information Science, University of Western Ontario London, Ontario, Canada N6G 1 Hl
The two-parameter size-frequency form of the Zipf dis-
tribution is defined as
g(x) = kx-b; k, b > 0, n = 1,2 )..., x,,.
(1)
When g(x) represents the number of researchers expected to
make x published contributions to a subject field, we have
the generalized Lotka model of author productivity [ 11. See
Chen and Leimkuhler [2] for an alternative formulation
which explicitly takes account of the fact that, for empirical
datasets, it is unrealistic to assume that the values of x will
run cufltinuously from 1 to x,,.
Much of the mystery surrounding the empirical validity
of the Lotka hypothesis is attributable to methodological
inconsistencies in the validity studies that have appeared [3].
Recently, standard procedures have been proposed for such
validity tests [4,5]. Parameter estimation is a central feature
of any such procedure, and is of crucial importance to the
resulting goodness-of-fit.
Parameter Estimation
A number of criteria are available to estimate the parame-
ter b of Eq. (1). An a priori estimate consists of the assump-
tion that b is a constant, perhaps 2 (following Lotka) or 3
(following Price). A graphical estimate is obtained by plot-
ting the data on logarithmic graph paper and visually esti-
mating the slope of a regression line. A more objective
approach is linear least squares regression. The LLS esti-
mate is given by
[NI: In x In g(x) - 2 In g(x)Z In x]
+ [NZlnx* - (~lng(x))*]; x = 1,2,...,x,,.
(2)
When the LLS estimate is calculated using only a subset of
the full N categories, as when the more prolific authors are
excluded, we have truncated linear least squares. A maxi-
Received October 6, 1986, revised November 5, 1986, accepted Novem-
ber 11, 1986.
8 1987 by John Wiley & Sons, Inc.
mum likelihood estimator of b will satisfy the equation [6].
C x In g(x)/2 x g(x) = -5@)/5@4 ;
x = 1,2 ,..., x,, (3)
where &.) denotes the Riemann zeta function. This may be
solved by iterative numerical methods or using tables of
-5@)/5@). Th
e minimum chi-square criterion involves
finding the value of b that minimizes the chi-square statistic,
x2 = wi - fem~ (4)
wheref, is the observed number of authors with x publica-
tions, andf, is the expected value, g(x) = kYb. A method
of moments estimator based on equating the empirical and
theoretical means satisfies the equation [6].
i = 5(b - 1)/5(b), (5)
which may be solved by iterative search or using tables
of l( b - l)/<(b). An empirical estimate based on a
ratio of the observed frequencies for x = 1 and x = 2 is
provided by
ln [g(l)/d2)l/ln 2. (6)
These estimators are described more fully by Tague and
Nicholls [7] and by Johnson and Kotz [6], who also provide
tables for the solution of the method of moments and maxi-
mum likelihood equations.
At least five methods of estimating the parameter k
may be identified. These include a priori assumptions
such as Lotkas k = 6/7r ; the empirical estimate, or actual
proportion of authors making only a single contribution,
g(l)/I$g(x); and the linear least squares estimate, i.e., the
y-intercept associated with the LLS estimate of b:
k = EXP[(Z In g(x) - bZ In x)/N] (7)
Better estimates than these can be obtained from the inverse
of the infinite Riemann series, k = J(b)-. Except for
b = 2 and b = 4, the limit of this series can only be
approximated. Table 1 gives values of c(b)- for b = 1.5
to 3.49, computed with the Pao/Singer approximation
formula [4]. The finite series estimate takes the sum only
to x,,: k = [xx-]-, x = 1,2, . . . ,x-. In the finite
series approach, x,, becomes an additional parameter and
must itself be estimated by either the sample value, some
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 38(6): 443-445, 1987 CCC 0002-8231/87/060443-03$04.00
TABLE 1. Values of c(b)-.
1.50 0.3828 1.90 0.5715 2.30 0.6981 2.70 0.7848 3.10 0.8450
1.51 0.3885 1.91 0.5753 2.31 0.7007 2.71 0.7866 3.11 0.8463
1.52 0.3942 1.92 0.5791 2.32 0.7033 2.72 0.7883 3.12 0.8475
1.53 0.3998 1.93 0.5828 2.33 0.7058 2.73 0.7901 3.13 0.8488
1.54 0.4054 1.94 0.5865 2.34 0.7083 2.74 0.7918 3.14 0.8500
1.55 0.4109 1.95 0.5902 2.35 0.7108 2.75 0.7935 3.15 0.8512
1.56 0.4163 1.96 0.5938 2.36 0.7133 2.76 0.7952 3.16 0.8524
1.57 0.4217 1.97 0.5974 2.37 0.7157 2.77 0.7969 3.17 0.8536
1.58 0.4270 1.98 0.6009 2.38 0.7181 2.78 0.7986 3.18 0.8547
1.59 0.4323 1.99 0.6044 2.39 0.7205 2.79 0.8003 3.19 0.8559
1.60 0.4375 2.00 0.6079 2.40 0.7229 2.80 0.8019 3.20 0.8571
1.61 0.4427 2.01 0.6114 2.41 0.7252 2.81 0.8035 3.21 0.8582
1.62 0.4478 2.02 0.6148 2.42 0.7276 2.82 0.8052 3.22 0.8593
1.63 0.4528 2.03 0.6182 2.43 0.7299 2.83 0.8068 3.23 0.8605
1.64 0.4578 2.04 0.6215 2.44 0.7322 2.84 0.8083 3.24 0.8616
1.65 0.4628 2.05 0.6249 2.45 0.7344 2.85 0.8099 3.25 0.8627
1.66 0.4677 2.06 0.6281 2.46 0.7367 2.86 0.8115 3.26 0.8638
1.67 0.4725 2.07 0.6314 2.47 0.7389 2.87 0.8130 3.27 0.8649
1.68 0.4773 2.08 0.6346 2.48 0.7411 2.88 0.8145 3.28 0.8660
1.69 0.4821 2.09 0.6378 2.49 0.7433 2.89 0.8161 3.29 0.8670
1.70 0.4868 2.10 0.6409 2.50 0.7454 2.90 0.8176 3.30 0.8681
1.71 0.4914 2.11 0.6441 2.51 0.7476 2.91 0.8191 3.31 0.8691
1.72 0.4961 2.12 0.6472 2.52 0.7497 2.92 0.8205 3.32 0.8702
1.73 0.5006 2.13 0.6502 2.53 0.7518 2.93 0.8220 3.33 0.8712
1.74 0.5051 2.14 0.6533 2.54 0.7539 2.94 0.8235 3.34 0.8723
1.75 0.5096 2.15 0.6563 2.55 0.7560 2.95 0.8249 3.35 0.8733
1.76 0.5140 2.16 0.6593 2.56 0.7580 2.96 0.8263 3.36 0.8743
1.77 0.5184 2.17 0.6622 2.57 0.7600 2.97 0.8277 3.37 0.8753
1.78 0.5227 2.18 0.6651 2.58 0.7620 2.98 0.8299 3.38 0.8763
1.79 0.5270 2.19 0.6680 2.59 0.7640 2.99 0.8305 3.39 0.8772
1.80 0.5313 2.20 0.6709 2.60 0.7660 3.00 0.8319 3.40 0.8782
1.81 0.5355 2.21 0.6737 2.61 0.7680 3.01 0.8333 3.41 0.8792
1.82 0.5397 2.22 0.6766 2.62 0.7699 3.02 0.8346 3.42 0.8801
1.83 0.5438 2.23 0.6793 2.63 0.7718 3.03 0.8360 3.43 0.8811
1.84 0.5479 2.24 0.6821 2.64 0.7737 3.04 0.8373 3.44 0.8820
1.85 0.5519 2.25 0.6848 2.65 0.7756 3.05 0.8386 3.45 0.8830
1.86 0.5559 2.26 0.6875 2.66 0.7775 3.06 0.8399 3.46 0.8839
1.87 0.5599 2.27 0.6902 2.67 0.7793 3.07 0.8412 3.47 0.8848
1.88 0.5638 2.28 0.6929 2.68 0.7811 3.08 0.8425 3.48 0.8857
1.89 0.5677 2.29 0.6955 2.69 0.7830 3.09 0.8438 3.49 0.8866
reasonable assumption concerning the population extreme
value, or by a statistical inference concerning the extreme
value [7].
Comparison of Estimators
Eight methods of estimating b were used to fit Eq. 1 to
34 empirical author productivity distributions. The origin of
these datasets has been described previously by Pao [4]; they
are samples l-10, 20-41,43,45 and 46 in her Table 2. The
criterion for estimating k was held constant as 6((b)-. The
methods of estimating b included linear least squares (LLS) ,
maximum likelihood (MLE), ratio of frequencies (RAT),
minimum chi-square (MIN), method of moments (MOM),
and a truncated least squares method [5] (TLS N). The data
for two further estimators were incorporated from Paos
study; the a priori assumption b = 2 (AJL), and another
truncated least squares approach [4] (TLS P). Theoretical
frequencies were calculated using these eight estimators and
the statistics associated with the Kolmogorov-Smimov test
computed. Table 2 summarizes the resulting values of D,, ,
the maximum absolute deviation between the observed and
theoretical distribution functions, for the 34 datasets.
The estimators rank consistently on several character-
istics. Two groups can be identified; estimators with a low
central tendency and variance in the resulting D,, (MLE,
RAT, TLS), and estimators with a higher central tendency
and variance (MIN, MOM, LLS, AJL). The former provide
many close fits to the data, while the latter provide poorer
fits. The Kolmogorov-Smimov goodness-of-fit test is not
actually applied in this instance, since only two of the data-
sets are random samples. D,, is used descriptively only, as
an indicator of fit; DKs, the critical value of the test statistic
at p = 0.99, serves only as a benchmark or basis for
comparison, not as part of an inferential procedure. The last
two columns of Table 2 give the number of fits judged to be
444 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-November 1987
TABLE 2. D,, for eight estimators of b.
b Mean SD Range Median PO.25 PO75 Fits Fail
MLE 0.0205 0.0119 0.0423 0.0181 0.0106 0.0285 34 0.00
RAT 0.0313 0.0203 0.1014 0.0292 0.0161 0.0388 31 0.09
TLS N 0.0381 0.0285 0.1241 0.0261 0.0185 0.0558 29 0.15
TLS P 0.0363 0.0218 0.0992 0.0324 0.0228 0.0403 28 0.18
MIN 0.0502 0.0384 0.2095 0.0400 0.0252 0.0641 23 0.32
MOM 0.0550 0.0405 0.1505 0.0506 0.0173 0.0751 20 0.41
LLS 0.0918 0.1050 0.3696 0.0418 0.0235 0.0779 19 0.44
AJL 0.1674 0.0930 0.3054 0.1500 0.0854 0.2418 6 0.82
adequate (D,,, < DKS) and the approximate rate of failure to
provide an adequate fit. Of the better estimators, maximum
likelihood is by far the best; it is also the only method in this
group that has the quality of sufficiency, i.e. , RAT and TLS
use only part of the available data. The maximum likelihood
estimator is also more efficient, having the least variance. It
also provides an adequate fit for all 34 samples.
The most commonly used estimation procedure in va-
lidity studies of Lotkas law has been an apriori assumption
such as b = 2. Most of the remaining studies used least
squares or truncated least squares. Without truncation, LLS
provides very poor parameter estimates. Here, truncation
raises r2 by about 0.04 on average, resulting in dramatically
better estimates. There are, however, at least four serious
problems with TLS as an estimator. The basic function of
truncation is to optimize linearity, i.e. , to increase r. Since
goodness-of-fit cannot be based on the quality of the
regression alone [8], it stands to reason that parameter
estimates should not; r2 and D,, are only weakly correlated
here. The TLS estimators are not sufficient in the statistical
sense; valuable data are excluded for estimation purposes,
but the resulting estimates are applied to the complete distri-
butions. Criteria for truncation frequently rely on subjective
judgements which lead, inevitably, to n-reproducible results.
Lastly, several Soviet investigators [9] have argued persua-
sively that least squares, and other constructions based on
the moment apparatus, are in principle unfit for the analysis
of stationary scientometric distributions, since results will
depend significantly upon sample sizes.
Conclusions
Of the available estimators for b, maximum likelihood
provides the best and most consistent fits. It is hardly sur-
prising that previous validity studies of Lotkas law have
yielded conflicting results, since the most commonly
applied methods of parameter estimation (LLS and AJL)
are also the poorest and least efficient. Although truncation
improves fits based on least squares estimators somewhat,
the inefficiency and dependency on sample sizes remains,
while the problem of insufficiency arises. When used
with MLE, [(b)) provides a good estimate of k. The RAT
method seems to provide adequate estimates in most cases,
and is very straightforward to calculate. However, the con-
sistent use of MLE is recommended for future empirical
studies of Lotkas model. The identification and application
of efficient parameter estimation procedures within a consis-
tent testing methodology show here that Lotkas model is
surprisingly well-fitting, much more so than previously sup-
posed. The same general approach to estimation and validity
testing should be applicable to the many other applications
of the size-frequency Zipf distribution in bibliometrics.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
Lotka, A. J. The Frequency Distribution of Scientific Productivity.
Journal of the Washington Academy of Science. 16(12):317-323; 1926.
Chen, Y. S.; Leimkuhler, F. F. A Relationship Between Lotkas Law,
Bradfords Law, and Zipfs Law. Journal of the American Sociefy for
Information Science. 37(5):307-314; 1986.
Potter, W. G. Lotkas Law Revisited. Library Trends. 31(2):21-39;
1981.
Pao, M. L. An Empirical Examination of Lotkas Law. Journal of rhe
American Society for Information Science. 37(1):26-33; 1986.
Nicholls, P. T. Empirical Validation of Lotkas Law. Information
Processing and Management. 22(5): 417-419; 1986.
Johnson, N. L.; Kotz, S. Discrete Distributions. Boston, MA:
Houghton Mifflin; 1969.
Tague, J. M.; Nicholls, P. T. The Maximal Value of a Zipf Size
Variable: Sampling Properties and Relationship to Other Parameters.
Information Processing and Management ; in press.
Phillips, W. J.; Shepherd, M. A. Statistical Analysis of the Rank-
Frequency Distribution of Elements in a Large Database. In: Canadian
Association for Information Science: Proceedings of the 13th Annual
Conference, Montreal 1985. Montreal, Canada: ACSIICAIS; 1985.
Yablonsky, A. I. Stable Non-Gaussian Distributions in Scientomet-
rics. Scientometrics. 7(3-6):459-470; 1985.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE- November 1987 445

You might also like