The two-parameter size-frequency form of the Zipf distribution is defined as g(x) = kx-b; k, b > 0, n = 1,2 )., x,,. A number of criteria are available to estimate the parameter b of Eq. (1). An a priori estimate consists of the assumption that b is a constant, perhaps 2 (following lotka) or 3 (following price). A more objective approach
The two-parameter size-frequency form of the Zipf distribution is defined as g(x) = kx-b; k, b > 0, n = 1,2 )., x,,. A number of criteria are available to estimate the parameter b of Eq. (1). An a priori estimate consists of the assumption that b is a constant, perhaps 2 (following lotka) or 3 (following price). A more objective approach
The two-parameter size-frequency form of the Zipf distribution is defined as g(x) = kx-b; k, b > 0, n = 1,2 )., x,,. A number of criteria are available to estimate the parameter b of Eq. (1). An a priori estimate consists of the assumption that b is a constant, perhaps 2 (following lotka) or 3 (following price). A more objective approach
Paul Travis Nicholls School of Library and Information Science, University of Western Ontario London, Ontario, Canada N6G 1 Hl The two-parameter size-frequency form of the Zipf dis- tribution is defined as g(x) = kx-b; k, b > 0, n = 1,2 )..., x,,. (1) When g(x) represents the number of researchers expected to make x published contributions to a subject field, we have the generalized Lotka model of author productivity [ 11. See Chen and Leimkuhler [2] for an alternative formulation which explicitly takes account of the fact that, for empirical datasets, it is unrealistic to assume that the values of x will run cufltinuously from 1 to x,,. Much of the mystery surrounding the empirical validity of the Lotka hypothesis is attributable to methodological inconsistencies in the validity studies that have appeared [3]. Recently, standard procedures have been proposed for such validity tests [4,5]. Parameter estimation is a central feature of any such procedure, and is of crucial importance to the resulting goodness-of-fit. Parameter Estimation A number of criteria are available to estimate the parame- ter b of Eq. (1). An a priori estimate consists of the assump- tion that b is a constant, perhaps 2 (following Lotka) or 3 (following Price). A graphical estimate is obtained by plot- ting the data on logarithmic graph paper and visually esti- mating the slope of a regression line. A more objective approach is linear least squares regression. The LLS esti- mate is given by [NI: In x In g(x) - 2 In g(x)Z In x] + [NZlnx* - (~lng(x))*]; x = 1,2,...,x,,. (2) When the LLS estimate is calculated using only a subset of the full N categories, as when the more prolific authors are excluded, we have truncated linear least squares. A maxi- Received October 6, 1986, revised November 5, 1986, accepted Novem- ber 11, 1986. 8 1987 by John Wiley & Sons, Inc. mum likelihood estimator of b will satisfy the equation [6]. C x In g(x)/2 x g(x) = -5@)/5@4 ; x = 1,2 ,..., x,, (3) where &.) denotes the Riemann zeta function. This may be solved by iterative numerical methods or using tables of -5@)/5@). Th e minimum chi-square criterion involves finding the value of b that minimizes the chi-square statistic, x2 = wi - fem~ (4) wheref, is the observed number of authors with x publica- tions, andf, is the expected value, g(x) = kYb. A method of moments estimator based on equating the empirical and theoretical means satisfies the equation [6]. i = 5(b - 1)/5(b), (5) which may be solved by iterative search or using tables of l( b - l)/<(b). An empirical estimate based on a ratio of the observed frequencies for x = 1 and x = 2 is provided by ln [g(l)/d2)l/ln 2. (6) These estimators are described more fully by Tague and Nicholls [7] and by Johnson and Kotz [6], who also provide tables for the solution of the method of moments and maxi- mum likelihood equations. At least five methods of estimating the parameter k may be identified. These include a priori assumptions such as Lotkas k = 6/7r ; the empirical estimate, or actual proportion of authors making only a single contribution, g(l)/I$g(x); and the linear least squares estimate, i.e., the y-intercept associated with the LLS estimate of b: k = EXP[(Z In g(x) - bZ In x)/N] (7) Better estimates than these can be obtained from the inverse of the infinite Riemann series, k = J(b)-. Except for b = 2 and b = 4, the limit of this series can only be approximated. Table 1 gives values of c(b)- for b = 1.5 to 3.49, computed with the Pao/Singer approximation formula [4]. The finite series estimate takes the sum only to x,,: k = [xx-]-, x = 1,2, . . . ,x-. In the finite series approach, x,, becomes an additional parameter and must itself be estimated by either the sample value, some JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE 38(6): 443-445, 1987 CCC 0002-8231/87/060443-03$04.00 TABLE 1. Values of c(b)-. 1.50 0.3828 1.90 0.5715 2.30 0.6981 2.70 0.7848 3.10 0.8450 1.51 0.3885 1.91 0.5753 2.31 0.7007 2.71 0.7866 3.11 0.8463 1.52 0.3942 1.92 0.5791 2.32 0.7033 2.72 0.7883 3.12 0.8475 1.53 0.3998 1.93 0.5828 2.33 0.7058 2.73 0.7901 3.13 0.8488 1.54 0.4054 1.94 0.5865 2.34 0.7083 2.74 0.7918 3.14 0.8500 1.55 0.4109 1.95 0.5902 2.35 0.7108 2.75 0.7935 3.15 0.8512 1.56 0.4163 1.96 0.5938 2.36 0.7133 2.76 0.7952 3.16 0.8524 1.57 0.4217 1.97 0.5974 2.37 0.7157 2.77 0.7969 3.17 0.8536 1.58 0.4270 1.98 0.6009 2.38 0.7181 2.78 0.7986 3.18 0.8547 1.59 0.4323 1.99 0.6044 2.39 0.7205 2.79 0.8003 3.19 0.8559 1.60 0.4375 2.00 0.6079 2.40 0.7229 2.80 0.8019 3.20 0.8571 1.61 0.4427 2.01 0.6114 2.41 0.7252 2.81 0.8035 3.21 0.8582 1.62 0.4478 2.02 0.6148 2.42 0.7276 2.82 0.8052 3.22 0.8593 1.63 0.4528 2.03 0.6182 2.43 0.7299 2.83 0.8068 3.23 0.8605 1.64 0.4578 2.04 0.6215 2.44 0.7322 2.84 0.8083 3.24 0.8616 1.65 0.4628 2.05 0.6249 2.45 0.7344 2.85 0.8099 3.25 0.8627 1.66 0.4677 2.06 0.6281 2.46 0.7367 2.86 0.8115 3.26 0.8638 1.67 0.4725 2.07 0.6314 2.47 0.7389 2.87 0.8130 3.27 0.8649 1.68 0.4773 2.08 0.6346 2.48 0.7411 2.88 0.8145 3.28 0.8660 1.69 0.4821 2.09 0.6378 2.49 0.7433 2.89 0.8161 3.29 0.8670 1.70 0.4868 2.10 0.6409 2.50 0.7454 2.90 0.8176 3.30 0.8681 1.71 0.4914 2.11 0.6441 2.51 0.7476 2.91 0.8191 3.31 0.8691 1.72 0.4961 2.12 0.6472 2.52 0.7497 2.92 0.8205 3.32 0.8702 1.73 0.5006 2.13 0.6502 2.53 0.7518 2.93 0.8220 3.33 0.8712 1.74 0.5051 2.14 0.6533 2.54 0.7539 2.94 0.8235 3.34 0.8723 1.75 0.5096 2.15 0.6563 2.55 0.7560 2.95 0.8249 3.35 0.8733 1.76 0.5140 2.16 0.6593 2.56 0.7580 2.96 0.8263 3.36 0.8743 1.77 0.5184 2.17 0.6622 2.57 0.7600 2.97 0.8277 3.37 0.8753 1.78 0.5227 2.18 0.6651 2.58 0.7620 2.98 0.8299 3.38 0.8763 1.79 0.5270 2.19 0.6680 2.59 0.7640 2.99 0.8305 3.39 0.8772 1.80 0.5313 2.20 0.6709 2.60 0.7660 3.00 0.8319 3.40 0.8782 1.81 0.5355 2.21 0.6737 2.61 0.7680 3.01 0.8333 3.41 0.8792 1.82 0.5397 2.22 0.6766 2.62 0.7699 3.02 0.8346 3.42 0.8801 1.83 0.5438 2.23 0.6793 2.63 0.7718 3.03 0.8360 3.43 0.8811 1.84 0.5479 2.24 0.6821 2.64 0.7737 3.04 0.8373 3.44 0.8820 1.85 0.5519 2.25 0.6848 2.65 0.7756 3.05 0.8386 3.45 0.8830 1.86 0.5559 2.26 0.6875 2.66 0.7775 3.06 0.8399 3.46 0.8839 1.87 0.5599 2.27 0.6902 2.67 0.7793 3.07 0.8412 3.47 0.8848 1.88 0.5638 2.28 0.6929 2.68 0.7811 3.08 0.8425 3.48 0.8857 1.89 0.5677 2.29 0.6955 2.69 0.7830 3.09 0.8438 3.49 0.8866 reasonable assumption concerning the population extreme value, or by a statistical inference concerning the extreme value [7]. Comparison of Estimators Eight methods of estimating b were used to fit Eq. 1 to 34 empirical author productivity distributions. The origin of these datasets has been described previously by Pao [4]; they are samples l-10, 20-41,43,45 and 46 in her Table 2. The criterion for estimating k was held constant as 6((b)-. The methods of estimating b included linear least squares (LLS) , maximum likelihood (MLE), ratio of frequencies (RAT), minimum chi-square (MIN), method of moments (MOM), and a truncated least squares method [5] (TLS N). The data for two further estimators were incorporated from Paos study; the a priori assumption b = 2 (AJL), and another truncated least squares approach [4] (TLS P). Theoretical frequencies were calculated using these eight estimators and the statistics associated with the Kolmogorov-Smimov test computed. Table 2 summarizes the resulting values of D,, , the maximum absolute deviation between the observed and theoretical distribution functions, for the 34 datasets. The estimators rank consistently on several character- istics. Two groups can be identified; estimators with a low central tendency and variance in the resulting D,, (MLE, RAT, TLS), and estimators with a higher central tendency and variance (MIN, MOM, LLS, AJL). The former provide many close fits to the data, while the latter provide poorer fits. The Kolmogorov-Smimov goodness-of-fit test is not actually applied in this instance, since only two of the data- sets are random samples. D,, is used descriptively only, as an indicator of fit; DKs, the critical value of the test statistic at p = 0.99, serves only as a benchmark or basis for comparison, not as part of an inferential procedure. The last two columns of Table 2 give the number of fits judged to be 444 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-November 1987 TABLE 2. D,, for eight estimators of b. b Mean SD Range Median PO.25 PO75 Fits Fail MLE 0.0205 0.0119 0.0423 0.0181 0.0106 0.0285 34 0.00 RAT 0.0313 0.0203 0.1014 0.0292 0.0161 0.0388 31 0.09 TLS N 0.0381 0.0285 0.1241 0.0261 0.0185 0.0558 29 0.15 TLS P 0.0363 0.0218 0.0992 0.0324 0.0228 0.0403 28 0.18 MIN 0.0502 0.0384 0.2095 0.0400 0.0252 0.0641 23 0.32 MOM 0.0550 0.0405 0.1505 0.0506 0.0173 0.0751 20 0.41 LLS 0.0918 0.1050 0.3696 0.0418 0.0235 0.0779 19 0.44 AJL 0.1674 0.0930 0.3054 0.1500 0.0854 0.2418 6 0.82 adequate (D,,, < DKS) and the approximate rate of failure to provide an adequate fit. Of the better estimators, maximum likelihood is by far the best; it is also the only method in this group that has the quality of sufficiency, i.e. , RAT and TLS use only part of the available data. The maximum likelihood estimator is also more efficient, having the least variance. It also provides an adequate fit for all 34 samples. The most commonly used estimation procedure in va- lidity studies of Lotkas law has been an apriori assumption such as b = 2. Most of the remaining studies used least squares or truncated least squares. Without truncation, LLS provides very poor parameter estimates. Here, truncation raises r2 by about 0.04 on average, resulting in dramatically better estimates. There are, however, at least four serious problems with TLS as an estimator. The basic function of truncation is to optimize linearity, i.e. , to increase r. Since goodness-of-fit cannot be based on the quality of the regression alone [8], it stands to reason that parameter estimates should not; r2 and D,, are only weakly correlated here. The TLS estimators are not sufficient in the statistical sense; valuable data are excluded for estimation purposes, but the resulting estimates are applied to the complete distri- butions. Criteria for truncation frequently rely on subjective judgements which lead, inevitably, to n-reproducible results. Lastly, several Soviet investigators [9] have argued persua- sively that least squares, and other constructions based on the moment apparatus, are in principle unfit for the analysis of stationary scientometric distributions, since results will depend significantly upon sample sizes. Conclusions Of the available estimators for b, maximum likelihood provides the best and most consistent fits. It is hardly sur- prising that previous validity studies of Lotkas law have yielded conflicting results, since the most commonly applied methods of parameter estimation (LLS and AJL) are also the poorest and least efficient. Although truncation improves fits based on least squares estimators somewhat, the inefficiency and dependency on sample sizes remains, while the problem of insufficiency arises. When used with MLE, [(b)) provides a good estimate of k. The RAT method seems to provide adequate estimates in most cases, and is very straightforward to calculate. However, the con- sistent use of MLE is recommended for future empirical studies of Lotkas model. The identification and application of efficient parameter estimation procedures within a consis- tent testing methodology show here that Lotkas model is surprisingly well-fitting, much more so than previously sup- posed. The same general approach to estimation and validity testing should be applicable to the many other applications of the size-frequency Zipf distribution in bibliometrics. References 1. 2. 3. 4. 5. 6. 7. 8. 9. Lotka, A. J. The Frequency Distribution of Scientific Productivity. Journal of the Washington Academy of Science. 16(12):317-323; 1926. Chen, Y. S.; Leimkuhler, F. F. A Relationship Between Lotkas Law, Bradfords Law, and Zipfs Law. Journal of the American Sociefy for Information Science. 37(5):307-314; 1986. Potter, W. G. Lotkas Law Revisited. Library Trends. 31(2):21-39; 1981. Pao, M. L. An Empirical Examination of Lotkas Law. Journal of rhe American Society for Information Science. 37(1):26-33; 1986. Nicholls, P. T. Empirical Validation of Lotkas Law. Information Processing and Management. 22(5): 417-419; 1986. Johnson, N. L.; Kotz, S. Discrete Distributions. Boston, MA: Houghton Mifflin; 1969. Tague, J. M.; Nicholls, P. T. The Maximal Value of a Zipf Size Variable: Sampling Properties and Relationship to Other Parameters. Information Processing and Management ; in press. Phillips, W. J.; Shepherd, M. A. Statistical Analysis of the Rank- Frequency Distribution of Elements in a Large Database. In: Canadian Association for Information Science: Proceedings of the 13th Annual Conference, Montreal 1985. Montreal, Canada: ACSIICAIS; 1985. Yablonsky, A. I. Stable Non-Gaussian Distributions in Scientomet- rics. Scientometrics. 7(3-6):459-470; 1985. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE- November 1987 445