You are on page 1of 12

REMOTE SENS. ENVIRON.

37:35-46 (1991)

A Review of Assessing the Accuracy of Classifications of Remotely Sensed Data


Russell G. Congalton
Department of Forestry and Resource Management, University of California, Berkeley

T h i s paper reviews the necessary considerations and available techniques for assessing the accuracy of remotely sensed data. Included in this review are the classification system, the sampling scheme, the sample size, spatial autocorrelation, and the assessment techniques. All analysis is based on the use of an error matrix or contingency table. Example matrices and results of the analysis are presented. Future trends including the need for assessment of other spatial data are also discussed.
INTRODUCTION

With the advent of more advanced digital satellite remote sensing techniques, the necessity of performing an accuracy assessment has received renewed interest. This is not to say that accuracy assessment is unimportant for the more traditional remote sensing techniques. However, given the complexity of digital classification, there is more of a need to assess the reliability of the results. Traditionally, the accuracy of photointerpretation has been accepted as correct without any confirmation. In fact, digital classifications are often assessed with reference to photointerpretation. An

obvious assumption made here is that the photointerpretation is 100% correct. This assumption is rarely valid and can lead to a rather poor and unfair assessment of the digital classification (Biging and Congalton, 1989). Therefore, it is essential that researchers and users of remotely sensed data have a strong knowledge of both the factors needed to be considered as well as the techniques used in performing any accuracy assessment. Failure to know these techniques and considerations can severely limit one's ability to effectively use remotely sensed data. The objective of this paper is to provide a review of the appropriate analysis techniques and a discussion of the factors that must be considered when performing any accuracy assessment. Many analysis techniques have been published in the literature; however, I believe that it will be helpful to many novice and established users of remotely sensed data to have all the standard techniques summarized in a single paper. In addition, it is important to understand the analysis techniques in order to fully realize the importance of the various other considerations for accuracy assessment discussed in this paper.

TECHNIQUES
Address correspondence to R. G. Congalton, 145 Mulford Hall, Department of Forestry and Resource Management, University of California, Berkeley, CA 94720. Received 15 October 1990; revised 14 April 1991.

Until recently, the idea of assessing the classification accuracy of remotely sensed data was treated

oo34-42s7/91/$3.50

35

36

Congalton

more as an afterthought than as an integral part of any project. In fact, as recently as the early 1980s many studies would simply report a single number to express the accuracy of a classification. In many of these cases the accuracy reported was what is called non-site-specific accuracy. In a non-sitespecific accuracy assessment, locational accuracy is completely ignored. In other words, only total amounts of a category are considered without regard for the location. If all the errors balance out, a non-site-specific accuracy assessment will yield very high but misleading results. In addition, most assessments were conducted using the same data set as was used to train the classifier. This training and testing on the same data set also results in overestimates of classification accuracy. Once these problems were recognized, many more site specific accuracy assessments were performed using an independent data set. For these assessments, the most common way to represent the classification accuracy of remotely sensed data is in the form of an error matrix. Using an error matrix to represent accuracy has been recommended by many researchers and should be adopted as the standard reporting convention. The reasons for choosing the error matrix as the standard are clearly demonstrated in this paper. An error matrix is a square array of numbers set out in rows and columns which express the number of sample units (i.e., pixels, clusters of pixels, or polygons) assigned to a particular cate-

gory relative to the actual category as verified on the ground (Table 1). The columns usually represent the reference data while the rows indicate the classification generated from the remotely sensed data. An error matrix is a very effective way to represent accuracy in that the accuracies of each category are plainly described along with both the errors of inclusion (commission errors) and errors of exclusion (omission errors) present in the classification.

Descriptive Techniques
The error matrix can then be used as a starting point for a series of descriptive and analytical statistical techniques. Perhaps the simplest descriptive statistic is overall accuracy which is computed by dividing the total correct (i.e., the sum of the major diagonal) by the total number of pixels in the error matrix. In addition, accuracies of individual categories can be computed in a similar manner. However, this case is a little more complex in that one has a choice of dividing the number of correct pixels in that category by either the total number of pixels in the corresponding row or the corresponding column. Traditionally, the total number of correct pixels in a category is divided by the total number of pixels of that category as derived from the reference data (i.e., the column total). This accuracy measure indicates the probability of a reference pixel being correctly

Table 1. A n E x a m p l e

Error Matrix

Reference Data
lOW

D D C BA SB column total 65 6 0 4 75

C 4 81 11 7 103

BA 22 5 85 3 115

SB 24 8 19 90 141

total 115 100 115 104 434

Land Cover Categories


D = deciduous C = conifer BA = b a r r e n SB = s h r u b

OVERALL ACCURACY = 3 2 1 / 4 3 4 = 74%

PRODUCER'S ACCURACY

USER'S ACCURACY D=65/115= C= 81/100= BA=85/115= SB = 9 0 / 1 0 4 = 57% 81% 74% 87%

D=65/75=
C = 81/103 = BA = 8 5 / 1 1 5 = SB = 9 0 / 1 4 1 =

87%
79% 74% 64%

Review: Assessing Classification Accuracy 37

classified and is really a measure of omission error. This accuracy measure is often called "producer's accuracy" because the producer of the classification is interested in how well a certain area can be classified. On the other hand, if the total number of correct pixels in a category is divided by the total number of pixels that were classified in that category, then this result is a measure of commission error. This measure, called "user's accuracy" or reliability, is indicative of the probability that a pixel classified on the map/image actually represents that category on the ground (Story and Congalton, 1986). A very simple example quickly shows the advantages of considering overall accuracy, "producer's accuracy," and "user's accuracy.'" The error matrix shown in Table 1 indicates an overall map accuracy of 74%. However, suppose we are most interested in the ability to classify deciduous forests. We can calculate a "producer's accuracy" for this category by dividing the total number of correct pixels in the deciduous category (65) by the total number of deciduous pixels as indicated by the reference data (75). This division results in a "producer's accuracy" of 87%, which is quite good. If we stopped here, one might conclude that, although this classification has an overall accuracy that is only fair (74%), it is adequate for the deciduous category. Making such a conclusion could be a very serious mistake. A quick calculation of the "user's accuracy" computed by dividing the total number of correct pixels in the deciduous category (65) by the total number of pixels classified as deciduous (115) reveals a value of 57%. In other words, although 87% of the deciduous areas have been correctly identified as deciduous, only 57% of the areas called deciduous are actually deciduous. A more careful look at the error matrix reveals that there is significant confusion in discriminating deciduous from barren and shrub. Therefore, although the producer of this map can claim that 87% of the time an area that was deciduous was identified as such, a user of this map will find that only 57% of the time will an area he visits that the map says is deciduous will actually be deciduous.

Analytical Techniques
In addition to these descriptive techniques, an error matrix is an appropriate beginning for many

analytical statistical techniques. This is especially true of the discrete multivariate techniques. Starting with Congalton et al. (1983), discrete multivariate techniques have been used for performing statistical tests on the classification accuracy of digital remotely sensed data. Since that time many others have adopted these techniques as the standard accuracy assessment tools (e.g., Rosenfield and Fitzpatrick-Lins, 1986; Hudson and Ramm, 1987; Campbell, 1987). Discrete multivariate techniques are appropriate because remotely sensed data are discrete rather than continuous. The data are also binomially or multinomially distributed rather than normally distributed. Therefore, many common normal theory statistical techniques do not apply. The following example presented in Tables 2-9 demonstrates the power of these discrete multivariate techniques. The example begins with three error matrices and presents the results of the analysis techniques. Table 2 presents the error matrices generated from using three different classification algorithms to map a small area of Berkeley and Oakland, California surrounding the University of California campus from SPOT satellite data. The three classification algorithms used included a traditional supervised approach, a traditional unsupervised approach, and a modified approach that combines the supervised and unsupervised classifications together to maximize the advantages of each (Chuvieco and Congalton, 1988). The classification was a simple one using only four categories; forest (F), industrial (I), urban (U), and water (W). All three classifications were performed by a single analyst. In addition, Table 3 presents the error matrix generated for the same area using only the modified classification approach by a second analyst. Each analyst was responsible for performing an accuracy assessment. Therefore, different numbers of samples and different sample locations were selected by each. The next analytical step is to "normalize" or standardize the error matrices. This technique uses an iterative proportional fitting procedure which forces each row and column in the matrix to sum to one. In this way, differences in sample sizes used to generate the matrices are eliminated and, therefore, individual cell values within the matrix are directly comparable. In addition, because as part of the iterative process the rows and columns are totaled (i.e., marginals), the resulting normal-

38 Congalton
Table 2. Error Matrices for the Three Classification Approaches from Analyst Supervised Approach
Reference Data
F F 68 I 7 U 3 W 0

#1

Classified Data

I
U W

12
3 0

112
9 2

15
89 5

10
0 56

Overall Accuracy =
3 2 5 / 3 9 1 = 83%

UnsupervisedApproach
Reference Data
F F 60 I U 3 W 4

11 102
13 4

Classified Data

I
U W

15
6 2

14
90 5

8
2 52

Overall Accuracy =
3 0 4 / 3 9 1 = 78%

Modified
F F 75 I 6

Approach
U 1 W 0

Reference Data

Classified Data

I
U W

4
3 1

116
7 1

11
96 4

3
2 61

Overall Accuracy =
3 4 8 / 3 9 1 = 89%

Table 3. Error Matrix for the Modified Classification Approach from Analyst ModifiedApproach - F F 35 3 4 0 AG 6 82 2 5

#2

Analyst # 2
U 1 5 W 0 10 0 37

Reference Data

Classified Data

AG U W

overall accuracy =
2 0 8 / 2 4 6 = 85%

54 2

ized matrix is more indicative of the off-diagonal cell values (i.e., the errors of omission and commission). In other words, all the values in the matrix are iteratively balanced by row and column thereby incorporating information from that row and column into each individual cell value. This process then changes the cell values along the major diagonal of the matrix (correct classifications) and therefore a normalized overall accuracy can be computed for each matrix by summing the

major diagonal and dividing by the total of the entire matrix. Consequently, one could argue that the normalized accuracy is a better representation of accuracy than is the overall accuracy computed from the original matrix because it contains information about the off-diagonal cell values. Table 4 presents the normalized matrices from the same three classification algorithms for analyst # 1 generated using a computer program called MARGFIT (marginal fitting). Table 5 presents the normalized

Review: Assessing ClassificationAccuracy 39


Table 4. Normalized Error Matrices for the Three Classification Approaches from Analyst Supervised Approach
Reference Data
F F 0.8652 0.0845 0.0435 0.0069 I 0.0940 0.7547 0.1171 0.0342 U 0.0331 0.0784 0.8319 00567 W 0.0073 0.0824 0.0072 0.9031

#1

Classified Data

I
U W

Normalized Accuracy =
3 . 3 5 4 9 / 4 = 84%

UnsupervisedApproach
Reference Data
F F 0.7734 0.1242 0.0656 0.0369 I 0.1256 0.7014 0.1163 0.0567 U 0.0387 0.1006 0.7094 0.0702 W 0.0622 0.0824 0.0273 0.8370

Classified Data

I
U W

Normalized Accuracy =
5.1022/4 = 78%

Modified
F I 0.0687 0.8460 0.0697 0.0156

Approach
U 0.0152 0.0801 0.8598 0.6450 W 0.0076 0.0366 0.0334 0.9224

Reference Data
F
0.9080 0.0372 0.0370 0.0178

Classified Data

I U W

Normalized Accuracy =
3.5362/ 4 = 88%

Table 5. Normalized Error Matrix for the Modified Approach from Analyst ModifiedApproach- AG F 0.8519 0.0464 0.0897 0.0120 0.1090 0.7641 0.0348 0.0921

#2

Analyst # 2
U W 0.0113 0.1313 0.0094 0.8480

Reference Data
0.0287 0.0581 0.8655 0.0477

Classified Data

AG U W

Overall Accuracy =
3.3295/ 4 = 83%

matrix for the modified approach performed by analyst #2. In addition to computing a normalized accuracy, the normalized matrix can also be used to directly compare cell values between matrices. For example, we may be interested in comparing the accuracy each analyst obtained for the forest category using the modified classification approach. From the original matrices we can see that analyst #1 classified 75 sample units correctly

while analyst # 2 classified 35 correctly. Neither of these numbers means much because they are not directly comparable due to the differences in the number of samples used to generate the error matrix by each analyst. Instead, these numbers would need to be converted into percent so that a comparison could be made. Here another problem arises: Do we divide the total correct by the row total (user's accuracy) or by the column total (producer's accuracy)? We could calculate both and

40 Congalton
Table 6. A Comparison of the Three Accuracy Measures for the Three Classification Approaches Classification Algorithm
Supervised approach Unsupervised approach Modified approach

compare the results or we could use the cell value in the normalized matrix. Because of the iterative proportional fitting routine, each cell value in the matrix has been balanced by the other values in its corresponding row and column. This balancing has the effect of incorporating producer's and user's accuracies together. Also since each row and column add to 1, an individual cell value can quickly be converted to a percentage by multiplying by 100. Therefore, the normalization process provides a convenient way of comparing individual cell values between error matrices regardless of the number of samples used to derive the matrix. Another discrete multivariate technique of use in accuracy assessment is called KAPPA (Cohen, 1960). The result of performing a KAPPA analysis is a KHAT statistic (an estimate of KAPPA), which is another measure of agreement or accuracy. The KHAT statistic is computed as

Overall Accuracy
84% 78% 88%

KHAT Normalized Accuracy Accuracy


77% 70% 85% 83% 78% 89%

/~=

i=1
F

i=1

N2-

E (x,+ x+,)
i=1

where r is the number of rows in the matrix, xii is the number of observations in row i and column i, x i+ and x +i are the marginal totals of row i and column i, respectively, and N is the total number of observations (Bishop et al., 1975). The KHAT equation is published in this paper to clear up some confusion caused by a typographical error in Congalton et al. (1983), who originally proposed the use of this statistic for remotely sensed data. Since that time, numerous papers have been published recommending this technique. The equations for computing the variance of the KHAT statistic and the standard normal deviate can be found in Congalton et al. (1983), Rosenfield and Fitzpatriek-Lins (1986), and Hudson and Ramm (1987) to list just a few. It should be noted that the KHAT equation assumes a multinomial sampling model and that the variance is derived using the Delta method. Table 6 provides a comparison of the overall accuracy, the normalized accuracy, and the KHAT statistic for the three classification algorithms used by analyst # 1. In this particular example, all three measures of accuracy agree about the relative ranking of the results. However, it is possible for these rankings to disagree simply because each

measure incorporates various levels of information from the error matrix into its computations. Overall accuracy only incorporates the major diagonal and excludes the omission and commission errors. As already described, normalized accuracy directly includes the off-diagonal elements (omission and commission errors) because of the iterative proportional ftting procedure. As shown in the KHAT equation, KHAT accuracy indirectly incorporates the off-diagonal elements as a product of the row and column marginals. Therefore, depending on the amount of error included in the matrix, these three measures may not agree. It is not possible to give cleareut rules as to when each measure should be used. Each accuracy measure incorporates different information about the error matrix and therefore must be examined as different computations attempting to explain the error. My experience has shown that if the error matrix tends to have a great many off-diagonal cell values with zeros in them, then the normalized results tend to disagree with the overall and Kappa results. Many zeros occur in a matrix when an insuflqeient sample has been taken or when the classification is exceptionally good. Because of the iterative proportional fitting routine, these zeros tend to take on positive values in the normalization process showing that some error could be expected. The normalization process then tends to reduce the accuracy because of these positive values in the off-diagonal cells. If a large number of off-diagonal cells do not contain zeros, then the results of the three measures tend to agree. There are also times when the Kappa measure will disagree with the other two measures. Because of the ease of computing all three measures (software is available from the author) and because each measure refleets different information contained within the error matrix, I recommend an analysis such as the

Review: Assessing Classification Accuracy 41

Table 7. Results of the KAPPA Analysis Test of Significance for Individual Error Matrices Test of Significance of Each Error Matrix Classification Algorithms
Supervised approach Unsupervised approach Modified approach

Table 9. Results of KAPPA Analysis for Comparison between Modified Approach for Analyst # 1 vs. Analyst # 2

Test of Significant Differences between Error Matrices Comparison


Modified #1 vs. modified #2 At the 95% confidence level. /'NS = not significant.

KHAT Statistic
.7687 .6956 .8501

Z Statistic Result"
29.41 24.04 39.23 Sb S S

Z Statistic
1.6774

Result ~
NS b

~At the 95% confidence level. bS = significant.

Table 8. Results of KAPPA Analysis for Comparison between Error Matrices for Analyst # 1 Test of Significant Differences between Error Matrices Comparison
Supervised vs. unsupervised Supervised vs. modified Unsupervised vs. modified

Z Statistic
1.8753 2.3968 4.2741

Result ~
NS/~ S S

At the 95% confidence level. I~S= significant, NS = not significant.

one performed here to glean as much information from the error matrix as possible. In addition to being a third measure of accuracy, KAPPA is also a powerful technique in its ability to provide information about a single matrix as well as to statistically compare matrices. Table 7 presents the results of the KAPPA analysis to test the significance of each matrix alone. In other words, this test determines whether the results presented in the error matrix are significantly better than a random result (i.e., the null hypothesis: KHAT = 0). Table 8 presents the results of the KAPPA analysis that compares the error matrices two at a time to determine if they are significantly different. This test is based on the standard normal deviate and the fact that, although remotely sensed data are discrete, the KHAT statistic is asymptotically normally distributed. A quick look at Table 8 shows why this test is so important. Despite the overall accuracy of the supervised approach being 6% higher than the unsupervised approach (84% - 7 8 % = 6%), the results of the KAPPA analysis show that these two approaches are not significantly different. Therefore, given the choice of only these two approaches, one should use the easier, quicker, or more efficient approach because the accuracy will not be the deciding factor. Similar results are presented in Table 9 comparing the modified classification approach for analyst #1 with analyst #2.

In addition to the discrete multivariate techniques just presented, other techniques for assessing the accuracy of remotely sensed data have also been suggested. Rosenfield (1981) proposed the use of analysis of variance techniques for accuracy assessment. However, violation of the normal theory assumption and independence assumption when applying this technique to remotely sensed data has severely limited its application. Aronoff (1985) suggested the use of a minimum accuracy value as an index of classification accuracy. This approach is based on the binomial distribution of the data and is therefore very appropriate for remotely sensed data. The major disadvantage of the Aronoff approach is that it is limited to a single overall accuracy value rather than using the entire error matrix. However, it is useful in that it this index does express statistically the uncertainty involved in any accuracy assessment. Finally, Skidmore and Turner (1989) have begun work on techniques for assessing error as it accumulates through many spatial layers of information in a GIS, including remotely sensed data. These techniques have included using a line sampling method for accuracy assessment as well as probability theory to accumulate error from layer to layer. It is in this area of error analysis that much new work needs to be performed.

CONSIDERATIONS

Along with the actual analysis techniques, there are many other considerations to note when performing an accuracy assessment. In reality, the techniques are of little value if these other factors are not considered because a critical assumption of all the analysis described above is that the error matrix is truly representative of the entire classification. If the matrix is improperly generated, then all the analysis is meaningless. Therefore, the

42 Congalton

following factors must be considered: ground data collection, classification scheme, spatial autocorrelation, sample size, and sampling scheme. Each of these factors provide essential information for the assessment and failure to consider even one of them could lead to serious shortcomings in the assessment process.

Classification Scheme
When planning a project involving remotely sensed data, it is very important that sufficient effort be given to the classification scheme to be used. In many instances, this scheme is an existing one such as the Anderson classification system (Anderson et al., 1976). In other cases, the classification scheme is dictated by the objectives of the project or by the specifications of the contract. In all situations a few simple guidelines should be followed. First of all, any classification scheme should be mutually exclusive and totally exhaustive. In other words, any area to be classified should fall into one and only one category or class. In addition, every area should be included in the classification. Finally, if possible, it is very advantageous to use a classification scheme that is hierarchical in nature. If such a scheme is used, certain categories within the classification scheme can be collapsed to form more general categories. This ability is especially important when trying to meet predetermined accuracy standards. Two or more detailed categories of lower than the minimum required accuracy may need to be grouped together (collapsed) to form a more general category that exceeds the minimum accuracy requirement. For example, it may be impossible to separate interior live oak from canyon live oak. Therefore, these two categories may have to be collapsed to form a live oak category to meet the required accuracy standard. Because the classification scheme is so important, no work should begin on the remotely sensed data until the scheme has been thoroughly reviewed and as many problems as possible identified. It is especially helpful if the categories in the scheme can be logically explained. The difference between Douglas fir and Ponderosa pine is easy to understand; however, the difference between Density Class 3 (50-70% crown closure) and Density Class 4 ( > 70% crown closure) may not be. In fact, many times these classes are rather artificial and one can expect to find confusion between a forest stand with a crown closure of 67% that belongs in Class 3 and a stand of 73% that belongs in Class 4. Sometimes there is little that can be done about the artificial delineations in the classification scheme; other times the scheme can be modified to better represent natural breaks. However, tZailure to try to understand the classification

Ground Data Collection


It is obvious that in order to adequately assess the accuracy of the remotely sensed classification, accurate ground, or reference data must be collected. However, the accuracy of the ground data is rarely known nor is the level of effort needed to collect the appropriate data clearly understood. Depending on the level of detail in the classification (i.e., classification scheme), collecting reference data can be a very difficult task. For example, in a simple classification scheme the required level of detail may be only to distinguish residential from commercial areas. Collecting reference data may be as simple as obtaining a county zoning map. However, a more complex forest classification scheme may involve collecting reference data for not only species of tree, but size class, and crown closure as well. Size class involves measuring the diameters of trees and therefore a great many trees may have to be measured to estimate the size class for each pixel. Crown closure is even more difficult to measure. Therefore, in this case, collecting accurate reference data can be difficult. A traditional solution to this problem has been for the producer and user of the classification to assume that some reference data set is correct. For example, the results of some photointerpretation or aerial reconnaissance may be used as the reference data. However, errors in the interpretation would then be blamed on the digital classification, thereby wrongly lowering the digital classification accuracy. It is exactly this problem that has caused the lack of acceptance of digital satellite data for many applications. Although no reference data set may be completely accurate, it is important that the reference data have high accuracy or else it is not a fair assessment. Therefore, it is critical that the ground or reference data collection be carefully considered in any accuracy assessment. Much work is yet to be done to determine the proper level of effort and collection techniques necessary to provide this vital information.

Review: Assessing Classification Accuracy 43

scheme from the every beginning will certainly result in a great loss of time and much frustration in the end.

Spatial Autocorrelation
Spatial autoeorrelation is said to occur when the presence, absence, or degree of a certain characteristic affects the presence, absence, or degree of the same characteristic in neighboring units (Cliff and Ord, 1973). This condition is particularly important in accuracy assessment if an error in a certain location can be found to positively or negatively influence errors in surrounding locations (Campbell, 1981). Work by Congalton (1988a) on Landsat MSS data from three areas of varying spatial diversity (i.e., an agriculture, a range, and a forest site) showed a positive influence as much as 30 pixels (over 1 mile) away. These results are explainable in an agricultural environment where field sizes are large and typical misclassification would be to make an error in labeling the entire field. However, these results are more surprising for rangeland and forested sites. Surely these resuits should affect the sample size and especially the sampling scheme used in accuracy assessment, especially in the way this autocorrelation affects the assumption of sample independence. This autocorrelation may then be responsible for periodicity in the data that could effect the results of any type of systematic sample. In addition, the size of the cluster used in cluster sampling would also be effected because each new pixel would not be contributing independent information.

Sample Size
Sample size is another important consideration when assessing the accuracy of remotely sensed data. Each sample point collected is expensive and therefore sample size must be kept to a minimum and yet it is critical to maintain a large enough sample size so that any analysis performed is statistically valid. Of all the considerations discussed in this paper, the most has probably been written about sample size. Many researchers, notably van Genderen and Lock (1977), Hay (1979), Hord and Brooner (1976), Rosenfield et al. (1982), and Congalton (1988b), have published equations and guidelines for choosing the appropriate sample size. The majority of researchers have used an

equation based on the binomial distribution or the normal approximation to the binomial distribution to compute the required sample size. These techniques are statistically sound for computing the sample size needed to compute the overall accuracy of a classification or even the overall accuracy of a single category. The equations are based on the proportion of correctly classified samples (pixels, clusters, or polygons) and on some allowable error. However, these techniques were not designed to chose a sample size for filling in an error matrix. In the case of an error matrix, it is not simply a matter of correct or incorrect. It is a matter of which error or, in other words, which categories are being confused. Sufficient samples must be acquired to be able to adequately represent this confusion. Therefore, the use of these techniques for determining the sample size for an error matrix is not appropriate. Fitzpatrick-Lins (1981) used the normal approximation equation to compute the sample size for assessing a land u s e / land cover map of Tampa, Florida. The results of the computation showed that 319 samples needed to be taken for a classification with an expected accuracy of 85% and an allowable error of 4%. She ended up taking 354 samples and filling in an error matrix that had 30 categories in it (i.e., a matrix of 30 rows 3 0 columns or 900 possible cells). Although this sample size is sufficient for computing overall accuracy, it is obviously much too small to be represented in a matrix. Only 35 of the 900 cells had a value greater than zero. Other researchers have used the equation to compute the sample size for each category. Although resulting in a larger sample, the equation still does not account for the confusion between categories. Because of the large number of pixels in a remotely sensed image, traditional thinking about sampling does not often apply. Even a one-half percent sample of a single Thematic Mapper scene can be over 300,000 pixels. Not all assessments are performed on a per pixel basis, but the same relative argument holds true if the sample unit is a cluster of pixels or a polygon. Therefore, practical considerations more often dictate the sample size selection. A balance between what is statistically sound and what is practically attainable must be found. It has been my experience that a good rule of thumb seems to be collecting a minimum of 50 samples for each vegetation or land use category in the error matrix. If the area is especially large

44 Congalton

(i.e., more than a million acres) or the classification has a large number of vegetation or land use categories (i.e., more than 12 categories), the minimum number of samples should be increased to 75 or 100 samples per category. The number of samples for each category can also be adjusted based on the relative importance of that category within the objectives of the mapping or by the inherent variability within each of the categories. Sometimes it is better to concentrate the sampling on the categories of interest and increase their number of samples while reducing the number of samples taken in the less important categories. Also it may be useful to take fewer samples in categories that show little variability such as water or forest plantations and increase the sampling in the categories that are more variable such as uneven-aged forests or riparian areas. Again, the object here is to balance the statistical recommendations in order to get an adequate sample to generate an appropriate error matrix with the time, cost, and practical limitations associated with any viable remote sensing project.

Sampling Scheme
In addition to the considerations already discussed, sampling scheme is an important part of any accuracy assessment. Selection of the proper scheme is absolutely critical to generating an error matrix that is representative of the entire classified image. Poor choice in sampling scheme can result in significant biases being introduced into the error matrix which may over or under estimate the true accuracy. In addition, use of the proper sampling scheme may be essential depending on the analysis techniques to be applied to the error matrix. Many researchers have expressed opinions about the proper sampling scheme to use (e.g., Hord and Brooner, 1976; Ginevan, 1979; Rhode, 1978; Fitzpatrick-Lins, 1981). These opinions vary greatly among researchers and include everything from simple random sampling to stratified systematic unaligned sampling. Despite all these opinions, very little work has actually been performed in this area. Congalton (1988b) performed sampling simulations on three spatially diverse areas and concluded that in all cases simple random without replacement and stratified random sampling provided satisfactory results. Despite the

nice statistical properties of simple random sampling, this sampling scheme is not always that practical to apply. Simple random sampling tends to undersample small but possibly very important areas unless the sample size is significantly increased. For this reason, stratified random sampiing is recommended where a minimum number of samples are selected from each strata (i.e., category). Even stratified random sampling can be somewhat impractical because of having to collect ground information for the accuracy assessment at random locations on the ground. The problems with random locations are that they can be in places with very difficult access and they can only be selected after the classification has been performed. This limits the accuracy assessment data to being collected late in the project instead of in conjunction with the training data collection, thereby increasing the costs of the project. In addition, in some projects the time between the project beginning and the accuracy assessment may be so long as to cause temporal problems in collecting ground reference data. In other words, the ground may change (i.e., the forest harveste d ) between the time the project is started and the accuracy assessment is begun. Therefore, some systematic approach would certainly help make this ground collection effort more efficient by making it easier to locate the points of the ground and allowing data to be collected simultaneously for training and assessment. However, results of Congalton (1988a) showed that periodicity in the errors as measured by the autocorrelation analysis could make the use of systematic sampling risky for accuracy assessment. Therefore, perhaps some combination of random and systematic sampling would provide the best balance between statistical validity and practical application. Such a system may employ systematic sampling to collect some assessment data early in a project while random sampling within strata would be used after the classification is completed to assure that enough samples were collected for each category and to minimize any periodicity in the data. In addition to the sampling schemes already discussed, cluster sampling has also been frequently used in assessing the accuracy of remotely sensed data, especially to collect information on many pixels very quickly. However, cluster sampiing must be used intelligently. Simply using

Review: Assessing Classification Accuracy

45

very large clusters is not a valid method of collecting data because each pixel is not independent of the other and adds very little information to the cluster. Congalton (1988b) recommended that no clusters larger than 10 pixels and certainly not larger than 25 pixels be used because of the lack of information added by each pixel beyond these cluster sizes. Finally, some analytic techniques assume that certain sampling schemes were used to obtain the data. For example, use of the Kappa analysis assumes a multinomial sampling model. Only simple random sampling completely satisfies this assumption. The effect of using another of the sampling schemes discussed here is unknown. An interesting project would be to test the effect on the Kappa analysis of using a sampling scheme other than simple random sampling. If the effect is found to be small, then the scheme may be appropriate to use within the conditions discussed above. If the effect is found to be large, then that sampling scheme should not be used to perform Kappa analysis. To conclude that some sampling schemes can be used for descriptive techniques and others for analytical techniques seems impractical. Accuracy assessment is expensive and no one is going to collect data for only descriptive use. Eventually, someone will use that matrix for some analytical technique.

CONCLUSIONS This paper has reviewed the factors and techniques to be considered when assessing the accuracy of classifications of remotely sensed data. The work has really just begun. The factors discussed here are certainly not fully understood. The basic issues of sample size and sampling scheme have not been resolved. Spatial autocorrelation analysis has rarely been applied to any study. Exactly what constitutes ground or reference data and the level of effort needed to collect it must be studied. Research needs to continue in order to balance what is statistically valid within the realm of practical application. This need becomes increasingly important as techniques are developed to use remotely sensed data over large regional and global domains. What is valid and practical over a small area may not apply to regional or global projects. Up to now, the little experience we have has been

on relatively small remote sensing projects. However, there is a need to use remote sensing for much larger projects such as monitoring global warming, deforestation, and environmental degradation. We do not know all the problems that will arise when dealing with such large areas. Certainly, the techniques described must be extended and refined to better meet these assessment needs. It is critical that this work and the use of quantitative analysis of remotely sensed data continue. We have suffered too long because of the oversell of the technology and the underutilization of any quantitative analysis early in the digital remote sensing era. Papers such as Meyer and Werth (1990) that state that the digital remote sensing is not a viable tool for most resource applications continue to demonstrate the problems we have created by not quantitatively documenting our work. We must put aside the days of a casual assessment of our classification. "It looks good" is not a valid accuracy statement. A classification is not complete until it has been assessed. Then and only then can the decisions made based on that information have any validity. In addition, we must not forget that remotely sensed data is just a small subset of spatial data currently being used in geographic information systems (GIS). The techniques and considerations discussed here need to be applied over all spatial data. Techniques developed for other spatial data need to be tested for use with remotely sensed data. The work has just begun, and if we are going to use spatial data to help us make decisions, and we should, then we must know about the accuracy of this information.
The author would like to thank Greg Biging and Craig Olson for their helpful reviews of this paper. Thanks also to the two anonymous reviewers whose comments significantly improved this manuscript.

REFERENCES

Anderson,J. R., Hardy, E. E., Roach,J. T., and Witmer, R. E. (1976), A land use and land cover classificationsystem for use with remote sensor data, U.S. Geol. Survey Prof. Paper 964, 28 pp. Aronoff, Stan (1985), The minimum accuracy value as an index of classificationaccuracy, Photogramm. Eng. Remote Sens. 51(1):99-111. Biging, G. and Congalton, R. (1989), Advances in forest inventory using advanced digital imagery, in Proceedings

46 Congalton

of Global Natural Research Monitoring and Assessments': Preparing for the 21st Century, Venice, Italy, September, Vol. 3, pp. 1241-1249.
Bishop, Y., Fienberg, S., and Holland, P. (1975), Discrete Multivariate Analysis--Theory and Practice, MIT Press, Cambridge, MA, 575 pp. Campbell, J. (1981), Spatial autocorrelation effects upon the accuracy of supervised classification of land cover, Photogramm. Eng. Remote Sens. 47(3):355-363. Campbell, J. (1987), Introduction to Remote Sensing, Guilford Press, New York, 551 pp. Chuvieco, E., and Congalton, R. (1988), Using cluster analysis to improve the selection of training statistics in classifying remotely sensed data, Photogramm. Eng. Remote Sens. 54(9): 1275-1281. Cliff, A. D., and Ord, J. K. (1973), Spatial Autocorrelation, Pion, London, 178 pp. Cohen, J. (1960), A coefficient of agreement for nominal scales, Educ. Psychol. Measurement 20(1):37-46. Congalton, R. G. (1988a), Using spatial autocorrelation analysis to explore errors in maps generated from remotely sensed data, Photogramm. Eng. Remote Sens. 54(5): 587-592. Congalton, R. G. (1988b), A comparison of sampling schemes used in generating error nmtrices for assessing the accuracy of maps generated from remotely sensed data, Photogramnt. Eng. Remote Sens. 54(5):593-600. Congalton, R. G., Oderwald, R. G., and Mead, R. A. (1983), Assessing Landsat classification accuracy using discrete multivariate statistical techniques, Photogramm. Eng. Remote Sens. 49(12):1671-1678. Fitzpatrick-Lins, K. (1981), Comparison of sampling procedures and data analysis for a land-use and land-cover map, Photogramm. Eng. Remote Sens. 47(3):343-351.

Ginevan, M. E. (1979), Testing land-use map accuracy: another look, Photogramm. Eng. Remote Sens. 45(10): 1371-1377. Hay, A. M. (1979), Sampling designs to test land-use map accuracy, Photogramm. Eng. Remote Sens. 45(4):529-533. Hord, R. M., and Brooner, W. (1976), Land use map accuracy criteria, Photogramm. Eng. Remote Sens. 42(5):671-677. Hudson, W., and Ramm, C. (1987), Correct formulation of the kappa coefficient of agreement, Photogramm. Eng. Remote Sens. 53(4):421-422. Meyer, M., and Werth, k (1990), Satellite data: management panacea of potential problem?, J. Forestry 88(9):10-13. Rhode, W. G. (1978), Digital image analysis techniques for natural resource inventory, in National Computer Conference Proceedings, pp. 43-106. Rosenfield, G. (1981), Analysis of variance of thematic mapping experiment data, Photogramm. Eng. Remote Sens. 47(12):1685-1692. Rosenfield, G., and Fitzpatrick-Lins, K. (1986), A coefficient of agreement as a measure of thematic classification accuracy, Photogramm. Eng. Remote Sens. 52(2):223-227. Rosenfield, G. H., Fitzpatriek-Lins, K., and Ling, H. (1982), Sampling tbr thematic map accuracy testing, Photogramm. Eng. Remote Sens. 48(1):131-137. Skidmore, A., and Turner, B. (1989), Assessing the accuracy of resource inventory maps, in Proceedings of Global Natural Resource Monitoring and Assessments: Preparing for the 21st Century, Venice, Italy, September, Vol. 2, pp. 524-535. Story, M., and Congalton, R. (1986), Accuracy assessment: a user's perspective, Photogramm. Eng. Remote Sens. 52(3):397-399. van Genderen, J. L., and Lock, B. F. (1977), Testing land use map accuracy, Photogramm . Eng. Remote Sens. 43(9):1135-1137.

You might also like