Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/265220944
READS
185
7 authors, including:
Hoi K Suen
Pennsylvania State University
90 PUBLICATIONS 1,225 CITATIONS
SEE PROFILE
All in-text references underlined in blue are linked to publications on ResearchGate, Available from: Hongli Li
letting you access and read them immediately. Retrieved on: 13 June 2016
Peer Assessment in the Digital Age: A Meta-Analysis Comparing Peer and Teacher Ratings
Yao Xiong, Xiaojiao Zang, Mindy Kornhaber, Youngsun Lyu, Kyung Sun Chung, Hoi Suen,
Paper presented at the 2014 Annual Meeting of the American Educational Research Association
Given the wide use of peer assessment, especially in higher education, the relative
accuracy of peer ratings compared to teacher ratings is a major concern for both educators and
researchers. This concern has grown with the increase of peer assessment in digital platforms. In
findings from studies on peer assessment since 1999 when computer-assisted peer assessment
started to proliferate. The estimated average Pearson correlation between peer and teacher ratings
is found to be .63, which is moderately strong. This correlation is significantly higher when (a)
the peer assessment is paper-based rather than computer-assisted; (b) the subject area is not
medical/clinical; (c) the course is graduate level rather than undergraduate or K-12; (d)
individual work instead of group work is assessed; (e) the assessors and assessees are matched at
random; (f) the peer assessment is voluntary instead of compulsory; (g) the peer assessment is
non-anonymous; (h) peer raters provide both scores and qualitative comments instead of only
scores; and (i) peer raters are involved in developing the rating criteria. The findings are
expected to inform practitioners regarding peer assessment practices that are more likely to
student learning. Based on a meta-analysis of 48 studies published between 1959 and 1999,
Falchikov and Goldfinch (2000) found the weighted correlation coefficient between peer ratings
and teacher ratings to be .69, which is moderately strong. Given the rise of digital and blended
courses, the use of computer-assisted peer assessment has grown remarkably in recent years
(Topping, 1998). However, among the 48 studies included in Falchikov and Goldfinchs meta-
analysis, very few involved computer-assisted peer assessment. Thus, they did not draw any
conclusion regarding the agreement between peer and teacher ratings in computer-assisted peer
assessment.
It is, therefore, necessary to synthesize more recent studies on peer assessment in order to
understand how well peer ratings correlate with teacher ratings in the current digital age. Also,
given the diversity of peer assessment practices (Gielen, Dochy, & Onghena, 2011), it is
important to understand which factors are likely to influence the agreement between peer and
teacher ratings. The purpose of the present study is to conduct a meta-analysis on the agreement
between peer and teacher ratings based on studies published since 1999. Specifically, two
research questions are addressed: (1) What is the agreement between peer and teacher ratings?
Literature Review
Peer assessment is defined as an arrangement in which individuals consider the amount,
level, value, worth, quality or success of the products or outcomes of learning of peers of similar
status (Topping, 1998, p. 250). In addition to increasing teachers efficiency in grading, peer
2004), promote students critical thinking (Sims, 1989), and increase students motivation to
learn (Topping, 2005). Additionally, peer assessment can function as a formative pedagogical
tool (Topping, 2009) or a summative assessment tool (Tsai, 2013). However, despite the great
potential of peer assessment, researchers and practitioners remain concerned about whether
students have the ability to assign reliable and valid ratings to their peers work (Liu & Carless,
2006). Specifically, reliability refers to inter-rater consistency across peer raters, whereas validity
refers to the consistency between peer ratings and teacher ratings, assuming teacher ratings to be
the gold standard (Falchikov & Goldfinch, 2000). Validity is a major concern for teachers who
are interested in using peer assessment and is also the primary focus of the present meta-analysis.
In this meta-analysis we use the term teacher ratings to be consistent with Falchikov and
Goldfinch (2000), whereas an alternative term used in the literature is expert ratings. A
peer and teacher ratings. For instance, some studies found low correlations between peer and
teacher ratings, such as .29 in Kovach, Resch, and Verhulst (2009). Others, such as Cho, Schunn,
and Wilson (2006), reported a moderate agreement of .60 between peer and teacher ratings. Yet
others found high agreement between peer and teacher ratings, for example, .97 and .98 in Harris
(2011). In this study, a group of undergraduate students graded their peers scientific laboratory
reports. Students had to sign the reports that they assessed, and they would be penalized if there
was a serious discrepancy between their ratings and the teacher ratings. Harris attributed the high
agreement to a few factors, such as the well-structured rating task, the fairly constrained rating
schedule, and the sufficient guidance provided to peer raters. Falchikov and Goldfinchs meta-
analysis provided a synthesis on the agreement between peer and teacher ratings for studies
conducted from 1959 to 1999. However, it is not clear whether studies published since 1999 will
generate a similar result, particularly given that peer assessment is increasingly mediated by
computer technology.
In their meta-analysis, Falchikov and Goldfinch (2000) further investigated the influence
of various factors on the agreement between peer and teacher ratings. Among the major factors
they explored were subject area, quality of study, number of peer raters, level of course, nature of
assessment task, and dimensional versus global judgment, etc. For instance, they found that
global judgments with clearly stated criteria were superior to judgments on separate dimensions.
However, many of their findings are not conclusive depending on whether they included or
excluded a study by Burnett and Cavaye (1980). Burnett and Cavaye reported a correlation of .99
between peer ratings and grades given by teachers. When this study was omitted in the meta-
analysis, the number of peer raters, whether the courses were at advanced level, and whether the
courses were about social sciences became statistically non-significant, but the quality of studies
became statistically significant. Given the large influence of this study, Falchikov and Goldfinch
provided different findings depending on whether the Burnett and Cavaye study was included in
the meta-analysis or not. This inconclusiveness, alongside the growing use of computer-assisted
peer assessment, necessitates the present meta-analysis, in which we examine the influence of
different factors on the agreement between peer and teacher ratings in the digital age.
Four reviews shaped the factors included in the present meta-analysis. Topping (1998)
proposed a typology of 17 variables to help describe the diversity of peer assessment activities.
van den Berg, Admiraal, and Pilot (2006) and van Gennip, Segers, and Tillema (2009)
reorganized these 17 variables. Gielen et al. (2011) further added three more variables to
Toppings list and then classified the 20 variables into five categories: decisions concerning the
use of peer assessment, the link between peer assessment and other elements in the learning
environment, interactions between peers, the composition of assessment groups, and the
management of the assessment procedure. Based on these four reviews, as shown in Table 1, we
list 17 variables that might influence the agreement between peer and teacher ratings. These
variables were classified into two categories, those related to peer assessment settings and those
Regarding peer assessment settings, a primary factor to be considered is the mode of peer
peer assessment is advocated as having some advantages over traditional paper-based peer
assessment: (a) online environments ensure anonymity and promote fair assessment without
being influenced by friendship bias (Lin, Liu, & Yuan, 2001; Wen & Tsai, 2008); (b)
computer assistance enhances efficiency especially for teachers in large classes; and (c)
assessment can be performed freely without time and location restrictions (Wen & Tsai, 2008).
In addition to peer assessment mode, other important variables include the subject area
and the task being assessed (Falchikov & Boud, 1989; Falchikov & Goldfinch, 2000).
Furthermore, the level of the courses in which the peer assessment occurs is worth considering.
Falchikov and Goldfinchs meta-analysis included only peer assessment in higher education
because few studies had been conducted at the K-12 level then. In the present meta-analysis, we
examine whether the agreement between peer and teacher ratings varies across courses at the K-
Regarding peer assessment procedures, one factor is the constellation of assessors and the
constellation of assessees (i.e., those who are assessed). For instance, assessors and assessees can
be individuals, pairs, or groups (van Gennip et al., 2009). Other factors are the number of peer
raters per assignment and the number of teachers per assignment (Falchikov & Goldfinch, 2000).
In addition, because the social context of peer assessment may introduce pressure, risk, or
competition among peers (Falchikov, 1995), it is important to examine whether assessors and
2005). Some studies entailed compulsory participation in peer assessment (e.g., Bouzidi &
Jaillet, 2009), whereas in others, participants were self-selected (e.g., Hafner & Hafner, 2003).
Additionally, friendships among peers may result in scoring bias (Magin, 2001). Thus, it is
necessary to examine whether the peer assessment is anonymous or not. Furthermore, feedback
format may be an influential factor (Falchikov, 1995). Sometimes, peer raters provided only
scores (e.g., Sealey, 2013), whereas at other times they provided both scores and qualitative
monitoring peer raters. Rating quality is said to improve when peer assessments are supported by
training, checklists, exemplification, teacher assistance, and monitoring (Berg, 1999; Miller,
2003; Pond, UI-Haq, & Wade, 1995). As discussed by Falchikov and Goldfinch (2000), peer
raters familiarity with and ownership of assessment criteria tend to improve the accuracy of peer
assessment. Therefore, the present meta-analysis also includes three variables to reflect whether
peer raters receive training, whether explicit rating criteria are used, and whether peer raters are
Methods
Selecting Studies and Coding Procedures
The following criteria were used to select studies for inclusion in the present meta-
analysis. First, we included only studies in which one or more numerical measures of the
agreement between peer and teacher ratings were presented or could be directly obtained from
information provided in the study. Typically, the agreement was measured by Pearson
correlation. Second, to address the file drawer problem, i.e., studies producing significant effects
are more likely to be published (Glass, 1977; Rosenthal, 1979), both published and unpublished
studies were considered. Third, only studies written in English and published (or released) since
1999 were considered. Finally, studies conducted in both educational and medical/clinical
The following procedures were used to search for eligible studies. First, using various
key words such as peer assessment, peer evaluation, peer rating, peer grading, peer scoring, and
peer feedback, we searched several well-known online databases (ERIC, PsycINFO, JSTOR, and
ProQuest). We also used the same key words to search Google scholar for the most recent
published and unpublished studies not yet included in the databases. Second, we searched review
articles on peer assessment (e.g., Gielen et al., 2011; Speyer, Pilz, Kruis, & Brunings, 2011; van
Gennip et al., 2009) for relevant studies. Third, we reviewed references cited in the studies that
we had already determined to be eligible and added those that we had not already found through
other sources. Finally, we contacted scholars in the field to ask for recent studies that they may
have encountered but had not yet been published. In total, initially 292 articles were located from
searching. After carefully examining each article and applying the inclusion criteria, we found 70
of them were eligible. In most cases, an article was excluded because it did not provide
numerical measures of the agreement between peer and teacher ratings. For example, a paper
authored by Senger and Kanthan (2012) showed the comparison between peer and teacher
ratings in graphs instead of numerical values and thus could not be included.
Many of the eligible studies involved more than one comparison (or effect size). For
instance, Liu, Lin, and Yuan (2002) reported the agreement between peer and teacher ratings for
six assignments separately, such that this article generated six comparisons (or effect sizes). The
multiple effect sizes within each study can be averaged, or one effect size can be selected from
each study. As a result, the number of effect sizes is drastically reduced, so that it is very
challenging to study the effects of the comparison characteristics. Alternatively, the dependence
can be ignored when appropriate (Borenstein, Hedges, Higgins, & Rothstein, 2009). Because the
samples used to calculate the six effect sizes were mutually exclusive in Liu et al. (2002), each
single comparison from the Liu et al. study was treated as a unit of analysis. Likewise, after a
thorough screening, we determined that 269 comparisons from 70 studies were eligible for
Based on the literature discussed previously, a coding sheet was built after several rounds
of discussion among the authors, as shown in Table 1. The authors coded all the studies
collaboratively, and each article was coded by at least two coders in order to establish reliability.
A measure of inter-coder reliability, percentage agreement, was calculated for each coded
variable. The percentage agreement for variables related to peer assessment settings ranged from
95% to 100%, and the percentage agreement for variables related to peer assessment procedures
ranged from 80% to 100%. Disagreements in regard to the codes assigned were resolved through
Table 1 presents details pertaining to the coded variables. Here are two examples.
Regarding subject area, given the small number of studies involving the arts, we collapsed social
science and the arts into one category. The subject area of science or engineering was coded as a
second category, and medical or clinical constituted a third category. Regarding task being rated,
two categories were established. Essays, reports, proposals, portfolios, design tasks, and exams
were coded as performance. As shown in Table 1, categorical variables were dummy coded for
subsequent analyses. These variables were subsequently used as potential predictors to account
linear modeling approach to meta-analysis. This approach focuses on discovering and explaining
variations in effect sizes, and groups of subjects are regarded as nested within the primary studies
included in the meta-analysis. The level-1 model investigates how effect sizes vary across
studies, whereas the level-2 model focuses on explaining the potential sources of this variation.
In particular, the level-2 model examines multiple predictors of effect sizes simultaneously. With
this approach, analysts are able to estimate the average effect size across a set of studies and test
hypotheses about the effects of study characteristics on study outcomes. This approach was
modeling approach with the HLM 6.08 software (Raudenbush, Bryk, & Congdon, 2004). As
demonstrated in Raudenbush and Bryk (2002), when Pearson correlation is reported in the
studies, the correlation coefficient rj is transformed to the standardized effect size measure dj
where
With the hierarchical linear modeling approach, the level-1 outcome variable in the meta-
analysis is dj, which is the Fishers zeta transformation of rj reported for each comparison. When
the variation in effect sizes is statistically significant, level-2 analysis is used to determine the
Raudenbush and Bryk (2002), the level-1 model (often referred to as the unconditional model) is
dj = j + ej [3]
where
In the level-2 model, the true population effect size, j, depends on comparison
W1j Wkj are the predictors explaining j across the effect sizes (see Table 1 for the list of
predictors);
Wk; and
Results
With the procedure described in the methods section, 269 comparisons from 70 studies
were included in the present meta-analysis. The distribution of the Pearson correlations is
illustrated in Figure 1. The correlations ranged from -.19 to .98, with a mean of .57 and a
standard deviation of .24. There were no obvious outliers beyond the range of -3 and 3 standard
deviation units. In addition, Table 1 shows the predictors and the corresponding frequencies. The
variable of the constellation of assessors was not included because the frequency of assessors
To begin, an intercept-only model was fit with no predictors included. The intercept (i.e.,
the estimated grand-mean effect size) was .74, which was statistically different from zero (t (268)
= 28.69, p < .001). This value indicates that on average the agreement between peer and teacher
ratings was .74 in the Fishers zeta metric, which is equivalent to .63 in the Pearson correlation
metric. Furthermore, the estimated variance of the effect size was .14, significantly different
from zero (2 = 2,471.35, df = 268, p < .001). This result suggests that variability existed in the
true effect sizes across the comparisons. Therefore, we proceeded to a level-2 conditional model
multicollinearity diagnostic test with the 17 predictors (Belsley, Kuh, & Welsch, 1980). The
multicollinearity among these predictors does not seem to be serious. All 17 predictors were thus
entered into the model simultaneously. For model parsimony, the predictors that lacked statistical
significance at the .05 level were dropped from the model one at a time starting from the one
with the largest p value. Because assessment mode was our focal interest, this predictor was
always kept in the model. Among the listed predictors in Table 1, eight were dropped eventually,
is K-12), W8 (number of peer raters is between 6 to 10), W9 (number of peer raters is larger than
10), W10 (number of teachers per assignment), W15 (there are explicit rating criteria), and W17
(peer raters receive training). As a result, nine significant predictors were retained in the final
model. Below we report detailed results of the final model shown in Table 2. Specifically, we
describe the regression coefficient of each predictor in the Fishers zeta metric, i.e., how a
predictor influenced the correlation between peer and teacher ratings when all the other
Among the variables related to peer-assessment settings, to begin with, the assessment
mode turned out to be a significant factor. When the peer assessment was computer-assisted, the
correlation between peer and teacher ratings was significantly lower by .14 standard deviation
units than when the peer assessment was paper-based. Second, when the subject area was
medical/clinical compared to when the subject area was social science/arts, the correlation was
significantly lower by .35 standard deviation units. In the full model with all the predictors, the
correlation was slightly higher when the subject area was science/engineering than when it was
social science/arts, though the difference was not statistically significant. Finally, when the
course was graduate level, the correlation was significantly higher by .18 standard deviation
units than when the course was undergraduate level. Undergraduate and K-12 courses, however,
Among the variables related to peer assessment procedures, to begin with, the
constellation of assessees was significant. When assessees were a group, the correlation between
peer and teacher ratings was significantly lower by .26 standard deviation units compared to
when assessees were individuals. Second, when assessors and assessees were not matched at
random, the correlation was significantly lower by .27 standard deviation units. Third, when peer
assessment was voluntary, the correlation was significantly higher by .28 standard deviation
units than when peer assessment was compulsory. Fourth, when peer raters were non-
anonymous, the correlation was significantly higher by .15 standard deviation units than when
peer raters were anonymous. Fifth, when peer raters provided both scores and comments, the
correlation was also significantly higher by .15 standard deviation units than when peer raters
provided only scores. Finally, when peer raters were involved in developing the rating criteria,
the correlation was significantly higher by .60 standard deviation units, than when peer raters
In the final model, when all nine significant predictors were included, the estimated
variance of the effect sizes was .09, significantly different from zero (2 = 1623.73, df = 259, p
< .001). Using the variance component in the intercept-only model as the baseline (Raudenbush
& Bryk, 2002), we found that these nine predictors explained 34.30% of the variance in the
effect sizes. This indicates that other sources of variability still exists among the effect sizes
based on the final model. First, lets define a default scenario when all the predictors are zero, i.e.
level; assessees are individuals; assessors and assessees are matched at random; peer assessment
is compulsory; peer raters are anonymous; only scores are provided; and peer raters are not
involved in developing criteria. In this default scenario, the estimated correlation in Fishers Z
metric is .69 (i.e., the intercept in the final model), which is equivalent to .60 in the Pearson
correlation metric. When assessment mode is computer-assisted and all the other predictors are
held at zero as described in the default scenario, the estimated correlation in Fishers Z metric
is .55 (i.e. .69-.14). This value is equivalent to.50 in the Pearson correlation metric. In a similar
way, we calculated the estimated Pearson correlations for other scenarios listed in Table 3. In
general, the estimated Pearson correlations between peer and teacher ratings vary substantially
Discussion
As shown in the intercept-only model, based on 269 effect sizes from 70 studies, the
estimated average Pearson correlation between peer and teaching ratings was .63. This is
significantly different from zero and moderately strong in a practical sense. This result does not
depart much from what was reported by Falchikov and Goldfinchs (2000), in which the
weighted Pearson correlation between peer and teacher ratings was .69. In both the present meta-
analysis and the one conducted by Falchikov and Goldfinch, the correlations were adjusted based
on sample sizes, and thus the results are comparable. The present meta-analysis confirms that
peer ratings generally show a moderately high level of agreement with teacher ratings. It also
finds that the peer-teacher rating agreement based on studies since 1999 is only slightly lower
than that based on studies before 1999. At the same time, the present meta-analysis reveals
insights about factors influencing peer assessment conducted in the digital age, discussed below.
significantly lower agreement between peer and teacher ratings than traditional paper-based peer
assessment ( = -.14, t = -2.25, p < .05). Though computer-assisted peer assessment is seen as
more efficient (Lin et al., 2001; Wen & Tsai, 2008), peer raters in a computer-assisted
environment might perform worse, perhaps due to reduced attention, effort, or instructional
support (Suen, 2014). It is also the case that computer-assisted peer assessment is still in its
infancy and thus yet to show its full potential. Furthermore, the computer-assisted mode could
cover a broad range regarding the extent to which the computer technology is used in peer
assessment. For example, in Lin et al. (2001), a sophisticated web-based peer assessment system,
named NetPeas, was used to administer the peer assessment tasks, whereas in Chen and Tsai
(2009), the peer raters mainly used computer technology for uploading and downloading peer
assessment materials. It would be desirable to further classify computer-assisted mode into more
refined categories regarding how much technology is used. However, many studies included in
the meta-analysis did not provide sufficient information regarding this so that we only included a
broad category of computer-assisted peer assessment. Further research is needed to study the
effective design and use of technologies to improve the accuracy of peer assessment in digital
environments.
When the subject area was medical/clinical compared to when it was social science/arts
or science/engineering, the correlation between peer and teacher ratings was significantly lower
( = -.35, t = -3.19, p < .001). This accords with Falchikov and Goldfinch (2000) in that peer
ratings in professional practice (e.g., clinical skills or teaching practice) were more problematic
than those in academic practices. The present meta-analysis also shows that the correlation was
slightly higher when the subject area was science/engineering compared to when it was social
science/arts although the difference was not statistically significant. A plausible reason is that the
science and engineering tasks are more likely to have clear-cut right or wrong answers, making
them easier for peers to assess. Determining the proper level of granularity in categorizing
subject areas has been a challenge and is always somewhat arbitrary. Having too many
categories will not only reduce the sample size of each category but also introduce too many
dummy-coded predictors for the analysis. We opted to use three categories for the subject areas:
medical/clinical, science/engineering, and social science/ arts. With only four studies
being in arts, further subdividing the category of social science/ arts would be problematic.
In addition, the correlation between peer and teacher ratings for graduate courses was
significantly higher than for undergraduate courses ( = .18, t = 2.49, p < .05). This was to be
expected as students taking advanced courses are likely to be more cognitively advanced and
potentially have higher reflection skills than those taking introductory courses (Falchikov &
Boud, 1989; Nulty, 2011). Reflection skills are argued to be an important factor related to the
peer assessment quality (Sluijsmans, Brand-Gruwel, van Merrinboer, & Bastiaens, 2002). We
also expected peer and teacher ratings on undergraduate courses to have a higher correlation than
those on K-12 courses. However, the difference we found was not statistically significant,
probably because the present meta-analysis included only a small number of studies involving K-
12 students.
As shown in the present meta-analysis, when assessees were groups, the correlation
between peer and teacher ratings was significantly lower than when assessees were individuals (
= -.26, t = -3.56, p < .001). When assessees are groups, the work being assessed is typically
group work, which usually involves interactions and dynamics among group members.
individual work (Panitz, 2003). This partially explains why peer ratings were less accurate when
When assessors and assessees were matched at random, the correlation between peer and
teacher ratings was higher than when the matching was not random ( = -.27, t = -3.83, p < .001).
Assessment bias exists when there is a systematic tendency for peer assessment scores to be
influenced by anything other than the trait, behavior, or outcome they are supposed to be
measuring (Kane & Lawler, 1978, p. 558). It is reasonable that randomly matching assessors
and assessees helps to reduce certain systematic biases and thus lead to higher agreement
Voluntary peer ratings showed more agreement with teacher ratings than did compulsory
peer ratings ( = .28, t = 3.40, p < .01). When students are given choices, they are more likely to
engage in the task and the learning process (Boud, 2012). Also, voluntary peer assessment
usually happens when peers are interested in and motivated to participate in peer assessment
activities, which might lead to more accurate peer ratings. Therefore, it will be helpful for
teachers to boost students interest in and motivation for conducting peer assessment.
The correlation between peer and teacher ratings was significantly higher when peer
raters were non-anonymous than when they were anonymous ( = .15, t = 2.40, p < .05).
Anonymity is believed to lead to a fairer environment and more honest ratings (Joinson, 1999).
However, as discussed by Cestone, Levine, and Lane (2008), when peer assessment is
anonymous, students may provide harsher criticism and evaluations. Also, previous research
noted that peer raters may lack effort or seriousness when performing the assessment (Hanrahan
& Isaacs, 2001). It is possible that non-anonymity may lead peer raters to take the assessment
more seriously, thereby generating more accurate ratings. Bloxham and West (2004) asked peer
raters to evaluate each others peer rating quality and found this practice encouraged peer raters
When peer raters provided both scores and comments, the correlation between peer and
teacher ratings was significantly higher than when peer raters provided only scores ( = .15, t =
2.57, p < .05). It is reported that students feel more comfortable giving qualitative feedback than
giving a purely quantitative evaluation of their peers work (Cestone et al., 2008). Moreover, it is
likely that qualitative comments enable students to more actively reflect on their peers work,
thereby supporting a clearer rationale for their numerical rating (Avery, 2014). This reflection
and documentation helps to improve the accuracy of peer ratings. In addition, Davies (2006;
2009) reported strong correlations between qualitative comments (i.e., negative/positive) and
A striking finding is that peer rater involvement in developing rating criteria yielded
much higher correlations between peer and teacher ratings than when peer raters were not
involved in developing the criteria ( = .60, t = 5.55, p < .001). Discussion, negotiation, and joint
construction of assessment criteria is likely to give students a greater sense of ownership and
investment in their evaluations (Cestone et al., 2008; Topping, 2003). Such student involvement
can also make the rating criteria more understandable and easier to apply (Orsmond, Merry, &
Reiling, 1996). In addition, through this involvement, students can gain an opportunity to reflect
on their own learning. For this reason, we suggest the practice of involving peer raters in
However, the correlation between peer and teacher ratings was not significantly higher
when peer raters had received training. A possible reason for this is that the quality of peer-rater
training may vary across the studies included in this meta-analysis such that a dichotomous
coding of whether peers had received training may not have been sufficient to capture its effect.
In addition, raters might have performed peer assessment or have received training prior to the
peer assessment activity reported in the study. Nevertheless, this information was not available in
most articles and thus was not included in the present meta-analysis. Further, we did not find the
correlation to be significantly higher when explicit criteria were used. This is probably because
almost all the studies included in this meta-analysis used explicit criteria, so that it was difficult
Neither the number of teachers per assignment nor the number of peer raters per
assignment was statistically significant. In the original full model, when the number of peer
raters per assignment was medium (610) or high (larger than 10) compared to when the number
of peer raters per assignment was low (15), the correlation between peer and teacher ratings was
slightly higher, although the difference was not statistically significant at the .05 level. This
direction agrees with the previous literature (Cho et al., 2006; Winch & Anderson, 1967)
wherein the agreement between peer and teacher ratings is higher when more peer raters are
involved. Falchikov and Goldfinch (2000), however, found that the agreement was lower when
the number of peer raters was more than 20. Nevertheless, there is no agreed upon value for the
Based on empirical studies published since 1999, this meta-analysis investigates the
agreement between peer and teacher ratings and factors that might significantly influence this
agreement. We found that peer and teacher ratings overall agree with each other at a moderately
high level (r = .63). This correlation is significantly higher when (a) the peer assessment is
paper-based rather than computer-assisted; (b) the subject area is not medical/clinical; (c) the
course is graduate level rather than undergraduate or K-12; (d) individual work instead of group
work is assessed; (e) the assessors and assessees are matched at random; (f) the peer assessment
is voluntary instead of compulsory; (g) the peer assessment is non-anonymous; (h) peer raters
provide both scores and qualitative comments instead of only scores; and (i) peer raters are
involved in developing the rating criteria. It is important to note that the effect of each variable
was examined when the other variables were controlled for. The findings of this meta-analysis
are expected to inform educational practitioners on how to structure peer assessment in ways that
Given the prevailing practice of peer assessment in higher education, this meta-analysis
especially has important implications for higher education at the digital age. A noteworthy
finding is that the agreement between peer and teacher ratings was significantly lower when the
peer assessment was computer-assisted rather than paper-based. With the rapid growth of
technology and calls for cost containment in higher education, we anticipate that peer assessment
will be increasingly mediated by computer technology. Such trends are evident in the eruption of
massive open online courses (MOOCs) throughout higher education, where computer-assisted
peer assessment is the primary assessment (Suen, 2014). Given these trends, it is essential to
conduct research on improving the quality and accuracy of peer assessment in the digital age. In
addition, the present meta-analysis confirms that agreement between peer and teacher ratings is
higher for graduate courses than for undergraduate or K-12 courses. This finding provides a basis
for educators to use peer assessment with more confidence in graduate-level courses.
between peer and teacher ratings and the factors likely to influence the extent of this agreement.
This is based on the assumption that teacher ratings are the gold standard, which is the norm in
the field but one worthy of further investigation. Furthermore, peer assessment involves a wide
range of activities, and a list of variables influencing the agreement between peer and teacher
ratings will never be exhaustive (Gielen et al., 2011). We included only theoretically meaningful
predictors that could be reliably coded. As a result, the current meta-analysis explained only
about one third of the variation of the agreement between peer and teacher ratings, and it is
important to examine the influence of other potential factors in future research. For instance, peer
assessment can be formative or summative, but the present meta-analysis did not examine
whether the purpose of peer assessment influences the agreement between peer and teacher
ratings. Also, peer raters from different cultural backgrounds could show different performances
or styles in terms of allocating scores to their peers (Fan, 2011). Finally, the present meta-
analysis did not look into the reliability of peer assessment (i.e., the inter-rater consistency across
peer raters) and the effect of peer assessment on learning. All these issues need to be addressed
in future work.
A final note is to reflect on the use of meta-analysis. Meta-analysis is a quantitative
synthesis of findings from different studies. One criticism against meta-analysis is that
researchers compare apples and oranges across studies because each individual study is different
in nature (Sharpe, 1997). As stated by Glass (2000), a meta-analysis asks questions about fruit,
for which both apples and oranges contribute important information. Further, meta-analysis
researchers take into account the characteristics of different studies while combining their
results. In the present meta-analysis, our level-2 analysis is to ascertain whether differences in
level of courses, subject areas, as well as other variables explain variance of the effect size, thus,
minimizing the problem of taking an unqualified average effect size across all studies.
Nevertheless, the results of our meta-analysis are necessarily influenced by the level-2 predictors
we included and the way we chose to operationalize them. For example, we chose to take the
grossest level of granularity when we divided subject areas into only three categories:
medical/clinical, science/engineering, and social sciences/arts. Also, there are potentially other
predictors not included due to a lack of information provided in the studies. It is, therefore,
necessary to replicate and or extend the current meta-analysis when new studies are available.
References
*Ahangari, S., Rassekh-Alqol, B., & Hamed, L. A. (2013). The effect of peer assessment on oral
electronic voting system and an automated feedback tool. Proceedings of CAA 2010
http://caaconference.co.uk/pastConferences/2010/Barker-CAA2010.pdf
*Basehore, P. M., Pomerantz, S. C., & Gentile, M. (2014). Reliability and benefits of medical
student peers in rating complex clinical skills. Medical Teacher, 36(5), 409414.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential
data and sources of collinearity. New York, NY: John Wiley and Sons.
Berg, E. C. (1999). The effects of trained peer response on ESL students revision types and
Bloxham, S., & West, A. (2004). Understanding the rules of the game: Making peer assessment
Borenstein, M., Hedges, V. L., Higgins, J., & Rothstein, R. H. (2009). Introduction to meta-
Boud, D. (2012). Developing student autonomy in learning (2nd ed.). New York, NY: Nichols
Publishing Company.
*Bouzidi, L., & Jaillet, A. (2009). Can online peer assessment be trusted? Journal of Educational
Burnett, W., & Cavaye, G. (1980). Peer assessment by fifth year students of surgery. Assessment
Cestone, C. M., Levine, R. E., & Lane, D. R. (2008). Peer assessment and evaluation in team-
based learning. New Directions for Teaching and Learning, 2008 (116), 6978.
*Chen, Y. C., & Tsai, C. C. (2009). An educational research course facilitated by online peer
*Cheng, W., & Warren, M. (1999). Peer and teacher assessment of the oral and written tasks of a
*Cho, K., Schunn, C., & Wilson, R. (2006). Validity and reliability of scaffolded peer
*Coulson, M. (2009). Peer marking of talks in a small, second year biosciences course. 2009
prod.library.usyd.edu.au/index.php/IISME/article/viewFile/6197/6845
*Daniel, R. (2005). Sharing the learning process: Peer assessment applications in practice.
Proceedings of the Effective Teaching and Learning Conference 2004. Retrieved from
http://researchonline.jcu.edu.au/5024/
Davies, P. (2006). Peer assessment: Judging the quality of students work by comments rather
Falchikov, N., & Boud, D. (1989). Student self-assessment in higher education: A meta-analysis.
Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A meta-
analysis comparing peer and teacher marks. Review of Educational Research, 70(3), 287
322.
Fan, M. (2011). International students' perceptions, practices and identities of peer assessment
from https://www.heacademy.ac.uk/node/3770
*Freeman S. & Parks, J.W. (2010). How accurate is peer grading? CBE Life Science Education,
9(4), 482488.
*Garcia-Ros, R. (2011). Analysis and validation of a rubric to assess oral presentation skills in
1,0431,062.
Gielen, S., Dochy, F., & Onghena, P. (2011). An inventory of peer assessment diversity.
Education, 5, 351379.
Glass, G. V. (2000). Meta-analysis at 25. Retrieved from
http://glass.ed.asu.edu/gene/papers/meta25.html
*Gracias, N., & Garcia, R. (2013). Can we trust peer grading in oral presentations? Towards
http://users.isr.ist.utl.pt/~ngracias/publications/Gracias13_Edulearn_1329.pdf
*Griesbaum, J. & Gortz, M. (2010). Using feedback to enhance collaborative learning and
exploratory student concerning the added value of self- and peer-assessment by first-year
503.
*Hafner, J., & Hafner, P. (2003). Quantitative analysis of the rubric as an assessment tool: An
25(12), 1,5091,528.
*Hamer, J. Purchase, H., Luxton-Reilly, A., & Denny, P. (2014, online first). A comparison of
DOI:10.1080/02602938.2014.893418.
*Hammer, R., Ronen, M., & Kohen-Vacs, D., (2012). On-line project-based peer assessed
*Han, Y., James, D. H., & McLain, R. M. (2013). Relationships between student peer and
for marking laboratory reports and a review of related practices. Advances in Physiology
*Heyman, J. E., & Sailors, J. J. (2011). Peer assessment of class participation: Applying peer
36(5), 605618.
*Hidayat, M. T. (2013). Self-, peer- and teacher-assessment in translation course. Retrieved from
http://file.upi.edu/Direktori/FPBS/JUR._PEND._BAHASA_INGGRIS/19670609199403
1-
DIDI_SUKYADI/SELF,%20PEER%20AND%20TEACHER%20ASSESSMENT%20IN
%20TRANSLATION%20COURSE.pdf
*Jones, I., & Alcock, L. (2013, online first). Peer assessment without assessment criteria. Studies
*Kakar, S. P., Catalanotti, J. S., Flory, A. L., Simmerns, S. J., Lewis, K. L., Mintz, M. L.,
Hwaywood, Y.C., & Blatt, B.C. (2013). Evaluating oral case presentations using a
88(9), 1,3631,367.
Kane, J. S., & Lawler, E. E. (1978). Methods of peer assessment. Psychological Bulletin, 85(3),
555586.
*Killic, G. B., & Cakan, M. (2007). Peer assessment of elementary science teaching skills.
*Kovach, A. R., Resch, S. R., & Verhulst, J. S. (2009). Peer assessment of professionalism: A
742746.
*Langan, M. A., Shuker, D. M., Cullen, W. R., Penney, D., Preziosi, F .R., & Wheater, P. C.
(2008). Relationships between student characteristics and self-, peer and tutor evaluations
*Lanning, S. K., Brickhouse, T. H., Gunsolley, J. C., Ranson, S. L., & Wilett, R. M. (2011).
*Liang, J. C., & Tsai, C. C. (2010). Learning through science writing via online peer assessment
*Lin, S. S. J., Liu, E. X. F. & Yuan, S. M. (2001). Web-based peer assessment: Feedback for
420432.
*Liow, J-L. (2008). Peer assessment in thesis oral presentation. European Journal of
*Lirely, R., Keech, M. K., Vanhook, C., & Little, P. (2011). Development and evaluative
accounting course. International Journal of Business and Social Science, 2(23), 8994.
Liu, N. F, & Carless, D. (2006). Peer feedback: The learning element of peer assessment.
study of comparing self and peer assessment with instructor assessment under a
*Liu, C. C., & Tsai, C. M. (2005). Peer assessment through web-based knowledge acquisition:
Magin, D. (2001). Reciprocity as a source of bias in multiple peer assessment of group work.
*Magin, D., & Helmore, P. (2001). Peer and teacher assessments of oral presentation skills: How
*Mehrdad, N., Bigdeli, S., & Ebrahimi, H. (2012). A comparative study on self, peer and teacher
*Mika, S. (2006). Peer- and instructor assessment of oral presentations in Japanese university
EFL classrooms: A pilot study. Waseda Global Forum, 3, 99107. Retrieved from
https://dspace.wul.waseda.ac.jp/dspace/bitstream/2065/11344/1/13M.Shimura.pdf
Miller, P. J. (2003). The effect of scoring criteria specificity on peer and self-assessment.
Nulty, D. D. (2011). Peer and self-assessment in the first year of university. Assessment &
*Okuda, R. & Otsu, R. (2010). Peer assessment for speeches as an aid to teacher grading. The
Orsmond, P., Merry, S., & Reiling, K. (1996). The importance of marking criteria in the use of
techniques used in design courses. Paper presented at 2012 ASEE Annual Conference,
http://www.asee.org/public/conferences/8/papers/4547/view
*Otoshi, J. & Heffernan, N. (2007). An analysis of peer assessment in EFL college oral
*Panadero, E., Romero, M., & Strijbos, J. M. (2013). The impact of a rubric and friendship on
Robinson, & D. Ball (Eds), Small group instruction in higher education: Lessons from
the past, visions of the future (pp.193-200). Stillwater, OK: New Forums Press.
*Papinczak, T., Young, L., Groves, M., & Haynes, M. (2007). An analysis of peer, self, and tutor
Pond, K., UI-Haq, R., & Wade, W. (1995). Peer review: A precursor to peer assessment.
*Raes, A., Vanderhoven, E., & Schellens, T. (2013, online first). Increasing anonymity in peer
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data
Raudenbush, S. W., Bryk, A. S, & Congdon, R. (2004). HLM 6 for Windows [Computer
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological
*Rudy, D. W., Fejfar, M. C., Griffith, C. H., & Wilson, J. F. (2001). Self- and peer assessment
*Sadler, M. P., & Good, E. (2006). The impact of self- and peer-grading on student learning.
*Saito, H. (2008). EFL classroom peer assessment: Training effects on rating and commenting.
*Sealey, R. M. (2013). Peer assessing in higher education: Perspectives of studies and staff.
Senger, J. L., & Kanthan, R. (2012). Student evaluations: Synchronous tripod of learning
Sharpe, D. (1997). Of apples and oranges, file drawers and garbage: Why validity issues in meta-
*Sila A. & Bartan, O. (2010). Self and peer assessment in different ability groups. Paper
http://www.iconte.org/FileUpload/ks59689/File/8.pdf
Sims, G. K. (1989). Student peer review in the classroom: A teaching and grading tool. Journal
Sluijsmans, D., Brand-Gruwel, S., van Merrinboer, J. J., & Bastiaens, T. J. (2002). The training
Speyer, R., Pilz, W., van Der Kruis, J., & Brunings, J. (2011). Reliability and validity of student
572585.
*Sitthiworachart, J. & Joyt, M. (2008). Computer support of effective peer assessment in an
231.
*Steensels, C., Leemans, L., Buelens, H., Laga, E., Lecoutere, A., Laekeman, G., & Simoens, S.
Suen, H. K. (2014). Peer assessment for massive open online courses (MOOCs). The
*Sulliva, M. E., Hitchcock, M. A., & Dunnington, G. L. (1999). Peer and self assessment during
*Thomas, P. A., Gebo, K. A., & Hellmann, D. B. (1999). A pilot study of peer review in
Topping, J. K., (1998). Peer assessment between students in colleges and universities. Review of
Topping, K. J. (2003). Self and peer assessment in school and university: Reliability, validity and
modes of assessment: In search of qualities and standards (pp. 5587). Dordrecht, The
*Tsai, C.-C., & Liang, J.-C. (2009).The development of science activities via on-line peer
293310.
*Tsai, C.-C., Lin, S. S .J., & Yuan, S.-M. (2002). Developing science activities through a
*Tseng, S.-C., & Tsai, C.-C. (2007). On-line peer assessment and the role of the peer feedback:
A study of high school computer course. Computer & Education, 49(4), 1,1611,174.
*Tseng, S-C., & Tsai, C.-C. (2009). Exploring the role of student online peer assessment self-
van den Berg, I., Admiraal, W., & Pilot, A. (2006). Peer assessment in university teaching:
Evaluating seven courses designs. Assessment & Evaluation in Higher Education, 31(1),
1936.
van Gennip, N, Segers, M., & Tillema, H. (2009). Peer assessment for learning from a social
*Wen, L. M., & Tsai, C.-C. (2008). Online peer assessment in an in service science and
Winch, R. F., & Anderson, R. B. W. (1967). Two problems involved in the use of peer rating
316322.
*Xiao, Y., & Lucking, R. (2008). The impact of two types of peer assessment on students
performance and satisfaction within a Wiki environment. Internet and Higher Education,
11(3), 186193.
*Yinjaroen, P., & Chiramanee, T. (2011). Peer assessment of oral English proficiency. Paper
presented at The 3rd International Conference on Humanities and Social Sciences. Hat
http://tar.thailis.or.th/bitstream/123456789/660/1/001.pdf
*Zakian, M., Moradan, A., Naghibi, S. E. (2012). The relationship between self-, peer-, and