Professional Documents
Culture Documents
Neal M. Kingston
and
Neil
J.
Dorans
June 1982
Neal M. Kingston
and
Neil
J.
Dorans
Report
May 1982
Copyright&l982
by Educational
Testing
Service.
All
rights
reserved.
ABSTRACT
The research described in this paper deals solely with the effect
of
the position
of an item within a test on examinee's responding behavior
at the item level.
For simplicity's
sake, this effect
will
be referred
to
as practice
effect
when the result
is improved examinee performance and
as fatigue
effect
when the result
is poorer examinee performance.
Item
response theory item statistics
were used to assess position
effects
because, unlike traditional
item statistics,
they are sample invariant.
In addition,
the use of item response theory statistics
allows one to make
a reasonable adjustment for speededness, which is important when, as in
this research,
the same item administered
in different
positions
is
likely
to be affected
differently
by speededness, depending upon its
location
in the test.
Five types of analyses were performed as part of this research.
The first
three types involved analyses of differences
between the
two estimations
of item difficulty
(b), item discrimination
(a), and
pseudoguessing (c) parameters.
The fourth type was an analysis
of the
differences
between equatings based on items calibrated
when administered
in the operational
section and equatings based on items calibrated
when
administered
in section V.
Finally,
an analysis
of the regression
of the
difference
between b's on item position
within the operational
section was
conducted.
The analysis
of estimated
item difficulty
parameters showed a
strong practice
effect
for analysis
of explanations
and logical
diagrams
items
and a moderate fatigue
effect
for reading comprehension items.
Analysis of other estimated item parameters,
a and c, produced no
consistent
results
for the two test forms analyzed.
Analysis of the difference
between equatings for Form 3CGRl reflected
the differences
between estimated b's found for the verbal,
quantitative,
and analytical
item types.
A large practice
effect
was evident for the
analytical
section,
a small practice
effect,
probably due to capitalization
on chance, was found for the quantitative
section,
and no effect
was found
for the verbal section.
Analysis of the regression
of the difference
between b's on item
position
within the operational
section for analysis
of explanations
items
showed a rather consistent
relationship
for Form ZGRl and a weaker but
still
definite
relationship
for Form 3CGRl.
The results
of this research strongly
important
implication
for equating.
If an
context effect,
any equating method, e.g.,
item data either
directly
or as part of an
provide for administration
of the items in
Although a within-test
old and new forms.
negligible
influence
on a single equating,
drift
because of the systematic
bias.
-ii-
TABLE OF CONTENTS
INTRODUCTION. ...........................
Traditional
Analysis
of Practice
Effects
............
......................
Potential
Advantages of Using IRT Item Statistics
Investigate
the Effects
of Item Position
on Item
Responding Behavior ......................
to
RESEARCHDESIGN ..........................
TestForms
...........................
Item Calibration
IRT Linking
Procedures.
Procedure.
..................
.....................
Item Types.
Quantitative
.......................
Item Types.
....................
9
9
12
Item Types.
.....................
16
SUMMARYAND IMPLICATIONS.
.....................
24
Analytical
REFERENCES.............................
25
-l-
INTRODUCTION
When test forms are equated one wants to minimize the error variance
of the process.
Generally,
one attempts to do so by choosing an appropriate
The standard errors of equating associated
with
data collection
design.
linear
methods for various data collection
designs are well known (Angoff,
1971; Lord, 1950; Lord & Stocking,
1973).
Designs that yield the smallest
standard errors of equating for a given sample size are those in which the
examinees taking both the old and the new forms take some common items.
Each test may have a set of
This can be accomplished in several ways.
items, which may or may not count toward an examinee's score, identical
to
a set in the other test or, to carry the common item idea to its extreme,
each examinee can take both forms of the test.
It has long been assumed (Lord (1950) is the earliest
reference
the
authors have found, but the idea is probably older)
that if all examinees
took form 1 followed by form 2, the equating might be biased by an order
a counterbalanced
administration
of
effect.
To overcome this effect,
forms has been used:
A random half of the examinees takes form 1 followed
The order
by form 2; the other half takes form 2 followed by form 1.
effect
could then be estimated and accounted for in the equating process
This estimation
procedure assumes that
(Lord, 1950; Angoff,
1971).
order effect
is proportional
to the standard deviation
of scores on
We are
each test form and that there is no form by order interaction.
aware of no empirical
evidence supporting
these assumptions.
It is usually difficult
to get a sufficient
number of examinees
In addition,
new legislative
test
willing
to take two full-length
tests.
disclosure
requirements
are producing new constraints
on the collection
of
One relatively
new equating method, item response
data for equating.
theory (IRT) based true score equating (Lord, 1980), has been the subject
of considerable
interest
(for example, Cowell, 1981; Kingston 6; Dorans,
1982; Petersen,
Cook, & Stocking,
1981; Scheuneman & Kay, 1981).
Of
particular
interest
is the use of a data collection
scheme known as
precalibration,
that would allow the collection
of item statistics
(and
thus the equating of the test form) before the test form is operationally
administered.
The appropriateness
of IRT equating based on precalibration requires
either
that the position
of an item within a test have no
effect
on examinees item responding behavior or that items be calibrated
and operationally
administered
in the same position
within
old and new
The latter
solution,
same positioning
of items in old and new
forms.
Even if
context variance
is negligible.
forms, assumes that form-specific
complexities
of such a solution
make it
this is so, the administrative
less than appealing.
The research described in this paper deals solely with the effect
of
the position
of an item within a test on examinee's responding behavior
will
be referred
to
For simplicity's
sake, this effect
at the item level.
as practice
effect
when the result
is improved examinee performance and
It is
as fatigue
effect
when the result
is poorer examinee performance.
realized
that this simplistic
labeling
in no way fully
describes
these
effects
and that there might be other explanations
for these results.
The interested
reader is referred
to Greene (1941) or Wing (1980) for
further
discussion
of the underlying
psychology of these effects.
-2-
Traditional
Analvsis
of Practice
Effects
Traditional
item level analyses have focused on p, the proportion
of
or a normalized version of p
examinees responding to an item correctly,
Attempts have been made to adjust these statistics
by
such as delta.
taking into account only the examinees who had sufficient
time to respond
to the item.
There is, however, no demonstrably correct way based on
classical
test theory to make this adjustment.
of examinees
Lord (1977) has pointed out that p, the proportion
responding correctly
to an item, is not a true measure of item difficulty.
If two items are administered
to two groups with different
distributions
of ability
the p of item 1 might be larger
than that of item 2 for the
Thus, neither
proportion
first
group but smaller for the second group.
correct
nor normalized proportion
correct
(e.g.,
delta)
is a particularly
good statistic
to analyze in order to ascertain
the effect
of the positon
In addition,
a
of an item within
a test on item responding behavior.
within-test
context effect
might well affect
the discrimination
of an
Classical
measures of item discriminaitem as well as its difficulty.
tion (biserial
or point-biserial
correlation
between item score and total
test score) are both difficult
to estimate accurately
and confounded
with item difficulty
(Kingston & White 1980; Lord & Novick, 1968).
g
where
Pg
ag
(W
is the.probability
item g correctly
that
an examinee with
logarithm
for
ability
approximately
for
(3 answers
equal
to
item g,
item g, and
b
g
is
cg
-30
In equation (l),
examinee, and a
shape of the it&
8 is the ability
b
r&p~~~,'s&~~~on
parameter,
a characteristic
of the
item parameters that determine the
(see Figure 1).
Figure
ITEM
RESPONSE
1
FUNCTION
PO.7
LO.6
F0.S
A
0.4
B-O
c *
.2
0.2
0. I
0.0
Potential
Advantages of Using IRT Item Statistics
to Investigate
Effects
of Item Position
on Item Responding Behavior
the
-5-
RESEARCHDESIGN
Test
Forms
Two operational
forms of the GRE Aptitude Test were used in this
study:
ZGRl and 3CGRl.
Form ZGRl is composed of four separately
timed
operational
sections:
Section
I
II
III
IV
Form 3CGRl is
Timing in
Minutes
Item Type
Verbal
analogies
antonyms
sentence completion
reading comprehension
50
Quantitative
quantitative
comparison
data interpretation
regular mathematics
50
Analytical
analysis
25
80
17
18
20
25
55
30
10
15
40
40
of explanations
Analtyical
logical
diagrams
analytical
reasoning
also
Number of
Items
composed of four
25
30
15
15
separately
timed operational
sections:
Section
I
II
III
IV
Timing in
Minutes
Item Type
Verbal
analogies
antonyms
sentence completion
reading comprehension
50
Quantitative
quantitative
comparisons
data interpretation
regular mathematics
50
Analtyical
analysis
25
Number of
Items
75
18
22
13
22
55
30
10
15
36
36
of explanations
Analytical
logical
diagrams
analytical
reasoning
Test
25
30
15
15
item types
-60
Table
Six Section
Designation
Item Type
V's
for
Form 3CGRl
Number of
Items
Location
in ZGRl
C41
C42
Verbal
Verbal
39
41
Section
Section
I
I
c43
c44
Quantitative
Quantitative
27
28
Section
Section
II
II
c45
C46
Analytical
Analytical
40
30
Section
Section
III
IV
Table 2
Six Section V's for Form ZGRl
Designation
Item Type
Number of
Items
Location in 3CGRl
c47
C48
Verbal
Verbal
37
38
Section I
Section I
c49
c50
Quantitative
Quantitative
27
28
Section II
Section 11
c51
C52
Analytical
Analytical
36
30
Section III
Section IV
Table 3
Description
Forms
Pretest
Section
Sample Size
ZGRlC47
ZGRlC48
ZGRlC49
ZGRlC50
ZGRlC51
ZGRlC52
2483
2486
2898
2484
2488
2482
13.23
14.62
11.94
12.88
18.73
14.14
8.01
8.10
6.43
5.93
9.13
6.98
31.61
31.53
24.46
24.26
32.89
32.69
15.86
16.30
10.47
10.34
15.21
15.66
3CGRlC41
3CGRlC42
3CGRlC43
3CGRlC44
3CGRlC45
3CGRlC46
1489
1495
1487
1497
1526
1476
15.54
15.91
11.65
12.27
24.26
15.92
8.42
8.80
5.59
5.43
11.75
7.19
30.17
30.43
24.94
24.41
28.86
28.52
15.38
15.55
11.51
11.74
15.41
14.87
Operational formula raw scores are for the operational score (V, Q, or
A) corresponding to the prete;.,t section listed under the heading Pretest
keep in mind that the number
When comparing these statistics,
Section.
Tables 1 and 2
of items going into each new score is not constant.
The tables
contain the number of items in each of the pretest sections.
embedded in the text on page 5 contain the number of operational items
for each section of Forms ZGRl and 3CGRl.
-8-
Item Calibration
Procedures
A total
of 10 different
item types were administered
within each
form.
Parameter estimates were based on the set of all verbal items
(analogies,
antonyms, sentence completion,
and reading comprehension),
all quantitative
items (quantitative
comparisons, data interpretation,
or all analytical
items (analysis
of
and regular mathematics),
reasoning).
logical
diagrams, and analytical
explanations,
All item parameter estimates
and ability
estimates were obtained
The function
of
with the program LOGIST (Wood, Wingersky, & Lord, 1978).
LOGIST is to estimate,
for each item, the three item parameters of the
three-parameter
logistic
model:
a (discrimination),
b (difficulty),
and c (pseudoguessing parameter);gand,
for each examinge, theta (ability).
a was
The f6llowing
constraints
were imposed on the estimation
process:
restricted
to values between 0.01 and 1.50 inclusive,
except for analytical
item calibrations
where the upper bound was 1.20; the lower limit
for
estimated
theta was -7; and c was restricted
to values between 0.0 and
We also required each e&minee to have responded to at least 20
0.5.
Choosing appropriate
items in order to insure stable ability
estimates.
constraints
is a complex procedure,
but necessary to speed convergence
and produce stable estimates.
Since IRT based item parameters are sample invariant
and the IRT
IRT parameter estimation
allows a
ability
parameter is item invariant,
If one assumes that an examinee
reasonable
correction
for speededness.
answers test questions in a sequential
progression,
one can consider
all contiguous unanswered items at the end of a test as not reached and
ignore them in the ability
estimation
process.
Hence, the item responses
(actually,
lack of responses) of all examinees whose item responses are
coded "not reached*' are ignored in the item parameter estimation
process.
Use of this coding convention will minimize any differences
in item
calibrations
due to a differential
speededness in the two administrations
Each item was calibrated
twice, once as an operational
of each item set.
item and once when it appeared in section V.
IRT Linking
Procedure
-9-
Item Types
-lO-
Table
Item TvDe
ZGRl
Operational
Mean
S.D.
3CGRl
Section V
Mean
S.D.
Verbal
.3136
1.1223
.3466
1.0337
Analogy
.6597
1.3122
.5849
1.1941
.0748
.1774
1.79
Antonyms
.7217
1.2481
.6760
1.1689
.0457
.3023
.68
.I
Sent.
ZGRl-3CGRl
Difference
Mean
S.D.
-.0330
.2574
-t
-1.15
Comp.
-.0148
.8786
.0654
.8836
-.0802
.2050
-1.61
Read. Comp.
-.0387
.7764
.1029
.7267
-.1415
.2603
-2.72*
Table
for
Item Type
ZGRl
Operational
Mean
S.D.
Verbal
.8722
.2764
.8886
.2858
-.0164
.1486
.99
.8413
.2572
.9087
.2206
-.0674
.1220
-2.34*
1.0482
.3344
1.0242
.3755
.2046
.52
Analogy
Antonyms
3CGRl
Section V
Mean
S.D.
ZGRl-3CGRl
Difference
Mean
S.D.
.0240
-t
Sent.
Comp.
.7831
.2535
.7936
.2740
-.0106
.1312
-.33
Read.
Comp.
.8143
.1714
.8302
.1949
-.0159
.1191
-.67
Table
for
Item Type
ZGRl
Operational
Mean
S.D.
3CGRl
Section V
Mean
S.D.
Verbal
.1778
.0545
.1784
.0604
-.0006
.0504
-.ll
Analogy
.1673
.0386
.1727
.0458
-.0054
.0212
-1.08
Antonyms
.2219
.0707
.2149
.0874
Sent.
Comp.
.1526
.0301
.1571
.0313
-.0045
.0158
-1.17
Read. Comp.
.1672
.0405
.1677
.0433
-.0005
.0487
-.05
*p < .05
ZGRl-3CGRl
Difference
Mean
S.D.
.0070
.0828
-t
.38
-ll-
Table
Item Type
ZGRl
Section V
Mean
S.D.
3CGRl
Operational
Mean
S.D.
3CGRl-ZGRl
Difference
Mean
S.D.
Verbal
.2534
1.1739
.2630
1.1371
.0095
.2817
.29
Analogy
.4535
1.3307
.4819
1.1823
.0284
.2350
.51
Antonyms
.1140
1.0975
.2342
1.0663
.1202
.2527
2.23*
1.1339
.0365
.1464
.90
Sent.
Comp.
Read. Comp.
-.0369
.4007
1.1734
-.0003
.2682
1.0524
Table
.3523
-.1325
-1.76
for
ZGRl
Section V
Mean
S.D.
3CGRl
Operational
Mean
S.D.
3CGRl-ZGRl
Difference
Nean
S.D.
.8926
.3048
.9027
.3153
.OlOl
.1522
.57
.8518
.2750
.9022
.3129
.0504
.1634
1.31
1.0693
.3699
1.1406
.3075
.0714
.1267
2.65*
Comp.
.8662
.1829
.8722
.2213
.0060
.1088
.20
Read. Comp.
.7649
.2211
.6830
.1729
.1521
-2.53*
Item
TvDe
Verbal
Analogy
Antonyms
Sent.
Table
IRT Item Pseudoguessing
Operational
Verbal
. Q
1.1342
-t
-.0819
-t
10
for
Item Type
ZGRl
Section V
Mean
S.D.
Verbal
.1682
.0460
.1729
.0441
.0047
.0372
1.09
Analogy
.1602
.0365
.1660
.0370
.0057
.0265
.91
Antonyms
.1862
.0498
.1946
.0473
.0084
.0408
.97
Sent.
Comp.
.1577
.0460
.1697
.0520
.0120
.0265
1.63
Read. Comp.
.1629
.0440
.1588
.0309
-.0041
.0461
-.42
*p < .05
3CGRl
Operational
Mean
S.D.
3CGRl-ZGRl
Difference
Mean
S.D.
-t
.
FIGURE
L
20 c
Effect
of
Item
Position
on the
Equating
of
the
Verbal
Section
of Form 3CGRl
I -
I 1
. 15
10
5
0
-5
-10
-15
-20
I I
20
40
FORMULA SCORE
*
SEE TEXT
60
80
-130
lesser opportunity
for bias entering
the system during the linking
of the
metrics.
It is assumed, however, that error variance,
based on the sample
sizes used in this research,
is small in relation
to bias.
Figure 2 shows the differences
between the two equatings.
At each
formula raw score, the converted score based on the equating that used
section V calibrations
was subtracted
from the converted score based on
the equating that used the operational
section calibrations.
A horizontal
line,
what one would expect if there was no context effect
and no error
of equating,
is drawn for a no-effect
reference.
The results,
despite the
earlier
finding
of a moderate fatigue
effect
for reading comprehension
items, lend no support to the hypothesis of an overall
verbal context
effect.
Quantitative
Item Types
-140
Table
11
Item Type
ZGRl
Operational
S.D.
Mean
3CGRl
Section V
Mean
S.D.
Quantitative
.0051
1.5040
.0154
1.5301
-.0103
.3123
-.24
.6870
1.3936
.7317
1.3510
0.0447
.1159
-1.11
.9985
0.2007
.1472
-4.31**
Reg. Math
Data.
Int.
Quant.
-1.0795
Comp.
.0257
1.0441
-.a789
1.4791
0.0447
Table
1.5877
Mean
ZGRl-3CGRl
Dffference
S.D.
-t
.0704
.3786
1.02
12
Parameter Estimates for
Items from Form ZGRl
ZGRl-3CGRl
Dffference
S.D.
-t
Item Type
ZGRl
Operational
S.D.
Mean
3CGRl
Section V
Mean
S.D.
Mean
Quantitative
.a560
.3947
.8484
.3950
.0076
.9308
.4135
.9430
.4296
-.0122
.1213
-.39
Data Int.
.5894
.2305
.6283
.2552
-.0389
.0706
-1.74
Quant.
.9075
.3915
.a745
.3883
Reg.
Math
Comp.
Table
IRT Item Pseudoguessing (c)
Operational
Quantitative
.1545
.0331
.1849
.36
.98
13
Parameter Estimates for
Items from Form ZGRl
Item Type
ZGRl
Operational
Mean
S.D.
3CGRl
Section V
Mean
S.D.
Mean
Quantitative
.1826
.0733
.1667
.0771
.0159
.0315
3.74**
Reg. Math.
.1780
.0894
.1602
.0979
.0178
.0241
2.86*
Data.
.1376
.02_2
.1303
.0249
.0074
.0357
.66
.1999
.0686
.1821
.0721
.0178
.0338
2.88**
Int.
Quant.
**p
< .Ol
*p < .05
Comp.
ZGRl-3CGRl
Dffference
S.D.
-t
-15-
Table
14
Item Type
Quantitative
Reg. Math
Data.
Int.
Quant.
Comp.
3CGRl
Operational
Mean
S.D.
XGRl-ZGRl
Difference
Mean
S.D.
-t
0.2571
1.4435
0.1175
1.3656
.1396
.5403
1.92
.2431
.9190
.3352
.9945
.0921
.2436
1.46
-.6966
1.9727
0.3104
1.9248
.3861
1.1055
1.10
0.3607
1.3786
-.2796
1.2443
.0811
.3315
1.34
Table
IRT Item Discrimination
(a)
Operational
Quantitative
15
Parameter Estimates for
Items from Form 3CGRl
Item Type
ZGRl
Section V
Mean
S.D.
3CGRl
Operational
Mean
S.D.
XGRl-ZGRl
Difference
Mean
S.D.
Quantitative
.8376
.3231
.8825
.3284
.0449
.1735
1.92
Reg. Math
.8792
.2605
.9690
.2895
.0898
.1728
2.01
Data.
.4801
.1879
.5012
.1681
.0211
.2074
.32
.9360
.3042
.9663
.2952
.0304
.1638
1.02
Int.
Quant.
Comp.
Ikble
16
Item Type
ZGRl
Section V
S.D.
Mean
3CGRl
Operational
S.D.
Mean
Quantitative
.1693
.0591
.1655
.0770
Reg. Math
.1288
.0500
.1415
l 0740
Data.
.16.4
.0708
.1522
.1172
.1912
.0466
.1818
.0543
Quant.
Int.
Comp.
-t
3CGRl-ZGRl
Difference
Mean
S.D.
-.0039
-t
.0700
0.41
.0498
1.00
-.0123
.1456
-.27
-.0094
.0340
-1.51
.0128
-17-
the inclusion
of this item.
The mean difference
would have been only .08 without
the questionable
of .14 given
item.
in Table
14
between
Table 13 indicates
a statistically
significant
effect
on estimated c
for all quantitative,
regular mathematics,
and quantitative
comparison
items for Form ZGRl.
These findings
were not replicated
for Form 3CGR1,
as can be seen in Table 16.
The problems in interpreting
the results
of
the analysis
of estimated c's, mentioned earlier,
apply to this analysis.
Figure 3 compares the two quantitative
equatings of Form 3CGRl to
Form ZGRl.
As with Figure 2, the converted score resulting
from the
equating based on the section V estimation
of item parameters is subtracted
from the converted score resulting
from the equating based on the operational
section estimation
of item parameters.
Although the mean difference
in b's indicated
in Table 14, .14, was
not statistically
significant,
this result
is reflected
in the difference
between the two equatings.
Analytical
Item Types
-18-
Table
17
Item Type
Analytical
-.0467
Anal of Exp.
.1514
1.0853
1.0619
-.2322
.0959
.1855
.2605
5.96**
1.1094
.2473
.2378
6.58**
.2221
.2609
3.30**
.2305
0.27
-.2729
.9009
-.4950
.8348
Anal.
-.3486
1.1963
-.3327
1.2069
Table
1.0920
Log. Diag.
Reas.
ZGRl-3CGRl
Difference
Mean
S.D.
-.0159
18
ZGRl
Operational
Mean
S.D.
3CGRl
Section V
S.D.
Mean
ZGRl-3CGRl
Difference
Mean
S.D.
.7683
.2595
.7287
.2666
.0396
.1460
2.27*
.7385
.2593
.6919
.2716
.0466
.1512
1.95
Log. Diag.
.9806
.2176
.8930
.2254
.0876
.1191
2.85*
Anal.
.6356
.1504
.6626
.2217
.1406
-.74
Item
Type
Analytical
Anal.
of Exp.
Reas.
Table
-.0270
19
Item Type
Analytical
.1608
.0480
.1421
.0499
.0186
.0447
3.48**
.1602
.0392
.1452
.0360
.0150
.0408
2.33*
Log. Diag.
.1652
.0473
.1326
.0238
.0326
.0520
2.43*
Anal.
.1579
.0663
.1435
.0866
.0144
.0471
1.18
Anal.
**p
of Exp.
Reas.
< .Ol
*p < .05
3CGRl
Section V
S.D.
Mean
ZGRl-3CGRl
Difference
Mean
S.D.
ZGRl
Operational
Mean
S.D.
-190
Table 20
IRT Item Difficulty
(b)
Operational
Analytical
ZGRl
Section V
Mean
S.D.
Item Type
Analytical
Anal.
Log.
3CGRl
Operational
S.D.
Mean
3CGRl-ZGRl
Difference
Mean
S.D.
-t
-.1688
.9043
.0431
.8821
.2118
.2022
8.51**
of Exp. 0.2105
.8639
.0919
.8234
.3024
.1837
9.88**
0.0806
.5914
.0700
.6437
.1507
.1346
4.33**
0.1566
1.2022
1.1639
.0556
.1897
1.14
Diag.
Anal.
Reas.
-.lOlO
Table
21
Item Type
ZGRl
Section V
S.D.
Mean
3CGRl
Operational
S.D.
Mean
Analytical
.8221
.2439
.8033
.2344
.8716
.2016
.8733
.1841
.0017
.1235
.08
Log. Diag.
.9261
.2781
.8499
.2688
0.0762
.0677
-4.36**
Anal.
.5994
.1434
.5888
.1681
-.0105
.1044
-.39
Anal.
of Exp .
Reas.
3CGR.l.ZGRl
Difference
S.D.
Mean
-.0188
.1120
t
-1.36
Table 22
IRT Item Pseudoguessing (c) Parameter Estimates for
Operational
Analytical
Items from Form 3CGRl
Item Type
ZGRl
Section V
Mean
S.D.
3CGRl
Operational
Mean
S.D.
Analytical
.1613
.0498
.1594
.0555
-.0019
.0250
-.62
.1609
.0507
.1580
.0547
-.0029
.0232
-.75
Log. Diag.
.1783
.0571
.1871
.0619
.0088
.0315
1. .s
Anal.
.1454
.0304
.1354
.0349
-.OlOO
.0189
Anal.
**p
of Exp.
Reas.
< .Ol
3CGRl-ZGRl
Difference
Mean
S.D.
-2.05
-2o-
item type.
By the time an examinee has responded to some number of
analysis
of explanations
items, perhaps 30 or more, the examinee might
not have to refer back to the directions.
Until
the directions
are
internalized
by the examinee, items will be more difficult
then they
If this is so, it might be reasonable
would be otherwise.
to expect that
items that appear early in the operational
section will undergo a larger
practice
effect
that those appearing late in the operational
section.
If
the law of diminishing
returns is applicable,
some value for difference
between b's will be approached asymptotically.
the regression
of the difference
To investigate
this hypothesis,
between estimated b's on item position
was plotted.
In order to smooth
the regression,
item position
was grouped; that is, the mean difference
between b's of items 1 through 4 (based on the operational
appearance of
the items) was plotted
against grouped item position
1, the mean difference
between b's of items 5 through 8 was plotted
against grouped item position
2, et cetera.
Figures 4 and 5 show these regressions
for the operational
items from Forms ZGRl and 3CGR1, respectively.
The relationship
between practice
effect
and item position
is
clearly
shown for Form ZGRl and less clearly
shown but still
evident
for Form 3CGRl.
The regression
shows no clear sign of leveling
off,
perhaps because the operational
section was not of sufficient
length.
Interpretation
of this result
might be confounded by at least two factors:
existence
of
a
similar
effect
within
the section V administration
of
(1)
the items and (2) item ordering within section is not random.
This result
does, however, point out that data collection
design has a potentially
large impact on the results
of a practice
effect
study.
Logical diagrams items also exhibit
a strong practice
effect:
mean
difference
between b's of .22 for Form ZGRl and .15 for Form 3CGRl.
Analytical
reasoning items showed no evidence of a practice
or fatigue
effect.
Although three out of eight tests yield statistically
significant
differences
at either
the .05 or .Ol level,
Tables 18 and 21 show a
statistically
significant
result
for logical
diagrams, but in opposite
directions.
Tables 19 and 22 present the results
for the analysis
or differences
between estimated c's.
Three of the four analyses produced statistically
significant
differences
at either
the .05 or .Ol level for Form ZGRl, but
none of these results
was replicated
for Form 3CGRl.
The difficulties
inherent
in interpreting
the results
of the analysis
of c's has been
mentioned earlier
in this report.
Figure 6 compares the two equatings of the analytical
section of Form
3CGRl to Form ZGKl as Figures 2 and 3 did for the verbal and quantitative
equatings.
The results
reflect
the differences
found in Table 20.
FORM ZGRt
ITEM
POSITION
OPERATIONAL
VERSUS
ITEMS
MEAN DIFFERENCE
BETWEEN
BS
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0
6
POSITION
FIGURE 4
10
FORM 3CGRl
ITEM
POSITION
VERSUS
OPERATIONAL
MEAN DIFFERENCE
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
-
0.15
0. IO
0
ITEMS
2
POSITION
FIGURE 5
BETWEEN
BS
Effect
of
Item
Position
on the
Equating
of
the
Analytical
Se-
?K.Rl
50
40
30
20
10
0
-10
-20
-30
-40
-50
#I.-L.
0
20
40
FORMULA SCORE
*
SEE TEXT
60
-24-
SUMMARYAND IMPLICATIONS
Five types of analyses were performed as part of this research.
The first
three types involved analyses of differences
between the
two estimations
of item difficulty
(b), item discrimination
(a), and
pseudoguessing (c) parameters.
The fourth type was an analysis
of the
differences
between equatings based on items calibrated
when administered
in the operational
section and equatings based on items calibrated
when
an analysis
of the regression
of the
administered
in section V.
Finally,
difference
between b's on item position
within the operational
section was
The analysis
of estimated
item difficulty
parameters showed a
conducted.
strong practice
effect
for analysis
of explanations
and logical
diagrams
2
items
and a moderate fatigue
effect
for reading comprehension items.
Analysis of other estimated
item parameters,
a and c, produced no
consistent
results
for the two test forms analyzed.
Analysis of the difference
between equatings for Form 3CGRl reflected
the differences
between estimated
b's found for the verbal,
quantitative,
and analytical
item types.
A large practice
effect
was evident for the
analytical
section,
a small practice
effect,
probably due to capitalization
on chance, was found for the quantitative
section,
and no effect
was found
for the verbal section.
Analysis of the regression
of the difference
between b's on item
position
within the operational
section for analysis
of explanations
items
showed a rather consistent
relationship
for Form ZGRl and a weaker but
still
definite
relationship
for Form 3CGRl.
The results
of this research strongly
important
implication
for equating.
If an
context effect,
any equating method, e.g.,
item data either
directly
or as part of an
provide for administration
of the items in
old and new forms.
Although a within-test
negligible
influence
on a single equating,
drift
because of the systematic
bias.
2
the GRE analytical
section has
Because of these large practice
effects,
diagrams
Both anal ysis of explana tions items and logical
been revised.
items have be en dropped from the test.
-25-
REFERENCES
Angoff, W. H.
Scales, norms, and equivalent
scores.
In R. L. Thorndike
(Ed.),
Educational
measurement (2nd ed.).
Washington, DC: American
Council on Education,
1971.
Conrad, L., Trismen, D., & Miller,
Technical Manual.
Princeton,
R. (Eds.)
Graduate Record Examinations
NJ: Educational
Testing Service,
1977.
Cowell, W. R. Applicability
of a simplified
three-parameter
logistic
model for equating tests.
Paper presented at the annual meeting of
the American Educational
Research Association,
Los Angeles, April 14,
1981.
Dorans, N. J., & Kingston,
N. M. Assessing the local independence
assumption of item response theory in GRE item types and populations.
Paper presented at the annual meeting of the Psychometric Society,
Chapel Hill,
NC, May 1981.
Greene, E. B. Measurement of human behavior.
Press, 1941.
New York:
The Odyssey
Kingston,
N. M., & Dorans, N. J.
The feasibility
of using item response
theory as a psychometric model for the GRE Aptitude Test.
GRE Board
Professional
Report 79-12bP, Princeton,
NJ: Educational
Testing Service,
1982.
Kingston,
N. M., 6 White, E. B.
Item response theory statistics,
classical test,
and item statistics;
What does it all mean when you sit down
to construct
a test?
Paper presented at the annual meeting of the
American Educational
Research Association,
Boston, April 8, 1980.
Lord,
F. M.
50-48.
Lord,
Lord,
F. M. Applications
problems.
Hillsdale,
Lord,
Lord,
for test
Testing
scores.
Service,
Research
1950.
Bulletin
of mental
test
scores.
-26-
Petersen,
N., Cook, L., and Stocking,
M. L.
IRT versus conventional
equating
methods:
A comparative
study of scale stability.
Paper presented at
the annual meeting of the American Educational
Research Association,
Los Angeles, April
14, 1981.
Scheuneman, J. D., & Kay, E. F.
Homogeneity of ability
and certification
decisions.
Paper presented at the annual meeting of the American
Educational
Research Association,
Los Angeles, April
14, 1981.
Swinton, S., Wild, C. L., and Wallmark, M. Investigation
of practice
effects
on item types in the Graduate Record Examinations Aptitude
GRE Board Professional
Report 800lbP, Princeton,
NJ:
Test.
Educational
Testing Service,
1982.
Practice
Wing, H.
Psychological
effects
with
Measurement,
traditional
mental
1980, A, 141-155.
test
items.
Applied