You are on page 1of 30

THE EFFECT OF THE POSITION OF AN ITEM WITHIN

A TEST ON ITEM RESPONDINGBEHAVIOR:


AN ANALYSIS BASED ON ITEM RESPONSETHEORY

Neal M. Kingston
and
Neil

J.

Dorans

GRE Board Professional


Report GREB No.,7942bP
ETS Research Report 82-22

June 1982

This report presents the findings


of a research
project
funded by and carried out under the
auspices of the Graduate Record Examinations
Board.

THE EFFECT OF THE POSITION OF AN ITEM WITHIN


A TEST ON ITEM RESPONDING BEHAVIOR:
AN ANALYSIS BASED ON ITEM RESPONSETHEORY

Neal M. Kingston
and
Neil

J.

GRE Board Professional

Dorans

Report

GREB No. 79-12bP

May 1982

Copyright&l982

by Educational

Testing

Service.

All

rights

reserved.

ABSTRACT
The research described in this paper deals solely with the effect
of
the position
of an item within a test on examinee's responding behavior
at the item level.
For simplicity's
sake, this effect
will
be referred
to
as practice
effect
when the result
is improved examinee performance and
as fatigue
effect
when the result
is poorer examinee performance.
Item
response theory item statistics
were used to assess position
effects
because, unlike traditional
item statistics,
they are sample invariant.
In addition,
the use of item response theory statistics
allows one to make
a reasonable adjustment for speededness, which is important when, as in
this research,
the same item administered
in different
positions
is
likely
to be affected
differently
by speededness, depending upon its
location
in the test.
Five types of analyses were performed as part of this research.
The first
three types involved analyses of differences
between the
two estimations
of item difficulty
(b), item discrimination
(a), and
pseudoguessing (c) parameters.
The fourth type was an analysis
of the
differences
between equatings based on items calibrated
when administered
in the operational
section and equatings based on items calibrated
when
administered
in section V.
Finally,
an analysis
of the regression
of the
difference
between b's on item position
within the operational
section was
conducted.
The analysis
of estimated
item difficulty
parameters showed a
strong practice
effect
for analysis
of explanations
and logical
diagrams
items
and a moderate fatigue
effect
for reading comprehension items.
Analysis of other estimated item parameters,
a and c, produced no
consistent
results
for the two test forms analyzed.
Analysis of the difference
between equatings for Form 3CGRl reflected
the differences
between estimated b's found for the verbal,
quantitative,
and analytical
item types.
A large practice
effect
was evident for the
analytical
section,
a small practice
effect,
probably due to capitalization
on chance, was found for the quantitative
section,
and no effect
was found
for the verbal section.
Analysis of the regression
of the difference
between b's on item
position
within the operational
section for analysis
of explanations
items
showed a rather consistent
relationship
for Form ZGRl and a weaker but
still
definite
relationship
for Form 3CGRl.
The results
of this research strongly
important
implication
for equating.
If an
context effect,
any equating method, e.g.,
item data either
directly
or as part of an
provide for administration
of the items in
Although a within-test
old and new forms.
negligible
influence
on a single equating,
drift
because of the systematic
bias.

suggest one particularly


item type exhibits
a within-test
IRT based equating,
that uses
equating section score should
the same position
in the
context effect
might have a
a chain of such equatings might

-ii-

TABLE OF CONTENTS

INTRODUCTION. ...........................
Traditional

Analysis

of Practice

Item Response Theory

Effects

............

......................

Potential
Advantages of Using IRT Item Statistics
Investigate
the Effects
of Item Position
on Item
Responding Behavior ......................

to

RESEARCHDESIGN ..........................
TestForms

...........................

Item Calibration
IRT Linking

Procedures.

Procedure.

..................

.....................

THE EFFECT OF ITEM POSITION ON ITEM KESPONDING


BEHAVIOR..............................
Verbal

Item Types.

Quantitative

.......................

Item Types.

....................

9
9
12

Item Types.

.....................

16

SUMMARYAND IMPLICATIONS.

.....................

24

Analytical

REFERENCES.............................

25

-l-

INTRODUCTION
When test forms are equated one wants to minimize the error variance
of the process.
Generally,
one attempts to do so by choosing an appropriate
The standard errors of equating associated
with
data collection
design.
linear
methods for various data collection
designs are well known (Angoff,
1971; Lord, 1950; Lord & Stocking,
1973).
Designs that yield the smallest
standard errors of equating for a given sample size are those in which the
examinees taking both the old and the new forms take some common items.
Each test may have a set of
This can be accomplished in several ways.
items, which may or may not count toward an examinee's score, identical
to
a set in the other test or, to carry the common item idea to its extreme,
each examinee can take both forms of the test.
It has long been assumed (Lord (1950) is the earliest
reference
the
authors have found, but the idea is probably older)
that if all examinees
took form 1 followed by form 2, the equating might be biased by an order
a counterbalanced
administration
of
effect.
To overcome this effect,
forms has been used:
A random half of the examinees takes form 1 followed
The order
by form 2; the other half takes form 2 followed by form 1.
effect
could then be estimated and accounted for in the equating process
This estimation
procedure assumes that
(Lord, 1950; Angoff,
1971).
order effect
is proportional
to the standard deviation
of scores on
We are
each test form and that there is no form by order interaction.
aware of no empirical
evidence supporting
these assumptions.
It is usually difficult
to get a sufficient
number of examinees
In addition,
new legislative
test
willing
to take two full-length
tests.
disclosure
requirements
are producing new constraints
on the collection
of
One relatively
new equating method, item response
data for equating.
theory (IRT) based true score equating (Lord, 1980), has been the subject
of considerable
interest
(for example, Cowell, 1981; Kingston 6; Dorans,
1982; Petersen,
Cook, & Stocking,
1981; Scheuneman & Kay, 1981).
Of
particular
interest
is the use of a data collection
scheme known as
precalibration,
that would allow the collection
of item statistics
(and
thus the equating of the test form) before the test form is operationally
administered.
The appropriateness
of IRT equating based on precalibration requires
either
that the position
of an item within a test have no
effect
on examinees item responding behavior or that items be calibrated
and operationally
administered
in the same position
within
old and new
The latter
solution,
same positioning
of items in old and new
forms.
Even if
context variance
is negligible.
forms, assumes that form-specific
complexities
of such a solution
make it
this is so, the administrative
less than appealing.
The research described in this paper deals solely with the effect
of
the position
of an item within a test on examinee's responding behavior
will
be referred
to
For simplicity's
sake, this effect
at the item level.
as practice
effect
when the result
is improved examinee performance and
It is
as fatigue
effect
when the result
is poorer examinee performance.
realized
that this simplistic
labeling
in no way fully
describes
these
effects
and that there might be other explanations
for these results.
The interested
reader is referred
to Greene (1941) or Wing (1980) for
further
discussion
of the underlying
psychology of these effects.

-2-

Traditional

Analvsis

of Practice

Effects

Traditional
item level analyses have focused on p, the proportion
of
or a normalized version of p
examinees responding to an item correctly,
Attempts have been made to adjust these statistics
by
such as delta.
taking into account only the examinees who had sufficient
time to respond
to the item.
There is, however, no demonstrably correct way based on
classical
test theory to make this adjustment.
of examinees
Lord (1977) has pointed out that p, the proportion
responding correctly
to an item, is not a true measure of item difficulty.
If two items are administered
to two groups with different
distributions
of ability
the p of item 1 might be larger
than that of item 2 for the
Thus, neither
proportion
first
group but smaller for the second group.
correct
nor normalized proportion
correct
(e.g.,
delta)
is a particularly
good statistic
to analyze in order to ascertain
the effect
of the positon
In addition,
a
of an item within
a test on item responding behavior.
within-test
context effect
might well affect
the discrimination
of an
Classical
measures of item discriminaitem as well as its difficulty.
tion (biserial
or point-biserial
correlation
between item score and total
test score) are both difficult
to estimate accurately
and confounded
with item difficulty
(Kingston & White 1980; Lord & Novick, 1968).

Item Kesnonse Theorv


Item response theory provides a mathematical
expression for the
probability
of success on an item as a function
of a single characteristic
of the individual
answering the item, his or her ability,
and multiple
This mathematical
expression is called an
characteristics
of the item.
A reasonable mathematical
form for a multiple
item response function.
choice item (both on psychometric
grounds and for reasons of tractability)
employed for the item response function
is the three-parameter
logistic
model,
l-c
(1) Pg(8) = c +
1 + e -1.7 ag F6 -bg)

g
where
Pg

ag

(W

is the.probability
item g correctly

that

an examinee with

is the base of the natural


2.7183,

logarithm

is a measure of item discrimination


a measure of item difficulty

for

ability

approximately

for

(3 answers

equal

to

item g,

item g, and

b
g

is

cg

is the lower asymptote of the item response curve, the probability


of very low ability
examinees answering item g correctly.

-30

In equation (l),
examinee, and a
shape of the it&

8 is the ability
b
r&p~~~,'s&~~~on

parameter,
a characteristic
of the
item parameters that determine the
(see Figure 1).

Figure

ITEM

RESPONSE

1
FUNCTION

PO.7
LO.6
F0.S
A

0.4

B-O
c *

.2

0.2

0. I
0.0

One of the major assumptions of IRT embodied in equation (1) is that


the set of items under study is unidimensional,
i.e.,
the probability
of
successful
response by examinees to a set of items can be modelled by
The second
a mathematical
model with only one ability
parameter,
0.
major assumption is that performance on an item can be adequately
described
Previous research has shown that,
by the three-parameter
logistic
model.
GRE items violate
these assumptions but, for the most
to some extent,
part,
IRT based methods seem robust to these violations
(Kingston & Dorans
1982).

Potential
Advantages of Using IRT Item Statistics
to Investigate
Effects
of Item Position
on Item Responding Behavior

the

IRT item statistics


have several advantages over classical
item
statistics
when one is investigating
practice
and fatigue
effects.
To the extent that model assumptions are correct,
IRT item statistics
are
The IRT ability
metric provides interval
level data.
sample invariant.
The IRT item discrimination
statistic
is sample invariant,
not confounded

with item difficulty,


and is more accurately
estimated
than classical
item
discrimination
indices
(Kingston 61 White, 1980).
In addition,
the use of
IRT statistics
allows one to make a reasonable adjustment for speededness
This is important when,
(see the section on item calibration
procedures).
as in this research,
the items administered
in one position
in a test are
likely
to be affected
differently
by speededness than the same items
administered
in a different
position.

-5-

RESEARCHDESIGN
Test

Forms

Two operational
forms of the GRE Aptitude Test were used in this
study:
ZGRl and 3CGRl.
Form ZGRl is composed of four separately
timed
operational
sections:

Section
I

II

III
IV

Form 3CGRl is

Timing in
Minutes

Item Type
Verbal
analogies
antonyms
sentence completion
reading comprehension

50

Quantitative
quantitative
comparison
data interpretation
regular mathematics

50

Analytical
analysis

25

80
17
18
20
25
55
30
10
15
40
40

of explanations

Analtyical
logical
diagrams
analytical
reasoning
also

Number of
Items

composed of four

25

30
15
15

separately

timed operational

sections:
Section
I

II

III
IV

Timing in
Minutes

Item Type
Verbal
analogies
antonyms
sentence completion
reading comprehension

50

Quantitative
quantitative
comparisons
data interpretation
regular mathematics

50

Analtyical
analysis

25

Number of
Items
75
18
22
13
22
55
30
10
15
36
36

of explanations

Analytical
logical
diagrams
analytical
reasoning

Examples of the various GRE Aptitude


Trismen, and Miller
(1977).

Test

25

30
15
15

item types

can be found in Conrad,

-60

Form 3CGRl was administered


with six different
25-minute fifth
The items in these fifth
sections were not experimental
pretest
sections.
Instead they were items taken from the four operational
sections
items.
Table 1 lists
the six fifth
sections of 3CGR1, the number
of Form ZGRl.
and the section of ZGRl from which they were
of items in the section,
drawn.
Form ZGRl was administered
with six section V's at the same
administration
at which Form 3CGRl was administered
with the six section
Table 2 lists
the six fifth
sections of Form ZGRl,
V's listed
in Table 1.
indicating
the number of items in the section and the section of 3CGRl
from which they were drawn.
Inspection
of Tables 1 and 2 reveals
that each operational
item
from Form ZGRl appears in one of the six section V's of Form 3CGRl and
each operational
item from 3CGRl appears in one of the six "C-subforms"
This commonality of items was used to study for position
of Form ZGRl.
effects.

Table
Six Section

Designation

Item Type

V's

for

Form 3CGRl
Number of
Items

Location

in ZGRl

C41
C42

Verbal
Verbal

39
41

Section
Section

I
I

c43
c44

Quantitative
Quantitative

27
28

Section
Section

II
II

c45
C46

Analytical
Analytical

40
30

Section
Section

III
IV

Table 2
Six Section V's for Form ZGRl
Designation

Item Type

Number of
Items

Location in 3CGRl

c47
C48

Verbal
Verbal

37
38

Section I
Section I

c49

c50

Quantitative
Quantitative

27
28

Section II
Section 11

c51
C52

Analytical
Analytical

36
30

Section III
Section IV

Table 3
Description

Forms

Pretest
Section

of Samples Used in this Research

Sample Size

Formula Score Means (M)


and Standard Deviations (S)
Operational*
Pretest
M
M
S
S

ZGRlC47
ZGRlC48
ZGRlC49
ZGRlC50
ZGRlC51
ZGRlC52

2483
2486
2898
2484
2488
2482

13.23
14.62
11.94
12.88
18.73
14.14

8.01
8.10
6.43
5.93
9.13
6.98

31.61
31.53
24.46
24.26
32.89
32.69

15.86
16.30
10.47
10.34
15.21
15.66

3CGRlC41
3CGRlC42
3CGRlC43
3CGRlC44
3CGRlC45
3CGRlC46

1489
1495
1487
1497
1526
1476

15.54
15.91
11.65
12.27
24.26
15.92

8.42
8.80
5.59
5.43
11.75
7.19

30.17
30.43
24.94
24.41
28.86
28.52

15.38
15.55
11.51
11.74
15.41
14.87

Operational formula raw scores are for the operational score (V, Q, or
A) corresponding to the prete;.,t section listed under the heading Pretest
keep in mind that the number
When comparing these statistics,
Section.
Tables 1 and 2
of items going into each new score is not constant.
The tables
contain the number of items in each of the pretest sections.
embedded in the text on page 5 contain the number of operational items
for each section of Forms ZGRl and 3CGRl.

-8-

Item Calibration

Procedures

A total
of 10 different
item types were administered
within each
form.
Parameter estimates were based on the set of all verbal items
(analogies,
antonyms, sentence completion,
and reading comprehension),
all quantitative
items (quantitative
comparisons, data interpretation,
or all analytical
items (analysis
of
and regular mathematics),
reasoning).
logical
diagrams, and analytical
explanations,
All item parameter estimates
and ability
estimates were obtained
The function
of
with the program LOGIST (Wood, Wingersky, & Lord, 1978).
LOGIST is to estimate,
for each item, the three item parameters of the
three-parameter
logistic
model:
a (discrimination),
b (difficulty),
and c (pseudoguessing parameter);gand,
for each examinge, theta (ability).
a was
The f6llowing
constraints
were imposed on the estimation
process:
restricted
to values between 0.01 and 1.50 inclusive,
except for analytical
item calibrations
where the upper bound was 1.20; the lower limit
for
estimated
theta was -7; and c was restricted
to values between 0.0 and
We also required each e&minee to have responded to at least 20
0.5.
Choosing appropriate
items in order to insure stable ability
estimates.
constraints
is a complex procedure,
but necessary to speed convergence
and produce stable estimates.
Since IRT based item parameters are sample invariant
and the IRT
IRT parameter estimation
allows a
ability
parameter is item invariant,
If one assumes that an examinee
reasonable
correction
for speededness.
answers test questions in a sequential
progression,
one can consider
all contiguous unanswered items at the end of a test as not reached and
ignore them in the ability
estimation
process.
Hence, the item responses
(actually,
lack of responses) of all examinees whose item responses are
coded "not reached*' are ignored in the item parameter estimation
process.
Use of this coding convention will minimize any differences
in item
calibrations
due to a differential
speededness in the two administrations
Each item was calibrated
twice, once as an operational
of each item set.
item and once when it appeared in section V.

IRT Linking

Procedure

of test forms at the June 1980 administration


of the
Spirallingl
GRE Aptitude
Test was used to link parameter estimates
on Form 3CGRl to
Linking by spiralling
parameter estimates
on the base form, Form ZGRl.
assumes that alternating
forms administered
to examinees results
in a
random assignment of examinees to forms.
Since large equivalent
groups
take each form, the distributions
of ability
in the two groups should be
the same, and separate parameterizations
based on these two random groups
via separate LOGIST runs should produce a single ability
metric.
ASpiralling
is a term used to describe a test administration
practice
in which test books are packaged in spiralled
order, i.e.,
alternating
Form A with Form B, such that half the examinees at any testing
center
take Form A while the other half take Form B.

-9-

THE EFFECT OF ITEM POSITION ON ITEM RESPONDINGBEHAVIOR


Verbal

Item Types

Tables 5, 6, and 7 summarize the effect


of item position
on IRT
difficulty,
discrimination,
and pseudoguessing parameter estimates,
respectively,
for the verbal item types of Form ZGRl.
Tables 8, 9,
and 10 do likewise
for Form 3CGRl.
Each table contains means and
standard deviations
of the appropriate
estimated item parameter for
the five item types (all
verbal,
analogies,
antonym, sentence completion,
and reading comprehension),
as well as mean differences,
standard
deviations
of the differences,
and their associated
dependent sample
t statistics.
In these tables and in Tables 11-16, significant
mean
difficulty
differences
at the .Ol level between item parameters estimated
in their operational
and nonoperational
locations
are marked by double
asterisks,
while a single asterisk
denotes a significant
mean difference
at the .05 level.
For all t-tests,
the numbers of degrees of freedom (df)
is one less than the number of items of each type.
Tables 5 and 8 show no strong consistent
evidence of a practice
or
fatigue
effect.
The reading comprehension items, however, exhibit
a
consistent
(though statistically
significant
only for Form ZGRl) moderate
fatigue
effect.
The mean difference
between b's for Form ZGRl was 0.14
(t = -2.72,
df
and for Form 3CGRl
= 24, significant
at the .05 level)
was -.13 (t = -1.76,
df = 21).
Tables 6 and 9 show three of the 10 mean differences
between
Since
estimated a's to be statistically
significant
at the .05 level.
none of these findings
is consistent
across the two forms, these results
indicate
either
a highly complex relationship
between within-test
item
context and item discrimination
or, more likely,
a chance result.
Tables 7 and 10 do not indicate
any relationship
between within-test
In interpreting
these results,
one must
item position
and estimated
c.
consider the difficulties
in estimating
c and the artificially
constrained
variance of the estimated
c's (Kingston & Dorans, 1982; Wood, Wingersky,
& Lord, 1978).
the primary goal in this research is to assess
As stated previously,
To estimate
the effect
of within-test
context on IRT true score equating.
this effect,
Form 3CGRl was equated to Form ZGRl twice, once based on the
item parameters for Form 3CGRl estimated when they appeared in their
and once based on the item parameters
operational
form and positions,
In the first
estimated when they appeared in section V of Form ZGRl.
but the estimated parameters were linked
case, there was no context effect,
dependence on group equivalence.
to scale by a relatively
weak procedure,
In the latter
case, there might be a context effect,
but the linking
of the
common items within a single
metrics was by a relatively
strong procedure,
The operational
position
calibrations,
relative
to the
LOGIST run.
section V calibrations,
have a greater
opportunity
for error variance but

-lO-

Table

IRT Item Difficulty


(b) Parameter Estimates for
Operational
Verbal Items from Form ZGRl

Item TvDe

ZGRl
Operational
Mean
S.D.

3CGRl
Section V
Mean
S.D.

Verbal

.3136

1.1223

.3466

1.0337

Analogy

.6597

1.3122

.5849

1.1941

.0748

.1774

1.79

Antonyms

.7217

1.2481

.6760

1.1689

.0457

.3023

.68

.I

Sent.

ZGRl-3CGRl
Difference
Mean
S.D.
-.0330

.2574

-t
-1.15

Comp.

-.0148

.8786

.0654

.8836

-.0802

.2050

-1.61

Read. Comp.

-.0387

.7764

.1029

.7267

-.1415

.2603

-2.72*

Table

IRT Item Discrimination


(a) Parameter Estimates
Operational
Verbal Items from Form ZGRl

for

Item Type

ZGRl
Operational
Mean
S.D.

Verbal

.8722

.2764

.8886

.2858

-.0164

.1486

.99

.8413

.2572

.9087

.2206

-.0674

.1220

-2.34*

1.0482

.3344

1.0242

.3755

.2046

.52

Analogy
Antonyms

3CGRl
Section V
Mean
S.D.

ZGRl-3CGRl
Difference
Mean
S.D.

.0240

-t

Sent.

Comp.

.7831

.2535

.7936

.2740

-.0106

.1312

-.33

Read.

Comp.

.8143

.1714

.8302

.1949

-.0159

.1191

-.67

Table

IRT Item Pseudoguessing (c) Parameter Estimates


Operational
Verbal Items from Form ZGRl

for

Item Type

ZGRl
Operational
Mean
S.D.

3CGRl
Section V
Mean
S.D.

Verbal

.1778

.0545

.1784

.0604

-.0006

.0504

-.ll

Analogy

.1673

.0386

.1727

.0458

-.0054

.0212

-1.08

Antonyms

.2219

.0707

.2149

.0874

Sent.

Comp.

.1526

.0301

.1571

.0313

-.0045

.0158

-1.17

Read. Comp.

.1672

.0405

.1677

.0433

-.0005

.0487

-.05

*p < .05

ZGRl-3CGRl
Difference
Mean
S.D.

.0070

.0828

-t

.38

-ll-

Table

IRT Item Difficulty


(b) Parameter Estimates for
Operational
Verbal Items from Form 3CGRl

Item Type

ZGRl
Section V
Mean
S.D.

3CGRl
Operational
Mean
S.D.

3CGRl-ZGRl
Difference
Mean
S.D.

Verbal

.2534

1.1739

.2630

1.1371

.0095

.2817

.29

Analogy

.4535

1.3307

.4819

1.1823

.0284

.2350

.51

Antonyms

.1140

1.0975

.2342

1.0663

.1202

.2527

2.23*

1.1339

.0365

.1464

.90

Sent.

Comp.

Read. Comp.

-.0369
.4007

1.1734

-.0003
.2682

1.0524
Table

IRT Item Discrimination


Operational
Verbal

.3523

-.1325

-1.76

(a) Parameter Estimates


Items from Form 3CGRl

for

ZGRl
Section V
Mean
S.D.

3CGRl
Operational
Mean
S.D.

3CGRl-ZGRl
Difference
Nean
S.D.

.8926

.3048

.9027

.3153

.OlOl

.1522

.57

.8518

.2750

.9022

.3129

.0504

.1634

1.31

1.0693

.3699

1.1406

.3075

.0714

.1267

2.65*

Comp.

.8662

.1829

.8722

.2213

.0060

.1088

.20

Read. Comp.

.7649

.2211

.6830

.1729

.1521

-2.53*

Item

TvDe

Verbal
Analogy
Antonyms
Sent.

Table
IRT Item Pseudoguessing
Operational
Verbal

. Q

1.1342

-t

-.0819

-t

10

(c) Parameter Estimates


Items from Form 3CGRl

for

Item Type

ZGRl
Section V
Mean
S.D.

Verbal

.1682

.0460

.1729

.0441

.0047

.0372

1.09

Analogy

.1602

.0365

.1660

.0370

.0057

.0265

.91

Antonyms

.1862

.0498

.1946

.0473

.0084

.0408

.97

Sent.

Comp.

.1577

.0460

.1697

.0520

.0120

.0265

1.63

Read. Comp.

.1629

.0440

.1588

.0309

-.0041

.0461

-.42

*p < .05

3CGRl
Operational
Mean
S.D.

3CGRl-ZGRl
Difference
Mean
S.D.

-t

.
FIGURE

L
20 c

Effect

of

Item

Position

on the

Equating

of

the

Verbal

Section

of Form 3CGRl

I -

I 1

. 15
10
5
0
-5
-10
-15
-20

I I

20

40

FORMULA SCORE
*

SEE TEXT

60

80

-130

lesser opportunity
for bias entering
the system during the linking
of the
metrics.
It is assumed, however, that error variance,
based on the sample
sizes used in this research,
is small in relation
to bias.
Figure 2 shows the differences
between the two equatings.
At each
formula raw score, the converted score based on the equating that used
section V calibrations
was subtracted
from the converted score based on
the equating that used the operational
section calibrations.
A horizontal
line,
what one would expect if there was no context effect
and no error
of equating,
is drawn for a no-effect
reference.
The results,
despite the
earlier
finding
of a moderate fatigue
effect
for reading comprehension
items, lend no support to the hypothesis of an overall
verbal context
effect.

Quantitative

Item Types

Tables 11, 12, and 13 summarize the effect


of item position
on IRT
difficulty,
discrimination,
and pseudoguessing parameter estimates
for
the quantitative
item types (all
quantitative,
quantitative
comparisons,
data interpretation,
regular mathematics)
from Form ZGRl, using the same
Tables 14, 15, and 16 present
format as the similar
verbal tables.
similar
summaries for Form 3CGRl.
Tables 11 and 14 do not provide evidence of any strong consistent
practice
or fatigue
effect.
The results
for data interpretation
items,
however, were peculiar
and demanded further
scrutiny.
Form ZGRl showed
a strong, highly significant
fatigue
effect
(mean difference
= -.20,
t = -4.31 with 9 df),
but Form 3CGRl showed the largest
practice
effect
found in this study (mean difference
= .39),
although this result
was
not statistically
significant
(t = 1.10 with 9 df).
The large standard
deviation
of the differences
(1.11)
helps explain this.
Further investigation
showed this large standard deviation
was due
to a single item that had a b of 1.66 when administered
operationally,
but a b of -1.75 when administered
in section V, a difference
of 3.41.
Analysis of the actual proportion
of examinees getting
the item correct
Just as very
when grouped by estimated
theta proved illuminating.
different
regression
lines may result
from two samples from a population
item response
with an underlying
correlation
of zero, very different
functions
resulted
from the three-parameter
logistic
regressions
for this
It should be noted that this extreme difference
between
particular
item.
b's is highly unusual and that the median absolute difference
for all
quantitative
items was about .15.
The discrepancy between the mean difference
between b's for the data
interpretation
items in forms ZGRl and 3CGRl is much less extreme when
The mean difference
this questionable
item is not included in the analysis.
between b's for the data interpretation
items of .39 in Table 14 would
the mean difference
between
only be a mean difference
of .05.
Similarly,
b's for all quantitative
items in form 3CGRl is artificially
inflated
by

-140

Table

11

IRT Item Difficulty


(b) Parameter Estimates for
Operational
Quantitative
Items from Form ZGRl

Item Type

ZGRl
Operational
S.D.
Mean

3CGRl
Section V
Mean
S.D.

Quantitative

.0051

1.5040

.0154

1.5301

-.0103

.3123

-.24

.6870

1.3936

.7317

1.3510

0.0447

.1159

-1.11

.9985

0.2007

.1472

-4.31**

Reg. Math
Data.

Int.

Quant.

-1.0795

Comp.

.0257

1.0441

-.a789

1.4791

0.0447
Table

IRT Item Discrimination


(a)
Operational
Quantitative

1.5877

Mean

ZGRl-3CGRl
Dffference
S.D.
-t

.0704

.3786

1.02

12
Parameter Estimates for
Items from Form ZGRl
ZGRl-3CGRl
Dffference
S.D.
-t

Item Type

ZGRl
Operational
S.D.
Mean

3CGRl
Section V
Mean
S.D.

Mean

Quantitative

.a560

.3947

.8484

.3950

.0076

.9308

.4135

.9430

.4296

-.0122

.1213

-.39

Data Int.

.5894

.2305

.6283

.2552

-.0389

.0706

-1.74

Quant.

.9075

.3915

.a745

.3883

Reg.

Math

Comp.

Table
IRT Item Pseudoguessing (c)
Operational
Quantitative

.1545

.0331

.1849

.36

.98

13
Parameter Estimates for
Items from Form ZGRl

Item Type

ZGRl
Operational
Mean
S.D.

3CGRl
Section V
Mean
S.D.

Mean

Quantitative

.1826

.0733

.1667

.0771

.0159

.0315

3.74**

Reg. Math.

.1780

.0894

.1602

.0979

.0178

.0241

2.86*

Data.

.1376

.02_2

.1303

.0249

.0074

.0357

.66

.1999

.0686

.1821

.0721

.0178

.0338

2.88**

Int.

Quant.

**p

< .Ol

*p < .05

Comp.

ZGRl-3CGRl
Dffference
S.D.
-t

-15-

Table

14

IRT Item Difficulty


(b) Parameter Estimates for
Operational
Quantitative
Items from Form 3CGRl
ZGRl
Section V
S.D.
Mean

Item Type
Quantitative
Reg. Math
Data.

Int.

Quant.

Comp.

3CGRl
Operational
Mean
S.D.

XGRl-ZGRl
Difference
Mean
S.D.

-t

0.2571

1.4435

0.1175

1.3656

.1396

.5403

1.92

.2431

.9190

.3352

.9945

.0921

.2436

1.46

-.6966

1.9727

0.3104

1.9248

.3861

1.1055

1.10

0.3607

1.3786

-.2796

1.2443

.0811

.3315

1.34

Table
IRT Item Discrimination
(a)
Operational
Quantitative

15
Parameter Estimates for
Items from Form 3CGRl

Item Type

ZGRl
Section V
Mean
S.D.

3CGRl
Operational
Mean
S.D.

XGRl-ZGRl
Difference
Mean
S.D.

Quantitative

.8376

.3231

.8825

.3284

.0449

.1735

1.92

Reg. Math

.8792

.2605

.9690

.2895

.0898

.1728

2.01

Data.

.4801

.1879

.5012

.1681

.0211

.2074

.32

.9360

.3042

.9663

.2952

.0304

.1638

1.02

Int.

Quant.

Comp.

Ikble

16

IRT Item Pseudoguessing (c)


Operational
Quantitative

Parameter Estimates for


Items from Form 3CGRl

Item Type

ZGRl
Section V
S.D.
Mean

3CGRl
Operational
S.D.
Mean

Quantitative

.1693

.0591

.1655

.0770

Reg. Math

.1288

.0500

.1415

l 0740

Data.

.16.4

.0708

.1522

.1172

.1912

.0466

.1818

.0543

Quant.

Int.
Comp.

-t

3CGRl-ZGRl
Difference
Mean
S.D.
-.0039

-t

.0700

0.41

.0498

1.00

-.0123

.1456

-.27

-.0094

.0340

-1.51

.0128

-17-

the inclusion
of this item.
The mean difference
would have been only .08 without
the questionable

of .14 given
item.

in Table

14

The mean difference


between b's for the data interpretation
items
in the two forms are quite discrepant
even when the peculiar
item is
removed (0.20 versus .05).
This discrepancy
might be due to a lack
of local independence for the data interpretation
items.
This dependence
has been demonstrated empirically
(Dorans & Kingston,
1982) but also
follows
intuition.
Data interpretation
items come in sets that refer
to a single graphical
display.
It is possible
that in general there is a
fatigue
effect
for sets of data interpretation
items but for idiosyncratic
sets there is either
no effect
or a practice
effect.
Tables 12 and 15 show that none of the eight mean differences
estimated a's is statistically
significant
at the .05 level.

between

Table 13 indicates
a statistically
significant
effect
on estimated c
for all quantitative,
regular mathematics,
and quantitative
comparison
items for Form ZGRl.
These findings
were not replicated
for Form 3CGR1,
as can be seen in Table 16.
The problems in interpreting
the results
of
the analysis
of estimated c's, mentioned earlier,
apply to this analysis.
Figure 3 compares the two quantitative
equatings of Form 3CGRl to
Form ZGRl.
As with Figure 2, the converted score resulting
from the
equating based on the section V estimation
of item parameters is subtracted
from the converted score resulting
from the equating based on the operational
section estimation
of item parameters.
Although the mean difference
in b's indicated
in Table 14, .14, was
not statistically
significant,
this result
is reflected
in the difference
between the two equatings.

Analytical

Item Types

Tables 17 through 22 summarize the effect


of item position
on IRT
parameter estimates for the analytical
item types from Form ZGRl and
Form 3CGRl.
The format of these tables is the same as for the previous
verbal and quantitative
tables.
Tables 17 and 20 show a large,
consistent,
statistically
significant
difference
between item difficulty
estimates based on operational
section
and section V administration
of the same items for analysis
of explanations
and logical
diagrams items, and consequently
for analytical
items as a
whole.
Analysis of explanations
items are considerably
easier (mean
difference
between b's of .25 for Form ZGRl anu .30 for Form 3CGRl)
when administered
after
examinees have already answered an operational
section containing
this item type.
This might be due to both a pervasive
relative
unfamiliarity
with this item type within
the GRE test-taker
population
and the complexity
of the directions.
It is likely
that
initially
most examinees frequently
refer back to the directions
for this

-18-

Table

17

IRT Item Difficulty


(b)
Operational
Analytical
ZGRl
Operational
Mean
S.D.

Item Type
Analytical

-.0467

Anal of Exp.

.1514

1.0853
1.0619

Parameter Estimates for


Items from Form ZGRl
3CGRl
Section V
S.D.
Mean

-.2322
.0959

.1855

.2605

5.96**

1.1094

.2473

.2378

6.58**

.2221

.2609

3.30**

.2305

0.27

-.2729

.9009

-.4950

.8348

Anal.

-.3486

1.1963

-.3327

1.2069

Table

1.0920

Log. Diag.
Reas.

ZGRl-3CGRl
Difference
Mean
S.D.

-.0159

18

IRT Item Discrimination


(a)
Operational
Analytical

Parameter Estimates for


Items from Form ZGRl

ZGRl
Operational
Mean
S.D.

3CGRl
Section V
S.D.
Mean

ZGRl-3CGRl
Difference
Mean
S.D.

.7683

.2595

.7287

.2666

.0396

.1460

2.27*

.7385

.2593

.6919

.2716

.0466

.1512

1.95

Log. Diag.

.9806

.2176

.8930

.2254

.0876

.1191

2.85*

Anal.

.6356

.1504

.6626

.2217

.1406

-.74

Item

Type

Analytical
Anal.

of Exp.

Reas.

Table

-.0270

19

IRT Item Pseudoguessing (c) Parameter Estimates for


Operational
Analytical
Items from Form ZGRl

Item Type
Analytical

.1608

.0480

.1421

.0499

.0186

.0447

3.48**

.1602

.0392

.1452

.0360

.0150

.0408

2.33*

Log. Diag.

.1652

.0473

.1326

.0238

.0326

.0520

2.43*

Anal.

.1579

.0663

.1435

.0866

.0144

.0471

1.18

Anal.

**p

of Exp.

Reas.

< .Ol

*p < .05

3CGRl
Section V
S.D.
Mean

ZGRl-3CGRl
Difference
Mean
S.D.

ZGRl
Operational
Mean
S.D.

-190

Table 20
IRT Item Difficulty
(b)
Operational
Analytical
ZGRl
Section V
Mean
S.D.

Item Type
Analytical
Anal.
Log.

3CGRl
Operational
S.D.
Mean

3CGRl-ZGRl
Difference
Mean
S.D.

-t

-.1688

.9043

.0431

.8821

.2118

.2022

8.51**

of Exp. 0.2105

.8639

.0919

.8234

.3024

.1837

9.88**

0.0806

.5914

.0700

.6437

.1507

.1346

4.33**

0.1566

1.2022

1.1639

.0556

.1897

1.14

Diag.

Anal.

Parameter Estimates for


Items from Form 3CGRl

Reas.

-.lOlO
Table

21

IRT Item Discrimination


(a) Parameter Estimates for
Operational
Analytical
Items from Form 3CGRl

Item Type

ZGRl
Section V
S.D.
Mean

3CGRl
Operational
S.D.
Mean

Analytical

.8221

.2439

.8033

.2344

.8716

.2016

.8733

.1841

.0017

.1235

.08

Log. Diag.

.9261

.2781

.8499

.2688

0.0762

.0677

-4.36**

Anal.

.5994

.1434

.5888

.1681

-.0105

.1044

-.39

Anal.

of Exp .

Reas.

3CGR.l.ZGRl
Difference
S.D.
Mean
-.0188

.1120

t
-1.36

Table 22
IRT Item Pseudoguessing (c) Parameter Estimates for
Operational
Analytical
Items from Form 3CGRl

Item Type

ZGRl
Section V
Mean
S.D.

3CGRl
Operational
Mean
S.D.

Analytical

.1613

.0498

.1594

.0555

-.0019

.0250

-.62

.1609

.0507

.1580

.0547

-.0029

.0232

-.75

Log. Diag.

.1783

.0571

.1871

.0619

.0088

.0315

1. .s

Anal.

.1454

.0304

.1354

.0349

-.OlOO

.0189

Anal.

**p

of Exp.

Reas.

< .Ol

3CGRl-ZGRl
Difference
Mean
S.D.

-2.05

-2o-

item type.
By the time an examinee has responded to some number of
analysis
of explanations
items, perhaps 30 or more, the examinee might
not have to refer back to the directions.
Until
the directions
are
internalized
by the examinee, items will be more difficult
then they
If this is so, it might be reasonable
would be otherwise.
to expect that
items that appear early in the operational
section will undergo a larger
practice
effect
that those appearing late in the operational
section.
If
the law of diminishing
returns is applicable,
some value for difference
between b's will be approached asymptotically.
the regression
of the difference
To investigate
this hypothesis,
between estimated b's on item position
was plotted.
In order to smooth
the regression,
item position
was grouped; that is, the mean difference
between b's of items 1 through 4 (based on the operational
appearance of
the items) was plotted
against grouped item position
1, the mean difference
between b's of items 5 through 8 was plotted
against grouped item position
2, et cetera.
Figures 4 and 5 show these regressions
for the operational
items from Forms ZGRl and 3CGR1, respectively.
The relationship
between practice
effect
and item position
is
clearly
shown for Form ZGRl and less clearly
shown but still
evident
for Form 3CGRl.
The regression
shows no clear sign of leveling
off,
perhaps because the operational
section was not of sufficient
length.
Interpretation
of this result
might be confounded by at least two factors:
existence
of
a
similar
effect
within
the section V administration
of
(1)
the items and (2) item ordering within section is not random.
This result
does, however, point out that data collection
design has a potentially
large impact on the results
of a practice
effect
study.
Logical diagrams items also exhibit
a strong practice
effect:
mean
difference
between b's of .22 for Form ZGRl and .15 for Form 3CGRl.
Analytical
reasoning items showed no evidence of a practice
or fatigue
effect.
Although three out of eight tests yield statistically
significant
differences
at either
the .05 or .Ol level,
Tables 18 and 21 show a
statistically
significant
result
for logical
diagrams, but in opposite
directions.
Tables 19 and 22 present the results
for the analysis
or differences
between estimated c's.
Three of the four analyses produced statistically
significant
differences
at either
the .05 or .Ol level for Form ZGRl, but
none of these results
was replicated
for Form 3CGRl.
The difficulties
inherent
in interpreting
the results
of the analysis
of c's has been
mentioned earlier
in this report.
Figure 6 compares the two equatings of the analytical
section of Form
3CGRl to Form ZGKl as Figures 2 and 3 did for the verbal and quantitative
equatings.
The results
reflect
the differences
found in Table 20.

FORM ZGRt

ITEM

POSITION

OPERATIONAL

VERSUS

ITEMS

MEAN DIFFERENCE

BETWEEN

BS

0.55

0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0

6
POSITION
FIGURE 4

10

FORM 3CGRl

ITEM

POSITION

VERSUS

OPERATIONAL

MEAN DIFFERENCE

0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
-

0.15
0. IO
0

ITEMS

2
POSITION
FIGURE 5

BETWEEN

BS

Effect

of

Item

Position

on the

Equating

of

the

Analytical

Se-

?K.Rl

50
40
30
20
10
0
-10
-20
-30
-40

-50

#I.-L.
0

20

40

FORMULA SCORE
*

SEE TEXT

60

-24-

SUMMARYAND IMPLICATIONS
Five types of analyses were performed as part of this research.
The first
three types involved analyses of differences
between the
two estimations
of item difficulty
(b), item discrimination
(a), and
pseudoguessing (c) parameters.
The fourth type was an analysis
of the
differences
between equatings based on items calibrated
when administered
in the operational
section and equatings based on items calibrated
when
an analysis
of the regression
of the
administered
in section V.
Finally,
difference
between b's on item position
within the operational
section was
The analysis
of estimated
item difficulty
parameters showed a
conducted.
strong practice
effect
for analysis
of explanations
and logical
diagrams
2
items
and a moderate fatigue
effect
for reading comprehension items.
Analysis of other estimated
item parameters,
a and c, produced no
consistent
results
for the two test forms analyzed.
Analysis of the difference
between equatings for Form 3CGRl reflected
the differences
between estimated
b's found for the verbal,
quantitative,
and analytical
item types.
A large practice
effect
was evident for the
analytical
section,
a small practice
effect,
probably due to capitalization
on chance, was found for the quantitative
section,
and no effect
was found
for the verbal section.
Analysis of the regression
of the difference
between b's on item
position
within the operational
section for analysis
of explanations
items
showed a rather consistent
relationship
for Form ZGRl and a weaker but
still
definite
relationship
for Form 3CGRl.
The results
of this research strongly
important
implication
for equating.
If an
context effect,
any equating method, e.g.,
item data either
directly
or as part of an
provide for administration
of the items in
old and new forms.
Although a within-test
negligible
influence
on a single equating,
drift
because of the systematic
bias.

suggest one particularly


item type exhibits
a within-test
IRT based equating,
that uses
equating section score should
the same position
in the
context effect
might have a
a chain of such equatings might

Item responding behavior might be affected


by other within-test
variables
besides item position.
The difficulty
(relative
or absolute)
of items preceding an item in question might influence
responding behavior.
The presence or absence of dissimilar
item types also might affect
responding behavior.
Variables
such as these might have confounded the
Further research should prove enlightening.
results
of this study.
Interested
readers may want to consult Swinton, Wild, and Wallmark (1982)
for further
research on GRE item types.

2
the GRE analytical
section has
Because of these large practice
effects,
diagrams
Both anal ysis of explana tions items and logical
been revised.
items have be en dropped from the test.

-25-

REFERENCES
Angoff, W. H.
Scales, norms, and equivalent
scores.
In R. L. Thorndike
(Ed.),
Educational
measurement (2nd ed.).
Washington, DC: American
Council on Education,
1971.
Conrad, L., Trismen, D., & Miller,
Technical Manual.
Princeton,

R. (Eds.)
Graduate Record Examinations
NJ: Educational
Testing Service,
1977.

Cowell, W. R. Applicability
of a simplified
three-parameter
logistic
model for equating tests.
Paper presented at the annual meeting of
the American Educational
Research Association,
Los Angeles, April 14,
1981.
Dorans, N. J., & Kingston,
N. M. Assessing the local independence
assumption of item response theory in GRE item types and populations.
Paper presented at the annual meeting of the Psychometric Society,
Chapel Hill,
NC, May 1981.
Greene, E. B. Measurement of human behavior.
Press, 1941.

New York:

The Odyssey

Kingston,
N. M., & Dorans, N. J.
The feasibility
of using item response
theory as a psychometric model for the GRE Aptitude Test.
GRE Board
Professional
Report 79-12bP, Princeton,
NJ: Educational
Testing Service,
1982.
Kingston,
N. M., 6 White, E. B.
Item response theory statistics,
classical test,
and item statistics;
What does it all mean when you sit down
to construct
a test?
Paper presented at the annual meeting of the
American Educational
Research Association,
Boston, April 8, 1980.
Lord,

F. M.
50-48.

Lord,

F. M., A study of item bias, using item characteristic


curve theory.
In Y. H. Poortinga
(Ed.),
Basic problems in cross-cultural
psychology.
Amsterdam:
Swets and Zeitlinger,
1977, pp.19-29.

Lord,

F. M. Applications
problems.
Hillsdale,

Lord,

F. M., & Novick, M. R.


Statistical
Reading, MA: Addison-Wesley,
1968.

Lord,

F. M., & Stocking,


M. L.
Autest - Program to perform automated
hypothesis tests for nonstandard problems.
Research Memorandum 73-7.
Princeton,
NJ: Educational
Testing Service,
1973.

Notes on comparable scales


Princeton,
NJ:
Educational

for test
Testing

scores.
Service,

Research
1950.

Bulletin

of item response theory to practical


testing
NJ: Lawrence Erlbaum Associates,
1980.
theories

of mental

test

scores.

-26-

Petersen,
N., Cook, L., and Stocking,
M. L.
IRT versus conventional
equating
methods:
A comparative
study of scale stability.
Paper presented at
the annual meeting of the American Educational
Research Association,
Los Angeles, April
14, 1981.
Scheuneman, J. D., & Kay, E. F.
Homogeneity of ability
and certification
decisions.
Paper presented at the annual meeting of the American
Educational
Research Association,
Los Angeles, April
14, 1981.
Swinton, S., Wild, C. L., and Wallmark, M. Investigation
of practice
effects
on item types in the Graduate Record Examinations Aptitude
GRE Board Professional
Report 800lbP, Princeton,
NJ:
Test.
Educational
Testing Service,
1982.
Practice
Wing, H.
Psychological

effects
with
Measurement,

traditional
mental
1980, A, 141-155.

test

items.

Applied

Wood, R. L., Wingersky, M. & Lord, F.


LOGIST: A computer program for
estimating
examinee ability
and item characteristic
curve parameters.
ETS Research Memorandum 76-6 (modified
l/78).
Princeton,
N.J.:
Educational
Testing Service,
1978.

You might also like