You are on page 1of 14

Journal of Geoscience Education

ISSN: 1089-9995 (Print) 2158-1428 (Online) Journal homepage: http://www.tandfonline.com/loi/ujge20

Development and Validation of a Science Inquiry


Skills Assessment

Yiping Lou, Pamela Blanchard & Eugene Kennedy

To cite this article: Yiping Lou, Pamela Blanchard & Eugene Kennedy (2015) Development and
Validation of a Science Inquiry Skills Assessment, Journal of Geoscience Education, 63:1, 73-85,
DOI: 10.5408/14-028.1

To link to this article: https://doi.org/10.5408/14-028.1

Published online: 31 Jan 2018.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=ujge20
JOURNAL OF GEOSCIENCE EDUCATION 63, 73–85 (2015)

Development and Validation of a Science Inquiry Skills Assessment


Yiping Lou,1,a Pamela Blanchard,2 and Eugene Kennedy2

ABSTRACT
This paper reports on the development and validation of a science inquiry skills assessment for Earth science (iSA–Earth
Science) in middle schools that classroom teachers can use to assess and monitor their students’ science inquiry skills. An
assessment framework with six primary inquiry skills, each with three to six subskills, was developed based on a
comprehensive review of the literature and following the recommendations of the National Science Education Standards (NSES;
NRC, 1996) and A Framework for K–12 Science Education (FSE; NRC, 2012). The six primary skills are: (1) identify questions for
scientific investigations, (2) design scientific investigations, (3) use tools and techniques to gather data, (4) analyze and
describe data, (5) explain results and draw conclusions, and (6) recognize alternative explanations and predictions. The
assessment development and validation process followed guidelines from the Standards for Educational and Psychological
Testing (American Educational Research Association [AERA], American Psychological Association [APA], and National
Council of Measurement in Education [NCME], 1999). Psychometric analysis using both classical test statistics and Item
Response Theory showed the instrument was internally consistent, reliable, and did not function differentially for male and
female students. Implications for science education practice and research are discussed. Ó 2015 National Association of
Geoscience Teachers. [DOI: 10.5408/14-028.1]

Key words: science education, inquiry skill, assessment, validity/reliability, teacher/student empowerment, standards,
formative assessment, Rasch modeling

INTRODUCTION the ways to assist teachers in understanding and helping


Science as inquiry has been at the core of science students develop science inquiry skills is the development
education reform for several decades (National Research and use of a teacher-friendly science inquiry skill assessment
Council [NRC], 1996, 2007, 2012, 2013). The National Science tool that can help both the teachers and students identify
Education Standards (NSES; NRC, 1996) calls for a ‘‘code- strengths and weaknesses in their science inquiry skills and
velopment of the skills of students in acquiring science to ‘‘chart students’ progress over time instead of simply
knowledge, in using high-level reasoning, in applying their measuring performance at a particular point’’ (NRC, 2012,
existing understanding of scientific ideas, and in communi- 318). If students are to learn how science is done, then at
cating scientific information’’ (145–146). The vision goes least three important conditions are necessary: (1) teachers
beyond ‘‘science as process,’’ when it was thought that must understand science processes themselves (Lewis,
individual skills, such as observation, inference, and exper- 2008), (2) students must have multiple opportunities to
imentation could be taught independently of content and practice their skills, and (3) students must understand
context. The integrated character of scientific knowledge and whether they are progressing in their skill attainment or
how science is practiced is again stressed in A Framework for not (Radford, DeTure, and Doran, 1992). Validated and
K–12 Science Education (FSE; NRC, 2012) and the Next reliable standards-aligned assessments are crucial instru-
Generation Science Standards (NGSS; NRC, 2013), emphasiz- ments for aiding teachers in diagnosing their students’
ing that the work of scientists and engineers requires the understanding and competencies as well as providing
concurrent use of both science knowledge and scientific feedback about their performance (Cooper, Shepardson,
inquiry practices. Furthermore, FSE argues that teachers and Harber, 2002). The need for a formative assessment tool
need to assess student understanding and monitor their in the area of Earth system science ‘‘is paramount in teaching
progress, and calls for assessments that ‘‘test students’ and learning to show teachers what their students under-
understandings of science as a content domain and their stand and how to adjust their instruction’’ (Lewis, 2008, 453).
understanding of science as an approach’’ (263). In fact, the FSE encourages researchers to explore ‘‘how new
Effective educational assessments inform instruction and forms of assessment can be made both accessible to teachers
improve student learning (NRC, 2000). Helping students and practical for use in classrooms’’ and that teachers should
engage in scientific inquiry and develop science inquiry skills ‘‘use assessments to plan for, revise, and adapt instruction’’
in the context of learning science is one of the most (NRC, 2012, 319).
important goals of science education (NRC, 2007). One of This paper reports on the development and validation of
a science inquiry skills assessment for Earth science (iSA–
Received 1 May 2014; revised 19 November 2014; accepted 28 November 2014; Earth Science), which is part of a larger project on designing
published online 18 February 2015. and developing two interrelated online tools, an Earth
1
Department of Educational and Psychological Studies, University of science inquiry skill analyzer and an Earth science inquiry
South Florida, Tampa, Florida 33647 activities portal. In this paper, we will first review the
2
School of Education, Louisiana State University, Room 223, Peabody
Hall, Baton Rouge, Louisiana 70803
research literature on science inquiry assessment. We will
a
Author to whom correspondence should be addressed. Electronic mail: then describe the iterative development and validation
ylou@usf.edu. Tel.:813-974-7886. process using the five types of validation evidence and

1089-9995/2015/63(1)/73/13 73 Q Nat. Assoc. Geosci. Teachers


74 Lou et al. J. Geosci. Educ. 63, 73–85 (2015)

reliability issues as articulated in the Standards for Educational housez, 1973), the Science Curriculum Improvement Study
and Psychological Testing (AERA, APA, and NCME, 1999). (Riley, 1972), and the Science Education for Public Under-
Finally, we will discuss the implications of the results. standing Program (SEPUP, 1995). One limitation of curric-
ulum-dependent assessments is that they are only available
to the teachers who adopted the specific curriculum.
WHY EARTH SCIENCE? Another limitation is that most of these assessments, with
The focus on integrating inquiry into classroom teaching the exception of SEPUP (1995), were constructed primarily
is not new. Science teachers are well aware that the National to provide information to curriculum developers and/or
Science Education Standards (NRC, 1996) and the Next researchers. These tests were not constructed to necessarily
Generation Science Standards (NRC, 2013) both call for the provide useful feedback to teachers or students.
integration of inquiry skills within the context of science As researchers began to look at inquiry skills indepen-
content. However, for some Earth science teachers integrat- dent of specific curricula in the 1970s and 1980s, they
ing inquiry skills into the teaching of geosciences is a far off developed inquiry assessments designed for particular grade
and daunting goal. After all, in Earth science, how can one levels or grade ranges rather than a specific published
isolate variables when dealing with plate tectonics or types curriculum. Examples include research assessments for lower
of faults? elementary (Dietz and George, 1970), upper elementary
As Ault (1998) argues: (Molitor and George, 1976), middle school (Tannenbaum,
1971), middle school to high school (Dillashaw and Okey,
Unless balanced by attention to how disciplines have evolved 1980; Okey, Wise and Burns, 1982; Burns, Okey and Wise,
in response to particular phenomena of interest, the 1985), and middle school to college (Tobin and Capie, 1982).
representation of science as process fails to portray the While these assessments were not associated with a
rationality of scientific inquiry properly. Good science particular curriculum, questions were raised about content
teaching depends upon resources that represent the demands coverage and the ability of these tools to provide useful
characteristic of solving problems in distinctive fields of information for classroom teachers. In arguing for a broad-
inquiry. . . . Geology is similar to but distinct from other based inquiry skill framework, Means and Haertel (2002)
sciences. . . . In summary, geology is not physics. (210–211) noted that curriculum-specific inquiry assessments, includ-
ing more recent efforts in technology-based assessments, are
Unlike physics, where variables can be isolated with
not broad enough to cover a variety of science contexts, but
relative ease, many geological inquiries rely on ‘‘ ‘experi-
content-independent assessments, on the other hand, may
ments’ that have already been conducted by nature.
be too decontextualized.
Consequently, many geological inquiries are of a retrospec-
In the 1990s and early 2000s, more focus was placed on
tive type—trying to unravel what happened in the past’’
(Orion and Kali, 2005, 387). This is the first difficulty that performance-based assessments, typically graded by use of a
Earth science teachers encounter—they generally have little rubric. Examples of performance-based assessments include
experience in how geoscientists conduct inquiry. Their work by Alonzo and Aschbacker (2004) and Germann and
second challenge is recognized in Blueprint for Change: others (1996). Although performance-based assessments
Report from the National Conference on the Revolution in Earth have the strength of providing a more authentic assessment,
and Space Science Education (Barstow and Geary, 2001), there are several limitations. First, they are often time
where conference attendees called for teachers to ‘‘under- consuming to administer and grade. For example, the three
stand how to assess student performance in inquiry-based performance assessments created by Alonzo and Aschbacker
learning, how to write assessments and score assessment (2004) took over 2.5 hours to administer. Consequently, they
items, and use assessment results to improve classroom were often used by researchers and not very feasible for
practice’’ (53). Cooper and others (2002) write that scenario- classroom teachers to use, given that they might have
based assessments in environmental geosciences profes- multiple classes of students and a limited amount of class
sional development programs should serve both to model time to devote to the assessment. Second, limiting inquiry
assessment techniques and as an instructional evaluation assessments, and consequently inquiry teaching, to perfor-
method and that ‘‘all of these pieces of information can be mance-based assessments in experimental labs can create a
used to design more effective instruction’’ (64). view of science investigations that is limited to those that can
only be carried out in labs that typically involve one
independent variable that can be manipulated and one
APPROACHES TO SCIENCE INQUIRY dependent variable (Roberts and Gott, 2003).
ASSESSMENT A recent assessment, Hypothesis, Observation, Conclu-
What and how to assess science inquiry skills have been sion Test was developed using a multiple-choice format by
the focus of science education researchers for several Orion and Kali (2005) for a guided-inquiry Earth science unit
decades. Many of the inquiry skills assessments from the entitled ‘‘The Rock Cycle.’’ The test includes a total of 20
1960s through the mid-1970s were developed to research items, and each item consists of a statement that the student
the effectiveness of a specific science curriculum’s capacity to was asked to identify as hypothesis, observation, or
improve students’ science process skills. Examples of conclusion. This assessment is one of the few assessments
research assessments associated with a particular curriculum meant to assess specific inquiry skills developed within Earth
include the Science—A Process Approach elementary science content. The limited number of items and the
science curriculum (Walbesser, 1965; Walbesser and Carter, multiple-choice format made this assessment easy to use in
1970; Fyffe, 1971; Ludeman, 1975; McLeod et al., 1975), the classroom and quick to grade. However, as is the case for
Biological Sciences Curriculum Study (BSCS, 1962; Laving- many other process skill assessments, the instrument was
J. Geosci. Educ. 63, 73–85 (2015) Science Inquiry Skills Assessment 75

developed for assessing the effectiveness of a specific based tools: an Inquiry Skills Analyzer (iSA) and an Inquiry
curriculum. Furthermore, it is limited in scope. Activity Portal (iAP), which focused on middle school Earth
science. The iSA is a Web-based assessment system that
teachers can use for their students and themselves to take
INQUIRY SKILL REPRESENTATION AND THE iSA–Earth Science, view a variety of tabular and graphic
NEED FOR VALIDATED INSTRUMENT FOR reports to identify strengths and weaknesses of their science
CLASSROOM USE inquiry skills, and to monitor progress throughout the school
While early inquiry assessments focused on specific year. The iAP is an online Earth science inquiry activities
inquiry skills of interest to researchers and were often portal with matrices that are keyed to both science inquiry
curriculum- or grade-specific, most were not comprehensive skills and Earth science content typically taught in middle
with regard to the inquiry skill set they assessed, and none schools. This paper reports on the iSA–Earth Science
are aligned with NSES. Furthermore, few were designed to development and validation.
provide formative feedback to students or teachers. The development of the iSA–Earth Science was a
In trying to fill the need for a comprehensive test of rigorous and iterative process that spanned three years
science inquiry skills, Wenning (2007) developed a 35-item while the online iSA application was simultaneously being
multiple-choice Scientific Inquiry Literacy Test (ScInqLiT) developed. It is anchored to a comprehensive inquiry skill
for high school students. This assessment was developed framework and is intended to provide teachers with an
following a framework of nine stages of inquiry based on overall measure of the inquiry skills of their students and to
Science and Its Ways of Knowing (Hatton and Plouffe, 1997). indicate specific areas in which additional development may
However, Wenning did not explicitly align the inquiry skills be needed.
in his assessment with the science inquiry abilities as The construction and validation of the iSA–Earth
Science followed a process designed to ensure that the
described in NSES. Two pilot studies were conducted with
resulting measure would provide classroom teachers with
high school students. The KR20 reliability coefficient for the
accurate and useful information on the inquiry skills of their
second pilot with 61 students was 0.71. After the second
students. This process was framed around conceptions of
pilot, one question was replaced and several reworded to
validity and reliability as articulated in the Standards for
increase clarity for the final version of the test, which,
Educational and Psychological Testing (AERA, APA, and
however, was not field-tested. Based on the limitations in
NCME, 1999).
the development process and the relatively highly ‘‘moti-
vated and homogeneous’’ group of high school students that
Validity Evaluation
participated in the second pilot testing, the author advised
As noted in the Standards for Educational and Psycholog-
caution in using the test and that the best use of the
ical Testing (AERA, APA, and NCME, 1999), validity
instrument was for research purposes only (Wenning, 2007,
concerns the body of evidence and theory that supports
23). the interpretation and use of a test score. The question is
In addition to questions of content and instructional use, whether or not there is sufficient evidence to justify the
early inquiry assessments including the most recent ones by interpretation of test scores and the subsequent use of a test
Orion and Kali (2005) and Wenning (2007) were developed for a given purpose. The process of constructing and
using classical test development approaches and none used validating a test are not separate activities, but intricately
more recent psychometric techniques based on Item interwoven into the validity argument, meaning that a case
Response Theory (IRT). IRT has the advantage of producing can be made that the test scores can be used for a given
sample-free estimates of item parameters, item-free esti- purpose. The Standards articulate five types of validity
mates of person ability, and permitting the test developer to evidence that test developers should address. These stan-
explicitly target coverage (test information) of the underlying dards and the methods for addressing each of the five types
trait reflected in test performance during the test assembly of validity evidence for the development of the iSA–Earth
process (Bond and Fox, 2007). IRT techniques such as Rasch Science are outlined in Table I.
modeling also facilitate item calibration and scaling of To ensure content validity, we used both a priori efforts
different forms to enable more accurate tracking of growth (through careful conceptualization of the assessment frame-
in students’ development of inquiry skills. work, item specification, and development based on Science
Additionally, the development process used in early Education Standards and frameworks [NRC, 1996, 2007, and
assessment of inquiry skills were often not as rigorous as 2012], and existing science inquiry assessment literature) and
recommended by the Standards for Educational and Psycho- a posteriori efforts (through iterative review by expert
logical Testing (AERA, APA, and NCME, 1999), which teachers and external panel review of content relevance
advocates demonstration of five types of validity evidence: and alignment). To determine response process validity, we
content, response process, internal structure, relationship to conducted a series of pilot studies with PTI project
other variables, and consequences of testing in addition to participants and their students over three years. Based on
issues of reliability. Most early inquiry assessments focused responses from students and their teachers, and classical
on content validity and reliability only. item analyses, modifications were made. Evidence on
internal structure was examined through exploratory and
confirmatory factor analysis of item responses, dimension-
DEVELOPMENT PLAN FOR THE iSA–EARTH ality and local independence, and Rasch item–person maps.
SCIENCE Relationship to other variables was examined through
Development of the iSA–Earth Science was part of a correlation with standardized test scores and growth trends
larger project to design and develop two interrelated Web- across academic year from field tests. Consequence of testing
76 Lou et al. J. Geosci. Educ. 63, 73–85 (2015)

TABLE I: Type of validity evidence based on the standards of educational and psychological testing.
Type of Evidence Definition/Questions Methods Used in iSA–Earth Science
Evidence based on test To what extent is the content domain explicitly [ Academic experts and practitioners collaborate on an
content described? iterative and structured process of development of
To what extent does the content domain match an inquiry skills framework, test specifications, and
the intended use? test items.

To what extent does the content of the test match [ Pilot studies and reviews from experts not involved
the targeted domain? in test development are used to revise framework
and items.
Evidence based on To what extent do examinee response processes [ Classroom teacher feedback
response processes reflect targeted behaviors? [ Classic item difficulty and discrimination indices
from pilot and field tests.
[ Rasch item and person fit statistics.
Evidence based on To what extent does the internal structure of the [ Exploratory and confirmatory factor analysis of item
internal structure test match that predicted by theory? responses.
To what extent is the targeted construct [ Dimensionality and local item independence.
adequately covered? [ Rasch item–person maps.
Evidence based on Do test scores relate to other variables in predicted [ Correlations with standardized test scores.
relationships to other ways?
variables Convergent
Do test scores correlate with teacher ratings and
expectations?
Discriminant
Do test scores differentiate between
knowledgeable examinees and naı̈ve examinees?
Criterion
Do test scores correlate with grades?
Evidence based on To what extent do test scores unfairly penalize one [ Differential item functioning.
consequences of testing or more groups?

was examined via differential item functioning (DIF) on involved in pilot and field testing, and a wide range of
gender. Research suggested that male and female students statistical techniques were utilized. These details are
often have different interests and achievement levels in presented in the results section below.
science (Greenfield, 1998; Kutnick, 2000; Hoffmann, 2002).
An important feature of the measure developed for this
project is that it not be differentially biased towards male or EMPIRICAL RESULTS
female students, as is recommended by NRC (2014). To In this section we detail the development and validation
address this, we conducted a DIF analysis for each of the of the iSA–Earth Science. It is organized around the five
individual items as well as the overall assessment. types of validity evidence and issues of reliability as
articulated in the Standards (AERA, APA, and NCME, 1999).
Reliability Evaluation
It is widely accepted that a test cannot be valid without Content Validity Evidence
first being reliable. Reliability refers to the extent to which PTI Science Inquiry Skills Framework
test scores are influenced by errors of measurement. While PTI Science Inquiry Skills Framework (iSF; see Table II)
there are a variety of different ways of determining the was developed based on a comprehensive review of the
reliability of a measure, for the iSA–Earth Science we literature on science inquiry and assessment including: the
focused on internal consistency of the overall measure as eight fundamental abilities outlined in the Science as Inquiry
well as the subscales. Reliability was examined through Strand of the NSES content standards for grades 5–8 (NRC,
classical Cronbach’s alpha in pilot tests and IRT analyses on 1996), prior science inquiry assessment (e.g., Dillashaw and
person and item separation via Rasch modeling. We also Okey, 1980; Burns, Okey, and Wise, 1985; Smith and
focused on the standard error of measurement and the Welliver, 1990; Beeth et al., 2001), state standards such as
person separation index, which is a measure of the ability of the Integrated Louisiana Educational Assessment Program
a test to reliably differentiate among examinees. Assessment Guide (Louisiana Department of Education,
In summary, the iSA–Earth Science is designed primar- 2003, 2006), science education methods textbooks (e.g.,
ily as an assessment instrument for classroom teachers. The Koballa and Chiappetta, 2005), the science proficiency
development and validation of the iSA–Earth Science framework, Taking Science to School: Learning and Teaching
occurred over a three-year period. Multiple samples were Science in Grades K–8 (NRC, 2007), and A Framework for K–12
J. Geosci. Educ. 63, 73–85 (2015) Science Inquiry Skills Assessment 77

Science Education: Practices, Crosscutting Concepts, and Core using other science content such as physics (Blanchard and
Ideas (NRC, 2012). Lou, 2009). By developing all items in Earth science, this
To ensure that inquiry skills were correctly identified and assessment can help the teachers better envision the range
test items could be appropriately aligned with the iSF, 50 of inquiry-based Earth science practices.
Earth science multiple-choice assessment items were select- The process of creating and vetting iSA–Earth Science
ed from a variety of resources, including Earth science assessment items was iterative. Items were written to assess
textbooks, science inquiry process skill textbooks, and specific subskills from the PTI iSF. Care was taken to provide
several released items from various state exams. A panel of sufficient elaboration in the question stem so that the item
four expert Earth science teachers, including two with tested the target inquiry skill rather than the Earth science
extensive experience in serving on state science assessment content. A Review Checklist was used by each development
development review panels, participated in classifying each team member to independently identify the subskill
item. addressed by the item and to evaluate whether each item
Several modifications were made based on this blind met best practices in item design. At periodic team meetings
classification process and piloting results. For example, one each item was reviewed and the decision was made to accept
inquiry skill area, the use of mathematical skills, was too the item as written for piloting, revise the item, or reject the
broad to be adequately assessed within the iSA–Earth item. If the item was revised, the team offered suggestions to
Science. The team decided that use of mathematics (as improve the item, and the newly revised item was again
represented in data collection, measurement, and analysis) evaluated using the review protocol sheet. The review
would be more feasible and appropriate to assess within the process was repeated until the item was accepted with
scope of the iSA–Earth Science. revisions or rejected. Figure 1 presents a sample item before
and after revision to improve clarity.
Test Specifications and Items Development
Using accepted test development guidelines (Popham, External Expert Review
1980; Haladyna, 1994; Downing and Haladyna, 2006) and A panel, external to the project, of science educators
the initial PTI Inquiry Skill Framework, a Test Development and researchers with expertise in science inquiry and
Specifications Document was created to guide item and test measurement independently reviewed the iSF and one of
development. This document provides specifications on the the iSA–Earth Science forms for the field study in Year 3.
format of the items, percentage of coverage for each primary They were asked to review the iSF and the items with
inquiry skill area, and included guidelines for developing respect to: (1) alignment with broad definition of scientific
assessment items. investigation practices, (2) coverage and representation of
Since the purpose for developing the iSA–Earth Science the six primary inquiry skills and related subskills, (3)
was to create a teacher-friendly assessment tool that alignment of assessment items with the iSF, and (4)
classroom teachers could use to identify strengths and whether items met technical requirements for multiple
weaknesses of their students’ inquiry skills and track their choice items. The goal of this process was to ensure the
progress, teacher recommendations were important to the construct validity of iSF and content validity of the items;
development of the specifications document. A multiple that is, the items are testing the specific inquiry skills they
choice format for all test items was chosen because of were designed to test. The external reviewers agreed that
efficiency in administration and automated scoring via the the iSF provides a good representation of primary skills and
Web interface. It was determined by the panel of expert subskills of science inquiry and it aligns with broad
teachers that each item would consist of a stem describing definitions of scientific inquiry practices. All also agreed
an Earth science context and four options. Teacher with the skill and item matching, the accuracy of the
recommendations resulted in a blueprint of the iSA–Earth answer choices, and the technical quality of the items.
Science that includes three to six items for each of the These external experts provided additional evidence of
primary skill areas with a total of 30 items that could be content validity of the assessment instrument.
completed in a typical 45–50 minute class. The number of
items in each skill area is noted in parentheses after each Response Process Evidence
inquiry skill title in Table II. The resulting scores in each Response process evidence that examinee responses
inquiry skill area are considered ‘‘indicators’’ of potential reflect targeted behaviors was determined through a series of
strengths or weaknesses with respect to a student’s skill set. pilot studies. Following the initial panel review, a process
Since most state testing programs provide subskill scores was established by which the pool of ‘‘acceptable items’’ was
that also function as indicators due to a limited number of piloted. While the details of the pilot studies are described
items, the test developers believe that classroom teachers below, all shared similar characteristics. First, test items were
would be well versed in utilizing this information produc- compiled into assessment forms based on the opinions of
tively. the review panel. The goal of this aspect of the development
All iSA–Earth Science assessment items were developed process was for each form to constitute a representative
using middle school Earth science context. Most inquiry sampling of the different skill categories in the framework. A
assessments used mixed content areas; we decided to choose consensus model was used for decision making concerning
one content for two reasons. First, this follows the the composition of the forms.
recommendations of NESE and the framework for K–12 The early pilot studies were often limited to only a few
science education (NRC, 2012) that inquiry practices should classes of students and as a result item analysis relied heavily
not be separated from the teaching of core disciplinary ideas. on classical measures of difficulty and discrimination.
Secondly, many teachers feel that Earth science is especially Specifically, p values (item difficulty) were calculated for
difficult for inquiry activities and they tend to teach inquiry each item. The ideal p value is .3 to .7. Item discrimination
78 Lou et al. J. Geosci. Educ. 63, 73–85 (2015)

TABLE II: PTI Science Inquiry Skills Framework.1


iSkills Operational Definition
1. Asking questions (FSE Practice 1: Asking questions)
1.1 Identify testable questions Given a description of a research interest and investigation set up, identify the
testable question.
1.2 Refine/refocus ill-defined questions Given a description of an investigation and a broad and ill-defined question,
identify a more focused question that can be answered through scientific
investigation.
1.3 Formulate hypotheses Given a description of variables involved in an experiment, select a reasonable
hypothesis that is being or can be tested.
2. Planning investigations (FSE Practice 3: Planning and carrying out investigations)
2.1 Design investigations to test a hypothesis Given a hypothesis, select a suitable design for an investigation to test it.
2.2 Identify independent variables, dependent Given a description of an investigation or research question, identify the
variables, and variables that need to be independent, dependent, and controlled variables involved in it.
controlled
2.3 Operationally define variables based on Given a description of an investigation, identify how the variables are operationally
observable characteristics defined.
2.4 Identify flaws in investigative design Given a description of an investigation, identify design flaws.
2.5 Utilize safe procedures Given a description of an investigation, identify safety procedures and equipment
to conduct the investigation
2.6 Conduct multiple trials Given a description of an investigation, understand the need and reasons for
conducting multiple trials.
3. Carrying out investigations (FSE Practice 3: Planning and carrying out investigations)
3.1 Gather data by using appropriate tools Given a description of an investigation and type of data to collect, select
and techniques appropriate tools and techniques for gathering the data.
3.2 Measure using standardized units of Given a description of an investigation and type of data to collect, identify
measure appropriate unit of measurement.
3.3 Compare, group, and/or order objects by Given a description or a graphic representation of a set of objects, identify a
characteristics classifying feature.
3.4 Construct and/or use classification systems Given a classification system, use it appropriately to identify specific objects
3.5 Use consistency and precision in data Given a description of an investigation and the type of data to collect, select an
collection appropriate unit of measurement to achieve precision and consistency.
3.6 Describe an object in relation to another Given a graphic representation of investigation data, describe an object in relation
object (e.g., its position, motion, direction, to another object.
symmetry, spatial arrangement, or shape)
4. Analyzing and interpreting data (FSE Practice 4: Analyzing and interpreting data)
4.1 Differentiate explanation from description Given several statements about a data sample of an object or event, differentiate
explanations from descriptions.
4.2 Construct and use graphical Given a description of an investigation and obtained data, identify a graph that
representations represents the data; or given data in a table, graph or map representation,
identifying a specific data point.
4.3 Identify patterns and relationships of Given a graph or table of data from an investigation, identify the relationship
variables in data between the variables.
4.4 Use mathematic skills to analyze and/or Given data such as in a table or figure, use mathematical skills to calculate and
interpret data (FSE Practice 5: Using interpret the data.
mathematics)
5. Constructing explanations (FSE Practice 6: Constructing explanations)
5.1 Differentiate observation from inference Given several statements about a data sample of an object or event, differentiate
inference from observations.
5.2 Propose an explanation based on Use data and information gathered to develop an explanation of the observation.
observation
5.3 Use evidence to make inferences and/or Given data in a table, graph, or verbal description, use the evidence to make
predict trends inference or prediction.
5.4 Form a logical explanation about the Given data from an investigation, draw a conclusion about the cause and effect
cause and effect relationships in data from relationships in the data.
an experiment
J. Geosci. Educ. 63, 73–85 (2015) Science Inquiry Skills Assessment 79

TABLE II: continued.


iSkills Operational Definition
6. Engaging in arguments from evidence (FSE Practice 7: Engaging in arguments from evidence)
6.1 Consider alternate explanations Given a set of data collected and a possible explanation of the results, identify
another possible explanation.
6.2 Identify faulty reasoning not supported by Given a set of data collected, differentiate concluding statement(s) that follow
data logically from the data and those that may relate to faulty reasoning and
misinterpretation of the data.
1
The PTI Science Inquiry Skills Framework is aligned with six of the eight scientific practices in FSE (NRC, 2012).

was computed based on item-total correlations. These values IRT Analyses


were deemed acceptable if they were positive and greater In addition to factor analysis, the final phase of
than .20. Finally, distractor analysis focused on the extent to development of the iSA–Earth Science was based on item
which the distractors functioned as ‘‘foils’’ for less able response theory (IRT). IRT has the advantage of producing
examinees. If the proportion of knowledgeable examinees sample-free estimates of item parameters and permitting the
(upper 25%) selecting a distractor was lower than the test developer to explicitly target the coverage of the
proportion of less knowledgeable examinees (bottom 25%) underlying trait reflected in test performance. There are
that attempted an item, the distractor was deemed to be numerous IRT models being used in test development and
acceptable. Interitem correlations and Cronbach’s alpha operational examinations nationally. The most prominent
were also computed but used cautiously. Based on the models for dichotomously scored items (Rasch, 2-parameter
results of these analyses, items were either selected for logistic, 3-parameter logistic) vary on the assumptions they
revision or set aside for field testing. make regarding the influence of guessing behavior by
During the pilot testing, teachers were also asked to examinees and whether or not the discrimination of items
write down any questions that students asked during the on an examination vary. The model that seemed to best fit
test. Several problems were identified such as the need for the iSA–Earth Science assessment is the 1-parameter logistic
clarity in wording and for more elaboration of Earth science model, also called the Rasch model. The Rasch model is
context in the assessment items, as well as when some items based on the idea that the probability that an examinee will
relied too heavily on knowledge of Earth science content. get an item correct is a function of the examinee’s ability and
Improvements were made for the problematic items. the difficulty of the item. As with other item response theory
models, it has the potential to provide estimates of a
Internal Structure Evidence person’s ability that are independent of the specific
A field test was conducted in year three with a total of 630 examination they have taken and estimates of item difficulty
that are independent of the examinees that attempted the
eighth-grade Earth science students in nine middle schools. Of
item. These models have the potential to provide detailed
the nine schools, six were public and three were private
information about the functioning of items and persons
schools. The school sizes ranged from 317 to 944. An average of
across the range of ability measured in a test. This model is
50% of the students in the public schools received free or
currently used in several state achievement testing programs.
reduced lunch. Prior to the academic year, training was
To examine and estimate the parameters of the model,
provided to the participating teachers on how to utilize the the Winsteps (Version 3.68.2) computer program (Linacre,
features of the assessment Web application, iSA. The teachers 2009) was used. The process was as follows: (1) the model
had the opportunity to explore iSA features and take the was fit to data and examined to determine if the model
assessment themselves before giving it to their classes. provided a reasonable fit to the data; and (2) as part of the
Analyses used to examine the internal structure of the test fitting process, items or persons that appeared to be
include: exploratory and confirmatory factor analyses of item inconsistent with model predictions would be discarded.
responses, dimensionality (principal component analysis of
Rasch model residuals) and local item independence, and Item and Person Fit
Rasch item–response maps. Random guessing has the impact of increasing the
likelihood that low-ability examinees will obtain a correct
Exploratory and Confirmatory Factor Analyses response to difficult items or conversely, high-ability
The internal structure of the iSA–Earth Science was first examinees will fail to correctly respond to an easy item. It
evaluated via exploratory and confirmatory factor analysis. The was anticipated that guessing would be minimal, in that the
PRELIS software package, which is part of LISREL 8, was used distracters or foils were selected to represent common
to produce polychoric correlations, which were entered into the misconceptions among less knowledgeable examinees; a key
SPSS 18.0 software and subjected to exploratory factor analysis goal of the distractor analysis described above was to identify
with principal axis extraction and varimax rotation. The incorrect options that ‘‘appeared’’ to be effective at
solution yielded one dominant factor (eigenvalue of 28.9 separating more and less knowledgeable examinees. In the
compared to a maximum of 1.4 for other trailing subfactors). Rasch model, person and item fit statistics provide the best
LISREL 8 (Jöreskog and Sörbom, 1993) was used to perform evidence of randomness in the data. Specifically, we focused
confirmatory factor analysis. The results showed that one on the mean square (MNSQ) of the infit and outfit statistics
dominant factor provided a superior fit to the data. (Linacre, 2009). Values of MNSQ greater than 1.5 would be
80 Lou et al. J. Geosci. Educ. 63, 73–85 (2015)

FIGURE 1: (a) Sample item for Skill 2.4: Identify flaws in experimental design, before revision. (b) Sample item for
Skill 2.4: Identify flaws in experimental design, after revision.

flagged as potential misfits, meaning that persons or items statistical information (model variance) and is more sensitive
were a poor fit to the model (see Linacre, 2009). to unexpected patterns of observations by persons on items
Table III presents the infit and outfit, as well as summary that are roughly targeted on them (and vice versa). Outfit is
statistics for the test. The infit and outfit statistics measure based on the conventional chi-square statistic and is more
the degree to which each item or person fits the model. In
accordance with general practices, we placed more emphasis sensitive to unexpected observations by persons on items
on item infit statistics in identifying misfitting items and that are relatively very easy or very hard for them (and vice
persons (Bond and Fox, 2007). Infit is based on the chi- versa). A small number of students with infit exceeding 1.50
square statistic with each observation weighted by its were removed. No item had infit or outfit statistics exceeding
J. Geosci. Educ. 63, 73–85 (2015) Science Inquiry Skills Assessment 81

TABLE III: Summary statistics for person ability and item Local Independence
difficulty. Local independence was evaluated using a principal
Raw Measure Model Infit Outfit component analysis of the residuals. We also examined the
Score (h) Error relationship between item order and difficulty and item fit, as
MNSQ MNSQ
well as correlations among item residuals (see Linacre, 2009).
Person Ability Local independence was assumed in that the items were
Mean 17.02 0.36 0.44 1.00 1.00 designed to be independent, stand-alone items. Analysis of
correlations between standardized residuals for pairs of items
SD 6.03 1.08 0.08 0.13 0.24
showed that none of the correlations are large in value (with
Max 29.00 3.60 1.02 1.51 2.67 the largest being .20, well below the .70 criteria), indicating that
Min 2.00 -2.85 0.39 0.73 0.30 local dependence is not a challenge for this instrument (Table
Item Difficulty V). For a more detailed discussion of statistics in the table, see
Linacre (2009).
Mean 357.43 0.00 0.09 1.00 1.00
SD 85.58 0.73 0.01 0.11 0.18 Rasch Item–Person Map
Max 508.00 1.52 0.11 1.25 1.44 Figure 2 shows the item–person map for the 30 items
included on the May 2009 field test. The map shows the
Min 176.00 -1.37 0.09 0.79 0.74
range of person abilities on the left side of a vertical line and
item difficulties on the right side of the line. Low ability/
difficulty is represented at the bottom of the map and high
the limit. For a more detailed discussion of statistics in the ability/difficulty at the top. As can be seen on the map, the
table, see Linacre (2009). student abilities and the item difficulties have approximately
normal distributions, and there is a high degree of overlap
Dimensionality between ability and difficulty. The mean item difficulty
The results of the principal component analysis of appears to match the mean person ability, which means that
residuals (differences between expected and actual perfor- the person has a 50% chance of answering the item
mance) from Rasch model predictions were used to correctly. Along most of the range of student ability, there
determine if a common factor was present. If the model are test items that provide a reliable measure of their inquiry
holds, no common factors would be present among the skills, but for the top and lowest part of the ability
residuals. This approach is recommended by Bond and Fox distribution there are few items.
(2007) and Linacre (2009).
The percentage of variance explained by person ability Relationship to Other Variables Evidence
and item difficulty is consistent with predictions of the Rasch Evidence on relationships to other variables was
model, suggesting a good fit of the data to the model (Table determined via correlations of the iSA–Earth Science with
IV). The percentage of unexplained variance in the data is the inquiry skills subtest scores in the Louisiana Educational
also close in value to the level of randomness predicted by Assessment Program (LEAP; Louisiana Department of
the model. The variance associated with the various Education, 2006). These scores are part of the student and
contrasts is similar in magnitude and all are significantly school accountability program in Louisiana and they carry
smaller than that associated with items, a pattern that significant consequences. Teachers on the development
suggests that multidimensionality is not present in these panel emphasized that evidence of a relationship with
data. For a more detailed discussion of statistics in the table, state-mandated tests is critical for building level support of
see Linacre (2009). classroom use of the iSA–Earth Science. To investigate this,

TABLE IV: Standardized residual variance (in eigenvalue units).


Total Raw Variance in Observations Empirical Modeled
Variance Proportion Proportion Proportion
Explained Variance Unexplained Variance of Variance
Total raw variance in observations 39.7 100% 100%
Variance explained by measures 9.7 24.5% 23.1%
Variance explained by persons 4.4 11.0% 10.4%
Variance explained by items 5.3 13.4% 12.7%
Raw unexplained variance (total) 30.0 75.5% 100% 76.9%
1st contrast 1.8 4.5% 5.9%
2nd contrast 1.5 3.8% 5.0%
3rd contrast 1.3 3.4% 4.5%
4th contrast 1.3 3.2% 4.3%
5th contrast 1.3 3.2% 4.3%
82 Lou et al. J. Geosci. Educ. 63, 73–85 (2015)

TABLE V: Correlations between standardized residuals (local iSA–Earth Science scores were correlated with the inquiry
dependence). skills subtests from the grade 7 and grade 8 LEAP tests.
Correlation Between Item Sequence Item Sequence These correlations were .64 and .61, respectively. Both
Standardized Residuals Number Number indicated that the iSA–Earth Science had a high level of
association with these subtests.
0.15 23 27
-0.20 14 24 Consequences of Testing Evidence
-0.19 6 26 Differential item functioning (DIF) across gender was
-0.15 7 12
analyzed to determine if items differentially penalize female
or male students. A difference in difficulty of .5 logits was the
-0.14 6 27 criteria used to determine if an item was problematic.
-0.14 7 17 DIF results showed one item, Item 21, with DIF contrast
-0.13 14 23 of .50 at 0.006 significance of Welch’s t. The value of the
contrast suggests that Item 21 is more challenging for female
-0.13 5 29 examinees. However, this value is small and likely does not
-0.13 15 21 represent a notable difference.
-0.13 4 10
Reliability and Overall Fit
To determine the extent to which the Rasch model
provides an overall fit to the data, we focused on the item

FIGURE 2: Item–person map showing the distribution of student abilities on the left and item difficulties on the right.
The letters and numbers on the right indicate items with associated inquiry skills.
J. Geosci. Educ. 63, 73–85 (2015) Science Inquiry Skills Assessment 83

and person separation indexes and reliabilities. The separa- science practices. NRC (2000, 2007, 2012) advocates inquiry
tion index represents the spread of the abilities or difficulties assessments as crucial instruments for aiding teachers in
and indicates the approximate number of different levels of diagnosing their students’ comprehension and competencies
difficulty or ability that can be reliably differentiated. A as well as providing feedback about their performance.
separation index greater than 1.5 is generally considered The iSA–Earth Science was developed based on a
acceptable. We also examined the overall goodness of fit chi- comprehensive assessment framework with six primary
square. inquiry skills, each with three to six subskills that are aligned
The person separation and reliability were estimated to with the National Science Education Standards, and A
be 2.15 and 0.82 and the item separation and reliability Framework for K–12 Science Education (NRC, 2012). The six
were estimated to be 7.50 and .98. The overall reliability of primary skills are: (1) identify questions for scientific
the measure, KR20, was estimated to be .84. These results investigations, (2) design scientific investigations, (3) use
suggest that iSA–Earth Science has excellent reliability in tools and techniques to gather data, (4) analyze and describe
measuring the latent variable of interest science inquiry data, (5) explain results and draw conclusions, and (6)
skill. recognize alternative explanations and predictions. All
assessment items are integrated within the context of middle
school Earth science.
DISCUSSION AND CONCLUSIONS The iSA–Earth Science instrument has important
This paper reports on the development and validation of pedagogical and research implications. It is hoped that
iSA–Earth Science, a science inquiry skills assessment, in the inquiry skill assessment using a variety of Earth science
context of middle school Earth science that classroom inquiry scenarios will encourage teachers to better integrate
teachers can use to assess and monitor their students’ inquiry with Earth science content and teach Earth science as
science inquiry skills. The construction and validation of the inquiry rather than accumulation of information and facts
iSA–Earth Science followed an iterative and rigorous process only. When students are engaged in more inquiry-based
using both classical psychometric analysis and Item Re- science, they will be better motivated to learn science,
sponse Theory based on the guidelines as articulated in the become scientifically literate, and ultimately contribute to
Standards for Educational and Psychological Testing (AERA, scientific discoveries in the future. An assessment that is not
APA, and NCME, 1999). gender-biased will encourage both female and male
As recommended by the Standards, we examined five students to achieve in science learning. Future studies
sources of validity evidence: content, response process, should also examine other types of possible bias, such as
internal structure, relationship to other variables, and race and ethnicity or urban and rural environments, to
consequences of testing. The results of this study show that determine whether the instrument is equally valid in urban
the iSA–Earth Science has strong validity evidence on and rural environments and across racial and ethnic groups.
content, response process, internal structure, relationship The iSA–Earth Science can also provide opportunities to
with other variables, and consequence of testing, and person evaluate the efficacy of alternative approaches to supporting
and item reliability. The psychometric analyses using both inquiry-based science learning and innovative preservice
the classical test development method and Item Response and inservice professional development programs. Until
Theory via Rasch Modeling showed that the instrument was now, professional development programs are typically
internally consistent, reliable and did not function differen- evaluated via end-of-training surveys that elicit perceived
tially for male and female students. learning. It is important to measure the actual learning. The
One limitation identified in the item difficulty/person PTI Inquiry Skill Framework as well as the iSA–Earth
ability map is the limited number of items at the highest and Science instrument can be used as a model to develop
lowest end of difficulty levels. One of our future efforts is to inquiry skill assessments that are embedded in other science
develop more items at these two levels to better separate content, such as biology, physics, and environmental
students at the highest ability levels and at the lowest ability science.
levels. The iSA–Earth Science is available free for use at the PTI
A second issue concerns the question of dimensionality. project Web site (http://scienceinquiry.cloudapp.net) by
The test specifications indicate a complex construct with middle school Earth science teachers in their classrooms
several subdimensions. The empirical data consistently and by interested researchers. The online assessment system
indicated one dominant underlying trait. While this supports is easy to use and provides immediate feedback as well as
use of the unidimensional Rasch model for these analyses, it archived results to students on their inquiry skills. Teachers
does pose a question as to the utility of the different can view basic and comparative students’ scores in a number
subscales. Our recommendation for the user is that the of ways including skills and subskills by class, individual
subscales be used cautiously, much as standard level scores student, and testing time. It is hoped that the results and
that are commonly reported in score reports classroom feedback from iSA–Earth Science can help teachers and their
teachers receive from state testing programs. These scores students to better understand, track, and improve science
tend to be based on few items and are best used as an inquiry skills. They can also take the tests themselves to
‘‘indicator’’ of a specific area in which a student may need identify the strengths and limitations of their own science
instruction. inquiry skills.
Although science inquiry has been the focus of science
reform efforts for decades, teachers have found it difficult to Acknowledgment
implement effective inquiry-based science learning in the This research was supported by the National Science
classrooms (NRC, 2007). One of the reasons was lack of Foundation under Grant No. 0554556. Any opinions,
understanding of what constitutes authentic and effective findings, and conclusions or recommendations expressed
84 Lou et al. J. Geosci. Educ. 63, 73–85 (2015)

in this material are those of the authors and do not Greenfield, T.A. 1998. Gender- and grade-level differences in science
necessarily reflect the views of the National Science interest and participation. Science Education, 81(3):259–276.
Foundation. Haladyna, T.M. 1994. Developing and validating multiple-choice
test items. Hillsdale, NJ: Lawrence Erlbaum Associates.
Hatton, J., and Plouffe, P.B. 1997. Science and its ways of knowing.
Upper Saddle River, NJ: Prentice Hall.
REFERENCES Hoffmann, L. 2002. Promoting girls’ interest and achievement in
Alonzo, A.C., and Aschbacker, P.R. 2004, March. Value-added? Long
physics classes for beginners. Learning and Instruction,
assessment of students’ scientific inquiry skills. Paper presented at
12(4):447–465.
the annual meeting of the American Education Research
Jöreskog, K.G., and Sörbom, D. 1993. LISREL 8 [Computer
Association, San Diego, CA.
software]. Chicago, IL: Scientific Software International.
American Educational Research Association (AERA), American
Psychological Association (APA), and National Council of Koballa, T.R., and Chiappetta, E.L. 2005. Science instruction in the
Measurement in Education (NCME). 1999. Standards for middle and secondary schools: Developing fundamental
educational and psychological testing. Washington, DC: Amer- knowledge and skills for teaching, 6th ed. Boston, MA: Allyn
ican Educational Research Association. & Bacon.
Ault, C.R. 1998. Criteria of excellence for geological inquiry: The Kutnick, P. 2000. Girls, boys and school achievement: Critical
necessity of ambiguity. Journal of Research in Science Teaching, comments on who achieves in schools and under what
35:189–212. economic and social conditions achievement takes place: A
Barstow, D., and Geary, E., eds. 2002. Blueprint for change: Report Caribbean perspective. International Journal of Educational
from the national conference on the revolution in Earth and Development, 20(1):65–84.
space science education. Cambridge, MA: Technical Education Lavinghousez, W.E., Jr., 1973. The analysis of the Biology Readiness
Research Centers. Available at http://www.earthscienceed Scale as a measure of inquiry skills required. Paper presented
revolution.org/RevEarthSciEd.pdf (accessed 2 January 2015). at BSCS Biology, College of Education, University of Central
Beeth, M.E., Cross, L., Pearl, C., Pirro, J., Yagnesak, K., and Florida, February 25, 1973.
Kennedy, J. 2001. A continuum for assessing science process Lewis, E. 2008. Content is not enough: A history of secondary Earth
knowledge in grades K–6. Electronic Journal of Science Education, science teacher preparation with recommendations for today.
5(3). Available at: http://wolfweb.unr.edu/homepage/crowther/ Journal of Geoscience Education, 56(5):445–455.
ejse/beethetal.html (accessed 19 December 2014). Linacre, J.M. 2009. WINSTEPS Rasch measurement and analysis,
Biological Sciences Curriculum Study (BSCS). 1962. Processes of Version 3.68.2. [Computer software]. Beaverton, OR: Win-
science test. New York: Psychological Corporation. steps.
Black, P., and Wiliam, D. 2004. The formative purpose: Assessment Louisiana Department of Education. 2003. Louisiana grade level
must first promote learning. In M. Wilson, ed., Toward expectations. Available at http://www.louisianabelieves.com/
coherence between classroom assessment and accountability, docs/academics/grade-5.pdf?sfvrsn=2 (accessed 19 December
Yearbook of the National Society for the Study of Education, 2014).
Part II, 103(2):20–50. Louisiana Department of Education. 2006. LEAP assessment guide:
Blanchard, P., and Lou, Y. 2009. Pathways to inquiry—Earth English language arts, mathematics, science and social studies,
science: Online tools to strengthen classroom understanding grade 8 (LEAP science, grade 8), p. 3–4. Available at http://
and integration of science inquiry skills. Paper presented at the www.doe.state.la.us/lde/uploads/9844.pdf (accessed 7 Decem-
National Association for Research in Science Teaching Annual ber 2014).
International Conference (NARST), Garden Grove, CA, April Ludeman, R.R. 1975. Development of the science processes test
17–21. (TSPT) [abstract, Ph.D. dissertation]. East Lansing, MI:
Bond, T.G., and Fox, C.M. 2007. Applying the Rasch model: Michigan State University. Retrieved from Dissertation Ab-
Fundamental measurement in the human sciences, 2nd ed. stracts International, 36, 203–A.
Mahwah, NJ: Lawrence Erlbaum Associates. McLeod, R.J., Berkheimer, G.D., Fyffe, D.W., and Robison, R.W.
Burns, J.C., Okey, J.R., and Wise, K.C. 1985. Development of an 1975. The development of criterion-validated test items for
integrated process skill test: TIPS II. Journal of Research in four integrated science processes. Journal of Research in Science
Science Teaching, 22(2):169–177. Teaching, 12:415–421.
Cooper, B.C., Shepardson, D.P., and Harber, J.M. 2002. Assess- Means, B., and Haertel, G. 2002. Technology supports for assessing
ments as teaching and research tools in an environmental science inquiry. In National Research Council, eds., Technol-
problem-solving program for in-service teachers. Journal of ogy and assessment: Thinking ahead—Proceedings of a
Geoscience Education, 50(1):64–71. workshop. Board on Testing and Assessment, Center for
Dietz, M.A., and George, K.D. 1970. A test to measure problem- Education, Division of Behavioral and Social Sciences and
solving skills in science of children in grades one, two, and Education. Washington, DC: National Academies Press.
three. Journal of Research in Science Teaching, 7(4):341–351. Molitor, L.L., and George, K.D. 1976. Development of a test of
Dillishaw, F.G., and Okey, J. 1980. Test of the integrated science science process skills. Journal of Research in Science Teaching,
process skills for secondary science students. Science Education, 13:405–412.
64(5):601–608. National Research Council (NRC). 1996. National science education
Downing, S.M., and Haladyna, T.M., eds. 2006. Handbook of test standards. Washington, DC: National Academies Press.
development. Mahwah, NJ: Lawrence Erlbaum Associates. National Research Council (NRC). 2000. Inquiry and the national
Fyffe, D.W. 1971. The development of test items for the integrated science education standards. Washington, DC: National
science processes: Formulating hypotheses and defining oper- Academies Press.
ationally [abstract, Ph.D. dissertation]. East Lansing, MI: National Research Council (NRC). 2007. Taking science to school:
Michigan State University. Retrieved from Dissertation Ab- Learning and teaching science in grades K–8. Washington,
stracts International, 32, 6823–A. DC: National Academies Press.
Germann, P.J., Aram, R., Odom, A.L., and Burke, G. 1996. Student National Research Council (NRC). 2012. A framework for K–12
performance on asking questions, identifying variables, and science education: Practices, crosscutting concepts, and core
formulating hypotheses. School Science and Mathematics, ideas. Committee on a Conceptual Framework for New K–12
96(4):192–201. Science Education Standards. Board on Science Education,
J. Geosci. Educ. 63, 73–85 (2015) Science Inquiry Skills Assessment 85

Division of Behavioral and Social Sciences and Education. State University. Retrieved from Dissertation Abstracts Inter-
Washington, DC: National Academies Press. national, 33, 6200A–6201A.
National Research Council (NRC). 2013. The next generational Roberts, R., and Gott, R. 2003. Assessment of biology investiga-
science standards. Washington, DC: National Academies tions. Journal of Biological Education, 37(3):114–121.
Press. Science Education for Public Understanding Project (SEPUP). 1995.
National Research Council (NRC). 2014. Developing assessment for Issues, evidence and you: Teacher’s guide. Berkeley, CA:
the next generational science standards. Washington, DC: University of California, Lawrence Hall of Science.
National Academies Press. Smith, K.A., and Welliver, P.W. 1990. The development of a science
Okey, J.R., Wise, K.C., and Burns, J.C. 1982. Test of integrated process assessment for fourth grade students. Journal of
process skills (TIPS II). Unpublished manuscript, Department Research in Science Teaching, 27(8):727–738.
of Science Education, University of Georgia, Athens, GA. Tannenbaum, R.S. 1971. Development of the test of science
Orion, N., and Kali, Y. 2005. The effect of an Earth-science learning processes. Journal of Research in Science Teaching, 8(2):123–136.
program on students’ scientific thinking skills. Journal of Tobin, K.G., and Capie, W. 1982. Development and validation of a
Geoscience Education, 53(4):387–393. group test of integrated science processes. Journal of Research in
Polit, D., and Beck, C. 2006. The content validity index: Are you Science Teaching, 19(2):133–141.
sure you know what is being reported? Research in Nursing and Walbesser, H.H. 1965. An evaluation model and its application.
Health, 29(5):489–497. AAAS Miscellaneous Publication No. 65-9. Washington,
Popham, W.J. 1980. Evaluation in education. Berkeley, CA: DC: The American Association for the Advancement of
McCutchan Publishing. Science.
Radford, D.L., DeTure, L.R., and Doran, R.L. 1992, March. A Walbesser, H.H., and Carter, H.L. 1970. The effects of test results of
preliminary assessment of science process skills of preservice changes in task and response format required by altering test
elementary teachers. A paper presented at National Associa- administration from an individual to a group form. Journal of
tion of Researchers in Science Teaching, Boston, MA. Research in Science Teaching, 7:1–8.
Riley, J.W. 1972. The development and use of a group process test Wenning, C. 2007. Assessing inquiry skills as a component of
for selected processes of the Science Curriculum Improvement scientific literacy. Journal of Physics Teacher Education Online,
Study [abstract, Ph.D. dissertation]. East Lansing: Michigan 4(2):21–24.

You might also like