Professional Documents
Culture Documents
An Investigation of the Effect of Task Type on the Discourse Produced by Students at Various Score Levels
in the TOEFL iBT® Writing Test. (2014). Knoch, U., Macqueen, S., & O’Hagan, S. TOEFL iBT Report No. TOEFL iBT-23.
Princeton, NJ: ETS.
Contextualizing Performances: Comparing Performances During TOEFL iBT® and Real-Life Academic
Speaking Activities. (2014). Brooks, L., & Swain, M. Language Assessment Quarterly, 11(4), 353–373.
From One to Multiple Accents on a Test of L2 Listening Comprehension. (In press). Ockey, G. J., & French, R.
Applied Linguistics.
Stakeholders’ Beliefs About the TOEFL iBT® Test as a Measure of Academic Language Ability. (2014).
Malone, M. E., & Montee, M. TOEFL iBT Report No. TOEFL iBT-22. Princeton, NJ: ETS.
Test-Takers’ Writing Activities During TOEFL iBT® Writing Tasks: A Stimulated Recall Study. (In press).
Barkaoui, K. TOEFL iBT Report. Princeton, NJ: ETS.
The Extent to Which TOEFL iBT® Speaking Scores are Associated with Performance on Oral Language
Tasks and Oral Ability Components for Japanese University Students. (2015). Ockey, G. J., Koyama, D.,
Setoguchi, E., & Sun, A. Language Testing, 32(1), 39–62.
Monitoring Students’ Progress in English Language Skills Using the TOEFL ITP® Assessment Series. (2014).
Choi, I., & Papageorgiou, S. TOEFL Research Memorandum RM 14–11. Princeton, NJ: ETS.
Are Teacher Perspectives Useful? Incorporating EFL Teacher Feedback in the Development of a
Large-scale International English Test. (2014). So, Y. Language Assessment Quarterly, 11(3), 283–303.
Developing and Validating Band Levels for Reporting Overall Examinee Performance. (In press).
Papageorgiou, S., Xi, X., Morgan, R., & So, Y. Language Assessment Quarterly.
www.toefl.org/toeflresearch
ETS — Listening. Learning. Leading.®
Copyright © 2015 by Educational Testing Service. All rights reserved. ETS, the ETS logo, LISTENING. LEARNING. LEADING., TOEFL, TOEFL IBT and TOEFL ITP are registered trademarks of Educational Testing Service
(ETS) in the United States and other countries. 29996
table of contents
46 Paper Abstracts:
11
Wed, Mar. 18
The Davies Lecture
13 Cambridge/ILTA Distin-
guished Achievement
Award Lecture
66 Paper Abstracts:
Thurs, Mar. 19
19 Conference Support
123 Poster Abstracts
21 ILTA 2015
145 Work-in-Progress Ab-
stracts
22 Award Winners
161 Index
1
Studies in Language Testing
An indispensable resource for anyone interested in
new developments and research in language testing
3
Canada’s premier English language
testing organization and proud
producer of the Tests.
INNOVATION
VALIDITY
FAIRNESS
ACCURACY
5
The Canadian Association of Language Assessment/
L’association canadienne pour l’évaluation des langues
www.cala-acel.org
welcomes you to
The Canadian Association of L’Association canadienne
Language Assessment (CALA) pour l’évaluation des langues
was officially established in (ACÉL) est un regroupement
2009. We have over 100 de plus de 100 chercheurs et
members from d’autres pro-
academia Official Co-hosts of fessionnels
as well as dans le do-
other edu- ltrc 2015 maine d’évalu-
cational, insti- ation de la
tutional and March 16-20, Toronto langue au
govern- Canada.
ment set- L’ACÉL existe
tings.The general objective of depuis 2009. L’Association a
the Association is to promote pour but général de promou-
best practices in language voir des pratiques de qualité
assessment design, practice, dans le domaine d’évaluation
research, and scholarship in des langues au Canada, en
the Canadian context, in both tenant compte des domaines
academic and occupational/- académiques ainsi que des
professional settings. domaines professionnels.
6
International Center for Language Research & Deveopment
iSmart test
iSmart test is an integration solution published by HEP in 2014 to meet various kinds of assessment requirements of
colleges and universities. With experiences of organizing Internet-based test, iSmart caters to changes of the language
teaching and testing standard, integrates the latest cloud computing, voice recognition and artificial intelligence
technology. System safety, utilization, flexibility, and compatibility have been taken into full consideration.
System features
Perfect solution to the daily homework, grading test, Support automatic speech recognition of oral testing
mid-term exams, final exams and other assessment
requirements With powerful data analysis function, support deep
mining of performance information to analyses
Fully support mock tests of written and computer- students' knowledge and skills and perfectly support
based test such as College English Test (CET) 4, CET6, teaching, research as well as evaluation of teaching
the National Business English Level 4 and Level 8 reform
Examinations, Practical English Test for Colleges Level Discipline: English for non-English majors
A and Level B, the National Vocational Skills Contest, English for English majors
PETS, BETS, IELTS, TOEFL IBT etc. English for postgraduate level
8
samuel j. messick memorial lecture
The proposed framework for validation re- and grading. Dr. Brookhart was the 2007-2009 editor
search will meet two challenges: to exploit these of Educational Measurement: Issues and Practice,
rich connections among teaching, learning, and a journal of the National Council on Measurement
assessment and at the same time to seek parsi- in Education. She is author or co-author of sixteen
mony so that the classroom perspective remains a books and over 70 articles and book chapters on
useful tool for examining validity and is not overly classroom assessment, teacher professional develop-
broad. ment, and evaluation. She serves on several editorial
boards and national advisory panels. She has been
Susan M. Brookhart, Ph.D., is an independent named the 2014 Jason Millman Scholar by the Con-
educational consultant and author based in Helena, sortium for Research on Educational Assessment and
Montana. She also currently serves as a Senior Re- Teaching Effectiveness (CREATE).
search Associate in the Center for Advancing the Study
of Teaching and Learning in the School of Education
at Duquesne University, where she was a full-time
faculty member for fourteen years. Dr. Brookhart’s
research interests include the role of both formative
and summative classroom assessment in student
motivation and achievement, the connection between
classroom assessment and large-scale assessment,
9
English courses
and testing aligned
to a single scale:
the Global Scale
of English
The Global Scale of English is a standardised granular scale from 10-90 for
measuring English language proficiency. It forms the backbone of Pearson
English course materials and assessment, including Progress the new fully
automated online test. Progress accurately measures student progression
in English, highlighting strengths and weaknesses to inform teaching.
Visit English.com/GSE or PearsonELT.com/Progress to learn more.
the davies lecture
Thurs, Mar 19
Connecting writing assessments to teaching and
learning: Distinguishing alternative purposes
9:00-10:00am Churchill Ballroom
Four conceptual- Alister Cumming, Ph.D., is a professor in the
ly-distinct options ex- Centre for Educational Research on Languages and
ist for relating assess- Literacies (CERLL, formerly the Modern Language Cen-
ments of writing to tre) at the Ontario Institute for Studies in Education,
teaching and learning University of Toronto, where he has been employed
in programs of lan- since 1991 following briefer periods at the Univer-
guage education, each sity of British Columbia, McGill University, Carleton
fulfilling different but University, and Concordia University. His research
interrelated purposes. and teaching focus on writing in second languages,
The assessment pur- language assessment, language program evaluation
poses may either be and policies, and research methods. Alister’s recent
Alister Cumming normative, diagnostic, books are Agendas for Language Learning Research
formative, or summa- (with Lourdes Ortega and Nick Ellis, Wiley-Blackwell,
Centre for Educational Re- tive. The options are 2013), Adolescent Literacies in a Multicultural Context
search on Languages and (a) proficiency tests (Routledge, 2012), A Synthesis of Research on Second
Literacies (CERLL) and curriculum stan- Language Writing in English (with Ilona Leki and Tony
dards, each based on Silva, Routledge, 2008), and Goals for Academic Writ-
different kinds of normative principles and data, ing (John Benjamins, 2006).
(b) diagnostic and dynamic assessments focused
on individual learners and their learning potential
within a specific educational context, (c) respond-
ing to students’ written drafts for formative pur-
poses of informing their improvement, and (d)
grades or local tests of summative achievements
in a particular course.
11
cambridge/ilta distinguished achievement award
Fri, Mar 20
Turning points: Implications for the training of lan-
guage testers
9:00-10:00am Churchill Ballroom
The trajectory of Tim McNamara, has taught Applied Linguis-
research in any field tics at Melbourne since 1987. He helped establish
is a mixture of op- the graduate program in Applied Linguistics from
portunities created 1987, and with Professor Alan Davies founded the
by broader devel- Language Testing Research Centre. His language
opments in theory testing research has focused on performance assess-
and practice and the ment, theories of validity, the use of Rasch models,
chance elements of and the social and political meaning of language
individual histories. I tests. Tim has acted as a consultant with Educational
try to illustrate this by Testing Service, Princeton where he worked on the
looking back at turn- development of the speaking sub-test of TOEFL iBT;
Tim McNamara ing points in my own he was also part of the team involved in the original
The University of career as a language development of IELTS. His work on language and
tester as it unfolded in identity has focused on poststructuralist approaches
Melbourne
the context of broad- to identity and subjectivity. Tim is the author of Mea-
er trends in communicative language testing, suring Second Language Performance (Longman,
performance assessment, the testing of languages 1996), Language Testing (OUP, 2000) and co-author
for specific purposes, the development of new (with Carsten Roever) of Language Testing: The Social
measurement techniques, particularly Rasch mea- Dimension (Blackwell, 2006). He is currently working
surement, the discursive and social turn in applied on a book entitled Language and Subjectivity, to be
linguistics and language testing, and the challenge published by De Gruyter. Tim has been elected 2nd
presented by the relatively new field of English as Vice-President of the American Association for Applied
a lingua franca. What enables a language tes- Linguistics (AAAL) and will be Conference Chair for the
ter to grow as a researcher in an unpredictable 2017 conference.
and constantly evolving intellectual and research
environment? Good luck, certainly, particularly the
luck of encountering and learning from the work
of colleagues who act as role models and mentors
and who are leading the exploration of points of
significant change. But also openness to the unex-
pected, underpinned by intellectual interests that
go beyond the specific familiar topics of a single
research field. I argue that breadth as well as
depth of intellectual preparation, and the encour-
agement to think laterally, should be emphasized
in the education of the next generation of lan-
guage testing researchers.
13
conference sponsors
Paragon
www.paragontesting.ca.
ETS TOEFL and TOEIC Preconference Workshop Refreshments
www.ets.org Conference Badges
Conference Bags Local Organization Sponsor
Conference T-shirts
Subsidy for Student Registration
Subsidy for Student Volunteers
14
conference sponsors
Foreign Language Teaching and Routledge
Research Press www.routledge.com/
http://en.fltrp.com/ Refreshment Breaks (March 18)
Refreshment Breaks (March 19)
15
LTRC 2015
Find out why so many organizations trust
CaMLA for their language assessments
CambridgeMichigan.org
© CaMLA 2015
conference exhibitors
Paragon
www.paragontesting.ca.
ETS TOEFL and TOEIC
www.ets.org
Touchstone Institute
www.touchstoneinstitute.ca Routledge
www.routledge.com/
17
CHINA LANGUAGE ASSESSMENT
中 国 外 语 测 评 中 心
R N E R S
O F A L L LEA
E RV I C E
IN T HE S
Co-established by
中国外语教育研究中心
NATIONAL RESEARCH CENTRE FOR FOREIGN LANGUAGE EDUCATION
www.fltrp.com www.sinotefl.org.cn
conference support
LTRC Organizing Abstract Reviewers Lynda Taylor
Carolyn Turner
Committee 2015 Beverly Baker
Elvis Wagner
Chair: Liying Cheng, Faculty of Jayanti Banerjee
Education, Queen’s University Yoshinori Watanabe
Liying Cheng
Sara Weigle
Local Chair: Khaled Barkaoui, Ikkyu Choi
Faculty of Education, York Guoxing Yu
Christian Colby
University Xiaoming Xi
Alister Cumming
Beverly Baker, University of Ot- Ying Zheng
Fred Davidson
tawa
Larry Davies Volunteers
Christine Doe, Mount Saint Vin-
Barbara Dobson
cent University Farhana Ahmed
Samira ElAtia
Heike Neumann, Concordia Uni- Shaheda Akter
Janna Fox
versity Poonam Anand
April Ginther
Bo Bao
LTRC 2015 Website Anthony Green
Esther Bettney
Luke Harding
Tom Fullerton Tina Beynen
Ching-Ni Hsieh
Beverly Baker Qinghui Chen
Ari Huhta
Andrew Coombs
LTRC 2015 Program Book Talia Isaacs
Shaima Dashti
and Logo Designer Eunice Jang
Jessica Freitag
Kris Pierre Johnston, Yan Jin
kpj@yorku.ca Wenqian Fu
Dorry Kenyon
Valerie Kolesova
LTRC 2015 Program Yong-won Lee
Jeannie Larson
Preparation Support Lorena Llosa
Zhi Li
Sari Luoma
Jessica Chan Jia Ma
Lynnette May
Jia Ma Yi Mei
Heike Neumann
Yi Mei Jason Yoongoo Nam
Spiros Papageorgiou
Gina Park
Cheng Zhou Lia Plakans
Shujiao Wang
David Qian
LTRC 2015 Session Chairs Yiqi Wang
John Read
Poster chair: Yong-Won Lee Yan Wei
Daniel Reed
Yongfei Wu
WIP chair: Lorena Llosa Steven Ross
Luna Yang
Shahrzad Saif
Professional Events Chairs Cheng Zhou
Yasuyo Sawaki
Publishing: Guoxing Yu Toshihiko Shiotse
Career: Eunice Jang Youngsoon So
Charles Stansfield
19
IELTS
Research
Grants –
IELTS Joint Funded
Research Program 2015
apply now!
Educational institutions and suitably qualified
individuals are invited to apply for funding
to undertake applied research projects in
relation to IELTS for a period of one or two
years. Projects will be supported up to a
maximum of £45,000/AU$70,000.
21
award winners
Cambridge/ILTA Distinguished Jacqueline Ross TOEFL
Achievement Award 2015 Dissertation Award 2015
Tim McNamara, University of Melbourne, AUS Hans Rutger Bosker, Utrecht University, NLD. The
Processing and Evaluation of Fluency in Native
and Non-Native Speech.
ILTA Best Article Award 2012 Supervisors: Dr. Nivja H. de Jong & Dr. Hugo Quené
Harsch, C. & Martin, G. (2012). Adapting CEF-
descriptors for rating purposes: Validation by
a combined rater training and scale revision
TOEFL Small Grants for Doctoral
approach. Assessing Writing, 17, 228-250. Research in Second or Foreign
Language Assessment 2014
ILTA Best Article Award 2013
RECIPIENT INSTITUTION
The University of
Khoi Ngoc Mai
Caroline Clapham IELTS Masters Queensland
22
Award winners
TIRF Doctoral Dissertation Jia Ma, Queen’s University, CAN. Chinese Students’
Journey to Success on High stakes English
Award Language Tests: Nature, Effects and Values of
Their Test Preparation.
Justin Cubilo, University of Hawaii at Manoa,
Supervisor: Liying Cheng
US. Video-Mediated Listening Passages and
Typed Notetaking: Investigating Their Impact
on Comprehension, Test Structure, and Item Jing Wei, New York University, US. Assessing
Performance. Speakers of World Englishes: The Roles of Rater
Language Background, Language Attitude and
Supervisor: James Dean Brown
Training.
Supervisor: Lorena Llosa
Iftikhar Haider, University of Illinois at Urbana-
Champaign. Re-Envisioning Assessment of Inter-
language Pragmatics (ILP) through Computer
Mediated Communicative Tasks.
Canadian Academic
English Language (CAEL) Assessment
• Recognized as a reliable assessment of English language
proficiency for over 20 years.
• Tests a student’s ability to use English in an academic
context.
• Closely resembles the experience of being in a tertiary
level classroom where the readings, lecture and the
writing components are all on the same topic.
• Presents a constructed-response question format with
tasks common to the Canadian university context.
• Item types include: filling in the blanks; labeling
diagrams; completing tables; note-taking; short answer
and longer essay.
• Promotes the interests of higher education in Canada.
• Accepted across Canada and in a growing number of
www.cael.ca European and American institutions.
23
Université d’Ottawa | University of Ottawa
OLBI
Proud leader in the assessment of Canada’s official languages
Contact Information: 613-562-5743 | l2test@uOttawa.ca
ILOB
À la fine pointe de l’évaluation des langues officielles du Canada
Contactez-nous : 613-562-5743 | l2test@uOttawa.ca
26
conference overview
8:45–9:00am 9:00-10:00am
27
York University’s Graduate
Program in Education
Language, Culture and Teaching
Our unique program will allow you to study education within the broad fields of Language,
Culture and Teaching. As a student in our program, you will share our commitment to the
interdisciplinary study of education through rigorous intellectual inquiry.
Master of Education
Doctor of Philosophy
Specialized Diplomas
Small classes allow for individual attention and collaboration with peers.
www.edu.yorku.ca
WORLD
CONGRESS deslangues
of modern
languages vivantes
www.caslt.org/WCML-CMLV-2015
Rita Green
12:00-1:00pm Lunch
30
conference schedule
Tues, Mar 17
8:30am-6:30pm Conference Registration Churchill Court
12:00-1:00pm Lunch
31
conference schedule
Wed, Mar 18
8:30am-5:00pm Conference Registration Churchill Court
32
conference schedule
10:20-10:45am Llosa, Malone, Wei, Donovan, & Stevens: Comparability
of writing tasks in TOEFL iBT and university writing
courses: Insights from students and instructors
11:20-11:45am Brau & Brooks: Testing the right skill: The misapplication
of reading scores as a predictor of translation ability
3:30-3:55pm Youn: The interlocutor effect in paired speaking tasks to Churchill Ballroom A
assess interactional competence
33
conference schedule
3:00-3:25pm Phakiti: Structural equation models of calibration,
performance appraisals, and strategy use in an IELTS
listening test
34
conference schedule
6:30-7:30pm Professional Event 1: Publishing in Language
Assessment
Special Sponsor: Higher Education Press
Mountbatten Salon
Chair: Guoxing Yu
Guoxing Yu, Executive Editor, Assessment in Education
(Taylor & Francis)
Liz Hamp-Lyons, Editor, Assessing Writing (Elsevier)
James Purpura, Editor, Language Assessment Quarterly
(Taylor & Francis)
April Ginther, Co-Editor, Language Testing (Sage)
Fred Davidson, Distinguished Editorial Adviser,
Language Testing in Asia (Springer)
Thurs, Mar 19
8:30am-5pm Conference Registration Churchill Court
10:00-10:20am Break
10:50-11:15am Saito: Junior and senior high school EFL teachers’ practice Churchill Ballroom A
of formative assessment: A mixed method study
35
conference schedule
10:20-10:45am Volkov, Stone & Gesicki: Empirically derived rating rubric
in a large-scale testing context
10:50-11:15am Wu & Wu: Constructing a common scale for a multi-level Churchill Ballroom B
test to enhance interpretation of learning outcomes
3:00-3:25pm Knoch, Macqueen, May, Pill, & Storch: Test use in the
transition from university to the workplace: Stakeholder
perceptions of academic and professional writing
demands
Churchill Ballroom A
3:30-3:55pm Kim & Billington: Tackling the issue of L1 influenced
pronunciation in English as a lingua franca
communication contexts: The case of aviation
36
conference schedule
3:00-3:25pm Papageorgiou & Ockey: Does accent strength affect
performance on a listening comprehension test for
interactive lectures?
3:30-3:55pm Min & He: Examining the effect of DIF anchor items on
Churchill Ballroom B
equating invariance in computer-based EFL listening
assessment
37
conference schedule
6:30-7:30pm Professional Event 2: Careers in language assessment
Chair: Eunice Jang Mountbatten Salon
Ardeshir Geranpayeh, Cambridge English Language
Assessment
Jonathan Schmidgall, Educational Testing Service
May Tan, Canadian Armed Forces
Eunice Jang, OISE/University of Toronto
Jake Stone, Paragon Testing Enterprises (UBC)
Fri, Mar 20
8:30am-5:00pm Conference Registration Churchill Court
10:00-10:20am Break
38
conference schedule
10:20-10:45am Huhta, Alderson, Nieminen, & Ullakonoja: Diagnostic
profiling of foreign language readers and writers
10:50-11:15am Jin, Zou, & Zhang: What CEFR level descriptors mean
to college English teachers and students in China Churchill Ballroom B
39
conference schedule
1:15-2:40pm Papers Session 6 (3 Parallel)
2:40-3:00pm Break
40
conference schedule
3:00-4:30pm Symposium Session 4 (3 Parallel)
Located in Toronto, Canada’s largest city, the Graduate Program in Linguistics and Applied
Linguistics at York University is well known for the excellence of its faculty, students and teaching.
Faculty research and supervision interests cover a broad spectrum of areas in the two offered fields
of Linguistics and Applied Linguistics.
We offer a one-year M.A. and a four-year Ph.D. program with courses and specialization in either in
Linguistics or Applied Linguistics.
In Linguistics, students can focus on sociolinguistics (language variation and change, discourse
analysis), formal linguistics (phonology, syntax), or psycholinguistics.
In Applied Linguistics, students can focus on second language education and assessment, language
policy and planning, narrative studies, as well as multi-modal and e-learning.
41
Workshop Abstracts
WORKSHOP 1 (2 days) • Kim, J.T. (2006) employed test-takers in the
development of his test, and specs were
the common vehicle by which all parties
Mon, Mar 16 & communicated
• Kim, J. et al. (2010) reported on a spec
Tues, Mar 17 management project with a group of trained
item writers who were producing operational
Working with (and teaching) test items and tasks
Furthermore, personal experience (Davidson)
language test specifications and anecdotal reports (from other language
for large-scale testing and testers) suggest that test specs are a useful
feature in the teaching of language testing and
classroom assessment growth of assessment literacy.
Fred Davidson, Youngshin Chi, Sun Joo Chung The workshop is designed for maximum
& Stephanie Gaillard applicability to many language test settings
from classroom testing, to medium-stakes or
A test specification is a generative document
institutional testing, to national and major high-
from which many equivalent test tasks or items
stakes assessment projects. The goals of this
can be produced. It is a well-established classical
workshop are to:
tool in test development. “Specs” are relevant at
all language testing contexts: from small-scale 1. Teach about language test specs
classroom-based assessment to large high- 2. Critically analyze and obtain feedback on
stakes testing programs. Recent years have specs brought by workshop participants (a
seen the emergence of empirical research and “spec clinic”)
statements of theory about the use of specs in 3. Practice with spec editing and revision
language testing, inter alia: 4. Explore and expand spec theory (Davidson
2012b)
• Chi (2011) featured test specs in her
5. Explore how specs apply to different
exploration of factors in the assessment of
settings
listening
6. Provide information and tips on how to
• Chung (2014) reported on the evolution
teach specs
and changes in a major high-stakes writing
placement test, for which test specs (and The two days will be organized along themes
their changes) provided key documentary as shown here: (a) Overview and basic theory of
evidence specs, (b) Exploration of various models of test
• Davidson (2012a) and Davidson (2012b) specs (this includes the bring-your-own Spec
are recent statements of theoretical Clinic), (c) The role of test specs in evolutionary
considerations in specs in language testing test change and validation arguments, (d) Use of
• Gaillard (2014) utilized specs in a central test specs in development of research instruments,
way during her development of an elicited (e) Management of operational test-spec driven
imitation task for the assessment of French projects, (f) Theory, teaching (of specs), and Wrap-
(for both practical and research purposes) up
• Li (2006) investigated how specs change over Workshop activities will follow a general
time and, in so doing, explored the metaphor model as follows: presentation of specs (e.g.
of an “audit” of evolution of test specs samples brought by participants and published
42
Workshop abstracts
samples), then critical discussion of the samples
(in small groups or as a whole), then revision of
WORKSHOP 2 (1 day)
the specs (with justification as to why), capped
by reflection about the revision and its potential Mon, Mar 16
contribution to test validity (Davidson 2012a;
Li, 2006). Depending on the goal and theme of Using Rasch analysis to gain
the moment, this general activity model may be
tailored to a specific need -- e.g. in the theme of
insights into test development
management of operational spec-driven testing
Rita Green
projects, we will also explore matters of budgeting
and organizational management of item/task This one-day workshop has two main
production and how revision of a given spec aims: firstly, to provide a basic introduction
should take those matters into account. into the insights that Rasch analysis can offer
test developers; and secondly, to encourage
Fred Davidson is a Professor Emeritus of
participants to use such procedures to help them
Linguistics at the University of Illinois at Urbana-
better understand the tests they have developed.
Champaign. He currently resides in Phoenix, Arizona.
Participants will learn how to construct and run
His interests include language test specifications and
a Winsteps control file based on a set of raw
the history and philosophy of testing.
data. Item difficulty and person ability estimates,
Youngshin Chi is an independent consultant person and item reliabilities, and the amount of
working in the United States. She received her PhD in error associated with each item and person will
Educational Psychology from the University of Illinois be explored. In addition, distracter performance
at Urbana-Champaign. She has research interests will be investigated. Furthermore, participants
in test development, test validation, and second will be expected to bring their own laptop to the
language listening. workshop. Information about the availability of the
Winsteps software (Linacre) will be provided in due
Sun Joo Chung is a Lecturer at Hankuk University course. This workshop is for Rasch beginners.
of Foreign Studies in Seoul, Korea. She received her
PhD in Educational Psychology from the University Dr. Rita Green has been involved in the field of
of Illinois at Urbana-Champaign. She has research language testing for more than 30 years, as a trainer
interests in placement testing, quality management and adviser to many projects around the world.
in language assessment programs, and second She has run numerous seminars and workshops
language writing. in Europe, Africa, Asia and South America. She is
a Visiting Teaching Fellow in the Department of
Stephanie Gaillard is an Assistant Professor Linguistics & English Language at Lancaster University
of French and the language program coordinator where she has been the Director of a four-week
at Louisiana State University. She received her PhD course in language testing and statistics for language
in French with a concentration in Second Language testers since 2001. She is also an EALTA Expert
Acquisition and Teacher Education at the University of Member. She is the author of Statistical Analyses for
Illinois at Urbana-Champaign. Her research interest Language Testers (2013, Palgrave Macmillan, ISBN
includes test validation, placement testing, test 9781137018281) and has presented her work at
specifications and especially their impacts in second LTRC and EALTA conferences.
language classroom curricula.
43
Workshop abstracts
WORKSHOP 3 (2 half days) test administration and learning processes.
Participants will explore the differences between
formal testing and classroom assessment,
Mon, Mar 16 & consider how formal tests might be reformed to
Tues, Mar 17 become valuable learning experiences and how
tests might be better integrated into the teaching
and learning cycle.
Learning oriented langauge
Goal: To encourage debate about the roles of
assessment for the classroom: A assessment and testing in educational programs.
primer By the end of the workshop, participants will:
Anthony Green & Liz Hamp-Lyons • Understand the key features of learning
oriented language assessment
This workshop will introduce participants to • Understand the role of cognitive
the principles which are emerging for learning- engagement in assessment by learners and
oriented assessment (LOA), focusing in particular teachers
on learning-oriented language assessment (LOLA)
by teachers in the classroom. • Be equipped to evaluate language test tasks
as tools for effective learning
Outline:
• Build their awareness of how external tests
1. Principles of LOLA might be more effectively integrated with
2. Role of teachers in diagnosing learners’ needs teaching and learning
for LOLA-focussed instruction
3. LOLA tasks Anthony Green is Deputy Director of the Centre
4. LOLA-oriented feedback for English Language Learning and Assessment at
5. Helping students to self and peer assess the University of Bedfordshire (UK). His publications
include the books Exploring Language Assessment
The purpose is to introduce the concept of and Testing, (Routledge), Language Functions
learning oriented language assessment as a Revisited and IELTS Washback in Context (both
means of improving the coherence of educational Cambridge University Press). He has extensive
assessment systems that involve both assessment experience in training teachers in language
by teachers in the classroom and large scale assessment, including acting as the academic
external testing. The workshop will a) explore coordinator of the EU funded ProSET project carried
how teachers can use self-designed assessments out in partnership with the Ministry of Education
to provide feedback to students and diagnostic of the Russian Federation and providing training in
evidence to guide instructional practices and b) assessment for secondary school teachers nationwide.
consider how external tests – often regarded as a He has carried out assessment training workshops in
threat to preferred classroom practices – might be over 30 countries.
more effectively integrated into local curricula.
Liz Hamp-Lyons is Visiting Professor at CRELLA
This workshop begins with an overview of the (English Language Assessment) and at Shanghai Jiao
formative functions of learning oriented language Tong University, China (International Studies). Her
assessment, presenting sample performance research encompasses development and validation
assessment activities that illustrate key concepts of English language performance (i.e. writing and
such as task authenticity, cognitive engagement speaking) assessments, classroom assessment and
and uptake of feedback. Tasks from large- assessment for learning as well as EAP materials and
scale language tests will then be evaluated as methods. She is the Editor of Assessing Writing and
tools for LOLA and participants will explore the joint Editor of the Journal of English for Academic
apparent tensions between test preparation, Purposes.
44
Workshop abstracts
WORKSHOP 4 (1 day) (Cronbach, 1988; Kunnan, 2000; Messick, 1989),
and the concomitant need for convergent sources
of evidence in arguments for test interpretation,
Tues, Mar 17 mesh well with a MMR perspective. Increasing
recognition of the role of context (see, Fox, 2003;
Situating mixed methods McNamara & Roever, 2007), the greater presence
of qualitative approaches in LT research, and the
research in language need to address ever more complex research
assessment: Evolution and questions, have led to greater use of and interest
in MMR approaches in LT. As Turner (2013)
practice points out, researchers in LT are latecomers
to considerations of MMR issues, but their
Carolyn E. Turner & Janna Fox
“rationale for the use of MMR follows closely the
Mixed methods research (MMR) design, philosophical orientation most often associated
which draws on both quantitative and qualitative with it, that is, pragmatism” (p. 1403).
approaches in addressing a research problem,
is increasingly evident in studies within language
testing and assessment (hereafter, LT). This is
in keeping with the general trend toward MMR, Carolyn E. Turner, PhD, is Associate Professor
which is evident in: 1) the increasing number of Second Language Education at McGill University
of prominent publications describing MMR where she teaches assessment and research methods.
procedures, pitfalls, and potentials (e.g., Creswell Her research interests include language testing/
& Plano Clark, 2011; Tashakkori & Teddlie, 2010); assessment in educational settings and in healthcare
2) the successful launch of the Journal of Mixed contexts concerning access for linguistic minorities. A
Methods Research in 2007, which now ranks 9 former President of the International Language
out of 92 Social Science/Interdisciplinary journals; Testing Association (ILTA) and Associate Editor of
3) the number of doctoral dissertations which Language Assessment Quarterly, her work appears in
apply MMR approaches in addressing research journals such as Language Testing, TESOL Quarterly,
problems and questions (Cheng & Fox, 2013); and and Health Communication. She is currently co-
4) revisions taking place in textbooks used in social authoring a book, “Learning-oriented L2 assessment,”
science research (e.g., McMillan & Schumacher, with James Purpura. Recently she published chapters
2010), which in more recent editions place MMR on mixed methods research (MMR) in the Companion
design alongside quantitative and qualitative as a to Language Assessment and the Encyclopedia of
separate, but equally important approach. Applied Linguistics.
Arguably, the original challenges to MMR (see Janna Fox, PhD, is Associate Professor and
the paradigm wars, e.g., Caracelli& Greene, 1997) Director of the Language Assessment and Testing
have ceased to be the focus of debate, as MMR Research Unit in the School of Linguistics and
gained legitimacy as a pragmatic (Tashakkori& Language Studies at Carleton University where she
Teddlie, 2010) alternative to purely quantitative teaches courses on research methods to graduate
or qualitative research approaches for all but the students in Applied Linguistics. Her research in
most rigidly conservative researchers. Indeed, language policy, curriculum, and assessment has
some have suggested (e.g., Creswell, 2012) increasingly drawn on mixed methods approaches
that MMR is a natural and inevitable extension to research, when neither a solely quantitative nor
of a priori triangulating strategies in research, qualitative approach was adequate in addressing a
which seek alternative perspectives or source research problem. She is currently co-editing a book
of evidence to cross-validate findings. Within LT, (with Vahid Aryadoust) entitled Current Trends in
the overriding concerns for validity and validation Language Testing in the Pacific Rim and Middle East
in assessment practices over the past decades for Cambridge Scholars Press.
45
paper abstracts
Wed, Mar 18 points out that it may be necessary to call on
surrogate stakeholders, for example, parents, to
speak on behalf of their children. Indeed, there
10:20-10:45am is emerging evidence that parents play a key role
in their children’s school attainment (Feignstein
Impact of an international et al 2010). Previous studies (e.g. European
Commission 2012) also highlight the necessity
English proficiency test on to investigate parents’ perspectives of their
young learners—voices from children’s English language learning. Therefore,
in our study, semi-structured interviews were
learners and parents also conducted with twenty parents individually
to get deeper information about KET for Schools
Xiangdong Gu, Chongqing University, China/
impact on young learners. The findings of student
Cambridge English, UK, xiangdonggu@263.net
survey revealed that KET for schools have exerted
Qiaozhen Yan, Chongqing University, a strongly positive impact on young learners’
qiaozhencqu@163.com learning motivation, self-awareness of learning
Jie Tian, Chongqing University, China, strengths and weakness and the improvement of
tianjiemolly@qq.com their English proficiency. Remarkably, the findings
of parent interview were highly consistent with
46
Paper abstracts
promote young learners’ English language learning assessment or portfolio assessment), grading
with the help of proficiency test. and placing students onto courses as well as
establishing quality criteria of assessments (e.g.
validity, reliability or using statistics) are not. To
Assessment literacy of foreign compensate for insufficient training, teachers
language teachers: Research, reported that they learn about LAL on the job
or use teaching materials for their assessment
challenges and future prospects purposes. Teachers overall expressed a need to
Dina Tsagari, University of Cyprus, dinatsa@ucy. receive training across the range of LAL features
ac.cy identified in the study with varying priorities
depending on their local educational contexts.
Karin Vogt, University of Heidelberg, vogt@ph- The results of the study offer insights into teacher
heidelberg.de cognition of their LAL needs and outline pressing
challenges which exist for fostering LAL among
A
analysed using descriptive and inferential statistics lthough many studies have been
(frequency distributions, correlation analyses, conducted to validate the writing tasks on
regression analysis) and qualitative content various versions of the TOEFL, none have explored
analysis (following Mayring 2010). Despite the the comparability of students’ performance on
small differences across countries, the results TOEFL writing tasks and actual academic writing
show that only certain elements of teachers’ tasks. Investigating this comparability is essential
LAL expertise are developed such as testing for providing backing for the extrapolation
microlinguistic aspects (grammar/vocabulary) inference in the TOEFL validity argument.
and language skills (e.g. reading and writing). LAL We conducted a study that examines the
aspects such as compiling and designing non- comparability of TOEFL iBT writing tasks and tasks
traditional assessment methods (e.g. self-/peer- assigned in required university writing courses.
47
Paper abstracts
Because such courses constitute a graduation more similar to the types of tasks that students
requirement for students in many North American will likely encounter in academic settings. Even
universities, they represent an important part though instructors believed that the TOEFL tasks
of the TLU domain. Specifically, we investigated only partially resembled the tasks they assigned,
whether students viewed their performance most of them reported that the criteria in the
on the TOEFL iBT writing tasks as comparable TOEFL rubrics are similar to the criteria they use
to their writing performance in their required to assess student writing. Detailed findings of
writing class. We also examined instructors’ and quantitative and qualitative data from the student
students’ perceptions of the comparability of the and teacher questionnaires will be presented
characteristics of the TOEFL writing tasks and and implications of the findings for the validity of
rubrics and those assigned in required writing score interpretations and the nature of the TLU
classes. One hundred international, nonnative- domain (writing courses vs. content courses) will
English-speaking, undergraduate students be discussed.
enrolled in required university writing courses
from eight universities in the U.S. and their Evaluating and revising a reading
writing instructors (N=15) participated in the
study. During a two-hour data collection session, tesblueprint through multiple
students completed a background questionnaire, methods: The case of the
two TOEFL iBT tasks, one Integrated and one
Independent, and a post questionnaire about CELPIP-G
their perceptions of the TOEFL writing tasks, the
level of correspondence between the TOEFL Maryam Wagner, OISE/University of Toronto,
writing tasks and the tasks they encounter in maryam.wagner@utoronto.ca
required university writing courses, and their Michelle Yue Chen, The University of British
views about the extent to which the TOEFL writing Columbia / Paragon Testing Enterprises, mchen@
tasks measure their writing abilities accurately. paragontesting.ca
Instructors also completed a questionnaire that
included screenshots of the TOEFL writing tasks Gina Park, OISE/University of Toronto, gina.park@
used in the study and the TOEFL writing rubrics; mail.utoronto.ca
instructors were asked to comment on the degree
Jake Stone, Paragon Testing Enterprises, stone@
of comparability between the characteristics of the
paragontesting.ca
TOEFL tasks and rubrics and those of the writing
tasks they assign in their courses. Preliminary Eunice Eunhee Jang, OISE/University of Toronto,
analyses of the student questionnaire data eun.jang@utoronto.ca
indicate that the majority of students believed that
their performance on the TOEFL tasks was similar
to or worse than their performance on tasks in
their writing class. Also, 60% of students reported
I n Canada, prospective immigrants need
to demonstrate their language proficiency
through one of two approved tests, one of which
that the Independent task most resembled the is the Canadian English Language Proficiency
types of tasks they encountered in their writing Index Program General (CELPIP-G) (Canadian
class. By contrast, only 17% of students thought and Immigration Canada). This high-stakes test
the Integrated task was most comparable. In has significant consequences for its test-takers;
general, writing instructors reported that the accordingly, the quality of the CELPIP-G is of
TOEFL tasks only partially resembled the types critical importance (Shohamy, 2007). As part of
of writing they assign, with the Independent an on-going development and validation process
task perceived as more similar by many of the to evaluate the quality of inferences made from
instructors—an interesting finding considering the test, we engaged in a systematic review and
that integrated tasks are typically believed to be analysis of the CELPIP-G. This paper focuses on
48
Paper abstracts
the processes related to the reading sub-tests information); knowledge of pragmatics (e.g.,
which aim to assess test-takers’ ability to engage tone, mood, formality); interpreting information
with, understand, interpret, and make use of across different text forms; and synthesising.
written English texts to achieve day-to-day and The psychometric analysis also provided integral
general workplace communicative functions. information about the quality of test items in
The primary purpose of the investigation was to terms of difficulty and discrimination, and further
strengthen the reading test’s blueprint through revealed the relationship amongst skills, as well as
multiple methods including expert panel analysis the distribution of test-takers across skill mastery
and psychometric modelling. The investigation classes. This paper discusses key lessons learned
involved three main tasks: 1) Developing an from our efforts to strike a balance between the
overarching reading assessment framework two different methods and sources of evidence
based on a review of relevant theoretical and crossing paradigmatic borders: human judgement
empirical literature (e.g., Alderson, 2000; Grabe, and psychometrics.
1991); 2) Expert analysis of the CELPIP-G reading
test items using the framework to identify the
knowledge and cognitive skills underlying each
10:50-11:15am
item; and 3) Developing a cognitive diagnostic
model (CDM) using test performance data to Distinguishing features in
empirically substantiate the proposed knowledge
/skill relationships with the items. Based on
scoring young language
the literature review, 17 fine-grained sub-skills students’ oral performances
were identified as underlying assessment of
functional reading proficiency which served as Lin Gu, Educational Testing Service, lgu001@ets.
the foundation for the expert panel’s work of org
reviewing test items on two operational CELIPIP-G Ching-Ni Hsieh, Educational Testing Service,
reading test forms. The purpose of the panel’s chsieh@ets.org
work was not to confirm the framework’s set of
linguistic knowledge and cognitive skills, but to use
it as a basis to identify a wide range of knowledge
and skills enriching their theoretical meanings
R esearch as examined the components of
speaking proficiency and their associations
with human ratings among adult learners of
through collaboration and negotiations. The skills English (e.g., Iwashita, Brown, MaNamara, &
identified by the expert panel were specified into O’Hagan, 2008; Plough, Briggs, & Van Bonn,
a Q-matrix (Tatsuoka, 1990). A Q-matrix maps 2010), and has contributed significantly to our
the relationships among knowledge and skills understanding of the construct of speaking
with items. These relationships were empirically ability. However, little attempt has been made to
evaluated through cognitive diagnosis modelling explore the nature of speaking proficiency among
using the Fusion Model skill diagnosis procedure young learners whose communication needs are
(Roussos et al, 2007). All discrepancies emerging distinctively different from adult learners. This
from the modelling were carefully considered to study examined the spoken features of young
ensure that changes made to the Q-matrix based English language learners using performance
on psychometrics did not lead to theoretically data extracted from operational TOEFL Junior
less interpretable construct representation. The Comprehensive (TJC) speaking test database.
iterative process revealed ten reading processing Our goal was to gain a detailed understanding
skills: vocabulary knowledge; grammatical of how different linguistic features distinguish
knowledge; textually explicit information; non- levels of oral proficiency among young learners.
explicit information; inferencing; knowledge of Performance data samples were produced by
cohesive markers and organizational features; 57 TJC test takers, between 9 and 12 years of
distinguishing ideas (e.g., relevant and irrelevant age with varying degrees of oral proficiency.
49
Paper abstracts
Four distinct TJC task types were included in The study focuses on the writing component of
the samples. Oral responses were transcribed the exam and investigates which features of the
by human transcribers. Audio recordings writing task may be controlled so as to better
and transcripts were then analyzed by using establish task equivalence. This paper will report
automated scoring technologies in terms of on a study of six retired writing prompts from the
pronunciation, fluency, and vocabulary use. Michigan English Language Assessment Battery
ANOVA analyses were performed for each feature (MELAB), an advanced level test of English for
separately to determine the degree to which nonnative speakers. The prompts were selected
each linguistic feature differed across proficiency and categorized by task features after applying
levels. Significant differences across levels were previously developed prompt taxonomies
found for all three aspects of the speaking ability. (Hirokawa and Swales, 1986; Horowitz, 1991;
The results also indicated that features related Swales. 1982; Lim, 2010), and interviewing test
to fluency have the strongest association with takers and experienced MELAB raters. The
ratings of young learners’ oral proficiency. This resulting task features that best categorize the
study provides an important piece of empirical writing prompts are domain, response mode,
evidence in support of the interpretation about number of rhetorical cues and focus (the extent
young learners’ language abilities made on the to which the writer is constrained when selecting
basis of the scores awarded by TJC human raters. a topic to write on). 20 essays at three distinct
Implications for the development of young learner proficiency levels were collected in response to
language assessments and suggestions for future each of the six prompts (n=360) and each essay
research will be discussed. was analyzed with respect to a range of discourse
variables, selected from the existing literature on
task effect and to represent key features from
Task equivalence in writing the MELAB rating scale. The discourse variables
assessment: Which task features were selected to operationalize the traits of
fluency (essay length), syntactic complexity
can and should be controlled? (average sentence length and average number of
Mark Chapman, CRELLA, chmark@umich.edu words before the first verb), lexical sophistication
(average word length, lexical frequency profile,
O ne difficulty facing those who attempt and type-token ratio), cohesion (incidence of all
to assess spoken or written language connectives, causative connectives, and temporal
on repeated occasions is how to ensure that connectives), and accuracy (total number of
the tasks learners are presented with are errors and average number of errors per
equivalent. If certain test tasks elicit responses 100 words). The discourse variable data were
that are significantly different from other tasks, analyzed using MANOVA for significant differences
then test takers’ scores may vary depending across prompts. The results revealed significant
on which prompt they respond to. These are differences in lexical sophistication, syntactic
serious concerns for high-stakes, standardized complexity, and cohesion. The task features that
assessments where parallel test forms are were associated with these significant differences
essential. Although there is a developing in written product were response mode, prompt
understanding of how varying complexity of input focus, and domain. A series of interviews with test
in integrated writing tests affects the final written takers who had completed the MELAB Writing
product (Cumming, Kantor, Baba, Erdosy, Keanre, test confirmed that domain and prompt focus
& James, 2005; O’Loughlin & Wigglesworth, 2007), were characteristics of the task that should be
the specific features of the task that may affect the controlled. The interview data also indicated that
final written product are not well understood. This the number of rhetorical cues within the task was
paper describes a study of task equivalence within an additional task variable that influenced how
a high-stakes test of English as a second language. test takers responded to the prompts. The talk
50
Paper abstracts
will conclude with an account of the revisions performed of the questionnaire data, including
that have been made to the test specifications to exploratory factor analysis, descriptive statistics,
promote task equivalence for all writing prompts. two-way ANOVA, and two-way MANOVA. The
qualitative interview data were analyzed through
an inductive approach with the aid of NVivo.
Impact and consequences of Results revealed that: 1) despite students’
university-based assessment in overall positive perceptions of the test, most of
them merely considered it as another external
China: A case study English test imposed on them; 2) the test had
Jinsong Fan, Fudan University, jinsongfan@fudan. more impact on students’ test preparation
edu.cn strategies than on learning content, attitudes, and
motivation; 3) students commented negatively
Xiaomei Song, Georgia Southern University, on the transparency of test information and the
xsong@georgiasouthern.edu lack of test preparation materials and learning
support; 4) the effect of gender was statistically
51
Paper abstracts
Development and validation of a resulted in the development of a pre-matriculation
test in which examinees respond to two carefully
placement test for a multi-polar selected, culturally sensitive prompts. The test
graduate institute was administered in three consecutive years
to 408 students. Data were examined using
James Elwood, Meiji University, elwood@meiji. multi-faceted Rasch measurement (MFRM) to
ac.jp elucidate the performance of the examinees,
raters, and the rating rubric along the dimensions
Katerina Petchko, Graduate Institute of Policy of (a) the extent to which the students were
Studies, kpetchko@grips.ac.jp able to evaluate the presented arguments and
produce text-responsible prose and (b) language
I
tasks more germane to academic writing. Such n response to the recent call for offering
tasks would assess students’ ability to understand detailed feedback about language test results
academic assignments, evaluate the logic and (e.g., Kunnan, 2008; Kunnan & Jang, 2009), some
validity of presented arguments, and respond by large-scale English language tests have started to
producing “text-responsible prose” (Leki & Carson, report fine-grained information about learners’
1997, p. 58). Our dissatisfaction with existing tests
52
Paper abstracts
test performance. Effective communication of & Hambleton, 2004; Zenisky & Hambleton, 2012).
test results to stakeholders is essential in order Second, the group interview results showed that
to promote stakeholders’ accurate understanding the student and teacher participants’ perception
and appropriate use of language test results. of the content and format of the score reports
Some recent studies (e.g., Knoch, 2012; Yin, and supplementary materials for both tests was
Sims, & Cothran, 2012) showed students’ general generally favorable, although both the students
preference for detailed performance feedback in and teachers attended to and used only limited
case of diagnosis. Likewise, we need investigations parts of the reported information in subsequent
into how large-scale test score reports should learning and teaching. Finally, results of the
be designed as well as how students and teacher interviews revealed that factors such as
teachers perceive the reported information and differences across the schools in the purpose of
use it when such tests are used as part of L2 using the two tests for EFL instruction affected
instruction. This small-scale qualitative study how and the extent to which the teachers used
investigated the current score reporting practice results of the two tests. Implications of the study
and Japanese learners’ and teachers’ perception findings to improving language test score report
and use of results reported for two large-scale design and communication of test results to
English language tests in Japan: the Global Test stakeholders for better bridging assessment and
of English Communication for Students (GTECfS) instruction will be discussed .
and the Eiken Test in Practical English Proficiency
(Eiken). Two types of qualitative analyses were Enhancing test validation
conducted. First, in order to summarize the
current score reporting practice for GTECfS and through rigorous mixed
Eiken, two coders independently conducted methods design features
a content analysis of the score reports and
supplementary materials by employing Roberts Angeliki Salamoura, Cambridge English
and Gierl’s (2010) test score report analysis Language Assessment, salamoura.a@
framework. Next, in order to explore how learners cambridgeenglish.org
and teachers perceive and use the score reports
and supplementary materials for further learning Tim Guetterman, University of Nebraska,
and instruction, 16 students and five EFL teachers tcguetterman@gmail.com
were recruited from three upper secondary
Hanan Khalifa, Cambridge English Language
schools in the Tokyo metropolitan area, where the
Assessment, khalifa.h@cambridgeenglish.org
two focal tests were used for various purposes
A
in EFL instruction. The student and teacher s test validation calls for the integration
participants were interviewed in small groups, and of multiple strands of evidence, mixed
the interview data were independently coded by methods research carries many potential ways
the same coders above for the comments made to add value to it. However, conducting a mixed
about features of specific parts of the documents methods study in language testing can be a
and their use by taking an inductive approach. challenging task. The purpose of this paper is
For both analyses above, the coders resolved to discuss essential design features of mixed
coding discrepancies by discussion. Key study methods studies in language testing. In order
findings can be summarized in terms of three to develop recommendations for designing
points. First, the analysis of the score reports mixed methods studies in language testing, we
and supplementary materials suggested that the reviewed a small body (n = 10) of such studies
current score reporting practice of both tests are conducted over the last ten years. For inclusion in
generally consistent with many principles of good the review, the studies had to meet three criteria:
score reporting practice identified in previous (i) addressed a language assessment topic, (ii)
studies in educational assessment (e.g., Goodman described a true mixed methods approach
53
Paper abstracts
(i.e., the collection, analysis and integration A multifaceted study using
of quantitative and qualitative data), and (iii)
contained a complete description of methods web-based video conferencing
adequate for our analysis. Throughout the review, technology in the assessment of
we identified nine methodological issues that
commonly arise in mixed methods language spoken language
testing studies, including the specification of the
rationale for collecting both quantitative and Fumiyo Nakatsuhara CRELLA, University of
qualitative data and the description of mixed Bedfordshire, fumiyo.nakatsuhara@beds.ac.uk
methods integration procedures (e.g., connecting, Chihiro Inoue CRELLA, University of
building, merging). Each of the nine identified Bedfordshire, chihiro.inoue@beds.ac.uk
issues can potentially negatively affect the rigour
of the study. To address these methodological Vivien Berry, British Council, vivien.berry@
issues and assist language testing researchers britishcouncil.org
in publishing their studies, we will present ten
Evelnia Galaczi, Cambridge English Language
design features derived from recent scientific
Assessment, galaczi.e@cambridgeenglish.org
developments in the mixed methods field
R
(Creswell, in press) and discuss what each will add apid advances in computer technology
to a language testing study. When applied these over the past few decades have facilitated
design features will produce a mixed methods changes in the practice of language teaching,
study that is consistent with current mixed learning and assessment. As one of the ways to
methods quality standards for publication. We offer spoken language lessons, it is now possible
will conclude by discussing how a rigorous mixed for teachers and students in different locations
methods design can strengthen the investigation to ‘meet’ using a web-based video conferencing
of test validity and the building of a validation system, offering opportunities to those who
argument. We argue, if applied consistently, would not otherwise be able to attend classes.
the proposed design features enhance the While the new technology has made face-to-face
methodological rigour of a test development or teaching and learning less logistically complex
revision project and by extension the validity of and thus accessible to more learners, it seems
the evidence presented in support of the test that the use of the technology has not been
validation argument. fully explored in the field of spoken assessment.
Most computer-delivered spoken tests to date
are semi-direct tests, in which the test-taker
speaks in response to recorded input delivered
via a computer. The delivery mode does not
permit reciprocal interaction between the test-
takers and interlocutor(s) in the same way as a
face-to-face format, inevitably constraining the
speaking ability construct to be measured. This
research therefore explored how web-based
video conferencing technology could be used to
deliver and conduct the face-to-face version of
an existing speaking test, and what similarities
and differences between the two modes can be
discerned. The study sought to examine (a) test-
takers’ linguistic output and scores on the two
modes and their perceptions of the two modes,
and (b) examiners’ test management and rating
54
Paper abstracts
behaviours across the two modes, including their Testing the right skill: the
perceptions of the two conditions for delivering
the speaking test. 32 test-takers and four trained misapplication of reading scores
examiners participated in the study. The test- as a predictor of translation
takers took two speaking tests under face-to-face
and computer-delivery conditions using Zoom ability
technology. The order of the two test conditions
was counter-balanced for both the test-takers and Maria Brau, Federal Bureau of Investigation,
examiners. A convergent parallel mixed method maria.brau@ic.fbi.gov
design was used to allow for a more in-depth Rachel Brooks, Federal Bureau of Investigation,
and comprehensive set of findings from multiple rachel.brooks@ic.fbi.gov
types of information. The research included an
U
analysis of feedback interviews with test-takers S Federal Government agencies often use
as well as their linguistic output during the tests a measure of reading comprehension
(especially types of language functions) and in the foreign language to qualify a person for
rating scores awarded under the two conditions. translation work. This practice is based on two
Examiners provided written comments justifying assumptions: that as a citizen of the United States,
the scores they awarded, completed a feedback the examinee has a sufficiently strong proficiency
questionnaire and participated in retrospective in writing English; and that this assumed English
verbal report sessions to elaborate on their proficiency in combination with a measure of
interlocuting and rating behaviour. Three reading comprehension in a foreign language
researchers also observed all test sessions and constitute sufficient qualification to perform
took field notes. The findings suggested that translation tasks. However, the Interagency
while the two modes generated similar test Language Roundtable (ILR) Skill Level Descriptions
score outcomes, there were some differences in (SLDs) for Translation Performance state that
functional output and examiner interviewing and reading comprehension of the source language
rating behaviours. In particular, some interactional and ability to write the target language are
language functions were elicited differently from prerequisite but not sufficient for translation. The
the test-takers between the two modes, and the SLDs posit that translation also involves “congruity
examiners seemed to use different turn-taking judgment,” defined as the ability to choose
techniques between the two conditions. Although equivalent expressions in the target language that
the face-to-face mode tended to be preferred, best match the meaning intended in the source
some examiners and test-takers with more language (ILR, 2007). Still, there has been little
exposure to computer-based communication research into the relationship between reading
and teaching felt more comfortable with the ability and translation ability that prove the need
computer mode than the face-to-face mode. This for an additional ability (“congruity judgment”) and
presentation will conclude with recommendations therefore the need to measure it. Past research
for further research, including examiner and has shown that for Arabic, there is a moderate to
test-taker training and resolution of technical weak correlation between reading proficiency and
issues, which should be addressed before any translation performance (Brau & Lunde, 2005),
examination boards make any decisions about but that research was limited to one language
introducing (or not) a speaking test using video and did not consider demographic variables of
conferencing technology. the examinees. Therefore, the answer to the
research question “is reading comprehension a
good predictor of translation ability” is unclear.
At the Federal Bureau of Investigation (FBI),
applicants for translator positions take both
reading and ILR-based translation tests as a part
55
Paper abstracts
of their qualifying battery. A regression analysis been undertaken within a search for further
was conducted of reading and translation scores understanding of communicative competence
for over 5,000 applicants. Examinees came (e.g., Bachman & Palmer, 1996; Chalhaub-
from a wide variety of language backgrounds, Deville, 2003; Purpura, 2008). The search for
including Arabic, Chinese, Farsi, French, Korean, deeper understanding has prompted the
and Russian. Results were analyzed for the entire development of various frameworks attempting
sample, as well as for the different language to provide dependable theoretical foundations
groups. Additionally, examinees with lower- of the nature of communicative language ability:
level scores were compared to those receiving these various theoretical models conceptualize
higher-level scores. Results were interpreted with communicative language ability as a composite
consideration of various demographic variables, of different sub-competencies which come to
such as educational attainment. Preliminary explain the degrees of mastery of a L2. Among
results indicate that reading comprehension the sub-competencies suggested to constitute
is a weak predictor of translation ability. This communicative language ability, discourse
conclusion is of consequence for USG agencies. competence has been assumed to be at the core
The over-application of reading test scores shows of the knowledge needed for L2 communication
the need for modification of testing procedures. (Bachman, 1990; Bachman & Palmer, 1996). While
Additionally, the research results suggest that there seems to be consensus of the importance of
government training programs should include greater understanding of discourse competence
teaching not only reading but also translation as a means for further clarity of communicative
skills, in order to prepare government linguists for language ability and L2 proficiency in general, a
the task they would be performing and the overall detailed study of discourse competence appears
need for increased assessment literacy. to somewhat neglected (Kormos 2011; Purpura,
2008), particularly in speaking performance.
Discourse competence is one of the four
3:00-3:25pm categories identified in the IELTS Speaking Band
Descriptor. In order to fill this gap, the current
Features of discourse study has undertaken detailed examination of
the test-taker oral discourse at three proficiency
competence in IELTS speaking levels addressing the research question: What
task performances at different are the distinctive features of performance that
characterise test-taker discourse at the three
band levels IELTS speaking band levels? The transcribed 60
Noriko Iwashita, The University of Queensland, speech samples (20 at each level) of the IELTS
n.iwashita@uq.edu.au Academic Speaking task 2 was analysed both
quantitatively and qualitatively by adapting the
Claudia Vasquez, The University of Queensland, method employed by Banerjee, et al. (2004)
claudia.vasquez@uqconnect.edu.au in the analysis of the IELTS Academic Writing
tasks. The features of discourse competence
56
Paper abstracts
taker performance than lower level test-takers, or complete a test task. Their CSU and MSU were
but other features (e.g., ellipsis, use of reference) measured prior to the test taking, during the test
was not clearly distinctive according to the levels. taking and after the test taking. 300 English as
Especially other factors such as test-takers’ L1 a second language (ESL) international students
and task type have an impact on some aspects of took part in the study. The data collection stages
coherence regardless the proficiency levels. These were as follows. First, prior to taking the listening
findings contribute to further understanding of test, test takers were asked to report on their
the nature of oral proficiency and communicative general CSU and MSU strategy use in listening
language ability as well as supplement IELTS comprehension and IELTS listening tests. Second,
speaking band descriptors with oral features they took IELTS listening test. While completing
of test-taker discourse empirically identified in each of the IELTS listening test part, they were
the test-taker performances. Furthermore, the asked to report on the level of their confidence in
results provide test-takers useful diagnostic their test performance (e.g., 0% to 100%) for each
information about test-takers’ oral proficiency and question. There were four parts to this listening
will inform language teachers of characteristics of test. At the end of each listening IELTS section,
oral proficiency to be targeted in L2 instruction. they were asked to report on0.25 in the CSU and
The study also contributes to the research in the MSU during that part. At the end of the entire
interface between language assessment and SLA. listening test, they were also asked to report their
overall CSU and MSU. Test takers’ calibration,
performance appraisals (i.e., confidence of test
Structural equation models correctness), and reported strategy use scores
of calibration, performance were analyzed together using a structural
equation modeling (SEM) approach. It was found
appraisals, and strategy use in that on average the test takers were not well
an IELTS listening test calibrated and had a tendency to be overconfident
across the test sections. Their calibration scores
Aek Phakiti, The University of Sydney, aek. and confidence ratings in performance were
phakiti@sydney.edu.au positively, yet marginally related to their reported
metacognitive strategy use. The present study
57
Paper abstracts
Applying academic vocabulary judgment of item/test quality, it is advisable to
employ multiple vocabulary lists for measuring
lists to validating test contents the academicality of a given text or discourse so
that not only word and phraseology coverage but
David Qian, The Hong Kong Polytechnic also the ranking of the most frequently occurring
University, david.qian@polyu.edu.hk words and phraseologies can be more reliably
determined. Furthermore, since these lists may
A cademic vocabulary lists play an important
role in facilitating English as a second/
foreign language learners in their academic
be purpose-specific when created, in applying
them to test validation, it is advisable to ensure
with preliminary screening that the vocabulary
studies. Since the introduction of University
lists chosen for such validation are relevant and
Word List (Xue & Nation, 1984), more academic
suitable for the intended purpose of the test being
vocabulary lists have appeared, such as Academic
evaluated.
Word List (AWL, Coxhead, 2000), Academic
Formulaic List (AFL, Simpson-Vlach & Ellis, 2010),
PHRASE List (Martinez & Schmitt, 2012), and Eye tracking of learner
Academic Vocabulary List (AVL, Gardner & Davies, interactions with diagnostic
2013). However, as these lists were created
following different theoretical frameworks and feedback on French as a Second
selection criteria, they may differ in the scope Language skill reports
of coverage of academic lexical elements. The
present study aimed to analyse the frameworks Maggie Dunlop, OISE/University of Toronto,
and criteria researchers resorted to in creating maggie.dunlop@utoronto.ca
their respective vocabulary lists, and apply these
T
resultant lists to profiling the academic lexical his paper reports a study that investigated
coverage of the Spoken Sub-Corpus of TOFEL the cognitive processes taking place when
2000 Spoken and Written Academic Language French as a Second Language (FSL) learners
(T2K-SWAL) corpus, which is used as an important interpret diagnostic feedback presented in
database for developing new test items in the different presentation modes (e.g., textual
speaking and listening subtests of the TOEFL iBT. vs. visual). In particular, the study examined
The results of our analysis indicate that, in terms how these processes differ among students
of overall coverage, AVL seems to capture a much with different learner characteristics (i.e., FSL
higher percentage of individual academic words proficiency level, goal-orientations (Dweck, 1986)).
than does AWL, and can therefore be considered The present study contributes to current theories
more robust in detecting the level of intensity of learner-feedback interaction in second language
of academic discourse. On the phraseology learning, a central component of the assessment-
dimension, however, both AFL and PHRASE List for-learning practices (Hattie & Timperley, 2007)
appear to be similarly represented in the T2K that are a key feature of effective foreign language
Spoken Corpus; nevertheless, when the top 30 instruction. Specifically, as learner interaction
high-frequency phrases are examined, nuances with feedback greatly influences how feedback is
appear. The paper will explore those fine-grained used (VandeWalle, 2003), we need to understand
differences and identify possible reasons which ways in which students process feedback to
may have resulted in different scopes of coverage ensure effective integration of assessment with
between AWL and AVL, and between AFL and learning (Jang, 2008). The objectives of this study
PHRASE List. In addition to determining the were to investigate how learners engage with and
usability of the above academic vocabulary lists process feedback differently, and what features
for supporting validity studies of English language of descriptive feedback facilitate deeper cognitive
for academic purposes, the results of this study processing among language learners. Non-
also points to the need that, to make a better
58
Paper abstracts
Francophone students at a bilingual Canadian to prompting learners to meaningfully engage
undergraduate institution received diagnostic with feedback. These study results enhance our
feedback reports on their French language skills understanding of leaner-feedback interaction, with
after writing a placement test designed according future applications available to delivering second
to cognitive diagnostic assessment principles language feedback meaningfully to facilitate
(Leighton & Gierl, 2007). The test and reports continued acquisition.
were delivered by computer. Six students with
various background characteristics participated in
this study. The students were randomly assigned
3:30-3:55pm
to one of two groups, to receive either ‘holistic
textual’ or ‘specific visual’ descriptive feedback. The interlocutor effect in
The students received their reports while an eye
tracking device was operating, then participated
paired speaking tasks to assess
in a stimulated recall interview. These activities interactional competence
were designed to elicit data on cognitive aspects
of students’ attention, processing and interaction Soo Jung Youn, Northern Arizona University, Soo-
with their report. The study analysed the eye Jung.Youn@nau.edu
tracking data and interviews and found that
learners of different psychosocial backgrounds
engage effectively with different forms of P aired speaking tasks can be useful
classroom assessment tools to measure
interactional competence, as students can
feedback. Lower proficiency students, who
were also highly motivated to master French, engage in meaningful conversations and a
reflected on feedback with few emotions and fairly teacher can efficiently administer a speaking test.
uncritical acceptance of advice. These students Despite its potential, the co-constructed nature
read the whole report carefully but paid little of interactional competence (Chalhoub-Deville,
attention to achievement charts, and struggled 2003; McNamara, 1997; Taylor & Wigglesworth,
to define learning plans. The higher proficiency 2009) presents validity challenges, for example, in
students, who were also highly concerned about applying rating criteria to assign an objective and
performing well, engaged in various reflection fair score to each examinee and in interpreting
patterns. The more emotionally reactive score meanings. Thus, further research on
students viewed the feedback as an evaluation the co-constructed nature of interactional
of themselves rather than their French, and were competence is crucial for valid uses of paired
less likely to engage in careful reading, deeper speaking tasks for classroom assessment. One of
reflection, and planning. All the students were the research directions in the field of assessing
prompted to deeper reflection by comparisons interactional competence is to examine the
of self-assessments with test results, sought to interlocutor effect in paired speaking tasks
understand skill descriptions, and read study (e.g., Davis, 2009; Nakatsuhara, 2004). More
advice carefully. This study suggests that students specifically, peer examinee’s proficiency level
with highly emotive learning approaches would might be a potential source that influences the
benefit from feedback that allows them to focus quality of resulting performances and raters’
on learning. Secondly, learners may not benefit scoring decisions. Employing a mixed method
substantially from achievement charts, which are (Green, 2007; Tashakkori & Teddlie, 2003),
interpreted similarly to numerical marks even this paper aims to investigate the interlocutor
when no numbers are displayed. Thirdly, skill effect in paired speaking task performances.
descriptions provide a foundational framework Specifically, the paper examines whether and
for learners to interpret feedback, and must how the peer examinee’s proficiency level
be clearly written and well-developed. Finally, systematically influences the co-constructed
including self-assessment comparisons is key nature of resulting interaction. For data, two
59
Paper abstracts
dyadic role-play tasks were developed to measure Producing actionable feedback
L2 English learners’ interactional competence
for classroom assessment in an English for in EFL diagnostic assessment
Academic Purposes setting. Additionally, analytical
rating criteria with detailed descriptions were Hongwen Cai, Guangdong University of Foreign
developed for stakeholders to make meaningful Studies, hwcai@gdufs.edu.cn
score interpretations and for ensuring raters’
reliable decisions. Examinees range from low-
intermediate to advanced-level L2 English
T his study answers an updated call for
relating second and foreign language
diagnostic assessment to intervention (Alderson,
speakers (N = 102), consisting of 51 pairs for the
Brunfaut, & Harding, 2014). It is observed
dyadic role-play tasks. Six trained raters scored
that most existing practices in diagnostic
the examinees’ performances using the analytical
language assessment do not lead to actionable
rating criteria. For data analysis, a many-facet
procedures, whereas the effect of different types
Rasch measurement was conducted using FACETS
of feedback is typically investigated through
(Linacre, 2006) with five facets (i.e., examinee,
group-level comparison in studies in second
rater, task, interlocutor, and rating criteria) to
language acquisition, which does not result in
investigate whether the peer interlocutor’s
any intervention plans that target the specific
proficiency influences raters’ scoring decisions.
needs of individual learners. Clearly, the missing
The FACETS analysis indicated that the interlocutor
link between second language acquisition and
proficiency did not have a significant effect on
diagnostic assessment is actionable feedback that
averaged scores. However, the raters’ scores
is suited to the needs of individual learners. To
on sub-scale scores, particularly interaction-
explore how such feedback can be produced, an
sensitive sub-scales, were influenced by the peer
ethnographic study was conducted on 178 college
interlocutor’s proficiency. In order to further
freshmen in an EFL listening course in China.
understand the nature of interaction elicited
Information about their comprehension problems
from paired speaking tasks, select examinees’
and how the problems are related to their learning
performances were qualitatively analyzed drawing
experience was collected on multiple occasions
on conversation analysis (CA) (Sacks, Schegloff,
in the classroom setting through the students’
& Jefferson, 1974). CA findings indicated that the
self-assessment reports, their performance in
examinees oriented to completing the tasks at
in-class and take-home tests, interviews based
hand in general. However, the pairs consisting of
on the students’ performance in the tests, and
low-intermediate and advanced level examinees
the teacher’s classroom notes. The data from
showed distinct interactional features and
the multiple sources were coded, frequencies of
sequential organizations compared to those with
various decoding problems were counted, and
examinees with similar proficiency levels. Based
students’ account of their learning experience
on the research findings, implications for pairing
was matched to the problems. With a focus on
examinees and assigning valid scores when
the decoding process, eight broad categories of
administering paired speaking tasks and ways
problems were identified from the data: failure
to utilize examinees’ speaking performances for
to recognize link-ups and weak forms, incorrect
teaching interactional competence in classroom
word recognition due to mispronunciation, ill
will be discussed.
adaptation to unfamiliar accents, failure to identify
unfamiliar collocations, failure to recognize
unfamiliar meanings of familiar words, failure
to guess the meaning of new words, failure to
identify set phrases, and failure to understand
new structures. While most of these categories
could find their counterpart in Field’s (2008) list of
decoding processes, further analyses showed that
60
Paper abstracts
a large part of these problems could be traced Palmer (1996) or Read (2000), the test purpose
to the students’ learning experience, especially determines all future decisions; with regard
the language input they had been exposed to, to placement, screening or pretesting, these
the instructions they had received up to the point purposes require a format which has been
of this study, and their habit of listening. On the reported as a reliable and valid measure for
basis of this exploration, a qualitative procedure screening and placement purposes, is simple and
for producing actionable feedback is proposed in quick in administration and scoring, and poses
the context of EFL listening comprehension. Two little demand on the test takers while covering as
key elements of this framework are post-listening many items as possible. We examined two widely
diagnosis of the listening process and relation reported formats suitable for these purposes,
of diagnosis to the learning experience of the the discrete decontextualized Yes/No vocabulary
learners. The post-listening diagnosis is based test (e.g. Alderson & Huhta, 2005; Meara, 2005)
on the cognitive framework proposed by Field and the embedded contextualized C-test format
(2008), comprising decoding and meaning-building (e.g. Eckes & Grotjahn, 2006). Using Read’s (2000)
processes. Elements of the learning experience and Read & Chapelle’s (2001) framework for
include the language input the learners have been assessing vocabulary as means of classification,
exposed to, the instructions they have received, the two formats operationalize opposing ends of
and their habits of listening (Norris & Ortega, the framework domains ‘discrete vs. embedded’
2000). In a nutshell, learners are encouraged to and ‘decontextualized vs. contextualized’. We
find out about their problems, trace the causes employed these opposing formats in order
of their problems to their learning experience, to determine which format can explain more
and then make their specific decisions as to the variance in measures of listening and reading
choice of learning materials, habit of listening, and comprehension. Our data stem from a large-
immediate steps to take. scale assessment with over 3000 students in
the German secondary educational context;
the four measures relevant to our study were
C-tests outperform Yes/ administered to a subsample of 559 students.
No vocabulary size tests as Using regression analysis on observed scores and
SEM on a latent level (Kunnan, 1998; Purpura,
predictors of receptive language 1998), we found that the C-test outperforms
skills the Yes-No format in both methodological
approaches. The contextualized nature of the
Claudia Harsch, University of Warwick, C-test seems to be able to explain large amounts
C.Harsch@warwick.ac.uk of variance in measures of receptive language
skills. The C-test, being a reliable, economical
Johannes Hartig, German Institute for
and robust measure, appears to be an ideal
International Pedagogical Research, hartig@dipf.
candidate for placement and screening purposes.
de
With regard to the contextualized nature of the
P lacement and screening tests serve C-test format and its embedded items, positive
important functions, not only with regard to washback for the foreign language classroom is
placing learners at appropriate levels of language to be expected (Qian, 2008) since solving C-test
courses but also with a view to maximizing the items draws on a range of processes and skills,
effectiveness of administrating test batteries. all of which are needed to process language in
With the advent of computer-administered tests context (Sigott, 2005). In a side-line of our study,
(CAT), adaptive testing becomes more and more we also explored different scoring approaches
attractive (e.g. Frey & Seitz, 2009). Here, reliable for the Yes-No format (Huibregtse, Admiraal &
pre-tests become a prerequisite to optimize the Meara, 2002). We found that using the hit rate and
CAT procedure. As stated by e.g. Bachman & the false-alarm rate as two separate indicators
61
Paper abstracts
yielded the most reliable results. As an additional tasks compared to the native-English-speaking
benefit, these two indicators can be interpreted as children and their attitudes toward computerized
measures for vocabulary breadth and as guessing assessments by incorporating a wide array of
factor respectively, and they allow controlling analyses. The ELL children and the native-English-
for guessing. This substantive interpretation speaking children will complete two test sections:
offers benefits for the language classroom, oral (six TOEFL Primary sample speaking tasks)
giving feedback to learners and teachers alike on and reading (seven TOEFL Primary sample reading
vocabulary knowledge and on the use of guessing. tasks). We will videotape each child as he or she
takes the test, which will be programmed on a
Tobii TX300 eye-tracking computer to record eye
How young children respond movements. After the test, we will ask each child
to computerized reading and (in their L1) to draw a picture to show how he or
she felt during testing (draw-a-picture technique;
speaking test tasks Carless, 2012) and to participate in stimulated
Laura Ballard, Michigan State University, recall about their test-taking experience in
ballar94@msu.edu response to selected eye-gaze recordings. We
will compute eye fixations and gaze duration
Shinhye Lee, Michigan State University, calculations to explore the amount of attention
leeshin2@msu.edu paid to the different test components, which
we can aggregate and analyze by test outcome.
62
Paper abstracts
4:00-4:25pm paper sheds new light on the neglected issue of
getting accountability through assessing language
achievements in a blended learning environment.
Challenges of blended learning
assessments in network-limited Test-takers’ use of visual
areas information from context and
Alberto Enrique Acedo Bravo, Universidad de content videos in the video-
Oriente, translator-2@vlir.uo.edu.cu based academic listening test
Yoennis Díaz Moreno, Universidad de Oriente,
translator-3@vlir.uo.edu.cu Ruslan Suvorov, Center for Language and
Technology, University of Hawaii at Manoa,
Lutgarde Baten, Leuven Language Institute, lut. rsuvorov@hawaii.edu
baten@gmail.com
63
Paper abstracts
focused on when watching two types of videos— that has attracted very little systematic interest
context and content videos—used in the VALT, from researchers’ (pp. 92-93), while Stæhr (2009)
and (b) the aspects of visual information that they lamented that ‘no suitable test of phonological
had perceived as helpful and as distracting for vocabulary size existed’ (p. 597). Milton and
their test performance. The results revealed (a) Hopkins (2006) investigated aural vocabulary size,
differences between aspects of visual information finding that very proficient EFL learners frequently
that the participants focused on in context videos have much larger reading vocabularies than
and content videos, and (b) differences between listening vocabularies, but the test used (A-Lex) is
the aspects of visual information from the two a size test and does not measure knowledge of
video types that the participants perceived as respective frequency levels. Within the LVLT and
helpful and as distracting. The implications of this NLVT, each 1000-word band is tested using 24
study will be discussed with regards to the use of items created through the retrofit and redesign of
visuals for the development of L2 listening skills previous Vocabulary Size Test (VST) items (Nation
and the use of cued retrospective reporting and & Beglar, 2007). The 30 AWL items, however, were
eye-tracking methodology for research on video- newly created using randomly selected target
based L2 listening assessment. words and item specifications reverse-engineered
from published descriptions of previous VST items.
As in the previous VST format, each target word is
Comparing aural and written first presented in isolation, then within a generic
receptive vocabulary knowledge sentence designed to provide context and aid
recall while not providing any hints for the correct
of the first 5k and the AWL answer choice. Quantitative analyses utilized
Brandon Kramer, Momoyama Gakuin University, several aspects of Messick’s validation framework
brandon.L.kramer@gmail.com to show that each test is a reliable measure while
follow-up interviews and qualitative analyses
Stuart McLean, Graduate School of Foreign indicated that each test measured the intended
Language Education and Research, Kansai construct of aural vocabulary knowledge, that the
University, stuart93@me.com formats are easily understood, and that the tests
have high face validity. For those students who
64
Paper abstracts
validity of the above tests as accurate, reliable, terms of proficiency level classification. This
and distinct measures of each construct. The data study compares two assessment programs for
supports the necessity of using construct-specific students of English as a foreign language (EFL)
vocabulary tests when conducting high quality that are mapped onto the CEFR: German National
research. Assessment (NA) and the international Test of
English as a Foreign Language (TOEFL). A sample
of N = 3,639 German EFL students in the last year
B2 or C1? Investigating the of upper secondary education were assessed by
equivalence of CEFR-based a short version of a released paper-and-pencil
TOEFL (ETS, 1997) as well as a test that had
proficiency level classifications originally been developed for NA in Germany
Johanna Fleckenstein, Leibniz-Institute 2009 (Köller, Knigge & Tesch, 2010). Student
for Science and Mathematics Education, abilities were estimated by drawing plausible
fleckenstein@ipn.uni-kiel.de values (PVs) in a multi-dimensional, one-parameter
(1-PL) Rasch model in ConQuest, Version 3.0
Michael Leucht, Leibniz-Institute for Science and (Wu, Adams, Wilson & Haldane, 2007). PVs were
Mathematics Education, leucht@ipn.uni-kiel.de recoded into a categorical variable according to
the respective corresponding CEFR proficiency
Olaf Köller, Leibniz-Institute for Science and
levels (A1, A2, B1, B2 and C1) based on cut-scores
Mathematics Education, koeller@ipn.uni-kiel.de
specified by Tannenbaum and Baron (2011) for
T
the ITP TOEFL and Harsch, Pant & Köller (2010) for
he Common European Framework of
German NA, respectively. We found a systematic
Reference (CEFR) was developed in
shift in the classifications of proficiency levels
order to provide a basis for mutual recognition
for the two scales: Students tend to score on a
and comparability of language qualifications.
lower level on the TOEFL scale than on the NA
It is commonly used for the classification of
scale. Thus, B2 on the one scale equals C1 on
performance in standardized language tests (Little,
the other. The two classifications show an only
2007). According to its self-description the CEFR
moderate relationship of ρ = .63 (reading) and ρ
can be used for specifying the content of tests and
= .61 (listening). These findings indicate that two
examinations, for stating the criteria to determine
EFL tests do not necessarily locate a student on
the attainment of a learning objective, and for
the same CEFR level. Thus, the interpretation of
describing the levels of proficiency in existing
a student’s actual language proficiency according
tests and examinations to enable comparisons
to CEFR descriptors seems to depend largely on
across different systems of qualifications (Council
the individual test’s proficiency level classification.
of Europe, 2001). The question remains, however,
By bringing CEFR-linked tests together we can
whether CEFR-based proficiency classifications
identify deficiencies in the operationalization
are really comparable. After all, most tests were
of the framework for language assessment
developed or implemented for different contexts
and standard-setting procedures. It can also
and target populations and some also differ in
contribute to the validation of policy-defined
their construct of language performance (Fulcher,
national educational standards.
2004). The validity and equivalence of such CEFR-
based proficiency level classifications is the subject
of critical discussion (Weir, 2005). Thus, as Charles
Alderson put it: “How do I know that my Level B1
is your Level B1?” (Council of Europe, 2003, p. ix).
While such critical evaluations of the CEFR are
mostly theoretical in nature, there is a shortage
of empirical linking studies that aim at analyzing
foreign language tests to their compliance in
65
Paper abstracts
Thurs, Mar 19 post-treatment tests of L2 production accuracy
were administered to indicate trends. The data
were analysed by mixed methods and the results
10:20-11:45am triangulated. The qualitative survey data showed
strong teacher and student perceptions that
Assessing the unobservable: Can learning had taken place in the CALL and CM
exercises. Evidence in the form of classroom
AFL inform on nascent learning observation field notes, audio-recordings,
in an L2 classroom setting? and audio transcriptions of group and class
interactions, supports this view. The survey data
Christian Colby-Kelly, christian.colby@mail. included student perceptions of their learning
mcgill.ca processes during and following the AFL activities.
The results of pre- and post-tests of grammatical
I t is known that L2 learning is not always production accuracy showed measurable gains
accompanied by an observable increase in the in increased accuracy in the grammatical form
accuracy of language produced (Lightbown, 2000). usage in one of the classes but not in the other.
Several (Turner, 2012; Turner & Upshur, 1995; Thus the triangulated results suggest that learning
Young, 1995) have noted the difficulty this brings may have taken place in the two classes, and
to the accurate assessment of L2 competence. that measureable gains in production accuracy
Despite this limitation, it is important for L2 were demonstrated in only one of the classes
teachers to be aware of the unobservable, nascent under investigation. The overall results suggest
learning of their students in order to take the that learner attestations may more completely
best and most effective direction in subsequent inform on the status of internal and unobservable
lesson planning. An Assessment for Learning (AFL) student learning than do production accuracy
methodology may be useful in helping teachers measures alone. These results have implications
meet this challenge. AFL includes a diagnostic for the use of AFL in L2 classrooms to more
component to help both teachers and students accurately recognize nascent learning, as teachers
better gauge ‘where students are in their learning’ build on and scaffold lessons learned, as a basis
for more effective lesson planning (Black and for future lessons. Using AFL, learners may be well
Wiliam, 1998). AFL also encourages students to be placed to work with teachers in helping assess
more aware of the learning process, and to take the unobservable, to inform on individual learning
greater responsibility for determining their own processes, to capitalize on learning progress, and
learning progress. The present research will report to more accurately determine where students are
on teacher and student perceptions of learning in their learning, to better plan for future learning.
in a quasi-experimental investigation of an
application of AFL in two classes of pre-university
English for Academic Purposes in Quebec,
Canada, as part of a larger study centering on the
students’ learning of a challenging L2 grammatical
feature. The pedagogical materials used included
computer-assisted language learning (CALL),
an online individual concept mapping (CM)
exercise, and peer-group and teacher-class
CM exercises. The data collection instruments
included the concept maps produced, classroom
observation field notes, transcribed group and
class discourse, as well as teacher and student
survey questionnaires. In addition, pre- and
66
Paper abstracts
Empirically derived rating rubric North and Schneider (1998), norm-referenced
modifiers were removed, all the descriptors were
in a large-scale testing context encapsulated into stand-alone statements, and
redundancy was minimized. Next, CELPIP writing
Alex Volkov, Paragon Testing Enterprises, construct was employed to discard the descriptors
volkov@paragontesting.ca not directly applicable to the context. At this stage
Jacob Stone, Paragon Testing Enterprises, the descriptor pool included 300 unique, positively
stone@paragontesting.ca worded descriptors. 2. Q-Sort To obtain a better
representation of the CELPIP language framework
Adam Gesicki, Paragon Testing Enterprises/UBC, in the new rubric, an expert panel was organized.
gesicki@paragontesting.ca The panelists worked together to conceptually
sort the descriptors into 5 conceptual categories.
T
rubric and rater interview data, the descriptor
est preparation (TP) for high-stakes
pool included 800 descriptors. Following the
English language tests has increasingly
methods described by Knoch (2009) and
been a research interest in washback studies
67
Paper abstracts
(e.g., Alderson & Hamp-Lyons, 1996; Green, English competence improvement, were not
2007). Mixed findings of TP effects (e.g., Elder significant predictors of students’ CET test scores.
& O’Loughlin, 2003; Liu, 2011) indicated the Chinese students perceived values of engaging
necessity to sequentially explore the nature of in test preparation as: access to structured
test-takers’ TP practices first, then examine the instructions on both test orientation and
effects of TP to make legitimate interpretations of English learning (merit); possibility of increasing
its effects on test performance, and further delve test scores, and need to keep motivation and
into test-takers’ perspectives on values of their persistence (worth); and opportunity to be
engagement in TP in order to put the complete involved in communities formed by peers to
process of washback under empirical lens. This prepare for both test-taking and academic
study, with a sequential multi-phase multi-method applications (significance). These findings
design, investigated nature, effects and perceived combined together, revealed nature, effects, and
values of Chinese students’ test preparation for perceived values regarding Chinese students’ test
two major English tests in China – the TOEFL iBT preparation practices, and all of these depend
and the College English Test (CET). The inclusion of on factors both within tests (e.g., stakes of tests)
Chinese students taking TP courses for these two and test setting (e.g., social and educational
tests provided the opportunity to reach a range functions of tests). This study contributed to our
of potential participants in terms of their cognitive understanding of how washback functions at the
(e.g., incentives), linguistic (e.g., English proficiency) interface of tests and test setting, an aspect of
and sociocultural (e.g., attitudes) characteristics, washback studies that needs further research
and in turn served as a contextual foundation to (Cheng, 2014).
depict a fuller picture of Chinese students’ test
preparation. Phase 1 used case study approach
and collected multiple-source evidence (e.g.,
10:50-11:15am
documents, interviews, classroom observations)
from multiple stakeholders (e.g., students, Junior and senior high school
teachers, administrative staff) to understand
the TOEFL iBT TP practices at a TP centre in
EFL teachers’ practice of
China. Phase 2 administrated a questionnaire formative assessment: A mixed
to 532 undergraduate students in 8 universities
in China who had taken a live CET and CET TP
method study
courses before test and collected their CET test Hidetoshi Saito, Ibaraki University, cldwtr@
scores. Drawing on Scriven’s (1998) framework of mx.ibaraki.ac.jp
conceptualizing values from the perspectives of
T
merit, worth and significance, Phase 3 unpacked he present study is an explanatory
how Chinese students (N=12) who had started sequential design study (Creswell & Clark,
their academic studies in a Canadian university 2011) of the current formative assessment (FA)
perceived values of their TOEFL iBT TP courses. practice of junior and senior high schools EFL
A one-hour semi-structured interview and a teachers in Japan. The EFL teachers use FA to
half-hour follow-up interview were conducted make instructional decisions, which directly
with each participant. Test orientation, linguistic affects students’ learning. However, there seems
emphasis, and sociocultural and cognitive to have been surprisingly few systematic surveys
engagements, which were derived by various of language teachers’ FA practice in Japan or
incentives for test preparation, depicted the in any other countries except for Cheng et al.’s
nature of Chinese students’ preparation for TOEFL trailblazing work (2004). The main purpose of
iBT in a particular test preparation context. Result this study is to describe how Japanese junior
indicated that TP practices for CET, although with and senior high school EFL teachers implement
an unbalanced weight on test demands over classroom formative assessments. The study
68
Paper abstracts
comprises of two stages: A large scale survey and between them. The present study reveals not only
a qualitative follow-up study. In the first stage, what kind of FA are most and least commonly
a 40-item questionnaire was constructed by used in Japanese junior and senior high schools,
drawing upon a framework of FA by Wiliam (2010) but patterns of typical FA use by the teachers. This
and relevant sources, including a previous large- has significant implications for assessment and
scale survey of classroom assessment (Cheng, testing literacy in in-service and pre-service EFL
Rogers, & Hu, 2004) and classroom evaluation teacher education programs.
(Genesee & Upshur, 1996). The questionnaire
consists of items that cover four dimensions of Constructing a common scale
FA—intentions/criteria, assessment method,
feedback, and assessment purposes—and the for a multi-level test to enhance
teacher’s account of actual FA use on a sample interpretation of learning
language activity. The questionnaire was piloted
and refined through the feedback from 41 outcomes
practitioners. A necessary sample size of 380 was
determined by considering estimated population. Jessica Row-Whei Wu, Language Training and
The questionnaire was mailed to 512 junior high Testing Center, jw@lttc.ntu.edu.tw
schools and 205 senior high schools across Japan Rachel Yi-fen Wu, Language Training and Testing
by way of a stratified random sampling—the strata Center, rwu@lttc.ntu.edu.tw
being 44 prefectures. The total of 732 teachers
A
responded. The data were analyzed first by the criterion-referenced test may contain
Rasch Analysis. This revealed how frequently the multiple levels. When a learner takes a
teachers employ intentions, assessment methods, higher level of a given test, his or her progress
feedback, and assessment purposes. I then cannot be measured by directly comparing
ran cluster analyses of the Rasch scores of the the scores from two different levels. The
four FA dimensions along with years of teaching General English Proficiency Test (GEPT), a five-
experience, type of school (junior or senior level criterion-referenced EFL testing system,
high school), hours of teaching, and class size was developed with reference to the English
as covariates. The three cluster solution turned curriculum in Taiwan. The GEPT is taken by
out to be most explanatory and efficient. The more than 500,000 learners annually for various
four dimensions were all statistically significant purposes, among which assessing learning
contributors of clustering while none of the outcomes is the most common. Recently, an
background variables were so. The three clusters increasing number of schools have begun to
were identified as “high,” “middle,” and “low” encourage their students with better English
users of FA. Based on the Rasch logit scores of proficiency to take the GEPT at a higher level.
questionnaire respondents, one low, one middle Under this circumstance, scores for more than
and two high users of FA were recruited for the one GEPT level are used in a single school
follow-up qualitative study. In the qualitative stage, setting. In other words, these schools compare
statistical profiles of these four FA users were the learning outcomes of students who do not
explored qualitatively by interviews, observations necessarily take the same level of the GEPT. To
of their lessons, a follow-up questionnaire, and establish more constructive relationships among
email exchanges. The results gave both supportive teaching, learning, and assessment, it is desirable
and negative evidence for the three cluster to create a common scale across all levels of
solution of FA users. That is, although salient the GEPT by vertical scaling (Kolen & Brennan,
difference in FA use among teachers seemed to 2004). To this end, the present paper reports
lie in intentions and purposes, there seemed to be procedures for constructing a vertical scale for
great overlaps between “high” and “low” FA users, four GEPT levels (Elementary, Intermediate, High-
thus making it difficult to draw clear demarcation Intermediate, Advanced; roughly equivalent to
69
Paper abstracts
CEFR A2-C1, respectively) through the use of A students’ perspective in a
Item Response Theory. The study employed
non-equivalent groups with anchor test design, high-stakes test: Attitudes,
and the test items were grouped into three contributing factors and test
testlets, each containing shortened versions of
two adjacent levels to prevent data from being performance
contaminated due to test-taker fatigue. The
between-item multidimensionality model was Ying Bai, University of Melbourne, yingb@student.
used to analyze the score data from a total of unimelb.edu.au
1,270 learners. According to the concurrent
estimations of learners’ ability (θ) on the common
score scale, the corresponding θ of GEPT scores
T est takers/students, predominately passive
receptors in the testing process, have rarely
had their voices and opinions taken into account
across levels were estimated (in logit). Thus, scores
in test development. When social dimensions and
from different levels can be compared on the
consequences of test are included as evidence
vertical scale. Major findings include 1. A good fit
of validity (Bachman, 1996; Bachman and
with Rasch model was obtained. 2. An increase
Palmer, 2010; Davies, 2008; Kane, 2001, 2002;
of difficulty across the GEPT levels provided
Kunnan, 2000, 2003; McNamara and Roever,
internal validation evidence to support the
2006; Messick, 1989; Shohamy, 2001), the test
hierarchical sequence. 3. A difference of 1.0 logit
stakeholders’ position has to be re-evaluated.
was found between the Elementary level and the
However, students/test takers’ views and their
Intermediate level, and between the Intermediate
possible contributions to test development are
level and the High-Intermediate level. However,
still under investigation. This project seeks to
a wider gap of 1.5 logit was found between the
investigate a high-stakes test (College English Test,
High-Intermediate level and the Advanced level,
CET) in China from students’ perspective. The CET,
suggesting that more learning is required for
since its inception in the 1980’s, has aimed to
students to achieve the Advanced level. However,
play a positive role in promoting effective English
it is worth noting that vertically linking tests of
teaching and learning commensurate with the
different levels provides a weaker degree of
College English Curriculum Requirements (CECR),
comparability than horizontally linking or equating.
which defines the requirements of each English
Despite the constraints on the inferences that
language skill at tertiary level and gives special
vertical linking can support, it is recommended
emphasis to English communicative skills. To
that test-takers’ logit scores on the common scale
achieve this goal, CET has gone through several
be provided as a supplement to their GEPT scores
major revisions in its development. According
when the test results are reported. In this way, the
to the latest CET syllabus (2006, p.3), it aims to
usefulness of the GEPT can be enhanced in order
“accurately examine students’ integrated language
to better inform teaching and learning.
skills and promote the implementation of the
CECR”. This study is composed of two phases,
with Phase I exploring students’ attitudes towards
the test quality, test use and test preparation
approach using qualitative data; and Phase II
exploring the possible relationships between
students’ attitudes, potential contributing factors
to these attitudes and their test performance,
using structural equation modelling (SEM) analysis.
Twenty second-year non-English major students
from two universities took part in a focus group
interview in Phase I, with 2-4 students in a 1 to
1.5 hours semi-structured discussion. Students
70
Paper abstracts
demonstrated a mixed attitude towards test 11:20-11:45am
quality: they believe the CET in general is a valid
test, but its major test format (multiple choices)
and familiarity with test content, which could What does the cloze test really
be acquired through sufficient test preparation test? A replication with eye-
exercises, may have had an effect on their test
performance; a supportive attitude towards test tracking data
use, but the reasons for such attitudes differ from
Paula Winke, Michigan State University, winke@
the intended rationale. Their attitudes towards
msu.edu
test preparation approach are largely influenced
by the language policy at local level: they are Shinhye Lee, Michigan State University,
more supportive of their English teaching when leeshin2@msu.edu
the policy focuses on improving English use
competence; and less so when it focuses more Daniel Walter, dswalter@gmail.com
on test results. However, due to the pressure of Katie Weyant, katie.weyant@gmail.com
this high-stakes test, students’ English learning
activities outside the classroom mainly focus on Suthathip Thirakunkovit, Purdue, thirakun@
CET preparation. The questionnaire for Phase purdue.edu
II, based on the interview results, was collected
Xun Yan, Purdue, yan50@purdue.edu
from 358 university students. The SEM analysis
O
demonstrated that students’ English learning
ne way to assess students’ language
motivations and test-related learning experience
skills is through a Cloze test (Bachman
positively affected their attitudes. Students’
& Palmer, 2010; Brown, 2005). To make a Cloze
attitudes towards test use and test preparation
test, teachers can take a paragraph, delete
approach were positively and negatively related
certain words, such as every 7th or 9th one, and
to their test performance respectively. Students’
have students fill in the blanks. Teachers can
attitudes towards test quality had no effect on
also select a specific linguistic feature to delete,
their test performance. It seems that students’
such as articles or modals. Over the years, while
perceptions can make a significant contribution
researchers have used Cloze tests to assess
in revealing the complexity of the interplay
the proficiency of language learners (Fairclough,
between testing policies at a macro level and their
2011; Horst, Cobb & Nicolae, 2005; Baltova,
implementation in local settings.
2000), other researchers have debated what the
construct of a Cloze test is (Carr, 2011; Purpura,
2004). That is, what does it measure? Reading
comprehension, vocabulary or grammatical
knowledge, or something else? In 2011, Tremblay
(in a paper published in SSLA) advocated that a
French cloze-test she created was a measure of
reading comprehension and French proficiency.
She suggested researchers could use her test
to assess the French of any level of learner.
Intrigued by her work, we had 22 individuals with
different levels of French knowledge take her
Cloze test online. Specifically, the participants
were 4 groups of 4 to 5 students, each group
from 1st through 4th year university-level-French
classes, respectively; 1 non-traditional learner; 2
professors of French; and 1 native speaker. By
71
Paper abstracts
tracking test takers’ eye movements while taking pronunciation most affecting perceptions of
the test, we were able to observe whether the fluency and comprehensibility. The objective of
participants read the passage, or if they focused the study was to examine exactly what features
their eyes on text solely around the blanks. We of student utterances contributed most to oral
found that the test’s average item discrimination assessment scores and particularly the fluency
was 0.345 when we set the test’s cut score at and comprehensibility ratings. The researcher
below or more than 2 years of French instruction, collected digitally recorded speech samples
and 0.341 with the cut at 3 years. These data from eleven Taiwanese English learners and
support Tremblay’s argument that the test can be eleven Korean English learners in three task
used to segregate low-level learners from upper- conditions: (a) reading a script, (b) comparing
level ones. But, we argue that the test may not be two pictures of houses, and (c) narrating a story.
appropriate for all learner levels. We use interview All speakers were university students living in
data and eye-movement records to argue that the either Taipei or Seoul. Then undergraduates were
test construct (what is being measured) differs asked to rate the speech samples for fluency
depending on proficiency level, with lower-level and comprehensibility. After the ratings were
students demonstrating that they did not read completed the voice samples were transcribed,
(or were unable to read) the text in particular. analyzed, and coded for grammatical, phonemic,
We argue that Cloze testing is a discrete-point and prosodic errors, and then subjected to
vocabulary and grammar test, depending on the correlational and factor analyses to assess the
missing text, and is not a test or task that normally types of errors most contributing to fluency and
involves reading comprehension in the traditional comprehensibility ratings and the degree to which
senset normally involves reading comprehension task type influenced ratings. Vowel errors were
in the traditional sense. aggregated into two categories: (a) tense-lax pairs
and (b) other. Consonant errors were aggregated
into three categories: (a) L1 variants, (b) L1 absent,
Factors contributing to fluency and (c) other. Both Vowel and Consonant errors
ratings in classroom-based were also aggregated into a Phonemic category.
Stress and intonation issues were also aggregated
assessment into one Prosodic category. Mean length of pauses
Douglas Margolis, University of Wisconsin-River in milliseconds and mean length of utterances
Falls, douglas.margolis@uwrf.edu were also calculated. Finally, total number of
grammatical errors was included in the analysis,
W hen language teachers focus on though these issues were not coded for finer grain
pronunciation, what should be prioritized analyses. The results suggest that prosodic issues
to most help learners? The pronunciation most influenced the ratings, especially pertaining
literature is somewhat split between emphasizing to the fluency scores. Nonetheless, phonemic
prosody versus phonemic articulation. Many errors also lowered fluency ratings. Grammatical
learners frequently express the desire to speak errors may have also impacted on fluency ratings
like a native speaker, suggesting a focus on but the data in this study is too limited to make
both areas is necessary. But a growing body of too certain of a conclusion. Moreover, significant
literature repudiates the native speaker norm for differences were found between fluency ratings
both practical and ideological reasons, calling into of the different tasks. Thus, this study suggests
question instructional choices and assessment that best practices in classroom assessment
approaches aimed at helping learners advance of pronunciation dictate having multiple task
their pronunciation to the level of fluent and conditions and that prosody should perhaps be
comprehensible speech that avoids spoiling weighted more than phonemic articulation. The
learners’ second language interactions. For this results will be presented and discussed in terms
reason, this study investigated the features of of the implications for teaching, studying, and
72
Paper abstracts
assessing pronunciation. What is the language assessment literacy of
the teacher? How is it developed and shaped
in a particular context? The findings reveal five
Language assessment literacy themes that comprise Linda’s assessment literacy:
in practice: A case study of 1) heightened awareness of the importance of
quality feedback to student learning; 2) structured
a Chinese university English plans and procedures to engage students in
teacher the assessment process; 3) strategic responses
to unethical practices arising from assessment;
Yueting Xu, University of Hong Kong, traceyxu@ 4) uncertainties surrounding grading; and 5)
hku.hk inadequate preparedness for test construction.
These themes are discussed in relation to Linda’s
73
Paper abstracts
3:00-3:25pm students’ and lecturers’ perceptions of the
writing demands both at university and in the
workplace, the students’ perceptions of their
Test use in the transition from preparedness for the language demands of their
university to the workplace: chosen profession, as well as the similarities and
differences between IELTS language, university
Stakeholder perceptions of language and the perceived language demands
academic and professional of the profession. Lecturers of final year core
courses were asked about common problems in
writing demands students’ writing and about typical writing tasks
in university courses. In addition, interviews were
Ute Knoch, University of Melbourne, uknoch@
conducted with new graduates, employers of new
unimelb.edu.au
graduates and members of professional bodies
Susy Macqueen, University of Melbourne, to explore their experience of workplace writing
susym@unimelb.edu.au and its relevance to the university writing tasks
and those used in IELTS. While all stakeholders
Lyn May, Queensland University of Technology, agreed on the importance of ‘effective’ writing in
lynette.may@qut.edu.au both university and professional contexts, findings
John Pill, University of Melbourne, tpill@unimelb. point to concern regarding the prevalence of
edu.au group writing projects used for assessment in
engineering degrees. When compared with the
Neomy Storch, University of Melbourne, university writing tasks, a potential disjunct was
neomys@unimelb.edu.au apparent in terms of text types, the tailoring of
a text for a specific audience and processes of
74
Paper abstracts
Does accent strength affect found in the listening assessment literature,
multiple perspectives by different speakers
performance on a listening facilitate comprehension of aural input. If this
comprehension test for is the case with interactive academic lectures,
less familiar accents might be fairly used in tests
interactive lectures? containing such input. Using Ockey and French’s
Strength of Accent Scale (2014), nine accent
Spiros Papageorgiou, Educational Testing strengths were used to determine the extent to
Service, spapageorgiou@ets.org which comprehension of interactive lectures is
Gary Ockey, Iowa State University, garyockey@ affected by accent strength. All speakers were
hotmail.com recorded giving the same lecture. A large sample
of test takers were then randomly assigned to
T he construct of second language listening listen to the lecture given by one of the speakers
ability is increasingly viewed as the ability (N = 21,726) and respond to six identical
to comprehend multiple dialects of the target comprehension items and a survey designed
language, and as a result, a growing number of to assess their familiarity with various accents.
researchers and practitioners advocate use of a The test taker responses were further analyzed
variety of accents for assessing second language with the Rasch computer program WINSTEPS
listening ability. However, research indicates that (Linacre, 2009) to investigate the relative difficulty
the use of multiple accents, some of which are of the items associated with the nine versions
unfamiliar to many of the targeted test takers of the interactive lectures. Results indicated that
threaten test fairness because familiarity with comprehension of the interactive lectures was
an accent has been shown to relate to better diminished after a threshold similar to that found
comprehension (Adank, et al., 2009; Adank & by Ockey and French (2014) for monologic speech,
Janse, 2010; Derwing & Munro, 1997; Gass & but there was some evidence that this threshold
Varonis, 1984; Ockey & French, 20014). These may be slightly higher for the interactive lectures.
studies have focused mainly on monologic The implications for the development of listening
discourse, i.e., a single speaker giving a talk, comprehension tests are discussed.
lecture or presentation. However, discourse
researchers point out that most presenters An Investigation into cognitive
monitor their speech and observe the reactions
of their audience (Fox Tree, 1999). Because of evidence for validating a
this monitoring, monologues normally contain visual literacy task in listening
two types of discourse: main discourse, that is,
the actual topic being discussed, and subsidiary assessment
discourse, that is, discourse concerned with the
reception of the subject matter and monitoring Luna Yang, School of Foreign Languages
the discussion (Coulthard & Montgomery, 1981). and Literature, Beijing Normal University,
Academic lectures often contain such elements yanglvna@163.com
of interactivity, for example questions by the Zunmin WU, School of Foreign Languages and
students and clarifications and further elaboration Literature, Beijing Normal University, wzm@bnu.
by the lecturers. It is not known whether the effect edu.cn
on performance found by previous accent studies
also holds with these commonly encountered
interactive lectures in a listening comprehension
test, in particular when only the presenter’s L istening comprehension is now perceived as
an interactive process instead of receptive
process. Meanwhile, it is necessary to recognize
accent is unfamiliar. Investigation of interactive
lectures is particularly important because, as the importance of visual literacy for the children of
75
Paper abstracts
the technology generation. Despite the popularity re-take the test, where their test-taking process
of different visual representations, such as was video-recorded. Then they were asked to
photographs and pictures, maps, videos, etc., recall what they were thinking and how they
in text books and classrooms, there are many processed the audio and visual information
research evidences showing teaching and learning and worked out the answer to each item. In
in visual literacy skill are not sufficient (Bowen & addition, retrospective data from item writers of
Roth, 2005; Eilam, 2012; Miller & Eilam, 2008). constructing these items were also collected. The
Since it is a common practice in language testing qualitative data were transcribed and analyzed
to provide visual aid for low proficiency language by NVIVO 10. The quantitative performances
learners and it is also authentic to listen while and cognitive process of students at different
perceiving and interpreting the meaning of images levels are compared, while qualitative data from
at the same time, it is essential to assess students’ both students and item writers are analyzed and
visual literacy in listening comprehension. discussed. The implications of the study on the
This presentation reports the study of collecting item writing, and on the instruction of students’
cognitive evidence both from students and item visual literacy and listening proficiency are
writers to validate the listening task of Look, presented.
Listen and match in the listening component
of an English test of Grade eight students (G8,
14-year-olds). The English test is administrated
3:30-3:55pm
together with Math, Chinese and Science subjects
in the nationwide Regional Diagnostic Assessment Tackling the issue of L1
project launched by National Innovation Center
for Assessment of Basic Education Quality in China
influenced pronunciation
in 2013. The overall purpose is to provide the in English as a lingua franca
educational administration at different levels and
middle schools with feedback information, as well
communication contexts: The
as to better people’s understanding of assessment case of aviation
as learning and teachers’ assessment literacy
concurrently from the analysis of data collected Hyejeong Kim, University of Melbourne, hykim@
from 50,000 to 100,000 G8 students annually. The unimelb.edu.au
listening task is designed in alignment with the
Rosey Billington, University of Melbourne, rbil@
objectives for listening skills and resource strategy
unimelb.edu.au
at fourth level for G8 students set in the National
E
Compulsory Education English Curriculum nglish has been used as the default
Standards mandated by Ministry of Education of language in pilot-air traffic controller
China in 2011. radiotelephony communication since 1951, and
In the integrative listening task, students are is therefore used as a lingua franca (ELF) in the
required to listen to tapes, understand the international radiotelephony communication
information and directions so as to accomplish context. In 2003, concerns over the role of
the task of matching information with visual insufficient English proficiency on the part of
representations of photos and maps. Using IRT, non-native English speaking aviation personnel
all the test items were calibrated to determine led to the establishment of Language proficiency
the difficulty levels in the form of item difficulty Requirements (LPRs) by the International Civil
measures by performing the ConQuest. The Aviation Organization (ICAO). In the LPRs, six
generated Item Characteristic Curves (ICCs) assessment criteria including Pronunciation,
illustrated the expected and observed responses Structure, Vocabulary, Fluency, Comprehension
of each item. Students of different proficiency and Interactions and associated rating scales
levels (mastery and nonmastery) were invited to and descriptors are specified as the basis for
76
Paper abstracts
any English proficiency tests developed to meet pronunciation, which is one of the challenges in
the LPRs. However, there is a lack of fit between the international radiotelephony communication
the stipulated requirement and the actual context. Accordingly, it is concluded that a new
communicative demands of radiotelephony conceptualisation is needed for taking up the real
communication from the viewpoint of the domain matters in the assessment of ELF in the aviation
specialists, an issue that has been previously context as well as other English as ELF contexts at
addressed and criticised (e.g., Kim and Elder, 2014, large.
2009; Knoch, 2014). Another challenge in the field
of language testing is the lack of a means to easily Examining the effect of DIF
incorporate additional features into descriptors
of ELF communication. Focusing on the role and anchor items on equating
features of the Pronunciation and Comprehension invariance in computer-based
criteria in radiotelephony exchanges between
speakers with various L1 backgrounds, we provide EFL listening assessment
a response to some of these challenges. This
paper investigates the potential effects of L1 Shangchao Min, Zhejiang University, msc@zju.
transferred pronunciations on the understanding edu.cn
of messages in radiotelephony communication. A Lianzhen He, Zhejiang University, hlz@zju.edu.cn
radiotelephony discourse between a French pilot
W
and a Korean air traffic controller in an abnormal hile differential item functioning (DIF)
situation was collected from the international air analysis has played a principal role in the
traffic control tower in Korea in order to explore evaluation of fairness at the item level (Kunnan,
these possible influences. Within the framework 2014), equating invariance concerns fairness at
of grounded ethnography (Douglas, 2000; Douglas the test score level (Huggins & Penfield, 2012), a
and Selinker, 1994), the discourse was analysed lack of which indicates that test takers with the
from the perspectives of aviation experts, three same score on the scale yet belonging to different
Korean airline pilots and three Korean air traffic subpopulations obtain different expected test
controllers. Their feedback on the performance of scores on the equated scale. Given the abundant
two contributors in the discourse was analysed by literature on DIF analysis in language assessment
a thematically based coding scheme. Observations (e.g., Elder, 1996; Pae, 2012; Ryan & Bachman,
that the experts made on pronunciation features 1992), it is surprising that little effort has been
present in the discourse were then supplemented devoted to examining equating invariance in
by a close inspection of the phonetic detail of language tests, especially considering the growing
relevant sections of the interaction, using Praat use of parallel test forms in high-stakes language
(Boersma and Weenink, 2014). The features tests in the past decade. Additionally, to our
from these analyses were then compared with knowledge no empirical study has investigated the
the Pronunciation and Comprehension criteria connection between DIF and equating invariance
and associated proficiency descriptors specified in the field of language assessment. To fill this
in the ICAO rating scale. Results reveal that L1 gap, the present study aims to investigate the
influenced pronunciation is one of the major equating invariance across gender of a computer-
factors causing miscommunication on the part based EFL listening test (CBLT), and in particular,
of the speaker, together with unfamiliarity with the effect of DIF manifested in anchor items on
the pronunciation on the part of the listener. equating dependence. The CBLT is part of the
Findings also suggest the descriptors for the graduation requirements for non-English majors
Pronunciation and Comprehension criteria in the at a major Chinese university. The datasets used
ICAO proficiency rating scale do not reflect the were item-level responses of 8,203 test takers,
ways the language users (i.e., pilots and air traffic 60% male and 40% female, to 15 different forms
controllers) deal with the issue of L1 influenced of the CBLT. Anchor items were assessed for
77
Paper abstracts
both uniform and non-uniform DIF across gender Developing and validating
in different forms using item-response-theory-
likelihood-ratio test. The Stocking and Lord (1983) achievement-based assessments
scaling method under IRT true-score equating of student learning outcomes in
methods was implemented via IRTEQ (Han, 2009).
Equating invariance over gender subpopulations an intensive English program
under three conditions were compared, in which
there were uniform DIF anchor items, non-uniform Ryan Lidster, Indiana University, rflidste@umail.
DIF anchor items and no DIF anchor items. The iu.edu
findings showed that population invariance over Sun-Young Shin, Indiana University, shin36@
gender subpopulation did not hold for some indiana.edu
forms of the CBLT, as evidenced by the larger
U
RMSD and REMSD values than a difference that ntil recently, our Intensive English Program
matters (DTM) benchmark at 0.5 score points. (IEP), like many, used a test of global
However, the elimination of DIF anchor items, proficiency to assess student readiness for
both uniform and non-uniform, would result advancement, despite the curriculum’s stated goal
in stronger population invariance in equating and the requirement of many language program
across all score levels. Most importantly, equating accreditation agencies, including the Commission
based on only DIF-free anchor items led to a on English Program Language Accreditation, that
more consistent pass/fail classification decision students be advanced based on their level of
between the subpopulation and total population achievement of student learning outcomes (SLOs).
equating functions at the recommended CBLT Typically, to measure SLO achievement, IEPs use
cut-off score of 30, which is what really matters course grades either alone or in combination with
for the CBLT. The findings indicated that the a standardized assessment tool (Jensen & Hansen,
presence of DIF anchor items indeed dampened 2003). However, grades are composite measures
population invariance in equating, verifying which include many factors beyond SLO mastery
previous researchers’ conjectures (von Davier & and which are subject to variation in the means of
Wilson, 2007) and simulation results (Huggins, assessment and rating severity across instructors.
2014) with real-data examples. Theoretically, the Meanwhile, score gains on standardized
present study, by combining the two aspects proficiency tests cannot be linked directly to SLO
of measurement invariance and pinpointing achievement. Thus, assessing SLO mastery in a
the causes of equating invariance, might help reliable and valid manner presents a formidable
develop a more holistic definition of fairness. challenge for language programs, but doing so is
Practically, it points to the need for language crucial for evaluating program effectiveness and
testing practitioners to study DIF in anchor items demonstrating to stakeholders what students in
before conducting equating and provides them the program actually become able to do through
with guidelines on how to maximize score equity instruction (Norris, 2006). This study examines the
in equating. development and validation of a set of in-house
language assessments specifically designed to
measure mastery of curricular learning outcomes
in a multi-level IEP housed in a large, Mid-western
University. There were three primary goals for
developing a battery of level-based achievement-
oriented assessment instruments for each level.
First, developing assessments would cause
instructors and administrators to refine and
operationalize the curriculum’s stated goals and
evaluate the appropriateness of its current design
78
Paper abstracts
and implementation (Byrnes, 2002). Second, the 4:00-4:25pm
assessments would provide models enabling
students and novice instructors to understand
what is expected of successful students at each Promoting credible language
level. Third, performance data would provide assessment in courts: Two
valuable information on how students were
actually performing at different levels over time, forensic cases
which in turn could inform curriculum revision
Margaret van Naerssen, Dr., margaret.
(Sempere et al., 2009). In order to realize these
vannaerssen@gmail.com
goals, assessment development was designed
U
as an iterative process, involving cycles of item
sers of language assessment in legal
writing, teacher and administrator evaluation,
cases need to promote credibility for
piloting, analysis, and revision. Content validity
language assessment in courts by going beyond
was examined at different levels, with separate
simply providing a score. Unfortunately, a judge
teacher judgments on the appropriateness
might believe a test score is enough: “Looks
of the item’s topics (especially important in a
scientific. It’s a test, isn’t it?” Two cases involving
content-based instruction program), discourse
non-native English-speakers provide the legal
structure, grammar, and vocabulary. Student
contexts: murder and an undercover sting
performance was analyzed both using traditional
operation. Language assessment and linking
CTT analyses test-internally and in comparison to
evidence argumentation area presented. The
course grades and separately collected teacher
audience has time to briefly become opposing
judgments of student achievement levels for
attorneys, cross-examining the “expert witness”
individual SLOs. Results provide evidence that
(presenter) on choices of language assessment
the tests corresponded well to other measures
and links to the legal question. Here are two
of SLO mastery, but more importantly, the act
situations commonly encountered. You are an
of explicating the connection between program-
applied linguist/ a language assessment expert. An
level assessment and classroom instruction has
attorney calls you, asking you to give an opinion
resulted in a clearer curricular focus. Feedback
on a client’s English proficiency. The attorney
from both teachers and students suggests the
offers to send you a video of the police interacting
tests were successful in promoting positive
with the client and asks you to say whether the
washback. Thus, considerable progress has
client understood what the police said at the
already been made toward the first two major
police station. You might be asked to use the
goals of this assessment, and tracking student
interrogation video for your language testing or
performance over time will reveal how and
even to do it by phone, both also saving the cost
where instruction reform is most warranted,
of doing a live language assessment. Lacking audio
underscoring the importance and value of greater
recording, you might be asked to just test the
coordination between teaching and assessment
client’s English and give an opinion on whether the
practices.
client understood the police officer during a traffic
stop. You need to be prepared to explain that
it is not enough to simply assess a defendant’s
language proficiency then jump to opining on
whether the defendant understood. However,
attorneys, even judges and jury members, may
not realize the limits of a language test. Also,
using a video to test language proficiency would
not meet assessment standards. The objectives
of this presentation are to illustrate 1. The
legal/ court requirements placed on a language
79
Paper abstracts
testing expert; 2. The varied roles language High-stakes test preparation
assessment might have in a case in the US legal
system; 3.The importance of linking language programs and learning
assessment to the legal question, through outcomes: A context-specific
evidence argumentation; and 4. The importance
of being prepared to answer the question: How study of learners’ performance
do you know the person isn’t just “faking” his/her on IELTS?
language proficiency? Federal Rules of Evidence
set the criteria for expert witness testimony in US Shahrzad Saif, Université of Laval, shahrzad.saif@
federal courts and in most state courts, though lli.ulaval.ca
some states still follow the earlier Frye standard.
(Some other countries have similar rules.) A Liying Cheng, Queen’s University, liying.cheng@
prospective expert witness should understand queensu.ca
the rules determining whether the expert is Mohammad Rahimi, Université du Québec à
allowed to testify and the acceptability of the Montréal, rahimi.mohammad@uqam.ca
language assessment methods. (A Rules handout
S
is provided.) Judges are sometimes persuaded tandardized English language test scores
to accept assessment instruments based on are increasingly used worldwide by
reliability data. Experts should be prepared to university admission offices and government
argue for validity as the primary principle for officials for making high-stakes decisions
assessing an individual’s communication abilities regarding university admissions and immigration
for a specific context. Insights from several to English-speaking countries. This trend has, in
disciplines can help build links between language turn, given rise to the institutionalization of test
evidence and legal questions. Experts may also preparation activities, particularly in EFL contexts.
need to quickly develop a supplementary task, Whereas previous research has established that
taking care to ground it in theory, research high-stakes tests tend to affect the teaching and
and practice. Forensic linguists usually do not learning activities in the immediate classroom
conduct traditional research: analyses are done environment (Alderson & Hamp-Lyons, 1996;
on a single-subject basis. However, experts can Alderson, 2004; Wall & Horak, 2006), there is no
draw on precedents and research from relevant systematic evidence that test-preparation courses
disciplines. offered by test-preparation centers significantly
raise test scores and, more importantly, English
language proficiency particularly in EFL contexts
(Green, 2007; Matoush & Fu, 2012). This study
explores the nature of test-preparation effect on
learners’ test performance and their language
proficiency in an Iranian tertiary EFL context. The
research questions addressed are (1) whether,
and to what extent, test preparation enhances
test performance, as shown by an increase in
test scores, and (2) whether preparing for a high-
stakes test also improves test-takers’ proficiency
in English. Using a sequential exploratory mixed-
methods design (Creswell & Plano Clark, 2011), the
study was conducted in two consecutive phases
in a major IELTS preparation center in Iran. The
focus of Phase I was to understand the nature of
test preparation activities, different stakeholders’
80
Paper abstracts
perceptions of the test, and their interactions A mixed-methods investigation
with the test content. In this phase, qualitative
data was gathered from 65 stakeholders (ESL into the young learners’
teachers, test center administrators, and test- cognitive and metacognitive
takers) using questionnaires, interviews, focus
group interviews, and observations. In Phase II of strategy use in reading test
the study, the effects of 10-week-long Academic
IELTS preparation courses on learners’ test Gina Park, University of Toronto, gina.park@mail.
performance and English language proficiency utoronto.ca
were empirically examined. Based on the results Maggie Dunlop, University of Toronto, maggie.
of a proficiency test administered before training, dunlop@mail.utoronto.ca
201 EFL learners in two homogeneous control
(General English) and experimental (IELTS Edith van der Boom, University of Toronto, edith.
preparation) groups participated in the study. The vanderboom@mail.utoronto.ca
results of using Two-Way Repeated Measures,
Eunice Eunhee Jang, University of Toronto, eun.
Independent and Paired Samples T-tests showed
jang@utoronto.ca
a significant effect for time and group; i.e., both
C
control and experimental groups significantly omprehending academic reading
improved their IELTS and proficiency test scores materials requires a wide range of multiple
at the end of the experiment. However, the post cognitive skills and metacognitive strategies.
test mean score for the experimental group (IELTS Cognitive reading skills refer to mental processes
group) turned out to be significantly higher than and subconscious actions readers involve to
that of the control group (General English group) comprehend reading materials, and metacognitive
pointing to a possible effect of test preparation. strategies pertain to regulative mechanisms
Test preparation helped the IELTS group raise that readers deliberately use to monitor their
their test scores and, at the same time, improve own reading (Pang, 2008; Sheorey & Mokhtari,
their English language proficiency more than the 2001). As they both play a critical role in reading
General English group at the end of the treatment. comprehension (Williams, 2002), learners who
The evidence from the qualitative analysis of data have a limited repertoire of reading skills and
gathered in Phase I revealed that teachers’ and strategies may have difficulty comprehending
learners’ perceptions of the test content as well as academic texts with abstract or unfamiliar
a variety of context-specific factors—such as the concepts (Oxford, 2002). While a considerable
culture of the test preparation center, length of amount of literature has been published on
the program, test-takers’ expectations, and their strategy use in reading comprehension (e.g.,
motivation for learning English — influence what Anderson, 1991; Block, 1992; Pressley &
is taught and learned in test preparation classes. Afflerbach, 1995), there is little research on how
The implications of this study for the validity of young language learners process academic
high-stakes test scores are discussed. texts and how their psychosocial characteristics
influence the mental processes during reading.
The present study investigated the extent to which
young learners with different learning profiles
would implement different cognitive skills and
metacognitive strategies. The study employed a
developmental mixed methods research design
(Maxwell & Loomis, 2003). In the quantitative
phase, the cognitive reading skill profiles for
Grade 6 students (n=90) were gained by applying
cognitive diagnostic modeling (CDM) to the
81
Paper abstracts
population’s reading test performance, while the
profiles of their learning goal orientations and
Fri, Mar 20
perceived ability were created from the latent trait
profiling and self-assessment. In the qualitative 10:20-10:45am
phase, think-aloud verbal protocols with students
(n=14) were used to understand whether or not Aligning teaching, learning, and
students with different profiles process academic
reading materials differently. The results showed assessment in EAP instruction?
that although inferencing (e.g., predicting future Stakeholders’ views, external
events, making connections between the text
and personal experience) is a highly sought influences, and researchers’
strategy among young learners, its effect was perspectives
subject to students’ reading skills mastery
levels. For example, while high cognitive skill Talia Isaacs, University of Bristol, talia.isaacs@
masters’ use of inferencing skills led to successful bristol.ac.uk
performance on reading tasks, low cognitive skill
masters tended to recall irrelevant background Carolyn Turner, McGill University, carolyn.
knowledge and resulted in misunderstanding the turner@mcgill.ca
text. The result also showed that learners’ goal
orientations influence the reading strategy use.
While performance goal-oriented students tended T his exploratory case study examines
the alignment between teacher and
student perceptions and classroom practices in
to use strategies associated with re-reading texts
or test-taking strategies, such as eliminating a Canadian pre-sessional English for Academic
distractors, mastery-oriented students tended to Purposes (EAP) program with respect to the
use planning/monitoring strategies and were less assessment bridge—the link between teaching,
interested in confirming correct answers. Notably, learning, assessment, and curricular objectives
mastery-oriented, low skill masters showed a high (Colby-Kelly & Turner, 2007). Due to EAP teachers’
level of engagement in learning throughout the pivotal role in curriculum development, the
think-aloud session. The research concludes that research site was viewed as having a heightened
a successful application of the inferencing strategy potential for assessment activities (both formative
requires both basic textual comprehension and and summative) to be viewed by stakeholders
making connections, and without the former, as integral to classroom activity and not merely
the inferencing is not compensated for low skill conceived of as an afterthought detached from
masters. Moreover, given that the students’ goal teaching and learning. The overarching goal
orientations prompt different reading skills, it is was to investigate stakeholders’ perceptions of
essential to understand their goal orientations what “counts” as assessment, the purpose and
in order to help them develop and utilize diverse value of tasks not directly linked to summative
metacognitive strategies so that they can self- assessments, the role of pre-emptive and
regulate their own strategy use. incidental feedback in gauging what students
know and facilitating learning, and ways in which
external pressures mediate classroom activity.
Forty hours of field notes from classroom
observations of two advanced-level intact classes
(4 experienced teachers; 21 students) were
inductively coded and analyzed using macro-
level categories for: (1) the assessment bridge—
instances of everyday assessment, (2) teachers’
and learners’ perceptions of assessment, (3)
82
Paper abstracts
formal classroom assessments, (4) feedback Diagnostic profiling of foreign
incidents, and (5) extraneous influences beyond
the classroom. The observational data were language readers and writers
triangulated with post-course interview data and
content analyses of curricular materials. The Ari Huhta, University of Jyväskylä, ari.huhta@jyu.fi
results paint a rich and varied picture of classroom Charles Alderson, Lancaster University,
assessment practices, with each teacher relying c.alderson@lancaster.ac.uk
on idiosyncratic strategies to facilitate learning
(e.g., using abstract analogies; considering Lea Nieminen, University of Jyväskylä,
developmental readiness in feedback provision). lea.s.m.nieminen@jyu.fi
Occasionally, students’ preoccupation with
Riikka Ullakonoja, University of Jyväskylä, riikka.
assessments that are consequential (i.e., count
ullakonoja@jyu.fi
for marks) and awareness of the prestige value
I
accorded to standardized (e.g., TOEFL) versus in- n recent years, language testing researchers
house tests conflicted with teachers’ promotion have shown interest in diagnostic testing,
of the educational value of classroom activities, which is an area that intersects both language
as was reflected in teachers’ frequent and often testing and second language acquisition research.
frustrated attempts to redress the focus back to Diagnostic testing, however, requires a better and
formative components. Overall, students did not more detailed understanding of language abilities
appear to subscribe to the view, as articulated by than is currently the case, and this has posed
one teacher, that “everything that you’re doing is a challenges to testers to define their diagnostic
part of learning,” revealing a complex power play constructs both theoretically and operationally.
between competing notions of what counts in the This paper is based on a research project into the
classroom. The results exemplify ways in which the diagnosis of reading and writing, which studied
teachers prioritized learning across all classroom two groups of Finnish learners of English as a
activity, which is in line with current research foreign language: 14 and 17-year-olds. Both
investigating learning-oriented assessment (LOA) groups consisted of about 200 learners who
in classroom contexts (Turner & Purpura, in completed a range of linguistic and cognitive
press). The data generate a snapshot of classroom tasks as well as replied to questionnaires about
stakeholders’ mindsets that demonstrated an their background (e.g., reading and writing habits,
expanded view of assessment (i.e., broader attitudes to reading / writing) and motivation
than assessment only being formal quizzes/ to use and learn English. The learners also
tests) by the teachers, but not necessarily by the completed several reading and writing tasks from
students. Teachers capitalized on “assessable an international English language examination and
moments” in an attempt to enhance learning, a previous research project. Writing performances
which provided evidence of the use of different were double-rated by trained raters and the
forms of assessment situations (both planned ratings were analyzed with the multifaceted Rasch
and unplanned) to benefit further learning on program Facets. Rasch analysis was also employed
an ongoing basis. This corresponds to an LOA in the analysis of the reading test scores. The
approach where student performance information results reported here complement our earlier
is used to guide and support the learning process. variable-centered findings which indicated that
On the other hand, evidence from the student certain cognitive, linguistic and motivational /
data has implications for work needed to create background measures could predict FL reading
a learning-oriented culture in classrooms, where and writing in the above-mentioned groups. The
the role of an expanded definition of assessment current analyses build on the results of those
is considered a vehicle for further learning by all measures and employ them in person-centered
stakeholders. analyses that focus on similarities and differences
among people. The approach adopted here is
83
Paper abstracts
the latent profile analysis (LPA) which seeks to 2010; Mey, 2001). While pragmatic competence
identify homogenous subgroups of individuals, is widely acknowledged as a component of
with each subgroup possessing a unique set overall communicative competence (Bachman
of characteristics that differentiates it from the & Palmer, 2010; Canale & Swain, 1980; Purpura,
other subgroups. At the conceptual level, LPA is 2004), not many tests exist that assess second
similar to cluster analysis but the key difference language learners’ pragmatic ability, and most
is that the validity of LPA results can be tested have narrow construct coverage. Tests of L2
statistically. The software used to estimate the sociopragmatics have traditionally focused on
latent profiles was Mplus v. 7.2. The cognitive metapragmatic judgment of speech act politeness
dimensions included in the analyses were: the and/or appropriateness of production (Hudson,
efficiency of the working memory, phonological Detmer, Brown, 1992, 1995; Liu, 2006; Yamashita,
processing capacity, and ease of access to 1996; Yoshitake, 1997) whereas pragmalinguistic
vocabulary in memory. The linguistic dimensions tests assess comprehension of implied meaning
were the size of FL vocabulary, the ability to and conventional production in social situations
(linguistically) segment texts in L1 and FL, and (Itomitsu, 2009; Roever, 2005). Recent tests have
L1 reading and writing. Background variables also evaluated learners’ ability to convey pragmatic
included learners’ frequency of using the FL and meanings in extended role play conversations
their motivation to use the FL. The results of the (Grabowski, 2009; Youn, 2013). However, most
study contribute to the emerging research base sociopragmatic tests have been limited to a small
on diagnosing FL proficiency and its development range of speech acts, and role play tests are less
by providing insights into how FL readers and practical than non-interactive formats (Roever,
writers are similar vs different with respect to their 2011). In this study, we developed and validated
cognitive, linguistic and motivational features. The a web-based test of sociopragmatics for second
findings may also allow us to characterize strong language learners of Australian English. Following
vs weak readers / writers more precisely. an argument-based validation approach (Chapelle,
2008; Kane, 2006, 2013), we defined our target
domain as everyday language use in social,
Validation of a web-based test workplace and service settings. We designed a
of ESL sociopragmatics web-based test delivery and rating system, which
tested sociopragmatic knowledge by using nine
Carsten Roever, University of Melbourne, Likert-type appropriateness judgment tasks, five
carsten@unimelb.edu.au extended discourse completion tasks where a
short conversation was reproduced with one
Catherine Elder, University of Melbourne,
party’s turns missing, four dialog choice tasks
caelder@unimelb.edu.au
asking test takers to judge which of two dialogs
Catriona Fraser, University of Melbourne, was more successful, and twelve appropriateness
fraserc@unimelb.edu.au judgment and correction tasks (based on
Bardovi-Harlig & Dörnyei, 1998) where test takers
84
Paper abstracts
correlations between sections as well as a four- on the relative importance of different criteria in
factor structure. As expected, the native speaker each test. A multifaceted Rasch analysis serves
group outscored both learner groups, and the to determine the item difficulty in both tests as
ESL group outscored the EFL group. A factorial well as the true ability of the candidates based
ANOVA showed that a model incorporating on their performance in both tests. A common
length of residence, proficiency, and amount of issue in predictive validity studies is the truncated
daily interaction in English best explained the sample problem. Since it is usually only possible
data, which is in line with findings from previous to track those candidates who passed a test, it
research in sociopragmatics (Bardovi-Harlig & is impossible to determine how failed students
Bastos, 2011; Matsumura, 2003; Taguchi, 2011; would have fared. This study bypasses the
Wang, Brophy & Roever, 2013). Our empirical truncated sample problem by following students
findings support the generalization, explanation who passed one test, but failed the other. The
and extrapolation inferences of the argument- predictive validity study primarily is of a qualitative
based validation approach. In terms of utilization, nature. A group of 20 students were interviewed
we concluded that our test is most appropriate for on a monthly basis about their linguistic
pedagogical and diagnostic decisions with limited functioning at university and about the language
utility for higher stakes decisions. functions they were expected to perform. A
small subgroup of students was followed more
intensively. Adopting an ethnographic approach,
The concurrent and predictive the researcher accompanied these students to
validity of university entrance class and observed which linguistic obstacles
they faced at university. In order to determine the
tests generalizability of the qualitative study, the whole
Bart Deygers, KULeuven, Belgium, bart.deygers@ population (N=200) filled out a questionnaire
arts.kuleuven.be that tackled the same themes. The results of this
study shed light on the validity of L2 university
I n order to gain access to higher education in entrance tests and on issues of fairness in high
Flanders, Belgium, non-native speakers are stakes language testing. They show how testing
required to pass either PTHO or ITNA - language constructs are sometimes very distant from real-
tests at the B2 level of the Common European life demands and how there is not necessarily a
Framework of Reference for Languages (CEFR). direct link between passing a test and functioning
Every year, some 800 L2 students take such a in that test’ target language use domain. Finally,
language test before entering higher education. this study raises salient questions concerning
To date, little is known about the concurrent academic language testing constructs and the
validity or the predictive validity of either test. extent of the inferences one can reasonable make,
In spite of their highly dissimilar constructs and based on a language test score.
operationalizations, both tests are assumed
to measure language at the same level and to
discriminate among candidates in the same way.
The relationship of both tests to each other and to
the target language use domain has never been
the subject of rigorous study however. In order
to determine the concurrent validity of PTHO and
ITNA, 200 prospecting L2 students took both tests
a week apart. The primary question guiding this
leg of the study is a basic one: do both tests pass
or fail the same students? Additionally, a principle
component analysis was conducted to shed light
85
Paper abstracts
10:50-11:15am allowing all stakeholders to track and interpret
important aspects of learner development over
time. This paper addresses the theoretical and
Integrating task-based language practical challenges to create a unified framework
teaching and assessment: for Task-Based Language Teaching (TBLT) and
assessment (TBLA). It discusses the task/test
towards a unified task characteristics/specifications guiding the selection
specification framework of task types and guaranteeing content relevance
and representativeness for both pedagogic and
Koen Van Gorp, Centre for Language and assessment tasks. The development of such
Education, KU Leuven, koen.vangorp@arts. a set of task characteristics for a task-based
kuleuven.be language syllabus for both L1- and L2-learners
in primary education is presented as a case. The
86
Paper abstracts
What CEFR level descriptors Survey data were collected from a valid sample
of 114 teachers and 412 students. Follow-up
mean to college English interviews were conducted with 8 teachers and 15
teachers and students in China students. The teacher-participants were required
to map 52 CEFR descriptors (25 of reading, 24 of
Yan Jin, Shanghai Jiaotong University, yjin@sjtu. writing and 3 of global scale) at the B1, B2, and
edu.cn C1 levels onto the three levels of the Curriculum
Requirements: basic, intermediate and advanced.
Shaoyan Zou, Shanghai Jiaotong University, The student-participants were required to
amandazsy@163.com assess their level of English proficiency by stating
Xiaoyi Zhang, Shanghai Jiaotong University, whether they could fulfil the task described
zhxiaoyi1201@163.com in each of the 52 descriptors. Descriptive and
Rasch analyses showed that both teachers and
87
Paper abstracts
Investigating a construct of Many-facet Rasch measurement (MFRM) analyses
were conducted, taking into account rater severity
pragmatic and communicative and task difficulty, and the true scores that were
language ability through email derived were used for all subsequent analyses.
The findings from correlation analyses suggested
writing tasks a moderate amount of interaction between
grammatical ability (GA) and pragmatic ability
Jin Ryu, Teachers College, Columbia University, (PA) (r=.694). However, when the interaction
cellest2@hanmail.net was examined across different proficiency
levels, only the low proficiency groups had a
D espite the importance of communicative
language ability (CLA) in today’s globalized
society, pragmatic ability, a significant part of
statistically significant moderate correlation
for GA and PA. Further correlation analyses
also revealed the distinctiveness of the five
CLA, has not been sufficiently and thoroughly
pragmatic subcomponents from each other.
investigated in the second language assessment
While the highest correlation between the
field. Few studies have operationalized
pragmatic subcomponents was found between
pragmatic ability within a CLA model and
contextual and sociocultural abilities (r=.832),
employed authentic tasks (Grabowski, 2013).
rhetorical ability correlated relatively weakly with
To address such problems, the current study
the other subcomponents. Lastly, the findings
operationalized pragmatic ability within Purpura’s
from a multiple linear regression analysis
(2004) theoretical model of language ability and
revealed that most of the variability in the holistic
examined all five subcomponents of pragmatic
score was accounted for by the five pragmatic
ability in his model (i.e., contextual, sociocultural,
subcomponents (R2=.912), but the source of the
sociolinguistic, psychological, and rhetorical
remaining variance remains a question for the
meanings) through authentic email writing
overall PA construct. Four of the five pragmatic
tasks. Eighty-two ESL learners in the University
subcomponents except rhetorical meaning
Settlement Society of New York participated in
contributed almost equally to the variability
the research and were each asked to write four
of PA. In summary, the results provide strong
out of eight email writing tasks for a pragmatics
evidence that one’s pragmatic ability should not
test while performing the speech act of apology.
be based on one’s grammatical ability alone, but
The eight writing tasks were administered to each
should rather be diagnosed in terms of these five
examinee in various four-item testlet versions so
pragmatic subcomponents.
as to maintain connectivity during data analysis.
The eight pragmatic tasks in the test varied in
terms of three variables: the power and distance
of the sociolinguistic relationship, and the context
in the task (i.e., interactional vs. transactional)
(Brown & Yule, 1983; Hudson et al., 1992, 1995).
Since the present research had an experimental
focus, some variables such as the speech act
and degree of imposition were controlled for
to enable more specific comparisons between
the pragmatic sub-scores achieved on the tasks
(Hudson et al., 1992, 1995). The collected data
were scored by five native English speaking
raters on the five pragmatic subcomponents, two
grammatical subcomponents (i.e., grammatical
form and semantic meaningfulness), and one
holistic score of pragmatic ability (Purpura, 2004).
88
Paper abstracts
Modeling the relationships cognitive strategic processing directly explained
much higher variances (18%) of test performance
between construct-relevant and than metacognitive strategic processing (1%)
-irrelevant strategic behaviors did. It was found that metacognitive strategic
processing had an indirect effect (β = 0.97x 0.43;
and lexico-grammar test R² = 0.17) on lexico-grammar test performance via
performance cognitive strategic processing. This indicates that
metacognitive strategies alone do not influence
Nick Zhiwei Bi, University of Sydney, zhbi6097@ test performance; rather the way this influence
uni.sydney.edu.au was accomplished was through metacognitive
processing exerting an effect over cognitive
89
Paper abstracts
11:20-11:45am and validating the materials. Moreover, in the
different piloting rounds, students are interviewed
and also asked to fill in questionnaires focusing
Large-scale assessment for on different aspects of the tasks presented. The
learning – a collaborative opinions expressed are included in the overall
analyses and decisions, together with other results
approach from the pilot tests, including anchor items. The
focus of the proposed paper presentation is on
Gudrun K. Erickson, University of Gothenburg,
test-taker feedback collected from 15-year old
gudrun.erickson@ped.gu.se
students during regular, large-scale pretesting
T
rounds of summative tests (n ≈ 400/task). The data
esting and assessment are often perceived
derive from Likert scales and open responses.
as somewhat different, or distant, from
The responses have been analysed with a multi-
teaching and learning – something that has to
faceted approach, encompassing qualitative as
be done but should be quick and cause as little
well as quantitative methods, and are combined
harm as possible, at individual, pedagogical
with achievement data. The presentation of
and systemic levels. In particular, this goes for
results will include a tentative categorization of
summative aspects of assessment, not least large-
aspects of assessment perceived as positive and
scale testing and examinations. Furthermore,
negative, respectively, according to students’
there is often a sharp dividing line, drawn or
definitions and expectations, as well as analyses
experienced, between formative and summative
of retrospective, task-related self-assessment
functions of assessment; assessment for learning
in relation to achievement. In the latter, certain
and assessment of learning are surprisingly
interesting gender differences have been found.
seldom discussed within the same conceptual
In addition, some comparisons between students’
and pedagogical framework. What remains to
and teachers’ perceptions of the same tasks will
be seen is whether the movement towards so
be presented. The practical and conceptual use
called learning-oriented assessment will manage
of collaboration and test-taker feedback in test
to encompass, clarify and operationalize both
design, including the creation of construct-based
basic functions of educational assessment,
profiles for presenting results, is described and
namely to enhance learning as well as fairness
discussed. Finally, some data from post-test
and equity. The current study is set within the
questionnaire surveys are considered in relation
Swedish educational context, which has a long
to the pedagogical potential of assessment – for
tradition of assessment at the national level.
students’ learning as well as for teachers’ teaching.
Mandatory summative tests, important for
students’ continued education and life chances –
and for their personal and academic self-esteem
– are provided alongside extensive, formatively
oriented materials. Building on an expanded
view of validity, which focuses on the use and
possible consequences of the procedures and
products provided, the development of national
language tests and other assessment materials
is conducted using a collaborative approach.
This aims to optimize the quality of the materials
- with regard to learning as well as to fairness -
but also to contribute to increased awareness
and development among those involved.
Consequently, different categories of teachers
work together with researchers in developing
90
Paper abstracts
Detecting evidence behind vocabulary tests are commonly used as indicators
of language proficiency, a vocabulary test was first
the college English curriculum conducted to screen around 5,000 students from
requirements in China: A mixed- a national key university, with two dimensions
of lexical knowledge measured (see Nation,
methods study 2008). A learner corpus was then established,
including a unique collection of scripts written by
Wen Zhao, Northeastern University, dianawen@ the students screened across proficiency levels.
hotmail.com It is worth noting that, to represent the English
Tan Jin, Northeastern University, alextanjin@ proficiency levels of college students in China,
gmail.com the corpus will involve more students from over
40 universities around the country. The collected
Boran Wang, Northeastern University, scripts were judged by trained raters and assigned
brwang1972@hotmail.com with relevant CECR levels. Linguistic features
including lexical features, syntactic complexity,
91
Paper abstracts
systems can also provide instant feedback on article, word choice, preposition and expression,
the quality of essays by flagging ungrammatical etc. These findings extended previous AEE
constructions (i.e. error feedback) and offering research on the efficacy of WriteToLearn in ELL
suggestions on discourse elements such as classroom settings and provided suggestions
organization and development of ideas (i.e., for its application to such context. Currently, the
discourse feedback). However, little research implementation of WrtiteToLearn into ELL writing
evidence can be found to support the instructional instruction may be more effective if it can be used
application of current AEE systems to ELL as a supplement to rather than a replacement of
population. This study investigated the applicability writing instructors.
of WriteToLearn to Chinese undergraduate
English majors in terms of its scoring ability and Investigating constructs of L2
the accuracy of its error feedback. Participants
included 163 second-year English majors from pragmatics through L2 learners’
a university located in Sichuan province of oral discourse and interview
China, who wrote 326 essays from two writing
prompts selected from the preloaded prompts data
in WriteToLearn. The whole set of essays were
graded by WriteToLearn and four trained human Naoki Ikeda, The University of Melbourne,
raters, based on the analytic scoring rubric nikeda@student.unimelb.edu.au
incorporated in WriteToLearn. Many-facet Rasch
Measurement (MFRM) was conducted to measure
WriteToLearn’s rating performance as compared O ver the past decades, developments in
pragmatics focused on speech acts (Searle,
1969) and politeness (Brown & Levinson, 1987).
to those of human raters. Of the sample, 60
randomly selected essays (30 essays per prompt) Recent research has examined L2 pragmatic
were also annotated by the human raters for the behavior through a discursive approach (Kasper,
analysis of the accuracy of WriteToLearn’s error 2006; Roever, 2011). In language assessment,
feedback in terms of its precision and recall. pragmatics is increasingly recognized as a key
The MFRM analysis showed that WriteToLearn’s domain (Ross & Kasper, 2013) although research
preloaded prompts and scoring rubric functioned exploring the construct of L2 pragmatics remains
well in evaluating Chinese undergraduate English limited (for exceptions see Grabowski, 2009;
majors’ writing ability. In addition, WriteToLearn Youn, 2013). The present study attempts to shed
was more consistent yet highly stringent relative further light on the construct of L2 pragmatics in
to the four trained human raters in scoring speaking by examining learners’ oral discourse
essays and it failed to score seven essays. Given in response to open role-play assessment tasks
its scoring consistency, severity in rating and simulating university activities. Interview data
occasional failure to score essays, it is suggested was also collected to further explain learners’
that WriteToLearn be used as a second or third oral performances. 47 participants in higher
rater along with one or more experienced human education with equivalent TOEFL iBT 59/ IELTS 6.0
raters, if it is to be used for the assessment of or above completed six speaking tasks simulating
students’ writing ability. WriteToLearn’s error university activities as well as an interview. The
feedback, with an overall precision and recall tasks were designed to elicit oral discourse in
of 49% and 18.7%, did not meet the minimum sequential organization. Three context variables
threshold of 90% precision for it to be a reliable (Brown & Levinson, 1987) - relative power, social
error detecting tool (Burstein, Chodorow, & distance and the degree of imposition in the task
Leacock, 2003; Leacock, Chodorow, Gamon, & situations were operationalized. A Likert scale
Tetreault, 2010). Furthermore, it had difficulty questionnaire included (a) learners’ consciousness
in identifying the frequent errors made by of the constructs of pragmatics in communication
Chinese undergraduate English majors such as in the academic domain; (b) learner perceptions
92
Paper abstracts
of how well the tasks reflected the real world 1:15-1:40pm
academic; (c) self-assessment of pragmatic
performance in relation to the construct. The
construct was defined to include the ability The construct of language
to deliver pragmatically appropriate meaning assessment literacy as perceived
depending on power relations and level of social
distance with the interlocutor, and degree of by foreign language teachers in
imposition. The follow up interview questions were a specific context
designed to identify possible construct-irrelevant
factors (Messick, 1989) in the task administration. Deena Boraie, The American University in Cairo,
Discourse analyses revealed that several common dboraie@aucegypt.edu
traits identified in the literature were displayed
in the participants’ extended discourse. These
included laying pre-expansion (Schegloff, 2007);
mitigating imposition of the act; laying closing;
A ssessment literacy (AL) or language
assessment literacy (LAL) was
conceptualized based on a deficit model and
and choices of linguistic resource (e.g., bi-clausal considered as a gap in the knowledge base
structure, modal verbs). The general findings of educators which should be addressed by
are consistent with those in recent empirical professional development or training programs.
studies (e.g., Al-Gahtani & Roever, 2012). The This competence-based approach defined AL in
quantitative analysis revealed a tendency for terms of abilities or competencies to be acquired
participants to perceive their language use in the such as those listed by Popham (2009) as well
tasks to be similar to that in reality and that their as those found in several standards documents.
performances were not unnecessarily affected In 2013, a special issue of the Language Testing
by the administration procedures. Participants journal devoted to AL/LAL contributed significantly
with higher proficiency especially perceived a to the conceptualization and definition of the
stronger similarity between the tasks and the LAL construct. The LAL construct is positioned
target language use domain. Also, participants at in real life areas associated with the professional
different proficiency groups self-assessed their beliefs, perceptions and practices of educators
language use differently, which is reflected in their and conceptualized taking into account the
task performances. Combined with some learners’ context or the community of practice it is located
volunteered feedback on meaningfulness of the in (Taylor, 2013). This means that LAL must be
tasks, this study will discuss the constructs of conceptualized adopting a more complex and
pragmatics measurable and recognized as more dynamic approach than the previous deficit model
important by L2 learners with relevance to the in the middle of tensions between the positivist
design of effective language assessment tasks to psychometric and interpretivist socio-cultural
better assess L2 learners readiness for entry into paradigms of assessment and education (Scarino,
the academic domain. 2013). The purpose of this study is to establish
a contextualized definition of LAL in a specific
language context by teachers of English (EFL)
and Arabic (AFL) as a foreign language to adults.
Semi-structured interviews as well as classroom
observations were conducted for 6 EFL and 6 AFL
teachers selected taking into account gender and
teaching experience. The interviews focused on
teachers’ knowledge, skills and attitudes about
language assessment as well as how their LAL
developed over time. The observations explored
teachers’ classroom assessment practices. A
93
Paper abstracts
qualitative methodology was adopted to identify what extent candidates consider specific functions
emerging patterns and themes from all the data when planning their responses to the task input.
collected to inform a contextualized construct Over 100 participants ranging in ability level
definition of LAL. The results of this study help from A2 to C1 were asked to complete a long-
language testing and assessment professionals turn speaking task as part of a larger, multi-task
understand assessment issues and needs in this computer delivered test of English proficiency.
context and accordingly engage with teachers In addition, they completed a questionnaire
more effectively. immediately post-test. This questionnaire focused
primarily on the planning time included in the
task, and was designed to gather evidence
Can planning at the functional of any strategies used by the candidates. In
level affect test score? order to present a multi-faceted perspective
of the situation, data from 40 candidates at all
Gwendydd Caudwell, British Council, Gwendydd. levels from the completed questionnaires were
Caudwell@britishcouncil.org compared to a functional analysis and to the
scores awarded by trained raters. The findings
Barry O’Sullivan, British Council,
allow us to draw meaningful conclusions as to
Barry.O’Sullivan@britishcouncil.org
whether the effective use of planning time can
P revious research focusing on planning make a difference in overall score and language
strategies (e.g. Foster & Skehan 1996; produced. These findings, as well as associated
Ortega, 1999; Wigglesworth) have concentrated implications will be discussed.
on the impact on accuracy and fluency or
complexity that planning strategies have on Language constructs revisited
spoken performance. Following on from these,
this study focuses on planning in what Weir (2005) for practical test design,
refers to as an extended informational routine and development and validation
examines more closely the interaction between
planning strategies and the language functions Xiaoming Xi, Educational Testing Service, xxi@ets.
used in spoken production. This study builds on org
the findings of a previous study (Caudwell, 2014)
using functional analysis (O’Sullivan, Weir & Saville,
2002) which examined the effect of the reduction
of cognitive load on the performance of young
W hile theoretical approaches to defining
language constructs continue to
be debated, recent conceptualizations have
teenagers (13 to 15 year olds) taking a long-turn converged on the interactionalist view, which
speaking task. The study highlighted the absence highlights the important role of language use
of a number of expected functions (e.g. staging) context (Chapelle, 1998; Chalhoub-Deville,
in the responses examined. Since any evidence 2004; Bachman, 2007). However, the practical
of a limited range of language functions might implications of the interactionalist approach for
well be perceived by examiners as reflecting a defining and operationalizing constructs remain
limited language range, and therefore impacting to be explored. This presentation will focus on
negatively on test score, it is clear that further explicating the interactionalist approach for
exploration of the area is required. In order to the purpose of guiding the practical design,
work towards a more complete understanding of development and validation of speaking tests. In
the relationship between functions produced and alignment with the interactionalist view, I argue for
any effect on overall score, it is important to take a multi-faceted approach to construct definition,
into consideration the use made of candidates consisting of two key elements: foundational and
of the planning time typically built into this task higher-order language skills and use contexts.
type. In other words, it is important to know to Theoretical language ability models focus on
94
Paper abstracts
delineating the foundational and higher-order inferences can be made to specify the contexts
knowledge and skills called for in language use to which generalizations about language skills
activities (Canale & Swain, 1980; Bachman, 1990; can be made. I will use examples to illustrate how
Bachman & Palmer, 1996; 2010; Fulcher, 2003). the interactionalist approach could be applied to
Frameworks for language use have attempted drive optimal test design and development and to
to describe personal attributes of examinees guide validation, and discuss how priorities may be
and characteristics of the language use task and different for defining constructs for tests intended
situation that shape language use (Bachman & for summative vs. formative purposes.
Palmer, 2010). Characteristics of the language use
task and situation essentially define the context
of language use. Prevalent theoretical models
1:45-2:10pm
attempt to articulate the language knowledge
components called upon in language use activities Implementing formative
in all modalities to facilitate generalization across
them. I propose a model of oral communicative
classroom assessment
competence to provide an organizing structure initiatives: What language
for describing all underlying components
relevant to oral communication, and to provide
assessment literacy knowledge
opportunities for highlighting salient components is required?
in this modality-specific model. In addition to
knowledge components, this model includes Ofra Inbar-Lourie, Tel Aviv University, ofrain@
processing components, and oral communication post.tau.ac.il
competences as evidenced in examinees’ speech,
Tziona Levi, ORT Israel, z_levi@netvision.net.il
which form the basis for scoring rubrics. Construct
F
definitions could include all or a selected subset ormative assessment for learning theories
of these skills depending on the purposes of a which link teaching and learning with
test, and both task and rubric designs need to be assessment currently form part of the stated
aligned with the construct definition. Given the policies of many central educational systems,
critical role of context in interactionalist-oriented disseminated via assessment tools and teacher
constructs, I describe a scheme to characterize professionalization sessions. Questions arise
key contextual factors of oral communication as to the effectiveness of such initiatives, i.e., if
as layers, each of which adds specificity to and to what extent teachers actually embrace,
constrain the contextual parameters of tasks: internalize and implement them and why.
language use domains, speech genre, goal of Findings from previous research on EFL teachers’
communication, medium of communication and alternative classroom-based assessment practices
characteristics of the setting. I argue that while a place teachers’ assessment use in wide social
model of contextual factors needs to be rich and and educational paradigms accentuating its
all-inclusive, including all of the contextual facets complexity and pointing at the involvement of
would make the construct definition unwieldy and additional forces in the assessment decision-
less useful in guiding construct operationalization. making process (Inbar-Lourie & Donitsa-Schmidt,
Therefore, test designers need to prioritize the 2009). Recent interest in language assessment
inclusion of key facets of the language use context literacy (LAL) reinforces this concept as it calls for
that characterize the target language use domain acknowledging the role of additional stakeholders,
most meaningfully, and that are expected to such as policy makers and administrators, in the
account for variation in the quality and features of assessment process. Such broadening of the
examinees’ responses. Moreover, test designers assessment circle necessitates a redefinition of
must decide which facets to control for and the LAL required by the various stakeholders
which ones to vary in test design. Appropriate in order to fulfill their role in bridging the gap
95
Paper abstracts
between policy and classroom practice (Taylor, presented and discussed.
2013). Hence this research tried to shed
light on the LAL components needed by the Investigating the impact of
different parties involved in a teaching learning
assessment initiative so as to inform future language background on the
similar undertakings. The tool in question was rating of spoken and written
an assessment of speaking kit developed by the
Israeli Ministry of Education and the National performances
Authority for Assessment and Evaluation for use
by EFL teachers in junior high classes. It follows Jamie Dunlea, British Council, jamie.dunlea@
formative assessment rationale and comprises britishcouncil.org
both teaching and assessment activities intended Judith Fairbairn, British Council, Judith.
to promote speaking skills aligned with the Fairbairn@britishcouncil.org
national curriculum. Pilot research findings
T
conducted at the end of the development stage, his paper investigates the impact of the
point at generally high levels of satisfaction of linguistic background of raters on the rating
both teachers and students who used the kit. of productive skills in English as a Foreign or
However, two years after its introduction the Second Language (EFL/ESL) assessment contexts.
digitally available kit is hardly used and this Previous studies have investigated differences
research looked at the reasons why from the between Native-Speaking (NS) raters and Non-
LAL perspective. The research sample includes Native Speaking (NNS) raters, but have focused
Ministry officials, assessment and EFL experts either on the rating of spoken performances (e.g.
who decided to create the assessment tool and Brown, 1995; Kim, 2009; Xi & Mollaun, 2011) or
took part in the process, as well as school officials written performances (Hill, 1996; Johnson & Lim,
(principals and administrators) who overlook 2009) separately. Rater language background in
the teaching and assessment processes in their relation to speaking has also been addressed
schools, and EFL teachers. The research tools from the perspective of raters’ familiarity with a
comprise semi-structured interviews and focus test taker’s accent (Winke, Gass, & Myford, 2012).
groups intended at eliciting views regarding the The present study builds on this background and
kit and its implementation. The data analysis makes a particular contribution to the literature
follows the qualitatively inquiry tradition analyzed by addressing the rating of speaking and writing
according to the eight dimensions set in Taylor’s performances from the same candidates by the
LAL 2013 framework. Preliminary results point at same raters, thus allowing for an examination
the LAL knowledge components the stakeholders of any differential effect of rater L1 background
lack and thus need in order to disseminate on the rating of these skills. The study utilized
the assessment initiative: Policy makers lacked performances collected from a large-scale,
understanding as to the implementation phase international testing program administered via
with resources allotted mainly to the creation computer to adult EFL/ESL learners in diverse
of the tool, passing on responsibility to the contexts. A total of 60 test takers’ performances
school administrators who are unaware of the on both the writing and speaking components
importance of formative assessment, of teaching of the test were selected for use in the study.
language strategies and of the resources needed. Twenty test takers were chosen from each of
Finally, teachers were found to lack understanding three geographically and linguistically distinct
of formative assessment principles in general backgrounds: Spanish, Japanese, and speakers
and knowhow in the teaching and assessment of an Indian language. Twenty trained and
of speaking in particular. Implications for future experienced operational raters from the rating
formative classroom assessment endeavors on pool of the testing program were invited to take
the policy and teacher education levels will be part in the study: 10 Native Speakers of English
96
Paper abstracts
(NS) and 10 Non-Native Speakers of English language proficiency (ELP) and progress in ELP
(NNS). The NNS raters were all based in India and development. For this accountability purpose,
had an Indian language as their first language. each state administers a summative ELP
A questionnaire was administered to the raters assessment. These ELP assessment results are
to collect information on their familiarity with also used to determine ELs’ readiness to exit from
the accents of the three groups of test takers EL designation. Despite these high-stakes score
used in the study, as well as other background uses, relatively little empirical research is available
variables. All raters rated all 60 speaking and to examine the content and measurement
writing performances, and the results were qualities of the current ELP assessments. This
analyzed using multifaceted Rasch measurement study aimed to investigate the construct validity
(MRFM) to investigate interactions between rater of three ELP assessments. Understanding
language background and the L1 of the test the construct of state ELP assessments is of
takers. The results demonstrate that for the raters fundamental importance in making adequate
in this study, the rating quality and severity of the inferences from scores about the types and level
NS and NNS groups were comparable with all of English language proficiency that students have.
raters demonstrating appropriate consistency in While the current ELP assessments claim that
terms of the fit indices generated by the MFRM they measure the academic English proficiency
analysis. No statistically significant interactions that students need to participate in K-12 school
were noted between rater L1 background and settings, it is not clear how this construct is
test taker L1 background. However, a number of being measured. Additionally, a positive, strong
individual raters showed statistically significant relationship between ELP and content assessment
bias interactions with particular L1 groups of performance would provide evidence to support
test takers, with more NNS raters than NS raters the construct validity of ELP assessments.
showing such interactions. The results were Specifically, we addressed the following research
generally similar for both Writing and Speaking. questions: (1) What are the characteristics of
There was no discernable pattern in terms of the academic and social English incorporated in
direction of bias, and the results indicate that for three sample ELP assessments, and how are they
this group of trained raters using explicit rating represented across the assessments? (2) To what
scales, differences between raters may be more extent are the students’ academic and social
idiosyncratic rather than dependent on language language subscores on the ELP assessments
background. The presentation will also discuss related to their performance on their states’
the implications for the use of raters from diverse content assessments? We obtained three states’
linguistic backgrounds in EFL/ESL assessment. ELP assessments and the students’ scores on
both the ELP and content assessments in English
language arts and mathematics. The total sample
A review of the content and size included 7,496 ELs (State A: 4,123, State B:
statistical qualities of three 1,133, and State C: 2,240). We first conducted a
linguistic content analysis of the items to examine
English language proficiency the linguistic demands of each item. Our content
assessments analysis rubric included various categories such
as types of vocabulary and syntactic complexity.
Mikyung Kim Wolf, ETS, mkwolf@ets.org One of the categories classified the items as social,
general-academic, or technical-academic language
Molly Faulkner-Bond, University of Massachusetts,
mfaulkne@educ.umass.edu items based on the degree of academic specificity
in the item content. Descriptive statistics of the
I n the United States, K-12 public schools ratings were compared across the assessments.
are mandated to annually assess and To examine the relationship between ELP and
report on their English learners’ (ELs) English content assessment performance, a hierarchical
97
Paper abstracts
linear modeling (HLM) analysis was performed, measure the English proficiency of young English
since students’ data were nested within schools language learners. The ‘trialogue’ system, a three-
and districts. The results indicate that the three party conversational system between a student
ELP assessments contained different degrees and and two virtual characters that has largely been
types of social and academic language items, both applied to teaching content knowledge in online
across assessments and across the four language tutoring environments (e.g., Cai, Graesser, Millis,
domains (listening, reading, speaking, and writing). Halpern, Wallace, & Moldovan, 2009), was adopted
Compared to other assessments, State B’s ELP to develop language assessment tasks. One
assessment was found to include more items presumed benefit of using the system in language
contextualized in specific content areas such assessment is that it provides test takers with
as mathematics and science, as well as more scenario-based tasks in which multiple language
technical academic vocabulary words and more skills, such as listening and speaking, are naturally
complex syntactic features. The HLM analysis integrated. The integration of multiple language
results showed that the technical-academic skills allows test developers to better simulate
language scores had a significant relationship real-life language use situations. This authenticity
with math assessment scores. We will present the of the assessment tasks enables a more direct
detailed results of our linguistic and HLM analyses interpretive link between test performance and
and discuss practical implications of the findings success in performing real-life language use tasks.
for the valid uses of ELP assessments. We will also To explore the feasibility of a trialogue system for
discuss suggestions for classroom instruction assessing spoken language abilities, a series of
regarding the types of English language skills that language assessment tasks was embedded within
are measured in the assessments. a scenario-based language use context. By design,
the tasks incorporated real-life situations into the
conversations and provided immediate feedback
2:15-2:40pm about the spoken responses in order to engage
young students of English at the elementary
Development of innovative school level. In each task, students interacted
with two virtual characters, and the spoken
assessment tasks to measure responses were processed by an Automated
spoken language abilities of Speech Recognition (ASR) system, the CMU
PocketSphinx system (Huggins-Daines, Kumar,
young English learners Chan, Black, Ravishankar, & Rudnicky, 2006). The
Youngsoon So, Educational Testing Service, yso@ ASR-generated text of the spoken responses
ets.org was then processed by the Natural Language
Understanding (NLU) and dialogue management
Diego Zapata, Educational Testing Service, modules in the AutoTutor system (D’Mello &
dzapata@ets.org Graesser, 2013) to provide an appropriate
Keelan Evanini, Educational Testing Service, response to a student’s utterance. A small usability
kevanini@ets.org study was conducted in May 2014 with eighteen
Jidong Tao, Educational Testing Service, jtao@ets. 3rd-5th grade English language learners in a
org public school in the USA. Results from the study
suggested that the trialogue-based system is a
Christine Luce, Educational Testing Service,
viable method for measuring the English language
cluce@ets.org
constructs targeted in the tasks--the correlation
Laura Battistini, Educational Testing Service, between the system-based speaker-level task-
lbattistini@ets.org completion scores using ASR output and human
I
ratings was r=0.80. Moreover, student perception
n this presentation we will demonstrate the
data, collected at the end of the study session,
potential of using technology-assisted tasks to
98
Paper abstracts
generally suggested that students perceived the of different modifiers in terms of their power of
trialogue-based tasks positively. For example, discriminating between different proficiency levels.
most (14 of 18) participating students reported More specifically, the current study attempted
feeling that the virtual characters understood their to answer the following questions: 1) Whether
responses. The presentation will conclude with a the discrimination power of descriptors will be
discussion of the potential usefulness and current different when attached to different modifiers?
limitations of trialogue-based assessment tasks for 2) Will the same modifiers attached to different
measuring English proficiency of English language descriptors nonetheless be more discriminating
learners with respect to target age groups, the at a specific ability range? Method 8 participants
range of constructs, and the types of expected were given 30 writing samples and 15 different
responses. descriptors with normative modifiers (partially,
sufficiently, and mostly). All the participants were
experienced CELPIP benchmark raters and their
Analysis of norm-referencing task was to identify whether the descriptors
modifiers as a component of applied to each sample response or not. All the
writing samples were pre-scored in operational
rating rubrics CELPIP settings and created a basis for ranking.
Alex Volkov, Paragon Testing Enterprises, MFRM (Linacre, 1989) was utilized to rank the
volkov@paragontesting.ca modifiers. Results The analysis revealed good
overall infit and outfit statistics (Infit MnSq .71
Jacob Stone, Paragon Testing Enterprises, stone@ to 1.18, average 0.97; Outfit MnSq 0.70 to 1.26,
paragontesting.ca average 0.76) which allows us to draw useful
R ating scales are an instrumental part of a conclusions from the data. 1. Descriptors with
performance language assessment, such different modifiers attached to them showed
as The Canadian English Language Proficiency clear differences in performance on the writing
Index Program® General Test (CELPIP-G Test). ability scale. This suggests that modifiers are
Recently more attention has been given to discriminating and have a major effect on raters’
empirically derived rating scales, as opposed to ‘a perceptions. 2. The same modifier attached to
priori’ theoretically created rubrics (Knoch, 2009). different descriptors would still cluster quite
Knoch (2009) stresses that many existing rating closely on the language ability scale. This implies
scales have been criticized for using relativistic that modifiers have a strong discriminatory
wording such as modifiers. However, not much power at a specific ability range and can provide
research has been done in relation to modifiers invaluable information if their performance and
that are commonly used to construct rating scale parameters are carefully analyzed. In addition,
bands. Currently CELPIP-G is revising the writing the rater participants reported that they were
rating scale by constructing an empirically-derived able to detect a pattern of modifiers’ distribution
rubric. The present study employs Many-Facet and started applying them more systematically.
Rasch Measurement (MFRM) (Linacre, 1989) This finding suggests that using a limited number
approach to analyze to what extend modifiers of modifiers systematically in rating rubrics
affect the raters’ use of different descriptors and can facilitate rater training and calibration.
how modifiers may discriminate across different The evidence suggests that utilizing modifiers
writing proficiency levels. Research Questions consistently directly affects raters’ judgment and
In their recent studies North and Schneider provides good discriminating capabilities. We
(1998) and Knoch (2009) suggest that normative will also suggest future empirical research into
modifiers should be removed before descriptors modifiers for rating scales.
are empirically scaled. Scarce research in the
domain of rating scale modifiers has pushed us to
conduct a study to better understand the effects
99
Paper abstracts
The relationship between TOEFL TOEFL iBT score profile, in addition, all score
profile groups achieved an average GPA above
and GPA: The case of Chinese 3.0. Assuming an expected positive but weak
students relationship between TOEFL iBT subscale scores
and GPA, the negative correlations between the
April Ginther, Purdue University, aginther@ reading and listening subsection scores and GPA
purdue.edu for the 2011 and 2012 cohorts were surprising.
These correlational patterns may be associated
Xun Yan, Purdue University, yan50@purdue.edu with the Reading/Listening vs. Speaking/Writing
Joe Potts, Purdue University, jdpotts@purdue.edu discrepant score profile of subgroups within the
2011 and 2012 cohorts. The findings underscore
T his study examines the predictive validity the need to consider subsection scores when
of the TOEFL iBT with respect to academic evaluating ESL students’ language proficiency.
achievement as measured by first- and second- for both admissions and research purposes.
semester GPA of Chinese students enrolled in Possible explanations for the uneven score
a large public university in North America. The profile and its consequences on academic
TOEFL iBT scores submitted in students’ admission achievement will also be discussed, along with
files and the first year GPAs of 1,738 students from the important implications for admissions
mainland China, enrolled across three academic decisions and provision of language support for
years (N2011=702, N2012=514, N2013=522), Chinese students accepted into North American
were examined. Correlations between GPA, universities.
TOEFL iBT total, and subsection scores were
included in the analyses. The coefficients were
also adjusted for attenuation using the Thorndike
case 2 formula to correct for restriction of range.
In addition, cluster analyses on the three cohorts’
TOEFL subsection scores were conducted to
determine whether different score profiles might
help explain the correlational patterns found
between TOEFL subscale scores and GPA. For
the 2011 and 2012 cohorts, strong and positive
correlations were observed between speaking and
writing subscale scores and GPA, and strong but
negative correlations were observed for listening
and reading. In contrast, for the 2013 cohort, only
writing subscale scores were positively correlated
with GPA. Results of cluster analyses suggest that
the negative correlations in the 2011 and 2012
cohorts may be associated with a distinctive TOEFL
iBT score profile: High on reading and listening
(>27), but relatively low on speaking and writing
(<20), and a total score above 90. Interestingly,
this group had the lowest average GPA (< 3.0) as
compared with the other score profile groups that
emerged in the cluster analyses. In the cluster
analysis of the 2013 cohort, no comparable score
profiles were present. Not only did the 2013
cohort not contain students with the distinctive
100
Symposium Abstracts
SYMPOSIUM 1 practices which impact teaching and learning.
Within the broader field of language testing
this more socially-informed view was reinvigorated
Wed, Mar 18 by Messick (1989), who argued that construct
definition was a value-laden process embodying
4:50-6:20pm (Churchill Ballroom) both social and political meanings. Following
Messick, a growing body of research has focused
Testing as social practice: on the social consequences of test use, test
impact (Wall, 2005), and washback (e.g., Cheng,
Canada and context in 2005; Cheng, Watanabe & Curtis, 2004). Social
construct representation, test theory (e.g., Bourdieu, 1991; Foucault, 1972;
Vygotsky, 1987) has increasingly informed
interpretation, and arguments research in language testing (e.g., McNamara &
for consequential validity Roever, 2007; Shohamy, 2001; Swain & Brooks,
2013) leading to new recognitions and prompting
Janna Fox, Carleton University, Janna_Fox@ scholars such as McNamara (2007) to suggest,
carleton.ca “arguably the greatest challenge facing language
testing is the issue of the context in which the
Beverly A. Baker, University of Ottawa language testing is carried out” (p. 131).
Michel Laurier, University of Ottawa, mlaurier@ Although currently, we find fewer proponents
uottawa.ca of the view that measurement is a purely cognitive
enterprise which occurs “in a social vacuum”
Dean Mellow, Simon Fraser University, dmellow@ (McNamara, 2008, p. 417), it is still the case that
sfu.ca many texts on testing and test development (e.g.,
Kaitlyn Begg, McGill University Downing & Haladyna, 2006): 1) avoid explicit
discussion of the local or social context within
David Slomp, University of Lethbridge, david. which tests act; 2) are largely detached from
slomp@uleth.ca considerations of individual stakeholders, socio-
cultural groups, indigenously drawn constructs,
Bruno Zumbo, University of British Columbia,
and the social identities of test takers; and, 3) only
bruno.zumbo@ubc.ca
rarely examine testing practices through a broader
Bryan Maddox, University of East Anglia social lens of fairness (Cole & Willingham, 1999;
Kunnan, 2014 ), equity, and justice. If fairness is
Discussant: discussed in such textbooks, it is typically related
to the merits of such psychometrically-driven
Carolyn Turner, McGill University
testing formats as multiple-choice.
C anada’s multilingual diversity, its bilingual Within the Canadian context, however,
history, and interactions between Official discussions of fairness inevitably lead to the
languages (French and English) and indigenous recognition that socio-cultural context is an
languages, have led to a tradition of language inescapable element of testing practice. The
assessment which is increasingly marked by present symposium brings together Canadian
recognitions regarding context: in construct researchers who are informed by socio-cultural
representation, test performance, and assessment theory in investigating testing practices and their
101
symposium Abstracts
washback, impact, and consequences, at both the choice questions on grammar and vocabulary
micro and macro levels of context. Taken together, where candidates usually have to identify the
these papers provide a window on the unique correct form or definition.
contributions of Canadian research to language EETC is a two-hour computer-based exam.
testing, and provide evidence that such research Part A is an objective section consisting of sentence
is indeed a response to calls in the language editing items and multiple choice items. Part B
testing literature for a clearer understanding consists of two integrated reading and writing tasks
of context (see, for example, Chalhoub-Deville, which are aligned on authentic interactions with
1995; Maddox, 2007; McNamara, 2007) in test parents and colleagues and can be considered as
development, validation, and interpretation. All of an ESP language performance assessment. The
the researchers in this Symposium have taken an performance is assessed holistically.
ecological approach (Fox, 2003), acknowledging
socio-cultural context as a critical component of The social demand, the expected impact and the
construct development, test interpretation, and representation of teachers’ language competence
consequential validity arguments. may explain why these two assessment instruments
appear to be so different. In addition to the
comparability issue, we identified four concerns
The consequential validity of in relation with the impact of TECFEE which also
language tests for pre-service applies to EETC.
T
candidates are enrolled in programs in French
he languages of the Indigenous people
and have to take the French examination (TECFEE).
of North America are remarkably diverse
This exam has been created in 2008. The
in terms of both distinct language families and
following year, an English exam (EETC) has been
regional variations in use. For these languages,
developed for candidates you will be working in
local decision-making about the assessment of
English school boards. During each examination’s
the language abilities that are valued and viable
independent development process, differing
to achieve is necessitated by the limitations of
interpretations of the same construct have led to
published information about these languages, by
two very different final products.
the dynamic and variable patterns of multilingual
TECFEE is a handwritten examination. In Part language use, by the diverse types of teaching and
1, candidates are asked to write a 350-word text learning contexts, and by the historical context of
related to an educational issue. Raters must colonization with its difficult political relationships.
assess two facets of the production: Discourse In most cases, empirical evidence for the construct
and Accuracy (based on the number of grammar validity of assessment instruments (correlations
and vocabulary errors). Part 2 consists of multiple or comparisons to pre-existing tests that have
102
symposium Abstracts
been deemed valid) is not possible because such Within the measurement community, however,
pre-existing tests do not exist for most of these there has been considerable debate about the
languages. Content-related or logical evidence role consequential validity evidence plays within
for construct validity primarily depends on the broader validation process. On the one hand
the expertise of local speakers and educators, assessment scholars such as Popham and Cizek
especially because language is so closely have largely dismissed consequential validity
intertwined with identity and culture. evidence as being a pseudo form of validity. They
This paper discusses potential types of also argue that often studies of consequential
collaborations between local communities and validity evidence focus too heavily on large-
language assessment developers, including a scale assessments while ignoring classroom
recent collaboration in the speech-language assessments (which often carry more significant
pathology profession between McGill University, social consequences for students than do large-
the Montreal Children’s Hospital, and the Cree scale assessments). On the other hand scholars
Board of Health at James Bay. This collaboration such as Lane and Brennan have argued that
attempted to provide culturally appropriate consequential validity is an essential but too often
speech-language services to northern ignored category of validity evidence. Lane further
communities through telecommunication argues that one reason consequential validity
assessment sessions and community resource evidence is missing in much of the validation
development. research is that a systematic approach for
collecting such evidence has not been developed.
The paper also discusses the tensions in
political situations in which the amount of local This paper addresses both of these concerns.
community input in educational policies and First, it presents a framework for systematically
assessment is contested by local educators and integrating considerations of consequential
federal politicians (e.g., challenges with Canadian validity issues into each phase of Kane’s validation
Bill C33, the First Nations Education Act; and the model. Second, drawing on this model and fusing
U.S. federal policy “No Child Left Behind”). it with Huot’s concept of assessment as research,
it presents both a process guide and a design
heuristic based on this framework that can guide
Integrating consequential classroom teachers through a more rigorous
validity evidence into the design assessment design process, one that compels
them to attend to social consequences of their
and validation process for both assessment design choices.
large-scale and classroom
assessment. Inviting the study of
consequences by using
David Slomp, University of Lethbridge
ethnographic-psychometrics in
T he increasing diversity of students in
contemporary classrooms highlights
the importance of developing assessment
operational testing programs
Bruno D. Zumbo, University of British Columbia,
programs that are sensitive to the challenges & Bryan Maddox, University of East Anglia
of assessing across diverse contexts and with
R
diverse populations. The issues of diversity ecently, Zumbo and Chan (2014) presented
and contextual sensitivity also make clear a systematic review of validation practices
the importance of attending to the social as a study of the scholarly genre of validation
consequences that result from the development reports – its focus, style, orientation, and
and use of assessment tools/programs. structure – and how this genre frames validity
103
symposium Abstracts
theory and practices across many disciplines and
international jurisdictions. One of their starkest
SYMPOSIUM 2
findings was that the study of consequences
is consistently under-represented in validation Thurs, Mar 19
practices. One of their recommendations is
that it is important to expand the evidential 10:20-11:45am (Mountbatten A)
basis of validity to include qualitative and mixed
methods evidence as a way to invite the study
of consequences. The purpose of this paper Assessment for learning
is to present an overview of our program of (AFL): teaching, learning and
research that draws upon (a) an ecological model
of item responding , (b) the Hubley-Zumbo assessment in the writing
(2011) framework of consequences in validation, classroom
and (c) what we have termed ‘ethnographic-
psychometrics’ or ‘psychometric-ethnographics’, Wenxia Zhang, Tsinghua University, wxzhang@
depending on your methodological orientation mail.tsinghua.edu.cn
and proclivities. We will provide several cases
from a forthcoming paper that demonstrate Mingbo Shen, Tsingnua University,
how the ethnography of testing situations, in shenmingbo01@163.com
an operational testing context, can combine Discussant: Jing Huang, China West Normal
with and complement psychometric analysis University, annahuang9466@sina.com
to identify and explore substantive hypotheses
T
about test, item, and group performance. Our he goal of this symposium is to present
aim is to marry ethnography and psychometrics how Assessment for Learning (AfL) Theories
in a manner that does not privilege one over the are implemented and researched in the target
other and hence we are going beyond the mere writing course and discuss some emergent issues
engagement of subject matter experts to aid in the such as how to develop students’ autonomy with
interpretation of statistical findings after-the-fact. all parties (teacher, the student writer, peer and
We will demonstrate how our method engages machine) involved and how to set assessment
experiential experts in ethnography to both help criteria to guide and integrate teaching, learning
inform as well as shape conclusions about the and assessment in the classroom.
test performance. In short, the ethnography is
This symposium consists of four presentations
not meant to simply provide explanations for
presented by a classroom-based EFL writing
the statistical findings but rather it meant to go
instruction and assessment research team in
beyond that to provide a richer sense of the
mainland China from three perspectives- teaching,
ecology of the testing situation, and in so doing
learning and assessment. The organizer and
invite the study of consequences in test validation.
moderator, Prof. Zhang Wenxia, a full Professor
in language testing and assessment at Tsinghua
University in China will chair this symposium.
Presenters are Associated Prof. Sun Haiyang and
Shen Mingbo, two Lecturers, Huang Jing and Gao
Manman and one current PhD student, Wu Yong.
The first presentation is a conceptual framework
to umbrella the other three presentations. It is a
logical argumentation of the integrated teaching
and assessment frame with Argument Use
Assessment (AUA), which offers the theoretical
backup for the design of the course (presentation
104
symposium Abstracts
2) and the impact of the integrated feedback teaching and assessment approach on students’
approach in EFL writing instruction (presentations performance and perceptions in order to explore
3 and 4). The second presenter is going to a new way out to improve the teaching of College
introduce how the writing course is designed and English writing in the new era.
how the integrated teaching and assessment
framework is manipulated in the course. The The implementation of
third presentation is mainly focused on how the
integrated feedback approach is applied and assessment for learning theories
its impact on students’ writing conceptions and in a writing course
performance. The fourth presenter is going
to discuss about how the integrated feedback Mingbo Shen
helps develop the student writers’ language
T
autonomy. At last, the moderator will wrap up the his paper intends to explore how to
four presentations, issues that may emerge and implement the integrated teaching and
implications for future research. assessment framework in course design and
classroom management. The presenter also
An AUA validation study of will talk about how student voice is heard in the
course and the function of criteria in the course
the integrated teaching and management. The presenter will first give a brief
assessment framework introduction of the design of the course which
involves two writing instruction processes guided
Haiyang Sun by the integrated teaching and assessment
framework. The presenter will then investigate
I
of the interpretive argument of the integrated n order to overcome the limitation of the
Approach in this stage is mainly from theoretical summative one-shot feedback approach,
analysis. The inferences, warrants, assumptions, this paper attempts to propose an integrated
and backing will be clearly stated to demonstrate feedback approach and further explores its impact
the feasibility and viability of the integrated on students’ writing conceptions and their writing
105
symposium Abstracts
performance. In view of the research issue, a The impact of integrated
mixed-method case study research design in a
longitudinal manner was employed. The study was feedback approach on student
conducted with a triangulation of data collection autonomy
and analysis: multiple sources of data included
pre-and post-course writing tests, student writing Yong Wu
samples, reflective journals, questionnaires,
interviews, and classroom observation; multiple
data analysis included content analysis, frequency
analysis, textual analysis, t-test analysis and
T he study was conducted to investigate
how the Integrated Feedback Approach
influenced students’ autonomy in EFL writing. In
ANOVA. The major findings are presented as the first stage of the present study, the students in
the following: through the analysis of students’ an English writing class at a prestigious university
perceptions of writing learning, results show in Mainland China were the participants. The drafts
that due to more interactions, their attitudes they wrote and the feedback they received in two
towards writing and revising changed, they self- weeks were collected and analyzed. Secondly,
reported writing progress, and writing confidence stimulated recall data were collected when the
was enhanced; through the analysis of students’ students were asked to explicitly tell whether the
writing performance, the participating students revisions were influenced by the feedback given by
wrote more drafts for one task, revised more AWE, the writing instructor, peers or other agents.
by themselves with the help of AWE software In the second stage, based on the quantitative
and peer, paid more attention to language use analysis of students’ text revisions and qualitative
after receiving multiple-sourced feedback, and analysis of stimulated recall data, some case study
improved the quality of their writing products students were selected and interviewed to provide
to different degrees between, within and cross in-depth view of how the Integrated Feedback
tasks/drafts as well as the pre-test and post-test. Approach activated and developed their autonomy
The contribution of the current research can be in EFL writing. Preliminary results showed that the
summarized as follows: (1) the integrated feedback number of teacher-triggered revisions comprised
approach was put forward to guide the integrated the majority of the total revisions, and the
feedback practice; (2) this study originally students made more self-corrections than AWE-
carried out the integrated feedback practice, prompted revisions.
which contributed to the feedback literature
theoretically and practically; (3) the integration of
written feedback, automated writing evaluation
(AWE) feedback and recorded oral feedback from
teacher, peer and AWE software was successful
carried out; (4) this study enriched feedback
research on writing in the EFL tertiary context of
Mainland China.
106
symposium Abstracts
SYMPOSIUM 3 supported through a consciousness of teachers’
assessment literacy needs.
The organizing questions for this symposium
Thurs, Mar 19 concern what the language interaction demands
may be for learners to be effective in learning-
4:45-6:15pm (Churchill Ballroom A) oriented (language) and what assessment literacy
demands are placed on language teachers if
Roles and needs of learning- they are to move successfully into learning-
oriented (language) assessment either in their
oriented language assessment own classrooms or in their roles as raters in the
human rating of large-scale assessments. All
Jim Purpura, Teachers College Columbia, jp248@ learning events are situated in complex, social
columbia.edu situations; when assessment becomes the dual
Hansun Waring, Teachers College Columbia, and concurrent focus of learning the complexity
hz30@columbia.edu of the learning-oriented language assessment
(LOLA) event, taking place through a language
Liz Hamp-Lyons, CRELLA, University of interaction, becomes significantly more complex.
Bedfordshire, liz.hamp-lyons@beds.ac.uk The symposium as a whole looks closely at three
specific yet varied assessment contexts and
Anthony Green, CRELLA, University of
argues that whether assessment is informal and
Bedfordshire, tony.green@beds.ac.uk
classroom-based or more formal and high stakes,
Kathryn Hill, University of Melbourne, kmhill@ whether the focus is linguistic or pragmatic,
unimelb.edu.au learners’ engagement in the dynamic process
of learning to use the language effectively is
107
symposium Abstracts
Examining the interactional assessments were then identified and examined
iteratively and recursively through several lenses
patterns of feedback and (e.g., interactional features, proficiency features,
assistance in spontaneous processing and learning features). This study
shows a complex mix of patterns related to how
learning-oriented assessments spontaneous assessments, mediated through
and their effect on the dynamic interaction, further the dynamic processes of
learning to use the passive voice in these contexts.
processes of learning
Introducing opportunities for
M any assessment researchers (e.g., Turner,
2012; Hill & McNamara, 2012; Lantolf & learning-oriented assessment to
Poehner, 2011; Purpura & Turner, 2012) have large-scale speaking tests
highlighted the central role that assessment plays
in L2 classrooms and have expressed the need
to relate assessment principles and practices to
teaching and learning in L2 instructional contexts.
D rawing on recent developments in
formative assessment and in broader
concepts of learning-oriented assessment [LOA]
In addressing this need, Turner and Purpura (Carless 2003, 2007; Carless, Joughin, Liu et al.
(2015) and Purpura and Turner (Forthcoming) 2006; Joughin 2009), we describe a model of
have proposed an approach to classroom-based learning-oriented language assessment (LOLA)
assessment, called learning-oriented assessment, which we have applied to a set of speaking
in which learning and learning processes are assessment standardization videos in order to
prioritized while learners are attempting to narrow understand whether and how LOLA opportunities
achievement gaps as a result of assessments. This might be inserted into a formal, large-scale
approach focuses on how planned and unplanned exam. Further, we describe our findings about
assessments, embedded in instruction, are the ways in which official teacher support
conceptualized and implemented from a learning materials for this speaking test may, or may not,
perspective with relation to several interacting encourage the teachers who serve as speaking
dimensions (i.e., the contextual, the proficiency, assessment interlocutors to put into practice
the elicitation, the interactional, the instructional, aspects of LOLA through formative or dynamic
and the affective dimensions). assessment practices during the test event. The
The current paper is concerned with the study suggests that putting LOLA into practice
interactional dimension of LOA. It specifically brings the personas of assessor and language
examines the interactional patterns of feedback teacher closer together, giving assessors the
and assistance in unplanned, spontaneous potential to implement teacherly skills, but that
assessments embedded in instruction, and for this potential to become actual, some changes
then endeavors to describe how these patterns would need to be made to the current speaking
perform collaboratively in social interaction to test structure, the preparation materials, and the
contribute (or not) to the learners’ knowledge of constraints imposed on interlocutors.
and ability to use the passive voice to describe We have identified some opportunities for
operational processes (e.g., desalination). One learning oriented practices to be inserted into
intermediate ESL class was videotaped from the available resources for teachers, and for the
the beginning of a lesson to its culmination, strategies implied by the identification of these
when the results of an achievement test were opportunities to also be included into teacher
returned (approximately 4 days). The video data and learner preparation materials/resources.
were then uploaded onto Nvivo, transcribed, This study has shown that if effective LOLA is to
and examined through a conversation analysis occur in large-scale speaking tests, training of
approach. Instances of planned and unplanned prospective interlocutors for the interlocutor role
108
symposium Abstracts
is critical, as is further development of the support more generally. It is based on a definition of
materials for teachers and learners planning to classroom-based assessment (CBA) designed to
take this test so that learners may take advantage account for the full spectrum of CBA practices,
of these new opportunities. With a view to including the types of assessment occurring in the
advising the testing body on future development, course of routine classroom interactions (Leung,
a fully-fleshed programme of refresher training 2005; Purpura, 2014). The self-assessment tool
for interlocutors, and a programme of training is designed to facilitate reflection on the nature
materials development, trainer training, and of the assessments as well as the beliefs and
ongoing observations of interlocutors in understandings (about the subject, SLA, language
action are proposed. The resulting speaking teaching and assessment), which underpin
assessment could be a model of the best in them. It also explores the relationship between
human engagement in live speaking performance assessment and the relevant curriculum standards
assessment, but how far LOLA insertions into the and frameworks and how prior knowledge,
existing ethos and design of the speaking tests can ability level, interest level and learning needs are
in fact produce a more learner/learning-oriented taken into account. It emphasizes the learner
approach to language test preparation will not be perspective and agency in assessment and
known until the attempt is made. highlights the relationship between feedback,
motivation and goal orientation. Furthermore,
The implications of LOLA by including a specific focus on how practice
for teacher professional is shaped by contextual factors at the local
(classroom and institutional) level as well as the
development. social and political level, the tool acknowledges
the inevitable gap between policy and practice as
109
symposium Abstracts
SYMPOSIUM 4 language testing when considering the perspective
of the test-taker and explore how the concept
of localisation may be applied to individualise
Thurs, Mar 19 feedback through a linking of complementary
theoretical approaches to reading assessment.
4:45-6:15pm (Churchill Ballroom B) Third, the symposium will describe how the widely
used descriptive labels of the Common European
Framework of Reference (CEFR) can be combined
New models and technologies with quantitative and qualitative measures to
for blurring the distinction define reading text and task difficulty in a test of
English proficiency. Efforts to integrate a Rasch-
between language testing and based measure of reading into specifications for
language learning an English as a Foreign Language (EFL) assessment
will be examined in context for practical utility
A. Jackson Stenner, MetaMetrics, jstenner@ in language-test development and validation.
lexile.com Finally, the symposium will explore ways in which
the technical perspectives presented by the
Barry O’Sullivan, British Council, symposium can be integrated with and applied to
barry.o’sullivan@britishcouncil.org day-to-day activities, which will be demonstrated
Jamie Dunlea, British Council, Jamie.Dunlea@ by an online reading platform that fosters
britishcouncil.org individualised learning. Special emphasis will be
given to illustrating the cumulative benefits for
Todd Sandvik, MetaMetrics, tsandvik@lexile.com educational systems moving toward a unified
approach to reading measurement.
Esther Geva, University of Toronto, esther.geva@
utoronto.ca Causal Rasch models in
R
Lyra and Chaudharg, 2009)); modern metrological asch’s unidimensional models for
thinking applied to education and behavioral measurement are conjoint models that
science (Fisher, 2010); personalised learning make it possible to put both texts and readers on
platforms that blur the distinction between the same scale. Causal Rasch Models (Stenner,
language testing and language learning (Hanlon Fisher, Stone and Burdick, 2013) for language
et al., 2012); and cloud based delivery platforms testing fuse a theory of text complexity to a
that make these new technologies scalable and Rasch Model making it possible for a computer
accessible around the globe. This symposium to response illustrate texts read by language
explores these intersections with a decided focus learners during daily practice. Causal Rasch
on illustrations and applications. Models are doubly prescriptive. First, they are
First, the symposium will introduce a Casual prescriptive as to data structure (e.s., non-
Rasch Model for language testing that represents intersecting item characteristic curves.) Second,
a fusion of a substantive theory of text complexity they are prescriptive as to the requirements
with a Rasch Model. Major advances in educational of a substantive theory for the attribute being
measurement are evidenced by this approach measured i.e., item calibrations take values
and will be demonstrated from the context of dictated by the substantive theory not the data.
individual-centered reading assessment. Second, One consequence of this infusion of a Rasch
the symposium will acknowledge critical issues in Model with a substantive theory is that individual-
110
symposium Abstracts
centered growth trajectories can be estimated various elements of the test, from the perspectives
for each reader even though no two readers of the test taker, the test task and the scoring
ever read the same article or respond to a single system and has been successfully applied in a
common item. Rather than common items or range of language test development and validation
common persons being the connective tissue that projects over the past decade (see O’Sullivan &
makes comparisons of readers possible, common Weir (2011) for an overview of these projects).
theory is the connective tissue just as is true in,
In considering the test-taker, two critical
say, human temperature measurement where
issues emerge. The first is the need to detail the
each person is paired with a unique thermometer.
model of language ability being tested, since the
Thus, although the instrument is unique for each
more explicit this model the more convincing
person on each occasion, a text complexity theory
any later validity argument. The British Council’s
makes it possible to convert counts correct to a
approach to testing reading, for example, involves
common reading ability metric in each and every
writing tasks to reflect the underlying reading
application.
model and then accepting only those tasks
In this paper we will richly illustrate the above which can be shown trough their psychometric
conceptual breakthroughs with applications. properties to meet the expectations of the
As an example, consider the implications for model. The second issue is that of localisation
the measurement of a student responding to (the appropriateness of the test for use with
10,000+ reader items per year embedded in a specified population from a range of test-
her self-directed daily 40 minutes of reading taker characteristics). The ultimate degree of
practice. Additionally, we will illustrate how the text localisation is the personalisation of the testing
complexity theory can be tested by comparing process. Though technology offers us innovative
the empirical text complexity measures based ways of reaching this goal, it is still some way off.
on reader responses with the theoretical text However, one way in which personalisation of
complexity measures produced by a computer reading assessment is achievable is through the
based text analyzer. provision of individualised feedback. By linking
the theoretical approaches of the British Council
Localising global assessment: and Lexile we have begun to build a systematic
Integrating the socio-cognitive understanding of how we can apply the concept
of localisation /personalisation to a whole range
model of test validation with of language learning and assessment initiatives.
applications of a causal Rasch Offering personalised feedback based on reading
test performance is just the first of what we see as
model a number of potential applications of the process.
111
symposium Abstracts
expected to engage in when completing the test Personalized learning:
task. When developing such specifications, we
are now able to utilize a number of automated integrating testing and
quantitative text analysis tools as well as a growing instruction in data-rich digital
body of research on the relationship between
the measures generated and empirical text environments
difficulty. From a different perspective, descriptive
frameworks of language proficiency have also
seen rapid uptake over the last decade, with the T he unfolding digital era is presenting
unprecedented opportunities to enhance
and personalise learning. The internet is bringing
Common European Framework of Reference
(CEFR) increasingly used as a reference point for worldwide access to an ever-growing array of
the description of language examinations. The services, tools and content. Online learning
CEFR is intuitively easy to apply through the use activities can reach the spectrum of connected
of can-do descriptors, which can be said to fulfill devices at any given hour around the globe.
the function of Performance Level Descriptors in Cloud-based infrastructure is enabling rich data
standard-setting terminology. At the same time, collection and imperceptibly fast data processing,
the illustrative scales of proficiency in the CEFR— which in turn is driving increasingly “intelligent”
always intended as user-oriented scales—lack the online experiences. More engagement with
depth of description required for language-test such technology over time can empower deeper
development and validation. personalisation with ever more sophistication. As a
result, a simple activity such as silent reading -once
This paper describes how a recently developed a strictly passive task- can now be transformed
test of English proficiency has combined the widely into a highly interactive and responsive endeavor.
used descriptive labels of the CEFR with both
quantitative and qualitative measures to define While general technology advances are
reading text and task difficulty at different levels bringing educational opportunities, the promise
of the CEFR, and will provide empirical evidence of of personalised learning is dependent on the
the relationship between these measures and task coherence and utility of measurements that
difficulty. The paper will then describe attempts fuel the system and empower the underlying
to integrate the Lexile measure—a Rasch-based analytics that drive user-centric features. The
measure of text difficulty originally developed for advances represented by Causal Rasch Models,
use with native speakers of English in the United applied within the context of an online platform
States and which places the reading ability of test that integrates testing and instruction with daily
takers and the reading difficulty of a text on a practice, presents a compelling example to
common metric—into the specification of the test. consider. New analytic options open up when the
Test specification faces the challenge of teasing data structure is small n (few to one person) but
out a realistic and practical balance between large P (many item responses observed over 5-8
comprehensive coverage and the use of a limited years.) These new analytic methods resemble time
number of efficient and manageable measures, series methods and can answer questions like
and this research aims to investigate the utility of the effects of summer vacation or daily practice
using the Lexile measure in an EFL reading context on an individual student growth trajectory. These
to help achieve that balance. methods address the valid criticism that in general
the study of inter individual variation to infer
anything about intra individual variation is fraught
with complications.
112
symposium Abstracts
further dimension of personalized, day-to-day
learning. This addition will be exhibited in the
SYMPOSIUM 5
form of a digital platform that harnesses trends
in technology and advances in educational
Fri, Mar 20
measurement. Exhibits using student data from
around the world will be used to demonstrate 3:00-4:30pm (Mountbatten A)
the power of these new approaches in
personalisation. Lastly, we will update our Vocabulary in assessment: what
illustrative model of an educational system to
show how common measures can flow freely
do we mean and what do we
throughout as a valuable new form of currency. assess?
Rob Schoonen, University of Amsterdam, rob.
schoonen@uva.nl
113
symposium Abstracts
also have to question whether all relevant aspects The more, the merrier? Issues in
of lexical-semantic processing are covered by the
traditional measures of vocabulary knowledge. measuring vocabulary size
V
This symposium aims to bridge the gap ocabulary tests are a key tool in the
between cognitive models of vocabulary in assessment repertoire of classroom
language use, and various measures of vocabulary. teachers. They hold great potential for integrating
Three complementary perspectives will be learning and assessment as they could be used
taken. First, vocabulary size will be addressed. for diagnosis and identification of lexical strengths
Using vocabulary size in research or teaching and weaknesses and tailoring instruction to
suggests that we have a clear understanding the learners’ needs. Several tests have been
of when a word is known. However, as the developed and suggested for pedagogic use,
first presentation shows, different formats of such as the Vocabulary Levels Test (VLT) (Nation,
vocabulary size assessment lead to different 1990; Schmitt et al., 2001) or the Vocabulary
outcomes which potentially have consequences Size Test (Nation & Beglar, 2007). Most of
for language teaching. The second perspective these have been commonly labelled measures
concerns how well language learners employ of vocabulary size, suggesting that this is a
the vocabulary knowledge they have, especially well-defined construct, useful for pedagogic
in language production. How do (traditional) practitioners in their daily work with language
measures of the lexical variability in language use learners. However, the interpretation of scores
(based on types and tokens) relate to language generated by tests of vocabulary size is not
proficiency? Furthermore, do they allow valid as straightforward as it seems. Size scores of
inferences about language proficiency? Some learners are likely to be greater if only little depth
alternative approaches will be suggested. The of word knowledge is tested. Vice versa, learners
third perspective concerns how language learners will know less vocabulary to a greater depth. The
access their vocabulary knowledge in online paper will therefore argue that size figures need
language use. One may know a lot of words, but to be interpreted in conjunction with the degree
still have problems in accessing that knowledge of knowledge tapped into in order for this size
in conversation or reading/writing. Slow lexical- estimate information to be useful for teachers and
semantic processing clearly affects language learners in pedagogic settings. The item format
performance. The questions become, then, to thus plays a crucial role as it directly affects size
what extent do our vocabulary assessments take measurement. However, there has been little
this accessibility aspect into account, and what are research into what different size test item formats
the best ways of measuring fluency in use? show about depth of knowledge. This paper will
Addressing these issues is an essential part report on a study that subjected four commonly-
of building validity arguments for the vocabulary used item formats to empirical investigation.
measures that are so frequently used in language The informativeness of two recognition formats
research and assessment. It may also shed light (VLT and 4-option multiple choice) and two recall
on new directions of vocabulary assessment. formats (with and without a context sentence)
Additionally, valid assessment can inform test was compared by administering 36 vocabulary
users about opportunities for learning and items in the different formats in a Latin square
teaching. design to native and non-native speakers of
English. The word knowledge of the participants
was afterwards verified in face-to-face interviews
and written meaning recall measures, probing
whether their scores derived from guessing,
partial knowledge, or mastery of different word
knowledge aspects of the target items. The results
provide insights into the workings of the different
114
symposium Abstracts
test formats, and will help both teachers and distinctions made between receptive (or passive)
researchers interpret the resulting size figures vocabulary and productive (or active) vocabulary.
stemming from vocabulary tests using these Fluency, or the speed and ease with which a
frequently-used formats. person can access and use his or her vocabulary,
is also sometimes added as yet another dimension
Variability in vocabulary use of vocabulary to assess. Fluency of vocabulary
access, however, poses an interesting challenge.
115
symposium Abstracts
SYMPOSIUM 6 the assessment of students’ well-being in
schools and factors affecting students’ academic
achievement, using questionnaire surveys of
Fri, Mar 20 students, teachers and school principals. For
example, in 2011 alone, NAEQ assessed 50
3:00-4:30pm (Mountbatten B) thousand students in 11 Chinese provinces (out of
a total of 31). The results were reported primarily
to the Ministry of Education but also to the
The evaluation of school-based provincial or regional educational administrative
EFL teaching and learning in officials and teachers.
116
symposium Abstracts
broader field of language testing and assessment. sleep time affect students’ academic achievement.
It was quite alarming to discover that in some
Assessing EFL learning in rural areas Grade 4 students did not have English
schools and factors affecting classes due to lack of teaching staff. The fact that
when English is taught right from Grade 1 in some
learning outcomes schools, students tend to confuse the English
sound system and the Chinese pinyin is also cause
117
symposium Abstracts
A baseline study of the Sampling and scaling processes
washback of the English for the national assessment of
assessment in the NAEQ system education quality in China
118
symposium Abstracts
where 13,421 students were sampled from
446 schools of 140 regions which provided a
SYMPOSIUM 7
guarantee that the assessed population would be
representative of the Eighth Graders in general Fri, Mar 20
for that year. It will also describe the process of
scaling of student scores and how scaled scores 3:00-4:30pm (Rossetti)
can be used to interpret education quality in
relation to the various factors.
Applications of automated
scoring tools for student
feedback and learning
Sara Weigle, Georgia State University, sweigle@
gsu.edu
119
symposium Abstracts
present current research on automated tools that everything they know about a subject with little
can be used to support assessment and diagnostic attention to addressing rhetorical or conceptual
feedback for student learning. The symposium problems.) AES systems score the final written
includes four papers—three on writing and one product, so they provide no information about
on speaking—to be followed by a discussant who planning, revision, or editing. On the other hand,
will synthesize the papers and discuss them from these activities typically involve interruptions to
the perspective of both an assessment specialist the normal flow of text production, and as such,
and a classroom teacher. Paper #1 analyzes should be identifiable from a keystroke log. It
data from an 8th grade timed writing assessment, should therefore be possible to gain a richer
exploring relationships between product features, picture of student performance by examining
process features (i.e., keystroke logs), and student a profile that combines AES product features
scores. Paper #2 reports on the use of an online with process features derived from a keystroke
system used to provide diagnostic feedback within log. The paper analyzes data from an 8th grade
a learning-oriented, self-access, and/or teacher- timed writing assessment, exploring relationships
directed assessment context. Paper #3 presents between product features, process features, and
the current capabilities of a fully automated student scores. Our results suggest that in this
writing system that is used in both low-stakes, population, students obtain higher scores almost
classroom-based assessments, and high-stakes entirely because they are better prepared to
assessments, describing how such automated undertake the writing task fluently and accurately.
writing systems are developed, and presenting Higher scoring students write longer and more
validation evidence. Paper #4 presents research fluently, while producing longer texts with more
on the use of automated scoring of unconstrained sophisticated vocabulary that contain fewer
speech, describing an initial exploration into errors. Lower-scoring students are less persistent,
making use of detailed speech data by including it show more hesitations and corrections, and
in an augmented score report for a practice test of produce shorter, simpler, more error-prone text.
speaking that elicits unconstrained speech. Unfortunately, in this population, students seem
to use editing and revision behaviors ineffectively
Combining product and process (and primarily when they are already having
data: Exploring the potential for difficulties). These results suggest that profiles
combining product and process features can help
automated feedback in writing teachers identify specific kinds of support students
may require, either to increase their ability to
120
symposium Abstracts
of language learning. E-assessments allow learners Use of automated writing
to receive individualized feedback based on their
performance, with recommendations directing scoring system in classroom and
them to remedial, development or extension high-stakes contexts.
exercises, depending on their test scores. As the
W
learning program progresses, more classroom test hether the use of a language test is
data is collected and an electronic student profile for a low-stakes testing or high-stakes
emerges, and is saved in the form of an electronic testing, the test-taker’s learning should ideally be
portfolio. Formal e-records of achievement can be enhanced as a result of the test event. One way
maintained by individual learners or for the whole to promote such learning is by providing feedback
class. This in turn has the potential to inform as quickly as possible while the test-taker’s
teacher planning, feeding back into the teaching experience with the test is still relatively fresh in
cycle. memory. The importance of immediate feedback is
In this paper we report on the use of an online a similar concept to the Output Hypothesis (Swain,
system used to provide diagnostic feedback within 1993, 1995) in second language acquisition. In
a learning-oriented, self-access, and/or teacher- the context of language assessment, if the test-
directed assessment context, and address issues taker notices the gap between the task demands
of construct validity, as well as of assessment and the test-taker’s own ability to complete them
use and impact. The system is an automated during test-taking (i.e., output process), receiving
writing assessment tool which maps learners’ scores immediately could help the test-taker plan
output to proficiency levels defined by external how he/she can narrow the experienced gap.
benchmarks and frameworks of language ability. Therefore, automated scoring systems, in general,
Implicit in this mapping is the identification of have the potential to help enhance the test-taker’s
positive and negative writing features related learning.
to topic relevance, organization and structure, In this paper, the authors present the current
language and style. We consider these features in capabilities of an automated writing system that
terms of the degree to which they cover models is used in both low-stakes, classroom-based
and constructs of writing ability. These features assessments, and high-stakes assessments. In the
can be weighted in different ways to maximize context of classroom-based assessment, the test-
their predictive power within different language taker writes an essay in response to a prompt or
use contexts and for different L1 backgrounds. reads a passage first, and then writes a summary
Learners can therefore receive overall and specific of the passage. Once the learner’s writing sample
feedback that diagnoses the quality of their writing is submitted, it is evaluated by the automated
according to context, increasing construct validity. writing system immediately on traits such as
Within a learning-oriented assessment content, organization, word choice, and sentence
context learners can, perhaps assisted by their fluency. The learner then has opportunities to
teacher, repeatedly access the system and work improve their writing based on the feedback and
on improving various aspects of their writing, to submit a revised text for another round of
referencing their earlier work as desired. Using immediate feedback.
questionnaires to collect feedback from teachers Such an automated writing scoring system
and learners, we show that the system can be a has also been embedded in a few different high-
useful supplement to other modes of learning stakes tests. Written responses such as a passage
that can help to promote learners’ writing reconstruction of a text the test-taker has read,
development. and an email-style writing, have been scored on
a few traits including vocabulary, organization,
and voice and tone. Sufficiently high correlation
coefficients (above r=0.90) between human scores
121
symposium Abstracts
and machine scores have been demonstrated in scores were obtained from scoring models trained
validation studies. to predict human sub-scores for the language
The paper describes how such automated domains of delivery (fluency and pronunciation)
writing systems are developed, and presents and language use (vocabulary and grammar).
validation evidence. The authors also discuss how Measurements of individual language features
this automated scoring system differs from other were selected based on the extent to which
automated essay evaluation systems, because of the measurement: (1) provided actionable
the natural language processing technique called information, (2) was understandable to test takers,
latent semantic analysis. and (3) discriminated between differing levels of
performance. User perceptions of the clarity and
Use of automated scoring usefulness of the expanded score report were
also collected in a small-scale usability study and
technology to provide feedback will be described. The presentation will conclude
on unconstrained speech with a summary of design considerations for
presenting automated feedback on unconstrained
122
Poster Abstracts
1. Assessing Chinese teaching markers. Results showed that in terms of quantity,
so, okay and uh were the most frequently used
assistants in the US discourse markers. With respect to functions,
the Chinese teaching assistants primarily used
Dingding Jia, The Pennsylvania State University, discourse markers to manage their speech, but
duj144@psu.edu they did not use discourse markers very frequently
to display attitudes or feelings. This study hopes to
I nternational teaching assistants’
communicative competence is one of the
most trending topics in studies of advanced
shed light on assessing, teaching, and training of
international teaching assistants.
second language speakers. Although many studies
up to this point focused on the linguistic features 2. Development of WIDA early
of international teaching assistants, most of years language assessments:
them were related to pronunciation and lexical
variety. Few studies focused on their use of From assessment to instruction
discourse markers. Discourse markers are not
included in either international students’ previous Carsten Wilmes, WIDA, wilmes@wisc.edu
English education in their home courtiers or Ahyoung Alicia Kim, WIDA, akim49@wisc.edu
assessment in the US, but they are truly important
T
in classroom interactions. Discourse markers are he WIDA Early Years suite of language
important linguistic features because they may assessments, designed for ages 2.5-5.5,
display the teacher’s stance and build rapports aims to not only screen and identify incoming
between the teacher and the students. This work DLLs to early care and education programs,
examines the similarities and differences among but also observe their language development
Chinese teaching assistants at a large public over an extended period of time in everyday
American university. The first research question settings. Therefore, the assessments will have
investigates the similarities and differences a strong formative characteristic. Due to the
among Chinese teaching assistants in terms of unique nature of early learners, a wide range
the frequency of discourse markers. The second of information will be collected from multiple
research question examines similarities and sources, including parents/caregivers and
differences among Chinese teaching assistants practitioners. Detailed information obtained using
with respect to the functions of discourse markers. the suite of assessments could be further used by
The theoretical orientation of this research is practitioners for providing appropriate language
guided by Schiffrin’s study (1988) of discourse support and instruction to DLLs. This poster
markers. In particular, this study analyzed several describes the current work by WIDA to develop
of the most commonly used discourse markers, a suite of assessment tools to help practitioners
such as I mean, uh, so, actually, and okay. understand, support, and monitor DLLs’ progress
Chinese teaching assistants who are doctoral in language development over time. In both design
students in science, technology, engineering, and mode of administration, these assessment
and mathematics participated in the study. Their tools take into account that children, age 2.5
lectures were transcribed using conversation – 5.5 years, primarily learn language through
analysis conventions. Conversation analysis the context of important relationships with
approach was used because it allowed the significant caregivers during daily routines and
researcher to identify the functions of discourse play-based learning activities. The assessment
123
poster abstracts
tools are designed to be used within a variety of made on the basis of an assessment constitute
early care and education programs and can be two essential components of any assessment use
easily incorporated into existing routines and argument. In this poster we investigate what these
learning activities. In alignment with K-12 WIDA components could mean in view of the possible
assessments, the WIDA Early Years Language introduction of an internationally recognized
Assessments will be standards-based and reflect and commercially available large-scale English
the WIDA Early Language Development Standards. language test in Cuba. Bearing in mind how new
In addition, the assessments are to be used in a phenomenon nationwide foreign language
conjunction with states’ Early Learning Standards. certification is in the country, the general question
Therefore, the assessments will measure the that we seek to answer is: What should and what
language of six standards: language of social can be done in order to maximize the chance
and emotional development; language of early that introducing the test would bring about good
language development and literacy; language in Cuban society taking account of the (evolving)
of mathematics; language of science; language educational, sociopolitical, and legal environment?
of social studies; and language of physical The authors are academics who have collaborated
development. In addition, the assessments will in long-term institutional university cooperation
measure the two language domains—receptive and network projects of the Flemish Interuniversity
and expressive—in three developmental levels— Council (Belgium) with universities in Cuba. In
Level 1 (Entering), Level 3 (Developing), and Level 5 the Council’s country strategy for Cuba, English
(Bridging). language learning and communication is listed
as one of three transversal themes for all current
and future interventions (http://www.vliruos.be/
3. Weighing the consequences en/countries/countrydetail/cuba_3850/). Based
of introducing an international on their experiences in a previous ten-year
project in central Cuba, the authors concluded
large-scale English language in 2011 that there was a need for establishing
test in Cuba a regional testing center, noting the following
key issues: (a) intercultural citizenship for Cuban
Jan Van Maele, KU Leuven (University of Leuven), academics, (b) sustainable local institutional
jan.vanmaele@kuleuven.be development, (c) validity of administered tests,
and (d) privacy and confidentiality issues. This
Roberquis Rodríguez González, Universidad de
need was reiterated in the external assessors’ final
Oriente, translator-1@vlir.uo.edu.cu
evaluation report (Carpenter & Vigil Tarquechel,
Yoennis Díaz Moreno, Universidad de Oriente, 2013). Since then initiatives have been started
translator-3@vlir.uo.edu.cu up to establish testing centers in the context of a
new project in eastern Cuba in view of the stated
Frank van Splunder, University of Antwerp, project goal of strengthening foreign language
frank.vansplunder@uantwerpen.be skills for intercultural and international academic
Lut Baten, KU Leuven (University of Leuven), lut. purposes. It is expected that the introduction
baten@ilt.kuleuven.be of an internationally recognized test at regional
testing centers could contribute to increased
124
poster abstracts
address some of the more elusive competences student’s interest. Then they wrote a summary and
than the ones commonly captured in standardized critique of the article as part of their classroom
testing, particularly the intercultural competences assessment. The instructor evaluated each
required for a successful academic mobility summary/critique in terms of four dimensions of
experience. The poster will report an initial content, language, organization, and mechanics,
answer to the general research question based and provided the individual students with detailed
on a stakeholder mapping exercise (identifying, written feedback on the four dimensions. After
analyzing, mapping, prioritizing) at project, the six week period, the instructor interviewed
institutional, and regional level that was conducted each student to understand their perceptions
during the first two project years. Given that about the effectiveness of written feedback for
outward mobility for Cuban academics in the the development of their academic writing ability.
project concerned is almost exclusively Europe- Qualitative analysis indicated that the students
bound, perspectives of European policymakers had diverse developmental patterns across the
and students on academic mobility will also four dimensions. Students demonstrated quick
be included (Beaven, Borghetti, Van Maele & improvement in the coherence and cohesion
Vassilicos, 2013). pertaining to the organization of their writing.
However, they struggled to include appropriate
content for summary and meaningful critique.
4. Investigating the effectiveness Interview findings indicated that the students’
of formative assessment for an weakness in content was partly due to the nature
of the integrated reading-to-write task. Overall,
integrated reading-to-write task repeated writing practices and subsequent
Hyun Jung Kim, Hankuk University of Foreign formative assessments helped the students
Studies, hjk2104@gmail.com improve reading skills as well as academic
writing ability. The results provide pedagogical
Ahyoung Alicia Kim, WIDA, alicia.a.kim@gmail. implications for the use of sustained formative
com feedback for L2 learners’ development of
academic of reading-to-write skills.
125
poster abstracts
a monopoly”, and stated that the Duolingo test 6. Exploring the relationships
“could produce equally representative results of
a student or job seeker’s English proficiency” as among teacher feedback,
the IELTS and the iBTOEFL. Is the Duolingo English students’ self-efficacy, self-
test the comet that will wipe out us language
testing dinosaurs? Or is this simply the case of regulatory control, and
a company with a large marketing budget trying academic achievement in
to make a splash by ridiculing the (language
testing) establishment? This work in progress Chinese EFL context
tries to answer those questions by providing a
critical overview of the Duolingo English test, the Yongfei Wu, Queen’s University, yongfei.wu@
test tasks, how the test links to learning, and an queensu.ca
exploration of possible washback effects. The
presentation is directly related to the theme of the
conference, in that we examine how Duolingo’s
T his study aims to explore the relationships
among teacher feedback, students’
self-efficacy, self-regulatory control, and their
English test is connected to their teaching/
academic achievement in Chinese EFL context.
learning program. Because the Duolingo test is
The existing research on classroom assessment
very new, there is almost no published research
has been largely practical, i.e., seeking successful
on it (i.e., Vesselinov & Grego, 2012), and a validity
assessment practices, but not theoretical
argument has not been made for this assessment.
(Brookhart, 2013). Without theoretical guidance,
Therefore, this critique uses a number of
classroom assessment practices tend to be
conceptual models, including test usefulness
episodic and dispersed (Allal and Lopez, 2005),
(Bachman & Palmer, 1996), test fairness (Kunnan,
which may make our understanding of classroom
2000), and consequential validity (Messick, 1989,
assessment fragmented and inadequately
1994), to critique the test in terms of its supposed
explicated (Torrance, 1993). The research on
reliability, practicality, construct and consequential
feedback, the central aspect of classroom
validity, and fairness. Beyond this conventional
assessment, needs “considerably better
critique, however, we employ Shohamy’s (2001)
theoretical grounding” (Wiliam, 2013, p. 207) so
model of critical language testing to examine
that it may play a powerful role in promoting
the Duolingo English test’s potential to have a
students’ learning. Theories of self-regulation
disruptive influence (both positive and negative)
have been discussed in depth to develop and
on the language testing industry, and the possible
solidify the theory of classroom assessment,
washback effects (both positive and negative) of
especially feedback (see Andrade, 2013; Nicol
the test becoming widely used. The examination
& Macfarlane-Dick, 2006; Wiliam, 2013). Yet
of its washback effects will focus not only on
empirical studies are rare that examine the
test-takers themselves, but also on other test
relationships among the key aspects of feedback,
users, including universities that might use the
self-regulation, and achievement. This study
Duolingo for admissions purposes, and teaching
adopts a sequential explanatory mixed-method
and learning systems that exist in many countries
design with a quantitative and a qualitative phase.
to prepare learners for North American English-
In the quantitative stage, a questionnaire will
medium university admissions. This is a work in
be designed to measure how students perceive
progress in that while our critique of the test will
teacher feedback, their self-efficacy for English
be complete by the time of the conference, we
learning, and their self-regulatory control. The
also are interesting in conducting an empirical
survey will be conducted among approximately
study to collect evidence for the validity, reliability,
1,000 EFL students at a comprehensive university
fairness, access, and consequences of the test in
in China. Students’ academic achievement data will
order to examine its overall beneficial nature to
be collected from the test scores in their English
the community.
final exams. Descriptive statistics, correlation,
126
poster abstracts
and structural equation modeling will be used 7. Development of the
to analyze the collected data to examine the
interrelationships among feedback variables, Norwegian formative writing
self-efficacy variables, self-regulatory control assessment package
variables, and academic achievements. In the
qualitative stage, individual interviews (n=12) Aasen Arne Johannes, Norwegian Centre for
and focus groups (n=20) with students sampled Writing Education and Research, arne.j.aasen@
from the quantitative phase will be conducted skrivesenteret.no
to compliment the relationships detected in the
quantitative stage. A standard thematic coding Lars Sigfred Evensen, Norwegian University of
process will be used and the co-occurrence Science and Technology, lars.evensen@ntnu.no
of the codes will be identified to explore the Anne Holten Kvistad, Norwegian Centre
relationships among feedback, self-efficacy, self- for Writing Education and Research, anne@
regulatory control, and academic achievement. skrivesenteret.no
The findings of this study potentially have both
theoretical and practical implications. First, this Siri Natvig, Norwegian Centre for Writing
study is interdisciplinary in nature, introducing Education and Research, siri@skrivesenteret.no
cognitive psychology into the study on classroom
Jorun Smemo, Norwegian Centre for Writing
assessment. This interdisciplinary inquiry makes
Education and Research, jorun@skrivesenteret.no
a concerted effort with other researchers in
repositioning classroom assessment, formative Gustaf Bernhard Uno Skar, Norwegian
assessment in particular, within self-regulated Centre for Writing Education and Research,
learning in order to develop the theory for gustaf.b.skar@hist.no
classroom assessment. The findings may
provide nuanced in-depth understanding of the
intertwined relationships between the core aspect
of classroom assessment, i.e., teacher feedback,
T his poster describes the development
process of the Formative Writing
Assessment Package (FWAP), which is funded by
and the key components of self-regulation, i.e., the Norwegian government. The FWAP is an on-
self-efficacy and self-regulatory control, which line tool available for teachers in Norway. It consist
may contribute to the development of classroom of items, generic standards applicable to writing in
assessment theory. Second, the findings of this a range of different contexts and exemplar texts
study have potential pedagogical implications. associated with each standard. It also includes
They may provide guidance for EFL teachers to guidelines on how to use information from
avoid superficial adoption of formative assessment writing assessments when planning instruction.
and capture its spirit, i.e., empower students to The purpose of the FWAP is to promote data-
become self-regulated learners (Black & Wiliam, driven, formative writing instruction in Norwegian
2009). classrooms. An additional purpose is to establish
a national norm community, i.e. shared meanings
of what writing is and what it means to be a
proficient writer. This has resulted in a data-
based “bottom-up-design” in the development
of scoring rubrics (c.f. Fulcher, 2012). Writing is
defined as a key competency in the Norwegian
National Curriculum. Writing is, therefore, taught
across the curriculum, and all teachers share a
responsibility for teaching writing within their
subjects. The National Curriculum is informed
by functionalist theories of writing in general
127
poster abstracts
and the theoretical model the Wheel of Writing 8. Bridging the gap between
(Berge, Evensen, & Thygesen, 2014) in particular.
Commissioned by the government agency The large scale assessment and
Norwegian Directorate for Education and Training, classroom assessment for
the FWAP is developed by the Norwegian Centre
for Writing Education and Research. A single learning
writing assessment package is developed during
a three-year cycle. In the 1st year, items are Phallis Vaughter, Educational Testing Service,
constructed and field tested on a small scale. pvaughter@ets.org
A panel of a few teachers and test developers Pablo Garcia Gomez, Educational Testing Service,
reviews the functionality of items by reading and pggomez@ets.org
marking scripts, as well as, reviewing the alignment
between the items and the National Curriculum, Trina Duke, Educational Testing Service, tduke@
subject syllabi and the theoretical model. Through ets.org
this process a number of items are often dropped.
Susan Hines, Educational Testing Service,
In the 2nd year, items are piloted on a national
shines@ets.org
scale with a representative sample of schools.
Data from the main pilot is used to refine prompts, Emilie Pooler, Educational Testing Service,
to revise scoring rubrics and find exemplar texts. epooler@ets.org
In the 3rd year, four items are taken by 400 test
takers per item. This data is used to continue to
refine the scoring rubrics and identify exemplar
texts. Consistent with the goal to establish a
A ssessment for learning (AfL) is an approach
to formative assessment that is increasingly
being explored in ESL/ EFL contexts (Davis, 2007).
norm community, the rating procedure and Effective formative assessment encourages
development of the generic scoring rubric has students to ask three fundamental questions
a “bottom-up-design”. In the two final stages of about their learning process: “Where am I going?,”
the development cycle, the scripts are rated by “Where am I now?,” and “How can I close the
two independent pairs of raters (i.e. 2x2), drawn gap?” (Stiggins, 2005). AfL helps students answer
from a national panel of 90+ trained teachers. To these questions by using a few strategies that,
establish a culture of negotiation and eventually taken together, provide the basic conceptual
shared norms, the composition of pairs is altered framework behind the approach. Some of the
so that two raters form a pair for only one rating basic strategies are: - Providing a clear and
session. Moreover, on the assumption that the understandable vision of learning targets - Using
panel of teacher-raters represents professional examples and models of strong and weak work -
norms, the scoring rubrics are developed based Offering regular descriptive feedback - Teaching
on the 2x2-ratings, being thus entirely empirical students to self-assess and set goals - Designing
(i.e. representing “actual” rather than “aspirational” lessons to focus on one aspect of quality at a time
standards) and coming from “below” rather than - Teaching students focused revision - Engaging
top down from the test developer. students in self-reflection, letting them keep track
of and share their learning AfL is a relatively new
concept in ESL/EFL, few materials and concrete
examples of possible implementations of the
approach have been made available for teachers.
Drawing on their expertise as ESL/EFL teachers,
assessment specialists, and teacher trainers,
the presenters have designed a professional
development workshop that highlights AfL
approaches. The workshop helps teachers
128
poster abstracts
implement AfL through the use of large-scale although the language assessment field has paid
assessment tools and practices, such as scoring much attention to the validation processes of
rubrics, sample responses, descriptive information test constructs and scoring methods (Messick,
of responses, can-do statements, performance 1989; Weir, 2005), it is only beginning to explore
descriptors, and calibration processes and issues of how test feedback is delivered to and
techniques. In this poster, the presenters will used by leaners (Chappelle, 2012; Jang, 2005). As
provide a brief introduction to AfL principles testing becomes increasingly computer-based
based on their extensive review of research (cf. Criterion, DELNA, DIALANG, PTE, TOEFL iBT),
on the topic. They will also provide a detailed understanding how to maximize the impact
description of the workshop content and the tasks of computer-based feedback is imperative for
in which participating teachers engage, such as: language testers. This study addresses these gaps
-discussing strategies to make test scoring rubrics in our understanding of how learners negotiate
student-friendly for use in the classroom - working computer-based feedback on foreign language
with test-taker responses at different levels of tests by investigating ways in which learners
proficiency -practicing scoring sample responses of varying learning and affective backgrounds
using scoring rubrics and benchmark responses as engage in processing computer-based feedback.
a reference -engaging students in self- and peer- The study adopts an experimental research
assessment -practicing how to use assessment design and utilizes mixed methods. Data is
tools to guide focused revision Finally, the poster being gathered from adult immigrant English
will describe how this workshop framework can as a Second Language learners in Canada, and
serve as a reference in replicating the process with uses a reading test from the CELPIP, a Canadian
different content materials and for ELT courses of test of English for immigration purposes. The
various kinds. reading test is being adapted to identify current
learner abilities in key cognitive language skills.
Adult immigrants to Canada who are currently
9. Maximizing feedback for studying English are invited to participate in
learning: Investigating language this study. Study participants (n=100) receive
their feedback report and plan their learning,
learners’ differing cognitive all online. After participating in data collection
processing when receiving activities, they download their report and learning
plan to keep. Three data collection activities are
computer-based feedback taking place. The three activities are designed
Maggie Dunlop, OISE/University of Toronto, to collect multiple sources of data to inform the
maggie.dunlop@utoronto.ca following research questions: How do students
vary in their processing of different aspects of
129
poster abstracts
in a follow-up interview one month later to elicit following the new format were analyzed as well as
information on ongoing processing, interaction a total of 36 speaking assessments following the
and application. Data collection will be completed old format. Old and new speaking assessments
December 2014. The results of this study, which were administered to the same examinee for
will be available by March 2015, will inform our comparison purposes. These assessments were
understanding of the cognitive processes by which reviewed and re-rated by certified ILR testers
language learners engage with making meaning who were native speakers of the language they
of descriptive, process-oriented computer- assessed. In addition to providing ratings following
based feedback. These enriched understandings two different protocols (FSI and DLI), the ILR
will assist educational institutions and test testers filled in a 129 item questionnaire focusing
developers in developing meaningful, personalized on test administration and rating procedures,
computer-based feedback processes for learners, appropriateness of topics and tasks, examinee
particularly where direct individual interaction with performance, and comparing old and new
a subject expert is unavailable. formats. Moreover, they participated in a focus
group after completing the reanalysis of the
assessments. The poster will describe the changes
10. Consequences reconsidered: to the FSI speaking assessment and why they were
validating the new FSI speaking implemented. It will also describe the results of
the study to validate the new speaking assessment
assessment using Bachman/Palmer’s (2010) assessment use
Erwin Tschirner, University of Leipzig, tschirner@ argument.
uni-leipzig.de
P
has given rise to the ILR Level Descriptors and ragmatics has long been anchored in
ACTFL Proficiency Guidelines and has influenced language competence frameworks.
the Common European Framework of Reference Instruments such as role-plays as well as written
for Languages (CEFR) among many others. In line and oral discourse completion tasks have
with current calls to expand the notion of validity been used successfully to assess L2 pragmatic
to include the impact and consequences foreign competence. However, while the majority
language assessments have, FSI decided to revise of pragmatic tests are aimed at measuring
its speaking assessment on the basis of a needs productive skills, the assessment of receptive
assessment completed at US embassies and pragmatic skills is still an underrepresented
consulates worldwide. FSI, then, commissioned a field. This study introduces and explores the
validation study to have the new format validated validity of the American English Sociopragmatic
independently. The validity study consisted of a Comprehension Test (AESCT), a web-based
side-by-side study of the old and new format of sociopragmatics test consisting of 54 multiple
the speaking assessment across six languages: choice discourse completion tasks. These tasks
Arabic, Chinese, French, Russian, Spanish, and were designed to measure knowledge of speech
Turkish. A total of 41 speaking assessments
130
poster abstracts
acts/implicatures (Section 1), routine formulae large amount of text-based learning materials
(Section 2), and culture-dependent differences in used in EFL/ESL instruction. In sum, the results
lexis (Section 3). The main purpose of this study reveal the reliable prototype of a test that can
was to investigate systematically the performance be used to assess students’ receptive pragmatic
of the AESCT and explore internal, structural, and abilities. Moreover, they also provide ideas as to
external aspects of construct validity as identified how the AESCT can be used in EFL/ESL education
by Messick (1989). Therefore, the following for the purpose of raising learner’s awareness of
research questions were proposed: (1) What are culture-specific sociopragmatic differences.
the AESCT’s psychometric characteristics? (2) Does
the empirical factor structure of the AESCT reflect 12. Low-level learners and
the construct? (3) Do students with high frequency
of exposure to the target language score higher computer-based writing
on the AESCT and its subsections than students assessment: The effects of
with less frequent exposure? (4) Which types
of target language input contribute to higher computer proficiency on written
sociopragmatic competence as measured by the test scores
AESCT? To explore these questions, 97 university-
level EFL/ESL learners (n=97) were asked to take Laura Ballard, Michigan State University,
the test along with (a) the Cambridge Placement ballar94@msu.edu
Test (CPT) and (b) a background questionnaire
I
about their exposure to different types of English n recent years, many large-scale, high-stakes
language media. To address the internal aspect of tests have moved from paper to computer-
construct validity descriptive statistics, G-theory, based formats (e.g. TOEFL, IELTS, WIDA, Smarter
and item response theory were used. For the Balanced). In response, education, technology,
structural aspect, factor analysis was employed. and language testing researchers have conducted
Finally, to investigate the external aspect of comparability studies on these testing modes
construct validity, linear regression and multilevel (Endres, 2012; Kim & Huynh, 2008; Kingston,
modeling were used to analyze the relationship 2009). While existing research on testing mode
between the test’s total and section scores and addresses the validity and ethics behind each
different background variables. Findings provide mode, there is a lack of research examining
preliminary evidence for the AESCT’s construct their effects on specific populations of language
validity. First, descriptive statistics and item learners; in particular, low-level ESL learners who
response theory showed that the AESCT provides potentially have limited computer exposure.
reliable estimates of learner’s sociopragmatic To address this, I investigated: 1) if low-level
knowledge. Moreover, correlations between the learners perform differently on paper-based
sections and factor analysis suggested that the and computer-based writing assessments; and
three sections of the AESCT (measuring speech 2) if so, if score differences could be attributed
acts/implicatures, routine, and culture-dependent- to computer proficiency. Twenty-nine low-level
differences in lexis) are highly related and the English learners completed a paper-based and
test is essentially unidimensional. Finally, linear computer-based version of a university ESL writing
regression and multilevel modeling showed placement test. The participants took an L1 and
general English proficiency, living experience in a L2 computer-proficiency test and responded to
the TL culture, and exposure to target language a questionnaire about computer habits. Results
audiovisual media as features that were found to indicated no statistical difference between
improve learner’s sociopragmatic comprehension. participants’ paper-based and computer-based
Surprisingly, frequent use of print media was writing scores; thus, computer proficiency had
not found to be a significant predictor for higher a negligible effect on writing scores. Students’
AESCT scores – an interesting finding given the computer proficiency, however, mainly affected
131
poster abstracts
their writing fluency, yet length was not taken into experiments, has failed to properly explain the
account in the rubric. Additional analyses revealed role of individual factors in those outcomes. This
an interaction between writing fluency and writing- study therefore intended to broaden this area
mode preference, a positive correlation between of research by investigating the role of these
computer proficiency (in both the L1 and L2) and visuals in listening tests for academic purposes,
L2 proficiency, and a staggeringly low computer focusing on both context and content visuals
proficiency overall (17 WPM in the L1, and 15 and using the mixed-method research approach.
in the L2). Importantly, the study revealed that To this end, the effect of three types of delivery
student L2 computer proficiency lagged behind modes (audio-only, audio-video and audio-
L1 computer proficiency until students reached multimedia) on test-takers’ performance and note-
an intermediate English-language proficiency, taking practices in a listening test for academic
suggesting a lack of automatization in typing and purposes, as well as their perceptions towards
letter encoding in novice- and beginner-level the inclusion of these visuals, were surveyed
students. I use the results to discuss test-mode and analyzed using a quantitative and qualitative
effects and computer-based test policies and mixed-method approach. The former approach
protocols (time limits and mode optionality). I included the analysis of test scores (making
conclude by calling for computer processing comparisons between the three groups), the
instruction in ESL writing courses and more content of the notes (both quantity and quality),
in-depth research on low-level L2-learner and the responses to the Likert scale items in the
populations. exit questionnaire ( the perceptions of test takers
towards the three delivery mode). And the latter
13. The role of visuals in listening one consisted of the analysis of the open-ended
items in the exit questionnaire. The results from
test for academic purposes the analysis of the primary and the secondary
databases were then mixed in order to address
Seyedeh Saeedeh Haghi, University of Warwick, the three research questions. Findings from
S.S.Haghi@warwick.ac.uk this study revealed no significant difference in
performance between the audio-only and content
Khosro Vahabi, Ozyegin University, khosro.
visual groups; however, a significant difference
vahabi@ozyegin.edu.tr
was noted between the audio-only group and the
W ith the increase in the use of visual context visual group, with the audio-only group
components in teaching listening gaining significantly higher scores. These findings
comprehension, researchers in language testing suggest that while content visuals seemed to
began to investigate the role of these visuals in have neither debilitative nor facilitative effects on
the testing of listening comprehension. Despite test performance, the context visuals did appear
the growing body of research examining the to be inhibitive. The analysis of participants’
inclusion of visual stimuli in listening tests, there perceptions towards the inclusion of visuals
are aspects which still invite further investigation. also confirmed these results, with the content
One apparent gap in the literature is the unequal visual group showing a more positive attitude
scrutiny given to the two types of visuals used in than the context visual group. These results also
academic listening texts. While many studies have demonstrated that while the content visuals did
assessed the role of one type of visual ‘context not seem to contribute to, or take away from, test-
visuals’ in listening tests, very few have investigated takers’ performance in the test, they did seem
the role of the other type, known as ‘content to have significantly facilitated the efficacy and
visuals’. In addition, a considerable amount of organization indices of notes. These findings make
research has relied heavily on quantitative data contributions to both testing and pedagogical
obtained from comparative research design, aspects of listening for academic purposes, which
which, though addressing the outcomes of are also identified in this study.
132
poster abstracts
14. Factor structure of the test of are empirically supported relationships between
the intended interpretation of scores and the
English for academic purposes constructs being measured (e.g., Messick, 1996;
Sawaki, Stricker, & Oranje, 2009). Data were
Yo In’nami, Shibaura Institute of Technology, obtained from 100 Japanese EFL university
innami@shibaura-it.ac.jp students taking both the TEAP and TOEFL iBT—
Rie Koizumi, Juntendo University, rie-koizumi@ another test of English in an academic setting.
mwa.biglobe.ne.jp We first analyzed the TEAP data to examine its
factor structure; the best-fitting model was then
Keita Nakamura, Eiken Foundation of Japan, ke- tested with the TOEFL iBT to examine if both tests
nakamura@eiken.or.jp measured similar constructs. For both analyses,
we used confirmatory factor analysis. First, for
133
poster abstracts
15. The JLTA online tutorials online tutorial: (a) the basics of language testing
and (b) test data analysis for test improvement
for the promotion of language and basic statistics of language education/test
assessment literacy in Japan research. The first group of topics, which were the
initial focus, included validity, reliability, practicality,
Yasuhiro Imao, Osaka University, imao@lang. test constructs, test specifications, and testing
osaka-u.ac.jp the four skills, vocabulary, and grammar. JLTA
members were asked to contribute slideshow
Rie Koizumi, Juntendo University, rkoizumi@ materials to the project—annotated slides to help
juntendo.ac.jp the main target learners (secondary school English
Yukie Koyama, Nagoya Institute of Technology, language teachers) learn by themselves. Next,
koyama@nitech.ac.jp 17 English language instructors from elementary
school to college level were asked to evaluate the
T his poster reports on online tutorials online tutorial materials and give suggestions for
created by the Japan Language Testing revision. Overall, the tutorials were well-received
Association (JLTA), partially funded by a Workshops by the reviewers, with some minor suggestions
and Meetings Grant from ILTA. While enhancing for revisions. The authors of the materials were
language assessment literacy through face-to-face requested to revise their materials based on this
workshops is a useful endeavor, many teachers do feedback. To allow access not only from PCs but
not have time to attend them, yet still face a variety also from tablet devices, whose use is increasing,
of difficulties related to language assessment every the tutorial slides are in the JPEG image format;
day. There are websites that cater to this need, they were watermarked and a web slideshow
such as the “Companion Online Tutorial” by the was generated for each slide file using HTML and
Center for Applied Linguistics (Malone, 2013) and simple JavaScript. At the time of this writing, 11
the “Language Testing Resource Website” (Fulcher, web tutorials are available on the JLTA website
2014). These websites in English allow people (see https://jlta.ac/course/view.php?id=35). We
interested to access information on matters such will next create further web tutorials covering
as how to compose good language tests and the second group of topics, improve the tutorials
how to interpret test scores. As an organization based on feedback from users, and continue
dedicated to serve language teachers in Japan, we to promote understanding of language testing
decided to serve local needs and create online among language education professionals in Japan.
resources in both Japanese and English. JLTA
published useful materials on its website, such
as web tutorials for language testing, which this
presentation focuses on. We conducted a needs
analysis using a questionnaire to investigate
language instructors’ understanding of language
testing and their related needs (Thrasher, 2012).
The results, from approximately 60 language
teachers from lower secondary to university level,
showed that they felt their knowledge of language
testing to be somewhat insufficient but that
they were interested in a wide range of topics in
language testing, although with a focus on basic
practical issues over theoretical ones. These
results led to the decision to prioritize basic topics
and materials written in Japanese. We selected
two groups of topics for the JLTA language testing
134
poster abstracts
16. Health care Finnish: the current language test. Results of this analysis
will contribute to the design of an assessment
Developing and assessing experiment. The data of 25 medical professionals
Finnish proficiency among originate from the FNCLP corpus (levels B1-
C2). The findings are discussed in relation
health care professionals to the participants’ background information,
which includes participants’ migration histories,
Tiina Lammervo, Centre for Applied Language authorisation processes, language studies and
Studies, University of Jyväskylä, tiina.lammervo@ self-assessments.The information is available from
jyu.fi the FNCLP corpus and provided by the National
Marja Seilonen, Department of Languages, Supervisory Authority for Welfare and Health
University of Jyväskylä, marja.a.p.seilonen@jyu.fi (Valvira). The experiment of professional language
proficiency assessment is conducted with
Minna Suni, Department of Languages, University reference to the National Certificates of Language
of Jyväskylä, minna.suni@jyu.fi Proficiency. In the experiment, productive
professional language skills will be assessed in a
135
poster abstracts
proficiency of college students who have 18. Assessing low educated L2-
undertaken compulsory English study, based on
the College English Curriculum Requirements learners – issues of fairness and
(Ministry of Education, 2007). The TOPE adopts validity
a face-to-face format, consisted of three major
sections: 1) Reading aloud and retelling; 2) Cecilie Hamnes Carlsen, Vox - Norwegian
Presentation and discussion and 3) Impromptu agency for lifelong learnnig, cecilie.carlsen@vox.no
speech and Question & Answer. With the aim of
determining the overall validity and acceptability
of the assessment, the designed study is intended
to explore following two research questions: 1)
A ssessing low educated L2-learners – issues
of fairness and validity When constructing
language tests, one of the many factors test
To what extent does the TOPE test elicit intended constructors need to take into account, is the
language functions? 2) Is there any evidence from group of test takers for which the test is intended.
candidate performance that validates the scale A test for young learners needs to be constructed
descriptors? Students’ speaking performances differently from a test for adult learners, and a
in real testing situations were video-recorded test, which may yield valid scores for one group,
and transcribed. As the actual assessment may not be suitable for another (Bachman
are conducted by college teachers, views and 1990). One test taker characteristic, which poses
professional judgments of teacher-assessors specific challenges in language assessment, is
regarding the testing and rating procedures the candidates’ educational background, or to
were also consulted by using questionnaires and be precise: the lack of such. Adult L2-learners
interviews. To analyze the data, language function with no or very poor literacy skills form part of
analysis was conducted for each speaking task the immigrant population. As more and more
and instances of functions derived from O’Sullivan European countries make language requirements
et al (2002) were counted. In addition, discourse for entry to the host country, for permanent
analysis of students’ speech performances residency or citizenship as well as for entrance
was performed in terms of salient features in the work marked, low educated learners form
characterized each level. Results of different an increasing part of the test population as well
proficiency groups were compared to identify to (Extramiana, Pulinx & Van Avermaet 2014). Test
what extent each of these features differs between fairness is defined in different ways in the test
adjacent proficiency levels. Overall, the study led literature. Most professionals seem to agree that
support to the validity of the TOPE. Implementing it should be free of bias, and all candidates should
the school-based assessment is highly challenging have and equal opportunity to demonstrate their
in mainland China, issues involved in the process language skills (Shaw & Imam 2013, Kunnan 2014).
of development and validating the test including Fairness is closely linked to validity. To be valid, a
teacher-assessor motivation, assessment test should measure as much as possible of the
management and implementation, assessment for construct in question, and as little as possible of
learning were also discussed. other skills and abilities. The latter is referred to as
construct-irrelevant variance, and it is a matter of
particular challenge when assessing learners with
limited school background and prior experience
with testing (Allemano 2013). The aim of this paper
is twofold: Firstly, I will present the development
of a new CEFR-based test of Norwegian as a
second language in which the concern for the low-
educated learners has been highly prioritized from
the start. In the presentation I will highlight specific
measures that were taken in the test format and
136
poster abstracts
test system to accommodate to the needs of this to assesses the proficiency in reading, listening,
group of test-takers. I will also show how specific grammar and vocabulary of NNS students at
attention to the group of low-educated learners English-medium universities using a web-based
necessitates heavy contextualization, adapted test platform. Students receive a report which shows
formats, extensive use of pictures, avoidance of their overall proficiency level (DELTA Measure),
hypothetic situations where possible etc. Secondly, Component Skills Profile and Component
I will present the test results of this particular Diagnostic Reports. From the reports, students
group of test takers as compared to the results know which of the four components they are
of learners with more school background in relatively weaker in and the sub-skills that caused
order to demonstrate that if constructed with low them unexpected difficulty. Given the importance
educated learners in mind, it is indeed possible to of writing at tertiary level, a writing component
construct a fair and valid test for this group, giving for DELTA was deemed as being essential and so
everybody a fair chance to demonstrate their this has been in development since 2012, with the
language abilities. This is a matter of existential features described above, i.e. diagnosis, feedback
importance to the individuals it concerns, and an and tracking in an online, automated system. This
ethical responsibility of policy makers and test poster describes the development of the DELTA
constructors. Writing Component. It will provide information on
task and rubric design, the utilisation of Lightside
Labs as its automated essay scoring engine, and
19. Developing an online, the design of an automated feedback system.
automated system to provide The poster will also report on the results of the
trial of this system at five universities in Hong
diagnosis, feedback and tracking Kong, in particular, students’ perception about
of academic English writing effectiveness of this system as a tool to enhance
their writing ability. It is envisaged that the addition
ability of this Writing Component will further enhance the
Alan Urmston, Hong Kong Polytechnic University, effectiveness of DELTA as a diagnostic and tracking
alan.urmston@polyu.edu.hk tool that will support students’ development of
academic literacy at the tertiary level.
Michelle Raquel, Lingnan University,
michelleraquel@ln.edu.hk
20. Rubric development through
137
poster abstracts
part of the test development and validation 21. Developing a speaking rubric
process. As documents that realize test constructs,
mediate between test performances and scores, for use across developmental
and communicate constructs to test stakeholders, levels
rubrics are values-driven documents. In this
poster, we discuss a rubric development project in Samantha Musser, Center for Applied Linguistics,
terms of the underlying social and cultural values smusser@cal.org
communicated through the testing materials. We
consider how the explicit values of the World- Megan Montee, Center for Applied Linguistics,
class Instruction Design and Assessment (WIDA) mmontee@cal.org
Consortium, as defined by a “Can Do” philosophy, Margaret Malone, Center for Applied Linguistics,
were realized within the scoring criteria for mmalone@cal.org
the writing component of a large-scale English
R
language proficiency assessment. The new rubric ubrics are the link between performances
is being developed as part of work on new, on a test and the test score, and rubric
computerized writing assessment to be used with descriptors mediate how raters and other
more than one million English language learners stakeholders understand examinee performances.
(ELLs) in the United States annually. The Can Do Test developers face a challenge in creating
philosophy focuses on the strengths and assets rubric descriptors that are meaningful to raters
of linguistically and culturally diverse student in and grounded in the underlying test construct.
order to support their ELLs’ academic language In reference to scoring scales, North (2000) has
development and academic achievement. One described this challenge as “trying to describe
goal of the assessment is to build awareness of complex phenomena in a small number of
language learners’ strengths and provide valuable words using incomplete theory.” This poster
information about this to educators, students, and describes the development of a rubric and scoring
families. During the rubric development process, descriptors for a computerized assessment
researchers addressed the following question: of academic English language proficiency for
How can rubric descriptors reflect a positive, Can students in U.S. grades 1-12. In this project, a
Do framework for describing second language single rubric was developed to score the oral
development in a way that is meaningful to test performance of students ranging in age from
raters and other stakeholders? Stages in the approximately 7-18 years old. Researchers faced
development process are described including: the challenge of creating a single scale that
comparing the previous rubric and the newly would be interpreted in relation to the cognitive
amplified standards, discussions with and review development of examinees. In developing the
by stakeholders, piloting the rubric with ELL writing rubric and related scoring materials, researchers
samples, and addressing issues in using a single addressed the following questions: • How are
rubric to evaluate the writing of ELLs in grades language proficiency levels realized across various
1-12. The new writing rubric is also provided with ages and developmental levels? • What criteria do
explanations for revisions and how it frames ELLs’ raters need to understand and apply the rubric
strengths. The discussion focuses on the social across grade levels? The development process
values communicated by the rubric and how this for the rubric drew from both empirical and
relates to the test’s assessment use argument. theoretical approaches. Researchers analyzed
examinee performances within and across
grade levels in order to understand how rubric
features are realized at by examinees at different
grade levels. The research team investigated the
following features of examinee performances:
comprehensibility, fluency, linguistic complexity,
138
poster abstracts
language forms, and vocabulary. Researchers sufficient ability to connect in-text definitions to
also analyzed grade-level standards for language technical vocabulary is thus potentially valuable
and content knowledge development as well as information. Existing direct measures of L2
theory related to second language acquisition and vocabulary primarily focus on existing vocabulary
cognitive development. Draft rubric descriptors knowledge (Read, 2007), and common indirect
were then applied to performances across grade measures involve the robustness of vocabulary
levels and raters provided feedback that was in spoken/written production. Neither of these
used to further refine descriptors and supporting methods would appear to account for the learning
materials. The result of this project was a single of unknown technical vocabulary through reading.
rubric to use across grade levels as well as a The present study aimed to 1) design a test
set of rubric interpretation guides. The rubric which measures the ability to correctly identify
interpretation guides describe how the rubric in-text definitions for technical vocabulary and
should be interpreted at different grade levels. 2) compare two item formats for delivery of the
Finally, the results of the project informed the test. To those ends, a test called the Vocabulary
design of self-access rater training materials. In through Reading Test was developed. An excerpt
the poster, we present the rubric development from an introductory biology textbook was chosen
process, discuss how the analysis of examinee as input, and technical terms supported by in-
performances affected rubric descriptors, and text definitions were substituted with nonwords
consider implications for approaches to rubric to control for existing vocabulary knowledge (K
development. = 10). In a short-answer format, examinees (N
= 158) read the text and wrote definitions for
each of the nonwords. Responses were scored
22. Assessing future learning: dichotomously based on meaning. A subset of
Testing an ability to identify responses was scored by two human raters, who
had acceptably high agreement. This allowed for
definitions for technical scoring to be completed by a single rater, yielding
vocabulary in introductory an internal consistency (Cronbach’s alpha) of .79.
The test was found to effectively make distinctions
textbooks among examinees (M = 5.12, SD = 2.84), and
Daniel Isbell, Northern Arizona University, items were within desirable ranges of facility and
daniel.r.isbell@nau.edu discrimination. However, while human scoring
was acceptably consistent, it was also labor
T his paper reports on an innovative intensive. To explore the issue of item format, a
measure of prospective ESL university multiple-choice version of the test was developed
students’ abilities to develop knowledge of new using responses from the short-answer version
terms presented in a text. When students enter to create keys and distractors for each item. 147
an American university, they typically take a different examinees took the multiple-choice
variety of introductory courses. In these courses, version, which also effectively made distinctions
information is commonly (if not primarily) provided among examinees (M = 5.48, SD = 2.23), though
to students through textbooks, where they begin its internal consistency was somewhat lower
learning foundational concepts and technical (Cronbach’s alpha = .61). Mean scores for the two
vocabulary of a field. This technical vocabulary groups were not significantly different, suggesting
accounts for a considerable portion of words in that the two versions of the test provided
textbooks (Chung & Nation, 2003), and much of equivalent information. Items in both formats
it considered beyond the realm of L2 instruction generally performed similarly, as did first-language
(Read, 2000). However, introductory textbooks are examinee subgroups. Further analysis revealed
notable for providing in-text definitions (Carkin, some difference in one of the two subconstructs,
2001). Assessing whether or not L2 students have which may suggest a small degree of construct
139
poster abstracts
irrelevant variance due to format. In terms of test, do different dimensionality assessment
practicality, the multiple-choice version was more approaches produce consistent results of the
easily and quickly scored. These results should test dimensionality? 2). How do the items cluster
be of interest to higher education stakeholders and how to interpret the results? Whether there
and of relevance to language testers interested in is a consistent pattern between the two forms?
item formats and the integration of reading and Method Two forms of the reading test in the
vocabulary measurement. Canadian English Language Proficiency Index
Program (CELPIP-G) Test, which consists of 4
passages with 38 passage-based items, were
23. Comparing dimensionality used as examples in this study. The CELPIP-G
assessment approaches using reading test was designed to assess test-takers’
ability to engage with, understand, interpret, and
empirical data: Investigating make use of written English texts. Similar to many
the dimensionality of CELPIP-G other testing programs, multiple equivalent forms
were constructed under the same test blueprint
reading test in CELPIP-G. This study randomly selected two
Michelle Yue Chen, The University of British forms of the reading test to be examined, which
Columbia / Paragon Testing Enterprises, mchen@ allowed us to compare the results from different
paragontesting.ca forms. Eight commonly used approaches in
assessing dimensionality were compared. These
B ackground Determining test dimensional approaches are 1). eigenvalue rule, 2). scree plot,
structures serves important purposes 3). MAP test, 4). parallel analysis, 5). difference
(Tate, 2003). The widely used passage-based chi-square, 6). HCA/CCPROX, 7). DETECT, and 8).
reading comprehension items, however, DIMTEST. Results As expected, these different
poses challenges to the determination and dimensionality investigation methods do not
interpretation of the reading test dimensionality. produce exactly identical solutions for the same
Previous research showed controversial findings test. Eigenvalue rule and scree plot methods
about whether the dependence of items led tend to suggest a larger number of underlying
to multidimensionality (e.g., Schedl et al., 1996; dimensions. The results were similar across the
Zwick, 1987). Many approaches have been used two forms. Item clusters were found, but the
to assess dimensionality. The most common pattern of how the items were clustered together
parametric approaches include eigenvalue rule was not clear based on the two forms analyzed
(Kaiser, 1960), scree plot (Cattell, 1966), Velicer’s in this study. The essential unidimensionality was
minimum average partial (MAP) test (Velicer, 1976), observed on both forms, which supported the
parallel analysis (Horn, 1965), and difference current unidimensional item response theory (IRT)
chi-square (Schilling & Bock, 2005). Besides, based scoring procedure employed by CELPIP-G.
several nonparametric are also used in the Although a dominant factor was found underlying
literature, including HCA/CCPROX (Roussos, 1995), the reading test, several minor dimensions were
DETECT (Zhang & Stout, 1999), and DIMTEST detected as well. Differential item functioning (DIF)
(Nandakumar & Stout, 1993). However, these analyses didn’t flag any item, which suggested that
different approaches do not usually lead to the these minor dimensions may not be caused by
same conclusion. Purpose Given that few studies DIF items. In other words, these minor dimensions
have provided comprehensive comparisons and can also be meaningful. More studies are needed
discussions of these commonly used approaches. to understand these important but usually be
This study attempted to conduct an empirical ignored minor factors in reading assessment.
comparison of these methods in real language
test contexts. The research questions we aimed to
answer include: 1). For each form of the reading
140
poster abstracts
24. A healthy context: contexts. Challenges encountered during the
operationalization of the Linguistic Assessments
Navigating language access will be presented to exemplify conflicts and their
policies and regulations to resolution. The authors will also show how the
Linguistic Assessments are linked to the job tasks
test the language abilities of of staff within a hospital, and how this linking has
Canadian hospital employees facilitated the acceptance of the assessments
within the hospital community. The creation of
Lauren Kennedy, Second Language Testing, Inc., language assessments for bilingual or multilingual
lkennedy@2lti.com populations imposes unique challenges on test
designers and stakeholders to ensure balanced
Kevin Joldersma, Second Langauge Testing, Inc., collaboration, access, and representation by all
kjoldersma@2lti.com relevant administration, regulatory, and language
Marie-Soleil Charest, The Ottawa Hospital, groups at key points during creation, analysis, and
mcharest@toh.on.ca administration phases. This paper will conclude
with recommendations for policy makers, test
Y
organizations, historical internal practices, and
es/No Vocabulary Testing for Assessing
changing community profiles, Canadian hospital
Class Achievement Yes/no vocabulary
administrators have a difficult task navigating
testing is typically considered or used for
language access and language assessment policies
obtaining rough estimates of language learners’
within hospitals. The first part of this presentation
vocabulary size, but it has not been much utilised
will describe the federal and provincial language
in language classrooms for other purposes. This
access policy and language assessment regulatory
presentation reports on an effort to utilise yes/
environment that hospitals must navigate in order
no vocabulary testing in university EFL classes
to remain in compliance. It will also review the
as an instrument for (1) providing the students
internal language policy and assessment policy
with opportunities for self-monitoring of their
decisions one major Canadian hospital has made
learning, (2) facilitating their vocabulary reviews
to best meet the needs of its community. The
and their target word prioritisation processes,
planned, unplanned, and hoped for effects of
and (3) obtaining an index of their achievement,
requiring and promoting staff bilingualism within
In an EFL course involving a reading component,
the hospital on patients, staff, administrators,
a list of words appearing in the course materials
and the community will also be presented.
were presented to the students for vocabulary
In the second part of the presentation, the
reviews and preparations for subsequent yes/
combined policies and regulations’ washback
no testing. The vocabulary items on the list were
on the language assessment of hospital staff
sorted according to frequency for prioritisation
will be examined. The authors will describe test
purposes. Students were made aware that the
design decisions that were made to address
yes/no testing would involve many more words
particular policies, regulations, or assessment
141
poster abstracts
as question items compared to other types of Test (VSTEP), an important intended outcome of
vocabulary testing but they would not only be the National Foreign Language 2020 Project, which
asked to provide self-report of their knowledge aims to enhance the English foreign language
but to complete a post-yes/no knowledge test on a proficiency among Vietnamese people to prepare
subset of words to adjust their scores. The above for the country’s economic growth and regional
was implemented in four different groups differing integration. The VSTEP is a four-skill test which
in academic background, level of motivation, and aims to measure students’ English language
proficiency. Quantitative and qualitative data were proficiency from B1 to C1 levels, which are the
collected to evaluate the performances of the expected English levels for post-secondary English
tests and explore the students’ perspectives on learners according to the CEFR-VN (an adapted
the approach. There are indications in the data version of the CEFR to suit the Vietnamese
of fairly high correlations between the yes/no test context). The issues presented include: the
results and student performances on independent psychometric properties of the VSTEP four sub-
measures of their achievement, and positive tests, the linkage between the items and the CEFR-
attitude held by a large proportion of the students VN, and the linkage of the items with the original
involved. CEFR. Implications will be presented regarding the
capability of the VSTEP as a measurement tool and
its potential impact on the teaching and learning
26. A validation study on the of English in Vietnam when the test is officially
newly-developed Vietnam used to test hundreds of thousands of test-takers
annually.
standardized English proficiency
test 27. Towards building a validity
Phuong Tran, Vietnam National University, argument for grammaticality
p.tran76@gmail.com judgment tasks
Hoa Nguyen, Vietnam National University,
hoadoe@yahoo.com
Payman Vafaee, University of Maryland, payman.
Trang Dang, Vietnam National University, vafaee@gmail.com
trangdangthu411@gmail.com
Illina Stojanovska, University of Maryland, ilina.
Minh Nguyen, Vietnam National University,
stojanovska@gmail.com
hmynh102@gmail.com <hmynh102@gmail.com>;
Lan Nguyen, Vietnam National University, Yuichi Suzuki, University of Maryland,
lanthuy.nguyen@gmail.com yuicchi0819@gmail.com
T
Tuan Huynh, Vietnam National University, he current study, which is a validation study
thonhontuan@yahoo.com on grammaticality judgment tests (GJTs),
Ha Do, Vietnam National University, habong54@ aims to rely on advances of L2 assessment in the
yahoo.com area of validation research to contribute to one
Huu Nguyen, Vietnam National University, of the challenging issues of SLA research which
maihuu@yahoo.com <maihuu@yahoo.com>; is measuring implicit vs. explicit L2 knowledge.
The results of the present work may be used as
Fred Davidson, Illinois University, fgd@illinois.edu
backing for several inferences in an interpretive
T
argument for interpreting GJTs’ scores. GJTs are
his work-in-progress will present the results
among the most commonly used assessment
of the development and validation of the
tools in SLA field. For this reason, GJTs have been
new Vietnamese Standardized English Proficiency
subject to many validity investigations. Several
142
poster abstracts
studies have concluded that manipulating GJTs’ We will avoid the previous methodological and
time condition may turn them into measures of statistical flaws. We will have a large enough
either explicit or implicit knowledge i.e., untimed sample size (N= 100), and test several CFA rival
GJTs are measures of explicit knowledge and models. Presentation of the current study, which
timed GJTs are measures of implicit knowledge is at the data collection stage, and receiving
(Bowles, 2011; R. Ellis, 2005; R. Ellis & Loewen, feedback, will assist the researchers to implement
2007; Loewen, 2009). However, all of these studies improvements in the design, materials and
relied on tradition approaches to validation, and analyses and conclusions of the study.
rather than building an argument for and against
the interpretations that can be drawn from scores 28. Développement d’un outil
of different types of GJTs, they only provided
criterion-related evidence for their claims. In many d’évaluation du rendement
instances, the validity of the criterion against which en lecture pour les militaires
GJTs were compared was itself under question. In
addition, many of the previous studies suffered canadiens anglophones
from methodological and statistical flaws such
as small sample sizes or not testing rival models Geneviève Le Bel Beauchesne,
when confirmatory factor analysis (CFA) was used. Défense nationale du Canada, genevieve.
Contrary to the previous GJT validation studies, lebelbeauchesne@forces.gc.ca
we hypothesize that applying time pressure on Sara Trottier, Défense nationale du Canada, sara.
GJTs do not turn them into measures of implicit trottier@forces.gc.ca
knowledge, and timed GJTs measure automatized
L
explicit knowledge at best. The current work is an ’Académie canadienne de la Défense, qui
improvement compared to the previous studies s’occupe de la formation linguistique des
in the following aspects: A) to show that GJTs are membres des Forces armées canadiennes (FAC)
too coarse to measure implicit knowledge, we grâce à son propre curriculum de français, a
operationalized the use of implicit knowledge in créé un test de rendement de la compréhension
terms of observing when error in the stimulus is de l’écrit afin d’évaluer les militaires canadiens
registered without awareness, or with minimized anglophones terminant leur formation en français.
level of explicit knowledge involvement. We La création d’un tel test s’imposait puisque le
attempted to measure the implicit knowledge by contenu du précédant contrôle de rendement
employing two psycholinguistic methods called était compromis par une dizaine d’années
the Self-Paced Reading Task (SPR) and Word d’utilisation. De plus, il n’était pas spécifiquement
Monitoring Task (WMT). We also have a measure en lien avec le curriculum de français des FAC.
of meta-linguistic knowledge as a measure of Ce nouveau test de rendement s’inscrit dans
explicit knowledge. Through comparison of la lignée des tests traditionnels à questions à
performance of participants on GJTs of different choix multiple (QCM), donné en format papier-
types with more robust measures of implicit crayon. Les items sont constitués de textes suivis
and explicit knowledge, we will establish better de questions ou d’énoncés à compléter et de
criteria for validity arguments. B) We also aim to quatre choix de réponse (la clé et trois leurres).
provide evidence for our hypotheses by including La définition du construit s’appuie sur une
two measures of cognitive aptitude for implicit revue de littérature portant sur l’évaluation de la
(Serial Reaction Time) and explicit (LLAMA F) compréhension écrite en L2 en situation de test à
learning. Previous studies (e.g., Granena, 2013b; choix multiple. Pour assurer la validité du contenu,
Suzuki, 2013) showed that cognitive measures un examen minutieux des textes du curriculum a
may provide us with evidence for the nature été réalisé afin de les classifier selon leur type et
of underlying knowledge participants rely on leur contenu (lexique, nombres de mots, nombre
to perform on different assessment tasks. C) de paragraphes, densité propositionnelle, etc.).
143
poster abstracts
L’emploi de techniques issues de l’analyse de texte
par ordinateur (ATO) a appuyé l’examen. Cette
analyse a permis de spécifier les caractéristiques
des items à créer. Ceux-ci ont été produits par
une équipe de professionnels des domaines de
l’éducation et de la linguistique. Ils sont issus
de matériel authentique (articles de journaux
militaires, rapports professionnels, récits, etc.)
et ont été ont été révisés par deux comités
: pédagogique et militaire. Les items ont été
validés à l’aide d’une méthodologie portant sur la
théorie classique des tests. Près de 80 volontaires
appartenant à la clientèle-cible ont participé
à l’étude de validation des items. L’indice de
facilité, la corrélation item-total et les patrons de
distribution des réponses ont permis de retenir
les items les plus performants. De plus, des
données sur la perception de la difficulté des
textes et des questions ont été systématiquement
recueillies pour chacun des items ce qui a permis
de montrer une corrélation entre la perception
de la difficulté et la difficulté réelle des items,
assurant ainsi une excellente validité apparente.
Une deuxième étude de validation a été menée
auprès de 73 volontaires issus de la clientèle-cible
afin de mesurer la cohérence interne du test, à
l’aide de l’alpha de Cronbach (0,78) et de fixer un
seuil de réussite. Ce seuil a été déterminé par la
méthode des groupes opposés. Le résultat du test
en compréhension de l’écrit de la Commission de
la fonction publique, auquel tous les participants
ont été soumis, a été utilisé comme mesure
étalon, ce qui a permis de maximiser la sensibilité
et la spécificité de l’outil. Ce test, développé sur
une période de trente mois, permet de mesurer
adéquatement le rendement en compréhension
de l’écrit des membres des FAC anglophones
terminant leur formation en L2. Il contribue ainsi
indirectement à assurer le bilinguisme au sein des
FAC.
144
Work-in-progress abstracts
1 .The (re)construction of 4th 2. New directions: Investigating
year pre-service teachers’ new tasks for integrated skills
practices about formative assessment
assessment and its impact on Lia Plakans, The University of Iowa, lia-plakans@
their learning to teach process uiowa.edu
145
work-in-progress abstracts
about language learner’s abilities, and if they as entire FL programs are dismantled, a degree in
can be supported by validity evidence for use in certain foreign languages is no longer an option
assessing English for academic purposes. If such for students on some American campuses. In
tasks can be used effectively, language programs contrast to this trend, a relatively new residential
and testing companies may consider adopting college within a large land-grant university in the
them for assessment. Furthermore, such tasks United States remains committed to making the
can be used as a research instrument to study an study and use of a world language an essential
area important to second language literacy: the part of its curriculum, as evidenced in its degree
connection between reading and writing abilities. requirement of proficiency in a world language.
Currently, a literature review is underway, and In 2012, the college conducted a comprehensive
after the review phase, specifications for two review of its language proficiency program with
assessment types will be designed: (1) a writing- an eye toward revisions that are built on the same
to-read assessment and (2) writing-reading-writing methodological principles and thus contribute to
iterative assessment. Once tasks have been the inherent interdependence of the processes
outlined, their development will entail selecting of teaching, learning, and assessment. Two
topics and texts, developing prompts, building significant changes came out of that review and
a rating scale, and piloting the tasks and scale. are currently underway: the adoption of a Culture
The results from the last step will be analyzed and Languages across the Curriculum (CLAC)
with revisions to the tasks and rating scale. pedagogical approach and the development
Once a set of integrated tasks/assessment have of a local test of language proficiency. The
been developed, the next stage is to investigate performance-based test of speaking proficiency
issues related to validity. The study will use both employs a paired format and is intended to assess
qualitative (think alouds and interviews) and intermediate speaking proficiency in the more
quantitative (performances and scores) data commonly taught as well as the less commonly
collection and analysis to explore questions taught languages. This Work in Progress focuses
related to the products of these tasks as well on the development of the proficiency scale.
as the process. The following questions will be Using an intuitive method (Fulcher 2003), linguistic
investigated: 1. Are scores on writing-to-read, and interactional features of a preliminary
reading-to-write, and iterative reading-writing scoring rubric have been assembled from level
tasks related? 2. How are constructs of L2 reading descriptors used in internationally accepted
and writing elicited in the new tasks? 3. Is there proficiency scales (e.g., ACTFL OPI, CEFR, CLB, ILR,
evidence of discourse synthesis processes in TOEFL iBT). During test trials, examiners provide
completing the new tasks? 4. What are the shared feedback on these features and also assist with
and divergent processes in test takers’ composing revising level descriptors; in this way, empirical
of new tasks? development of the scale is in progress. Test
trials in English, French, German, and Spanish
have been completed. Test trials in Turkish and
3. Is true comparability possible? Korean begin late Fall 2014. Preliminary analysis
India Plough, Michigan State University, ploughi@ of transcripts of completed test trials as well as
msu.edu feedback from native speaker examiners and
participants highlight the difficulty of using the
I s True Comparability Possible? The current same scale for multiple languages, some of which
state of foreign language education in are quite diverse. The intricate and dynamic array
the United States has been characterized as of interactional and linguistic features that are
a “language crisis” (Berman, 2011). More and important to effective communication in one
more colleges and universities are eliminating or language, culture, and context are not necessarily
lowering their foreign language (FL) requirements equally important, if at all, in another language,
for admission and/or for graduation. Additionally, culture, and context. Specifically, initial analysis
146
work-in-progress Abstracts
of data in the current study brings into question Fox, Doe and Li (2013) will be used to direct data
the place traditionally held by syntactic complexity collection in both focus group and interview
[as measured by Analysis of Speech Units (Foster, settings. Focus groups immediately following the
Tonkyn, & Wigglesworth, 2000)], in construct completion of the reading, writing and listening
definition and related scale descriptors. Issues to section of the CELBAN will be conducted with
be explored with LTRC colleagues include: How IENs who are required to complete a language
is test comparability maintained when adapting assessment (IELTS or CELBAN) as a component
a test for multiple languages? When empirical of their certification with the CNO. Individual
evidence dictates variations in level descriptors interviews will be conducted with IEN test-takers
across languages, at what point has the construct immediately following their individual speaking
that the test is intended to assess been changed? assessments. Focus groups and interviews will
be audio recorded and transcribed verbatim.
Transcripts will be coded abductively via the
4. Professional language theoretical framework derived from the Cheng
competency, the CELBAN and & DeLuca (2011) study of test taker responses
to English language proficiency exams as well
professional licensure as the 2013 study of the TOEFL iBT (DeLuca et
Stefanie Bojarski, Queen’s University, stefanie. al,). This research hopes to add to the critical
bojarski@queensu.ca understanding of what is tested by the CELBAN
and the role of large scale language testing in the
T he Canadian Nursing Association has raised certification and licensure experience of IENs.
concerns that as large numbers of Canadian
nurses approach retirement age, a “critical 5. Using audio-only and video
shortage of nursing professionals” could result in
Canada (Epp & Lewis, 2009, p. 286). The College of tasks for a computer-delivered
Nurses of Ontario (CNO) requires Internationally listening test
Educated Nurses (IENs) who are seeking
membership to show evidence of language May Tan, Canadian Defence Academy,
competency by successfully completing either the Department of National Defence, may.tan@forces.
CELBAN or IELTS exams (CNO, 2013). Researchers gc.ca
of testing validity recommend an approach to
“obtain[ing] additional information from test-takers Nancy Powers, Canadian Defence Academy,
for validating large-scale assessments” (Cheng & Department of National Defence, nancy.powers@
DeLuca, 2011, p. 104). Therefore this study will forces.gc.ca
implement a series of focus groups and interviews Ruth Lucas, Canadian Defence Academy,
immediately following the completion of the Department of National Defence, ruth.lucas@
CELBAN in Ontario in order to analyse assessment forces.gc.ca
experience, thus contributing to a wider
understanding of construct validation from the Roderick Broeker, Canadian Defence Academy,
perspective of test-takers. Given that the CELBAN Department of National Defence, rod.broeker@
is a relatively new testing format, with widespread forces.gc.ca
use for high-stakes decisions (a component of
certification and licensure), further research
validating IEN-test-taker responses to construct T he English Language Test Section of the
Canadian Defence Academy is in the early
stages of developing a new listening proficiency
representation is critical to our understanding of
the role of competency testing in the certification test for NATO military personnel which will be
and licensure of IENs. A semi-structured computer delivered. It will be a multilevel test
interview guide adapted from DeLuca, Cheng, developed to meet the specification of the NATO
147
work-in-progress abstracts
Standardization Agreement 6001 language 6. Validity and validation in the
proficiency descriptors. This test will replace
the older version of the listening test in which Singapore context
candidates listen to recorded texts and mark
their responses on an answer sheet. In line with Yun-Yee Cheong, The Institute of Education,
current ideas about the importance of authenticity University of London, ycheong@ioe.ac.uk
and interactiveness in defining construct validity
(Bachman & Palmer, 1996; Ockey, 2007), this
new listening test incorporates audio and video
T he most important consideration in
assessment is validity and validation, the
nature of which is complex and heavily contested.
listening tasks that are representative of the target
This thesis attempts to deconstruct the evaluation
language use domain (Messick, 1996; Wagner,
process into a series of more manageable
2014). As part of the development process,
questions grouped into three categories, namely
candidates will be interviewed using a verbal
the micro, meso and macro-level. Designed (with
feedback protocol to see how they perceive older
guidelines from Cambridge Assessment), set,
version of the test versus the new computer-
scored, calibrated and offered twice a year (June
delivered trial version. Their reactions to audio-
and November) in Singapore, the Singapore—
only versus video listening tasks within the new
Cambridge GCE O-Level Chinese Language
test will also be recorded. A Benchmark Advisory
Examination (GCE 1162) has the largest number
Test (BAT) for listening proficiency is available to
of candidates among all mother tongue language
NATO members by the Bureau for International
papers. At the micro-level, this ex post facto
Language Coordination. This is the recommended
study intends to determine the degree of validity
test against which other listening tests should
in using the GCE 1162 Paper 2 (Reading) for the
be validated. However, the BAT uses audio only
purpose of demonstrating Chinese language
in its test tasks and constrains the number of
reading achievement and proficiency. Newton
times candidates get to listen to a text: the higher
and Shaw’s Neo-Messickian Framework (2014)
the level of difficulty, the fewer the number of
will be employed to highlight critical areas for
replays allowed. At this point in time, the test
analysis, with measurement objectives being the
development team is grappling with a few issues
main focal point of my research. The research
and is interested in initiating discussions around
design is then framed by the investigative steps
some of the concerns outlined below: - How can
proposed by Weir (2005). Special attention
this newer version of the test be validated against
will be given to the cognitive and contextual
the benchmark which uses audio-only tasks?
aspects of validity evidence. At the meso-level,
Would a rigorous item development process and
the discussion is taken forward by examining the
expert opinion (via a modified Angoff method) in
evidence amassed within the broader context
determining the suitability and difficulty level of
of Singapore. Since Messick (1989), the impact
items suffice? - In contrast to traditional listening
and washback effect of test use on institutions
tests where students have very little control over
and society has been a significant component of
the input, time and item sequence (no pausing, no
test validation. There is, however, little discussion
replay, time for each item is pre-set, candidates
about how society can influence and shape the
must access items in linear order etc.) candidates
process of validation itself. The relationship is bi-
may be allowed to control various aspects related
directional rather than linear. The specific social,
to input and time with this new test. For each of
political, cultural and educational environment in
these aspects, how much is enough for optimal
which validation occurs will inevitably determine
candidate performance? How do these changes
its feasibility and meaningfulness. The meso-level
impact item and student performance? What are
of analysis will therefore consider how these
the consequences for construct validity?
contextual factors can guide, steer or hinder
validation practice, thereby strengthening (or
weakening) our claims to validity. The micro and
148
work-in-progress Abstracts
meso levels are subsequently superimposed onto CATTI (China Accreditation Test for Translators
a theoretical plane. A fitting validation process will and Interpreters) as an example, the test takers
depend equally on an in-depth understanding of of CATTI increased almost 10 times from 2003 to
the prevalent validation frameworks today and the 2009. In the year of 2008, CATTI started to play a
theories that underlie them. Most of the influential substantial role in the certification of MTI program
validation frameworks were conceptualised in (Master of Translation and Interpretation), that is,
Europe and the USA, calling their generalizability the postgraduates of MTI program in a number
and transferability into question. The potential of universities can only graduate if they take the
utility of different frameworks in the Singapore test and reach a certain level. By doing so, the
context will therefore demand investigation. The government attempts to apply the test criteria
macro-perspective will not be a major focus of the to translation instruction. This may present an
research process; although, it is likely that useful opportunity to promote students’ translation skills
insights will be gained from the attempt to apply a through assessment. Nevertheless, few empirical
range of different frameworks. Discourse around studies have been done on the washback effect
assessment in Singapore has been dominated by of translation tests in China, not to mention
issues of bias and fairness, assessment load and tapping their potential in improving learning.
balance of assessment forms. Although various In the pilot study, interviews with 20 test takers
assumptions regarding validity are embedded in were conducted in 2014 to investigate their
these matters, informed discussion about validity test motivation, perception, and preparation.
and concerted effort devoted to validation has Regarding test motivation, students admit that
assumed too low a profile. The first of its kind in they are eager to obtain a certificate for future
Singapore, this study hopes to encourage genuine job-hunting, but many, especially those who are
and open dialogue among persons involved with strongly interested in translation want to use
or invested in the examination process, including this test to examine their English or translation
policy owners, specialists, administrators, students, proficiency, and more importantly, improve
teachers, parents and many others. It will also be their skills. Most of interviewees took the test
of interest and relevance to researchers with an in the junior year when the curriculum burden
academic interest in educational testing, especially was lighter. So they signed up for the test since
those examining the application of western they believed the preparation process could
validation frameworks in globall settings. enhance their language and translation ability.
Before the test, most of them spent at least one
month in intensive preparation, which in students’
7. Translation tests in China: words remarkably improved their English and
Is there any opportunity for translation ability. Also for English majors, the
intensive practice pushed them to acquire an
assessment for learning? in-depth understanding of those translation
Manman Gao, Anhui University, gaomm03@ strategies taught in class. However, it is worth
gmail.com noticing that students paid no attention to the
translation criteria, as they had no access to them.
Yuanyuan Mu, University of Science and Since students tend to regard preparation for
Technology of China, yuanyuan_mu@hotmail.com translation tests as one learning opportunity, the
next step aims to incorporate AfL into translation
149
work-in-progress abstracts
week ask students to translate passages in the washback studies on high-stakes “writing” tests
translation tests and give students feedback, guide in the field language assessment, 2) minimal
students to fully understand the rating criteria extant washback research on the interaction of
by self or peer assessment, and appreciate the tests, teaching, and learning, and 3) insufficient
reference answer given by the test. The study will knowledge of whether washback varies from the
last for one semester, and students’ eight pieces different timings of any investigation. To bridge
of translation work will be collected and analyzed this research gap, it is the intent of this study to
to examine their development trend of translation. investigate particular aspects of washback effects
Each month, two teachers and four groups of (Shohamy et al., 1996; Watanabe, 1997; Messick,
students will be invited for interviews. The study 1989; Cheng, 2005; Green 2007) including: 1)
attempts to explore effective ways to integrate general and specific washback; 2) positive and
assessment with learning in translation instruction, negative washback; 3) strong and weak washback;
as well as possible adaptations of translation tests. 4) intended and unintended washback; and 5)
long-influence and short-influence washback. One
example is whether or not the aforementioned
8. The washback effects of the writing assessment influences the practices of
high-stakes writing assessment both teachers and students in the course and
therefore improves students’ communicative
in Taiwan competence in real-life writing. Another example
Yi-Ching Pan, National Pingtung University, is the extent to which the test format, test task,
yichingpan@yahoo.com.tw the weight of the test, and its purpose influence
writing instruction and learning. In addition,
O ver the course of the past two decades, whether or not students’ proficiency levels, their
the quality of the English components years of study, and their teachers’ professional
of four-year technical and vocational education development play a role in generating the five
joint college exams in Taiwan has improved to aspects of washback as mentioned above, and
a remarkable degree, shifting emphasis from what these aspects look like, are all areas worthy
discrete linguistic knowledge to real-life language of further exploration. Instruments utilized in
application. Nevertheless, English tests such as this study include questionnaires, interviews, and
these have been criticized due to the fact that classroom observations. Sixty English teachers and
they tend to focus on the evaluation of reading 600 business students of different years at three
skills and utilize multiple-choice questions, and vocational high schools in Taiwan (the required
thus are unable to assess the communicative entrance scores of which are ranked top, medium,
competence of students. In view of this criticism, and bottom) are being recruited. The findings
beginning in 2015, there will be an essential drawn from this study will generate a better
change to the administration of the joint understanding of the complexity of the effects of
vocational college English exam, with the addition washback on teaching and learning. Furthermore,
of a non-multiple-choice writing assessment. this research will provide teachers with valuable
On this English test, controlled writing tasks (e.g. information necessary for preparing students for
vocabulary assessment tasks, ordering tasks, tests and for facilitating English instruction and
translation tasks, and short-answer and sentence learning objectives so as to elicit positive and
completion tasks) will account for one-fourth eliminate negative effects. Finally, test developers
of the total score. It is therefore the goal of this can utilize evidence drawn from this study as one
study to explore the washback effects that this key component of test validation.
major change to this high-stakes test will have
on vocational high school English teaching and
learning. This study is being undertaken for
the following reasons: 1) the lack of empirical
150
work-in-progress Abstracts
9. Improving Cree students’ research (Desgagné et Al. 2001; Desgagné, 1997;
Maxwell, J. A., 2005; Creswell, 2014) will take place
reading comprehension through in two Grade 6 classes. The participants will be
formative assessment practices two ESL grade 6 teachers. Data will be collected
through class observation (Laperrière, 1997), video
and explicit teaching. and audio recordings (Flick, 2002) as well as semi
structured interviews (Vermersch, 2006). This
Maria-Lourdes Lira Gonzales, Université du study presents three phases; each of them has
Québec en Abitibi-Témiscamingue, maria-lourdes. the same structure: (1) participating in a training
lira-gonzales@uqat.ca session, (2) applying in class what has been learnt
during the training session and (3) receiving
T he Cree School Board [CSB] has jurisdiction
for elementary, secondary and adult
education for members of the Cree Nation of
feedback on their class performance.
Eastern James Bay and other people resident in 10. An investigation into
one of its nine communities. Numerous reports alignment of Lexile scores and
have documented serious shortcomings in almost
every aspect of service delivery of instruction in analytical rubrics
Cree (level of understanding of bilingual models of
education, language development and acquisition Roxanne Wong, City University of Hong Kong/
in general, effective literacy approaches, and University of Jyväskylä, elroxann@gapps.cityi.edu.
curriculum development). These shortcomings hk
have an impact in the performance of the CSB Carl Swartz, MetaMetrics, cswartz@lexile.com
students who have remained well below the
T
Canadian norm in all academic areas. According he Diagnostic English Language Tracking
with the CSB annual report (2013-2014), 15% of Assessment (DELTA) is a low-stakes
Grade 6 students achieved a Stanine 4 on the multicomponential online diagnostic assessment
Canadian Achievement Tests (CAT), indicating designed specifically for undergraduate students
that they are meeting end of grade expectations. in the Hong Kong educational context. Students
This is a decrease of 3% from the 18% who are given a detailed report on their overall
were achieving at the expected level in 2012- proficiency and performance in reading, listening,
2013. The target set for 35% has not been met grammar, and vocabulary. The DELTA currently
for 2013-2014. Therefore, there is still work does not have a writing component. One of the
to do in order to close the gap between Cree key features of the DELTA writing is that it will be
student achievement and that of other students using an Automated Essay Scoring system. This
in Canada. The goal of the present study is to is essential as scoring of writing papers is labor
improve Cree students’ reading comprehension intensive as well as expensive. Between 2012 and
– and therefore CAT results- through formative 2014 a writing component has been designed
assessment practices (Black et William, 2009; and trialed. Approximately 1500 university
Hattie et Timperley, 2007; Isaacs, 2001; Rodet, level students in Hong Kong participated in the
2000) and explicit teaching (Gauthier, C. et al. writing pilot. The pilot test used MetaMetrics the
2013; Tardif, J., 1997). Three research questions EdSphere Lexile Framework for writing, an online
are posed: 1. How to lead teachers to develop platform where students can immediately receive
reading strategies in English as a second language a Lexile score on their work. The Lexile program
(ESL)? 2. Which are the types of feedback was chosen because like the DELTA, it has a built
teachers provide to enhance learning during in tracking function that can record and report on
reading strategies lessons? 3. What are the self- students progress. While the students are given a
competency perceptions before and after the Lexile score upon completion of their writing, it is
teacher training sessions? This collaborative action
151
work-in-progress abstracts
not enough for diagnostic assessment. According 11. Improving language teaching
to Alderson (2005), “A crucial component of any
diagnostic test must be the feedback that is and learning through language
offered to users on their performance. Merely testing: A washback study on
presenting users with a test score is quite
inappropriate on diagnostic tests.” In a joint the Hanyu Shuiping Kaoshi (HSK)
project with MetaMetrics and one school district
in the United States, a similar test was conduvted Shujiao Wang, McGill University, shujiao.wang@
with 539 students in upper secondary school. mail.mcgill.ca
These tests were then scored simultaneously
by members of the DELTA team, staff from the
MetaMetrics corporation, and teachers from
T he rapid growth of China in the global
economy has increased the need for
the Chinese language. As a result, the number
the school district. The scoring instrument was
of learners studying Chinese as a second/
specifically designed for the DELTA and has the
foreign language (CSL) has reached 40 million
same general domains as the EdSphere self
all over the world. Hanyu Shuiping Kaoshi (HSK),
assessment for writing. This study investigates the
literally “Chinese Proficiency Test,” also known
use of the DELTA rubric for scoring, and the initial
as the “Chinese TOEFL,” is the central national
phases of feedback design using a combination
standardized test of Chinese language proficiency
of the Lexile system and the scoring rubric. This
for non-native speakers, and it plays a vital role
study aims to look at the reliability of the scoring
in certifying language proficiency for higher
instrument as used by a variety of stakeholders in
education and professional purposes. Despite
a cross-national context. The instrument designed
the significance and the claimed high validity and
for DELTA is an empirically based analytical scale
reliability of this test (Chen, 2009; Luo et al., 2011),
which has six domains, namely: Task fulfillment,
very few empirical studies related to its validity
Academic register and stance, Organization,
have been conducted, in particular, with regards
Semantic complexity, Grammatical accuracy and
to the washback effect (Huang, 2013; Huang &
Vocabulary. It was initially designed for formative
Li 2009). The research questions of the on-going
feedback purposes. Weigle (2002) indicates
study are: 1) What are the perceptions of CSL
that Analytic scales are more appropriate for L2
teachers and learners towards the HSK content,
learners to develop their writing ability as did
use and impact? 2) What are the relationships
Knoch (2011). The main purpose of this study
between learners’ test-taking expectations, test
is to investigate the ability to provide students
preparation practices and test outcomes? 3)
with diagnostic feedback using analytical scales.
What is the nature of and intent of washback
The Lexile score provided to students has no
from the HSK and 4) Within the educational and
easily discernible function for most students and
societal context, how do HSK stakeholders and
therefore, the rubric is being trialed to determine
score users interpret and react to washback?
the usefulness for diagnostic feedback purposes.
These questions are addressed though a mixed
The steps taken to ensure reliability and trialing of
methods sequential explanatory design to provide
the rubric in this context will be discussed.
a holistic account of the phenomena under
investigation (Creswell, 2009; Creswell & Plano
Clark, 2011). In Phase 1, participants are 300 HSK
test-takers who register for the HSK test and
30 CSL teachers from 5 countries. The survey
questions include the learning/teaching strategies
and the perceptions towards the test design, test
preparation, test expectation and test use. In
Phase 2, after taking the HSK test, the test-takers
are asked to report their test scores by category.
152
work-in-progress Abstracts
Quantitative analysis in these 2 phases includes scale from 0 to 12. The CELPIP levels correspond
descriptive statistics and item-level exploratory to assigned Canadian Language Benchmark
factor analysis, confirmatory factor analysis (CLB) level equivalencies. The current research
and structural equation modeling. In Phase 3, project is a timely response to the ongoing need
individual semi-structured interviews will be to contribute to evidence supporting the overall
conducted with 10 HSK stakeholders from various validity of high stakes standardized English
education (e.g., student, teacher, institution), language proficiency tests that are used to provide
government and business contexts on their test evidence of English language abilities needed for
score interpretations and uses. In the end, both high stakes immigration and citizenship decisions.
types of data sources will be cross-examined and It also points the way to incorporating appropriate
synthesized. This research is the first washback vocabulary targets for candidates preparing
study on the HSK from the perspective of multiple for the CELPIP-General Test. For the current
stakeholders at both the micro (classroom) study, a corpus is being built of approximately
and macro (society) levels. It obtains multiple 200 speaking samples and 500 writing samples
perspectives on washback effects of high-stakes from the responses to the CELPIP-General Test.
tests, not only provides findings that are applicable
Quota sampling is employed to gather from test-
to pedagogical and methodological issues of taker samples available as even a number as
CSL teaching and learning, but also contributes possible of samples from each of the different
a more comprehensive model to enrich the CELPIP levels of proficiency. To analyse breadth
existing knowledge in washback literature. Last of vocabulary usage, lexical profiles are generated
but not least, this study has social and significantfor each individual oral and written text using
implications for the reform of the HSK and score digital corpus analysis tools (Cobb, 2014). Profiles
users. consist of measures of lexical frequency based on
the percentage coverage of text by the General
Service List (GSL) (West, 1953), the Academic
12. Lexical performance Word List (AWL) (Coxhead, 2000), text not covered
and rater judgements of by either of those lists, and the percentage text
covered by the British National Corpus (BNC)
English language proficiency: (Cobb, 2014). To analyse depth of vocabulary
Implications for teaching, usage, Vocabulary Error Ratios (VER) are calculated
based on the detailed coding of the oral and
learning, and assessment written texts. The VER is calculated for each
Scott Roy Douglas, University of British Columbia, text by the following formula: (total vocabulary
Okanagan Campus, scott.douglas@ubc.ca errors/total words)*100. On compilation of
the measures of lexical sophistication in the
T his report outlines an ongoing project corpus as a whole, data are correlated using
exploring lexical validity and standardized the product moment correlation coefficient
English language testing in which lexical evidence (Pearson r) and scatter plots are generated
extracted from test-takers’ responses to speaking for the relationships under investigation as a
and writing prompts on the CELPIP-General Test preliminary analysis. Next, hierarchical multiple
is gathered to contribute to an understanding of regression analysis is employed to determine
what is represented by that test’s scores. Lexical the predictive value of the measures of the
validity is conceptualized as the uncovering independent variables of lexical performance
of evidence supporting the recognition that and the dependent variable of the determined
vocabulary elicited by a standardized test’s CELPIP levels of proficiency. This analysis is
prompts unfolds as expected against that test’s carried out to determine which of the lexical
rating scale. The CELPIP-General Test test-takers variables under analysis offer the most predictive
are scored for each component of the test on a value for CELPIP rating scale performance. Once
153
work-in-progress abstracts
the relationship between measures of lexical modified version of the professional knowledge
sophistication and rater judgements of test-takers’ of assessment needs questionnaire, developed
performance in comparison to the CELPIP levels of by Fulcher (2012), along with some open-ended
performance is explored, the current study turns questions on teachers’ needs was administered
to the implications for teaching, learning, and to 143 EFL teachers in Iran. Preliminary analysis
assessment. In particular, implications connected showed that the majority of EFL teachers
to how vocabulary in use unfolds over the CLB considered the major topics in language testing
descriptors can be postulated along with the as either important or essential to be included
concomitant implications for teaching and learning in language testing courses in teacher education
within the CLB framework. programs. In the second phase, EFL teachers’
claims about their professional knowledge of
assessment will be checked against their real
13. EFL teachers’ professional performance on a professional knowledge test.
knowledge of assessment Further, the effect of certain factors such as
teachers’ age, gender, experience, and academic
Hossein Farhady, Yeditepe University, Istanbul, degree on their professional knowledge of
hossein_farhady@yahoo.com assessment will also be investigated. Qualitative
data will be collected from first observing the
Kobra Tavassoli, Karaj Branch, Islamic Azad
classes of teachers who participated in the first
University, Karaj, kobitavassoli@yahoo.com
stage of research and then interviewing them.
M any scholars have recently addressed The purpose is to investigate the extent to which
the importance of teachers’ professional they apply their assessment knowledge to their
knowledge of assessment from different classroom assessment and whether there is
perspectives (Inbar-Lourie, 2013; Malone, 2013; a relationship between teachers’ professional
Popham, 2009, 2011; Stiggins, 2005, 2008; knowledge of assessment and the quality of their
Taylor, 2009). At a minimum level, it is often tests and students’ achievement. In the third and
referred to as “assessment literacy”, and defined final phase of the study, the procedures will be
as “the knowledge, skills and abilities required extended to non-EFL teachers to compare and
to design, develop, maintain or evaluate large- contrast their assessment professional knowledge
scale standardized and/or classroom based tests, with those of EFL teachers. The purpose is to
familiarity with test processes, and awareness of inform the education community that all teacher
principles and concepts that guide and underpin education programs may need to reformulate
practice, including ethics and codes of practice” and refine their treatment of assessment and pay
(Fulcher, 2012, p.125). Despite some good quality more attention to improving teachers’ professional
research on the topic (Çakan, 2004; Razavipour, knowledge in all disciplines. Further, comparative
Riazi, & Rashidi, 2011), findings are not conclusive studies will be performed in some other countries
on the extent to which teachers’ professional such as Turkey, Malaysia, and UAE to investigate
knowledge of assessment would exert positive the levels of professional knowledge of teachers
effect on students’ learning and achievement. in different countries. The findings are expected
This paper is a progress report on a three- to have contributions to our understanding of the
phase research project designed to investigate significance of teachers’ professional knowledge in
teachers’ professional knowledge of assessment general, and in language education in particular.
and its effect on teacher-produced classroom They are also expected to offer guidelines for
tests as well as students’ achievement. The first reforms in planning in-service trainings for
phase of research was an attempt to conduct teachers in all major disciplines.
a needs assessment and investigate teachers’
perception of the importance of their professional
knowledge of assessment in an EFL context. A
154
work-in-progress Abstracts
14. The washback of ENEM EFL methodology and teaching sequence are
concerned. It also provides information for further
test in Brazil discussions of the social role and relevance of
assessment in educational contexts. In each
Flávia J. de Sousa Avelar, University of one of the schools two groups were observed
Campinas, SP; CAPES Foundation, Ministry of throughout 2013. Aiming at data triangulation,
Education of Brazil, Brasília – DF 70.040 – 020., information was gathered through interviews with
avelarfj@yahoo.com.br the two teachers, school coordinators, classroom
Matilde V. R. Scaramucci, University of observations and field notes. Questionnaires were
Campinas, SP, matilde@scaramucci.com.br answered by students and analyses of textbooks
and performance tests used by the teachers
T he washback of ENEM EFL test in Brazil complement the data. Preliminary results suggest
This work-in-progress session will discuss that apart from the relevance of ENEM, teachers
preliminary results of a study which aims at and students from both private and public schools
investigating the washback of the EFL exam reacted differently to the exam. In the private
which is part of ENEM (Exame Nacional do Ensino school context, the influence of the traditional
Médio or High School National Examination), entrance exam (vestibular) is much more intense
implemented by the Brazilian Ministry of than the ENEM itself. In the public school context,
Education in 1998. Initially proposed as a on the other hand, although the teacher seems be
diagnostic test to be taken by students leaving familiar with the EFL exam, classroom observation
high schools to assess the quality of education shows that test wiseness was the main focus of
at this level, the exam had its function and status the classes, suggesting a negative washback of the
changed in 2009, and became a mandatory high- exam upon her teaching.
stakes entrance examination for high school
students who intend to enter federal universities. 15. Understanding essay rating
The change was justified as a means to improve
the quality of education and at the same time activity as a socially-mediated
to make access to public university easier. practice
Entrance exams (or “vestibulares”) have had a
long tradition in Brazilian educational system. The Yi Mei, Queen’s University, yi.mei@queensu.ca
ENEM comprises 180 multiple choice questions
divided in four areas: Natural Sciences (Biology,
Chemistry and Physics), Human Sciences (History,
Geography and Philosophy), Mathematics and
T his study aims to understand raters’ essay
rating activity (RA) as mediated by its
sociocultural context of a high-stakes university
Languages (a written essay in Portuguese and entrance examination in China, the National
either English or Spanish reading test). Although Matriculation English Test (NMET). RA has long
English starts at grade 5 of primary school, an been considered a cognitive process of individual
English test was only included in the exam in 2010. raters interacting with artifacts (e.g., essays,
Despite the relevance of the new exam and the rating scales; e.g., Bejar, 2012; Crisp, 2010), but
intended purpose of using it to influence high this decontextualized approach fails to consider
school teaching and learning, few studies have potential interactions between individual raters,
been carried out in order to gather evidence of sociocultural contexts, and their influences
its impact. Therefore, in an attempt to contribute on rating processes. RA is a socially motivated
to EFL teaching policies in Brazilian public practice with social meanings and consequences,
schools, this ongoing study aims at investigating and situating it within its sociocultural contexts
the washback of this EFL reading test on the can make findings more meaningful (Barkaoui,
perceptions and practices of two teachers from 2008). A comprehensive understanding of RA has
different high schools, as far as their approach, implications to rater training and fair assessment.
155
work-in-progress abstracts
I apply cultural-historical activity theory (CHAT; informed by Cumming, Kantor, and Powers’ (2002)
Engeström, 1987, 2001) from sociocultural theory descriptive framework of rater decision-making
to reconceptualize RA as a 3-level hierarchical, behaviors, to describe raters’ specific goal-directed
socially mediated process with collective object- actions and situating them within collective activity
oriented activity (top), individual’s goal-directed experiences (RQ2). This study hopes to make an
actions (middle), and individual’s automatic initial attempt at determining the potential value of
operations (bottom). The distinction between a CHAT-based approach.
the top and middle levels is critical: The central
RA (top), in its networked relationships to other 16. The development and
interconnected activities, is the primary unit of
analysis, while individual’s actions (middle) are application of cognitive
the subordinate unit of analysis, which are more diagnostic model for reading
understandable when interpreted against the
background of collective activity (Engeström, assessment
2000). In an examination-driven society such as
China’s, the consequences of NMET and the role of Junli Wei, University of Illinois at Urbana-
high school English teachers as NMET raters make Champaign, wei5@illinois.edu
them good candidates on understanding how
their NMET RA is socially mediated. Two research
questions (RQs) will guide this study: 1) What is the A ssessing students’ achievement and
providing constructive feedback is a critical
responsibility of teachers. The Cognitive Diagnostic
nature of raters’ NMET rating experiences? 2) How
do NMET raters assess NMET essays to achieve Assessment (CDA) provides valuable diagnostic
their goals, against the background of their NMET profile information about students’ underlying
rating experiences? The study will employ a three- knowledge and cognitive processing skills required
phase qualitative case study design (Yin, 2009). for answering problems, thus the CDA has
In Phase 1, in one provincial NMET centre, I will become popular in the testing field. However,
interview the centre’s director, 4 trainers/team most current empirical studies on the CDA rely on
leaders, and 8 NMET raters with varying teaching coding and analyzing existing items for cognitive
and NMET rating experience, to investigate attributes (Gorin, 2006), although the items
NMET raters’ central NMET RA. In Phase 2, I will were not originally intended for CDA use (Buck &
interview the NMET raters, their principals, and Tatsuoka, 1998; Jang, 2005). Few efforts have been
16 of their colleagues to investigate NMET rating- made to construct new tests considering the CDA,
related activities at school. NMET rating-related hence few guidelines are available for developing
documents will also be reviewed for additional cognitive diagnostic language assessments (Lee
contextual information. In Phase 3, the NMET & Sawaki, 2009). The present study employs
raters will first conduct concurrent think-aloud cognitive diagnostic approaches to develop a
protocols (Ericsson & Simon, 1993), followed by diagnostic test model for fine-grained assessment,
consecutive stimulated recalls (Gass & Mackey, and classification of learners’ knowledge state and
2000), to verbalize their thought processes while presence or absence of certain attributes. The
assessing 8 NMET sample essays. Lastly, individual study also develops a validation argument for the
interviews will be conducted on the raters’ specific constructed CDA, and addresses the diagnostic
goals of assessing NMET essays and perceptions report and corresponding remedial guidance.
of the NMET rating task. A constant comparative Questions addressed are: What attributes count
method (CCM; Corbin & Strauss, 2008) will be in reading comprehension and how they are
used to analyze Phase 1 and 2 data, to describe related hierarchically? How can we construct
central NMET RA and its interconnected activity and validate the reading CDA model? What are
in school context (RQ1). Phase 3 data will also the considerations in presenting the diagnostic
be analyzed using CCM but also conceptually reports? How can we guide the learners based
156
work-in-progress Abstracts
on their diagnostic reports? This study adapts the 17. Construction of rating scales
Evidence Centered Design paradigm (Mislevy &
Riconscente, 2006) to describe a framework for for the assessment of peer-
the entire implementation process for diagnostic to-peer interactive speaking
assessment. Participants are experienced test
developers, content experts, teachers, and more performance
than 2,000 students from senior high schools in
China. The study consists of three steps: First, Romulo Guedez-Fernandez, The University of
propose a preliminary diagnostic model through the West Indies, romulo.guedez@sta.uwi.edu
literature analysis and analyze a pilot study on it,
including attribute identification, initial Q-matrix
construction, Item Response Theory (IRT) and
I n the Anglophone Caribbean, Spanish plays an
important role, both in high school education
and as a desired communication tool/asset. This
CDA analysis. Second, propose and validate the
study forms part of on-going research aimed
hypothetical model by comparing test developers’
at developing rating scales for the assessment
judgment, examinees’ Think Aloud Protocol, and
of interactive speaking performance for an
domain experts’ evaluation. The G-DINA model
advanced level Spanish course at the university
is psychometrically employed to analyze the test
level in an English speaking Caribbean country.
responses and the hypothetical model to find out
Participants (n=40) are Spanish majors and minors
the absolute and relative model-data fit statistics.
who are in their final year of their degree. All of
The best-fit model will then be verified with new
the participants have received at least nine (9)
data and its goodness of fit will be confirmed.
years of Spanish language instruction. In this
Third, diagnostic reports will be proposed, and
academic context, Spanish language instruction
followed by guidance for learners to cope with
and the respective curriculum follow the Common
their diagnosed problems. Quantitative and
European Framework of Reference (CEFR) level C1.
qualitative information is collected to develop
Classroom teaching activities include discussions
a validation argument for the composed CDA.
and debates on complex topics such as politics,
The first two steps are completed. Preliminary
immigration, gender, etc., as these types of tasks
findings identify six attributes for reading
demand more interactive ability of test takers at
comprehension and suggest that the CDA
the level C1. Similar types of tasks are used in
method is scientific and feasible. It is expected
each of the two peer-to-peer oral assessments
to be verified by later analysis that the CDA
during the semester. These two tests are assessed
diagnostic report can facilitate teaching and
by one rater who does not participate in the
learning. Combining assessment of learning with
test, and one rater as interlocutor with minimal
assessment for learning has become a global
participation during the test. The set of descriptors
trend in the education field. Complementing
provided by the CEFR’s rating scales for informal/
this trend, the study suggests that CDA is an
formal discussion and conversation for the C1
effective way to achieve this goal by elaborating on
level provides allowances for a wide range of
practical aspects of cognitively based assessment
discrepancies for similar test performances
design and application. In addition, the use of
either when raters allocate marks during the
diagnostic reports for providing feedback would
rating process or when these performances are
allow teachers to identify the learner’s specific
re-rated. For this specific context of teaching
deficiencies and to plan individualized instruction.
and assessment, the performance data-driven
It also facilitates the learners on their journey to
approach has been chosen in order to develop
autonomy.
rating scales for the assessment of peer-to-peer
interactive speaking performance. Feedback
sessions on participants’ performance will
be provided after the test using think-aloud
protocols. Students’ performances, focus groups
157
work-in-progress abstracts
with raters and teachers, feedback sessions as (Young & Schartner, 2014). Background. The
well as raters’ verbal reports while assessing intent of the study was to identify the strengths
students’ performances will be video-recorded, and weaknesses of EAL students in a business
transcribed and further analysed. Not only will administration co-op education program to inform
the data collected allow us to arrive at a more the construct development of a PELA. Previous
robust definition of the construct of interactional research has identified course-based challenges
competence for this particular speaking test, international EAL students encounter in business-
but also enable the operationalization of this related programs (Foster & Stapleton, 2012) and
construct, provide more specific descriptors for the strengths the students bring to the programs
the new rating scales, and facilitate the validation (Macdonald, 2013), but such research has not
process. In the Work in Progress session data been extended to workplace learning situations.
from this study will be presented and feedback In engineering, Fox & Haggerty (2014) conducted
about methods of analysis and interpretation will a needs analysis of EAL students by drawing
be sought. on multiple data sources, including interviews
with various stakeholders. Based on the needs
analysis, an existing PELA was refined to inform
18. Designing a post-entry targeted instruction for an English for Academic
language assessment for Purposes (EAP) course for engineering students
(Fox & Haggerty). Methods. Taking a case study
international students in a approach (Yin, 2009) this study was theoretically
business administration co- grounded in Davidge-Johnston’s (2007) framework
for evaluating co-op education curricula as well
operative education program as Read and von Randow’s (2013) approach
Christine Doe, Mount Saint Vincent University, to development of a PELA in a New Zealand
christine.doe@msvu.ca university context. Interview data was collected
from 5 stakeholder groups, including, 4 Business
158
work-in-progress Abstracts
audience regarding the interpretation of the initial Reference (CEFR). It tests four skills (reading,
findings and design of the larger study. listening, speaking, and writing) and targets
specific knowledge, skills, and abilities (KSAs) that
students would use in real-life social and academic
19. Planning for washback in the contexts. The KSAs were derived from syllabuses
development of a large-scale and aligned with the “Can do” statements from
the CEFR. All items are designed to provide the
English language proficiency test-taker with an authentic context in which s/
assessment he would interact with the item. This presentation
will address multiple aspects of washback effects
Kevin Joldersma, SLTI, kevin.joldersma@gmail. related to the development and implementation
com of the new large-scale assessment. There is
already evidence of washback in Japan, thanks to
Karen Feagin, SLTI, kfeagin@2lti.com
end-users including universities, students, and
Charis Walikonis, SLTI, cwalikonis@2lti.com the Japanese Ministry of Education. Assuming
continued interest from the Ministry, there is
Catherine Pulupa, SLTI and University of the potential for shaping future education policy
Maryland, cpulupa@2lti.com in Japan. This study will aim to assess washback
Jennie Norviel, SLTI, jnorviel@2lti.com effects that are already taking place, as well
as make recommendations for curriculum
Mika Hama, SLTI, mhama@2lti.com development, instruction, and education policy
within Japan. Specifically, we will describe the
Toshi Watanabe, Benesse, t-watana@mail.
preliminary steps of our research into the
benesse.co.jp
washback effects of the assessment. We hope to
W ashback effects are often believed to be use this research to plan for washback, continue
negative. Buck (1988), commenting on the to develop the test with washback in mind,
inability of many Japanese high school graduates develop teacher training resources aligned with
to perform basic communicative functions after our assessment framework, and plan research
years of English language instruction, blames methodology for addressing washback effects
grammar- and translation-focused testing and the after the test goes live. We hope to gain, from
instruction practices that result. But according this work-in-progress session, input from the
to Swain (1985), the washback effects of a well- audience on our proposed methodologies for
designed test should be positive. Buck (1988) investigating washback of the assessment as
offers a solution to the problem in Japan: “If we well as suggestions for alternative approaches or
want our students to learn to communicate in sources of data collection.
English, then we must give them tests which
require them to process communicative English”
(p. 18). In light of the need for tests which assess
communicative English skills, a multinational
group is currently developing a large-scale
assessment at the university admissions level.
Currently, the test is launching for administration
in Japan to Japanese students seeking university
admission, with plans for future expansion after
continued development and research. It was
designed within the framework of Task-Based
Language Assessment (TBLA) and in alignment
with the Common European Framework of
159
work-in-progress abstracts
20. What does passage to choose the most appropriate English words
or try to find synonyms as a compensation for
translation task measure? memory loss while the low group may ignore
Connecting translation the unfamiliar words. As to grammar, the two
groups generally focus on single or plural forms,
assessment with instruction tenses and use of conjunctions with no significant
by investigating test-taking difference. Most test-takers translate word by
word and then synthesize them into a sentence.
processes The most frequently used translation strategies
are finding synonyms as substitute, reading,
Jian Xu, Chongqing University, 857119178@ repeating, revising, reasoning, and monitoring.
qq.com Significant differences of strategy use between the
Xiangdong Gu, Chongqing University/Cambridge two groups are identified. Besides, test-wiseness
English, xiangdonggu@263.net strategies and construct-irrelevant factors are
detected in our study. Based on the exploratory
O ne of the most common testing methods study above, we plan to design a questionnaire
used in foreign language teaching is with items of translation competence and
translation (Buck, 1992). Although it is a very strategies, and ask 300 college students to tick
common testing method, there has been very the items they use when taking a CET-4 passage
little research to indicate how effective a testing translation task in order to further reveal what
method translation really is (Buck, 1992; Ito, the passage translation task measures exactly.
2004). House (2009) ever put forward a system The study intends to provide helpful suggestions
for analyzing and evaluating texts for translation for better classroom translation instruction and
purposes. As noted by Klein-Braley (1982), it is practice in and out of class. We hope the research
yet not clear what translation is supposed to findings may function as a bridge connecting
measure. The National College English Test Band translation assessment with translation instruction
4 (CET-4 hereafter) has adjusted its Chinese- and practice in addition to bear implications for
English translation from sentence format (5% of passage translation test development.
its total score), to a short passage format (15%
of its total score) since 2013. However, the test
syllabus does not give a clear construct definition
about this task, so it is difficult for teachers to
teach and for students to prepare for passage
translation in and out of class. Thus we think
there is a need to conduct an empirical study to
investigate what passage translation task actually
measures in CET-4. Driven by Bachman and
Palmer’s (1996) description of language ability, we
tend to examine test-takers’ processes of handling
the CET-4 Chinese-English passage translation
task by adopting think-aloud protocol analysis
and retrospective interviews. Six participants
(3 males and 3 females) are involved in our
exploratory study. The instrument used is a CET-
4 passage translation task in December, 2013.
The preliminary findings suggest that vocabulary
and grammar are two main elements assessed by
the task. As to vocabulary, the high group tends
160
Index of presenters
Acedo Bravo, Alberto Enrique 33, 63 Duke, Trina 128-129
Alderson, Charles 39, 83-84 Dunlea, Jamie 37, 40, 96-97, 110-113
Arne Johannes, Aasen 127-128 Dunlop, Maggie 34, 37, 58-59, 81-82, 129-130
Bai, Ying 36, 70-71 Elder, Catherine 39, 84-85
Baker, Beverly A. 34, 101-104 Elwood, James 33, 52
Ballard, Laura 34, 62, 131-132 Erickson, Gudrun K. 38, 90
Bärenfänger, Olaf 130 Evanini, Keelan 40, 98-99
Baten, Lutgarde 33, 63, 124-125 Evensen, Lars Sigfred 127-128
Battistini, Laura 40, 98-99 Fairbairn, Judith 40, 96-97
Begg, Kaitlyn 34, 101-104 Fan, Jinsong 33, 51
Berry, Vivien 33, 54-55 Farhady, Hossein 154
Bi, Nick Zhiwei 39, 89 Faulkner-Bond, Molly 40, 96-97
Billington, Rosey 36, 76-77 Feagin, Karen 159
Bitterman, Tanya 137-138 Fleckenstein, Johanna 34, 65
Bojarski, Stefanie 147 Fox, Janna 31, 34, 45, 101-104
Boraie, Deena 40, 93-94 Fraser, Catriona 39, 84-85
Brau, Maria 33, 55-56 Gaillard, Stephanie 30-31, 42-43
Broeker, Roderick 147-148 Galaczi, Evelnia 33, 54-55
Brookhart, Susan M. 8, 9, 32 Gao, Manman 105, 149-150
Brooks, Rachel 33, 55-56 Geranpayeh, Ardeshir 38, 41, 119-122
Cai, Hongwen 34, 60-61 Gesicki, Adam 36, 67
Carlsen, Cecilie Hamnes 136-137 Geva, Esther 37, 110-113
Caudwell, Gwendydd 40, 93-94 Ginther, April 119-122
Chapman, Mark 32, 50-51 Gomez, Pablo Garcia 128-129
Charest, Marie-Soleil 141 3, 30-31, 37, 44,
Chen, Michelle Yue 33, 48-49, 140 Green, Anthony 107-109
Dang, Trang 142 Hamp-Lyons, Liz 30-31, 35, 37, 41, 44, 107-109
Davidson, Fred 30-31, 35, 42-43, 142 Han, Baocheng 41, 116-119
161
index of presenters
Huynh, Tuan 142 Maddox, Bryan 34, 101-103
Ikeda, Naoki 39, 92-93 Malone, Margaret 33, 47-48, 138-139
Imao, Yasuhiro 134 Margolis, Douglas 36, 72
In’nami, Yo 133 May, Lyn 36, 74
Inbar-Lourie, Ofra 40, 95-96 McLean, Stuart 34, 64-65
Inoue, Chihiro 33, 54-55 McNamara, Tim 13, 38, 41
Isaacs, Talia 38, 82-83 Mei, Yi 155-156
Isbell, Daniel 139-140 Mellow, Dean 34, 101-104
Iwashita, Noriko 33, 56-57 Mesquita, Alexandre 145
Jang, Eunice Eunhee 33, 37-38, 48-49, 81-82 Min, Shangchao 37, 77-78
Jarvis, Scott 41, 113-115 Montee, Megan 137-139
Jia, Dingding 123 Mu, Yuanyuan 149-150
Jia, Guodong 135-136 Musser, Samantha 138-139
Jin, Tan 91 Nakamura, Keita 133
Jin, Yan 39, 87 Nakatsuhara, Fumiyo 33, 54-55
Joldersma, Kevin 141, 159 Natvig, Siri 127-128
Kennedy, Lauren 141 Nguyen, Hoa 142
Khalifa, Hanan 32, 53-54 Nguyen, Huu 142
Kim, Ahyoung Alicia 123-125 Nguyen, Lan 142
Kim, Hyejeong 36, 76-77 Nguyen, Minh 142
Kim, Hyun Jung 125 Nieminen, Lea 39, 83-84
Knoch, Ute 36, 74 Norviel, Jennie 159
Koizumi, Rie 32, 52-53, 133-134 O'Sullivan, Barry 37, 40-41, 93-94, 110-113
Köller, Olaf 34, 65 Ockey, Gary 37, 75
Kremmel, Benjamin 41, 113-115 Park, Gina 33, 37, 48-49, 81-82
162
index of presenters
Salamoura, Angeliki 32, 53-54 Vaughter, Phallis 128-129
Sandvik, Todd 37, 110-113 Vogt, Karin 32, 47
Sawaki, Yasuyo 32, 52-53 Volkov, Alex 36, 40, 67, 99
Schmidgall, Jonathan 38 Wagner, Elvis 125-126
Schmitt, Norbert 41, 113-115 Wagner, Maryam 33, 48-49
Schoonen, Rob 41, 113-115 Walikonis, Charis 159
Segalowitz, Norman 41, 113-115 Walter, Daniel 35, 71-72
Seilonen, Marja 135 Wang, Boran 39, 91
Shen, Mingbo 35, 104-105 Wang, Shujiao 152-153
Shin, Sun-Young 37, 78-79 Waring, Hansun 37, 107-109
Shiotsu, Toshihiko 141-142 Watanabe, Toshi 159
Skar, Gustaf Bernhard Uno 127-128 Wei, Jing 33, 47-48
Slomp, David 34, 101-104 Wei, Junli 156-157
Smemo, Jorun 127-128 Wei, Na 41, 116-119
So, Youngsoon 40, 98-99 Weigle, Sara 41, 119-122
Song, Xiaomei 33, 51 Weyant, Katie 35, 71-72
Stenner, A. Jackson 37, 110-113 Wilmes, Carsten 123-124
Stevens, Lilly 33, 47-48 Winke, Paula 35, 71-72
Stojanovska, Illina 142-143 Wolf, Mikyung Kim 40, 97-98
Stone, Jake 33, 36, 38, 40, 48-49, 67, 99 Wong, Roxanne 151-152
Storch, Neomy 36, 74 Wu, Jessica Row-Whei 36, 69-70
Sun, Haiyang 105 Wu, Rachel Yi-fen 36, 69-70
Suni, Minna 135 Wu, Yong 35, 106
Suvorov, Ruslan 34, 63-64 Wu, Yongfei 126-127
Suzuki, Masanori 41, 119-122 Wu, Zunmin 37, 41, 75-76, 116-119
Suzuki, Yulchi 142-143 Xi, Xiaoming 40, 94-95
Swartz, Carl 151-152 Xu, Jian 160
Tan, May 38, 147-148 Xu, Yueting 36,73
Tao, Jidong 40, 98-99 Yan, Qiaozhen 32, 46
Tavassoli, Kobra 154 Yan, Xun 35, 40, 71-72,100
Thirakunkovit, Suthathip 35, 71-72 Yang, Luna 37, 75-76
Tian, Jie 32, 46 Youn, Soo Jung 33, 59-60
Tran, Phuong 142 Yu, Guoxing 35
Trottier, Sara 143-144 Zapata, Diego 40, 98-99
Tsagari, Dina 32, 47 Zhang, Wenxia 35, 104-105
Tschirner, Erwin 130 Zhang, Xiaoyi 39, 87
Turner, Carolyn E. 31, 34, 38, 45, 82-83, 101-102 Zhao, Wen 39, 91
Ullakonoja, Riikka 39, 83-84 Zou, Shaoyan 39, 87
Urmston, Alan 137 Zumbo, Bruno D. 34, 101-104
Vafaee, Payman 142-143
Vahabi, Khosro 132
van der Boom, Edith 37, 81-82
Van Gorp, Koen 38, 86
Van Maele, Jan 124-125
van Naerssen, Margaret 36, 79-80
van Splunder, Frank 124-125
Vasquez, Claudia 33, 56-57
163
Language Testing in Asia (LTA) was founded
in 2011 as a high quality peer-reviewed online
academic journal. The purpose of LTA is to
acknowledge and showcase scholarly findings
in Asia in the field of language testing.