You are on page 1of 15

Assessing Grammatical and Textual Features in L2 Writing Samples: The Case of French as a Foreign Language Author(s): Steve Y.

Chiang Source: The Modern Language Journal, Vol. 83, No. 2 (Summer, 1999), pp. 219-232 Published by: Wiley on behalf of the National Federation of Modern Language Teachers Associations Stable URL: http://www.jstor.org/stable/330337 . Accessed: 29/03/2014 17:08
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Wiley and National Federation of Modern Language Teachers Associations are collaborating with JSTOR to digitize, preserve and extend access to The Modern Language Journal.

http://www.jstor.org

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

Assessing Features
Case of

Grammatical in French

and

Textual

L2

Writing
as
a

Samples:
Foreign

The

Language

STEVEY. CHIANG Da YehUniversity Changhua, Taiwan Email:ytchiang@aries. dyu.edu.tw This article investigates the relative importance of various grammatical and discourse features in the evaluation of second language (L2) writing samples produced by college students enrolled in beginning and intermediate French courses. Three native-speaking instructors of French rated 172 essays using a scale that was constructed by the researcher and based on theory and research from discourse analysis. The scale contained 4 areas of evaluationmorphology, syntax, cohesion, and coherence-encompassing a total of 35 language/textual features, in addition to a holistic judgment of overall quality. Among the findings are that (a) raters relied heavily on discourse features, especially those for cohesion, in judging the overall quality of an essay; and (b) the rating scale exhibits content validity and reliability, although refinement is still needed to achieve a desired construct validity. Future research should focus on discovering other elements involved in the rating practice through analytical delineations and validation procedures and on adapting the proposed rating instrument for large-scale assessment contexts.

THIS ARTICLE REPORTS A PRELIMINARY study of evaluation of second language (L2) samples. It addresses perceived inadequacies in current evaluative practices with the intent of increasing their validity. These inadequacies include (a) the "isolated" approach, such as that adopted in error gravity studies, that focuses on discrete, formal features of language devoid of discourse context, and (b) the summary approach, as is commonly found in the field of direct writing assessment, that is primarily concerned with practical value but largely ignores its theoretical justification and the practitioner's role as contributor to knowledge in language proficiency. The present study reflects an integrated approach that consists of the following three interrelated steps: (a) assessing language features within a relatively complete discourse universe so as to take account of the full significance of their qualities, (b) moving beyond the narrow focus on grammatical elements to incorporate
TheModernLanguage Journal, 83, ii, (1999)
The Modern Language Journal

the extrasentential domain of this universe, and (c) providing theoretical and empirical justifications for the measures adopted. For this study, I constructed a rating scale (see Tables 1 and 2) primarily as a research instrument that is based on current theories in communicative competence modeling and discourse analysis and applied it to the evaluation of pagelong essays written by students enrolled in college French courses in the U.S. The purposes of the study were (a) to delineate and identify language features relevant to the evaluation of writing samples of beginning and intermediate learners of French as a Foreign Language, and (b) to examine the relative importance of these features in raters' perception of writing quality. BACKGROUND Traditionally, the assessment of L2 samples has centered around discrete, formal elements in language. Error gravity studies, for example, have sought to establish a hierarchy of various errors in terms of the effect they produce on their reader/listener. Two of these studies, by

0026-7902/99/219-232 $1.50/0
?1999

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

220 Ensz (1982) and Piazza (1980), were largely responsible for the prevailing belief in the language teaching community that the French have a low tolerance for grammatical inaccuracy. However, the language samples used in most of these studies, such as those used by Delisle (1982), Khalil (1985), and Tardif and d'Anglejan (1981), consisted only of isolated and artificially contrived sentences that obviously are not representative of actual communication situations. Even those studies that employed more or less naturalistic stimuli within larger discourse contexts paid only occasional attention to extrasentential features such as conjunctions and the loosely defined concept of "organization" (Santos, 1988; Tomiyana, 1980). These were the observations that prompted Rifkin and Roberts (1995) to speculate that "[i]t may well be that linguistic errors take a 'back seat' to larger rhetorical issues once language is assessed in the context of a discourse universe larger than a single utterance" (p. 532). The same authors thus called for "a more holistic, naturalistic, qualitative approach to examining the full context of interaction" (p. 533) for future studies of this kind. As in the field of error analysis, the focus of writing assessment has rarely surpassed the sentence boundary. According to Huot (1990), the T-unit, which was used to measure syntactic maturity and, therefore, writing quality, was the major form of textual analysis. However, as the same author also pointed out, in numerous studies (e.g., Crowhurst, 1980; Nold & Freedman, 1977; Witte, Daly, & Cherry, 1986; see also Kern & Schultz, 1992), the relationship between syntactic maturity and writing was eventually found to be, at best, a weak one. It was not until the 1980s that researchers began to apply intersentential grammars (e.g., Halliday & Hasan, 1976) to the analysis of written texts. A few studies were carried out to investigate the relationship of cohesion to coherence, writing quality, or both, in first language (LI), as well as in L2 texts, but there were mixed results (Anderson, 1980; Connor, 1984; Evola, Mamer, & Lentz, 1980; Fitzgerald & Spiegel, 1986; Jafarpur, 1991; Linnarud, 1978; McCulley, 1985; Neuner, 1987; Spiegel & Fitzgerald, 1990; Tierney & Mosenthal, 1983; Witte & Faigley, 1981). Despite general agreement that cohesion was an important property of writing quality, attempts to link cohesion (based on frequency analyses) to writing quality were often unsuccessful. The fundamental incongruity in these studies, it seems, is the search for a quantitative relation-

The ModernLanguageJournal 83 (1999) ship, even though cohesion and, to a more important degree, coherence are largely qualitative phenomena. This overly formalistic approach toward extrasentential features has left us with little useful information (Anderson, 1980) concerning the intricate propositional relationships within a text and has limited our capacity to look beyond intersentential connections toward a more comprehensive view of the performance domain (see Enkvist, 1990; Farghal, 1992). This is the reason why, until fairly recently, few researchers in the field seemed willing to venture beyond analysis of formal cohesive devices, such as sentence connectors and pronouns, into the realm of logical or rhetorical relationships among propositions within the entirety of a text (see Grabe, 1985). This reluctance continues to be seen even when researchers do move beyond the formal analysis of overt, concretely describable entities. Despite the fact that "organization" has repeatedly been found to be a textual aspect that influences essay raters' judgment of writing quality more than other features (Breland &Jones, 1984; Freedman, 1977, 1979a, 1979b), there has been practically no attempt to define what it is in analytical terms (Astika, 1993; Brown & Bailey, 1984; Jacobs, Zinkgraf, Wormuth, Hartfiel, & Hughey, 1981; Koybyashi, 1992; Mullen, 1980; Sasaki & Hirose, 1996; Sweedler-Brown, 1993). As a result, a curious phenomenon has arisen. Scores of researchers continue to conduct studies and evaluations using whatever categories they see fit and to make claims about these studies without describing their theoretical underpinning or discussing their significance within any particular theoretical framework. Indeed, as Faigley, Cherry, Jolliffe, and Skinner (1985) put it, "[P]ractice has far outrun theory in writing assessment" (p. 205). But writing assessment is not the only research area facing the issue of theoretical justification. The entire discipline of language testing had to be, and in fact has been, concerned with construct validation. In contrast to the situation in writing assessment, however, theory seems to "outrun" practice in communicative competence modeling. In this area, there has been no lack of theoretical propositions concerning the nature or structure of language proficiency. However, empirical evidence in support of these propositions has been sparse. In terms of the discourse aspect of language samples, controversies still exist as to its constituent features and its relative independence from all the other hypothesized components of the model. As Shohamy (1990) pointed out, clarifica-

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

Steve Y Chiang

221 their assessment. All received monetary compensation. Following an analysis of the data gathered, the rating scale underwent minor revision: Items with a Pearson product-moment correlation coefficient significantly lower than .3 (level of significance = .05 for one-tailed test) when correlated with the overall ratings for their respective subgroups were either revised or discarded (see Henning, 1987, pp. 57, 170 for theoretical justification of these decisions). The final count of items in each subscale is as follows: 7 in morphology, 8 in syntax, 9 in cohesion, and 11 in coherence. THE RATING SCALE After a review of literature on writing assessment, communicative language modeling, and discourse analysis, I decided to define the grammar trait as including morphology and syntax, and the discourse trait as composed of cohesion and coherence. The rating scale, therefore, consists of two intrasentential subscales (morphology and syntax) and two extrasentential subscales (cohesion and coherence), in addition to a separate 5-point scale on overall writing quality. The detailed discussion of these subscales below provides the theoretical or empirical justification, or both, of the content of each trait domain. TheIntrasentential Subscales The intrasentential subscales, morphology and syntax (Table 1), were designed to measure grammatical accuracy within individual sentences. The morphology subscale addresses grammatical accuracy within individual word and noun phrase boundaries, whereas the syntax subscale deals with the accuracy with which syntactical rules of the target language (TL) are applied to order or relate different elements within the sentence boundary. The selection of certain grammatical features instead of others within these two categories stemmed primarily from practical or empirical considerations. Given the proficiency level of the participants, the number of grammatical elements that could be investigated fruitfully was limited. Certain features, such as the subjunctive mood, although a pertinent domain to look into for grammatical control, do not appear often, if at all, in such students' language production. The present study did not include "vocabulary" as a variable of interest. In previous research in communicative competence modeling, some confusion arose as to where "vocabulary" belonged. Most researchers (e.g., Bachman, 1990;

tion of its constituent linguistic features is insufficient. Moreover, researchers have paid little heed to theory and research that come from discourse analysis. In the assessment of language samples produced by participants, discourse features have hardly received any emphasis or attention. All of the rating scales proposed so far or currently in use, including the 1986 scale of The American Council on the Teaching of Foreign Languages (ACTFL), provide only general descriptions in which hardly any reference is made to discourse features other than those that are dependent on grammatical knowledge. As a result, Shohamy concluded that there is an urgent need to devise rating scales and other criteriafor assessingdiscourseelementsin the languagesample obtained.The constructionof rating scales that will focus on different elements of discourseare [sic] essentialfor providing meaningful informationon the qualityof oral and writtentexts which test takersproduce. (p. 125) Several years have passed since the publication of her study, but little seems to have been accomplished to address this need. THE PILOT STUDY I conducted a pilot study during the spring semester of 1994. The primary purpose of the study was to test the application of the rating scale I had constructed. It contained four subscales aiming respectively at the four areas of evaluation-morphology, syntax, cohesion, and coherence-and included a total of 38 features: 8 each for morphology and for syntax, 10 for cohesion, and 12 for coherence. Theoretical justification for the content of the instrument is presented in the following section. One essay from each of the 30 students was collected. These essays had been written outside of class and had a general theme of leisure activities. All students received extra credit for their participation. The essays were then given to the raters, who evaluated them independently. Prior to the start of the grading, the researcher and raters participated in a training session. During the session, the raters received three sample essays and evaluated them as a group according to the features on the rating scale. As they reviewed these features, the raters explained and discussed their decisions with one another in order to reach a consensus. In the process, I also gave them further clarification on the meanings of certain items. The raters then took the 30 essays home and had one month to complete

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

222 TABLE1 IntrasententialSubscales:Syntaxand Morphology

TheModernLanguageJournal 83 (1999)

Please circle the number that reflects the accuracyof the feature in the essay:Circle NA (Not Applicable) when insufficient or no information is availableconcerning the particularfeature.
5 = Very Good 4 = Good 3 = Fair 2 = Poor 1 = Very Poor

SYNTAX 5 4 3 2 1 NA (a) Prepositions 5 4 3 2 1 NA (b) Relativepronouns 5 4 3 2 1 NA (c) Position of adjectives 5 4 3 2 1 NA (d) Position of adverbs 5 4 3 2 1 NA (e) Subject-verb agreement 5 4 3 2 1 NA (f) Verb tenses and moods 5 4 3 2 1 NA (g) Passivevoice/reflexive verbs 5 4 3 2 1 NA (h) Negation MORPHOLOGY
5 5 5 5 5 5 5 4 4 4 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 2 1 1 1 1 1 1 1 NA NA NA NA NA NA NA (a) (b) (c) (d) (e) (f) (g)

Articles (definite, indefinite, partitif) Gender and number agreement (articles,nouns, adjectives,past participle) Verb conjugation Word class (noun, adjective,adverb,verb, etc.) Spelling Accents Elision

5 4 3 2 1 Please rate the essayon its overall quality:

Canale, 1983; Canale & Swain, 1980; Verhoeven & Vermeer, 1992) have included it under grammatical competence, but others (e.g., Bachman & Palmer, 1982) viewed it as part of pragmatic (discourse) competence, along with cohesion and coherence. Here, the decision to omit vocabulary was a result of the belief that lexical knowledge seems to enter at all levels of a text. Such was the belief of Schachter (1990), who, citing Chomsky (1980), viewed "lexical knowledge" as a conceptual system that "provides common sense understanding of the world" and that has "acentral role in all sorts of mental acts and processes in which language plays no special part" (p. 41). Grabe (1985), in his model of text, also identifies "lexicon" as being apart from the other six components-syntax, semantics, cohesion, coherence, posture, and style-"in that it is not a component with independent properties from the other components" (p. 112; see also Grabe & Kaplan, 1996, on the centrality of "lexicon"). Since maximum separation of traits is of great importance to a study such as this one, which investigates the relative contribution of various language and textual features to the judgment of writing quality, the decision to exclude "vocabulary"seemed appro-

priate. In fact, empirical findings by Harley, Cummins, Swain, and Allen (1990), who were able to provide some evidence of the separateness of traits when no lexical measures were included, are consistent with the above theoretical view of lexical knowledge.

Subscales TheExtrasentential The extrasentential subscales (Table 2) were designed to permit assessment of effectiveness and appropriateness in the organization of ideas above the sentence level and among different parts of the text; that is, they used the notions of cohesion and coherence. Bamberg (1983), Carrell (1982), de Beaugrande (1984), Grabe
(1985), van Dijk (1977), and Witte and Faigley

(1981) have argued for the conceptual or theoretical distinction between these two traits. Empirical studies such as those by Anderson (1980), Connor (1984), and Tierney and Mosenthal (1983), moreover, have supported this distinction. Grabe (1985) provided perhaps the best articulated definitions of these two intersentential (or textual) level features:

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

Steve Y. Chiang TABLE 2 Extrasentential Subscales: Coherence and Cohesion

223

Please circle the number that reflects the degree to which you agree with the statement about the essay. Circle NA (Not Applicable) when insufficient or no information is available concerning the particular feature. 5 = Strongly Agree COHERENCE 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 NA NA NA NA NA NA NA NA NA NA NA (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) The beginning section is effective in introducing the reader to the subject. The ideas in the essay are all very relevant to the topic. The ideas in the essay are well-related one to another. The causal relationship between ideas is clear. Problem statements are followed up by responses/solutions. Different ideas are effectively compared/contrasted. Ideas mentioned are elaborated. The writer's overall point of view is clear. The division of paragraphs is justifiable in terms of content relevance. Transition between paragraphs is smooth. The ending gives the reader a definite sense of closure. 4 = Agree 3 = Undecided 2 = Disagree 1 = Strongly Disagree

COHESION 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 NA NA NA NA NA NA NA NA NA (a) (b) (c) (d) (e) (f) (g) (h) (i) The exact same vocabulary/expressions/structures are repeated consistently. Equivalent words/paraphrases, when used, are used appropriately. Pronouns of reference are used appropriately and accurately. Ellipsis is used where needed. Junction words are used judiciously and accurately. Where no junction words are used, transition between sentences is smooth. New information is introduced in an appropriate place or manner. Examples are introduced judiciously, not just to form an exhaustive list. Punctuation is employed appropriately to separate ideas and sentences.

Cohesion is the means available in the surface forms of the text to signal relations that hold between sentences or clausal units in the text; it is the surface manifestation of the underlying relations that bind a text .... Coherence, as a theoretical construct in text structure, refers to the underlying relations that hold between assertions (or propositions), and how they contribute to the overall discourse theme (or macro-structure). . . . Much as cohesion represents the formal features of text beyond the limits of the sentence, so also does coherence represent the semantic relations of text beyond the level of the sentence. (p. 110) The ideas in these statements constituted the fundamental principles in the construction of the extrasentential and corating scales-cohesion herence-in the present study. Cohesion. The cohesion subscale consists of statements pertaining to the degree of compactness or efficiency with which ideas within individual sentences are related to each other. It is based on de Beaugrande and Dressler's (1981)

taxonomy of cohesive ties, which includes four main types: (a) those expressing equivalence, and effi(b) those constituting compactness ciency, (c) those signaling relationships among events or situations in the textual world, and (d) those showing importance or newness of content. Ways to express equivalence include (a) recurrence-the of elestraightforward repetition ments or patterns; (b) partial recurrence-the shifting of already used elements to different classes (such as from noun to verb); (c) parallela structure but filling it with new ism-repeating elements; and (d) paraphrase-repeating content but conveying it with different expressions. These features are subsumed under items (a) and (b) in the cohesion subscale in Table 2. Devices employed to achieve compactness and efficiency include the use of (a) proforms, which are short placeholders with no independent content that replace content-carrying elements, and of (b) ellipsis, which consists of the repetition of a structure and its content with some of the surface expressions omitted. These devices are rep-

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

224 resented by items (c) and (d) in the cohesion subscale in Table 2. Surface signals of the relationships among events or situations in the textual world are said to be created by the use of tense, aspect, and junction. Item (e) in the pilot version of the cohesion subscale attempted to represent the consistent use of verb tense, aspect, and mood, but was deleted as a consequence of its low correlation to the total gain in this category. The reason for this low correlation is unclear. One possibility may be that the raters confused this particular item with item (f) in syntax, which, in contrast, represents the formal accuracy of these features. Further study is needed to clarify this issue. To represent the notion of connection and transition between propositions more thoroughly, two items, (e) and (f), were included in the final version of the cohesion subscale. Finally, the relative importance, or newness, of content is shown by the ordering or the timing of the use of expressions. For example, the salience of the information, "human beings," is less in this flatly stated instance, "Human beings are the living creatures that produced the largest amount of damage to the earth," than it is in the emphatic position in "The living creatures that produced the largest amount of damage to the earth are the human beings." One placement of an expression can be more appropriate than the other depending on the context. Item (g) in the cohesion subscale is formulated to represent this feature. In addition to the above features drawn from discourse theory, two items were added to the list. Item (h) in the cohesion subscale is motivated by the observation that certain learners tend to introduce examples in an inventorial fashion. This is usually accomplished by means of a long list of items following "par exemple," which inevitably makes the text look mechanical. According to Magnan (1985), "listing"is typical in novice-level writing. But this is a phenomenon that does not seem to have been addressed in de Beaugrande and Dressler's taxonomy. Finally, item (i) in the cohesion subscale, concerning punctuation, aims at the linear grouping of propositions, which may appear in an appropriate order but in incongruent places due to inappropriate use (or nonuse) of punctuation. The coherence subscale in Table 2 Coherence. pertains to discourse-level features, or macropropositions, according to Meyer (1985), that are responsible for giving prose its overall organization. At this level, of concern are the relationships among ideas represented in complexes of

TheModernLanguageJournal 83 (1999) propositions or paragraphs, which tend to be logical or rhetorical (see Meyer, 1985). The subscale is an adaptation of Meyer's functionally based system (1975, 1985), which identifies five causation, groups of relationships-collection, response, comparison, and description. Collection refers to the kind of relation that shows how ideas or events are related on the basis of some commonality. Causation represents the causeeffect or antecedent-consequent relationship between ideas. Response denotes the kind of relationship between problem and solution, remark and reply, and question and answer. Comparison has to do with differences and similarities between two or more topics. Finally, description refers to giving more information about a topic by presenting attributes, specifics, manners, or settings. Items (b), (c), (d), (e), (f), and (g) in the coherence subscale are direct operationalizations of these discourse traits. A few weaknesses in the students' essays that appear to subtract from the coherent condition were unaccounted for through Meyer's system (1975, 1985). The beginning section of an essay, for example, is important in helping the reader anticipate forthcoming messages, thereby facilitating his or her understanding of the text. Ideas in this section may be relevant to the topic or well-related to other parts of the text but fail to fulfill their introductory function. The same type of failure may occur in the final section, which is supposed to announce to the reader that the discourse is coming to a close. However, this function is not necessarily well served when students provide the description of a corollary or the proposition of a solution, or even a contrastive point of view or further elaboration of the central theme of the thesis. Within the middle section, the body of the text, appropriate and effective grouping of all relevant, contrastive, or causal propositions is also important in fostering an orderly sense within and across idea groups. Transitions between these internally coherent domains of propositions are also necessary in order to give a free-flowing sense to the text as a whole. All these conditions, however, do not seem necessarily to make the writer's purpose clear. Not unusually, the content of a text becomes obscure with too many ideas, even those that are interrelated and discussed in detail. It would appear that, unless it conveys a clear sense of overall purpose, a text can hardly be viewed as coherent. These were the reasons behind the "additional" items, namely items (a), (h), (i), (j), and (k) in the coherence subscale in Table 2.

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

Steve Y Chiang

225 DATAANALYSES
Pearson Product-Moment Correlations

ADMINISTRATION OF THE MAIN STUDY


Participants

Participants consisted of over 200 college students enrolled in second-, third-, and fourthsemester French language courses at The Pennsylvania State University, University Park. The total number of students from whom data were included was 172, of which 126 were female and 46 male. With few exceptions, participants were between the ages of 17 and 22. For participating in the study, students received extra credit.

Data Collection

Near the end of Fall 1994, all students were


assigned the following topic: "Le Temps Libre":Leisure activities are gradually becoming an important part of the modern lifestyle. How do you think people nowadays are making use of their free time and profiting from their leisure activities? All were instructed

to write between 250 and 300 words and to turn in their works within a week's time.

Rater Training

As in the pilot stage, I held a meeting with the three raters. Prior to the meeting, I selected 5 sample essays from the 30 essays used in the pilot stage. Each essay represented a quality level on the rating scale, and I distributed all five with detailed ratings to the three raters. I selected these particular essays because they had received the most consistent ratings from the three raters. They were intended to serve as reference for the raters in evaluating the new essays. However, I took care not to provide any explanation as to what I meant by the "overall quality" of an essay for fear that this would artificially direct the raters' attention toward certain aspects rather than others, and thus, possibly invalidate the results. Similar precautions have also been taken by Henry (1996) and Kobayashi and Rinnert (1996).

To determine how perceptions of the quality of constituent elements relate to those of the whole, correlations were calculated between ratings for individual features and the corresponding overall quality ratings for all essays. I did this for the four areas of evaluation, as well as for all individual items within them. For all 35 analytical features across all three raters, all but two ratings (rater 3: syntax [a] and [h]) were found to correlate significantly with their corresponding overall quality ratings. The significant coefficients ranged from 0.16 (p < .05 for one-tailed test, rater 3: morphology [b]) to 0.68 (rater 3: cohesion [f]). Ratings for each of the four areas of evaluation were obtained by averaging those of their constituent features. Correlations of these averaged scores to overall ratings are listed in Table 3. It appears that raters 1 and 2 were quite wellbalanced in taking into account all four areas of evaluation when deciding the overall rating, whereas rater 3 showed a clear tendency to give more weight to discoursal features. The average scores for all three raters reduced deviations in judgment and visibly increased the correlations of the ratings in each area with those of overall quality. But in order to obtain a clearer picture of the relative contribution and predictive power of individual areas, regression analysis was needed.

Multiple Regression Analyses

I conducted multiple regression analyses in order to determine the degree to which perceptions of overall writing quality could be predicted from the score on each of the subareas of investigation. These subareas are ordered by rank below on the basis of their importance or contribution to this prediction. The R-squared coefficients (in parentheses) indicate the amount of variance ac-

TABLE 3 Correlation of the Four Areas of Evaluation with Overall Ratings Morphology Rater 1 Rater 2 Rater 3 Averagea .78 .77 .49 .79 Syntax .76 .72 .46 .78 Cohesion .70 .80 .82 .89 Coherence .67 .68 .63 .74

aAveraged raw scores from all three raters.

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

226

TheModernLanguage Journal 83 (1999)


8. cohesion (i): .76 9. coherence (c): .75 10. morphology (c): .75

counted for in the overall quality after a variable and all its preceding variables have been entered. These coefficients were arrived at using the forward selection procedure in multiple regression analysis. The final R-squared coefficients, when converted to percentages, indicate the total amount of variance accounted for when all four subscales enter the regression equation.
Rater 1: Coherence (60.47%) > Syntax (78.66%) > Cohesion (82.84%) > Morphology (85.14%) Rater 2: Cohesion (63.48%) > Coherence (72.77%) > Morphology (78.07%) > Syntax (79.05%) Rater 3: Cohesion (66.69%) > Coherence (69.97%) > Syntax (74.30%) > Morphology (75.53%) Average: Cohesion (78.95%) > Coherence (86.62%) > Morphology (89.64%) > Syntax (90.78%)

Given the findings from the regression and correlational analyses, these results are hardly surprising, but they do provide more detailed information on the rating practice, which was clearly dominated by discourse features. Construct Validity In order to make meaningful claims about the relative contribution of individual parts to the whole, one must be sure that the proposed components are sufficiently independent among themselves. Therefore, to provide information about the construct validity of the four traits investigated in the study, I performed a multitraitmultimethod (MTMM) analysis (Campbell & Fiske, 1959) for the averaged scores from each of the four traits while treating the three raters as different methods. This matrix appears in Table 4. The multitrait-multimethod validation procedure calls for (a) inspection of the convergent validity coefficients, or monotrait-heteromethod correlation coefficients (figures outside of triangles, designated as C), which should be significantly higher than zero for the trait to be claimed to manifest convergent validity; (b) comparisons of the convergent validity coefficients with the heterotrait-monomethod coefficients (figures in solid triangles, designated as M); and (c) comparisons of the convergent validity coefficients with the heterotrait-heteromethod coefficients (figures in broken triangles, designated as H). For a trait to be claimed to exhibit either heterotrait-monomethod or heterotraitheteromethod discriminant validity, its convergent validity coefficient must be greater in either group of comparisons (Campbell & Fiske, 1959; Henning, 1987; Lord & Novick, 1968; Magnussen, 1967). Application of this validation procedure to the data in Table 4 revealed that all monotrait-heteromethod correlation coefficients for the four traits were significantly above zero, which indicates convergent validity. The results of their comparisons with discriminant validity coefficients are summarized in Table 5. Because there are 3 convergent validity coefficients, 6 heterotrait-monomethod correlation coefficients, and 12 heterotrait-heteromethod correlation coefficients for each trait, these coefficients were combined to provide an overall list-

Reliabilities Interrater reliabilities were calculated for each subscale using the Spearman-Brown Prophecy Formula with Fisher Z transformation. The resulting coefficients were .92 for morphology, .81 for syntax, .77 for cohesion, .78 for coherence, and .88 for overall quality. These estimates indicate the degree to which the three raters agreed in their assessment of these components in individual essays. The general trend here was one of decreasing agreement as the scope of evaluation enlarged and the nature of the features changed of from accuracy of form to appropriateness meaning. As one of the purposes of the study was to ascertain the relative contribution of component features to the raters' perception of writing qual-

ity, reliability coefficients with their corresponding overall quality ratings were also calculated for each of the four areas of evaluation. The resulting coefficients indicate the degree of consistency across the three raters in relating the quality of each component area to that of the entire essay. These reliability coefficients are .86 for morphology, .85 for syntax, .90 for cohesion, and .85 for coherence. Reliability was also calculated for each of the 35 constituent language features. The 10 items that have the highest coefficients follow: 1. cohesion (f): .83 2. cohesion (b): .80 3. cohesion (e): .79 4. coherence (j): .79 5. cohesion (g): .78 6. morphology (a): .78 7. coherence (h): .78

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

SteveY. Chiang
TABLE 4 A MTMM Matrix of Scores for Four Traits by Three Raters

227

1
R1-M

2
RI-S

3
RI-Cs

4
RI-Cr

5
R2-M

6
R2-S

7
R2-Cs

8
R2-Cr

9
R3-M

10
R3-S

11
R3-Cs

2. Ri-S
3. RI-Cs

.77

.53

.53\

4. RI-Cr

.33

.37
- -----.....................................

.52

5. R2-M

.77

". .66

.50

.36

6. R2-S

.63'-.

.65

"- .43

.35:

.75

7. R2-Cs

.48

.48 '.

.49

...44

.66

.61\

8.R2-Cr

.36
...

.30 "

.39

.53

.41

...................................

.41

.62\

9. R3-M

.80
: "

.62

.39

.25

.68

.58

.41

.26

10.R3-S

.51

.. .51

.37

.26

.48

.57

.35

.22

.50

1.R3-Cs

.44

.43

...52

...51 "
..

.42

.47 "..

.54

'.50

.35

.38\

. .56 .55 41 23 .20 = = Note. R = rater; M morphology; S syntax; Cs = cohesion; Cr = coherence 12.R3-Cr
, , ,.,

.19
*,--

.14

.......

.39

......

...........

..................

.08

.O0

rp\

ing of the comparisons for individual traits. The results are seen in Table 6. It would appear that morphology, as measured, is the most valid of the hypothesized traits, be-

cause it exceeded all but one discriminant coefficient in its comparisons. It is followed by (a) coherence, which succeeded in all but two comparisons; (b) syntax; and finally, (c) cohe-

TABLE 5 Summary of MTMM Comparisons of Convergent and Discriminant Validity Coefficients C>M C>H rl,5 6/6 12/12 r2,6 4/6 11/1 r3,7 r4,8 rl,9 0/6 5/6 6/6 2 11 /12 12/1212/12 r2,10 5/6 r3,11 3/6 12/12 r4,12 6/6 12/12 r5,9 5/6 12/12 r6,10 4/6 11/12 r7,11 3/6 12/12 r8,12 5/6 12/12

Note. r = rater C = monotrait-heteromethod correlation coefficients; M = heterotrait-monomethod correlation coefficients; H = heterotrait-heteromethod correlation coefficients.

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

228 TABLE6 Combined MTMMComparisonsfor Four Traits DiscriminantValidities Heterotrait-MonomethodC>M Heterotrait-HeteromethodC>H Morphology 17/18 36/36

TheModernLanguage Journal 83 (1999)

Syntax 13/18 32/36

Cohesion 6/18 35/36

Coherence 16/18 36/36

sion, which still met the criteria over three fourths of the time. These ratios, though not perfect, are encouraging signs that the delineations sketched for the constructs in the rating instrument may be valid in terms of defining and distinguishing these psychological realities.

INTERPRETATIONOF RESULTS The fact that scores from practically each and every item in the rating scale correlated significantly with the corresponding overall quality ratings for all essays provided some evidence of content validity for the rating scales' measures. However, there may be room for suspicion that such high correlations are but an artifact of the rating procedure. That is, having been directed to evaluate a writing sample using a number of detailed criteria, the raters might have felt obliged to take into consideration all of them in making their final, overall judgment. But this is a question that could not be answered by the present study in light of its research design. Researchers who are interested in the method factor in composition rating may want to manipulate the rating procedure in various ways, such as by reversing the order of the analytical and the holistic evaluations or by introducing time lapse between the evaluations in order to resolve the issue. An examination of the results of the multiple regression analyses suggests that (a) discoursal features, especially cohesion, appeared to exert the greatest influence on the raters' judgment; and (b) the three raters differed in their approach to evaluation. Whereas rater 1 gave nearly equal consideration to all four areas, she also seemed to include fewer of her own criteria. The contrary can be said about rater 3, who based her decisions primarily on cohesion features, and probably judged the essays on other aspects as well. Just what these additional factors are cannot be known from the design of the present study. Rater 2 appeared to be somewhere in between in these tendencies. Once again, the averaged scores provided a sharpened general picture of

evaluation and improved tangibly the predictive power of the analytic rating scales. In the analysis of reliability, the list of 10 features that relate most reliably with perceptions of overall quality is an interesting one. The most logical interpretation for this list is that these are the features that raters pay attention to most consistently in grading L2 samples at this proficiency level. But before rushing to generalize this finding to all situations, the reader should be reminded of its potential weaknesses and limitations. First, numerous studies have documented rater variability in direct assessment of writing samples. In addition to the fact that different raters apply different criteria in judging writing quality due to their personal backgrounds or experiences (Brindley, 1991; Connor-Linton, 1995a, 1995b; Davies, 1983; Griffin, 1989; HampLyons, 1991; Henry, 1996; Huot, 1990; Kobayashi & Rinnert, 1996; Sweedler-Brown, 1993; Vaughan, 1991), the rating process is also a complex mental activity, often involving hierarchical considerations of combinations of characteristics (Cumming, 1990; Hamp-Lyons, 1991; Homburg, 1984). Results from the multiple regression analyses reported earlier in this study are merely symptomatic of such a general tendency toward variability. Second, this study only involved three raters. Although there exists the possibility that the detailed approach adopted by the rating instrument in this study has exerted substantial influence in calibrating the raters' agendas, further research is needed to verify this point in order to make generalizations about the outcome from the application of this rating scale to the general rater population. In addition to interpreting the above listed features as the ones to which raters pay most attention, one might be tempted to infer that the features identified in the list are the most salient developmental features that distinguish good and bad writings at this L2 proficiency level. This inference is plausible not only because of the large number of writing samples included but also because of previous studies, involving detailed text-analysis or comparison with native speakers, that have identified connectives (Doushaq, 1986;

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

SteveY Chiang Hamdan, 1988; Shakir, 1991), synonyms, and collocations (Connor, 1984) as problem areas in L2 writing. These weaknesses correspond to items (e), (f), and (b) in cohesion, which happened to be the three most reliable measures in relation to perceptions of overall quality in the present study. However, an examination of the difference between the mean scores of the 60 essays judged best in overall quality and those of the 60 judged worst revealed that these language/textual features are not always the areas in which raters perceived the largest discrepancy between the good and the bad essays. On the one hand, therefore, the interpretation that the features identified in the list are the most salient developmental features that distinguish good and bad writings must be taken with caution. On the other hand, the largest differences in mean scores between the strong and the weak writings have been found in the two morphological features-articles (a) and verb conjugation (c). This finding suggests that these may indeed be important developmental features, but that they were not given as much weight as were cohesion features by the raters in evaluating writing quality. Given the considerations above, it seems reasonable to assume that the holistic rating process is at least as complex as the interplay between the objective (the strengths and weaknesses in the essay) and the subjective (the rater's criteria as to what constitutes good
writing).

229 ally been found to be the best predictors of writing quality, occupying five spots on the top-ten list of these predictors. A close examination of the discoursal features making the list indicates that they may well be very characteristic of the writing at this proficiency level: They all seem to point to the observation that learners at lower proficiency levels tend to be bound to the transcription of meaning at the sentence level, and that they spare little attention for logical transition or connection between sentences. As a result, they may be successful in collecting a group of ideas that are all pertinent to the topic of the essay, but they may not be able to make their overall intention clear to the reader, or be able to provide a smooth transition between groups of ideas/propositions within the essay. However, as I pointed out earlier, these features are merely the ones that the raters perceived as most pertinent to writing quality and are not necessarily indicative of any developmental pattern of learners' proficiency. The validation attempt at the four-trait distinction has not been entirely successful. It is clear that it necessarily becomes more difficult to achieve accuracy in delineation (or consensus among raters, for that matter) as the subject domain expands from relatively well-defined linguistic items to extended stretches of discourse, and the nature of evaluation switches from accuracy of form to appropriateness of meaning. But the process of refining definitions must go on. Whereas the present study has produced substantial information on the identity of the construct of discourse, much work remains to be done. Future research is needed to refine its delineations and to apply the revised instrument in actual rating instances, that is, to carry on the "chicken-and-egg,"circular practice of theorizing and testing in construct validation. The resulting information will be crucial for the construction of valid (criterion-referenced) rating scales that distribute proper weight to appropriate features and, in this way, assess learners' writing ability in a more efficient manner. CONCLUSION Setting out to address perceived inadequacies in the practices of error analysis and writing assessment, the present study has applied theories in discourse analysis, delineated textual dimensions in analytical terms, and conducted evaluation within a relatively complete discourse universe. The rating scale has proven to be fairly reliable and to have good content validity, even though refinement is still needed to achieve per-

DISCUSSION The strong showing of cohesion and, to a lesser extent, coherence in correlation to overall quality in the present study may have confirmed Rifkin and Roberts's (1995) speculation that the discoursal aspects of written essays take precedence over grammatical accuracy in native speakers' judgment of writing quality. This finding is also congruent with previous research that identified a few similar cohesion conditions as salient weaknesses in L2 writing. All correlational statistics from data analysis, including the high interrater reliability coefficients, attest, if not to the construct validity (Brindley, 1991; Cumming, 1990), at least to the wellfounded nature of the contents of the rating scale. At the same time, it is interesting to note that although the three raters appeared to disagree the most in terms of the cohesion features in student writing (r = .77), they also relied most heavily on them to arrive at their decisions about the overall quality of individual essays (r = .90). As reported above, cohesion features have gener-

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

230 fect construct validity. Detailed delineation of traits and analytical scoring in the study (a) have helped provide information about variables involved in the rating process and profiles of student writing at the beginning and intermediate proficiency levels and (b) will be useful in producing positive backwash effects on instruction and learning (Connor-Linton, 1995b). Given the preliminary nature of the present study, generalization of its findings to other assessment conditions must be made with caution. Its limitations pertain to the following four areas:
1. Theoretical constructs. This study has limited its

TheModernLanguageJournal 83 (1999) a complete and coherent picture of both the object and the process of assessment. 2. The relationship between rater characteristics and their rating patterns should be investigated by including and controlling for factors such as rater background and assessment contexts. 3. The rating scale should be adapted to reflect the discourse genre and the proficiency level of the subjects in evaluative instances. Future studies that include these variables will provide valuable insight into the dynamic relationships among these various features in discourse and in the developmental pattern of L2 proficiency. 4. Practical grading instruments need to be constructed for large-scale assessment that reflects the findings in this study on the hierarchy of importance of individual features. Research in L2 writing assessment has thus far been focusing on fairly advanced learners of English as a Second Language from diverse cultural/language backgrounds. As such, its findings may not be applicable to foreign language learners enrolled in beginning and intermediate college courses (Henry, 1996). Only recently have two studies aimed specifically at this context: Henry (1996) in Russian, and Valdes, Haro, and Echevarriarza (1992) in Spanish. Although their findings are not directly comparable to those of the present study, due to differences in research purposes, research designs, and features investigated, the point in contention, namely that of Ll-to-L2 transfer of discourse skills, is very much relevant to the current research direction in writing assessment. With the present study, I hope to have contributed to this enterprise with theoretical delineations of textual features and with empirical findings on the relative importance of these features in the assessment of L2 discourse.

scope to what the author perceived to be of most obvious pedagogical value, namely grammatical and discoursal features. Therefore, it makes no claim about other areas of competence that may also be contributing factors to perceptions of writing quality. 2. Raterpopulation.A larger pool of evaluators would increase the stability and validity of findings and make more generalizations possible. Also, because aspects of raters' background, such as education or ethnicity, may be factors in judging writing quality, there is need to control for this variable. 3. Generalizability of results. Because this study involved only expository essays written by students in the early stage of L2 learning, some of the features identified in the rating scale may be insufficient or inappropriate for assessing a different discourse genre or for subjects at higher proficiency levels, or both. 4. Practicalvalue of instrument.As stated in the beginning section of this article, this study was primarily concerned with the theoretical justification and validity of the rating practice. As a result, the rating scale produced in this study may not be directly applicable to large-scale assessments. More likely, it serves as a foundation based on which practical instruments can be constructed for actual use. As the present study is the first of its kind, replication studies are needed to bolster its claims. In the wake of the findings and limitations, however, future research should address the following four areas. 1. The proposed rating scale needs refinement and empirical verification for both theoretical and practical purposes. In addition to modifying the trait delineations provided in the instrument, future research efforts should direct their attention toward discovering other nonredundant elements, such as sociolinguistic features that enter into play in assessing writing, in order to arrive at

REFERENCES
American Council on the Teaching of Foreign Languidelines.Hasguages. (1986). ACTFLproficiency tings-on-Hudson, NY:ACTFL. as an index Anderson, P. L. (1980, November). Cohesion and oralcomposition Paper ofESLLearners. for written presented at Midwest Modern Language Association, Linguistics I. Astika, G. G. (1993). Analytical assessments of foreign students' writing. RELCJournal: A Journal of Language Teachingand Researchin SoutheastAsia, 24, 61-72. in lanBachman, L. (1990). Fundamentalconsiderations guage testing.Oxford: Oxford University Press. Bachman, L., & Palmer, A. (1982). The construct valida-

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

Steve Y. Chiang tion of some components of communicative profi16, 449-465. ciency. TESOLQuarterly, Bamberg, B. (1983). What makes a text coherent? College and Communication, 34, 417-429. Composition Breland, H., &Jones, R.J. (1984). Perceptions of writing skills. Written Communication, 1, 101-109. Brindley, G. (1991). Defining language ability: The criteria for criteria. In S. Anivan (Ed.), Currentdevelopments in language testing (pp.139-164). Singapore: SEAMEO. Brown,J. D., & Bailey, K. M. (1984). A categorical instrument for scoring second language writing skills. LanguageLearning,34, 21-41. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discrimant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Canale, M. (1983). From communicative competence to communicative performance. In J. Richards & R. Schmidt (Eds.), Languageand Communication (pp. 2-27). New York:Longman. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. AppliedLinguistics,1, 1-47. Carrell, P. (1982). Cohesion is not coherence. TESOL 16, 479-488. Quarterly, Chomsky, N. (1980). Rules and representations. Oxford, UK: Blackwell. Connor, U. (1984). A study of cohesion and coherence in English as a second language students' writing. Papersin Linguistics: InternationalJournal of Human Communication, 17, 301-316. Connor-Linton, J. (1995a). Crosscultural comparison of writing standards: American ESL and Japanese EFL. World Englishes,14, 99-115. Connor-Linton,J. (1995b). Looking behind the curtain: What do L2 composition ratings really mean? TESOLQuarterly, 29, 762-765. Crowhurst, M. (1980). Syntactic complexity and teachers' ratings of narrations and arguments. Research in the Teaching of English, 14, 223-231. Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7, 31-51. Davies, E. (1983). Error evaluation: The importance of viewpoint. English Language Teaching Journal, 37, 304-311. de Beaugrande, R. (1984). Text, discourse,and process: Towarda multidisciplinary scienceof texts.Norwood, NJ:Ablex. de Beaugrande, R., & Dressler, W. (1981). Introduction to textlinguistics.London: Longman. Delisle, H. H. (1982). Native speaker judgment and the evaluation of error in German. Modern Language Journal, 66, 39-48. Doushaq, M. (1986). An investigation into the stylistic errors of Arab students learning English for academic purposes. English for Specific Purposes, 3, 27-39. Enkvist, N. E. (1990). Seven problems in the study of coherence and interpretability. In U. Connor & A. M. Johns (Eds.), Coherence in writing, research

231
and pedagogicalperspectives(pp. 11-28). Alexandria, VA:TESOL. Ensz, K. Y. (1982). French attitudes toward typical speech errors of American speakers of French. ModernLanguage Journal, 66, 133-139. Evola,J., Mamer, E., & Lentz, B. (1980). Discrete point versus global scoring for cohesive devices. In J. W. in Language Oller, Jr., & K. Perkins (Eds.), Research Testing (pp. 171-176). Rowley, MA: Newbury House. Faigley, L., Cherry, R. D.,Jolliffe, D. A., & Skinner, A. M. and processes (1985). Assessingwriters'knowledge of Norwood, NJ: Ablex. composing. Farghal, M. (1992). Naturalness and the notion of cohesion in EFL writing classes. InternationalReviewof AppliedLinguistics,30, 45-51. Fitzgerald,J., & Spiegel, D. L. (1986). Textual cohesion and coherence in children's writing. Research in the Teaching of English, 20, 263-280. Freedman, S. W. (1977). Influences on the evaluators of student writing. Dissertation AbstractsInternational, 37, 5306A, (University Microfilms No. AAC 7802159). Freedman, S. W. (1979a). How characteristics of student essays influence teachers' evaluation. Journal of EducationalPsycholog, 71, 328-338. Freedman, S. W. (1979b). Why do teachers give the and Communicagrades they do? College Composition tion, 30, 161-164. Grabe, W. (1985). Written discourse analysis. Annual Reviewof AppliedLinguistics,5, 101-123. Grabe, W., & Kaplan, R. (1996). Theoryand practiceof London: writing: An applied linguisticsperspective. Longman. Griffin, P. E. (1989, September). Latent trait estimatesof raterreliabilityin IELTS.Paper presented at Fourteenth Annual Congress of the Applied Linguistics Association of Australia, Melbourne. in English.New Haliday, R., & Hasan, D. (1976). Cohesion York:Longman. and cohesion in textswritten Hamdan, A. (1988). Coherence in English byJordanian universitystudents.Unpublished doctoral dissertation, University of Manchester. Hamp-Lyons, L. (Ed.). (1991). Assessingsecondlanguage Norwood, NJ:Ablex. writingin academiccontexts. Harley, B., Cummins, J., Swain, M., & Allen, P. The nature of language proficiency. In B. Harley, J. Cummins, M. Swain, & P. Allen (Eds.), Thedevelopmentof secondlanguageproficiency (pp. 7-25). Cambridge: Cambridge University Press. Henning, G. H. (1987). A guide to languagetesting: Development, evaluation, research.New York: Newbury House/Harper and Row. Henry, K. (1996). EarlyL2 writing development: A study of autobiographical essays by university-level students of Russian. Modern LanguageJournal, 80, 309-326. Homburg, T.J. (1984). Holistic evaluation of ESL compositions: Can it be validated objectively? TESOL 18, 87-106. Quarterly,

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

232
Huot, B. (1990). The literature of direct writing assessment: Major concerns and prevailing trends. Review of EducationalResearch, 60, 237-263. Jacobs, H. L., Zinkgraf, S. A., Wormuth, D. R., Hartfiel, V. F., & Hughey, J. B. (1981). TestingESL composition: A practical approach.Rowley, MA: Newbury House. Jafarpur, A. (1991). Cohesiveness as a basis for evaluating compositions. System,19, 456-465. Kern, R., & Schultz, J. (1992). The effects of composition instruction on intermediate level French students' writing performance: Some preliminary findings. ModernLanguageJournal,76, 1-13. Khalil, A. M. (1985). Communicative error evaluation: Native speakers' evaluation and interpretation of written errors of Arab EFL learners. TESOLQuarterly,19, 335-351. Kobayashi, H., & Rinnert, C. (1996) Factors affecting composition evaluation in an EFL context: Cultural rhetorical pattern and readers' background. LanguageLearning,46, 397-437. Kobayashi, T. (1992). Native and nonnative reactions to ESL compositions. TESOLQuarterly, 26, 81-112. Linnarud, M. (1978). Cohesion and communication in StudiesBulletin, the target language. Interlanguage 3, 23-34. Lord, F. M., & M. R. Novick. (1968). Statisticaltheories of New York:Addison-Wesley. mentaltestscores. Magnan, S. S. (1985). Teaching and testing proficiency in writing: Skills to transcend the second-language curclassroom. In A. C. Omaggio (Ed.), Proficiency, riculum, articulation: The ties that bind (pp. 109-136). Middlebury, VT: Northeast Conference. Magnussen, D. (1967). Test theory.Reading, MA: Addison-Wesley. McCulley, G. A. (1985). Writing quality, coherence, and cohesion. Researchin the Teachingof English, 19, 269-282. Meyer, B. J. F. (1975). The organizationof prose and its Amsterdam: North-Holland. on memory. effects Meyer, B. J. F. (1985). Prose analysis: Purposes, procedures, and problems. In B. K. Britton and J. B. Black (Eds.), Understanding expository text (pp. 11-64). Hillsdale, NJ: Erlbaum. Mullen, K A. (1980). Evaluating writing proficiency in in ESL. InJ. W. Oller & K. Perkins (Eds.), Research language testing (pp. 160-170). Rowley, MA: Newbury House. Neuner, J. L. (1987). Cohesive ties and chains in good in the Teaching and poor freshmen essays. Research of English, 21, 92-105. Nold, E. W., & Freedman, S. W. (1977). An analysis of in theTeaching reader's responses to essays. Research of English, 11, 164-174. Piazza, L. G. (1980). French tolerance for grammatical errors made by Americans. ModernLanguage Journal, 64, 422-427. Rifkin, B., & Roberts, F. (1995). Error gravity:A critical

The Modern Language Journal 83 (1999) review of research design. LanguageLearning, 45, 511-537. Santos, T. A. (1988). Professors' reactions to the academic writing of nonnative-speaking students. TESOLQuarterly, 22, 69-90. Sasaki, M., & Hirose, K. (1996). Explanatory variables for EFL students expository writing. Language Learning,46, 137-174. Schachter,J. (1990). Communicative competence revisited. In B. Harley, J. Cummins, M. Swain, & P. Allen (Eds.), Thedevelopment of secondlanguageproficiency (pp. 39-49). Cambridge: Cambridge University Press. Shakir, A. (1991). Coherence in EFL student-written texts: Two perspectives. ForeignLanguage Annals, 24, 399-411. Shohamy, E. (1990). Discourse analysis in language testing. Annual Review of Applied Linguistics, 11, 115-131. Spiegel, D. L., & Fitzgerald,J. (1990). Textual cohesion and coherence in children's writing revisited. Researchin the Teaching of English, 24, 48-66. Sweedler-Brown, C. 0. (1993). ESL essay evaluation: The influences of sentence-level and rhetorical features. Journal of Second Language Writing, 2, 3-17. Tardif, C., & d'Anglejan, A. (1981). Les erreurs en francais langue seconde et leurs effets sur la communication orale [Errors in French as a second language and their effect on oral communication]. Canadian Modern Language Review, 37, 706-723. Tierney, R.J., & Mosenthal,J. H. (1981). Cohesion and textual coherence. Researchin the Teachingof English, 17, 215-230. Tomiyama, M. (1980) Grammatical errors and communication breakdown. TESOLQuarterly, 14, 71-79. Valdes, G., Haro, P., & Echevarriarza, M. P. (1992). The development of writing abilities in a foreign language: Contributions toward a general theory of L2 writing. ModernLanguage Journal, 76, 333-352. New York:Longvan Dijk, T. A. (1977). Textand context. man. Vaughan, C. (1991). Holistic Assessment: What goes on in the rater's mind? In L. Hamp-Lyons (Ed.), Assessing second language writing in academiccontexts (pp. 111-125). Norwood, NJ:Ablex. Verhoeven, L., & Vermeer, A. (1992). Modeling communicative second language competence. In L. Verhoeven &J. H. A. L. de Jong (Eds.), The construct of language proficiency (pp. 163-173). Amsterdam/Philadelphia: John Benjamins. Witte, S. P., Daly,J. A., & Cherry, R. D. (1986). Syntactic complexity and writing quality. In D. A. McQuade (Ed.), The territory of language(pp. 150-164). Carbondale, IL: Southern Illinois University Press. Witte, S. P., & Faigley, L. (1981). Coherence, cohesion, and Commuand writing quality. College Composition nication, 32, 189-204.

This content downloaded from 158.170.6.222 on Sat, 29 Mar 2014 17:08:38 PM All use subject to JSTOR Terms and Conditions

You might also like