You are on page 1of 17

The 1th national conference on Emerging Horizons in ELT and Liturature(EHELTL,2013)

Reliability, validity,and practicality foundamental cornerstones of language testing


Farhad Nouriade1
1

MA , Department of English Language and Literature ,Shahid Rajaee Teacher Training University , Tehran, Iran

Abstract
Language testing in any level is a highly complex undertaking that must be based on theory as well as practice. The results of assessments are used for one or more purposes. So they have an effect on those who are assessed and on those who use the results. Thus it matters, more or less, that the result is dependable, that the users of the results can have confidence in them. An understanding of basic principles of testing is essential. This study aimed mainly at clarification of notions of validity, reliability and practicality in one hand and their relationships as three basic cornerstones of testing procedure in the other hand. Key Words: Testing, Validity, Reliability, Practicality

Email adress :

Farhadnourizade@yahoo.com

TEL : 98-938-037-9681

1. Introduction
No measuring instrument is perfect. If we use a thermometer to measure the temperature of a liquid, even if the temperature doesnt change, we get small variations. We generally assume that these are random fluctuations around the true temperature, and we take the average of the readings as the temperature of the liquid but there are more subtle problems. If we place the thermometer in a long narrow tube of the liquid, then the thermometer itself might affect the readingif the thermometer is warmer than the liquid, the thermometer will heat up the liquid. In contrast to the problems of reliability mentioned above, this effect is not random,

but a systematic bias, and so averaging across a lot of readings does not help. Sometimes, we are not really sure what we are measuring. For example bathroom scales sometimes change their reading according to how far apart the feet are placed. The scales are not just measuring weight but also the way the weight is distributed. Extending the analogy to the case of test construction process , it should be noted at the outset that if the items, which are the building blocks of a test, meet the criteria that have been introduced before, the whole test will be most likely acceptable. However, the assumption that good items will necessarily produce a good test may not always come true. Test developers should go one more step to determine the characteristics of the total test. In other words, simply putting good individual items together would not be sufficient for a test to function satisfactorily. (Farhady&Jafarpur&Birandi, 2010).To guarantee the quality of a test other factors and guiding principles including the administration process and scoring procedures are necessary. In general it is noteworthy that an efficient test can be marked by certain characteristics that be considered as cornerstones of testing : reliability, validity and practicality. (Farhady.etal. 2010).Due to this fact that these concepts are absolutely essential in testing, they require fairly comprehensive elaborations. Reviewing existing literature, this article is devoted to not only clarifying their notions but explaining the relationships between reliability, validity and practicality.

2. Validity 2.1. Definitions and Conceptualizations


The concept of validity has evolved over more than 60 years, and various definitions of validity have been proposed. Validity in testing and assessment has basically and traditionally been understood to mean the degree to which a test measures accurately what it is intended and supposed to measure(Hughes, 1989), or uncovering this fact that a given test or any of its component parts are appropriate as a measure of what it is purposed to measure. (Henning, 1987). This view of validity presupposes that when we write a test we have an intention to measure something, that the something is real, and that validity enquiry concerns finding out whether a test actually does measure what is intended. Although this definition is relatively common and straightforward, it oversimplifies the issue a bit. A better definition, reflecting the most contemporary perspective, is that validity is the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test (AERA, APA, & NCME, 1999).Most of the recent views of validity, supported this definition, seem to be focused their attentions not on the instrument itself but on the interpretation and measuring of the scores derived from the instruments. Ary, Jacobs &Razavieh ,2002 (as cited in Oluwatayo,2012). Conceptualize validity as the extent to which theory and evidence support the proposed interpretation of test scores for an intended purpose. Relatedly, Whiston ,2005 (as cited in Oluwatayo ,2012) views validity as the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests.

In this regard Bachman (1990) argues that validity has been identified as the most important quality of test use, which concerns the extent to which meaningful inferences can be drawn from test scores. Taking a similar position Cronbach, ( as cited in Crocker & Algina, 1986) notes that in order to examine the validity of a test, it requires a validation process by which a test user presents evidence to support the inferences or decisions made on the basis of test scores. Criticizing traditional approaches and conceptualizations, Messick(1989) put forward unified validity framework in which he considers validity as nothing less than an evaluative summary of both the evidence for and the actual as well as the potential consequences of score interpretation and use (i.e., construct validity conceived comprehensively). This comprehensive view of validity integrates considerations of content, criteria and consequences into a comprehensive framework for empirically testing rational hypotheses about score meaning and utility. In this view validity is not a property of a test or assessment but the degree to which we are justified in making an inference to a construct from a test score (for example, whether 20 on a reading test indicates ability to read first-year business studies texts), and whether any decisions we might make on the basis of the score are justifiable. In order to improve assessment practices, Killen (2003) argues that a different understanding of validity needs to be invoked. Drawing on insights from the work of Messick (1989), he notes that the meaning of the term validity has evolved with time and that a more recent understanding of validity concerns the extent to which justifiable inferences can be made on the basis of evidence gathered. The interest therefore is not to validate a test but to validate the inference that can be drawn from the learner's results in the test/assessment task. Viewing validity as inference necessitates a greater responsibility on the part of teachers. As Killen (2003:6) writes: No longer can a teacher claim to be using a valid assessment task simply because it is clearly linked to the curriculum content, or because someone else has used the test and decided that it is valid, or because it produces results similar to those obtained from other assessment tasks. Instead the teacher must question the validity of the inferences they are making as a result of having used the assessment task. Based on these modern views, two important concepts can be attributed to validity. The first one is the notion of a construct. Constructs defined as labels used to describe an unobserved behavior, such as honesty. Such behaviors vary across individuals; although they may leave an impression, they cannot be directly measured. In the social sciences virtually everything is a construct. Linda Crocker and James Algina described constructs as the product of informed scientific imagination. Constructs represent something that is abstract, but can be operational zed through the measurement process. Measurement can involve a paper and pencil test, a performance, or some other activity through observation which would then produce variables that can be quantified, manipulated, and analyzed. The second important concept is inference. An inference is an informed leap from the observed or measured value to an estimate of some underlying variable. This is called an inferential leap because we cannot directly observe what is being measured, but we can observe its manifestations. Through observation of the results of the measurement process, an inference can be made about the underlying characteristics and its nature or status.

2.2. Types of Validity


In the early days of validity investigation, validity was broken down into three types that were typically seen as distinct. Each type of validity was related to the kind of evidence that would count towards demonstrating that a test was valid. Cronbach and Meehl, 1955 (as cited in Wier,2005) described these as: Criterion-oriented validity (Predictive validity and Concurrent validity), Content validity and Construct validity. In fact, traditional ways of cutting and combining evidence of validity have led to three major categories of evidence: content-related, criterion-related, and construct-related. However, because content- and criterion-related evidence contributes to score meaning, they have come to be recognized as aspects of construct validity. In a sense, then, this leaves only one category, namely, construct-related evidence. Mesick set out to produce a unified validity framework, in which different types of evidence contribute in their own way to our understanding of construct validity. Supporting Messicks unified view, Bachman (1990) asserted that none of these various categories by itself is sufficient to demonstrate the validity of a particular interpretation or use of test scores. He thus sees the complementarily of the different sources of evidence formerly considered as separable validities. As such he came to be seen as representing a new orthodoxy on approaching construct validity as a superordinate category for test validities. The literature on psychometric validation is saturated with the different types of validity that are used in research. However, in educational research, there are four aspects of validity that are of much importance. They are face validity, content validity, construct validity and criterion-related validity.

2.2.1 Face Validity


Face validity is concerned with if the test appears to test what the name of the test implies. (Dick, & Hagerty, 1971, as cited in Sak, 2008). Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well constructed? Does it seem as though it will work reliably? Therefore, face validity is determined impressionistically; for example, by asking students whether the exam was appropriate to their expectations and giving questionnaires to administrators or other users. However, Anastasi & Urbina (2007) argue that face validity is a desirable feature of tests, noting that if tests originally designed for children and developed with a classroom setting are extended for adults use, such tests are likely to meet with resistance and criticism because of their lack of face validity. In other word, if the content of a measuring instrument appears irrelevant, inappropriate, silly or childish, there is every likelihood that the result obtained might provide false information and misleading decisions and it appears to do what it claims, it is more likely to be accepted by all concerned. Jafarpur (1987: 199) considers face validity to be subordinate to the other categories of validity. Indeed, it is a qualitative measure, not a scientific one. However, in light of harsh, economic realities.

2.2.2. Content Validity


As the name suggests, content validity is concerned with whether or not the content of the test is sufficiently representative and comprehensive for the test to be a valid measure of what is supposed to measure (Henning, 1987). Content validity is defined as any attempt to show that the content of the test is a representative sample from the domain that is to be tested. To address this issue, testers or people interested in test validation may need to focus particularly on the organization of the different types of items that they have included on the test and the specifications for each of those item types (Brown, 2005;as cited in Sak,2008). Although this validation process may take many forms, the goal is to indicate that the test is a representative sample of the content it claims to measure. (Jafarpur, 1987). If the test actually requires the test-taker to perform all the skills that it claims to measure, it can be said to have content validity (Hughes, 2003). Bachman (1990) described content validity in two crucial concepts: content relevance and content coverage. Content relevance refers to the extent to which the aspects of ability to be assessed are actually tested by the task, indicating the requirement to specify the ability domain and the test method facets. Content coverage concerns the extent to which the test tasks adequately demonstrate the performance in the target context, which may be achieved by randomly selecting representative samples the second aspect of content validity is similar to that of content representativeness, which also concerns the extent to which the test accurately samples the behavioral domain of interest (Bachman, 2002).

2.2.3. Construct validity


The next type of validity is the construct validity and this is the most difficult validity type to explain as it is regarded as a superordinate form of validity to which external and internal validity contribute (Alderson, et al., 1995, p.183). Similarly, Anastasi (as cited in Weir, 1990) expresses that the content, criterion related and construct validation do not correspond to distinct or logically coordinate categories, on the contrary, construct validity is a comprehensive concept which includes other types. Construct validity is related to the question of whether the test meets certain theoretical requirements (Jafarpur, 1987). How much is the tests content based on the theoretical construct which it is claimed to be? And thus, how much does a test score actually allow us to make an appropriate interpretation of an individuals ability? (Bachman & Palmer, 1996).Practically, construct validity comprises two elements namely, convergent validity and discriminant validity. The convergent validity requires that the scores derived from the measuring instrument correlate with the scores derived from similar variables (Campbell & Fiske, 1959). The discriminant validity suggests that using similar methods for researching different constructs should yield relatively low inter-correlations. That is, the construct in question is different from other potentially similar constructs. Such discriminant validity can also be yielded by factor analysis, which clusters together similar issues and separates them from others (Cohen, Manion & Morrison, 2008).

2.2.4 .Criterion-related validity


The concept of criterion validity involves demonstrating validity by showing that the scores on the test being validated correlate highly with some other, well-respected measure of the same construct (Brown, 2005;as cited in Sak,2008).. A criterion is a standard of judgment or an established standard against which other measure is compared (Kaplan & Saccuzzo, 2005). Therefore, criterion-related validity covers correlations of the measure with another criterion measure, which is accepted as valid (Bowling, 2009). In other words, criterion-related validity is where a high correlation coefficient exists between the scores on a measuring instrument and the scores on other existing instrument which is accepted as valid. There is a time lag between when the instrument is administered and the time when the criterion information is gathered. In concurrent validity, the measures and criterion measures are taken at the same time because they are usually designed to provide diagnostic information that can help guide educational development of the learners (Kaplan & Saccuzzo, 2007,as cited in Oluwatayo,2012). As Weir (1990) states this is a predominantly quantitative and a posteriori concept and is divided into two types: concurrent and predictive validity (p. 27). Predictive validity is achieved if the data acquired at the first round of research correlates highly with data acquired at a future date. For example, if the results of examination taken by a student at his entry into the university (say, Unified Tertiary Matriculation Examination scores) correlate highly with the examination results at his or her graduation level, then it might be concluded that the first examination demonstrated strong predictive validity (Oluwatayo, 2004). Concurrent validity compares a new instrument with those more established, that supposedly measure the same things. It is established when the test and the criterion are administered at about the same time (Hughes, 1990, p.27). When concurrent validity is investigated, one needs to administer a reputable test of the same ability to the same test takers concurrently or within a few days of the administration of the test to be validated. Then, the scores of the two different tests are correlated using some formula for the correlation coefficient and the resultant correlation is reported as a concurrent validation. Predictive validity differs from concurrent validity in that instead of collecting the external measures at the same time as the administration of the external test, the external measures will only be gathered some time after the test has been given (Alderson, et al., 1995, p. 180).

3. Reliability 3.1. Definition


The concept of reliability is defined as the consistency of measurement (Bachman and Palmer, 1996). In other words, a test is reliable to the extent that whatever it measures, it measures it consistently. A measure is considered reliable if a person's score on the same test

given twice is similar. It is significant to remember that reliability is not measured; it is estimated. Jones, 2001(as cited in Wier, 2005) argued that reliability is a word whose everyday meaning adds powerful positive connotations to its technical meaning in testing. Reliability is a highly desirable quality in a friend, a car or a railway system. Reliability in testing also denotes dependability, in the sense that a reliable test can be depended on to produce very similar results in repeated uses. Reliability is defined as the conformity of measurement results derived from measurement procedures that are identical or as similar as possible (Churchill, 1983). It is the extent to which the measurements are repeatable and consistent (Torabi, 1994). Put it another way, reliability involves the consistency, or reproducibility, of test scores. That is, the degree to which one can expect relatively constant deviation scores of individuals across testing situations on the same, or parallel, testing instruments. This property is not a stagnant function of the test. Rather, reliability estimates change with different populations (i.e. population samples) and as a function of the error involved. These facts underscore the importance of consistently reporting reliability estimates for each administration of an instrument, as test samples, or subject populations, are rarely the same across situations and in different research settings. More important to understand is that reliability estimates are a function of the test scores yielded from an instrument, not the test itself (Thompson, 1999). Accordingly, reliability estimates should be considered based upon the various sources of measurement error that will be involved in test administration (Crocker & Algina, 1986). Highlighting the significance of reliability in testing, Wells and Wollack (2003) suggests that it is important to be concerned with a tests reliability for two reasons. First, reliability provides a measure of the extent to which an examinees score reflects random measurement error Measurement errors are caused by one of three factors: (a) examinee-specific factors such as motivation, concentration, fatigue, boredom, momentary lapses of memory, carelessness in marking answers, and luck in guessing, (b) test-specific factors such as the specific set of questions selected for a test, ambiguous or tricky items, and poor directions, and (c) scoring-specific factors such as no uniform scoring guidelines, carelessness, and counting or computational errors. In an unreliable test, students scores consist largely of measurement error. An unreliable test offers no advantage over randomly assigning test scores to students. Therefore, it is desirable to use tests with good measures of reliability, so as to ensure that the test scores reflect more than just random error. The second reason to be concerned with reliability is that it is a precursor to test validity. That is, if test scores cannot be assigned consistently, it is impossible to conclude that the scores accurately measure the domain of interest.

3.2. Estimating Reliability


There are many methods to estimating reliability, each resulting in a different dimension of reliability. These applications of these methods will vary depending on your testing situation and how you plan to use the test results. Firstly, there are instances in which multiple testing sessions of the same instrument will occur. Test-retest reliability refers to the temporal stability of a test from one measurement session to another in which test proctors administer the same test to a set of examinees more than once. The test should be administered, a

sufficient period of time should elapse, and the test should then be administered once again. Upon completion of the second administration, one is able to calculate the correlation coefficient between scores on the two measures, which will yield information on how stable the test results (i.e. observed scores) are over time (Gregory,1992) Rosenthal & Rosnow, 1991(as cited in Drost,n.d) states that despite its appeal, the test-retest reliability technique has several limitations. When the interval between the first and second test is too short, respondents might remember what was on the first test and their answers on the second test could be affected by memory. Alternatively, when the interval between the two tests is too long, maturation happens. Maturation refers to changes in the subject factors or respondents (other than those associated with the independent variable) that occur over time and cause a change from the initial measurements to the later measurements (t and t + 1). During the time between the two tests, the respondents could have been exposed to things which changed their opinions, feelings or attitudes about the behavior under study. A second type of reliability estimate is the alternate form method. This test-retest technique evaluates the consistency of alternate forms of a single test. This approach is particularly useful in the context of standardized testing procedures, where it is ideal to have multiple, and equivalent, forms of the same test. In this method, participants take one form of the test, a period of time elapses, and they then take a second form of the test. Once results are gathered from both sessions, the correlation coefficient between the two sets of scores is calculated. In this technique, a coefficient of equivalence is yielded (Crocker & Algina, 1986).The most common scenario for classroom exams involves administering one test to all students at one time point, often referred to as internal consistency measures of reliability. In this case, a single score is used to indicate a students level of understanding on a particular topic. However, the purpose of the exam is not simply to determine how many items students answered correctly on a particular test, but to measure how well they know the content area. To achieve this goal, the particular items on the test must be sampled in a way as to be representative of the entire domain of interest. These methods are concerned with the consistency of scores within the test itself, or rather consistency of scores among the items (ibid, 1986). Here, the key is to have a homogenous set of items that reflects a unified underlying construct. High reliability estimates of this kind will result in high inter-item correlations among the items or subscales.The most common method of assessing internal consistency reliability estimates is through the use of coefficient alpha. Though there are three different measures of coefficient alpha, the most widely used measure is Cronbachs coefficient alpha. Cronbachs alpha is actually an average of all the possible split -half reliability estimates of an instrument (Crocker & Algina, 1986; DeVellis, 1991). It is important to note that coefficient alpha is the lower bound of reliability. The two lesser-used techniques of estimating coefficient alpha are appropriate in limited circumstances. For example, the Kuder Richardson 20 is appropriate for use with dichotomously-scored items, and Hoyts method is useful in particular testing situations that involve computer programming (ibid, 1986).

Another type of internal consistency measure that only requires one test administration is split-half reliability. This technique literally takes an instrument, assesses the reliability of the first half, and then compares this estimate to the reliability measure of the second half. The method demands equal item representation across the two halves of the instrument. Clearly the comparison of dissimilar sample items will not yield an accurate reliability estimate. Researchers can insure equal item representation through the use of random item selection, matching items from one half to the next or assigning items to halves based on an even/odd distribution (Gregory, 1992). There are several aspects that make the split-halves approach more desirable than the test-retest and alternative forms methods. First, the effect of memory discussed previously does not operate with this approach. Also, a practical advantage is that the split-halves are usually cheaper and more easily obtained than over time data (Bollen, 1989, as cited in Drost,n.d). A disadvantage of the split-half method is that the tests must be parallel measures that is, the correlation between the two halves will vary slightly depending on how the items are divided. Nunnally (1978) suggests using the split-half method when measuring variability of behaviours over short periods of time when alternative forms are not available.

3.3. Factors that Effect Reliability


There are many factors that prevent measurements from being exactly repeatable or replicable. These factors depend on the nature of the test and how the test is used (Nunnally, 1978). It is important to make a distinction between errors of measurement that cause variation in performance within a test, and errors of instrumentation that are apparent only in variation in performance on different forms of a test. As explained before, reliability is an indicator of the extent to which a test produces consistent scores. This consistency can be over time (test-retest or parallel forms) or over equal parts of the same test (split-half or KR-21) .Taking these points into account, it can be said that two factors contribute to test reliability: the testee and test itself. Fluctuation in the psychological and physiological conditions of the testees influence their performance .The influence may increase their true scores or error scores.In former case the reliability will be overestimated and in latter case it will be underestimated . In addition, the homogeneity of testees ability on the measured attribute is another effective factor in this regard. Put it another way, due to the fact that reliability is the function of the amount of variation in the testees performance ,the homogeneous testees will not show a great deal of variation among their scores.(Farhady,etal.,2010). According to Crocker and Algina other factors can equally reduce the reliability coefficient .group homogeneity is particularly influential when you are trying to apply a norm-referenced test to a homogenous testing sample. In such circumstances, the restriction of range of your testing group (i.e. low variability) translates into a smaller proportion of variance explained by your testing instrument, ultimately deflating the reliability coefficient. It is essential to bear in mind the intended use of your instrument when considering these circumstances and deciding how to use an instrument.

The test factors on the other hand can potentially influence the consistency of a test to great extent. Fluctuations in these factors can be originated from the structure of the test, the administration procedures and the scoring process of the test. (Farhady.etal, 2010). In line with earlier mentioned points, Crocker and Algina state that Low internal consistency estimates are often the result of poorly written items or an excessively broad content area of measure. Item quality has a large impact on reliability in that poor items tend to reduce reliability while good items tend to increase reliability. How does one know if an item is of low or high quality? The answer lies primarily in the items discrimination. Items that discriminate between students with different degrees of mastery based on the course content are desirable and will improve reliability. An item is considered to be discriminating if the better students tend to answer the item correctly while the poorer students tend to respond incorrectly. (Wells&Wollack, 2003). Imposed time constraints in a testing situation pose a different type of problem. That is, time limits ultimately affect a test takers ability to fully answer questions or to complete an instrument. As a result, variance in test taker ability to work at a specific rate becomes enmeshed in that persons score variance. Ultimately, test takers that work at similar rates have higher degrees of variance correspondence, artificially inflating the reliability of the testing instrument. Clearly, this situation becomes problematic when the construct that an instrument intends to measure has nothing to due with speed competency (Gregory, 1992). The relationship between reliability and item difficulty addresses a variability issue once again. Ultimately, if the testing instrument has little to no variability in the questions (i.e. either item is all too difficult or too easy), the reliability of scores will be affected. Aside from this, reliability estimates can also be artificially deflated if the test has too many difficult items, as they promote uneducated guesses (Crocker & Algina, 1986) .Lastly, test length also factors into the reliability estimate. Simply, longer tests yield higher estimates of reliability. This phenomenon can be best explained through an examination of the Spearman Brown prophecy equation, which indicates that as number of items increase, there is a direct increase in the reliability estimate. However, one must consider the reliability gains earned in such situations, as infinitely long tests are not necessarily desirable. For instance, if you have an 80-item testing instrument with an internal consistency reliability coefficient of .78 and the Spearman-Brown prophecy indicates that your reliability estimate will increase to .85 if you add an additional 25 items, you may consider that the slightly lower reliability estimate is more desirable than an excessively long instrument (Mehrens & Lehman, 1991). In general, longer tests produce higher reliabilities. This may be seen in the old carpenters adage, measure twice, and cut once. Intuitively, this also makes a great deal of sense. Most instructors would feel uncomfortable basing midterm grades on students responses to a single multiple-choice item, but are perfectly comfortable basing midterm grades on a test of 50 multiple-choice items. This is because, for any given item, measurement error represents a large percentage of students scores. The percentage of measurement error decreases as test length increases. Even very low achieving students can answer a single item correctly, even through guessing; however it is much less likely that low achieving students can correctly answer all items on a 20-item test. (Wells&Wollack, 2003).Other things being equal, it is

10

good that a test should be easy and cheap to construct, administer, score and interpret (Hughes, 2003:56).

4.

Practicality
Available resources resources

Practicality = Required

If practicality > 1, the test development and use is practical. If practicality < 1, the test development and use is not practical. Figure 2.2: Practicality (Bachman & Palmer, 1996: 36) Before we even examine the content of the test, we must ask if it is feasible. A good test must be practical. Whether a test is practical or not is a matter of examining available resources. It may be possible to develop a test which is highly valid and reliable for a particular situation, but if that test requires more resources than what are available, it is doomed. It is possible that the test may initially be used if test development does not exceed available resources. But cutting corners in administering or marking the test, in order to make savings in time or money, will immediately lead to an unacceptable deterioration in reliability. Test practicality involves the nitty-gritty (H.D. Brown, 2001; as cited in Lawson, 2008) of man-power, materials and time. We can only make the best use of what is available to us. It refers to facilities available to test developers regarding both administration and scoring procedures of a test. As far as administration is concerned, test developers should be attentive to the possibilities of giving a test under reasonably acceptable conditions. Regarding the scoring procedures of a test, one should pay attention to the problem of ease of scoring as well as ease of interpretation of scores. For instance, assume that composition tests are excellent indicators of language ability. Therefore, test developers should be very careful in selecting and administering a test. The test should be practical, i.e., it should be easy to administer, easy to score, and easy to interpret the scores (Farhady.etal, 2010) .Indeed, in many respects, practicality is the whole raison dtre of testing. In elaborating overall concept of practicality in testing process, Owen, 2001(as cited in Lawson ,2008) provides an interesting and succinct analogy: for the complete bath water test, just dump the baby into the bath and see what happens. For the complete test of an employees English communication skills in a business context, a firm should send the employee out to work overseas and do business in English. If he/she is able to do the work required, the company may deem the employees language abilities to be sufficient. If there are lots of costly errors due to communication problems, and the employee becomes chronically depressed as a consequence of being unable to do their job properly, make friends or even make a dental appointment where they are living, then the company may deem the employees language skills to be insufficient. Clearly, there are reasons why companies do not do this (just as there are obvious reasons why we dont test bathwater temperature by tossing in our infants and waiting to see if they boil or not). The costs to a company of sending staff into foreign language-speaking business situations without examining their readiness beforehand would often be financially colossal, as well as morally

11

questionable. A test is a more economical (and hopefully more reliable) substitute for a more complete procedure (ibid).

5. Relationship between validity, reliability and practicality


Questioning the quality of a language test, Bachman and Palmer argue that a model of test usefulness is the essential basis for quality control throughout the entire test development process (1996: 17). This model of test usefulness may be expressed as follows: Usefulness = Reliability + Construct validity + Authenticity + Interactiveness + Impact + Practicality Figure 2.1: Usefulness (Bachman & Palmer, 1996: 18) For Bachman and Palmer (1996) the most important consideration in designing and developing a language test is the use for which we can evaluate not only the tests itself but all aspects of the test development and use. Based on their model of test usefulness any language test must be developed with a specific purpose, a particular group of test-takers and a specific language use in mind. Accordingly, Bachman and Palmer emphasize the need for test developers to avoid considering any of the above test qualities independently of the others. Trying to maximize all of these qualities inevitably leads to conflict. Thus, the test developer must strive to attain an appropriate balance of the above qualities for each specific test. Obviously, there is no perfect language test, much less one single test which is perfectly suitable for each and every testing situation. In early days of language testing, reliability and validity were often seen as dichotomous concepts (Weir, 2005). Reliability is in fact a prerequisite to validity in performance assessment in the sense that the test must provide consistent, replicable information about candidates language performance (Clark, 1975). That is, no test can achieve its intended purpose if the test results are unreliable. In contrast to Clark viewpoint regarding the positive links between reliability and validity, Underhill, 1982(as cited in Fulcher, 2000) argued that there is no real-life situation in which we go around asking or answering multiple choice questions. (It should also be remembered that at this time validity was perceived to be a quality of the test, which is no longer the case.) Thus, the more validity a test had, the less reliability it had, and vice versa. This was expressed most clearly in Underhill (1982) when he wrote: If you believe that real language use only occurs in creative communication between two or more parties with genuine reasons for communication, then you may accept that the trade between reliability and validity is unavoidable. In a similar view Davies (1978) put forward that there was a ``tension'' between reliability and validity, this meant that validity involved making the test truer to an activity that would take place in the real world. Yet, reliability could only be increased through using objective item types like multiple choice. Wiliam and Black (1996) claims that these two aspects are not entirely independent. One check on construct validity would be the inter-correlation between relevant test items. If this inter-correlation were low neither criterion is satisfied, whereas if it were high, the test would satisfy one criterion for reliability; it could also be a valid measure of the intended construct, although it is always possible that it is actually measuring something else. For example, a test

12

designed to measure comprehension using a set of different and varied questions about a given prose passage might turn out to be testing familiarity with the subject content of the passage rather than the ability to understand unseen prose passages in general. Similarly, as also pointed out above, for predictive validity an adequate reliability is a necessary, but not a sufficient condition. Whether or not reliability is an essential prior condition for validity will depend on the circumstances and purposes of an assessment. Two different cases will illustrate this point, as follows: For any test or assessment where decisions are taken which cannot easily be hanged, as in examinations for public certificates, reliability will be an essential prior condition for validity. However, there may be a high price to pay, because the testing might have to take a long time and there would have to be careful checks on the marking and grading to ensure the reliability required. Given limited resources, the two ideals of validity and reliability may often be in conflict. Narrowing the range of the testing aims can make it possible to enhance reliability at the expense of validity, whilst broadening the range might do the reverse. Such conditions would not apply in the same way to classroom assessment by a teacher. In classroom use, an action taken on the basis of a result can easily be changed if it turns out that the result is misleading so that the action is inappropriate. In these circumstances, a teacher may infer that a pupil has grasped an idea on the basis of achievement on one piece of work and may not wish to spend the time needed in order to ensure high reliabilityby taking an average of a pupils performance over several tasks before moving the pupil on to the next piece of work. However, this decision that the pupil should move on would be quite unjustified if the achievement result were invalid, for then it would bear no relation to what the pupil might need as a basis for moving on to the next piece of work. (Wiliam & Black, 1996). As stated by Gerbing and Anderson (1988), and Cortina (1993), the reliability of a scale is determined by the number of items that define a scale and the reliabilities of those items. Churchill and Peter (1984) show that alpha is directly related to both the number of items and the average item inter-correlation. The fundamental question is whether the multiple items chosen are true constructs describing the event or merely an artifact of the required methodology. That is a large number of items are chosen just to be consistent with the vogue of using multiple items without ensuring that the items cover the entire domain of the construct. The crux of the problem lies in the way multi-item scales are assessed. In an attempt to ascertain the reliability of the scales, researchers overlook the fact that a highly reliable scale may not necessarily be valid. A highly reliable scale may be obtained by items that highly correlate with each other and finally with the construct to be measured. For example, consider a construct with four different dimensions, say, the perceived quality of running shoes. Assuming that the four dimensions of the running shoe quality are comfort, durability, protection and workmanship. A multiple item scale for this measure should include at least two items for each of these dimensions. For a scale to be reliable, these items should all be highly correlated with each other and with the construct in itself. However, high reliability estimates could also be obtained even if the scale includes highly correlated items for attributes comfort, durability and workmanship but does not measure protection at all. In

13

such an instance, though the study would reveal high reliability estimates for the scale, the construct validity would be low, because the scale fails to measure one of the important dimensions of the construct (i.e. the problem of underidentification). Therefore, a highly reliable scale does not necessarily confirm the validity of the scale. Thus the view that the more items, the greater the internal consistency or reliability and hence the better the scale, is not always true unless a high construct validity is also ascertained. Alpha can be high by the influence of number of items even if they do not measure the construct or only measure one aspect of it repetitiously.(Jaju&Crask,.n.d). Wiliam (1992, 1996) have commented on the difficulties involved in untangling this relationship. In this regard, he refers to some authors that consider validity to be more important than reliability. He continues to say that in one sense this is true, since there is no point in measuring something reliably unless one knows what one is measuring. After all, that would be like saying I've measured something, and I know I'm doing it right, because I get the same reading consistently, although I don't know what I'm measuring. On the other hand, reliability is a pre-requisite for validityno assessment can have any validity at all if the mark a student gets varies radically from occasion to occasion, or depends on who does the marking. To confuse matters even further, it is often the case that reliability and validity are in tension, with attempts to increase reliability (e.g. by making the marking scheme stricter) having a negative effect on validity (e.g. because students with good answers not foreseen in the mark scheme cannot be given high marks).In addition to above mentioned points ,William also believes that reliability and validity are not absolutes but degrees, and the relationship between the two can be clarified through the metaphor of stage lighting. For a given amount of lighting power (testing time), one can use a spotlight to illuminate a small part of the stage very brightly, so that one gets a very clear picture of what is happening in the illuminated area (high reliability), but one has no idea what is going on elsewhere, and the people in darkness can get up to all kinds of things, knowing that they won't be seen (not teaching parts of the curriculum not tested). Alternatively, one can use a floodlight to illuminate the whole stage, so that we can get some idea what is going on across the whole stage (high validity), but no clear detail anywhere (low reliability). The validity-reliability relationship is thus one of focus. For a given amount of testing time, one can get a little information across a broad range of topics, as is the case with national curriculum tests, although the trade-off here is that the scores for individuals are relatively unreliable. One could get more reliable tests by testing only a small part of the curriculum, but then that would provide an incentive for schools to improve their test results by teaching only those parts of the curriculum actually tested. For more on the social consequences of assessment. In a similar attempt to clarify the notions of validity and reliability and their links, Trochim (2008) referred to the metaphor of the target. He explains: think of the center of the target as the concept that you are trying to measure. Imagine that for each person you are measuring, you are taking a shot at the target. If you measure the concept perfectly for a person, you are hitting the center of the target. If you don't, you are missing the center. The more you are off for that person, the further you are from the center.

14

6. Conclusion
Test construction may seem a tedious and difficult task. However, regardless of the complexity of the tasks in determining the reliability, validity, and practicality of a test, these concepts are indispensable parts of test construction. It means that in order to have an acceptable and defendable test, upon which reasonably sound decisions can be made, test developers should go through planning, preparing, reviewing, and pretesting processes. Furthermore, both item and test characteristics should be determined. On the basis of the results of pre-testing and reviewing, test developers should make all necessary modifications on the test. Only then can a test be used for practical purposes. Only then can one make fairly sound decisions on the lives of people. Without determining these parameters, nobody is ethically allowed to use a test for practical purposes. Otherwise, the test users are bound to make inexcusable mistakes, unreasonable decisions and unrealistic appraisals. Taking the significance of this parameters into account, this article dealt with elaborations on these three fundamental characteristics of a good test that are considered to be the cornerstones of testing. In studying the links between these three important elements in testing process, the main problem has lied in the vagueness of relationship. As it said earlier, some authors believe that validity is more important than reliability, others regards reliability as prerequisite for validity and the other group claims that they are in tension. Bachman and Palmers test usefulness model clarify and untangle this complex issue. Based on their model any language test must be developed with a specific purpose, a particular group of test-takers and a specific language use in mind. Therefore, test developers should avoid considering any of the test qualities independently of the others. Trying to maximize all of these qualities inevitably leads to conflict. Thus, the test developer must strive to attain an appropriate balance of the above qualities for each specific test. Obviously, there is no perfect language test, much less one single test which is perfectly suitable for each and every testing situation. In general from the discussion about the relationship between validity and reliability, one can conclude that a reliable test may not be valid, but a valid test to some extent reliable.

7. Reference
1. AERA, APA, & NCME (1999).Standards for educational and psychological testing.Washington, D.C.: Author. 2. Alderson, J. C., Claphamn, C.,& Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press. 3. Anastasi, A. & Urbina, S. (2007). Psychological testing (2nd impression). Pearson. NJ: Prentice-Hall 4. Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press

15

5. Bachman, L. (2002). Some reflections on task-based language performance assessment.Language Testing, 19, 453-476 6. Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press. 7. Campbell, D. T. & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraits multi-method matrix. Psychological Bulletin. 56: 81105 8. Churchill, G.A. Jr. and Peter, P.J. (1984) Research Design Effects on the Reliability of Rating Scales: A Meta-Analysis Journal of Marketing Research, XXI (Nov.), pp. 360-75. 9. Clark, J. (1975). Theoretical and technical considerations in oral proficiency testing. In S. Jones & B. Spolsky (Eds.), Language testing proficiency (pp. 10-24). Arlington, VA: Center for Applied Linguistics. 10. Cohen, L.; Manion, L. & Morrison, K. (2008). Research methods in education (6th ed.). London & New York: Routledge Taylor & Francis Group. 133164 11. Crocker, L., &Algina, J. (1986).Introduction to classical and modern test theory. Belmont, CA:Wadsworth Group/Thomson Learning. 12. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. 13. Davies, A., (1978). Language testing. In: Language Teaching and Linguistics Abstracts. Vol. 11, 3 & 4, reprinted in Kinchella, V. (Ed.) (1982) Surveys 1: Eight State-of-the-art Articles on Key Areas in Language Teaching. Cambridge University Press, Cambridge, pp. 127159. 14. DeVellis, R.F. (1991). Scale Development: theory and applications (Applied Social Research Methods Series, Vol. 26). Newbury Park: Sage. 15. Drost,E.A. (n.d) . Validity and reliability in social research. Education Research and Perspectives.38, (1 )-105 16. Farhady,H. (1986). Fundamental Concepts in Language Testing (4). a. Characteristics of Language Tests: Total Test Characteristics. Roshd Foreign Language Teaching Journal (1986).2 .3 17. Farhady,H.&Jafapur,A.&Birjandi,P.(2010).Testing language skills from theory to practice.Iran:SAMT. 18. Fulcher,G. (2000). The `communicative' legacy in language testing. English Language Institute, University of Surrey, Guildford, Surrey GU2 7XH, UK Received 15 December 1999; accepted 11 April 2000 19. Gerbing, D.W. and Anderson J.C. (1988) An Updated Paradigm for Scale Development Incorporating Unidimensionality and its Assessment Journal of Marketing Research, XXV,(May), pp. 186-192. 20. Gregory, R.J. (1992). Psychological testing: history, principles and applications. Boston: Allyn and Bacon. 21. Henning, G. (1987) A Guide to Language Testing: Development, Evaluation, Research. Cambridge, MA: Newbury House. 22. Hughes, A. (1989) Testing for Language Teachers, 1st ed. Cambridge: Cambridge University Press. 23. Jafarpur, A. (1987) The short-context technique: an alternative for testing reading comprehension. Language Testing 4 (2), pp. 195-220

16

24. Jaju,A.( n.d) . The perfect design : optimization between reliability,validity,redundancy in scale items and response rates : University of Georgia, Athens. Retrieved June ,2013 from : http://www.eu.toeic.eu/toeic-sites/toeic-europe/language-schools/ 25. Kaplan, R. M. & Saccuzzo, D. P. (2005). Psychological testing: Principles, applications and issues (6th ed.) Thomson Wadsworth. 132154. 26. Lawson,A.J. (2008) . Testing the TOEIC: practicality, reliability and validity in the test of English for international communication: The university of Birmingham Center for English studies MATEFL MODULE6. 27. Mehrens,W.A., & Lehman, I.J. (1991). Measurement and Evaluation in Education and Psychology, 4th ed. Holt, Rinehart and Winston, Inc.: Orlando, FL. 28. Mehrens,W.A., & Lehman, I.J. (1991). Measurement and Evaluation in Education and Psychology, 4th ed. Holt, Rinehart and Winston, Inc.: Orlando, FL. 29. Messick, S. (1989) Validity. In Linn, R. L. (ed.) Educational Measurement. New ork:Macmillan/American Council on Education, 13103. 30. Nunnally, J. C. (1978). Psychometric Theory. McGraw-Hill Book Company, pp. 86-113, 190255. 31. Oluwatayo, J. A. (2004). Mode of entry and performance of university undergraduates in science courses. An unpublished Ph.D. thesis, University of Ado-Ekiti, Nigeria 32. Oluwatayo, J.A. (2012). Validity and Reliability Issues in Educational Research. Journal of Educational and Social Research Vol. 2. (2). Doi:10.5901/jesr.2012.v2n2.391 33. SAK,G. (2008) . An investigation of the validity and reliability of the speaking at a Turkish university. An unpublished thesis. Graduate school of social sciences of middle east technical university 34. Rosenthal, R. and Rosnow, R. L. (1991). Essentials of Behavioral Research: Methods and Data Analysis. Second Edition. McGraw-Hill Publishing Company, pp. 46-65. 35. Thompson, B. (1999). Understanding coefficient alpha, really. Paper presented at the annual meeting of the Education Research Exchange, College Station, Texas, and February 5, 1999. Retrieved October 30, 2007, from http://ltj.sagepub.com/cgi/content/abstract/19/1/33 36. Torabi, M.R. (1994) Reliability Methods and Number of items in development of health instruments Health Values: The Journal of Health Behavior, Education and Promotion, V18 n6 (Nov.-Dec.), pp. 56-59 37. Weir, C. J. (2005). Language testing and validation. New York: Palgrave Macmillan 38. Wells,C.S. & Wollack,J.A,(2003) . An Instructors Guide to Understanding Test Reliability: Testing & Evaluation Services. University of Wisconsin. 39. William, D. (1992). Some technical issues in assessment: a users guide. British Journal for Curriculum and Assessment, 2(3), 11-20. 40. William, D. (1996). National curriculum assessments and programme of study: validity and impact. British Educational Research Journal, 22(1), 129-141. 41. Wiliam, D. & Black, P.J. (1996) Meanings and consequences: A basis for distinguishing formative and summative functions of assessment, British Educational Research Journal, 22, 5, pp. 53748.

17

You might also like