Professional Documents
Culture Documents
MA , Department of English Language and Literature ,Shahid Rajaee Teacher Training University , Tehran, Iran
Abstract
Language testing in any level is a highly complex undertaking that must be based on theory as well as practice. The results of assessments are used for one or more purposes. So they have an effect on those who are assessed and on those who use the results. Thus it matters, more or less, that the result is dependable, that the users of the results can have confidence in them. An understanding of basic principles of testing is essential. This study aimed mainly at clarification of notions of validity, reliability and practicality in one hand and their relationships as three basic cornerstones of testing procedure in the other hand. Key Words: Testing, Validity, Reliability, Practicality
Email adress :
Farhadnourizade@yahoo.com
TEL : 98-938-037-9681
1. Introduction
No measuring instrument is perfect. If we use a thermometer to measure the temperature of a liquid, even if the temperature doesnt change, we get small variations. We generally assume that these are random fluctuations around the true temperature, and we take the average of the readings as the temperature of the liquid but there are more subtle problems. If we place the thermometer in a long narrow tube of the liquid, then the thermometer itself might affect the readingif the thermometer is warmer than the liquid, the thermometer will heat up the liquid. In contrast to the problems of reliability mentioned above, this effect is not random,
but a systematic bias, and so averaging across a lot of readings does not help. Sometimes, we are not really sure what we are measuring. For example bathroom scales sometimes change their reading according to how far apart the feet are placed. The scales are not just measuring weight but also the way the weight is distributed. Extending the analogy to the case of test construction process , it should be noted at the outset that if the items, which are the building blocks of a test, meet the criteria that have been introduced before, the whole test will be most likely acceptable. However, the assumption that good items will necessarily produce a good test may not always come true. Test developers should go one more step to determine the characteristics of the total test. In other words, simply putting good individual items together would not be sufficient for a test to function satisfactorily. (Farhady&Jafarpur&Birandi, 2010).To guarantee the quality of a test other factors and guiding principles including the administration process and scoring procedures are necessary. In general it is noteworthy that an efficient test can be marked by certain characteristics that be considered as cornerstones of testing : reliability, validity and practicality. (Farhady.etal. 2010).Due to this fact that these concepts are absolutely essential in testing, they require fairly comprehensive elaborations. Reviewing existing literature, this article is devoted to not only clarifying their notions but explaining the relationships between reliability, validity and practicality.
In this regard Bachman (1990) argues that validity has been identified as the most important quality of test use, which concerns the extent to which meaningful inferences can be drawn from test scores. Taking a similar position Cronbach, ( as cited in Crocker & Algina, 1986) notes that in order to examine the validity of a test, it requires a validation process by which a test user presents evidence to support the inferences or decisions made on the basis of test scores. Criticizing traditional approaches and conceptualizations, Messick(1989) put forward unified validity framework in which he considers validity as nothing less than an evaluative summary of both the evidence for and the actual as well as the potential consequences of score interpretation and use (i.e., construct validity conceived comprehensively). This comprehensive view of validity integrates considerations of content, criteria and consequences into a comprehensive framework for empirically testing rational hypotheses about score meaning and utility. In this view validity is not a property of a test or assessment but the degree to which we are justified in making an inference to a construct from a test score (for example, whether 20 on a reading test indicates ability to read first-year business studies texts), and whether any decisions we might make on the basis of the score are justifiable. In order to improve assessment practices, Killen (2003) argues that a different understanding of validity needs to be invoked. Drawing on insights from the work of Messick (1989), he notes that the meaning of the term validity has evolved with time and that a more recent understanding of validity concerns the extent to which justifiable inferences can be made on the basis of evidence gathered. The interest therefore is not to validate a test but to validate the inference that can be drawn from the learner's results in the test/assessment task. Viewing validity as inference necessitates a greater responsibility on the part of teachers. As Killen (2003:6) writes: No longer can a teacher claim to be using a valid assessment task simply because it is clearly linked to the curriculum content, or because someone else has used the test and decided that it is valid, or because it produces results similar to those obtained from other assessment tasks. Instead the teacher must question the validity of the inferences they are making as a result of having used the assessment task. Based on these modern views, two important concepts can be attributed to validity. The first one is the notion of a construct. Constructs defined as labels used to describe an unobserved behavior, such as honesty. Such behaviors vary across individuals; although they may leave an impression, they cannot be directly measured. In the social sciences virtually everything is a construct. Linda Crocker and James Algina described constructs as the product of informed scientific imagination. Constructs represent something that is abstract, but can be operational zed through the measurement process. Measurement can involve a paper and pencil test, a performance, or some other activity through observation which would then produce variables that can be quantified, manipulated, and analyzed. The second important concept is inference. An inference is an informed leap from the observed or measured value to an estimate of some underlying variable. This is called an inferential leap because we cannot directly observe what is being measured, but we can observe its manifestations. Through observation of the results of the measurement process, an inference can be made about the underlying characteristics and its nature or status.
given twice is similar. It is significant to remember that reliability is not measured; it is estimated. Jones, 2001(as cited in Wier, 2005) argued that reliability is a word whose everyday meaning adds powerful positive connotations to its technical meaning in testing. Reliability is a highly desirable quality in a friend, a car or a railway system. Reliability in testing also denotes dependability, in the sense that a reliable test can be depended on to produce very similar results in repeated uses. Reliability is defined as the conformity of measurement results derived from measurement procedures that are identical or as similar as possible (Churchill, 1983). It is the extent to which the measurements are repeatable and consistent (Torabi, 1994). Put it another way, reliability involves the consistency, or reproducibility, of test scores. That is, the degree to which one can expect relatively constant deviation scores of individuals across testing situations on the same, or parallel, testing instruments. This property is not a stagnant function of the test. Rather, reliability estimates change with different populations (i.e. population samples) and as a function of the error involved. These facts underscore the importance of consistently reporting reliability estimates for each administration of an instrument, as test samples, or subject populations, are rarely the same across situations and in different research settings. More important to understand is that reliability estimates are a function of the test scores yielded from an instrument, not the test itself (Thompson, 1999). Accordingly, reliability estimates should be considered based upon the various sources of measurement error that will be involved in test administration (Crocker & Algina, 1986). Highlighting the significance of reliability in testing, Wells and Wollack (2003) suggests that it is important to be concerned with a tests reliability for two reasons. First, reliability provides a measure of the extent to which an examinees score reflects random measurement error Measurement errors are caused by one of three factors: (a) examinee-specific factors such as motivation, concentration, fatigue, boredom, momentary lapses of memory, carelessness in marking answers, and luck in guessing, (b) test-specific factors such as the specific set of questions selected for a test, ambiguous or tricky items, and poor directions, and (c) scoring-specific factors such as no uniform scoring guidelines, carelessness, and counting or computational errors. In an unreliable test, students scores consist largely of measurement error. An unreliable test offers no advantage over randomly assigning test scores to students. Therefore, it is desirable to use tests with good measures of reliability, so as to ensure that the test scores reflect more than just random error. The second reason to be concerned with reliability is that it is a precursor to test validity. That is, if test scores cannot be assigned consistently, it is impossible to conclude that the scores accurately measure the domain of interest.
sufficient period of time should elapse, and the test should then be administered once again. Upon completion of the second administration, one is able to calculate the correlation coefficient between scores on the two measures, which will yield information on how stable the test results (i.e. observed scores) are over time (Gregory,1992) Rosenthal & Rosnow, 1991(as cited in Drost,n.d) states that despite its appeal, the test-retest reliability technique has several limitations. When the interval between the first and second test is too short, respondents might remember what was on the first test and their answers on the second test could be affected by memory. Alternatively, when the interval between the two tests is too long, maturation happens. Maturation refers to changes in the subject factors or respondents (other than those associated with the independent variable) that occur over time and cause a change from the initial measurements to the later measurements (t and t + 1). During the time between the two tests, the respondents could have been exposed to things which changed their opinions, feelings or attitudes about the behavior under study. A second type of reliability estimate is the alternate form method. This test-retest technique evaluates the consistency of alternate forms of a single test. This approach is particularly useful in the context of standardized testing procedures, where it is ideal to have multiple, and equivalent, forms of the same test. In this method, participants take one form of the test, a period of time elapses, and they then take a second form of the test. Once results are gathered from both sessions, the correlation coefficient between the two sets of scores is calculated. In this technique, a coefficient of equivalence is yielded (Crocker & Algina, 1986).The most common scenario for classroom exams involves administering one test to all students at one time point, often referred to as internal consistency measures of reliability. In this case, a single score is used to indicate a students level of understanding on a particular topic. However, the purpose of the exam is not simply to determine how many items students answered correctly on a particular test, but to measure how well they know the content area. To achieve this goal, the particular items on the test must be sampled in a way as to be representative of the entire domain of interest. These methods are concerned with the consistency of scores within the test itself, or rather consistency of scores among the items (ibid, 1986). Here, the key is to have a homogenous set of items that reflects a unified underlying construct. High reliability estimates of this kind will result in high inter-item correlations among the items or subscales.The most common method of assessing internal consistency reliability estimates is through the use of coefficient alpha. Though there are three different measures of coefficient alpha, the most widely used measure is Cronbachs coefficient alpha. Cronbachs alpha is actually an average of all the possible split -half reliability estimates of an instrument (Crocker & Algina, 1986; DeVellis, 1991). It is important to note that coefficient alpha is the lower bound of reliability. The two lesser-used techniques of estimating coefficient alpha are appropriate in limited circumstances. For example, the Kuder Richardson 20 is appropriate for use with dichotomously-scored items, and Hoyts method is useful in particular testing situations that involve computer programming (ibid, 1986).
Another type of internal consistency measure that only requires one test administration is split-half reliability. This technique literally takes an instrument, assesses the reliability of the first half, and then compares this estimate to the reliability measure of the second half. The method demands equal item representation across the two halves of the instrument. Clearly the comparison of dissimilar sample items will not yield an accurate reliability estimate. Researchers can insure equal item representation through the use of random item selection, matching items from one half to the next or assigning items to halves based on an even/odd distribution (Gregory, 1992). There are several aspects that make the split-halves approach more desirable than the test-retest and alternative forms methods. First, the effect of memory discussed previously does not operate with this approach. Also, a practical advantage is that the split-halves are usually cheaper and more easily obtained than over time data (Bollen, 1989, as cited in Drost,n.d). A disadvantage of the split-half method is that the tests must be parallel measures that is, the correlation between the two halves will vary slightly depending on how the items are divided. Nunnally (1978) suggests using the split-half method when measuring variability of behaviours over short periods of time when alternative forms are not available.
The test factors on the other hand can potentially influence the consistency of a test to great extent. Fluctuations in these factors can be originated from the structure of the test, the administration procedures and the scoring process of the test. (Farhady.etal, 2010). In line with earlier mentioned points, Crocker and Algina state that Low internal consistency estimates are often the result of poorly written items or an excessively broad content area of measure. Item quality has a large impact on reliability in that poor items tend to reduce reliability while good items tend to increase reliability. How does one know if an item is of low or high quality? The answer lies primarily in the items discrimination. Items that discriminate between students with different degrees of mastery based on the course content are desirable and will improve reliability. An item is considered to be discriminating if the better students tend to answer the item correctly while the poorer students tend to respond incorrectly. (Wells&Wollack, 2003). Imposed time constraints in a testing situation pose a different type of problem. That is, time limits ultimately affect a test takers ability to fully answer questions or to complete an instrument. As a result, variance in test taker ability to work at a specific rate becomes enmeshed in that persons score variance. Ultimately, test takers that work at similar rates have higher degrees of variance correspondence, artificially inflating the reliability of the testing instrument. Clearly, this situation becomes problematic when the construct that an instrument intends to measure has nothing to due with speed competency (Gregory, 1992). The relationship between reliability and item difficulty addresses a variability issue once again. Ultimately, if the testing instrument has little to no variability in the questions (i.e. either item is all too difficult or too easy), the reliability of scores will be affected. Aside from this, reliability estimates can also be artificially deflated if the test has too many difficult items, as they promote uneducated guesses (Crocker & Algina, 1986) .Lastly, test length also factors into the reliability estimate. Simply, longer tests yield higher estimates of reliability. This phenomenon can be best explained through an examination of the Spearman Brown prophecy equation, which indicates that as number of items increase, there is a direct increase in the reliability estimate. However, one must consider the reliability gains earned in such situations, as infinitely long tests are not necessarily desirable. For instance, if you have an 80-item testing instrument with an internal consistency reliability coefficient of .78 and the Spearman-Brown prophecy indicates that your reliability estimate will increase to .85 if you add an additional 25 items, you may consider that the slightly lower reliability estimate is more desirable than an excessively long instrument (Mehrens & Lehman, 1991). In general, longer tests produce higher reliabilities. This may be seen in the old carpenters adage, measure twice, and cut once. Intuitively, this also makes a great deal of sense. Most instructors would feel uncomfortable basing midterm grades on students responses to a single multiple-choice item, but are perfectly comfortable basing midterm grades on a test of 50 multiple-choice items. This is because, for any given item, measurement error represents a large percentage of students scores. The percentage of measurement error decreases as test length increases. Even very low achieving students can answer a single item correctly, even through guessing; however it is much less likely that low achieving students can correctly answer all items on a 20-item test. (Wells&Wollack, 2003).Other things being equal, it is
10
good that a test should be easy and cheap to construct, administer, score and interpret (Hughes, 2003:56).
4.
Practicality
Available resources resources
Practicality = Required
If practicality > 1, the test development and use is practical. If practicality < 1, the test development and use is not practical. Figure 2.2: Practicality (Bachman & Palmer, 1996: 36) Before we even examine the content of the test, we must ask if it is feasible. A good test must be practical. Whether a test is practical or not is a matter of examining available resources. It may be possible to develop a test which is highly valid and reliable for a particular situation, but if that test requires more resources than what are available, it is doomed. It is possible that the test may initially be used if test development does not exceed available resources. But cutting corners in administering or marking the test, in order to make savings in time or money, will immediately lead to an unacceptable deterioration in reliability. Test practicality involves the nitty-gritty (H.D. Brown, 2001; as cited in Lawson, 2008) of man-power, materials and time. We can only make the best use of what is available to us. It refers to facilities available to test developers regarding both administration and scoring procedures of a test. As far as administration is concerned, test developers should be attentive to the possibilities of giving a test under reasonably acceptable conditions. Regarding the scoring procedures of a test, one should pay attention to the problem of ease of scoring as well as ease of interpretation of scores. For instance, assume that composition tests are excellent indicators of language ability. Therefore, test developers should be very careful in selecting and administering a test. The test should be practical, i.e., it should be easy to administer, easy to score, and easy to interpret the scores (Farhady.etal, 2010) .Indeed, in many respects, practicality is the whole raison dtre of testing. In elaborating overall concept of practicality in testing process, Owen, 2001(as cited in Lawson ,2008) provides an interesting and succinct analogy: for the complete bath water test, just dump the baby into the bath and see what happens. For the complete test of an employees English communication skills in a business context, a firm should send the employee out to work overseas and do business in English. If he/she is able to do the work required, the company may deem the employees language abilities to be sufficient. If there are lots of costly errors due to communication problems, and the employee becomes chronically depressed as a consequence of being unable to do their job properly, make friends or even make a dental appointment where they are living, then the company may deem the employees language skills to be insufficient. Clearly, there are reasons why companies do not do this (just as there are obvious reasons why we dont test bathwater temperature by tossing in our infants and waiting to see if they boil or not). The costs to a company of sending staff into foreign language-speaking business situations without examining their readiness beforehand would often be financially colossal, as well as morally
11
questionable. A test is a more economical (and hopefully more reliable) substitute for a more complete procedure (ibid).
12
designed to measure comprehension using a set of different and varied questions about a given prose passage might turn out to be testing familiarity with the subject content of the passage rather than the ability to understand unseen prose passages in general. Similarly, as also pointed out above, for predictive validity an adequate reliability is a necessary, but not a sufficient condition. Whether or not reliability is an essential prior condition for validity will depend on the circumstances and purposes of an assessment. Two different cases will illustrate this point, as follows: For any test or assessment where decisions are taken which cannot easily be hanged, as in examinations for public certificates, reliability will be an essential prior condition for validity. However, there may be a high price to pay, because the testing might have to take a long time and there would have to be careful checks on the marking and grading to ensure the reliability required. Given limited resources, the two ideals of validity and reliability may often be in conflict. Narrowing the range of the testing aims can make it possible to enhance reliability at the expense of validity, whilst broadening the range might do the reverse. Such conditions would not apply in the same way to classroom assessment by a teacher. In classroom use, an action taken on the basis of a result can easily be changed if it turns out that the result is misleading so that the action is inappropriate. In these circumstances, a teacher may infer that a pupil has grasped an idea on the basis of achievement on one piece of work and may not wish to spend the time needed in order to ensure high reliabilityby taking an average of a pupils performance over several tasks before moving the pupil on to the next piece of work. However, this decision that the pupil should move on would be quite unjustified if the achievement result were invalid, for then it would bear no relation to what the pupil might need as a basis for moving on to the next piece of work. (Wiliam & Black, 1996). As stated by Gerbing and Anderson (1988), and Cortina (1993), the reliability of a scale is determined by the number of items that define a scale and the reliabilities of those items. Churchill and Peter (1984) show that alpha is directly related to both the number of items and the average item inter-correlation. The fundamental question is whether the multiple items chosen are true constructs describing the event or merely an artifact of the required methodology. That is a large number of items are chosen just to be consistent with the vogue of using multiple items without ensuring that the items cover the entire domain of the construct. The crux of the problem lies in the way multi-item scales are assessed. In an attempt to ascertain the reliability of the scales, researchers overlook the fact that a highly reliable scale may not necessarily be valid. A highly reliable scale may be obtained by items that highly correlate with each other and finally with the construct to be measured. For example, consider a construct with four different dimensions, say, the perceived quality of running shoes. Assuming that the four dimensions of the running shoe quality are comfort, durability, protection and workmanship. A multiple item scale for this measure should include at least two items for each of these dimensions. For a scale to be reliable, these items should all be highly correlated with each other and with the construct in itself. However, high reliability estimates could also be obtained even if the scale includes highly correlated items for attributes comfort, durability and workmanship but does not measure protection at all. In
13
such an instance, though the study would reveal high reliability estimates for the scale, the construct validity would be low, because the scale fails to measure one of the important dimensions of the construct (i.e. the problem of underidentification). Therefore, a highly reliable scale does not necessarily confirm the validity of the scale. Thus the view that the more items, the greater the internal consistency or reliability and hence the better the scale, is not always true unless a high construct validity is also ascertained. Alpha can be high by the influence of number of items even if they do not measure the construct or only measure one aspect of it repetitiously.(Jaju&Crask,.n.d). Wiliam (1992, 1996) have commented on the difficulties involved in untangling this relationship. In this regard, he refers to some authors that consider validity to be more important than reliability. He continues to say that in one sense this is true, since there is no point in measuring something reliably unless one knows what one is measuring. After all, that would be like saying I've measured something, and I know I'm doing it right, because I get the same reading consistently, although I don't know what I'm measuring. On the other hand, reliability is a pre-requisite for validityno assessment can have any validity at all if the mark a student gets varies radically from occasion to occasion, or depends on who does the marking. To confuse matters even further, it is often the case that reliability and validity are in tension, with attempts to increase reliability (e.g. by making the marking scheme stricter) having a negative effect on validity (e.g. because students with good answers not foreseen in the mark scheme cannot be given high marks).In addition to above mentioned points ,William also believes that reliability and validity are not absolutes but degrees, and the relationship between the two can be clarified through the metaphor of stage lighting. For a given amount of lighting power (testing time), one can use a spotlight to illuminate a small part of the stage very brightly, so that one gets a very clear picture of what is happening in the illuminated area (high reliability), but one has no idea what is going on elsewhere, and the people in darkness can get up to all kinds of things, knowing that they won't be seen (not teaching parts of the curriculum not tested). Alternatively, one can use a floodlight to illuminate the whole stage, so that we can get some idea what is going on across the whole stage (high validity), but no clear detail anywhere (low reliability). The validity-reliability relationship is thus one of focus. For a given amount of testing time, one can get a little information across a broad range of topics, as is the case with national curriculum tests, although the trade-off here is that the scores for individuals are relatively unreliable. One could get more reliable tests by testing only a small part of the curriculum, but then that would provide an incentive for schools to improve their test results by teaching only those parts of the curriculum actually tested. For more on the social consequences of assessment. In a similar attempt to clarify the notions of validity and reliability and their links, Trochim (2008) referred to the metaphor of the target. He explains: think of the center of the target as the concept that you are trying to measure. Imagine that for each person you are measuring, you are taking a shot at the target. If you measure the concept perfectly for a person, you are hitting the center of the target. If you don't, you are missing the center. The more you are off for that person, the further you are from the center.
14
6. Conclusion
Test construction may seem a tedious and difficult task. However, regardless of the complexity of the tasks in determining the reliability, validity, and practicality of a test, these concepts are indispensable parts of test construction. It means that in order to have an acceptable and defendable test, upon which reasonably sound decisions can be made, test developers should go through planning, preparing, reviewing, and pretesting processes. Furthermore, both item and test characteristics should be determined. On the basis of the results of pre-testing and reviewing, test developers should make all necessary modifications on the test. Only then can a test be used for practical purposes. Only then can one make fairly sound decisions on the lives of people. Without determining these parameters, nobody is ethically allowed to use a test for practical purposes. Otherwise, the test users are bound to make inexcusable mistakes, unreasonable decisions and unrealistic appraisals. Taking the significance of this parameters into account, this article dealt with elaborations on these three fundamental characteristics of a good test that are considered to be the cornerstones of testing. In studying the links between these three important elements in testing process, the main problem has lied in the vagueness of relationship. As it said earlier, some authors believe that validity is more important than reliability, others regards reliability as prerequisite for validity and the other group claims that they are in tension. Bachman and Palmers test usefulness model clarify and untangle this complex issue. Based on their model any language test must be developed with a specific purpose, a particular group of test-takers and a specific language use in mind. Therefore, test developers should avoid considering any of the test qualities independently of the others. Trying to maximize all of these qualities inevitably leads to conflict. Thus, the test developer must strive to attain an appropriate balance of the above qualities for each specific test. Obviously, there is no perfect language test, much less one single test which is perfectly suitable for each and every testing situation. In general from the discussion about the relationship between validity and reliability, one can conclude that a reliable test may not be valid, but a valid test to some extent reliable.
7. Reference
1. AERA, APA, & NCME (1999).Standards for educational and psychological testing.Washington, D.C.: Author. 2. Alderson, J. C., Claphamn, C.,& Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press. 3. Anastasi, A. & Urbina, S. (2007). Psychological testing (2nd impression). Pearson. NJ: Prentice-Hall 4. Bachman, L. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press
15
5. Bachman, L. (2002). Some reflections on task-based language performance assessment.Language Testing, 19, 453-476 6. Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press. 7. Campbell, D. T. & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraits multi-method matrix. Psychological Bulletin. 56: 81105 8. Churchill, G.A. Jr. and Peter, P.J. (1984) Research Design Effects on the Reliability of Rating Scales: A Meta-Analysis Journal of Marketing Research, XXI (Nov.), pp. 360-75. 9. Clark, J. (1975). Theoretical and technical considerations in oral proficiency testing. In S. Jones & B. Spolsky (Eds.), Language testing proficiency (pp. 10-24). Arlington, VA: Center for Applied Linguistics. 10. Cohen, L.; Manion, L. & Morrison, K. (2008). Research methods in education (6th ed.). London & New York: Routledge Taylor & Francis Group. 133164 11. Crocker, L., &Algina, J. (1986).Introduction to classical and modern test theory. Belmont, CA:Wadsworth Group/Thomson Learning. 12. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. 13. Davies, A., (1978). Language testing. In: Language Teaching and Linguistics Abstracts. Vol. 11, 3 & 4, reprinted in Kinchella, V. (Ed.) (1982) Surveys 1: Eight State-of-the-art Articles on Key Areas in Language Teaching. Cambridge University Press, Cambridge, pp. 127159. 14. DeVellis, R.F. (1991). Scale Development: theory and applications (Applied Social Research Methods Series, Vol. 26). Newbury Park: Sage. 15. Drost,E.A. (n.d) . Validity and reliability in social research. Education Research and Perspectives.38, (1 )-105 16. Farhady,H. (1986). Fundamental Concepts in Language Testing (4). a. Characteristics of Language Tests: Total Test Characteristics. Roshd Foreign Language Teaching Journal (1986).2 .3 17. Farhady,H.&Jafapur,A.&Birjandi,P.(2010).Testing language skills from theory to practice.Iran:SAMT. 18. Fulcher,G. (2000). The `communicative' legacy in language testing. English Language Institute, University of Surrey, Guildford, Surrey GU2 7XH, UK Received 15 December 1999; accepted 11 April 2000 19. Gerbing, D.W. and Anderson J.C. (1988) An Updated Paradigm for Scale Development Incorporating Unidimensionality and its Assessment Journal of Marketing Research, XXV,(May), pp. 186-192. 20. Gregory, R.J. (1992). Psychological testing: history, principles and applications. Boston: Allyn and Bacon. 21. Henning, G. (1987) A Guide to Language Testing: Development, Evaluation, Research. Cambridge, MA: Newbury House. 22. Hughes, A. (1989) Testing for Language Teachers, 1st ed. Cambridge: Cambridge University Press. 23. Jafarpur, A. (1987) The short-context technique: an alternative for testing reading comprehension. Language Testing 4 (2), pp. 195-220
16
24. Jaju,A.( n.d) . The perfect design : optimization between reliability,validity,redundancy in scale items and response rates : University of Georgia, Athens. Retrieved June ,2013 from : http://www.eu.toeic.eu/toeic-sites/toeic-europe/language-schools/ 25. Kaplan, R. M. & Saccuzzo, D. P. (2005). Psychological testing: Principles, applications and issues (6th ed.) Thomson Wadsworth. 132154. 26. Lawson,A.J. (2008) . Testing the TOEIC: practicality, reliability and validity in the test of English for international communication: The university of Birmingham Center for English studies MATEFL MODULE6. 27. Mehrens,W.A., & Lehman, I.J. (1991). Measurement and Evaluation in Education and Psychology, 4th ed. Holt, Rinehart and Winston, Inc.: Orlando, FL. 28. Mehrens,W.A., & Lehman, I.J. (1991). Measurement and Evaluation in Education and Psychology, 4th ed. Holt, Rinehart and Winston, Inc.: Orlando, FL. 29. Messick, S. (1989) Validity. In Linn, R. L. (ed.) Educational Measurement. New ork:Macmillan/American Council on Education, 13103. 30. Nunnally, J. C. (1978). Psychometric Theory. McGraw-Hill Book Company, pp. 86-113, 190255. 31. Oluwatayo, J. A. (2004). Mode of entry and performance of university undergraduates in science courses. An unpublished Ph.D. thesis, University of Ado-Ekiti, Nigeria 32. Oluwatayo, J.A. (2012). Validity and Reliability Issues in Educational Research. Journal of Educational and Social Research Vol. 2. (2). Doi:10.5901/jesr.2012.v2n2.391 33. SAK,G. (2008) . An investigation of the validity and reliability of the speaking at a Turkish university. An unpublished thesis. Graduate school of social sciences of middle east technical university 34. Rosenthal, R. and Rosnow, R. L. (1991). Essentials of Behavioral Research: Methods and Data Analysis. Second Edition. McGraw-Hill Publishing Company, pp. 46-65. 35. Thompson, B. (1999). Understanding coefficient alpha, really. Paper presented at the annual meeting of the Education Research Exchange, College Station, Texas, and February 5, 1999. Retrieved October 30, 2007, from http://ltj.sagepub.com/cgi/content/abstract/19/1/33 36. Torabi, M.R. (1994) Reliability Methods and Number of items in development of health instruments Health Values: The Journal of Health Behavior, Education and Promotion, V18 n6 (Nov.-Dec.), pp. 56-59 37. Weir, C. J. (2005). Language testing and validation. New York: Palgrave Macmillan 38. Wells,C.S. & Wollack,J.A,(2003) . An Instructors Guide to Understanding Test Reliability: Testing & Evaluation Services. University of Wisconsin. 39. William, D. (1992). Some technical issues in assessment: a users guide. British Journal for Curriculum and Assessment, 2(3), 11-20. 40. William, D. (1996). National curriculum assessments and programme of study: validity and impact. British Educational Research Journal, 22(1), 129-141. 41. Wiliam, D. & Black, P.J. (1996) Meanings and consequences: A basis for distinguishing formative and summative functions of assessment, British Educational Research Journal, 22, 5, pp. 53748.
17