VALIDITY - CONSEQUANTIALISM Assoc. Prof. Dr. Sehnaz Sahinkarakas “Effect-driven testing” (Fulcher & Davidson, 2007) “the effect that the test is intended to have and to structure the test development to achieve that effect” (p.144) What does this mean? DEFINITION OF VALIDITY “Overall judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions on the basis of test scores or other modes of assessment” (Messick, 1995, p. 741). What is score? In general it is “any coding or summarization of observed consistencies or performance regularities on a test, questionnaire, observation procedure, or other assessment devices such as work samples, portfolios, and realistic problem simulations” (p. 741). Then validity is making inferences about scores; scores are the reflections of a test taker’s knowledge and/or skills based on test tasks. Different from early definitions of validity: the degree of correlation between the test and the criterion (validity coefficient) In early definition: there is an upper limit for the possible correlation it is directly related to the reliability of the test (without high reliability a test cannot be valid) New definition (especially after Messicks), validity changed as the meaning of the test scores, not a property Final remarks for validity (and reliability, fairness…): not based on just measurement principles; they are social values correlation coefficients and/or content validity analysis are not enough to assume validity (Messick). So, “score validation is an empirical evaluation of the meaning and consequences of measurement” (Messick) CONSTRUCT VALIDITY What is construct? To define a concept in such a way that it becomes measureable (operational definition) it can have relationship with other different constructs (e.g. the more anxious, the less self-confidence) Construct validity is the degree to which inferences can be made from the operational definitions to theoretical constructs those definitions are based What does this mean? Two things to consider in construct validation: Theory (what goes on in our mind: ideas, theories, beliefs…) Observation (what we see happening around us; our actual program/treatment) i.e., we develop something (observation) to reflect what is in our mind (theory) Construct validity is assessing how well we have transformed our ideas/theories to our actual programs/measures What does this mean in testing? How do we do it in testing? SOURCES OF INVALIDITY Two major threats: Construct underrepresentation: assessment is too narrow: does not include important dimensions of the construct Construct-irrelevant variance: assessment is too broad: contains variance associated with other distinct constructs CONSTRUCT-IRRELEVANT VARIABLE Two kinds Construct-irrelevant difficulty (e.g., undue reading text based on subject-matter knowledge): leads to invalid low scores Construct-irrelevant easiness (e.g., highly familiar texts to some): leads to invalid high scores What do you think about KPDS/YDS in terms of threats to validity SOURCES OF EVIDENCE IN CONSTRUCT VALIDITY (MESSICK, 1995) Construct Validity= the evidential basis for score interpretation How do we interpret scores? Any score interpretation is needed, not just ‘theoretical constructs’ How do we do this? EVIDENCE-RELATED VALIDITY Two types: Convergent validity consists of providing evidence that two tests that are believed to measure closely related skills or types of knowledge correlate strongly. (i.e. The test MEASURES what it clasims to measure) Discriminant validity consists of providing evidence that two tests that do not measure closely related skills or types of knowledge do not correlate strongly. (i.e. The test does NOT MEASURE irrelevant attributes) ASPECTS OF CONSTRUCT VALIDITY Validity is a unified concept but it can be differentiated into distinct aspects: Content Substantive Structural Generalizability External Consequential CONTENT ASPECT Content relevance; Representativeness; Technical quality (to what extent does it represent the domain?) It requires identifying the construct DOMAIN to be assessed To what extent does the domain/task cover the construct All important parts of the construct domain should be covered SUBSTANTIVE ASPECT The process of the construct and the degree these processes are reflected It includes content aspect in it but empirical evidence is also needed. This can be done using a variety of sources; e.g. think-aloud protocols The concept bridging content and substantive is representativeness. Representativeness has two distinct meanings: Mental representation (cognitive psyhchology) Brunswinkian sense of ecological sampling: correlation between a cue and a property. (e.g. Color of banana is a cue and it indicates the ripeness of the fruit) STRUCTURAL ASPECT Related to scoring The scoring criteria and rubrics should be rationally developed (based on the constructs) GENERALIZABILITY Interpretations should not be limited to the task assessed Should be generalizable to the construct domain (degree of correlation between the task and the others) EXTERNAL VARIABLES Scores’ relationship with other measures and nonassessment behaviours Convergent (correspondence between measures of the same construct) and Discriminant evidence (distinctness from measures of other constructs) are important CONSEQUENCES Evaluating intended and unintended consequences of score interpretation both positive and negative impact But, negative impact should NOT be because of the construct underrepresentation or construct irrelevant variance. Two facets: (a) justification of the testing based on score meaning or consequences contributing to score valuation; (b) function or outcome of the testing—as interpretaion or applied use FACETS OF VALIDITY AS A PROGRESSIVE MATRIX (MESSICKS, 1995, P. 748) Two facets: (a) justification of the testing based on score meaning or consequences contributing to score valuation; (b) function or outcome of the testing—as interpretaion or applied use. When they are crossed with each other a four-fold classification is obtained Test Interpretation Test Use Evidential Basis Construct Validity (CV) CV + Relevance/Utility(R/U) Consequential Basis CV + Value Implication (VI) CV + R/U + VI + Social Consequences Construct validity appears in every cell in the figure. This means: Validity issues are unified into a unitary concept But also distinct features of construct validity should be emphasized What is the implication here? Both meaning and values are interwined in the validation process. Thus, ‘Validity and values are one imperative, not two, and test validation implicates both the science and the ethics of assessment, which is why validity has force as a social value’ (Messick, 1995, p. 749). CONSEQUENTIAL VALIDITY & WASHBACK Messician view (Unified version) of Construct Validity = Considering the consequences of test use (i.e., washback) What does this mean in validity studies? Washback is a particular instance of consequential aspect of construct validity Investigating washback and other consequences is a crucial step in the process of test validation i.e., Washback is one (not the only) indicator of consequential aspect of validity It is important to investigate washback to establish the validity of a test Put it differently: Modern paradigm of validity comes with its consequential nature Test impact is part of a validation argument Thus, effect-driven testing should be considered: testers should build tests with the intended effects in mind To put it all together Value implication + Social consequences = CONSEQUENTIAL VALIDITY (two fairness-related elements of Messick’s consequential validity) IMPLICATION Positive washback Consequential validity Promoting learning Negative washback Lack of validity Unfairness But who brings about washback (positive or negative)? People in classrooms (T / Ss)? Test Developers? For Fulcher and Davidson, it is the people in classrooms Thus more attention should be given to teachers’ beliefs about teaching and learning and the degree of their PROFESSIONALISM TASK A9.2 Course book (p.143) Select one large-scale test you are familiar with. What is its influence upon whom? Does it seem reasonable to define these tests as their influence as well?