ABOUT/CONTACT

SEMINAR

SLAA
（Second Language Acquisition & Assessment Research Group）

FOR STUDENTS

TEASY

LINK

2019年度　　異文化言語教育評価論

Chapter 1

Mediating Assessment Innovation: Why Stakeholder Perspectives Matter

1.4 Assessment Validation

1.4.1 Fundamental Considerations

- assessment is the basis for the evolution of teaching and learning processes

Assessment Stakes

- low-stakes assessment has small effect on stakeholders the teaching/learning process = small consequences

- high-stakes assessment (such as 'failing the grade' = negative consequences included) have great effect on stakeholders

- As Shohamy (2001b) explains, assessment results have substantial social consequences. High-stakes decisions (winner/loser, acceptance/rejection) are usually made based on limited media (grades, marks, percentages or comments).

- Tests are seen as an unpleasant experience - a source of anger, pressure, competition and humiliation, while lacking the feeling of testing real knowledge or proper explanation of why tests are important. (Learning is supposed to be fun and rewarding, so testing feels like betrayal. (Shohamy, 2007))

- It is important for assessment results to be closely representing of students' abilities.

- In foreign language education, a test should be an opportunity for its takers to present their "best" performance. There is also need accurate discrimination between different levels.

1.4.2 The Contributions of Assessment Score Evidence to a Validity Argument

- To test validity and reliability of newly proposed assessments, psychometric model, based on psychology, can be applied.

- construct validity = agreement between a test score or measure and the quality it is believed to measure (Do the scores of a test tell us how well students are able to perform in a certain activity?)

- reliability: how scores are awarded and whether the assessment process is consistent

- parallel forms: comparation with a different assessment concerning the same construct

- test-retest reliability: will same assessment completed at a different time differ?

- fairness: the construct is clearly defined, meaningfully operationalized and the scores are reliable.

- If we guarantee the fairness of a test, high-stakes decisions become more beneficial and can be performed with an ease of mind.

- Do performance outcomes alone provide sufficient evidence for the evaluation of a fairness?

1.4.3 The Limitations of Assessment Score Evidence to a Validity Argument

- assessment in practice is often unpredictable and frequent innovation of assessment can cause frustration to teachers.

- Norris (2008) argues that assessment validity evaluation requires more than a small sample of data. McNamara and Roever (2006) criticize the psychometric approach as being too disconnected from the reality of "far-reaching and unanticipated social consequences."

- In high-stakes decisions, ranking or grading (abbreviating results into a numerical score) are not avoidable, so the validity of ranking/grading will not be discussed.

- McNamara and Roever (2006) and others suggest, that performance scores are not sufficient evidence to for validity of an assessment.

1.4.4 Towards a Broader Understanding of Assessment Validation

- several reasons for measurement errors:

- construct under-representation: the assessment task does not include the important aspect of the construct

- construct irrelevant variance: assessment includes variables that are not relevant to the construct (when some aspects of the task make the task irrelevantly easy / difficult for some test takers)

converse

interact

- positive effect on test takers' performance

- not fully relevant to speaking proficiency

- created with focus on reliability and validity

- more representative of spoken communication

- peer interaction might cause variance in difficulty, causing one or both test takers to under-perform

- therefore, validity should have evidential (score etc.) and consequential (value implications and social consequences) basis

- opposite side of the argument represented by Newton and Shaw (2014): ethical concern about impacts of assessment is taking away the 'simple' focus from scores.

- not only the conditions within the test, but also the conditions around it (teaching and learning process, assessment process) should be considered.

- assessment results (performance outcome) alone can't provide enough evidence for the validity of interact.

1.4.5 A Qualitative Perspective on Assessment Validation

- Lazaraton (2002) argues, the most important development in language testing in 1990s was the introduction of qualitative research methods. There are multiple qualitative approaches. These methods help better understand phenomena in qualitative data, and should play a bigger role in applied linguistics (Lazaraton, 2002).

- stakeholders' judgements about an assessment are important for determining consequential validity. Particularly teachers' ones, as teachers have insight into administrating tests and the consequences of assessment.

- Positive effect of tests can also be promoted by involving test takers in the design and development of the test.

- Fundamental questions posed by this book:

- What are teachers and students making of interact?

- What is working, what is not working, what could work better?

- What are the implications, both for on-going classroom practice and for on-going evaluation of the assessment?

1.6 Conclusion

- ideological theoretical perspective affects language teaching, but we need to confirm, whether these innovations are truly beneficial

- central control: students change their behavior to match test form → authorities use tests to assimilate students' behavior to their own (Shohamy, 2001b)

- proponents of new assessment method must provide convincing justification. However, quantitative data is not enough for such justification anymore (McNamara, 1997).

Discussion Points

(1) Try to think of actual examples of construct under-representation and construct irrelevant variance.

(2) If interact scores depend on conversational partner, can we consider this test lacking the quantitative property of reliability, since the outcome will be different each time. Is this really a quantitative / qualitative issue?