Principles of Language Assessment

In testing a test, there are 5 cardinal criteria such as, practicality, reliability, validity, authenticity, and washback. We will look at each one, but with no priority order implied in the order of presentation.

Practicality
An effective test is pratical. This means that it

is not excessively expensive,
stays within appropriate time constraints,
is relatively easy to administer, and
has a scoring/evaluation procedure that is specific adn time-efficient

Reliability
A reliable test is consistent and dependable. If you give the same test to the same student or matched students on two different occasions, the test should yield the similar results. The issue of reliability of a test may best be addressed by considering a number of factors that may contribute to the unreliability of a test.

A. Student-Related Reliability
The most common learner-related issue in reliability is caused by temporary illness, fatigue, anxiety and other physical or psychological factors, which may make an "observed" score deviate from one's "true" score.
B. Rater Reliability
Human error, subjectivity and bias may enter into the scoring process. Inter-rater reliability occurs when two or more scorers yield inconsistent scores on the same test, possibly for lack of attention to scoring criteria, inexperience, inattention, or even preconceived biases.
Rater-reliability issues are not limited to contexts where two or more scorers are involved. Inter-rater reliability is a common occurrence for classroom teachers because of unclear scoring criteria, fatigue, bias toward particular "good" and "bad" students, or simple carelessness.
C. Test Administration Reliability
Unreliability may also result from the conditions in which the test is administered. For example, the administration of a test of aural comprehension in which a tape recorder played items for comprehension, but because of street noise outside the building, students sitting next to windows could not hear the tape accurately. This was a clear case of unreliability caused by the conditions of the test administration.
D. Test Reliability
Sometimes the nature of the test itself can cause measurement errors. If a test is too long, test-takers may become fatigued by the time they reach the later items and hastily respond incorrectly. Timed test may discriminate against students who do not perform well on a test with a time limit. We all know people who "know" the course material perfectly, but who are adversely affected by the presence of a clock ticking away. Poorly written test items may be a further source of test unreliability.

Validity
By far the most complex criterion of an effective test and arguably the most important principle is validity, "the extent to which inferences made from assessment results are appropriate, meaningful and useful in terms of the purpose of the assessment" (Gronlund, 1998).

And how is the validity of a test established? There is no final, absolute measure of validity, but several different kinds of evidence may be invoked in support. By this, we will look at these five types of evidence below.

1. Content-Related Evidence
If a test actually samples the subject matter about which conclusions are to be drawn, and if it requires the test-taker to perform the behavior that is being measured, it can claim content-related evidence of validity, often popularly referred to as content validity (e.g., Mousavi, 2002;Hughes, 2003).
2. Criterion-Related Evidence
    Criterion-related evidence usually falls into one of two categories :

    a. Concurrent validity
A test has concurrent validity if its results are supported by other concurrent performance beyond the assessment itself. For example, the validity of a high score on the final exam of a foreign language course will be substantiated by actual proficiency in the language.
    b. Predictive validity
The predictive validity of an assessment becomes important in the case of placement tests, admissions assessment batteries, language aptitude tests and the like.
3. Construct-related Evidence
This one does not play as large role for classroom teachers. Constructs may or may not be directly or empirically measured - their verification often requires inferential data.
4. Consequential Validity
Consequential validity encompasses all the consequences of a test, including such considerations as its accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the learner, and the (intended and unintended) social sequences of a test's interpretation and use.
5. Face Validity
"Face validity refers to the degree to which a test looks right and appears to measure the knowledge or abilities it claims to measure, based on the subjective judgment of the examinees who take it, the administrative personnel who decide on its use, and other psychometrically unsophisticated obeservers" (Mousavi, 2002).

Authenticity
Bachman and Palmer (1996) define authenticity as "the degree of correspondence of the characteristics of a given language test task to the features of target language task," and then suggest as agenda for identifying those target language task and for transforming them into valid test items.

Washback
Washback generally refers to the effects the test have on instruction in terms of how students prepare for the test. A little bit of washback may also help students through a specification of the numerical scores on the various subsections of the test.

If you are looking for slide shows of this topic, you can download it here (google drive).

Further reading :
Brown, D. (2004). Language Assessment: Principles and Classroom Practices. New York: Pearson Longman.