3 Kinds of Tests and Testing
The
purposes of this material is to know which language testing is carried out. It
goes on to make a number of distinctions: between direct and indirect testing,
between discrete point and integrative testing, between norm-referenced and
criterion-referenced testing, and between objective and subjective testing.
Tests
can be categorised according to the types of information they provide. This
categorisation will prove useful both in deciding whether an existing test is
suitable for a particular purpose and in writing appropriate new tests where
these are necessary. The four types of test which we will discuss in the
following sections are: proficiency tests, achievement tests, diagnostic tests,
and placement tests.
Proficiency tests
Proficiency tests are
designed to measure people's ability in a language, regardless of any training
they may have had in that language. The content of a
proficiency test, therefore, is not based on the content or objectives of
language courses that people taking the test may have followed. Rather, it is
based on a specification of what candidates have to be able to do in the
language in order to be considered proficient. This raises the question of what
we mean by the word 'proficient'.
In
the case of some proficiency tests, 'proficient' means having sufficient
command of the language for a particular purpose. An example of this would be a
test designed to discover whether someone can function successfully as a United
Nations translator. Another example would be a test used to determine whether a
student's English is good enough to follow a course of study at a British
university. Such a test may even attempt to take into account the level and
kind of English needed to follow courses in particular subject areas. It might,
for example, have one form of the test for arts subjects, another for sciences,
and so on. Whatever the particular purpose to which the language is to be put,
this will be reflected in the specification of test content at an early stage
of a test's development.
There
are other proficiency tests which, by contrast, do not have an occupation or
course of study in mind. For them the concept of proficiency is more general.
British examples of these would be the Cambridge First Certificate in English
examination (FCE) and the Cambridge Certificate of Proficiency in English
examination (CPE). The function of such tests is to show whether candidates
have reached a certain standard with respect to a set of specified abilities.
The examining bodies responsible for such tests are independent of teaching
institutions and so can be relied on by potential employers, etc. to make fair
comparisons between candidates from different institutions and different
countries. Though there is no particular purpose in mind for the language,
these general proficiency tests should have detailed specifications saying just
what it is that successful candidates have demonstrated that they can do. Each
test should be seen to be based directly on these specifications. All users of
a test (teachers, students, employers, etc.) can then judge whether the test is
suitable for them, and can interpret test results. It is not enough to have
some vague notion of proficiency, however prestigious the testing body concerned.
The Cambridge examinations referred to above are linked to levels in the ALTE
(Association of Language Testers in Europe) framework, which draws heavily on
the work of the Council of Europe (see Further Reading).
Despite
differences between them of content and level of difficulty, all proficiency
tests have in common the fact that they are not based on courses that
candidates may have previously taken. On the other hand, as we saw in Chapter
1, such tests may themselves exercise considerable influence over the method
and content of language courses. Their backwash effect - for this is what it is
- may be beneficial or harmful. In my view, the effect of some widely used
proficiency tests is more harmful than beneficial. However, the teachers of
students who take such tests, and whose work suffers from a harmful backwash
effect, may be able to exercise more influence over the testing organisations
concerned than they realise. The supplementing of TOEFL with a writing test,
referred to in Chapter 1, is a case in point.
Achievement
tests
Most
teachers are unlikely to be responsible for proficiency tests. It is much more
probable that they will be involved in the preparation and use of achievement
tests. In contrast to proficiency tests, achievement
tests are directly related to
language courses, their purpose being to establish how successful individual
students, groups of students, or the courses themselves have been in achieving
objectives. They are of two kinds: final
achievement tests and progress
achievement tests.
Final
achievement tests are those administered at
the end of a course of study. They may be written and administered by
ministries of education, official examining boards, or by members of teaching
institutions. Clearly the content of these tests must be related to the courses
with which they are concerned, but the nature of this relationship is a matter
of disagreement amongst language testers.
In
the view of some testers, the content of a final achievement test should be
based directly on a detailed course syllabus or on the books and other
materials used. This has been referred to as the syllabus-content approach. It
has an obvious appeal, since the test only contains what it is thought that the
students have actually encountered, and thus can be considered, in this respect
at least, a fair test. The disadvantage is that if the syllabus is badly
designed, or the books and other materials are badly chosen, the results of a
test can be very misleading. Successful performance on the test may not truly
indicate successful achievement of course objectives. For example, a course may
have as an objective the development of conversational ability, but the course
itself and the test may require students only to utter carefully prepared
statements about their home town, the weather, or whatever. Another course may
aim to develop a reading ability in German, but the test may limit itself to
the vocabulary the students are known to have met. Yet another course is
intended to prepare students for university study in English, but the syllabus
(and so the course and the test) may not include listening (with note taking)
to English delivered in lecture style on topics of the kind that the students
will have to deal with at university. In each of these examples - all of them
based on actual cases – test results will fail to show what students have
achieved in terms of course objectives.
The
alternative approach is to base the test content directly on the objectives of
the course. This has a number of advantages. First, it compels course designers
to be explicit about objectives. Secondly, it makes it possible for performance
on the test to show just how far students have achieved those objectives. This
in turn puts pressure on those responsible for the syllabus and for the
selection of books and materials to ensure that these are consistent with the
course objectives. Tests based on objectives work against the perpetuation of
poor teaching practice, something which course-content-based tests, almost as
if part of a conspiracy, fail to do. It is my belief that to base test content
on course objectives is much to be preferred; it will provide more accurate
information about individual and group achievement, and it is likely to promote
a more beneficial backwash effect on teaching.
Now
it might be argued that to base test content on objectives rather than on
course content is unfair to students. If the course content does not fit well
with objectives, they will be expected to do things for which they have not
been prepared. In a sense this is true. But in another sense it is not. If a
test is based on the content of a poor or inappropriate course, the students
taking it will be misled as to the extent of their achievement and the quality
of the course. Whereas if the test is based on objectives, not only will the
information it gives be more useful, but there is less chance of the course
surviving in its present unsatisfactory form. Initially some students may
suffer, but future students will benefit from the pressure for change. The
long-term interests of students are best served by final achievement tests
whose content is based on course objectives.
The
reader may wonder at this stage whether there is any real difference between
final achievement tests and proficiency tests. If a test is based on the
objectives of a course, and these are equivalent to the language needs on which
a proficiency test is based, there is no reason to expect a difference between
the form and content of the two tests. Two things have to be remembered,
however. First, objectives and needs will not typically coincide in this way.
Secondly, many achievement tests are not in fact based on course objectives.
These facts have implications both for the users of test results and for test
writers. Test users have to know on what basis an achievement test has been
constructed, and be aware of the possibly limited validity and applicability of
test scores. Test writers, on the other hand, must create achievement tests
that reflect the objectives of a particular course, and not expect a general
proficiency test (or some imitation of it) to provide a satisfactory
alternative.
Progress
achievement tests, as their name suggests, are
intended to measure the progress that students are making. They contribute to
formative assessment (referred to in Chapter 1). Since 'progress' is towards
the achievement of course objectives, these tests, too, should relate to
objectives. But how? One way of measuring progress would be repeatedly to
administer final achievement tests, the (hopefully) increasing scores
indicating the progress made. This is not really feasible, particularly in the
early stages of a course. The low scores obtained would be discouraging to
students and quite possibly to their teachers. The alternative is to establish
a series of well-defined short-term objectives. These should make a clear
progression towards the final achievement test based on course objectives. Then
if the syllabus and teaching are appropriate to these objectives, progress
tests based on short-term objectives will fit well with what has been taught.
If not, there will be pressure to create a better fit. If it is the syllabus
that is at fault, it is the tester's responsibility to make clear that it is
there that change is needed, not in the tests.
In
addition to more formal achievement tests that require careful preparation,
teachers should feel free to set their own 'pop quizzes'. These serve both to
make a rough check on students' progress and to keep students on their toes.
Since such tests will not form part of formal assessment procedures, their
construction and scoring need not be too rigorous. Nevertheless, they should be
seen as measuring progress towards the intermediate objectives on which the
more formal progress achievement tests are based. They can, however, reflect
the particular 'route' that an individual teacher is taking towards the
achievement of objectives.
It
has been argued in this section that it is better to base the content of
achievement tests on course objectives rather than on the detailed content of a
course. However, it may not be at all easy to convince colleagues of this,
especially if the latter approach is already being followed. Not only is there
likely to be natural resistance to change, but such a change may represent a
threat to many people. A great deal of skill, tact and, possibly, political
manoeuvring may be called for – topics on which this book cannot pretend to
give advice.
Diagnostic
tests
Diagnostic tests are used
to identify learners' strengths and weaknesses.
They are intended primarily to ascertain what learning still needs to take
place. At the level of broad language skills this is reasonably
straightforward. We can be fairly confident of our ability to create tests that
will tell us that someone is particularly weak in, say, speaking as opposed to
reading in a language. Indeed existing proficiency tests may often prove
adequate for this purpose.
We
may be able to go further, and analyse samples of a person's performance in
writing or speaking in order to create profiles of the student's ability with
respect to such categories as 'grammatical accuracy' or ‘linguistic
appropriacy'. Indeed Chapters 9 and 10 suggest that raters of writing and oral
test performance should provide feedback to the test takers as a matter of
course.
But it is not so easy to obtain a
detailed analysis of a student's command of grammatical structures - something
that would tell us, for example, whether she or he had mastered the present
perfect/past tense distinction in English. In order to be sure of this, we
would need a number of examples of the choice the student made between the two
structures in every different context that we thought was significantly
different and important enough to warrant obtaining information on. A single
example of each would not be enough, since a student might give the correct
response by chance. Similarly, if one wanted to test control of the English
article system, one would need several items for each of the twenty or so uses
of the articles (including the 'zero' article) listed in Collins Cobuild
English Usage (1992). Thus, a comprehensive diagnostic test of English grammar
would be vast (think of what would be involved in testing the modal verbs, for
instance). The size of such a test would make it impractical to administer in a
routine fashion. For this reason, very few tests are constructed for purely
diagnostic purposes, and those that there are tend not to provide very detailed
or reliable information.
The
lack of good diagnostic tests is unfortunate. They could be extremely useful
for individualised instruction or self-instruction. Learners would be shown
where gaps exist in their command of the language, and could be directed to
sources of information, exemplification and practice. Happily, the ready
availability of relatively inexpensive computers with very large memories
should change the situation. Well-written computer programs will ensure that
the learner spends no more, time than is absolutely necessary to obtain the
desired information, and without the need for a test administrator. Tests of
this kind will still need a tremendous amount of work to produce. Whether or
not they become generally available will depend on the willingness of
individuals to write them and of publishers to distribute them. In the
meantime, there is at least one very interesting web-based development,
DIALANG. Still at the trialling stage as I write this, this project is planned
to offer diagnostic tests in fourteen European languages, each having five modules:
reading, writing, listening, grammatical structures, and vocabulary.
Placement
tests
Placement
tests, as their name suggests, are
intended to provide information that will help to place students at the stage
(or in the part) of the teaching programme most appropriate to their abilities.
Typically they are used to assign students to classes at different levels.
Placement tests can be bought, but this is to be recommended only when the
institution concerned is sure that the test being considered suits its
particular teaching programme. No one placement test will work for every
institution, and the initial assumption about any test that is commercially
available must be that it will not work well. One possible exception is
placement tests designed for use by language schools, where the similarity of
popular text books used in them means that the schools' teaching programmes
also tend to resemble each other.
The
placement tests that are most successful are those constructed for particular
situations. They depend on the identification of the key features at different
levels of teaching in the institution. They are tailor-made rather than bought
off the peg. This usually means that they have been produced 'in house'. The
work that goes into their construction is rewarded by the saving in time and
effort through accurate placement. An example of how a placement test might be
developed is given in Chapter 7; the validation of placement tests is referred
to in Chapter 4.
Direct
versus Indirect Testing
So far in this chapter we have
considered a number of uses to which test results are put. We now distinguish
between two approaches to test construction.
Testing
is said to be direct when it
requires the candidate to perform precisely the skill that we wish to measure.
If we want to know how well candidates can write compositions, we get them to
write compositions. If we want to know
how well they pronounce a language, we get them to speak. The tasks, and the
texts that are used, should be as authentic as possible. The fact that
candidates are aware that they are in a test situation means that the tasks
cannot be really authentic. Nevertheless every effort is made to make them as
realistic as possible.
Direct
testing is easier to carry out when it is intended to measure the productive
skills of speaking and writing. The very acts of speaking and writing provide
us with information about the candidate's ability. With listening and reading,
however, it is necessary to get candidates not only to listen or read but also to
demonstrate that they have done this successfully. Testers have to devise
methods of eliciting such evidence accurately and without the method
interfering with the performance of the skills in which they are interested.
Appropriate methods for achieving this are discussed in Chapters 11 and 12.
Interestingly enough, in many texts on language testing it is the testing of
productive skills that is presented as being most problematic, for reasons
usually connected with reliability. In fact these reliability problems are by
no means insurmountable, as we shall see in Chapters 9 and 10.
Direct
testing has a number of attractions. First, provided that we are clear about
just what abilities we want to assess, it is relatively straightforward to
create the conditions which will elicit the behaviour on which to base our
judgements. Secondly, at least in the case of the productive skills, the
assessment and interpretation of students' performance is also quite
straightforward. Thirdly, since practice for the test involves practice of the
skills that we wish to foster, there is likely to be a helpful backwash effect.
Indirect testing
attempts to measure the abilities that
underlie the skills in which we are interested. One section of the TOEFL, for
example, was developed as an indirect measure of writing ability. It contains
items of the following kind where the candidate has to identify which of the
underlined elements is erroneous or inappropriate in formal standard English:
At
first the old woman seemed unwilling to accept
anything that was offered
her
by my friend and I.
While
the ability to respond to such items has been shown to be related statistically
to the ability to write compositions (although the strength of the relationship
was not particularly great), the two abilities are far from being identical.
Another example of indirect testing is Lado's (1961) proposed method of testing
pronunciation ability by a paper and pencil test in which the candidate has to
identify pairs of words which rhyme with each other.
Perhaps
the main appeal of indirect testing is that it seems to offer the possibility
of testing a representative sample of a finite number of abilities which
underlie a potentially indefinite large number of manifestations of them. If,
for example, we take a representative sample of grammatical structures, then,
it may be argued, we have taken a sample which is relevant for all the
situations in which control of grammar is necessary. By contrast, direct
testing is inevitably limited to a rather small sample of tasks, which may call
on a restricted and possibly unrepresentative range of grammatical structures.
On this argument, indirect testing is superior to direct testing in that its
results are more generalisable.
The
main problem with indirect tests is that the relationship between performance
on them and performance of the skills in which we are usually more interested
tends to be rather weak in strength and uncertain in nature. We do not yet know
enough about the component parts of, say, composition writing to predict
accurately composition writing ability from scores on tests that measure the
abilities that we believe underlie it. We may construct tests of grammar,
vocabulary, discourse markers, handwriting, punctuation, and what we will. But
we will still not be able to predict accurately scores on compositions (even if
we make sure of the validity of the composition scores by having people write
many compositions and by scoring these in a valid and highly reliable way).
It
seems to me that in our present state of knowledge, at least as far as
proficiency and final achievement tests are concerned, it is preferable to rely
principally on direct testing. Provided that we sample reasonably widely (for
example require at least two compositions, each calling for a different kind of
writing and on a different topic), we can expect more accurate estimates of the
abilities that really concern us than would be obtained through indirect
testing. The fact that direct tests are generally easier to construct simply
reinforces this view with respect to institutional tests, as does their greater
potential for beneficial backwash. It is only fair to say, however, that many
testers are reluctant to commit themselves entirely to direct testing and will
always include an indirect element in their tests. Of course, to obtain
diagnostic information on underlying abilities, such as control of particular
grammatical structures, indirect testing may be perfectly appropriate.
Before
ending this section, it should be mentioned that some tests are referred to as semi-direct.
The most obvious examples of these are speaking tests where candidates
respond to tape-recorded stimuli, with their own responses being recorded and
later scored. These tests are semi-direct in the sense that, although not
direct, they simulate direct testing.
Discrete
Point versus Integrative Testing
Discrete
point testing refers to the testing of one element at
a time, item by item. This might, for example, take the form of a series of
items, each testing a particular grammatical structure. Integrative testing,
by contrast, requires the candidate to combine many language elements in
the completion of a task. This might involve writing a composition, making
notes while listening to a lecture, taking a dictation, or completing a cloze
passage. Clearly this distinction is not unrelated to that between indirect and
direct testing. Discrete point tests will almost always be indirect, while
integrative tests will tend to be direct. However, some integrative testing methods,
such as the cloze procedure, are indirect. Diagnostic tests of grammar of the
kind referred to in an earlier section of this chapter will tend to be discrete
point.
Norm-Referenced
versus Criterion-Referenced Testing
Imagine
that a reading test is administered to an individual student. When we ask how
the student performed on the test, we may be given two kinds of answer. An
answer of the first kind would be that the student obtained a score that placed
her or him in the top 10 per cent of candidates who have taken that test, or in
the bottom 5 per cent; or that she or he did better than 60 per cent of those
who took it. A test which is designed to give this kind of information is said
to be norm-referenced. It
relates one candidate's performance to that of other candidates. We are not
told directly what the student is capable of doing in the language.
The
other kind of answer we might be given is exemplified by the following, taken
from the Interagency Language Roundtable (ILR) language skill level descriptions
for reading:
Sufficient
comprehension to read simple, authentic written materials in a form equivalent
to usual printing or typescript on subjects within a familiar context. Able to
read with some misunderstandings straightforward, familiar, factual material,
but in general insufficiently experienced with the language to draw inferences
directly from the linguistic aspects of the text. Can locate and understand the
main ideas and details in materials written for the general reader . . . The individual
can read uncomplicated but authentic prose on familiar subjects that are
normally presented in a predictable sequence which aids the reader in
understanding. Texts may include descriptions and narrations in contexts such
as news items describing frequently occurring events, simple biographical
information, social notices, formulaic business letters, and simple technical
information
written
for the general reader. Generally the prose that can be read by the individual
is predominantly in straightforward/high-frequency sentence patterns. The
individual does not have a broad active vocabulary . . . but is able to use
contextual and real-world clues to understand the text.
Similarly,
a candidate who is awarded the Berkshire Certificate of Proficiency in German
Level 1 can 'speak and react to others using simple language in the following
contexts':
·
to greet, interact with and
take leave of others; - to exchange information on personal background, home,
school life and interests;
·
to discuss and make choices,
decisions and plans; - to express opinions, make requests and suggestions; - to
ask for information and understand instructions.
In these two cases we learn nothing
about how the individual’s performance compares with that of other candidates.
Rather we learn something about what he or she can actually do in the language.
Tests that are designed to provide this kind of information directly are said
to be criterion-referenced.
The
purpose of criterion-referenced tests is to classify people according to whether
or not they are able to perform some task or set of tasks satisfactorily. The
tasks are set, and the performances are evaluated. It does not matter in
principle whether all the candidates are successful, or none of the candidates
is successful. The tasks are set, and those who perform them satisfactorily
'pass'; those who don't, 'fail'. This means that students are encouraged to
measure their progress in relation to meaningful criteria, without feeling
that, because they are less able than most of their fellows, they are destined
to fail. In the case of the Berkshire German Certificate, for example, it is
hoped that all students who are entered for it will be successful.
Criterion-referenced tests therefore have two positive virtues: they set meaningful
standards in terms of what people can do, which do not change with different
groups of candidates, and they motivate students to attain those standards.
The
need for direct interpretation of performance means that the construction of a
criterion-referenced test may be quite different from that of a norm-referenced
test designed to serve the same purpose. Let us imagine that the purpose is to
assess the English language ability of students in relation to the demands made
by English medium universities. The criterion-referenced test would almost
certainly have to be based on an analysis of what students had to be able to do
with or through English at university. Tasks would then be set similar to those
to be met at university. If this were not done, direct interpretation of
performance would be impossible. The norm-referenced test, on the other hand,
while its content might be based on a similar analysis, is not so restricted.
The Michigan Test of English Language Proficiency, for instance, has multiple choice
grammar, vocabulary, and reading comprehension components. A candidate's score
on the test does not tell us directly what his or her English ability is in
relation to the demands that would be made on it at an English medium
university. To know this, we must consult a table which makes recommendations
as to the academic load that a student with that score should be allowed to
carry, this being based on experience over the years of students with similar
scores, not on any meaning in the score itself. In the same way, university
administrators have learned from experience how to interpret TOEFL scores and
to set minimum scores for their own institutions. The fact that these minimum
scores can be thought of as criterial for entry does not, however, make the
TOEFL criterion-referenced.
Books
on language testing have tended to give advice which is more appropriate to
norm-referenced testing than to criterion-referenced testing. One reason for
this may be that procedures for use with norm-referenced tests (particularly
with respect to such matters as the analysis of items and the estimation of
reliability) are well established, while those for criterion-referenced tests
are not. The view taken in this book, and argued for in Chapter 6, is that
criterion-referenced tests are often to be preferred, not least for the
beneficial backwash effect they are likely to have. The lack of agreed
procedures for such tests is not sufficient reason for them to be excluded from
consideration. Chapter 5 presents one method of estimating the consistency
(more or less equivalent to 'reliability') of criterion-referenced tests.
The
Council of Europe publications referred to in Further reading are a valuable
resource for those wishing to write specifications for criterion-referenced tests.
The highly detailed learning objectives specified in those publications,
expressed in terms of notions and functions, lend themselves readily to the
writing of 'can do' statements, which can be included in test specifications.
Objective
Testing versus Subjective Testing
The
distinction here is between methods of scoring, and nothing else. If, no
judgement is required on the part of the scorer, then the scoring is objective.
A multiple choice test, with the correct responses unambiguously
identified, would be a case in point. If judgement is called for, the scoring
is said to be subjective. There are different degrees of subjectivity in
testing. The impressionistic scoring of a composition may be considered more
subjective than the scoring of short answers in response to questions on a
reading passage.
Objectivity
in scoring is sought after by many testers, not for itself, but for the greater
reliability it brings. In general, the less subjective the scoring, the greater
agreement there will be between two different scorers (and between the scores
of one person scoring the same test paper on different occasions). However,
there are ways of obtaining reliable subjective scoring, even of compositions.
Summarize by;
Moch. Kusen, M.Pd (Lecture of Language Assessment 1 Nusantara PGRI Kediri University)
Sources: