1、Designation: F2889 11Standard Practice forAssessing Language Proficiency1This standard is issued under the fixed designation F2889; the number immediately following the designation indicates the year oforiginal adoption or, in the case of revision, the year of last revision. A number in parentheses
2、indicates the year of last reapproval. Asuperscript epsilon () indicates an editorial change since the last revision or reapproval.1. Scope1.1 PurposeThis practice describes best practices for thedevelopment and use of language tests in the modalities ofspeaking, listening, reading, and writing for
3、assessing abilityaccording to the Interagency Language Roundtable (ILR)2scale. This practice focuses on testing language proficiency inuse of language for communicative purposes.1.2 LimitationsThis practice is not intended to addresstesting and test development in the following specialized areas:Tra
4、nslation, Interpretation, Audio Translation, Transcription,other job-specific language performance tests, or DiagnosticAssessment.1.2.1 Tests developed under this practice should not be usedto address any of the above excluded purposes (for example,diagnostics).2. Referenced Documents2.1 ASTM Standa
5、rds:3F1562 Guide for Use-Oriented Foreign Language Instruc-tionF2089 Guide for Language Interpretation ServicesF2575 Guide for Quality Assurance in Translation3. Terminology3.1 Definitions:3.1.1 achievement test, nan instrument designed to mea-sure what a person has learned within or up to a given t
6、imebased on a sampling of what has been covered in the syllabus.3.1.2 adaptive test, nform of individually tailored testingin which test items are selected from an item bank where testitems are stored in rank order with respect to their itemdifficulty and presented to test takers during the test on
7、thebasis of their responses to previous items, until it is determinedthat sufficient information regarding test takers abilities hasbeen collected. The opposite of a fixed-form test.3.1.3 authentic texts, ntexts not created for languagelearning purposes that are taken from newspapers, magazines,etc.
8、, and tapes of natural speech taken from ordinary radio ortelevision programs, etc.3.1.4 calibration, nthe process of determining the scale ofa test or tests.3.1.4.1 DiscussionCalibration may involve anchoringitems from different tests to a common difficulty scale (thetheta scale). When a test is co
9、nstructed from calibrated itemsthen scores on the test indicate the candidates ability, i.e. theirlocation on the theta scale.3.1.5 cognitive lab, na method for eliciting feedback fromexaminees with regard to test items.3.1.5.1 DiscussionSmall numbers of examinees take thetest, or subsets of the ite
10、ms on the test, and provide extensivefeedback on the items by speaking their thought processesaloud as they take the test, answering questionnaires about theitems, being interviewed by researchers, or other methodsintended to obtain in-depth information about items. Theseexaminees should be similar
11、to the examinees for whom thetest is intended. For tests scored by raters, similar techniquesare used with raters to obtain information on rubric function-ing.3.1.6 computer adaptive test, na test administered by acomputer in which the difficulty level of the next item to bepresented to test takers
12、is estimated on the basis of theirresponses to previous items and adapted to match theirabilities.3.1.7 construct, nthe knowledge, skill or ability that isbeing tested.3.1.7.1 DiscussionThe construct provides the basis for agiven test or test task and for interpreting scores derived fromthis task.3.
13、1.8 constructed response, adja type of item or test taskthat requires test takers to respond to a series of open-endedquestions by writing, speaking, or doing something rather thanchoose answers from a ready-made list.1This practice is under the jurisdiction of ASTM Committee F43 on LanguageServices
14、 and Products and is the direct responsibility of Subcommittee F43.04 onLanguage Testing.Current edition approved May 1, 2011. Published June 2011. DOI: 10.1520/F2889-11.2Interagency Language Roundtable, Language Skill Level Descriptors (http:/www.govtilr.org/Skills/ILRscale1.htm).3For referenced AS
15、TM standards, visit the ASTM website, www.astm.org, orcontact ASTM Customer Service at serviceastm.org. For Annual Book of ASTMStandards volume information, refer to the standards Document Summary page onthe ASTM website.1Copyright ASTM International, 100 Barr Harbor Drive, PO Box C700, West Conshoh
16、ocken, PA 19428-2959, United States.3.1.8.1 DiscussionThe most commonly used types ofconstructed-response items include fill-in, short-answer, andperformance assessment.3.1.9 content validity, na conceptual or non-statisticalvalidity based on a systematic analysis of the test content todetermine whe
17、ther it includes an adequate sample of the targetdomain to be measured.3.1.9.1 DiscussionIn order to achieve content validity, anadequate sample involves ensuring that all major aspects arecovered and in suitable proportions.3.1.10 criterion-referenced scale, na graduated and sys-tematic description
18、 of the domain of subject matter that a testis designed to assess; (or) a rating scale that provides fortranslating test scores into a statement about the behavior to beexpected of a person with that score and/or their relationship toa specified subject matter.3.1.10.1 DiscussionA criterion-referenc
19、ed test is one thatassesses achievement or performance against a cut score that isdetermined as a reflection of mastery or attainment of specifiedobjectives. Focus is on ability to perform tasks rather thangroup ranking.3.1.11 cut score, na score that represents achievement ofthe criterion, the line
20、 between success and failure, mastery andnon-mastery.3.1.12 dichotomous scoring, nscoring based on two cat-egories, e.g., right/wrong, pass/fail. Compare to polytomousscoring.3.1.13 equated forms, ntwo or more forms of a test whosetest scores have been transformed onto the same scale so thata compar
21、ison across different forms of a test is made possible.3.1.14 expert panel, na group of target-language expertswho take a test under test-like conditions and provide com-ments about any problem areas.3.1.14.1 DiscussionAn expert panel should include atleast 8 members. Panel members receive training
22、before theytake the test in order to ensure that their comments will behelpful.3.1.15 face validity, nthe degree to which a test appears tomeasure the knowledge or abilities it claims to measure, basedon the subjective judgment of an observer.3.1.16 fixed-form test, na test whose content does not va
23、ryin order to better accommodate to the examinees level ofknowledge, skill, ability or proficiency. The opposite of anadaptive test.3.1.17 genre, na type of discourse that occurs in aparticular setting, that has distinctive and recognizable patternsand norms of organization and structure, and that h
24、as particularand distinctive communicative functions.3.1.18 ILR scale, na scale of functional language abilityof 0 to 5 used by the Interagency Language Roundtable.23.1.18.1 DiscussionThe range of the ILR scale is from0no knowledge of a language to 5equivalent to a highlyeducated native speaker.3.1.
25、19 indirect test, na test that measures ability indi-rectly, rather than directly.3.1.19.1 DiscussionAn indirect test requires examinees toperform tasks that are not directly reflective of an authentictarget-language use situation. Inferences are drawn about theabilities underlying the examinees obs
26、erved performance onthe indirect test.3.1.20 interpretation, nthe process of understanding andanalyzing a spoken or signed message and re-expressing thatmessage faithfully, accurately and objectively in another lan-guage, taking the cultural and social context into account.3.1.20.1 DiscussionAlthoug
27、h there are correspondencesbetween the skills of interpreting and translating, an interpreterconveys meaning orally, while a translator conveys meaningfrom written text to written text. As a result, interpretationrequires skills different from those needed for translation.3.1.21 inter-rater reliabil
28、ity, nthe degree to which differ-ent examiners or judges making different subjective ratings ofability agree in their evaluations of that ability.3.1.22 intra-rater reliability, nthe degree to which anindividual examiner or judge renders consistent and reliableratings.3.1.23 item, none of the assess
29、ment units, usually aproblem or a question, that is included on a test.3.1.23.1 DiscussionTest items provide a means to mea-sure whether a test taker can perform a task and are scorableusing a scoring rubric or answer key. Successful or unsuccess-ful performance on an item contributes information to
30、 the testtakers overall score. Examples of item types include: multiplechoice, constructed response, cloze, matching and essayprompts.3.1.24 item response theory (IRT), nthe theory underlyingstatistical models that are used to describe the relationshipbetween a students ability level and the probabi
31、lity of successon a test question.3.1.24.1 DiscussionIRT encompasses latent trait theory;logistic models; Rasch models; 1, 2, and 3 parameter IRT;normal ogive models; Generalized Partial Credit models; andSamejimas Graded Response model.3.1.25 language proficiency, nthe degree of skill withwhich a p
32、erson can use a language for communicative pur-poses.3.1.25.1 DiscussionLanguage proficiency encompasses apersons ability to read, write, speak, or understand a languageand can be contrasted with language achievement, whichdescribes language ability as a result of learning. Proficiencymay be measure
33、d through the use of a proficiency test.3.1.26 operational validity, nthe extent to which itemtasks, items, or interviewers on a test perform as intended andfunction to create an accurate score in a real world setting, asopposed to a setting involving an experiment, a simulation ortraining.3.1.27 pe
34、rformance test, na test in which the ability ofcandidates to perform particular tasks, usually associated withjob or study requirements, is assessed using “real-life” perfor-mance requirements as a criterion.3.1.28 polytomous scoring, na model for scoring an itemusing a scale of at least three point
35、s.3.1.28.1 DiscussionUsing a polytomous scoring model,for example, the answer to a question can be assigned 0, 1, or2 points. Open-ended questions are often scored polytomously.F2889 112Also referred to as scalar or polychotomous scoring. Compareto dichotomous scoring.3.1.29 predictive validity, nth
36、e degree to which a testaccurately and reliably predicts future performance in thedomain being tested.3.1.30 protocol, na standardized method or procedure forexecuting a given task, often formalized in documents.3.1.31 quality assurance, vthe process of ensuring thatthe test planning and development
37、 phases are executed prop-erly and satisfy the needs of all stakeholders.3.1.31.1 DiscussionQuality assurance (QA) applies (1)when a new test is being created, (2) when a test that alreadyexists is being repurposed or revised, (3) during certain aspectsof the implementation process of the test (that
38、 is, replenishmentof test items), (4) during item replenishment to ensure that newtest items and prompts that will be used in the test conform tothe original specifications that were used in creating theoriginal items of that type, and (5) to train new personnel toadminister the test to the same sta
39、ndards that were specified forthe first testing personnel.3.1.32 quality control, vthe system of post-developmentevaluations used at and after product acceptance to determinewhether the test and testing practices used by an organizationcontinue to meet and adhere to all standards and relevanttesting
40、 policies.3.1.32.1 DiscussionQuality control (QC) is used at andany time after product acceptance. QC verifies the continuedvalidity and reliability of the test and shows the test is beingused in an appropriate manner on an ongoing basis. Qualitycontrol (QC) is part of the test maintenance process.3
41、.1.33 rater, na suitably qualified and trained person whoassigns a rating to a test takers performance based on ajudgment usually involving the matching of features of theperformance to descriptors on a rating scale.3.1.34 rating, vto exercise judgment about an examineesperformance on a given task.3
42、.1.35 rating scale, na scale for the description of lan-guage proficiency consisting of a series of constructed levelsagainst which a language learners performance is judged.3.1.36 reliability, nthe consistency of a test in measuringwhat it is intended to measure across the life of the test or thede
43、gree to which an instrument measures the same way eachtime used; reproducibility.3.1.36.1 DiscussionConsistency is the essential notion ofclassical reliability. Reliability is defined as the extent thatseparate measurements (for example, items, scales, test admin-istrations, and interviews) yield co
44、mparable results under thesame or similar conditions. For example, test items measuringthe same construct should yield similar results when adminis-tered to same group of test-takers under comparable testingsituations. Simply put, reliability is the extent to which an item,scale, procedure, or test
45、will yield the same value whenadministered under similar or dissimilar conditions.3.1.37 scoring rubric, na standardized method or proce-dure used by a rater in assigning a score to an examineesperformance on a given task.3.1.37.1 DiscussionA scoring rubric is a detailed docu-ment that is used by tr
46、ained raters to assess test takerperformance. Correct interpretation and application of thescoring rubric requires training.3.1.38 selected response, adjany item which requires theexaminee to choose between response options which areprovided to the examinee, including, but not limited to true/false
47、and multiple-choice items.3.1.39 skill modality, nany one of the four receptive andproductive language skills of listening, reading, speaking,writing as defined in the ILR.3.1.40 specifications, na detailed description of the char-acteristics of a test, including what is tested, how it is tested,det
48、ails such as number and length of papers, item types used,etc.3.1.41 task, nan activity performed by a test taker in orderto demonstrate functions and other proficiency criteria stated inthe ILR Skill Level Descriptors.3.1.42 test-retest reliability, nan estimate of the reliabilityof a test as deter
49、mined by the extent to which a test gives thesame results if it is administered at two different times underthe same conditions with the same group of test takers.3.1.42.1 DiscussionTest-retest reliability is estimatedfrom the coefficient of correlation that is obtained from the twoadministrations of the test. An assessment should provide astable measurement of a construct across multiple administra-tions, especially when the time interval in between the admin-istrations limits the potential for the amount o