ASTM E2849-2018 Standard Practice for Professional Certification Performance Testing.pdf

资源描述

1、Designation: E2849 13E2849 18 An American National StandardStandard Practice forProfessional Certification Performance Testing1This standard is issued under the fixed designation E2849; the number immediately following the designation indicates the year oforiginal adoption or, in the case of revisio

2、n, the year of last revision. A number in parentheses indicates the year of last reapproval. Asuperscript epsilon () indicates an editorial change since the last revision or reapproval.1. Scope1.1 This practice covers both the professional certification performance test itself and specific aspects o

3、f the process thatproduced it.1.2 This practice does not include management systems. In this practice, the test itself and its administration, psychometricproperties, and scoring are addressed.1.3 This practice primarily addresses individual professional performance certification examinations, altho

4、ugh it may be usedto evaluate exams used in training, educational, and aptitude contexts. This practice is not intended to address on-site evaluationof workers by supervisors for competence to perform tasks.1.4 This standard does not purport to address all of the safety concerns, if any, associated

5、with its use. It is the responsibilityof the user of this standard to establish appropriate safety safety, health, and healthenvironmental practices and determine theapplicability of regulatory limitations prior to use.1.5 This international standard was developed in accordance with internationally

6、recognized principles on standardizationestablished in the Decision on Principles for the Development of International Standards, Guides and Recommendations issuedby the World Trade Organization Technical Barriers to Trade (TBT) Committee.2. Terminology2.1 DefinitionsSome of the terms defined in thi

7、s section are unique to the performance testing context. Consequently, termsdefined in other standards may vary slightly from those defined in the following.2.1.1 automatic item generation (AIG), na process of computationally generating multiple forms of an item.2.1.2 candidate, nsomeone who is elig

8、ible to be evaluated through the use of the performance test; a person who is or willbe taking the test.2.1.3 construct validity, ndegree to which the test evaluates an underlying theoretical idea resulting from the orderlyarrangement of facts.2.1.4 differential system responsiveness, nmeasurable di

9、fference in response latency between two systems.2.1.5 examinee, ncandidate in the process of taking a test.2.1.6 gating item, nunit of evaluation that shall be passed to pass a test.2.1.7 inter-rater reliability, nmeasurement of rater consistency with other raters.2.1.7.1 DiscussionSee rater reliab

10、ility.2.1.8 item, nscored response unit.2.1.8.1 DiscussionSee task.1 This practice is under the jurisdiction of ASTM Committee E36 on Accreditation a task can be scored as one item; a task may also becomprised of multiple components each of which is scored as an item.2.1.21 test, nsampling of behavi

11、or over a limited time in which an authenticated examinee is given specific tasks underspecified conditions, tasks that are scored by a uniformly applied rubric.2.1.21.1 DiscussionA test can also be referred to as an assessment, although typically “assessment” is used for formative evaluation. This

12、practiceaddresses specifically certification and licensure, as stated in 1.3.Atest is designed to predict the examinees behavior in a specifiedcontext, the “target context.”2.1.22 trajectory, ncandidates path through the solution to a single item, task, or test.2.1.22.1 DiscussionAlso termed the res

13、ponse trajectory.2.1.23 validity, nextent to which a test predicts target behavior for multiple candidates within a target context.3. Significance and Use3.1 This practice for performance testing provides guidance to performance test sponsors, developers, and delivery providersfor the planning, desi

14、gn, development, administration, and reporting of high-quality performance tests. This practice assistsstakeholders from both the user and consumer communities in determining the quality of performance tests. This practice includesrequirements, processes, and intended outcomes for the entities that

15、are issuing the performance test, developing, delivering andevaluating the test, users and test takers interpreting the test, and the specific quality characteristics of performance tests. Thispractice provides the foundation for both the recognition and accreditation of a specific entity to issue a

16、nd use effectively a qualityperformance test.E2849 1823.2 Accreditation agencies are presently evaluating performance tests with criteria that were developed primarily or exclusivelyfor multiple-choice examinations. The criteria by which performance tests shall be evaluated and accredited are ones a

17、ppropriateto performance testing. As accreditation becomes more critical for acceptance by federal and state governments, insurancecompanies, and international trade, it becomes more critical that appropriate standards of quality and application be developed forperformance testing.4. Candidate Prepa

18、ration4.1 Number of Practice ItemsAcandidate shall be given access to sufficient practice items that the novelty of the item formatshall not inhibit the examinees ability to demonstrate his or her capabilities.4.2 Scoring Rubric Available to Candidates:4.2.1 Candidates shall have sufficient informat

19、ion about the scoring rubric to be able to appropriately prioritize their efforts incompleting the item or test.4.2.2 The examinee shall not be provided so much information about the scoring rubric that it diminishes the ability ofstakeholders to generalize the examinees skills from his or her test

20、score.4.3 Practice Tests:4.3.1 There are two types of practice tests: one for gaining familiarity with the user interface of the test items and the other toallow the candidate to self-evaluate mastery of the content.4.3.1.1 User Interface PreparationA practice test or tests to familiarize candidates

21、 with the user interface shall be madeavailable to the candidate at no charge. The practice test shall be sufficient to assure adequate candidate practice time so that thedegree of familiarity with the user interface does not impair the validity of the test.4.3.1.2 Content Self-AssessmentPractice te

22、sts that evaluate content mastery may be made available at no charge or for a fee.There is no obligation on the part of the test provider to provide a self-assessment practice test to evaluate content mastery.NOTE 1If a practice test is provided, it shall sample test content sufficiently to allow th

23、e candidate to predict reasonably success or failure on thetest.4.3.2 Candidates shall know specifically which type of practice test they are requesting.4.3.3 Both types of practice test shall help candidates understand how their responses are going to be scored.5. Procedure5.1 Item DevelopmentAll r

24、equirements in Section 5 may be superseded by empirical, logical, or statistical argumentsdemonstrating that the practices of a certification body are equivalent to or superior to the practices required to meet this practice.5.1.1 Item Time Limits:5.1.1.1 When items or test sections can be accessed

25、repeatedly, no item time limit is required to be enforced or recommendedto the candidate.5.1.1.2 When items can be accessed only once, item time limits shall be either suggested or enforced, with a visual timekeepingoption for the examinee.5.1.1.3 For a power test, item time limits shall be set usin

26、g a standard practice such as the mean item response time measuredin beta testing plus two standard deviations for successful candidates within the calibration sample. When sufficient data have beencollected from test administrations, the item time shall be recalibrated to reflect performance on the

27、 actual test5.1.1.4 For a speeded test, item time limits shall be determined by measuring minimum acceptable time limits in the targetcontext.5.1.2 Differential System ResponsivenessDifferential system responsiveness may be due to variance in network bandwidth,network latency, random-access memory (

28、RAM), storage speed, operating systems, computer processing unit (CPU) count andperformance, bus speed, or other factors.NOTE 2It is the obligation of the test developer to attempt to measure differences in latency and system responsiveness whenever possible and, ifpossible, to compensate appropriat

29、ely for these variations.5.1.2.1 There shall be compensation in test scoring for variances in the hardware and software environment to assure that allexaminees are scored fairly.NOTE 3Compensation may be in adjusting item time limits, item latency scoring factors, or other compensatory variables.5.1

30、.2.2 An examinee taking a test under one set of conditions shall receive the same score as if he or she took the test underany admissible alternative set of conditions.5.1.3 References/CitationsWhen possible, codes, guidelines, industry standards, application source code, or other evidenceshall be s

31、ufficient to establish the correctness of scoring a procedure. Where such documentation does not exist, correct responsesmay be documented as standard practice by a vote of the subject matter expert (SME) advisory panel for the test.5.1.4 Rater ReliabilityWhen human raters are involved in assessing

32、item success, rater reliability shall correlate with anestablished performance standard greater than 0.80.5.1.4.1 When multiple raters are used to rate a single performance, inter-rater reliability shall correlate higher than 0.80.E2849 1835.1.5 Automated ScoringTo verify automated scoring, the test

33、 developer shall develop test cases that verify the scoring of aminimum of 95 % of anticipated responses. When items are scored automatically, for the first 100 administrations of the test, thetest developer shall verify that the scoring algorithm is scoring responses correctly. Verification may be

34、done by humanobservation, alternate scoring mechanisms, playback of recorded performance, or audit of collected data. Initial verification shallbe performed for at least 5 % of failed items.After 100 administrations, the developer shall verify 1 % of failed items until at least200 failed items have

35、been checked.5.1.6 Item Stimulus ConstructionThe item solution space shall enable options that would be used by at least 95 % ofpractitioners in addressing the problem represented by the item.NOTE 4The estimate of the practitioner percentage can be derived empirically from usability studies, use cas

36、es, expert panels, observation, or otherempirical means.5.1.7 Simulation Representation of RealitySimulation rules shall represent reality as it is encountered in the target context oraccurately abstract essentials of reality in the target context, unless the content of the item is for the candidate

37、 to infer the rulesof the simulation.5.1.8 Access to HelpSupport available to the candidate during the examination shall reflect the support available in the targetcontext, unless the test is designed to predict candidate behavior in an unsupported environment.5.1.9 ReconfigurationReconfiguration is

38、 so commonplace in many work environments that it shall be taken into account whenevaluating the valid range of interpretations of a performance test.5.1.9.1 If minimal reconfiguration is encountered in the field, requiring the examinee to take the test with the defaultconfiguration is acceptable.5.

39、1.9.2 If field practice normally involves extensive reconfiguration of the tools, then the test shall allow candidates to importtheir industry standard configurations into the test environment, provided that doing so does not compromise exam security,provide unfair advantage over other candidates, o

40、r impact the generalizability of results.5.1.9.3 The criterion the test developer shall use to determine “minimal reconfiguration” is whether competence measured withthe default configuration will predict performance with a reconfigured system.5.1.10 Level of FeedbackFeedback during the test shall r

41、eflect feedback available doing similar tasks in the target context.NOTE 5Feedback may be time compressed to minimize testing time. Interim results may be omitted if they do not impact success in performing theitem.5.1.11 American with Disabilities Act (ADA) AccommodationsAccommodations shall be fai

42、r to the candidate, the testingadministrator, other candidates, and the potential employer alike, with no interest predominating. Before awardingaccommodations, the test administrator shall discuss with the candidate what the candidate feels would be reasonableaccommodations and, when feasible, shal

43、l allow the methods candidates use for accomplishing tasks in the target context. Thecandidate shall possess the capability to perform the required test item in full with the agreed upon accommodations. In no caseshall a verbal option be given in place of a performance requirement.5.1.12 Sensitivity

44、 and BiasItems shall be developed with sensitivity toward the cultural context within which the candidatewill be practicing the skills evaluated. The items shall not include content that would prevent people of equal ability or skill fromexhibiting those abilities or skills.5.1.13 Item Response Term

45、inationItem termination methods used shall create an environment in which the examineesresponse during a test will best predict performance in the target context.NOTE 6In the target context, if an examinee determines completion of the task, then the examinee shall indicate completion of the task on

46、the test.If, in the target context, an external individual determines completion of the task, then an examiner or external indication shall terminate the item.5.1.14 Observer Item EffectsThe test developer shall minimize the intrusiveness of the item observer on the process beingevaluated at or belo

47、w the normal level of supervision encountered by the candidate in the target context.5.1.15 Item Scoring:5.1.15.1 Item scoring shall be both consistent and fair. The scoring rubric shall be applied in the same manner to all examineesresponses. The scoring rubric shall give credit to all correct resp

48、onses.5.1.15.2 There shall be a method that allows an auditor to evaluate scored states of the item, evaluate the accuracy of task anditem timing, and assess the accuracy of the weighting scheme if one is applied.5.1.15.3 When the universe of response trajectories is undefined, scoring for a reasona

49、ble set of correct paths to the correctanswer shall be verified.E2849 1845.2 Test Development:5.2.1 Equivalent Forms:5.2.1.1 Diffculty: IRTTest information functions shall have integrals within 2 % of each other and not depart more than 5 %anywhere along the theta range from 3.0 to +3.0.5.2.1.2 Diffculty: Classical Test TheoryDifficulty between forms shall be equated. The recommended range of P-values isfrom 0.35 to 0.95.5.2.1.3 Diffculty: AIG EquivalenceThe test developer shall periodically evaluate variant forms of items to assur

展开阅读全文