The competency I can certify after a practical — and the one I can’t after a multiple-choice exam.
Disagreement about assessment in undergraduate science is often framed as disagreement about values: tradition vs. innovation, standards vs. equity, rigor vs. accessibility. The framing is not quite right. The underlying question is a measurement question, and the measurement literature has answers.
Every assessment is a claim about a student. The claim a multiple-choice exam can defend — honestly, with the support of a century of psychometric research — is roughly this: this student can, under timed conditions, select the correct option from a set of plausible alternatives at a rate distinguishable from chance. That is a real claim. It is testable, it is reproducible, and for many purposes it is exactly the right one.
The claim a lab practical can defend is different. It is roughly: this student can, presented with a real specimen and an unfamiliar question, demonstrate the procedural steps and identify the relevant features required to answer it. That is also a real claim. It is also testable, also reproducible, and for many purposes also exactly the right one.
Neither instrument is inherently better. They measure different constructs and underwrite different claims. The question for any program is not "which one is real assessment" but "which claim does this course need to be able to defend?" That question has an answer, and the answer is course-specific.
What a multiple-choice exam can and cannot certify
Multiple-choice instruments do real work, and the work matters. They sample a content domain at scale, at low cost, with high inter-form reliability when the items are well-constructed. They are excellent for tracking knowledge growth over a term. They are excellent for board-exam preparation, because most professional board exams in the health sciences are themselves multiple-choice. The format has earned its place.
What a well-designed multiple-choice exam cannot do is demonstrate application in any rich sense — the gap between recognizing the correct answer when shown four options and producing the correct action when shown a patient. Norman and colleagues spent two decades documenting that the format reaches its ceiling at recognition and reasoning-from-given-information, not at procedural execution or open-ended clinical decision-making.1 The AERA / APA / NCME Standards for Educational and Psychological Testing — the closest thing the field has to a settled rulebook — says the same thing in formal language: the validity of an assessment is bounded by the construct it can plausibly be said to measure.2
What a lab practical can and cannot certify
A lab practical is, in measurement-theory terms, a performance assessment: the student is presented with the actual stimulus they will encounter downstream — a real specimen, a real instrument, a real procedural sequence — and produces an observable response that can be scored against a defined criterion. What it can certify is precisely what its construct is: applied identification, procedural competence, and the decision-making that goes with both.
What it cannot do, and is honest to admit, is sample a content domain quickly or cheaply. A practical takes time per student, requires physical materials and supervision, and asks rater training that a multiple-choice exam does not. The OSCE literature in medical education — the Objective Structured Clinical Examination, originating in Harden and Gleeson's 1979 paper — has spent 45 years working out how to make performance assessments reliable and feasible at scale.3 The conclusion of that literature is encouraging: with structured checklists, anchor examples, and rater training, practical assessments achieve inter-rater reliability comparable to multiple-choice exams of similar length, while measuring something the multiple-choice cannot.4
Norm-referenced vs. criterion-referenced: a real distinction
A short technical aside that turns out to matter. There are two fundamentally different ways an assessment can interpret a score. A norm-referenced instrument tells us how a student performed relative to other students — the traditional bell curve, the percentile rank, the class average as a benchmark. A criterion-referenced instrument tells us whether a student has demonstrated a defined competency, independent of how anyone else performed. The licensure exam, the OSCE station, the lab practical with its rubric — all criterion-referenced by design.
The distinction was formalized by Glaser in 1963 and has been a foundation of educational measurement ever since.5 Both frameworks are legitimate; they answer different questions. Mastery learning, in the educational-research sense Bloom developed, is criterion-referenced by definition: the standard does not move based on cohort performance, and students who do not yet meet it are given additional opportunities until they do.6 The Kulik meta-analyses confirmed that mastery-based instruction produces consistently larger learning gains than time-based instruction across a wide range of subjects, with effect sizes in the moderate-to-large range.7
The credentialing-adjacency principle
From the two distinctions above — what each instrument measures, and which interpretive framework it lives in — a working principle emerges that turns out to make most assessment decisions for you. The principle is this: the closer a course sits to a downstream credentialing decision, the stronger the case for criterion-referenced assessment of applied competence.
The reasoning is straightforward, almost mechanical. The downstream credential — the nursing license, the PA certification, the medical-school admission — is itself criterion-referenced. We are not preparing students to be ranked. We are preparing them to demonstrate, against a defined standard, the competencies the credential exists to certify. A course one or two steps upstream of that credential should measure what the credential measures, in the format the credential measures it. Where the credential will require an applied performance, the course should require an applied performance. Where it will require multiple-choice recognition of facts, the course can use the same.
A grade of B in a norm-referenced course tells me a student outperformed roughly half the class. A pass on a criterion-referenced practical tells me they can do the thing the course exists to teach. For a course that prepares students for licensure, the second statement is the one I can defend.
The honest counter: reliability cost
The strongest argument against expanding the use of practicals is, fairly, a measurement argument: practicals are harder to score reliably than multiple-choice exams. This is a real challenge, well documented, and not something a curriculum committee should hand-wave away. Two raters watching the same performance can disagree on its quality; the same rater on a different day can score the same performance differently.
What the OSCE psychometrics literature has shown, however, is that the gap is an engineering problem, not a fundamental ceiling. Structured rubrics with operationalized anchors, frame-of-reference rater training, multiple raters per station, and pre-assessment calibration sessions all narrow the gap substantially. Generalizability-theory studies of well-designed OSCEs report reliability coefficients in the same range as multiple-choice exams of comparable length.4 Reliability is a design problem to be solved, not a reason to abandon the instrument that measures the construct the course needs to certify.
What this means for grading philosophy
The bridge from measurement theory to grading practice is shorter than it sometimes feels. If a course's purpose is to certify a competency, then the appropriate grading philosophy is the one that supports certification: clear, published standards; multiple opportunities for the student to demonstrate the competency; and no penalty for an earlier attempt that did not yet meet the standard. A grade of 95% on the third attempt at a procedural checklist means the same thing — the student can do the procedure to standard — as a 95% on the first attempt. The downstream credential will not record which attempt it was.
This is not a softer grading philosophy. It is grading that matches the construct being measured. A norm-referenced course design, with its bell curve and one-shot summative format, is appropriate for courses where the goal is to rank students reliably for a downstream selection decision — an intentional, defensible use that exists. It is the wrong instrument for a course whose purpose is to certify whether each individual student has met the standard the next program, or the licensure board, will hold them to.
A practical implication
For any course in the credentialing-adjacent category, the assessment audit is straightforward. Take each major assessment. Name the construct it measures. Name the claim it underwrites. Ask whether the claim is one the course needs to be able to defend. Where the answer is yes, keep the instrument. Where the answer is no, replace it with the instrument that matches the construct.
A program that does this exercise honestly tends to produce more lab practicals, not fewer, and clearer multiple-choice exams, not muddier ones. The two outcomes are not in tension. The multiple-choice instrument keeps the work it is good at; the practical takes on the work that requires it. Both become better at the construct they were designed to measure, and the course as a whole earns the right to defend the claims its grades make.
References & further reading
- Norman, G. R., Swanson, D. B., & Case, S. M. (1996). “Conceptual and methodological issues in studies comparing assessment formats.” Teaching and Learning in Medicine, 8(4), 208–216. doi:10.1080/10401339609539791. One of the canonical short references on what multiple-choice instruments can and cannot measure in health-professions education. See also van der Vleuten, C. P. M., & Schuwirth, L. W. T. (2005), “Assessing professional competence: from methods to programmes,” Medical Education, 39(3), 309–317.
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC: AERA. testingstandards.net. The field's settled rulebook on validity, reliability, and fair use of assessment instruments.
- Harden, R. M., & Gleeson, F. A. (1979). “Assessment of clinical competence using an objective structured clinical examination (OSCE).” Medical Education, 13(1), 41–54. doi:10.1111/j.1365-2923.1979.tb00918.x. The originating paper for the OSCE as a structured performance assessment.
- Brannick, M. T., Erol-Korkmaz, H. T., & Prewett, M. (2011). “A systematic review of the reliability of objective structured clinical examination scores.” Medical Education, 45(12), 1181–1189. doi:10.1111/j.1365-2923.2011.04075.x. A meta-analytic review showing that well-designed OSCEs achieve reliability coefficients comparable to multiple-choice exams of similar length.
- Glaser, R. (1963). “Instructional technology and the measurement of learning outcomes: some questions.” American Psychologist, 18(8), 519–521. doi:10.1037/h0049294. The originating paper on the criterion-referenced / norm-referenced distinction.
- Bloom, B. S. (1968). “Learning for mastery.” UCLA Evaluation Comment, 1(2), 1–12. The foundational treatment of mastery learning as criterion-referenced instruction with multiple opportunities to demonstrate.
- Kulik, C.-L. C., Kulik, J. A., & Bangert-Drowns, R. L. (1990). “Effectiveness of mastery learning programs: a meta-analysis.” Review of Educational Research, 60(2), 265–299. doi:10.3102/00346543060002265. The meta-analytic synthesis of mastery-learning outcomes against time-based instruction.
Drafted May 2026.