Thesis statement
A defensible rubric for assessing “assessment practice” in educational or workplace settings must (a) be construct-aligned, (b) contain behaviorally anchored performance descriptors, (c) support reliable scoring through rater training and quality control, and (d) generate validity evidence for score use. The rubric below provides a parsimonious, evidence-informed structure grounded in authoritative standards and scholarship (AERA, APA, & NCME, 2014; Kane, 2013; Messick, 1995; Jonsson & Svingby, 2007; Moskal & Leydens, 2000).
- Intended use, population, and deliverables
- Intended use: Summative evaluation of candidates’ competence in designing, implementing, scoring, and using results from an assessment in their domain (e.g., education, training, certification).
- Target population: Pre-service or in-service educators, instructional designers, assessors, or program evaluators.
- Required artifacts:
- Assessment plan with purpose, construct definition, claims/targets, and blueprint.
- Task(s)/items and scoring tools (rubrics, keys), administration materials.
- Fairness and accessibility plan (including accommodations and bias review).
- Quality assurance plan for scoring and standard setting as appropriate.
- Evidence report with data (pilot or simulated), reliability/consistency estimates, validity argument, and use-of-results plan.
- Reflective memo on improvement decisions and consequences.
- Performance levels and weights
- Scale: 4 = Exemplary; 3 = Proficient; 2 = Developing; 1 = Beginning.
- Recommended weights (sum to 100):
- C1 Purpose and construct definition: 15
- C2 Design and blueprinting: 20
- C3 Fairness, accessibility, and ethics: 15
- C4 Scoring and standard setting: 15
- C5 Evidence and interpretation (reliability/validity): 20
- C6 Reporting and improvement: 15
- Criteria with behaviorally anchored descriptors
C1. Purpose and construct definition (15)
- 4 Exemplary: States specific intended uses and decisions; articulates a defensible construct with boundaries and grain size; aligns claims to learning/competency targets and context; specifies consequences to be monitored (AERA et al., 2014; Messick, 1995).
- 3 Proficient: States uses and decisions; defines construct with minor ambiguities; shows clear alignment to targets.
- 2 Developing: Vague uses or decisions; construct definition incomplete or overly broad; partial alignment.
- 1 Beginning: No clear decision use; construct undefined or conflated with proxies; misaligned to targets.
C2. Design and blueprinting (20)
- 4 Exemplary: Provides blueprint mapping targets to tasks/items, cognitive processes, and score points; sampling is representative; tasks elicit intended evidence with clear directions; difficulty and cognitive demand are justified; administration plan addresses logistics and security (AERA et al., 2014).
- 3 Proficient: Logical blueprint with acceptable coverage; tasks largely elicit intended evidence; administration plan adequate.
- 2 Developing: Coverage gaps or imbalance; tasks partly mismatched to targets or over/under-difficult; administration plan incomplete.
- 1 Beginning: No blueprint or severe misalignment; tasks fail to elicit targeted evidence; administration plan absent.
C3. Fairness, accessibility, and ethics (15)
- 4 Exemplary: Documents bias/sensitivity review procedures; integrates universal design for learning and access features; specifies accommodation policies; anticipates and mitigates construct-irrelevant barriers; addresses privacy and informed consent; articulates fairness monitoring (AERA et al., 2014; CAST, 2018).
- 3 Proficient: Incorporates key fairness/accessibility elements with minor omissions; accommodation policy present.
- 2 Developing: Fairness considerations ad hoc; access features limited; accommodation guidance vague.
- 1 Beginning: No evidence of bias review, access planning, or ethical safeguards.
C4. Scoring and standard setting (15)
- 4 Exemplary: Rubrics/keys have clear performance indicators and anchors; rater training and calibration plan specified; quality control includes double-scoring and drift checks; criterion-referenced standard-setting approach selected and justified (e.g., Angoff/Bookmark) with cut score documentation when applicable (AERA et al., 2014).
- 3 Proficient: Rubrics clear with minor ambiguities; basic rater training plan; appropriate standard-setting choice with partial documentation.
- 2 Developing: Rubrics/keys lack anchors; limited or informal rater guidance; standard setting ill-specified or weakly justified.
- 1 Beginning: Scoring rules unclear; no rater training; cut scores arbitrary or absent when required.
C5. Evidence and interpretation: reliability/validity (20)
- 4 Exemplary: Provides a coherent validity argument spanning scoring, generalization, extrapolation, and decision inferences; includes relevant evidence (e.g., internal consistency or inter-rater reliability with appropriate coefficients; item/task analysis; alignment indices; relationships to external measures when feasible); limitations and alternative explanations addressed (Kane, 2013; Jonsson & Svingby, 2007; Moskal & Leydens, 2000).
- 3 Proficient: Presents multiple pertinent indices with correct interpretation; tentative validity narrative with minor gaps.
- 2 Developing: Limited or inappropriate indices; interpretations exceed evidence; validity argument superficial.
- 1 Beginning: No empirical checks; claims unsupported or inaccurate.
C6. Reporting, use, and improvement (15)
- 4 Exemplary: Communicates results for intended audiences with accuracy and transparency; provides actionable feedback; specifies decision rules; documents intended and unintended consequences; proposes concrete revisions based on evidence (AERA et al., 2014; Messick, 1995).
- 3 Proficient: Clear, audience-appropriate reporting and plausible improvement steps.
- 2 Developing: Reporting uneven or lacks actionability; improvement suggestions weakly connected to evidence.
- 1 Beginning: Results opaque or misleading; no plan for use or improvement.
- Scoring procedure and quality assurance
- Rater selection and training:
- Provide raters with construct definitions, exemplars/anchors, and a scoring guide; conduct calibration using anchor artifacts spanning the scale (Moskal & Leydens, 2000).
- Require agreement thresholds before operational scoring (e.g., percent exact agreement ≥ 70% and adjacent agreement ≥ 90% during training).
- Operational scoring:
- Double-score at least 20% of portfolios/practicum submissions; resolve discrepancies via adjudication rules.
- Monitor rater drift with periodic recalibration and feedback.
- Inter-rater reliability:
- Report an appropriate coefficient for ordinal rubric scores, such as a two-way random-effects intraclass correlation (ICC[2,k]) with 95% CIs; aim for ≥ 0.75 for high-stakes uses and ≥ 0.60 for moderate stakes, interpreted in context (Jonsson & Svingby, 2007).
- For categorical pass/fail decisions, report Cohen’s kappa or weighted kappa; examine decision consistency.
- Internal structure and score quality:
- If multiple tasks/indicators form a composite, examine internal structure (e.g., inter-item correlations; factor structure if sample permits) with construct coherence as the goal rather than maximizing alpha.
- Use item/task analyses where appropriate (e.g., facility, discrimination trends) to inform revisions.
- Standard setting (when required):
- Choose a method aligned to task type and decision stakes (e.g., modified Angoff for selected-response, Body of Work for performance tasks); document panel qualifications, training, performance level descriptors, and cut score computations (AERA et al., 2014).
- Fairness monitoring:
- Document accommodations provided; collect qualitative feedback from examinees; where data permit, screen for subgroup anomalies while recognizing small-sample limitations; prioritize qualitative bias review for performance tasks.
- Implementation notes and adaptations
- Contextualization: Tailor task specificity and weighting to discipline and stakes while preserving the six criteria and four-level scale to support comparability.
- Evidence proportionality: For lower-stakes settings with small N, prioritize inter-rater agreement, alignment evidence, and qualitative validity argument; for higher stakes, augment with broader evidence and decision-consistency studies (AERA et al., 2014; Kane, 2013).
- Consequential validity: Track intended uses and potential unintended effects (e.g., narrowing of instruction), integrating them into periodic rubric review (Messick, 1995).
- Scoring form (compact, rater-facing)
- Enter level (1–4) for each criterion; multiply by weight; sum total score; record qualitative comments anchored to descriptors.
- Decision guidance:
- Exemplary: 3.5–4.0 average with no criterion below 3.
- Proficient: 2.75–3.49 average with no criterion below 2.
- Developing: 2.0–2.74 average or any criterion at 2 with notable deficiencies.
- Beginning: < 2.0 average or any criterion at 1 for C1, C3, or C5.
- Note: Use cut scores only after a documented standard-setting procedure consistent with stakes.
References
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association.
- CAST. (2018). Universal Design for Learning Guidelines version 2.2. CAST. https://udlguidelines.cast.org
- Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002
- Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
- Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741
- Moskal, B. M., & Leydens, J. A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research & Evaluation, 7(10). https://doi.org/10.7275/q7rm-gg74
Note on evidence base
- The rubric’s emphasis on alignment, fairness, scoring quality, and a validity argument follows the Standards (AERA et al., 2014) and contemporary validity theory (Kane, 2013; Messick, 1995). The structure and rater procedures reflect empirical findings on rubric reliability and validity (Jonsson & Svingby, 2007; Moskal & Leydens, 2000).