评估任务基础量表设计

幂简官方

341 浏览

29 试用

7 购买

Sep 28, 2025更新

创建一个用于评估特定任务类型的基础量表，提供准确和专业的建议。

论点陈述 Statement of intent: 本方案提出一份面向“评估实验报告”的基础性分析型量表，旨在以可操作、可评分且具证据支撑的方式，评价报告在研究目的、设计与方法、测量质量、数据分析、结果与解释、推断有效性与公平性、以及学术表达与合规性等核心维度上的质量。量表的设计遵循教育与心理测量的专业标准与报告规范，并提供使用与信度保障建议，以支持评分的公正性与一致性（AERA, 2014; Brookhart, 2013; Jonsson & Svingby, 2007; Appelbaum et al., 2018）。

一、量表结构与权重 Rubric structure and weights

评分类型 Type: 分析型量表 Analytic rubric（按维度分别评分）
等级 Levels: 4 优秀/Exemplary, 3 良好/Proficient, 2 发展中/Developing, 1 初始/Beginning
总分 Total: 100 分（括号内为建议权重 Suggested weights）

研究目的与评估问题 Research purpose and evaluation questions (10%)

4: 目的清晰、聚焦且可评价；评估问题可操作并与干预/项目逻辑严密对齐；成功判据与期望效应预先界定。[Purpose is explicit, evaluable, and aligned to a logic model; questions are operationalized with pre-specified criteria of success.]
3: 目的明确；问题基本可操作，与项目目标大体一致。[Clear purpose; mostly operational questions and reasonable alignment.]
2: 目的较笼统；问题部分可操作，成功判据多为事后界定。[Somewhat vague; partial operationalization; post hoc criteria.]
1: 目的与问题不清或脱节。[Unclear or misaligned purpose/questions.]

文献与理论基础 Literature and theoretical foundation (10%)

4: 系统且及时的综述；明确的理论框架指导设计与假设；识别关键证据缺口。[Current, critical synthesis with a guiding framework.]
3: 覆盖充分，存在有限批判性分析；理论联系基本到位。[Adequate coverage; some critique; loose linkage.]
2: 范围有限或过时；多为描述性汇总；理论联系薄弱。[Limited/outdated; descriptive; weak linkage.]
1: 文献极少且无理论框架。[Minimal literature; no framework.]

设计与方法的匹配与严谨性 Design-method alignment and rigor (20%)

4: 设计与问题高度匹配（如随机/准实验、混合方法等）且论证充分；抽样策略与统计功效有依据；实施流程可复现；伦理审批与参与者保护清晰。[Design fit is strongly justified; sampling/power justified; procedures reproducible; ethics documented.]
3: 设计基本匹配；抽样合理；流程可大致复现；伦理有所说明。[Generally appropriate design; reasonable sampling; mostly replicable; ethics addressed.]
2: 部分不匹配；便利抽样缺乏论证；流程信息缺口；伦理着墨甚少。[Mismatches; convenience sampling without justification; gaps; minimal ethics.]
1: 设计不当；抽样与流程不明；无伦理说明。[Inappropriate design; unclear sampling/procedures; no ethics.]

测量工具与证据质量 Measurement quality: instruments and evidence (15%)

4: 工具与构念对齐；提供内容/结构效度与信度（如α/ω、ICC）及区间估计；公平性与可及性考虑到位；试测/认知访谈证据充分。[Strong construct alignment; validity and reliability with CIs; fairness/accessibility; pilot evidence.]
3: 提供部分效度或信度证据；公平性讨论有限；有初步试测。[Some validity/reliability; limited fairness; basic piloting.]
2: 仅陈述表面效度；信度未知或不当估计。[Mostly face validity; unknown or weak reliability.]
1: 无测量证据；工具与构念不符。[No evidence; misaligned instruments.]

数据分析与前提检验 Data analysis and assumptions (15%)

4: 分析与问题严格对齐；检验并报告前提（独立性、正态/方差、模型拟合等）；缺失数据的合理处理（如多重插补）；报告效应量与置信区间；进行稳健性/敏感性分析。[Aligned analyses; assumption checks; principled missing-data handling; effect sizes and CIs; sensitivity analyses.]
3: 核心分析恰当；前提检验有限；基本缺失处理；部分报告效应量或区间。[Appropriate core analyses; limited checks; basic missing-data handling; partial effect-size reporting.]
2: 分析与问题部分不匹配；无前提检验；仅逐案剔除且无理由；不报告效应量。[Partial mismatch; no checks; listwise deletion; no effect sizes.]
1: 分析错误或与问题不符。[Incorrect or misaligned analyses.]

结果报告与解释 Results reporting and interpretation (10%)

4: 报告透明、可复核（表/图清晰）；解释兼顾效应大小与不确定性；讨论实践意义；如实呈现负向或不显著结果且不夸大。[Transparent reporting; interprets magnitude and uncertainty; practical significance; no overclaiming.]
3: 报告清楚；涉及不确定性但较简略；存在轻微过度推断。[Clear reporting; some uncertainty; minor overreach.]
2: 关键信息缺失；过度依赖p值；泛化过度。[Incomplete; p-value centric; overgeneralization.]
1: 模糊或与数据不符的结论。[Opaque or unsupported claims.]

推断的效度、信度与公平性；局限性 Validity, reliability, fairness of inferences; limitations (10%)

4: 基于统一效度观构建推断论证，识别并缓解内部/外部效度威胁；开展关键亚组/公平性分析；局限与潜在偏倚阐明且提出改进路径。[Coherent validity argument; addresses threats; equity analyses; explicit limitations and mitigation.]
3: 讨论主要威胁与若干公平性议题；承认局限。[Addresses major threats; some equity; acknowledges limits.]
2: 局限泛泛而谈；未涉及公平性或偏倚来源。[Superficial limitations; no equity/bias analysis.]
1: 缺失或误导性讨论。[Absent or misleading.]

结构、写作与引文合规 Organization, writing, and referencing (10%)

4: 结构严谨、语言精确；引用与格式一致（如APA）；参考文献完整且准确；遵循报告标准（如JARS）；提供数据/代码可用性声明与再现性信息。[Well-organized; precise language; consistent style; adheres to reporting standards; data/code availability.]
3: 结构清楚；偶有格式或引用小错。[Clear; minor issues.]
2: 结构松散；多处引用错误；表达影响理解。[Uneven organization; multiple citation errors.]
1: 严重不规范；可能涉及学术不端风险。[Disorganized; integrity concerns.]

二、评分与解释 Guidance for scoring and interpretation

等级锚定 Level anchoring: 以证据为基础作出等级判断，评分时应引用报告中的具体段落、表格或附录作为依据（Brookhart, 2013）。
加权汇总 Weighted aggregation: 各维度分数按权重加权求和形成总分；报告维度分和总分，避免仅以总分替代诊断性信息（AERA, 2014）。
等级划分 Cut scores: 可采用经验法或Angoff法设定等级界值，并通过小范围试评分进行复核与调优（Jonsson & Svingby, 2007）。

三、信度与评审程序 Reliability and rater procedures

评审人数 Raters: 至少两名独立评审；先行校准共识标准与证据要求（Moskal, 2000）。
校准 Calibration: 使用2–3篇锚样文档进行独立评分与讨论，直至对“充分证据”的理解一致。
一致性评估 Interrater agreement: 量化评审一致性（如ICC[2,1]或加权κ）；目标ICC≥0.75为良好，<0.5需再培训与修订锚语（Koo & Li, 2016）。
偏差管理 Bias control: 盲评（遮蔽作者与机构信息）；按维度集中评分以减少跨维度迁移效应（AERA, 2014）。

四、有效性与公平性保障 Validity and fairness safeguards

内容对齐 Content alignment: 量表维度与学科标准和报告规范对齐（AERA, 2014; Appelbaum et al., 2018）。
证据链 Evidence chain: 保存评分依据与校准记录，以支撑评分解释的可追溯性与透明度（Messick, 1995）。
可及性与便利 Accommodations: 为有特殊需要的作者/学生提供合理便利，同时保持评价标准一致（AERA, 2014）。

五、使用建议 Practical notes for use

形成性反馈 Formative feedback: 在每个维度提供“证据+改进建议”式短评，例如“数据分析：已报告效应量，但未检验模型前提；建议补充残差诊断与稳健性分析。”
模板与清单 Templates and checklists: 鼓励报告附录包含变量字典、分析代码与再现性说明，以提升可复核性与评分透明度（Appelbaum et al., 2018）。

参考文献 References (APA style)

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board task force report. American Psychologist, 73(1), 3–25. https://doi.org/10.1037/amp0000191
Brookhart, S. M. (2013). How to create and use rubrics for formative assessment and grading. ASCD.
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741
Moskal, B. M. (2000). Scoring rubrics: What, when and how? Practical Assessment, Research & Evaluation, 7(3). https://doi.org/10.7275/rh6g- e3x0

附注 Notes:

若评估对象为严格随机对照试验或大规模项目评估，可在“设计与方法”“数据分析”维度中加入预注册、干预忠实度监测、异质性分析与多重比较调整的明确要求，以进一步提高推断质量与可重复性。

论点陈述：为提升面试评分的一致性、可比性与预测效度，应采用结构化面试并使用行为锚定的多维度评分量表。该基础量表以岗位分析为依据，提供清晰的评分等级定义与行为锚，配套评分与决策规则及质量控制要点，符合人员测评的专业标准与实证证据。

一、设计原则与证据基础

岗位相关性与结构化：量表的维度与权重应来源于岗位分析，并在面试实施中保持问题、评分标准与追问的一致性，以提升信度与效度（Campion, Palmer, & Campion, 1997；SIOP, 2018）。
行为锚定评分：为每一评分等级提供可观察的行为描述，减少评分者主观性与量纲漂移，改进评估质量（Smith & Kendall, 1963）。
证据导向与标准化记录：要求评分基于候选人提供的情境—任务—行动—结果（STAR）证据，保留评分理由与行为例证，满足可追溯性与合规要求（AERA, APA, & NCME, 2014）。
质量控制：通过评分者培训、双评机制与信度监测（如ICC）控制误差来源，增强结果可用性（Woehr & Huffcutt, 1994；Shrout & Fleiss, 1979）。
实证依据：结构化面试相较于非结构化面试在信度与效度方面具有稳健优势，已得到多项综述与元分析的支持（Campion et al., 1997；Levashina, Hartwell, Morgeson, & Campion, 2014；Schmidt & Hunter, 1998）。

二、基础量表结构

评分等级与通用锚（五级制，整数评分）

5 优秀（显著超过岗位期望）：证据充分、结构化叙述，体现系统性、前瞻性与高影响的行为结果。
4 良好（超过岗位期望）：证据清晰、基本结构化，体现稳定的高质量表现，偶有轻微缺漏。
3 合格（达到岗位期望）：证据可用且与问题相关，达到基本标准，存在若干改进空间。
2 欠佳（低于岗位期望）：证据零散或不具体，与岗位关键要求匹配度不足。
1 不足（明显低于岗位期望）：缺乏相关证据或出现不当/错误行为。

评价维度与行为锚（通用岗位适用；具体岗位可据岗位分析调整维度与权重）维度A 沟通与表达（权重建议：等权，若无岗位分析）

定义：清晰表达、倾听与提问、针对受众调整信息、逻辑组织与简洁性。
锚示例：
- 5：结构清晰；能主动澄清需求并复述关键信息；用数据/示例支撑观点；根据听众调整术语与深度。
- 3：能基本回答问题且较为清楚；偶有跑题或结构松散；举例有限。
- 1：回答含混或偏离问题；缺乏逻辑；无法提供相关示例。维度B 问题解决与分析推理
定义：问题界定、假设与证据、方案生成与取舍、风险/权衡与复盘。
锚示例：
- 5：系统性过程（定义—诊断—方案—评估）；使用数据与标准；阐明权衡与影响；复盘经验。
- 3：能提出可行方案；部分考虑约束与风险；证据支持有限。
- 1：直觉式选择；缺少步骤与依据；无法解释取舍。维度C 专业知识与情境应用
定义：关键概念与术语准确性；将知识转化为解决实际问题的能力；遵循质量/安全/合规要求。
锚示例：
- 5：概念准确无误；能迁移到陌生情境；明确相关标准与约束。
- 3：掌握核心概念；能在常见场景中应用；偶有术语不精确。
- 1：关键概念错误；不能说明如何应用于实践。维度D 协作与影响
定义：跨职能协作、利益相关方管理、冲突处理、信息共享与共同目标导向。
锚示例：
- 5：主动整合多方意见；有效化解分歧并达成共识；合理归功团队。
- 3：能在团队中分工协作；基本处理分歧；对齐目标但影响力有限。
- 1：倾向单独行动；回避或升级冲突；否定他人贡献。维度E 计划执行与责任担当
定义：目标设定、优先级管理、进度与风险跟踪、结果交付与承诺兑现。
锚示例：
- 5：制定里程碑与应对预案；可靠交付；主动通报偏差与纠偏。
- 3：有计划并基本按期完成；偶有延误且事后改进有限。
- 1：无清晰计划；频繁错期；缺乏责任意识。

注：4与2分为相邻等级的中间表现。可根据岗位特性增加“职业伦理与合规”“以客户为中心”等维度，但应避免以“文化匹配”替代岗位相关行为证据（Campion et al., 1997）。

三、评分与决策规则

打分单位：各维度按1–5分，允许标注“N/A（不适用）”且不计入总分。
证据要求：每个维度至少记录一条STAR行为证据（要点式），作为评分理由。
汇总方式：采用加权平均（权重由岗位分析或工作专家评定给出；若无证据，暂用等权并在试运行后校准）。
通过标准：建议先行试运行收集效度与公平性证据后再设定阈值；可采用专家设定（如修订Angoff法）与历史数据并行校准，避免武断切分（SIOP, 2018；AERA et al., 2014）。

四、实施与质量控制

结构化执行：
- 使用标准化问题库与追问脚本；每题对应到具体维度与行为锚（Campion et al., 1997；Levashina et al., 2014）。
- 面试官按相同顺序与时长实施，避免无关提问。
评分者培训：
- 内容包括行为证据识别、锚定对齐、常见偏差（晕轮、首因/近因、相似性、宽严偏差）控制、记录规范；采用示例评分+讨论校准（Woehr & Huffcutt, 1994）。
信度监测：
- 至少双评；定期计算组内一致性（如ICC，双向随机、绝对一致，依据设计选择模型），分析维度级与总分级一致性并据此改进锚与培训（Shrout & Fleiss, 1979）。
效度与公平性证据：
- 内容效度：维度与题目来源于岗位分析并经专家复核（SIOP, 2018）。
- 关联效度：与入职后绩效指标、培训成绩或客观产出建立关联，并做增量效度检验（Schmidt & Hunter, 1998）。
- 公平性：监测差异功能与差异预测，遵循测试公平与可接近性原则，必要时提供便利措施（AERA et al., 2014）。

五、评分表文本模板（示例）

候选人：________ 岗位：________ 面试日期：______
面试官：________
维度A 沟通与表达（1–5；N/A）：__ 分
- 证据要点：____________________________________
维度B 问题解决与分析推理（1–5；N/A）：__ 分
- 证据要点：____________________________________
维度C 专业知识与情境应用（1–5；N/A）：__ 分
- 证据要点：____________________________________
维度D 协作与影响（1–5；N/A）：__ 分
- 证据要点：____________________________________
维度E 计划执行与责任担当（1–5；N/A）：__ 分
- 证据要点：____________________________________
加权总分（自动计算）：__ 分
综合意见与风险提示（必填，基于证据）：________________________
是否建议进入后续环节（依据当前通过标准）：是 / 否 / 待定

结论：上述基础量表以岗位相关的多维度行为锚定与标准化程序为核心，兼顾实施可行性与证据可追溯性，为组织在不同岗位和场景下开展结构化面试提供可复用的评分框架。在正式高风险决策前，应通过小样本试运行和事后分析对维度、锚定、权重与阈值进行本地化校准。

参考文献（APA第7版）

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. AERA.
Campion, M. A., Palmer, D. K., & Campion, J. E. (1997). A review of structure in the selection interview. Personnel Psychology, 50(3), 655–702.
Levashina, J., Hartwell, C. J., Morgeson, F. P., & Campion, M. A. (2014). The structured employment interview: Narrative and quantitative review of the research literature. Personnel Psychology, 67(1), 241–293.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.
Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47(2), 149–155.
Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). SIOP.
Woehr, D. J., & Huffcutt, A. I. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189–205.

Thesis statement A defensible rubric for assessing “assessment practice” in educational or workplace settings must (a) be construct-aligned, (b) contain behaviorally anchored performance descriptors, (c) support reliable scoring through rater training and quality control, and (d) generate validity evidence for score use. The rubric below provides a parsimonious, evidence-informed structure grounded in authoritative standards and scholarship (AERA, APA, & NCME, 2014; Kane, 2013; Messick, 1995; Jonsson & Svingby, 2007; Moskal & Leydens, 2000).

Intended use, population, and deliverables

Intended use: Summative evaluation of candidates’ competence in designing, implementing, scoring, and using results from an assessment in their domain (e.g., education, training, certification).
Target population: Pre-service or in-service educators, instructional designers, assessors, or program evaluators.
Required artifacts:
- Assessment plan with purpose, construct definition, claims/targets, and blueprint.
- Task(s)/items and scoring tools (rubrics, keys), administration materials.
- Fairness and accessibility plan (including accommodations and bias review).
- Quality assurance plan for scoring and standard setting as appropriate.
- Evidence report with data (pilot or simulated), reliability/consistency estimates, validity argument, and use-of-results plan.
- Reflective memo on improvement decisions and consequences.

Performance levels and weights

Scale: 4 = Exemplary; 3 = Proficient; 2 = Developing; 1 = Beginning.
Recommended weights (sum to 100):
- C1 Purpose and construct definition: 15
- C2 Design and blueprinting: 20
- C3 Fairness, accessibility, and ethics: 15
- C4 Scoring and standard setting: 15
- C5 Evidence and interpretation (reliability/validity): 20
- C6 Reporting and improvement: 15

Criteria with behaviorally anchored descriptors

C1. Purpose and construct definition (15)

4 Exemplary: States specific intended uses and decisions; articulates a defensible construct with boundaries and grain size; aligns claims to learning/competency targets and context; specifies consequences to be monitored (AERA et al., 2014; Messick, 1995).
3 Proficient: States uses and decisions; defines construct with minor ambiguities; shows clear alignment to targets.
2 Developing: Vague uses or decisions; construct definition incomplete or overly broad; partial alignment.
1 Beginning: No clear decision use; construct undefined or conflated with proxies; misaligned to targets.

C2. Design and blueprinting (20)

4 Exemplary: Provides blueprint mapping targets to tasks/items, cognitive processes, and score points; sampling is representative; tasks elicit intended evidence with clear directions; difficulty and cognitive demand are justified; administration plan addresses logistics and security (AERA et al., 2014).
3 Proficient: Logical blueprint with acceptable coverage; tasks largely elicit intended evidence; administration plan adequate.
2 Developing: Coverage gaps or imbalance; tasks partly mismatched to targets or over/under-difficult; administration plan incomplete.
1 Beginning: No blueprint or severe misalignment; tasks fail to elicit targeted evidence; administration plan absent.

C3. Fairness, accessibility, and ethics (15)

4 Exemplary: Documents bias/sensitivity review procedures; integrates universal design for learning and access features; specifies accommodation policies; anticipates and mitigates construct-irrelevant barriers; addresses privacy and informed consent; articulates fairness monitoring (AERA et al., 2014; CAST, 2018).
3 Proficient: Incorporates key fairness/accessibility elements with minor omissions; accommodation policy present.
2 Developing: Fairness considerations ad hoc; access features limited; accommodation guidance vague.
1 Beginning: No evidence of bias review, access planning, or ethical safeguards.

C4. Scoring and standard setting (15)

4 Exemplary: Rubrics/keys have clear performance indicators and anchors; rater training and calibration plan specified; quality control includes double-scoring and drift checks; criterion-referenced standard-setting approach selected and justified (e.g., Angoff/Bookmark) with cut score documentation when applicable (AERA et al., 2014).
3 Proficient: Rubrics clear with minor ambiguities; basic rater training plan; appropriate standard-setting choice with partial documentation.
2 Developing: Rubrics/keys lack anchors; limited or informal rater guidance; standard setting ill-specified or weakly justified.
1 Beginning: Scoring rules unclear; no rater training; cut scores arbitrary or absent when required.

C5. Evidence and interpretation: reliability/validity (20)

4 Exemplary: Provides a coherent validity argument spanning scoring, generalization, extrapolation, and decision inferences; includes relevant evidence (e.g., internal consistency or inter-rater reliability with appropriate coefficients; item/task analysis; alignment indices; relationships to external measures when feasible); limitations and alternative explanations addressed (Kane, 2013; Jonsson & Svingby, 2007; Moskal & Leydens, 2000).
3 Proficient: Presents multiple pertinent indices with correct interpretation; tentative validity narrative with minor gaps.
2 Developing: Limited or inappropriate indices; interpretations exceed evidence; validity argument superficial.
1 Beginning: No empirical checks; claims unsupported or inaccurate.

C6. Reporting, use, and improvement (15)

4 Exemplary: Communicates results for intended audiences with accuracy and transparency; provides actionable feedback; specifies decision rules; documents intended and unintended consequences; proposes concrete revisions based on evidence (AERA et al., 2014; Messick, 1995).
3 Proficient: Clear, audience-appropriate reporting and plausible improvement steps.
2 Developing: Reporting uneven or lacks actionability; improvement suggestions weakly connected to evidence.
1 Beginning: Results opaque or misleading; no plan for use or improvement.

Scoring procedure and quality assurance

Rater selection and training:
- Provide raters with construct definitions, exemplars/anchors, and a scoring guide; conduct calibration using anchor artifacts spanning the scale (Moskal & Leydens, 2000).
- Require agreement thresholds before operational scoring (e.g., percent exact agreement ≥ 70% and adjacent agreement ≥ 90% during training).
Operational scoring:
- Double-score at least 20% of portfolios/practicum submissions; resolve discrepancies via adjudication rules.
- Monitor rater drift with periodic recalibration and feedback.
Inter-rater reliability:
- Report an appropriate coefficient for ordinal rubric scores, such as a two-way random-effects intraclass correlation (ICC[2,k]) with 95% CIs; aim for ≥ 0.75 for high-stakes uses and ≥ 0.60 for moderate stakes, interpreted in context (Jonsson & Svingby, 2007).
- For categorical pass/fail decisions, report Cohen’s kappa or weighted kappa; examine decision consistency.
Internal structure and score quality:
- If multiple tasks/indicators form a composite, examine internal structure (e.g., inter-item correlations; factor structure if sample permits) with construct coherence as the goal rather than maximizing alpha.
- Use item/task analyses where appropriate (e.g., facility, discrimination trends) to inform revisions.
Standard setting (when required):
- Choose a method aligned to task type and decision stakes (e.g., modified Angoff for selected-response, Body of Work for performance tasks); document panel qualifications, training, performance level descriptors, and cut score computations (AERA et al., 2014).
Fairness monitoring:
- Document accommodations provided; collect qualitative feedback from examinees; where data permit, screen for subgroup anomalies while recognizing small-sample limitations; prioritize qualitative bias review for performance tasks.

Implementation notes and adaptations

Contextualization: Tailor task specificity and weighting to discipline and stakes while preserving the six criteria and four-level scale to support comparability.
Evidence proportionality: For lower-stakes settings with small N, prioritize inter-rater agreement, alignment evidence, and qualitative validity argument; for higher stakes, augment with broader evidence and decision-consistency studies (AERA et al., 2014; Kane, 2013).
Consequential validity: Track intended uses and potential unintended effects (e.g., narrowing of instruction), integrating them into periodic rubric review (Messick, 1995).

Scoring form (compact, rater-facing)

Enter level (1–4) for each criterion; multiply by weight; sum total score; record qualitative comments anchored to descriptors.
Decision guidance:
- Exemplary: 3.5–4.0 average with no criterion below 3.
- Proficient: 2.75–3.49 average with no criterion below 2.
- Developing: 2.0–2.74 average or any criterion at 2 with notable deficiencies.
- Beginning: < 2.0 average or any criterion at 1 for C1, C3, or C5.
Note: Use cut scores only after a documented standard-setting procedure consistent with stakes.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association.
CAST. (2018). Universal Design for Learning Guidelines version 2.2. CAST. https://udlguidelines.cast.org
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741
Moskal, B. M., & Leydens, J. A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research & Evaluation, 7(10). https://doi.org/10.7275/q7rm-gg74

Note on evidence base

The rubric’s emphasis on alignment, fairness, scoring quality, and a validity argument follows the Standards (AERA et al., 2014) and contemporary validity theory (Kane, 2013; Messick, 1995). The structure and rater procedures reflect empirical findings on rubric reliability and validity (Jonsson & Svingby, 2007; Moskal & Leydens, 2000).

解决的问题

将“评估任务”快速转化为严谨、可落地、可复用的基础量表。通过最少输入（任务类型与输出语言），自动产出清晰的评估维度、分级标准、行为或证据描述、评分示例与注意事项、权重建议，以及可靠性/效度的初步检核与试点修订建议，帮助团队在教育测评、绩效考核、课程评价、客户质检、用户研究等场景中实现一致、公平、可审查的评估标准；以学术化、基于证据的表达确保专业性，支持跨团队与跨语言协作，显著缩短量表设计周期并降低返工成本。

适用用户

高校教师与教务人员

快速为作业、报告、实验与闭卷考试生成评价维度与等级描述；对齐课程学习产出；产出双语量表并附示例；用于助教培训与批改一致性校准。

人力资源与用工管理

搭建岗位胜任力与绩效目标量表；设定权重与阈值；生成行为锚定事例；用于季度评估、晋升评审与面试评分表统一。

培训与L&D负责人

为培训效果评估设计反应、学习、行为与结果层量表；制定实践考核评分规则；用于讲师评估、课程迭代与学员差距诊断。

特征总结

• 一键为特定任务生成基础量表，清晰列出维度、等级与描述，立即可用

• 自动对齐评价目标与行为指标，避免偏题与遗漏，确保可测、可比与可执行

• 支持多场景模板化调用，教育、绩效、问卷等场景快速适配，减少从零搭建时间

• 提供基于证据的设计建议，引用权威标准与范式，提升量表信度与效度

• 自动生成清晰评分规则与样例表述，方便培训评估者，降低主观差异

• 可定制维度、权重与阈值，支持不同难度与层级管理需求，快速迭代版本

• 多语言输出与学术写作风格，一键切换语言且保持结构严谨、表述规范

• 内置质量校对与事实核验提醒，减少错漏与夸大，保障专业性与可信度

如何使用购买的提示词模板

1. 直接在外部 Chat 应用中使用

将模板生成的提示词复制粘贴到您常用的 Chat 应用（如 ChatGPT、Claude 等），即可直接对话使用，无需额外开发。适合个人快速体验和轻量使用场景。

2. 发布为 API 接口调用

把提示词模板转化为 API，您的程序可任意修改模板参数，通过接口直接调用，轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。

3. 在 MCP Client 中配置使用

在 MCP client 中配置对应的 server 地址，让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作，让提示词在不同 AI 工具间无缝衔接。

AI 提示词价格

￥25.00元

先用后买，用好了再付款，超安全！

在线免费用提示词

您购买后可以获得什么

✓

获得完整提示词模板

- 共 231 tokens

- 2 个可调节参数

{ 任务类型 } { 输出语言 }

✓

获得社区贡献内容的使用权

- 精选社区优质案例，助您快速上手提示词

购买

评估任务基础量表设计

解决的问题