热门角色不仅是灵感来源,更是你的效率助手。通过精挑细选的角色提示词,你可以快速生成高质量内容、提升创作灵感,并找到最契合你需求的解决方案。让创作更轻松,让价值更直接!
我们根据不同用户需求,持续更新角色库,让你总能找到合适的灵感入口。
创建一个用于评估特定任务类型的基础量表,提供准确和专业的建议。
论点陈述 Statement of intent: 本方案提出一份面向“评估实验报告”的基础性分析型量表,旨在以可操作、可评分且具证据支撑的方式,评价报告在研究目的、设计与方法、测量质量、数据分析、结果与解释、推断有效性与公平性、以及学术表达与合规性等核心维度上的质量。量表的设计遵循教育与心理测量的专业标准与报告规范,并提供使用与信度保障建议,以支持评分的公正性与一致性(AERA, 2014; Brookhart, 2013; Jonsson & Svingby, 2007; Appelbaum et al., 2018)。 一、量表结构与权重 Rubric structure and weights - 评分类型 Type: 分析型量表 Analytic rubric(按维度分别评分) - 等级 Levels: 4 优秀/Exemplary, 3 良好/Proficient, 2 发展中/Developing, 1 初始/Beginning - 总分 Total: 100 分(括号内为建议权重 Suggested weights) 1) 研究目的与评估问题 Research purpose and evaluation questions (10%) - 4: 目的清晰、聚焦且可评价;评估问题可操作并与干预/项目逻辑严密对齐;成功判据与期望效应预先界定。[Purpose is explicit, evaluable, and aligned to a logic model; questions are operationalized with pre-specified criteria of success.] - 3: 目的明确;问题基本可操作,与项目目标大体一致。[Clear purpose; mostly operational questions and reasonable alignment.] - 2: 目的较笼统;问题部分可操作,成功判据多为事后界定。[Somewhat vague; partial operationalization; post hoc criteria.] - 1: 目的与问题不清或脱节。[Unclear or misaligned purpose/questions.] 2) 文献与理论基础 Literature and theoretical foundation (10%) - 4: 系统且及时的综述;明确的理论框架指导设计与假设;识别关键证据缺口。[Current, critical synthesis with a guiding framework.] - 3: 覆盖充分,存在有限批判性分析;理论联系基本到位。[Adequate coverage; some critique; loose linkage.] - 2: 范围有限或过时;多为描述性汇总;理论联系薄弱。[Limited/outdated; descriptive; weak linkage.] - 1: 文献极少且无理论框架。[Minimal literature; no framework.] 3) 设计与方法的匹配与严谨性 Design-method alignment and rigor (20%) - 4: 设计与问题高度匹配(如随机/准实验、混合方法等)且论证充分;抽样策略与统计功效有依据;实施流程可复现;伦理审批与参与者保护清晰。[Design fit is strongly justified; sampling/power justified; procedures reproducible; ethics documented.] - 3: 设计基本匹配;抽样合理;流程可大致复现;伦理有所说明。[Generally appropriate design; reasonable sampling; mostly replicable; ethics addressed.] - 2: 部分不匹配;便利抽样缺乏论证;流程信息缺口;伦理着墨甚少。[Mismatches; convenience sampling without justification; gaps; minimal ethics.] - 1: 设计不当;抽样与流程不明;无伦理说明。[Inappropriate design; unclear sampling/procedures; no ethics.] 4) 测量工具与证据质量 Measurement quality: instruments and evidence (15%) - 4: 工具与构念对齐;提供内容/结构效度与信度(如α/ω、ICC)及区间估计;公平性与可及性考虑到位;试测/认知访谈证据充分。[Strong construct alignment; validity and reliability with CIs; fairness/accessibility; pilot evidence.] - 3: 提供部分效度或信度证据;公平性讨论有限;有初步试测。[Some validity/reliability; limited fairness; basic piloting.] - 2: 仅陈述表面效度;信度未知或不当估计。[Mostly face validity; unknown or weak reliability.] - 1: 无测量证据;工具与构念不符。[No evidence; misaligned instruments.] 5) 数据分析与前提检验 Data analysis and assumptions (15%) - 4: 分析与问题严格对齐;检验并报告前提(独立性、正态/方差、模型拟合等);缺失数据的合理处理(如多重插补);报告效应量与置信区间;进行稳健性/敏感性分析。[Aligned analyses; assumption checks; principled missing-data handling; effect sizes and CIs; sensitivity analyses.] - 3: 核心分析恰当;前提检验有限;基本缺失处理;部分报告效应量或区间。[Appropriate core analyses; limited checks; basic missing-data handling; partial effect-size reporting.] - 2: 分析与问题部分不匹配;无前提检验;仅逐案剔除且无理由;不报告效应量。[Partial mismatch; no checks; listwise deletion; no effect sizes.] - 1: 分析错误或与问题不符。[Incorrect or misaligned analyses.] 6) 结果报告与解释 Results reporting and interpretation (10%) - 4: 报告透明、可复核(表/图清晰);解释兼顾效应大小与不确定性;讨论实践意义;如实呈现负向或不显著结果且不夸大。[Transparent reporting; interprets magnitude and uncertainty; practical significance; no overclaiming.] - 3: 报告清楚;涉及不确定性但较简略;存在轻微过度推断。[Clear reporting; some uncertainty; minor overreach.] - 2: 关键信息缺失;过度依赖p值;泛化过度。[Incomplete; p-value centric; overgeneralization.] - 1: 模糊或与数据不符的结论。[Opaque or unsupported claims.] 7) 推断的效度、信度与公平性;局限性 Validity, reliability, fairness of inferences; limitations (10%) - 4: 基于统一效度观构建推断论证,识别并缓解内部/外部效度威胁;开展关键亚组/公平性分析;局限与潜在偏倚阐明且提出改进路径。[Coherent validity argument; addresses threats; equity analyses; explicit limitations and mitigation.] - 3: 讨论主要威胁与若干公平性议题;承认局限。[Addresses major threats; some equity; acknowledges limits.] - 2: 局限泛泛而谈;未涉及公平性或偏倚来源。[Superficial limitations; no equity/bias analysis.] - 1: 缺失或误导性讨论。[Absent or misleading.] 8) 结构、写作与引文合规 Organization, writing, and referencing (10%) - 4: 结构严谨、语言精确;引用与格式一致(如APA);参考文献完整且准确;遵循报告标准(如JARS);提供数据/代码可用性声明与再现性信息。[Well-organized; precise language; consistent style; adheres to reporting standards; data/code availability.] - 3: 结构清楚;偶有格式或引用小错。[Clear; minor issues.] - 2: 结构松散;多处引用错误;表达影响理解。[Uneven organization; multiple citation errors.] - 1: 严重不规范;可能涉及学术不端风险。[Disorganized; integrity concerns.] 二、评分与解释 Guidance for scoring and interpretation - 等级锚定 Level anchoring: 以证据为基础作出等级判断,评分时应引用报告中的具体段落、表格或附录作为依据(Brookhart, 2013)。 - 加权汇总 Weighted aggregation: 各维度分数按权重加权求和形成总分;报告维度分和总分,避免仅以总分替代诊断性信息(AERA, 2014)。 - 等级划分 Cut scores: 可采用经验法或Angoff法设定等级界值,并通过小范围试评分进行复核与调优(Jonsson & Svingby, 2007)。 三、信度与评审程序 Reliability and rater procedures - 评审人数 Raters: 至少两名独立评审;先行校准共识标准与证据要求(Moskal, 2000)。 - 校准 Calibration: 使用2–3篇锚样文档进行独立评分与讨论,直至对“充分证据”的理解一致。 - 一致性评估 Interrater agreement: 量化评审一致性(如ICC[2,1]或加权κ);目标ICC≥0.75为良好,<0.5需再培训与修订锚语(Koo & Li, 2016)。 - 偏差管理 Bias control: 盲评(遮蔽作者与机构信息);按维度集中评分以减少跨维度迁移效应(AERA, 2014)。 四、有效性与公平性保障 Validity and fairness safeguards - 内容对齐 Content alignment: 量表维度与学科标准和报告规范对齐(AERA, 2014; Appelbaum et al., 2018)。 - 证据链 Evidence chain: 保存评分依据与校准记录,以支撑评分解释的可追溯性与透明度(Messick, 1995)。 - 可及性与便利 Accommodations: 为有特殊需要的作者/学生提供合理便利,同时保持评价标准一致(AERA, 2014)。 五、使用建议 Practical notes for use - 形成性反馈 Formative feedback: 在每个维度提供“证据+改进建议”式短评,例如“数据分析:已报告效应量,但未检验模型前提;建议补充残差诊断与稳健性分析。” - 模板与清单 Templates and checklists: 鼓励报告附录包含变量字典、分析代码与再现性说明,以提升可复核性与评分透明度(Appelbaum et al., 2018)。 参考文献 References (APA style) - American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. - Appelbaum, M., Cooper, H., Kline, R. B., Mayo-Wilson, E., Nezu, A. M., & Rao, S. M. (2018). Journal article reporting standards for quantitative research in psychology: The APA Publications and Communications Board task force report. American Psychologist, 73(1), 3–25. https://doi.org/10.1037/amp0000191 - Brookhart, S. M. (2013). How to create and use rubrics for formative assessment and grading. ASCD. - Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002 - Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012 - Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741 - Moskal, B. M. (2000). Scoring rubrics: What, when and how? Practical Assessment, Research & Evaluation, 7(3). https://doi.org/10.7275/rh6g- e3x0 附注 Notes: - 若评估对象为严格随机对照试验或大规模项目评估,可在“设计与方法”“数据分析”维度中加入预注册、干预忠实度监测、异质性分析与多重比较调整的明确要求,以进一步提高推断质量与可重复性。
论点陈述:为提升面试评分的一致性、可比性与预测效度,应采用结构化面试并使用行为锚定的多维度评分量表。该基础量表以岗位分析为依据,提供清晰的评分等级定义与行为锚,配套评分与决策规则及质量控制要点,符合人员测评的专业标准与实证证据。 一、设计原则与证据基础 - 岗位相关性与结构化:量表的维度与权重应来源于岗位分析,并在面试实施中保持问题、评分标准与追问的一致性,以提升信度与效度(Campion, Palmer, & Campion, 1997;SIOP, 2018)。 - 行为锚定评分:为每一评分等级提供可观察的行为描述,减少评分者主观性与量纲漂移,改进评估质量(Smith & Kendall, 1963)。 - 证据导向与标准化记录:要求评分基于候选人提供的情境—任务—行动—结果(STAR)证据,保留评分理由与行为例证,满足可追溯性与合规要求(AERA, APA, & NCME, 2014)。 - 质量控制:通过评分者培训、双评机制与信度监测(如ICC)控制误差来源,增强结果可用性(Woehr & Huffcutt, 1994;Shrout & Fleiss, 1979)。 - 实证依据:结构化面试相较于非结构化面试在信度与效度方面具有稳健优势,已得到多项综述与元分析的支持(Campion et al., 1997;Levashina, Hartwell, Morgeson, & Campion, 2014;Schmidt & Hunter, 1998)。 二、基础量表结构 1. 评分等级与通用锚(五级制,整数评分) - 5 优秀(显著超过岗位期望):证据充分、结构化叙述,体现系统性、前瞻性与高影响的行为结果。 - 4 良好(超过岗位期望):证据清晰、基本结构化,体现稳定的高质量表现,偶有轻微缺漏。 - 3 合格(达到岗位期望):证据可用且与问题相关,达到基本标准,存在若干改进空间。 - 2 欠佳(低于岗位期望):证据零散或不具体,与岗位关键要求匹配度不足。 - 1 不足(明显低于岗位期望):缺乏相关证据或出现不当/错误行为。 2. 评价维度与行为锚(通用岗位适用;具体岗位可据岗位分析调整维度与权重) 维度A 沟通与表达(权重建议:等权,若无岗位分析) - 定义:清晰表达、倾听与提问、针对受众调整信息、逻辑组织与简洁性。 - 锚示例: - 5:结构清晰;能主动澄清需求并复述关键信息;用数据/示例支撑观点;根据听众调整术语与深度。 - 3:能基本回答问题且较为清楚;偶有跑题或结构松散;举例有限。 - 1:回答含混或偏离问题;缺乏逻辑;无法提供相关示例。 维度B 问题解决与分析推理 - 定义:问题界定、假设与证据、方案生成与取舍、风险/权衡与复盘。 - 锚示例: - 5:系统性过程(定义—诊断—方案—评估);使用数据与标准;阐明权衡与影响;复盘经验。 - 3:能提出可行方案;部分考虑约束与风险;证据支持有限。 - 1:直觉式选择;缺少步骤与依据;无法解释取舍。 维度C 专业知识与情境应用 - 定义:关键概念与术语准确性;将知识转化为解决实际问题的能力;遵循质量/安全/合规要求。 - 锚示例: - 5:概念准确无误;能迁移到陌生情境;明确相关标准与约束。 - 3:掌握核心概念;能在常见场景中应用;偶有术语不精确。 - 1:关键概念错误;不能说明如何应用于实践。 维度D 协作与影响 - 定义:跨职能协作、利益相关方管理、冲突处理、信息共享与共同目标导向。 - 锚示例: - 5:主动整合多方意见;有效化解分歧并达成共识;合理归功团队。 - 3:能在团队中分工协作;基本处理分歧;对齐目标但影响力有限。 - 1:倾向单独行动;回避或升级冲突;否定他人贡献。 维度E 计划执行与责任担当 - 定义:目标设定、优先级管理、进度与风险跟踪、结果交付与承诺兑现。 - 锚示例: - 5:制定里程碑与应对预案;可靠交付;主动通报偏差与纠偏。 - 3:有计划并基本按期完成;偶有延误且事后改进有限。 - 1:无清晰计划;频繁错期;缺乏责任意识。 注:4与2分为相邻等级的中间表现。可根据岗位特性增加“职业伦理与合规”“以客户为中心”等维度,但应避免以“文化匹配”替代岗位相关行为证据(Campion et al., 1997)。 三、评分与决策规则 - 打分单位:各维度按1–5分,允许标注“N/A(不适用)”且不计入总分。 - 证据要求:每个维度至少记录一条STAR行为证据(要点式),作为评分理由。 - 汇总方式:采用加权平均(权重由岗位分析或工作专家评定给出;若无证据,暂用等权并在试运行后校准)。 - 通过标准:建议先行试运行收集效度与公平性证据后再设定阈值;可采用专家设定(如修订Angoff法)与历史数据并行校准,避免武断切分(SIOP, 2018;AERA et al., 2014)。 四、实施与质量控制 - 结构化执行: - 使用标准化问题库与追问脚本;每题对应到具体维度与行为锚(Campion et al., 1997;Levashina et al., 2014)。 - 面试官按相同顺序与时长实施,避免无关提问。 - 评分者培训: - 内容包括行为证据识别、锚定对齐、常见偏差(晕轮、首因/近因、相似性、宽严偏差)控制、记录规范;采用示例评分+讨论校准(Woehr & Huffcutt, 1994)。 - 信度监测: - 至少双评;定期计算组内一致性(如ICC,双向随机、绝对一致,依据设计选择模型),分析维度级与总分级一致性并据此改进锚与培训(Shrout & Fleiss, 1979)。 - 效度与公平性证据: - 内容效度:维度与题目来源于岗位分析并经专家复核(SIOP, 2018)。 - 关联效度:与入职后绩效指标、培训成绩或客观产出建立关联,并做增量效度检验(Schmidt & Hunter, 1998)。 - 公平性:监测差异功能与差异预测,遵循测试公平与可接近性原则,必要时提供便利措施(AERA et al., 2014)。 五、评分表文本模板(示例) - 候选人:________ 岗位:________ 面试日期:______ - 面试官:________ - 维度A 沟通与表达(1–5;N/A):__ 分 - 证据要点:____________________________________ - 维度B 问题解决与分析推理(1–5;N/A):__ 分 - 证据要点:____________________________________ - 维度C 专业知识与情境应用(1–5;N/A):__ 分 - 证据要点:____________________________________ - 维度D 协作与影响(1–5;N/A):__ 分 - 证据要点:____________________________________ - 维度E 计划执行与责任担当(1–5;N/A):__ 分 - 证据要点:____________________________________ - 加权总分(自动计算):__ 分 - 综合意见与风险提示(必填,基于证据):________________________ - 是否建议进入后续环节(依据当前通过标准):是 / 否 / 待定 结论:上述基础量表以岗位相关的多维度行为锚定与标准化程序为核心,兼顾实施可行性与证据可追溯性,为组织在不同岗位和场景下开展结构化面试提供可复用的评分框架。在正式高风险决策前,应通过小样本试运行和事后分析对维度、锚定、权重与阈值进行本地化校准。 参考文献(APA第7版) - American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. AERA. - Campion, M. A., Palmer, D. K., & Campion, J. E. (1997). A review of structure in the selection interview. Personnel Psychology, 50(3), 655–702. - Levashina, J., Hartwell, C. J., Morgeson, F. P., & Campion, M. A. (2014). The structured employment interview: Narrative and quantitative review of the research literature. Personnel Psychology, 67(1), 241–293. - Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274. - Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. - Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47(2), 149–155. - Society for Industrial and Organizational Psychology. (2018). Principles for the validation and use of personnel selection procedures (5th ed.). SIOP. - Woehr, D. J., & Huffcutt, A. I. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology, 67(3), 189–205.
Thesis statement A defensible rubric for assessing “assessment practice” in educational or workplace settings must (a) be construct-aligned, (b) contain behaviorally anchored performance descriptors, (c) support reliable scoring through rater training and quality control, and (d) generate validity evidence for score use. The rubric below provides a parsimonious, evidence-informed structure grounded in authoritative standards and scholarship (AERA, APA, & NCME, 2014; Kane, 2013; Messick, 1995; Jonsson & Svingby, 2007; Moskal & Leydens, 2000). 1) Intended use, population, and deliverables - Intended use: Summative evaluation of candidates’ competence in designing, implementing, scoring, and using results from an assessment in their domain (e.g., education, training, certification). - Target population: Pre-service or in-service educators, instructional designers, assessors, or program evaluators. - Required artifacts: - Assessment plan with purpose, construct definition, claims/targets, and blueprint. - Task(s)/items and scoring tools (rubrics, keys), administration materials. - Fairness and accessibility plan (including accommodations and bias review). - Quality assurance plan for scoring and standard setting as appropriate. - Evidence report with data (pilot or simulated), reliability/consistency estimates, validity argument, and use-of-results plan. - Reflective memo on improvement decisions and consequences. 2) Performance levels and weights - Scale: 4 = Exemplary; 3 = Proficient; 2 = Developing; 1 = Beginning. - Recommended weights (sum to 100): - C1 Purpose and construct definition: 15 - C2 Design and blueprinting: 20 - C3 Fairness, accessibility, and ethics: 15 - C4 Scoring and standard setting: 15 - C5 Evidence and interpretation (reliability/validity): 20 - C6 Reporting and improvement: 15 3) Criteria with behaviorally anchored descriptors C1. Purpose and construct definition (15) - 4 Exemplary: States specific intended uses and decisions; articulates a defensible construct with boundaries and grain size; aligns claims to learning/competency targets and context; specifies consequences to be monitored (AERA et al., 2014; Messick, 1995). - 3 Proficient: States uses and decisions; defines construct with minor ambiguities; shows clear alignment to targets. - 2 Developing: Vague uses or decisions; construct definition incomplete or overly broad; partial alignment. - 1 Beginning: No clear decision use; construct undefined or conflated with proxies; misaligned to targets. C2. Design and blueprinting (20) - 4 Exemplary: Provides blueprint mapping targets to tasks/items, cognitive processes, and score points; sampling is representative; tasks elicit intended evidence with clear directions; difficulty and cognitive demand are justified; administration plan addresses logistics and security (AERA et al., 2014). - 3 Proficient: Logical blueprint with acceptable coverage; tasks largely elicit intended evidence; administration plan adequate. - 2 Developing: Coverage gaps or imbalance; tasks partly mismatched to targets or over/under-difficult; administration plan incomplete. - 1 Beginning: No blueprint or severe misalignment; tasks fail to elicit targeted evidence; administration plan absent. C3. Fairness, accessibility, and ethics (15) - 4 Exemplary: Documents bias/sensitivity review procedures; integrates universal design for learning and access features; specifies accommodation policies; anticipates and mitigates construct-irrelevant barriers; addresses privacy and informed consent; articulates fairness monitoring (AERA et al., 2014; CAST, 2018). - 3 Proficient: Incorporates key fairness/accessibility elements with minor omissions; accommodation policy present. - 2 Developing: Fairness considerations ad hoc; access features limited; accommodation guidance vague. - 1 Beginning: No evidence of bias review, access planning, or ethical safeguards. C4. Scoring and standard setting (15) - 4 Exemplary: Rubrics/keys have clear performance indicators and anchors; rater training and calibration plan specified; quality control includes double-scoring and drift checks; criterion-referenced standard-setting approach selected and justified (e.g., Angoff/Bookmark) with cut score documentation when applicable (AERA et al., 2014). - 3 Proficient: Rubrics clear with minor ambiguities; basic rater training plan; appropriate standard-setting choice with partial documentation. - 2 Developing: Rubrics/keys lack anchors; limited or informal rater guidance; standard setting ill-specified or weakly justified. - 1 Beginning: Scoring rules unclear; no rater training; cut scores arbitrary or absent when required. C5. Evidence and interpretation: reliability/validity (20) - 4 Exemplary: Provides a coherent validity argument spanning scoring, generalization, extrapolation, and decision inferences; includes relevant evidence (e.g., internal consistency or inter-rater reliability with appropriate coefficients; item/task analysis; alignment indices; relationships to external measures when feasible); limitations and alternative explanations addressed (Kane, 2013; Jonsson & Svingby, 2007; Moskal & Leydens, 2000). - 3 Proficient: Presents multiple pertinent indices with correct interpretation; tentative validity narrative with minor gaps. - 2 Developing: Limited or inappropriate indices; interpretations exceed evidence; validity argument superficial. - 1 Beginning: No empirical checks; claims unsupported or inaccurate. C6. Reporting, use, and improvement (15) - 4 Exemplary: Communicates results for intended audiences with accuracy and transparency; provides actionable feedback; specifies decision rules; documents intended and unintended consequences; proposes concrete revisions based on evidence (AERA et al., 2014; Messick, 1995). - 3 Proficient: Clear, audience-appropriate reporting and plausible improvement steps. - 2 Developing: Reporting uneven or lacks actionability; improvement suggestions weakly connected to evidence. - 1 Beginning: Results opaque or misleading; no plan for use or improvement. 4) Scoring procedure and quality assurance - Rater selection and training: - Provide raters with construct definitions, exemplars/anchors, and a scoring guide; conduct calibration using anchor artifacts spanning the scale (Moskal & Leydens, 2000). - Require agreement thresholds before operational scoring (e.g., percent exact agreement ≥ 70% and adjacent agreement ≥ 90% during training). - Operational scoring: - Double-score at least 20% of portfolios/practicum submissions; resolve discrepancies via adjudication rules. - Monitor rater drift with periodic recalibration and feedback. - Inter-rater reliability: - Report an appropriate coefficient for ordinal rubric scores, such as a two-way random-effects intraclass correlation (ICC[2,k]) with 95% CIs; aim for ≥ 0.75 for high-stakes uses and ≥ 0.60 for moderate stakes, interpreted in context (Jonsson & Svingby, 2007). - For categorical pass/fail decisions, report Cohen’s kappa or weighted kappa; examine decision consistency. - Internal structure and score quality: - If multiple tasks/indicators form a composite, examine internal structure (e.g., inter-item correlations; factor structure if sample permits) with construct coherence as the goal rather than maximizing alpha. - Use item/task analyses where appropriate (e.g., facility, discrimination trends) to inform revisions. - Standard setting (when required): - Choose a method aligned to task type and decision stakes (e.g., modified Angoff for selected-response, Body of Work for performance tasks); document panel qualifications, training, performance level descriptors, and cut score computations (AERA et al., 2014). - Fairness monitoring: - Document accommodations provided; collect qualitative feedback from examinees; where data permit, screen for subgroup anomalies while recognizing small-sample limitations; prioritize qualitative bias review for performance tasks. 5) Implementation notes and adaptations - Contextualization: Tailor task specificity and weighting to discipline and stakes while preserving the six criteria and four-level scale to support comparability. - Evidence proportionality: For lower-stakes settings with small N, prioritize inter-rater agreement, alignment evidence, and qualitative validity argument; for higher stakes, augment with broader evidence and decision-consistency studies (AERA et al., 2014; Kane, 2013). - Consequential validity: Track intended uses and potential unintended effects (e.g., narrowing of instruction), integrating them into periodic rubric review (Messick, 1995). 6) Scoring form (compact, rater-facing) - Enter level (1–4) for each criterion; multiply by weight; sum total score; record qualitative comments anchored to descriptors. - Decision guidance: - Exemplary: 3.5–4.0 average with no criterion below 3. - Proficient: 2.75–3.49 average with no criterion below 2. - Developing: 2.0–2.74 average or any criterion at 2 with notable deficiencies. - Beginning: < 2.0 average or any criterion at 1 for C1, C3, or C5. - Note: Use cut scores only after a documented standard-setting procedure consistent with stakes. References - American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association. - CAST. (2018). Universal Design for Learning Guidelines version 2.2. CAST. https://udlguidelines.cast.org - Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144. https://doi.org/10.1016/j.edurev.2007.05.002 - Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000 - Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741 - Moskal, B. M., & Leydens, J. A. (2000). Scoring rubric development: Validity and reliability. Practical Assessment, Research & Evaluation, 7(10). https://doi.org/10.7275/q7rm-gg74 Note on evidence base - The rubric’s emphasis on alignment, fairness, scoring quality, and a validity argument follows the Standards (AERA et al., 2014) and contemporary validity theory (Kane, 2013; Messick, 1995). The structure and rater procedures reflect empirical findings on rubric reliability and validity (Jonsson & Svingby, 2007; Moskal & Leydens, 2000).
快速为作业、报告、实验与闭卷考试生成评价维度与等级描述;对齐课程学习产出;产出双语量表并附示例;用于助教培训与批改一致性校准。
搭建岗位胜任力与绩效目标量表;设定权重与阈值;生成行为锚定事例;用于季度评估、晋升评审与面试评分表统一。
为培训效果评估设计反应、学习、行为与结果层量表;制定实践考核评分规则;用于讲师评估、课程迭代与学员差距诊断。
构建问卷题项质量与样本有效性评分表;生成筛选与剔除标准;形成编码规范与复核清单;提升数据可靠性与可比性。
设计流程稽核与服务质检量表;明确关键缺陷与扣分规则;用于门店巡检、客服通话抽检与供应商评估标准化。
依据文献与标准搭建研究量表初稿;生成引用格式与证据链;给出信效度改进建议;支持预测试与数据采集方案。
将“评估任务”快速转化为严谨、可落地、可复用的基础量表。通过最少输入(任务类型与输出语言),自动产出清晰的评估维度、分级标准、行为或证据描述、评分示例与注意事项、权重建议,以及可靠性/效度的初步检核与试点修订建议,帮助团队在教育测评、绩效考核、课程评价、客户质检、用户研究等场景中实现一致、公平、可审查的评估标准;以学术化、基于证据的表达确保专业性,支持跨团队与跨语言协作,显著缩短量表设计周期并降低返工成本。
将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。
把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。
在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。
免费获取高级提示词-优惠即将到期