以下方案面向中学高年级至本科初年级学生,旨在为“气候协商”主题构建一个标准化、可评分、可复用的角色扮演评估情境。设计聚焦三项关键可控因素:身份(role identity)、证据(evidence)、难度(difficulty),以支持可靠与有效的证据收集与解释。设计思路遵循基于证据的有效性论证与绩效评估原则(Kane, 2013; Messick, 1994; Pellegrino, Chudowsky, & Glaser, 2001),并借鉴客观结构化情境评估(OSCE)的标准化做法以提升评分一致性(Harden et al., 1975),同时兼顾公平性与可及性(AERA, APA, & NCME, 2014; CAST, 2018)。
This scenario targets upper-secondary to early undergraduate students and offers a standardized, scorable, reusable role-play assessment in the context of “climate negotiations.” It controls identity, evidence, and difficulty to enable reliable, valid inferences about student performance. The design aligns with argument-based validity and performance assessment principles (Kane, 2013; Messick, 1994; Pellegrino et al., 2001) and adopts OSCE-like standardization to strengthen reliability (Harden et al., 1975), with fairness and accessibility provisions (AERA, APA, & NCME, 2014; CAST, 2018).
一、测评目标与构念
- 主要构念:
- 证据驱动的论证(基于数据与权威报告制定主张与支撑);2) 协商与策略推理(在约束下设计可行让步与交易);3) 定量与政策素养(解读气候指标、条约条款与政策选项);4) 合作与伦理推理(公平、气候正义与代际责任的权衡)。
- 证据来源与表征:
口头陈述、谈判过程行为(观察量表)、谈判文本(联合公报/条文草案)、个人事后政策备忘录。
- 预期用途:
形成性改进与总结性评分兼容;可与多回合任务与多评阅者结合提高普遍化信度(Shavelson & Webb, 1991)。
I. Assessment goals and constructs
- Focal constructs:
- Evidence-based argumentation; 2) Negotiation and strategic reasoning under constraints; 3) Quantitative and policy literacy; 4) Collaboration and ethical reasoning.
- Evidentiary sources:
Oral statements, observed behaviors, negotiated text (joint communiqué/decision draft), and a post-task policy memo.
- Intended use:
Formative improvement and summative scoring; supports multi-task, multi-rater designs to enhance generalizability (Shavelson & Webb, 1991).
二、标准化情境(气候协商)简介
- 背景:模拟一次缔约方大会(COP)部长级非正式磋商,焦点为通过一份“全球盘点(Global Stocktake)行动包”草案,包括:2030/2035减排力度表述、资金与损失损害安排、透明度与盘点机制的文字。
- 成功条件(公开、统一):若草案满足预设“合意阈值”(例如:覆盖四个议题、文本互不矛盾、在证据包范围内引用且逻辑自洽),即达成“通过”;否则“未通过”。
II. Standardized scenario (climate negotiation) overview
- Context: An informal ministerial at a COP aims to adopt a “Global Stocktake Action Package” draft, covering mitigation ambition wording (2030/2035), finance and loss-and-damage arrangements, and transparency/stocktake language.
- Success criteria (uniform and public): Adoption occurs if the draft meets predefined adequacy thresholds (coverage of four items, non-contradiction, evidence-cited within the packet, and coherent logic). Otherwise, it fails.
三、受控身份设计(Role Identity Control)
- 原则:使用“角色原型”而非真实国家名称,以减少外部先验与价值冲突;每个角色提供等长、等结构身份简表,含:授权目标、国内约束、不可逾越底线(red lines)、可考虑的让步区间、偏好排序、谈判指标。
- 统一角色集(6选5,均衡搭配):
- 高排放发达经济体 HDE(High-emitting developed economy)
- 新兴工业化经济体 EIE(Emerging industrializing economy)
- 小岛屿气候脆弱体 SIS(Climate-vulnerable small island state)
- 最不发达国家 LDC(Least developed country)
- 化石燃料出口经济体 FFE(Fossil-fuel exporting economy)
- 多边气候基金官员 MCF(Multilateral climate finance official,具有协调与技术说明职能,不参与投票)
- 标准化身份简表结构(每人1页):
- 任务授权(3个优先目标,含最低可接受成果陈述)
- 约束条件(政治/经济/技术三类,每类不超过2条)
- 底线与红线(最多3条,需可观察)
- 可让步与交换筹码(最多3项,含触发条件)
- 证据引用优先级(优先使用哪类证据及其阐释角度)
- 量化指标(如:希望的资金区间、文本关键词偏好)
- 统一话语权规则:每轮发言限时相同;主持人(考官)按脚本控制轮次与时长,确保机会公平(Harden et al., 1975)。
III. Controlled role identities
- Principle: Use role archetypes instead of real country names to minimize prior bias; provide equal-length, equal-structure briefs specifying mandates, constraints, red lines, concessions, preference orderings, and indicators.
- Standardized role set (choose 5 of 6 for balance):
HDE, EIE, SIS, LDC, FFE, plus MCF (facilitator/technical, non-voting).
- Role brief (1 page per role):
Mandate (3 priority goals with minimally acceptable outcomes), constraints (political/economic/technical), red lines (≤3), concessions and triggers (≤3), evidence priorities, and quantitative indicators.
- Equal speaking-time rules with scripted facilitation to standardize opportunities (Harden et al., 1975).
四、证据包设计(Evidence Control)
- 核心证据包(所有角色共享,统一分页编号):
A. 科学依据:IPCC AR6 综合报告的执行摘要选段与关键图示(IPCC, 2023)
B. 法律框架:巴黎协定相关条款摘录(目标、NDC、资金与透明度;UNFCCC, 2015)
C. 政策差距:排放差距与进展概览(以权威年度报告节选为主,如UNEP Emissions Gap报告的要点摘录;避免提供未校核数值)
D. 定义与术语表:关键概念一致释义(例如“公正转型”“损失与损害”)
- 角色定制补充页(每人1页,且内容长度与密度匹配):
- 角色相关的数据解读角度与引用提示(不新增外部数据,仅改变叙述焦点)
- 文本偏好关键词样例(如“逐步减少/逐步淘汰”措辞偏好)
- 证据使用规则:
- 所有可引用信息必须来自证据包并注明页码;评分将依据准确引用与恰当解释(避免外部搜索带来的不均等;AERA et al., 2014)。
- 不提供可运算模拟器;改用预制“情景摘要卡”(如不同措辞对可追责性与可核查性影响的对照),降低工具差异带来的误差。
IV. Evidence packet
- Core packet (shared, paginated):
A. Science: Excerpts and key figures from IPCC AR6 Synthesis Report (IPCC, 2023)
B. Law: Excerpts of Paris Agreement articles on goals, NDCs, finance, transparency (UNFCCC, 2015)
C. Policy gaps: Curated excerpts from authoritative annual reports (e.g., UNEP Emissions Gap key messages; avoid unverified numbers)
D. Glossary: Standardized definitions (e.g., just transition, loss and damage)
- Role-specific addendum (1 page per role; content density matched):
Perspective cues, citation prompts, and sample wording preferences; no new external data.
- Evidence use rules:
All claims must cite the packet with page numbers; scoring rewards accurate citation and interpretation; no external search to ensure equity (AERA et al., 2014). Provide pre-made “scenario summary cards” instead of live simulators to reduce tool inequities.
五、难度控制(Difficulty Control)
- 难度维度与开关:
- 议题广度:必谈议题数量(2/3/4个子议题)
- 信息清晰度:证据措辞的明确/含混程度(提供更/少的解释性脚注)
- 时间压力:总时长与回合数(如60/75/90分钟)
- 立场对立度:红线相互冲突的强度(提高“不可兼容”项的数量)
- 事件干预:中段插入一则“新闻快讯”卡(如极端天气或宏观经济波动),改变某角色可让步区间
- 难度等级样例:
- 基础(A):2个议题、清晰证据标注、90分钟、无事件
- 中级(B):3个议题、部分含混证据、75分钟、1次事件
- 高级(C):4个议题、存在证据张力、60分钟、1–2次事件且红线冲突明显
- 教学-评估一致性:难度调整不改变被测构念,主要通过任务条件操控认知负荷与协商复杂度(Sitzmann, 2011)。
V. Difficulty control
- Dimensions:
Topic breadth, evidence clarity, time pressure, stance polarity (red-line conflicts), and mid-session events.
- Levels:
Basic (2 topics, clear evidence, 90 minutes, no event); Intermediate (3 topics, some ambiguity, 75 minutes, 1 event); Advanced (4 topics, evidence tension, 60 minutes, 1–2 events, strong conflicts).
- Construct-preserving manipulation to adjust cognitive load and negotiation complexity (Sitzmann, 2011).
六、施测流程与材料
- 时序(建议总时长60–90分钟):
- 预备(10–15’):统一说明、发放角色简表与证据包、静默阅读与标注
- 开场陈述(每人2’):提出主张与依据(须标注证据包页码)
- 小组磋商(2–3轮×10’):主持人按脚本提示聚焦点与时限
- 起草文本(15’):共同拟定简版行动包(≤300词,含至少3处证据引用括注)
- 终场确认(5’):各方表决并给出合规性自评
- 事后备忘录(课后24小时内,300–500词):阐明取舍、证据权衡与角色一致性
- 记录与采分材料:
- 录音/转写(若允许)、观察量表、草案文本、个人备忘录、证据引用清单。
VI. Administration and materials
- Timeline (60–90 minutes):
Briefing; opening statements; two to three negotiation rounds; drafting; adoption check; 24-hour post-task memo.
- Collected artifacts:
Audio/transcripts (if permitted), observation checklist, draft text, memo, citation list.
七、评分框架与量表(Rubrics)
- 分析性量表(6维度;每维4级,明确锚定样例;Jonsson & Svingby, 2007):
- 证据使用与准确性(来源合规、解释恰当、引用规范)
- 论证质量(主张-证据-论据链条完整性;Toulmin 框架对齐)
- 协商策略与一致性(与角色目标/约束一致,交易设计可行)
- 定量/政策素养(正确解读指标与条款逻辑,不过度外推)
- 合作与伦理推理(公平与正义考量,尊重他方约束)
- 产出质量(文本清晰、可执行、与证据一致;达成度以“相对角色目标”衡量,避免绝对立场偏置)
- 评分证据的映射:
- 口头陈述与过程行为→维度1–5
- 文本与备忘录→维度1–6
- 评分与一致性:
- 至少20%样本双评,计算ICC(连续评分)与κ系数(分类核对)(Hallgren, 2012)
- 评阅员培训与锚定样例包(边界样本与典型样本),每轮再校准(AERA et al., 2014)
VII. Scoring and rubrics
- Analytic rubric with six dimensions (4 performance levels; anchored exemplars; Jonsson & Svingby, 2007):
Evidence use; Argumentation quality; Negotiation strategy; Quantitative/policy literacy; Collaboration and ethics; Product quality. Outcome attainment is judged relative to role mandates to avoid positional bias.
- Evidence mapping and reliability:
Double-scoring ≥20% with ICC and kappa (Hallgren, 2012); rater training with anchor sets and periodic recalibration (AERA et al., 2014).
八、标准化、信度与效度证据
- 标准化做法:
- 统一脚本(主持人提示词、时间节点、事件卡触发条件)
- 统一材料(等长身份简表与证据包,统一分页)
- 统一流程(发言时限与轮次)
- 信度提升:
- 多任务/多回合与多评阅员设计,开展G研究估计误差来源(Shavelson & Webb, 1991)
- 评分者盲评文本产出,过程行为由独立观察者记录
- 有效性论证(Kane, 2013; Messick, 1994):
- 构念层面:任务需求与目标构念对齐(证据化的论证与协商)
- 评分推断:量表锚定样例与评分者培训支撑可解释的一致性
- 外推推断:跨难度/跨回合的稳定性检验
- 决策后果:课堂反馈与再学习应用的正向后果监测
VIII. Standardization, reliability, and validity
- Standardization:
Scripted facilitation, unified materials, and fixed timing.
- Reliability:
Multi-task/round and multi-rater design with G-study; blind scoring of products and independent process observations.
- Validity argument:
Construct alignment, scoring inference supported by anchors/training, extrapolation via stability across forms, and monitoring consequences of use (Kane, 2013; Messick, 1994).
九、公平性与可及性
- 措施:
- 采用中性“角色原型”避免地缘政治标签;简化语言并提供术语表
- 允许合理便利(延长阅读时间、替代性表达方式),材料可及性符合UDL建议(CAST, 2018)
- 明确禁止外部资料以避免资源差异;仅以证据包为依据评分(AERA et al., 2014)
IX. Fairness and accessibility
- Measures:
Neutral archetypes; plain-language supports and glossary; reasonable accommodations; UDL-aligned materials (CAST, 2018); evidence-packet-only rules to equalize access (AERA et al., 2014).
十、实施示例材料(摘录)
- 主持人开场脚本(统一):说明任务、时间、成功条件与证据规则。
- 事件卡样例(中级难度):最新极端天气通报—SIS可让步条件放宽资金优先顺序;FFE要求在措辞中保留“技术中立”表述。
- 文本模板:包含“缓解”“资金/损失与损害”“透明度/盘点”三个标题段,每段≤100词,且括注证据包页码不少于1处。
X. Sample administration elements
- Standardized opening script, sample event card (Intermediate), and a structured draft template with headings and mandatory in-text citations to the evidence packet.
参考文献(APA第7版)
- AERA, APA, & NCME. (2014). Standards for educational and psychological testing. American Educational Research Association.
- CAST. (2018). Universal Design for Learning guidelines version 2.2. CAST.
- Hallgren, K. A. (2012). Computing inter-rater reliability for observational data. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34.
- Harden, R. M., Stevenson, M., Downie, W. W., & Wilson, G. M. (1975). Assessment of clinical competence using an objective structured clinical examination (OSCE). BMJ, 1, 447–451.
- IPCC. (2023). Climate Change 2023: Synthesis report. Intergovernmental Panel on Climate Change.
- Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130–144.
- Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73.
- Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13–23.
- Pellegrino, J. W., Chudowsky, N., & Glaser, R. (2001). Knowing what students know: The science and design of educational assessment. National Academy Press.
- Sitzmann, T. (2011). A meta-analytic examination of the instructional effectiveness of computer-based simulation games. Personnel Psychology, 64(2), 489–528.
- UNFCCC. (2015). Paris Agreement. United Nations Framework Convention on Climate Change.
有效实施要点(摘要)
- 严格控制角色、证据与难度,减少无关方差,突出构念相关表现(Messick, 1994)。
- 采用OSCE式流程标准化与脚本化主持,确保机会均等(Harden et al., 1975)。
- 通过双评、锚定样例与G研究管理评分误差(Shavelson & Webb, 1991; Hallgren, 2012)。
- 用分析性量表与证据包内引用规范化实践,提高评分可解释性与可迁移性(Jonsson & Svingby, 2007; AERA et al., 2014)。
Implementation essentials (summary)
- Tightly control roles, evidence, and difficulty to reduce construct-irrelevant variance (Messick, 1994).
- Standardize OSCE-like flow and scripted facilitation to ensure equity (Harden et al., 1975).
- Use double-scoring, anchored exemplars, and G-study to manage error (Shavelson & Webb, 1991; Hallgren, 2012).
- Apply analytic rubrics and enforced in-packet citation practices to strengthen interpretability (Jonsson & Svingby, 2007; AERA et al., 2014).