数据科学:回归建模项目作业说明(含评估标准与时间表)
Data Science: Regression Modeling Project Assignment (with Rubric and Timeline)
一、学习目标
Learning Objectives
- 明确回归问题定义,选择并论证合适的评估指标(如RMSE/MAE/R²)。
Define a regression problem and justify appropriate evaluation metrics (e.g., RMSE/MAE/R²).
- 完成数据清洗、特征工程、模型训练、调参与验证的端到端流程。
Execute an end-to-end pipeline: cleaning, feature engineering, model training, tuning, and validation.
- 检查线性回归假设与模型诊断,并进行误差分析与公平性审视。
Conduct linear-model assumptions checks, diagnostics, error analysis, and basic fairness checks.
- 进行可解释性分析(系数、特征重要性、SHAP/部分依赖)并写出清晰报告。
Provide interpretation (coefficients, feature importance, SHAP/PDP) and write a clear report.
- 确保可复现性与合规性(随机种子、环境依赖、数据许可证与隐私)。
Ensure reproducibility and compliance (random seeds, environment, licenses, privacy).
二、项目概述
Project Overview
- 选择一个连续型目标变量的真实世界数据集,围绕业务/科研场景提出预测问题并构建回归模型。
Choose a real-world dataset with a continuous target; frame a predictive question aligned to a domain need and build a regression model.
- 产出包括:代码与可复现实验、可解释的结果、对误差与风险的反思,以及面向非技术受众的结论。
Deliver code with reproducible experiments, interpretable results, reflection on errors/risks, and conclusions for non-technical stakeholders.
三、数据集选项与要求
Dataset Options and Requirements
- 选项A:使用教师提供的数据集清单(课堂平台发布)。
Option A: Use instructor-provided datasets (posted on course platform).
- 选项B:自选数据集,需满足:
Option B: Self-sourced dataset must meet:
- 目标变量为连续型;样本量1000–100000;特征数5–50。
Continuous target; 1,000–100,000 rows; 5–50 features.
- 具备公开许可证或已获授权;不得含有可识别个人信息或需脱敏处理。
Licensed for academic use; no PII unless properly anonymized.
- 无数据泄漏(例如不得使用事后数据、目标编码需在CV内执行)。
No data leakage (e.g., no post-outcome features; target encoding within CV).
- 时间序列须使用时间分块或滚动验证。
Time series must use temporal splits/rolling validation.
- 提交数据说明文档(Data Sheet):来源、字段含义、时间范围、许可证、潜在偏差。
Submit a Data Sheet: source, schema, time span, license, and known biases.
四、任务步骤与交付物
Tasks and Deliverables
- 问题定义与指标
Problem Definition and Metrics
- 明确场景、受众、业务价值与约束(时延、可解释性、成本函数)。
Specify context, stakeholders, value, and constraints (latency, interpretability, cost).
- 选择主指标(推荐RMSE或MAE,二者至少一个)与辅指标(R²; 若适合可加MAPE),说明理由。
Choose a primary metric (RMSE or MAE at minimum) and secondary (R²; MAPE if appropriate); justify.
- 数据理解与清洗
Data Understanding and Cleaning
- 数据字典、缺失值概况、异常值策略(winsorize/变换/稳健损失)。
Data dictionary, missingness summary, outlier strategy (winsorization/transform/robust loss).
- 划分训练/验证/测试集;建议70/15/15,或时间序列分割;固定random_state。
Split train/validation/test; suggest 70/15/15 or temporal splits; fix random_state.
- 构建可复用预处理流水线(ColumnTransformer + Pipeline)。
Build reusable preprocessing pipelines (ColumnTransformer + Pipeline).
- 探索性分析(EDA)
Exploratory Data Analysis
- 目标分布、特征分布、相关性热力图、多重共线性预警。
Target distribution, feature distributions, correlation heatmap, multicollinearity flags.
- 初步洞察与假设,提出潜在特征工程思路。
Insights and hypotheses to inform feature engineering.
- 特征工程与编码
Feature Engineering and Encoding
- 数值标准化/变换,类别One-Hot或目标编码(在CV内避免泄漏),日期派生、交互项、分箱。
Numeric scaling/transform, categorical OHE or target encoding (within CV), date features, interactions, binning.
- 记录每一步对目标指标的边际贡献(消融实验)。
Record marginal gains via ablation studies.
- 基线与候选模型
Baselines and Candidate Models
- 基线:均值/中位数预测、简单线性回归。
Baselines: mean/median predictor, simple linear regression.
- 候选:Ridge、Lasso、Elastic Net、树模型(RandomForest、Gradient Boosting、XGBoost/LightGBM),可选KNN/SVR。
Candidates: Ridge, Lasso, Elastic Net, tree-based (RandomForest, Gradient Boosting, XGBoost/LightGBM), optional KNN/SVR.
- 训练、调参与验证
Training, Tuning, and Validation
- 使用K折交叉验证(建议K=5);时间序列用TimeSeriesSplit。
Use K-fold CV (K=5 suggested); TimeSeriesSplit for temporal data.
- 超参搜索(Randomized/Grid/Bayesian);报告搜索空间与计算预算。
Hyperparameter search (Randomized/Grid/Bayesian); report search space and budget.
- 仅在最终一次评估中使用测试集;在开发阶段绝不窥视测试集。
Use test set only once for final evaluation; no peeking during development.
- 假设检验与诊断(线性模型必做)
Assumptions and Diagnostics (required for linear models)
- 残差可视化(残差-拟合图、Q-Q图)、异方差检验(如Breusch–Pagan)、多重共线性(VIF)、影响点(Cook距离)。
Residual plots (residuals vs fitted, Q-Q), heteroskedasticity test (e.g., Breusch–Pagan), multicollinearity (VIF), influential points (Cook’s distance).
- 解释、误差与公平性
Interpretation, Error, and Fairness
- 系数解释(标准化后)、树模型特征重要性、SHAP或PDP。
Coefficient interpretation (after standardization), tree importances, SHAP or PDP.
- 分层误差分析(按数值区间/类别/时间),定位系统性偏差。
Stratified error analysis (by bins/categories/time) to detect systematic bias.
- 若存在与人群相关的特征,比较子群体误差(如MAE/RMSE差异);记录局限。
If human-related features exist, compare subgroup errors (MAE/RMSE gaps); note limitations.
- 鲁棒性与复现
Robustness and Reproducibility
- 不同随机种子、多次重采样的稳定性;轻微噪声/缺失扰动敏感性。
Stability across seeds/resamples; sensitivity to small noise/missingness.
- 固定随机种子;提交requirements.txt或environment.yml;保存训练好的模型(joblib/pkl)。
Fix random seeds; submit requirements.txt or environment.yml; save trained model (joblib/pkl).
- 轻量部署与使用说明
Lightweight Deployment and Usage
- 提供predict函数与命令行/脚本示例;加载模型并对新数据推断。
Provide a predict function and CLI/script example; load model and infer on new data.
交付物 Deliverables
- 技术报告PDF(8–12页,含图表与参考文献)。
Technical report PDF (8–12 pages with figures and references).
- 演示幻灯片(最多10页,面向非技术受众)。
Slide deck (max 10 slides for non-technical audience).
- 可执行Notebook(.ipynb)与脚本(.py),包含README与环境文件。
Executable notebook (.ipynb) and scripts (.py) with README and environment file.
- 数据说明文档(Data Sheet)与数据字典。
Data Sheet and data dictionary.
- 最终测试集预测文件(CSV,含id与prediction列)。
Final test predictions CSV (id and prediction columns).
- 模型文件与推理示例(.pkl/.joblib + usage instructions)。
Model artifact and inference example (.pkl/.joblib + usage instructions).
五、评估标准(总分100分)
Assessment Rubric (Total 100 points)
- 问题定义与指标 10分:场景清晰、指标选择有依据与可业务解释。
Problem Framing & Metrics 10: Clear context; justified, business-aligned metrics.
- 数据理解与EDA 15分:数据质量审查充分,洞察有价值,图表规范。
Data Understanding & EDA 15: Thorough quality checks, valuable insights, clear visuals.
- 预处理与特征工程 15分:流水线规范、编码与缺失处理正确、消融有证据。
Preprocessing & Features 15: Proper pipelines, correct encoding/imputation, evidenced ablations.
- 建模与调参 20分:基线合理、候选多样、调参策略与搜索空间恰当。
Modeling & Tuning 20: Strong baselines, diverse candidates, sound search strategy.
- 验证与诊断 15分:严格CV/时间分割,线性诊断到位,误差分析扎实。
Validation & Diagnostics 15: Rigorous CV/temporal splits, solid linear diagnostics, error analysis.
- 解释与沟通 10分:解释可信不过度,图文清晰,结论可执行。
Interpretation & Communication 10: Credible explanations, clear writing, actionable conclusions.
- 复现与代码质量 10分:环境可复现、结构清晰、注释/文档完备、无泄漏。
Reproducibility & Code Quality 10: Reproducible env, clean structure, documentation, no leakage.
- 伦理与公平性 3分:许可证合规、隐私意识、子群体误差对比或声明。
Ethics & Fairness 3: License/privacy compliance, subgroup error check or caveats.
- 专业呈现 2分:版面整洁、时间控制、问答准确。
Professional Presentation 2: Polished deliverables, timing, accurate Q&A.
- 加分项最多+5分:创新(例如贝叶斯回归/不确定性估计)、严谨消融、自动化实验脚本/Makefile。
Bonus up to +5: Innovation (e.g., Bayesian regression/uncertainty), rigorous ablations, automation.
评分提示 Scoring Notes
- 优秀:方法选择贴合问题、证据链完整、结果稳健、可落地。
Excellent: Methods fit problem, evidence chain complete, robust, deployable.
- 合格:流程完整,细节偶有不足但结论可靠。
Satisfactory: Pipeline complete; minor gaps but reliable conclusion.
- 需改进:存在泄漏/验证不当/报告不清。
Needs improvement: Leakage, improper validation, unclear reporting.
六、时间表与里程碑(以星期计,具体日期由任课老师发布)
Timeline and Milestones (by week; exact dates to be announced)
- 第1周 Week 1
- 提交1页项目提案与数据许可证证明;定义指标与成功标准;初步风险清单。
Submit 1-page proposal with data license proof; define metrics; initial risk list.
- 第2周 Week 2
- 提交EDA与数据清洗草稿;完成基线模型;锁定数据分割方案。
Submit EDA/cleaning draft; complete baselines; finalize split strategy.
- 第3周 Week 3
- 提交特征工程与候选模型方案;开始CV与调参;上传中期代码与README。
Submit feature plan and candidate models; start CV/tuning; push interim code + README.
- 第4周 Week 4
- 完成调参与诊断;误差与公平性分析;教师发布隐藏测试集或最终评估要求。
Complete tuning and diagnostics; error/fairness analysis; instructor releases holdout or final eval.
- 第5周 Week 5
- 提交最终包:报告、幻灯片、代码、环境、模型与预测CSV;课堂演示与问答。
Final submission: report, slides, code, environment, model, predictions CSV; in-class presentation.
迟交政策与门槛
Late Policy and Thresholds
- 迟交每24小时扣总分的10%,最多3天;超过3天不计分(除非预先获批)。
10% off per 24 hours late, up to 3 days; after 3 days, no credit unless pre-approved.
- 及格线60分;重评申请须在成绩发布后7天内。
Passing threshold 60; regrade requests within 7 days of grade release.
七、技术与工具要求
Technical and Tooling Requirements
- 语言/环境:Python 3.9+;推荐库:numpy、pandas、scikit-learn、statsmodels、matplotlib/seaborn、shap;树模型可选xgboost/lightgbm;可使用PyTorch/TF仅限回归。
Language/Env: Python 3.9+; libraries: numpy, pandas, scikit-learn, statsmodels, matplotlib/seaborn, shap; optional xgboost/lightgbm; PyTorch/TF allowed for regression only.
- 代码规范:使用Git版本控制;结构清晰(src/data, src/features, src/models等);注释与docstring;black/flake8优先。
Code: Use Git; clear structure (src/data, src/features, src/models); comments/docstrings; prefer black/flake8.
- 复现:固定随机种子;提供requirements.txt或environment.yml;README含一键运行命令。
Reproducibility: Fix seeds; provide requirements.txt or environment.yml; README with one-command run.
- 数据处理:用Pipeline防止数据泄漏;标准化只在训练集拟合;目标编码在CV内执行。
Data handling: Use Pipelines to prevent leakage; fit scalers on train only; target encoding within CV.
- 评估:报告CV均值±标准差;最终仅报告一次测试集结果;可选置信区间(bootstrap)。
Evaluation: Report CV mean ± std; report test set once; optional CIs via bootstrap.
- 线性模型:检查VIF、BP检验、残差图;必要时对数/Box-Cox变换;解释使用标准化系数。
Linear models: Check VIF, BP test, residual plots; log/Box-Cox if needed; interpret standardized coefficients.
- 部署示例:保存model.joblib;提供cli_predict.py示例(输入CSV输出预测)。
Deployment: Save model.joblib; provide cli_predict.py (CSV in, predictions out).
- 测试:为关键预处理/推理函数写最少1–2个单元测试。
Tests: Provide 1–2 unit tests for key preprocessing/inference functions.
八、学术诚信与AI使用
Academic Integrity and Use of AI
- 个人或2–3人小组完成;须在报告中声明每位成员贡献。
Individual or teams of 2–3; declare contributions in the report.
- 允许参考开源代码与生成式AI,但需:
Open-source and generative AI allowed with:
- 明确标注使用的资源/提示词/生成内容位置。
Clear disclosure of resources/prompts/generated content.
- 对生成内容进行技术核查与实证验证;你需为结果负责。
Technical verification and empirical validation; you are responsible for outputs.
- 禁止直接提交他人项目或泄露测试集答案。
No submitting others’ work or leaking test labels.
九、报告结构建议
Suggested Report Structure
- 摘要与结论概览(非技术读者可读)。
Abstract and executive summary.
- 问题定义与数据来源(含许可证与伦理)。
Problem and data source (license and ethics).
- 方法:预处理、特征工程、模型与调参方案。
Methods: preprocessing, features, models, tuning.
- 结果:CV与测试集指标、对比基线、诊断与误差分析。
Results: CV/test metrics, baseline comparisons, diagnostics, error analysis.
- 解释与业务启示、局限与未来工作。
Interpretation and implications; limitations and future work.
- 复现与部署说明(运行命令、环境、模型文件)。
Reproducibility and deployment notes (commands, env, artifacts).
十、提交清单
Submission Checklist
- PDF报告、10页以内幻灯片。
PDF report; slide deck ≤10.
- 代码与Notebook;README;requirements.txt或environment.yml。
Code and notebook; README; requirements.txt or environment.yml.
- 数据说明与数据字典;训练好的模型文件;预测CSV。
Data Sheet and dictionary; trained model file; predictions CSV.
- 贡献声明与引用/致谢;AI使用说明(如适用)。
Contribution statement and citations/acknowledgments; AI usage note if applicable.
提示与常见错误
Tips and Common Pitfalls
- 避免数据泄漏(日期派生、目标编码、标准化顺序)。
Avoid leakage (date derivations, target encoding, scaling order).
- 用简单强基线对照复杂模型,防止过拟合。
Compare against strong baselines to curb overfitting.
- 指标与业务成本对齐;解释不要过度外推因果。
Align metrics with business costs; do not overclaim causality.
- 图表要自解释(标题、轴标签、单位、注释)。
Make plots self-explanatory (titles, axes, units, notes).
如需澄清数据选择或时间表,请尽早与教师沟通。
Contact the instructor early for dataset or timeline clarifications.