课程名称:数据科学导论(16周)/ Course Title: Introduction to Data Science (16 Weeks)
- 课程概览 / Course Overview
- 课程定位:面向初学者的基础课程,覆盖数据科学全流程:获取与清洗、探索与可视化、统计推断、机器学习入门、结果沟通与复现。/ Scope: Foundational course covering the end-to-end data science lifecycle: data acquisition/cleaning, EDA/visualization, statistical inference, introductory machine learning, communication and reproducibility.
- 教学形式:讲授 + 实操实验课 + 项目。/ Modality: Lectures + hands-on labs + project.
- 接触时数:每周3小时(2小时讲授 + 1小时实验/讨论),16周。/ Contact hours: 3 per week (2h lecture + 1h lab/discussion), 16 weeks.
- 适用对象与先修:具备基础编程(任一语言)与高中层次代数/函数知识;不要求先修统计学。/ Audience & prerequisites: Basic programming (any language) and high-school algebra/functions; no prior statistics required.
- 教学语言与提交:代码与报告可用中文或英文;建议英文代码注释以便通用复现。/ Language & submissions: Chinese or English; recommend English code comments for portability.
- 课程学习成果(CLOs)/ Course Learning Outcomes
完成课程后,学生能够 / Upon completion, students will be able to:
- 描述数据科学工作流与常见角色、任务与伦理边界 / Describe DS workflow, roles, tasks, and ethical boundaries.
- 使用Python、NumPy、pandas进行数据摄取、清洗、变换与合并 / Use Python, NumPy, pandas for ingestion, cleaning, transformation, merging.
- 进行探索性数据分析并构建有效可视化 / Perform EDA and create effective visualizations.
- 以SQL查询关系型数据并进行分组聚合与连接 / Query relational data with SQL including grouping and joins.
- 应用概率与统计推断(估计、区间、假设检验、A/B测试) / Apply probability and inference (estimation, CIs, hypothesis tests, A/B testing).
- 训练与评估基础监督学习模型(回归与分类),并执行交叉验证与调参 / Train and evaluate baseline supervised models (regression/classification) with CV and tuning.
- 应用聚类与PCA等无监督方法进行模式发现与降维 / Use clustering and PCA for pattern discovery and dimensionality reduction.
- 构建可复现的分析工作流(Git、环境管理、流水线)并清晰沟通结果 / Build reproducible workflows (Git, environments, pipelines) and communicate findings.
- 教学材料与工具 / Materials and Tools
- 编程环境 / Programming: Python 3.10+; JupyterLab; Conda或venv。/ JupyterLab; Conda or venv.
- 主要库 / Core libraries: NumPy, pandas, matplotlib, seaborn, scikit-learn; statsmodels(可选)。/ Optional: statsmodels.
- 数据库 / Database: SQLite(必备); PostgreSQL(可选)。/ SQLite required; PostgreSQL optional.
- 版本控制 / Version control: Git/GitHub或等效平台。/ Git/GitHub or equivalent.
- 文档与报告 / Documentation: Jupyter Notebook/Quarto;Streamlit(可选)用于展示。/ Streamlit optional for demos.
- 推荐教材与参考 / Recommended references:
- Wes McKinney, Python for Data Analysis, 3rd ed., 2022.
- Gareth James et al., An Introduction to Statistical Learning with Applications in Python, 2023.
- Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow, 3rd ed., 2022.
- Peter Bruce, Andrew Bruce, Peter Gedeck, Practical Statistics for Data Scientists, 3rd ed., 2023.
- 官方文档:pandas、scikit-learn、seaborn、SQLite。/ Official docs for pandas, scikit-learn, seaborn, SQLite.
- 评估组成与权重 / Assessment and Weights
- 课堂与实验参与 Participation/Labs: 10%
- 小测(4次) Quizzes (4): 10%
- 编程作业(6次) Programming Assignments (6): 20%
- 期中考试 Midterm Exam: 20%
- 课程项目 Capstone Project(含提案、里程碑、报告与展示): 30%
- 同行评审 Peer Review: 10%
说明:强调学术诚信与可追溯性;代码需可运行、结果可复现。/ Note: Emphasis on academic integrity and reproducibility; code must run and results be reproducible.
- 每周进度与任务 / Weekly Schedule and Tasks
第1周:课程导入与环境搭建 / Week 1: Orientation and Environment Setup
- 主题/Topic:数据科学生态、工作流、工具栈;Jupyter、Git、环境。/ DS landscape, workflow, stack; Jupyter, Git, environments.
- 目标/Objectives:完成环境安装;能运行Notebook;理解课程期望与项目要求。/ Install environment; run notebooks; understand course and project.
- 实验/Lab:环境检查、Git基础、数据读写。/ Env check, Git basics, I/O.
- 作业/HW:工具链再现报告(含环境说明与样例分析)。/ Reproducible setup report with sample analysis.
第2周:Python与NumPy基础 / Week 2: Python and NumPy Essentials
- 主题:数据结构、控制流、函数、向量化与广播。/ Data structures, control flow, functions, vectorization.
- 目标:编写可读、可测试函数;使用NumPy进行高效计算。/ Write testable functions; use NumPy efficiently.
- 实验:数值计算与基准比较。/ Numeric computing and benchmarking.
- 作业:实现数据清洗实用函数库。/ Implement utility functions for cleaning.
第3周:pandas数据整理 / Week 3: Data Wrangling with pandas
- 主题:索引/选择、缺失值、连接/合并、重塑(melt/pivot)。/ Indexing, missing values, joins/merge, reshape.
- 目标:构建整洁数据;记录数据字典。/ Produce tidy data; document a data dictionary.
- 实验:清洗“脏数据”案例。/ Clean a messy dataset.
- 作业:完成数据清洗与整合报告。/ Deliver a cleaning/integration report.
第4周:探索性数据分析与可视化 / Week 4: EDA and Visualization
- 主题:分布、关系、异常点、图形原则。/ Distributions, relationships, outliers, viz principles.
- 目标:形成可复现EDA;选择恰当图表。/ Reproducible EDA; appropriate chart selection.
- 实验:用seaborn制作EDA图谱。/ EDA gallery with seaborn.
- 作业:EDA Notebook与洞见摘要。/ EDA notebook and insight brief.
第5周:数据伦理与复现 / Week 5: Data Ethics and Reproducibility
- 主题:隐私、偏见、公平、同意;随机种子、日志、依赖锁定。/ Privacy, bias, fairness, consent; seeds, logging, dependency locking.
- 目标:完成伦理清单;建立可复现项目模板。/ Ethics checklist; reproducible project template.
- 实验:偏见风险初筛;requirements.txt生成。/ Bias pre-mortem; requirements.txt.
- 项目里程碑/Milestone:选题与数据源确认 + 伦理清单草案。/ Topic and data source + ethics checklist draft.
第6周:概率基础与模拟 / Week 6: Probability Foundations and Simulation
- 主题:随机变量、期望/方差、常见分布、LLN/CLT(概念)、蒙特卡罗。/ RVs, moments, common distributions, LLN/CLT (conceptual), Monte Carlo.
- 目标:用模拟验证概率结论;选择合适分布建模。/ Validate via simulation; choose distributions.
- 实验:置信区间覆盖率模拟。/ CI coverage simulation.
- 作业:概率建模与模拟报告。/ Probability modeling report.
- 项目里程碑:项目提案1页(问题、数据、指标、风险)。/ 1-page proposal (problem, data, metrics, risks).
第7周:统计推断与实验设计 / Week 7: Statistical Inference and Experimentation
- 主题:抽样、点估计与区间、t检验/卡方、p值与效应量、检验力、A/B测试。/ Sampling, CIs, t/chi-square tests, p-values, effect size, power, A/B.
- 目标:构建与解释假设检验;避免常见谬误。/ Build and interpret tests; avoid pitfalls.
- 实验:A/B功效分析与结果解读。/ A/B power analysis and interpretation.
- 作业:实验设计备忘录。/ Experiment design memo.
第8周:SQL与期中 / Week 8: SQL + Midterm
- 主题:SELECT、WHERE、GROUP BY、HAVING、JOIN、子查询;数据摄取与API概览。/ SELECT, WHERE, GROUP BY, HAVING, JOINs, subqueries; ingestion and APIs overview.
- 目标:从关系数据库提取特征数据集。/ Extract feature-ready datasets from RDBMS.
- 实验:SQLite业务查询实战。/ SQLite query lab.
- 评估/Assessment:期中考试(覆盖第1–7周)。/ Midterm exam (Weeks 1–7).
第9周:监督学习I—回归 / Week 9: Supervised Learning I—Regression
- 主题:一元/多元线性回归、假设与诊断、正则化(Ridge/Lasso)入门。/ Linear/multiple regression, assumptions/diagnostics, intro to Ridge/Lasso.
- 目标:拟合与解读回归;处理多重共线性与过拟合。/ Fit/interpret; handle multicollinearity and overfitting.
- 实验:房价或类似数据回归。/ Housing-like regression lab.
- 作业:回归建模与残差分析。/ Regression + residual analysis.
- 项目里程碑:EDA初稿提交。/ EDA draft due.
第10周:监督学习II—分类 / Week 10: Supervised Learning II—Classification
- 主题:逻辑回归、kNN;混淆矩阵、精确率/召回率、ROC-AUC;类别不平衡。/ Logistic regression, kNN; confusion matrix, precision/recall, ROC-AUC; class imbalance.
- 目标:选择恰当评估指标;应对不平衡。/ Choose metrics; address imbalance.
- 实验:欺诈/流失分类任务。/ Fraud/churn classification.
- 作业:分类器比较与报告。/ Classifier comparison report.
第11周:树模型与集成 / Week 11: Trees and Ensembles
- 主题:决策树、随机森林、梯度提升;变量重要性与可解释性概念。/ Decision trees, Random Forest, Gradient Boosting; feature importance and interpretability concepts.
- 目标:训练稳健基线;理解偏差-方差权衡。/ Train robust baselines; bias-variance trade-off.
- 实验:随机森林调参。/ Random Forest tuning.
第12周:模型评估与流水线 / Week 12: Model Evaluation and Pipelines
- 主题:训练/验证/测试分割、交叉验证、数据泄漏、预处理(缩放/编码)、管道、超参搜索。/ Train/val/test splits, CV, leakage, preprocessing (scaling/encoding), pipelines, hyperparameter search.
- 目标:构建端到端可复现ML流水线。/ Build end-to-end reproducible ML pipelines.
- 实验:scikit-learn Pipeline + GridSearchCV。/ Pipelines + GridSearchCV.
- 项目里程碑:建模计划与原型。/ Modeling plan + prototype.
第13周:无监督学习 / Week 13: Unsupervised Learning
- 主题:k-means、层次聚类、轮廓系数;PCA;高维可视化注意事项(t-SNE/UMAP概念)。/ k-means, hierarchical clustering, silhouette; PCA; high-dim viz caveats (t-SNE/UMAP concept).
- 目标:执行聚类并评估可分性;使用PCA降维与解释方差。/ Run clustering; use PCA and explain variance.
- 实验:客户分群或文本TF-IDF聚类(可选)。/ Customer segmentation or TF-IDF clustering (optional).
第14周:时间序列与特征工程 / Week 14: Time Series and Feature Engineering
- 主题:日期时间处理、重采样、滚动窗口、平稳性直觉、简单基线预测;特征工程与编码策略。/ Datetime handling, resampling, rolling windows, stationarity intuition, baseline forecasting; feature engineering and encoding strategies.
- 目标:构建时间序列基线与特征;避免数据泄漏。/ Build TS baselines and features; avoid leakage.
- 实验:销量或传感器数据基线预测。/ Sales/sensor baseline forecast.
- 作业:特征工程备忘录。/ Feature engineering memo.
第15周:沟通、可视化叙事与复现交付 / Week 15: Communication, Storytelling, and Reproducible Delivery
- 主题:受众导向的信息架构;图表选择;不确定性表达;报告与仪表板;环境与数据卡。/ Audience-focused narratives; chart choices; uncertainty communication; reports/dashboards; environment and data cards.
- 目标:完成可复现报告与演示材料。/ Finalize reproducible report and presentation.
- 实验:用Quarto/Jupyter或Streamlit打包成果。/ Package deliverables with Quarto/Jupyter or Streamlit.
- 项目里程碑:项目草稿与同伴演练。/ Draft and peer rehearsal.
第16周:项目展示与课程总结 / Week 16: Final Presentations and Wrap-up
- 评估:项目口头展示、代码审阅、最终报告提交。/ Assessment: Oral presentations, code review, final report submission.
- 总结:课程回顾、扩展学习路径建议。/ Wrap-up and pathways for further study.
- 课程项目要求 / Capstone Project Requirements
- 交付物/Deliverables:1页提案、伦理清单、EDA报告、建模原型、最终报告(≤10页)与演示(8–10分钟)、可运行代码与环境文件。/ 1-page proposal, ethics checklist, EDA report, modeling prototype, final report (≤10 pages) and 8–10 min presentation, runnable code with environment file.
- 评分维度/Rubric dimensions:问题定义与价值(20%)、数据质量与处理(15%)、方法与严谨性(30%)、结果与不确定性沟通(20%)、复现与代码质量(15%)。/ Problem/value (20%), data quality/wrangling (15%), methods/rigor (30%), communication of results/uncertainty (20%), reproducibility/code (15%).
- 数据来源/Data sources:公开数据集(如UCI、政府开放数据等)或教师批准的自选数据;遵循使用条款与隐私规范。/ Public datasets (e.g., UCI, open government) or approved data; comply with ToS and privacy.
- 教学与学习策略 / Instructional and Learning Strategies
- 翻转+实作:课前微视频与阅读;课堂聚焦练习与代码演示。/ Flipped micro-lectures + in-class coding.
- 渐进复杂度:从清洗与EDA过渡到建模与评估。/ Progressive complexity from wrangling/EDA to modeling/evaluation.
- 形成性评估:每周短测与即时反馈。/ Frequent formative quizzes and feedback.
- 同伴互评:提高沟通质量与代码可读性。/ Peer review to improve communication and code readability.
- 学术诚信与可访问性 / Academic Integrity and Accessibility
- 学术规范:独立完成个人作业;允许讨论但需署名引用;禁止抄袭与未经授权的模型/代码共享。/ Individual assignments; discussion allowed with attribution; no plagiarism or unauthorized code/model sharing.
- 可访问性:提供可机读资料、字幕与大纲;为必要的学习便利提供支持。/ Accessible materials (machine-readable, captions); reasonable accommodations available.
- 建议每周阅读(与主题对应)/ Suggested Weekly Readings (aligned to topics)
- W1–W2: Python for Data Analysis (PfDA) Ch. 1–4.
- W3: PfDA Ch. 5–8(数据整理与重塑)。/ Wrangling and reshape.
- W4: seaborn/matplotlib 官方指南;数据可视化最佳实践文章。/ Official guides; viz best practices.
- W5: 负责任AI与数据伦理入门文献;Git与可复现研究实践。/ Responsible AI/data ethics; reproducible research practices.
- W6–W7: Practical Statistics for Data Scientists(概率/推断章节)。/ Probability and inference chapters.
- W8: SQL官方教程或SQL入门教材。/ SQL tutorials.
- W9–W12: ISL with Python(回归/分类/模型评估章节);Hands-On ML(模型评估与管道)。/ ISLP regression/classification/evaluation; Hands-On ML pipelines.
- W13–W14: ISLP/Hands-On ML(无监督与降维;时间序列基础参考资料)。/ Unsupervised and dimensionality reduction; TS basics references.
- W15: 数据叙事与技术写作指南;Streamlit/Quarto文档。/ Data storytelling and technical writing; Streamlit/Quarto docs.
- 评估基准与通过标准 / Performance Benchmarks
- 通过标准:总评≥60%且项目≥60%;期中考试≥50%。/ Pass if overall ≥60% and project ≥60%; midterm ≥50%.
- 代码质量:通过自动化测试/样例数据运行;随机种子固定;读我文档与依赖文件完整。/ Code runs on sample tests; random seeds fixed; README and dependency files complete.
备注 / Notes
- 所列教材与工具为业界主流与稳定版本,适合入门课程;若版本更新不影响核心API与概念,课程材料将同步微调。/ Listed texts and tools are mainstream and stable for introductory courses; minor updates will be accommodated without changing core APIs and concepts.
- 本大纲为基础框架,授课进度可根据学习诊断与班级背景进行微调。/ This outline is a baseline; pacing may be adjusted based on diagnostics and cohort background.