数据集对比分析专家

幂简官方

195 浏览

18 试用

4 购买

Sep 24, 2025更新

数据分析文生文

专注于数据集对比分析的精准技术表达，提供专业建议。

已就位。为保证分析高质量与可复现性，请按以下要点提供信息或数据，我将基于您的具体任务进行精准分析与可视化，并给出可执行结论。

请提供

业务目标与问题范围：要回答的核心问题/决策场景；关键指标（定义、口径、单位、期望方向）。
数据概览：数据源与时间范围；数据粒度（如用户-日、订单-明细）；主键；行数级别；字段清单（字段名、类型、单位、含义、可能取值）。
数据样例：建议提供前50行或抽样（可脱敏）；或提供可下载链接/结构化描述（CSV/Parquet/SQL）。
约束与期望输出：时限、隐私/合规要求、可用算力；希望的交付（图表类型、模型、报告/Notebook/SQL）。
评估标准：成功判据（如相对提升%、统计显著性、误差阈值、业务KPI）。

分析方法与流程（我将按需裁剪执行）

数据准备

完整性与一致性：主键唯一性检查；重复与异常记录清理；单位与时区统一；时间对齐与去重。
缺失值与异常值：缺失机制诊断（MCAR/MAR/MNAR）；IQR/稳健z-score/时间序列异常检测；合理填补或删失策略。
编码与类型：类别高基数处理（合并长尾/目标编码）；数值缩放；日期派生特征（周内/节假日/滞后特征）。

探索性分析与可视化

分布与集中趋势：均值/中位数/分位数/变异系数；置信区间。
分组对比与异质性：箱线图/小提琴图/条形图+误差线；Aggr与分层可视化。
相关性与共线性：Pearson/Spearman/Kendall；VIF；偏相关；热力图。
时间序列特性：分解（趋势、季节性、残差）；ADF平稳性检验；滞后/自相关ACF/PACF。

统计检验与因果/实验

对比分析：正态/方差齐性检验（Shapiro/Levene）；t检验或Mann-Whitney；多重比较控制（Benjamini-Hochberg）；效应量（Cohen’s d/Cliff’s delta）。
A/B测试：样本量/功效分析；分层与CUPED；顺序检测与错误率控制；干扰/样本污染排查。
观测因果：回归调整、倾向评分匹配/加权；差分中的差分（含平行趋势检验）；断点回归；合成控制（如政策/渠道上线评估）。

预测/分类/时间序列建模

回归/分类：基线到梯度提升/正则化线性；避免数据泄漏；K折或留出验证；不平衡处理（权重/重采样）；指标：RMSE/MAE/R2、ROC-AUC/PR-AUC、校准度。
时间序列：SARIMAX/ETS/TBATS或基于滞后特征的树模型；节假日/促销外生变量；滚动回测（expanding/rolling origin）；指标：sMAPE、MASE、RMSE。
生存/流失：Kaplan-Meier、Cox模型（检验PH假设）；指标：中位生存期、HR、Brier score。

结果验证与解释

稳健性：敏感性分析、替代口径、子样本分析、留一法/交叉验证。
可解释性：特征重要性（Permutation/SHAP），偏依赖与局部解释；残差诊断。
业务落地：可执行建议、风险与假设边界、预估影响区间。

可交付

可视化图表：分布/对比/时间趋势/相关性与诊断图（使用颜色盲安全方案，附注释与区间）。
可复现工件：分析报告（含方法、假设、限制）；Python/R Notebook；必要SQL；指标字典。
决策摘要：关键发现、效应量与不确定性、行动清单与预期影响。

数据模板示例（可直接粘贴）

目标：提升转化率（CVR=order_users/visit_users），期望+5%。
粒度：用户-日；主键：user_id+date；时间：2024-01-01~2024-06-30。
字段： user_id(str), date(YYYY-MM-DD), channel(cat), device(cat), visits(int), clicks(int), orders(int), revenue(float, CNY), is_new_user(bin), campaign(cat)
备注：节假日口径含法定+调休；iOS收入单位为分需换算为元。
输出：分渠道CVR差异分析+显著性；拟合包含渠道/设备/新老用户的Logistic回归；可视化与结论。

请提供上述信息或上传样例数据，我将据此启动分析并给出结构化结论与可视化建议。

Acknowledged. I will respond as a data analysis specialist with precise, context-aware methods across preprocessing, statistical analysis, visualization, and interpretation. To proceed efficiently and maintain accuracy, please share your task or dataset and the following details:

Required inputs

Objective and key questions: What decisions will this analysis inform? Any hypotheses to test?
Unit of analysis and granularity: Row meaning (e.g., transaction, user-day), time coverage, sampling frame.
Target variable(s) and key features: Definitions, data dictionary, expected relationships.
Data sources and joins: Primary keys, foreign keys, expected constraints, known quality issues.
Experiment/causal design (if applicable): Randomization, assignment, exposure, inclusion/exclusion criteria.
Metrics/KPIs: Business metrics to optimize or monitor; success thresholds.
Context constraints: Regulatory, operational, domain-specific rules.
Preferred environment: Python/R/SQL; ability to share sample data (schema + 10–50 rows).

Standard analysis workflow

Problem framing
- Define hypotheses, metrics, unit of analysis, cohorts/segments, and time windows.
- Identify confounders and potential sources of bias.
Data audit and preprocessing
- Validate schema, types, uniqueness, referential integrity, and time coverage.
- Handle missingness (MCAR/MAR/MNAR) using appropriate strategies: listwise deletion (if MCAR and small), simple imputation (median/mode), multiple imputation (MICE), or model-based imputations; document rationale.
- Detect and treat outliers using robust statistics (IQR/median absolute deviation) or domain rules; consider winsorization and robust models.
- Address duplicates, inconsistent labels, and timezone issues; standardize units and encodings.
- Prevent target leakage by ensuring features are available at prediction time.
Exploratory data analysis (EDA)
- Univariate: distributions, tails, missingness maps.
- Bivariate/multivariate: correlations (Pearson/Spearman), mutual information, pivoted group summaries, pairwise plots.
- Temporal: trend/seasonality, stationarity diagnostics (ADF/KPSS), lag features, cohort retention curves.
Statistical inference
- Variable-type and design-appropriate tests:
  - Two-group numeric: t-test (if normal, equal variances) or Welch’s t-test; Mann–Whitney U if non-normal.
  - Multi-group numeric: ANOVA or Kruskal–Wallis; post-hoc Tukey/Holm.
  - Categorical associations: chi-square or Fisher’s exact (small counts).
  - Correlation: Pearson (linear, normal), Spearman (monotonic), partial correlations controlling for confounders.
  - Regression: linear/logistic/Poisson/negative binomial; regularization (L1/L2); robust SEs; check assumptions (linearity, homoscedasticity, multicollinearity, residual diagnostics).
- Multiple testing control via FDR (Benjamini–Hochberg) or Bonferroni where appropriate.
- Report effect sizes and confidence intervals, not just p-values.
Modeling (if predictive or forecasting)
- Baselines: simple mean/propensity/logistic baseline; naive or seasonal naive for time series.
- Feature engineering: domain-driven features, interactions, lags/rolling stats; avoid leakage.
- Models:
  - Classification: logistic regression, tree-based (RF/GBM), calibrated probabilities (Platt/Isotonic).
  - Regression: linear/regularized, tree-based (GBM/XGBoost/LightGBM).
  - Time series: ARIMA/SARIMA, ETS; for complex seasonality consider TBATS; cross-validated ML with time-aware splits.
  - Panel/causal: fixed/random effects, difference-in-differences, synthetic controls; check parallel trends.
- Evaluation:
  - Classification: ROC-AUC, PR-AUC (for imbalance), log-loss, precision/recall/F1, calibration (reliability curves), decision curves.
  - Regression: MAE, RMSE, MAPE/sMAPE (mind zero-valued targets).
  - Forecasting: sMAPE, MASE, rolling-origin validation.
- Interpretability: SHAP/feature importance, partial dependence; stability across folds.
Visualization
- Choose charts aligned with data type and message: distributions (histograms/ECDF/box/violin), relationships (scatter with CI/binned heatmaps), time series (decomposition, anomaly overlays), categorical comparisons (bar/point-range with CIs).
- Always include axis units, uncertainty bands where relevant, and clear labeling of cohorts/time windows.
Results interpretation and recommendations
- Summarize key findings with quantified effects and uncertainty.
- Identify limitations, assumptions, sensitivity analyses, and robustness checks.
- Translate findings into actionable steps and expected impact on KPIs.

Quality and bias considerations

Confounding: Include known covariates, stratify or adjust via regression/propensity methods.
Selection bias: Verify sampling frame; compare sample vs population.
Measurement error: Assess reliability of key variables; conduct sensitivity analyses.
Data drift: For ongoing models, monitor feature distributions and performance over time.
Reproducibility: Version data, code, parameters; provide seeds and environment details.

Deliverables (tailored to your task)

Cleaned dataset summary and data quality report.
EDA visuals and statistical test results with effect sizes and CIs.
Model artifacts (if applicable): training pipeline, metrics, interpretability outputs.
Executive summary: findings, limitations, and recommended actions.
Reproducible notebook/script with instructions.

To begin, please provide:

The exact question(s) you want answered and the decision context.
Schema or sample rows, and a brief data dictionary.
Time range, granularity, and any inclusion/exclusion rules.
Target variable definition (if modeling) and key KPIs.
Any known constraints, experimental setup, or suspected issues.
Preferred tools (Python/R/SQL) and any formatting requirements for outputs.

Once I have these, I will propose a concise analysis plan and proceed with rigorous, verifiable results.

已就緒。為確保分析高效、準確並可重現，請提供以下最小必要資訊與樣本資料。收到後我將基於您的目標設計嚴謹的預處理、統計分析、可視化與結果解讀流程。

請提供

分析目標與決策問題：明確定義要回答的問題、成功指標（KPI/指標）、期望輸出形式（圖表、指標表、模型、建議）。
資料來源與結構：資料表清單、欄位名稱、資料型別（數值/類別/時間）、主鍵與關聯（外鍵）、單位與時區。
時間範圍與抽樣：分析期間、是否滾動視窗、抽樣策略（若資料量大）。
定義與口徑：指標口徑（如「活躍用戶」定義）、去重規則、缺失值處理原則、異常值處理原則。
統計或模型需求：需要描述統計、推論（假設檢定/置信區間）、因果推斷、預測/分類或分群。
約束與環境：可用工具（Python/R/SQL）、效能限制、資料存取規範與隱私要求。
代表性樣本：每張表提供10–50行樣本（可匿名），或一份可重現的匯出檔（CSV/Parquet），以便快速驗證欄位與分佈。

預期分析流程（可依任務調整）

資料品質與預處理

結構檢查：主鍵唯一性、關聯完整性、重複列與重複事件。
缺失與異常：缺失模式（MCAR/MAR/MNAR）分析；極端值與錯誤單位/時區；類別稀疏度。
一致性：時間戳對齊、事件去重、口徑統一（如時區、週期、轉換漏斗邏輯）。
特徵工程：衍生指標（轉化率、留存、客單價、週期聚合）、編碼（類別/日期）、尺度轉換。

探索性分析（EDA）

分佈與集中趨勢：直方圖、箱型圖、分位數、偏態與峰度。
關聯性：皮爾森/史匹爾曼相關、交叉表、群組對比（分群、分渠道、分時段）。
漏斗與路徑：步驟轉化率、瓶頸定位、用戶行為序列。

推論與建模（視需求）

組間對比：均值/比例檢定、效果大小（Cohen’s d）、多重比較校正。
時序分析：季節性/週期性分解、異常偵測、干預分析。
預測/分類：基準模型→正則化→交叉驗證；特徵重要度與穩健性。
因果設計：傾向分數、雙重差分、斷點回歸（若可行且滿足假設）。

可視化與交付

圖表：分佈、時間序列、分群對比、漏斗、地圖（若有地域）。
指標表：核心KPI、子維度拆解、置信區間。
建議：行動建議與風險；敏感度分析；可重現腳本或筆記本。

風險與精度控制

明確口徑與時間對齊，避免因定義差異造成偏差。
報告統計不確定性（置信區間、p值、效果大小）與樣本量充足性。
對多次比較進行校正；避免資料探勘式誤報。
匯報限制與假設，避免過度延伸結論。

若您已準備好資料與問題，請按上述清單提供；我將以技術寫作風格回覆，聚焦準確的分析、可視化與解讀，并給出可操作的結論與建議。

解决的问题

将 AI 打造成“数据集对比分析专家”，帮助业务与数据团队在分钟级完成多数据集的差异识别与结论提炼；标准化对比分析流程（数据准备—指标统一—对比与检验—可视化—结论与行动建议），让输出可直接用于汇报与决策；适配 A/B 测试、活动复盘、渠道评估、用户分群、版本迭代等高频场景；突出清晰、客观、结构化的表达，既给出结论也说明依据与局限，降低误判风险；支持持续复用，成为团队的“通用提示词”，显著提升分析效率与一致性。

适用用户

增长与营销分析师

快速比对渠道数据、活动人群与转化漏斗，生成图表与执行建议，优化预算分配与投放策略。

产品经理

对比版本数据与功能使用率，识别影响留存的关键差异，自动输出下一步实验方案与优先级。

运营负责人

按地区、时段、品类对比运营指标，定位异常波动与根因，制定补货、定价、引流等可落地措施。

特征总结

• 一键比对多数据集差异，自动输出关键指标与结论，快速锁定可行动洞察。

• 自动清洗与预处理说明，提示缺失值、异常点与采样偏差，减少分析风险。

• 一键生成对比图表与简明文档，直观呈现趋势、分布与差距，便于分享决策。

• 支持参数化分析路径，可选维度、时段与分组，灵活适配不同业务场景。

• 自动输出可执行建议与优先级排序，帮助团队快速落地改进与试验计划。

• 结合上下文与业务目标，精准解读指标变化原因，避免误判与过度优化。

• 适用于营销、产品、运营、风控等场景，快速评估方案效果与资源投入回报。

• 支持多语言输出与统一格式规范，便于跨团队协作与对外高效发布报告。

• 一键生成测试设计与对照组建议，减少试验偏差，让A/B测试更稳更快。

如何使用购买的提示词模板

1. 直接在外部 Chat 应用中使用

将模板生成的提示词复制粘贴到您常用的 Chat 应用（如 ChatGPT、Claude 等），即可直接对话使用，无需额外开发。适合个人快速体验和轻量使用场景。

2. 发布为 API 接口调用

把提示词模板转化为 API，您的程序可任意修改模板参数，通过接口直接调用，轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。

3. 在 MCP Client 中配置使用

在 MCP client 中配置对应的 server 地址，让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作，让提示词在不同 AI 工具间无缝衔接。

AI 提示词价格

￥10.00元

先用后买，用好了再付款，超安全！

在线免费用提示词

您购买后可以获得什么

✓

获得完整提示词模板

- 共 238 tokens

- 1 个可调节参数

{ 输出语言 }

✓

获得社区贡献内容的使用权

- 精选社区优质案例，助您快速上手提示词

购买

数据集对比分析专家

解决的问题