数据可视化脚本生成

0 浏览
0 试用
0 购买
Sep 27, 2025更新

生成基于matplotlib的数据可视化Python脚本,技术性强且结构清晰。

示例2

下面给出一套可直接运行的标准化可视化脚本(仅使用matplotlib+numpy+pandas)。脚本包含统一风格设置,并输出三张图稿:
- 新用户7日留存热力图
- 注册→激活→付费漏斗图
- 渠道转化链路热力图 + 分群(按渠道)转化趋势图

数据未提供时将自动生成示例数据;若提供CSV,将优先读取。脚本内注明各数据文件的标准字段。

代码(保存为plot_growth_analytics.py):
```python
# -*- coding: utf-8 -*-
"""
标准化图稿:新用户7日留存、注册→激活→付费漏斗、渠道转化链路与分群趋势
- 依赖: pandas, numpy, matplotlib
- 数据来源:
  1) retention.csv: 队列留存(按注册日期聚合)
     列: cohort_date, size, d1, d2, d3, d4, d5, d6, d7  (d1~d7为比例 0~1)
  2) funnel.csv: 总体漏斗
     列: stage, count  (stage取值: 注册, 激活, 付费)
  3) channel.csv: 渠道按日数据
     列: date, channel, registered, activated, paid

若CSV文件不存在,自动生成模拟数据。
输出:
  - retention_heatmap.png
  - funnel.png
  - channel_chain_trend.png
"""

import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# ---------------------------
# 全局风格与常量
# ---------------------------
def set_style():
    mpl.rcParams.update({
        "figure.dpi": 120,
        "savefig.dpi": 150,
        "figure.figsize": (12, 7),
        "axes.titlesize": 14,
        "axes.labelsize": 12,
        "axes.grid": True,
        "grid.linestyle": "--",
        "grid.linewidth": 0.5,
        "axes.edgecolor": "#333333",
        "axes.linewidth": 0.8,
        "xtick.labelsize": 10,
        "ytick.labelsize": 10,
        "legend.fontsize": 10,
        "legend.frameon": False,
        "font.sans-serif": ["Noto Sans CJK SC", "Microsoft YaHei", "SimHei", "Arial", "DejaVu Sans"],
        "axes.unicode_minus": False,
    })

PALETTE = {
    "brand_blue": "#2F6BFF",
    "brand_green": "#2CB67D",
    "brand_orange": "#FF7F3F",
    "brand_purple": "#7C5CFF",
    "gray": "#7A7A7A",
    "bg": "#F7F8FA",
}

OUT_RET = "retention_heatmap.png"
OUT_FUNNEL = "funnel.png"
OUT_CHANNEL = "channel_chain_trend.png"

FILE_RET = "retention.csv"
FILE_FUNNEL = "funnel.csv"
FILE_CHANNEL = "channel.csv"

np.random.seed(42)


# ---------------------------
# 数据加载或生成
# ---------------------------
def load_or_mock_retention(path: str | Path) -> pd.DataFrame:
    path = Path(path)
    if path.exists():
        df = pd.read_csv(path)
        # 校验列
        required = ["cohort_date", "size"] + [f"d{i}" for i in range(1, 8)]
        missing = [c for c in required if c not in df.columns]
        if missing:
            raise ValueError(f"retention.csv缺少字段: {missing}")
        # 类型处理
        df["cohort_date"] = pd.to_datetime(df["cohort_date"]).dt.date
        # 安全裁剪比例
        for c in [f"d{i}" for i in range(1, 8)]:
            df[c] = pd.to_numeric(df[c], errors="coerce").clip(lower=0, upper=1)
        df["size"] = pd.to_numeric(df["size"], errors="coerce").fillna(0).astype(int)
        df = df.sort_values("cohort_date", ascending=False)
        return df
    # 生成模拟队列(最近10个注册日)
    today = datetime.today().date()
    cohorts = [today - timedelta(days=i) for i in range(9, -1, -1)]
    data = []
    base_sizes = np.random.randint(800, 2000, size=len(cohorts))
    for i, c in enumerate(cohorts):
        size = int(base_sizes[i])
        # 构造一个递减且带噪声的留存曲线
        d = np.maximum(0, 0.42 - 0.05*np.arange(1, 8) + np.random.normal(0, 0.015, 7))
        d = np.clip(d, 0.02, 0.6)
        data.append([c, size, *d])
    df = pd.DataFrame(data, columns=["cohort_date", "size"] + [f"d{i}" for i in range(1, 8)])
    df = df.sort_values("cohort_date", ascending=False)
    return df


def load_or_mock_funnel(path: str | Path) -> pd.DataFrame:
    path = Path(path)
    if path.exists():
        df = pd.read_csv(path)
        required = ["stage", "count"]
        missing = [c for c in required if c not in df.columns]
        if missing:
            raise ValueError(f"funnel.csv缺少字段: {missing}")
        # 强制顺序
        order = ["注册", "激活", "付费"]
        df["stage"] = pd.Categorical(df["stage"], categories=order, ordered=True)
        df = df.sort_values("stage")
        df["count"] = pd.to_numeric(df["count"], errors="coerce").fillna(0).astype(int)
        return df
    # 模拟
    stages = ["注册", "激活", "付费"]
    counts = [50000, 32000, 9000]
    df = pd.DataFrame({"stage": stages, "count": counts})
    df["stage"] = pd.Categorical(df["stage"], categories=stages, ordered=True)
    return df


def load_or_mock_channel(path: str | Path) -> pd.DataFrame:
    path = Path(path)
    if path.exists():
        df = pd.read_csv(path)
        required = ["date", "channel", "registered", "activated", "paid"]
        missing = [c for c in required if c not in df.columns]
        if missing:
            raise ValueError(f"channel.csv缺少字段: {missing}")
        df["date"] = pd.to_datetime(df["date"]).dt.date
        for c in ["registered", "activated", "paid"]:
            df[c] = pd.to_numeric(df[c], errors="coerce").fillna(0).astype(int)
        return df
    # 模拟近8周x 4渠道
    end = datetime.today().date()
    dates = [end - timedelta(days=i) for i in range(55, -1, -1)]  # 56天
    channels = ["渠道A", "渠道B", "渠道C", "渠道D"]
    rows = []
    for d in dates:
        for ch in channels:
            base = {"渠道A": 1200, "渠道B": 800, "渠道C": 500, "渠道D": 300}[ch]
            # 周期性与噪声
            reg = max(0, int(base * (1 + 0.2*np.sin((d.toordinal()%14)/14*2*np.pi)) + np.random.normal(0, base*0.05)))
            act = int(reg * np.clip(0.55 + np.random.normal(0, 0.03), 0.35, 0.85))
            pay = int(act * np.clip(0.28 + np.random.normal(0, 0.02), 0.12, 0.5))
            rows.append([d, ch, reg, act, pay])
    df = pd.DataFrame(rows, columns=["date", "channel", "registered", "activated", "paid"])
    return df


# ---------------------------
# 绘图函数
# ---------------------------
def fmt_pct(x: float) -> str:
    return f"{x*100:.0f}%"


def pick_text_color(bg_rgb):
    # 依据亮度选择黑/白文字
    r, g, b = bg_rgb[:3]
    luminance = 0.2126*r + 0.7152*g + 0.0722*b
    return "black" if luminance > 0.6 else "white"


def plot_retention_heatmap(df_ret: pd.DataFrame, out_path: str = OUT_RET):
    set_style()
    fig, ax = plt.subplots(figsize=(12, 7))
    # 构造矩阵(d0=100%)
    days = ["D0", "D1", "D2", "D3", "D4", "D5", "D6", "D7"]
    mat = np.c_[np.ones(len(df_ret)), df_ret[[f"d{i}" for i in range(1, 8)]].values]
    # 显示
    cmap = plt.get_cmap("YlGnBu")
    im = ax.imshow(mat, aspect="auto", cmap=cmap, vmin=0, vmax=1)
    ax.set_title("新用户7日留存(按注册队列)", pad=12)
    ax.set_xlabel("天数")
    ax.set_ylabel("注册队列(cohort_date)")
    ax.set_xticks(range(len(days)), days)
    ax.set_yticks(range(len(df_ret)), [str(d) for d in df_ret["cohort_date"]])
    # 标注格内数值
    for i in range(mat.shape[0]):
        for j in range(mat.shape[1]):
            val = mat[i, j]
            rgba = cmap(val)
            color = pick_text_color(rgba)
            ax.text(j, i, fmt_pct(val), ha="center", va="center", color=color, fontsize=9)
    # 颜色条
    cbar = fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
    cbar.ax.set_ylabel("留存率", rotation=-90, va="bottom")
    # 右侧附加信息:队列规模
    for i, size in enumerate(df_ret["size"].tolist()):
        ax.text(len(days)+0.2, i, f"n={size}", va="center", fontsize=9, color=PALETTE["gray"])
    ax.set_xlim(-0.5, len(days)+1.0)
    ax.set_facecolor(PALETTE["bg"])
    plt.tight_layout()
    plt.savefig(out_path, bbox_inches="tight")
    plt.close(fig)


def plot_funnel(df_funnel: pd.DataFrame, out_path: str = OUT_FUNNEL):
    set_style()
    fig, ax = plt.subplots(figsize=(10, 6))
    stages = df_funnel["stage"].tolist()
    counts = df_funnel["count"].astype(float).tolist()
    max_count = max(counts) if counts else 1.0

    # 计算转化率
    rate2prev = [1.0]
    for i in range(1, len(counts)):
        denom = counts[i-1] if counts[i-1] > 0 else np.nan
        rate2prev.append((counts[i] / denom) if denom else np.nan)
    rate2first = [(c / max_count) if max_count > 0 else np.nan for c in counts]

    # 居中漏斗效果:使用左右留白
    y = np.arange(len(stages))[::-1]  # 顶部是注册
    left = [(max_count - c)/2 for c in counts]
    colors = [PALETTE["brand_blue"], PALETTE["brand_green"], PALETTE["brand_orange"]]

    bars = ax.barh(y=y, width=counts, left=left, color=colors[:len(counts)], edgecolor="none", alpha=0.9)
    ax.set_title("注册→激活→付费 漏斗", pad=12)
    ax.set_yticks(y, stages)
    ax.set_xlabel("人数")
    ax.grid(axis="x", linestyle="--", alpha=0.5)

    # 标注文字:绝对值 + 环比转化率
    for i, b in enumerate(bars):
        c = counts[i]
        ax.text(b.get_x() + b.get_width()/2, b.get_y() + b.get_height()/2,
                f"{int(c):,}\n({fmt_pct(rate2prev[i]) if not np.isnan(rate2prev[i]) else 'NA'})",
                ha="center", va="center", color="white", fontsize=10, weight="bold")

    # 辅助信息:首段基准
    ax.text(0.98, 0.02, f"首段基准={int(max_count):,}", transform=ax.transAxes,
            ha="right", va="bottom", color=PALETTE["gray"])

    ax.set_facecolor(PALETTE["bg"])
    plt.tight_layout()
    plt.savefig(out_path, bbox_inches="tight")
    plt.close(fig)


def plot_channel_chain_and_trend(df_channel: pd.DataFrame, out_path: str = OUT_CHANNEL, topn: int = 4):
    set_style()
    # 聚合与计算转化
    df = df_channel.copy()
    df["date"] = pd.to_datetime(df["date"])
    # 按渠道汇总(用于链路热力图)
    grp_ch = df.groupby("channel", as_index=False)[["registered", "activated", "paid"]].sum()
    grp_ch = grp_ch.sort_values("registered", ascending=False)
    top_channels = grp_ch["channel"].head(topn).tolist()
    grp_ch = grp_ch[grp_ch["channel"].isin(top_channels)].copy()
    grp_ch["reg_to_act"] = grp_ch["activated"] / grp_ch["registered"].replace(0, np.nan)
    grp_ch["reg_to_pay"] = grp_ch["paid"] / grp_ch["registered"].replace(0, np.nan)
    grp_ch["act_to_pay"] = grp_ch["paid"] / grp_ch["activated"].replace(0, np.nan)

    # 趋势(按日,取TopN渠道)
    df_top = df[df["channel"].isin(top_channels)].copy()
    daily = (df_top.groupby(["date", "channel"], as_index=False)
             .agg(registered=("registered", "sum"), paid=("paid", "sum")))
    daily["pay_rate"] = daily["paid"] / daily["registered"].replace(0, np.nan)
    # 按周重采样更平滑(可选)
    # 这里保留日级,若需周级:daily = daily.set_index("date").groupby("channel").resample("W")...

    fig = plt.figure(figsize=(14, 6.5))
    gs = fig.add_gridspec(1, 2, width_ratios=[1, 1.2], wspace=0.2)

    # 左侧:渠道转化链路热力图
    ax1 = fig.add_subplot(gs[0, 0])
    stages = ["注册→激活", "注册→付费", "激活→付费"]
    mat = grp_ch[["reg_to_act", "reg_to_pay", "act_to_pay"]].values
    cmap = plt.get_cmap("Greens")
    im = ax1.imshow(mat, aspect="auto", cmap=cmap, vmin=0, vmax=1)
    ax1.set_title("渠道转化链路(Top渠道)", pad=12)
    ax1.set_xticks(range(len(stages)), stages)
    ax1.set_yticks(range(len(grp_ch)), grp_ch["channel"].tolist())

    # 标注格内数值
    for i in range(mat.shape[0]):
        for j in range(mat.shape[1]):
            val = mat[i, j]
            rgba = cmap(0 if np.isnan(val) else val)
            color = pick_text_color(rgba)
            text = "NA" if np.isnan(val) else fmt_pct(val)
            ax1.text(j, i, text, ha="center", va="center", color=color, fontsize=10)

    cbar = fig.colorbar(im, ax=ax1, fraction=0.046, pad=0.04)
    cbar.ax.set_ylabel("转化率", rotation=-90, va="bottom")
    ax1.set_facecolor(PALETTE["bg"])

    # 右侧:分群(按渠道)转化趋势
    ax2 = fig.add_subplot(gs[0, 1])
    ax2.set_title("按渠道的付费转化率趋势", pad=12)
    colors = [PALETTE["brand_blue"], PALETTE["brand_green"], PALETTE["brand_orange"], PALETTE["brand_purple"]]
    for idx, ch in enumerate(top_channels):
        sub = daily[daily["channel"] == ch].sort_values("date")
        ax2.plot(sub["date"], sub["pay_rate"], label=ch, color=colors[idx % len(colors)], linewidth=2)

    ax2.set_xlabel("日期")
    ax2.set_ylabel("付费转化率")
    ax2.yaxis.set_major_formatter(mpl.ticker.PercentFormatter(1.0))
    ax2.grid(True, axis="both", linestyle="--", alpha=0.5)
    ax2.legend(ncol=2, loc="upper left")
    ax2.set_facecolor(PALETTE["bg"])
    plt.tight_layout()
    plt.savefig(out_path, bbox_inches="tight")
    plt.close(fig)


# ---------------------------
# 主流程
# ---------------------------
def main():
    # 加载或生成数据
    df_ret = load_or_mock_retention(FILE_RET)
    df_fun = load_or_mock_funnel(FILE_FUNNEL)
    df_ch = load_or_mock_channel(FILE_CHANNEL)

    # 绘图
    plot_retention_heatmap(df_ret, OUT_RET)
    plot_funnel(df_fun, OUT_FUNNEL)
    plot_channel_chain_and_trend(df_ch, OUT_CHANNEL)

    print(f"输出完成:\n- {OUT_RET}\n- {OUT_FUNNEL}\n- {OUT_CHANNEL}")


if __name__ == "__main__":
    main()
```

使用说明与数据规范
- 运行
  - python3 plot_growth_analytics.py
  - 若同目录存在retention.csv、funnel.csv、channel.csv,将优先读取;否则自动生成示例数据。
- retention.csv
  - 字段: cohort_date(日期), size(队列规模), d1~d7(1~7日留存率,0~1)
  - 示例: 2025-08-01, 1500, 0.36, 0.31, 0.28, 0.24, 0.21, 0.19, 0.17
- funnel.csv
  - 字段: stage(注册|激活|付费), count(人数)
- channel.csv
  - 字段: date(日期), channel(渠道名), registered, activated, paid
  - 按日汇总的绝对人数。脚本计算注册→激活、注册→付费、激活→付费的转化率,并绘制按渠道的付费转化率趋势。

可视化标准化要点(已在脚本中实现)
- 统一风格:字体、配色、网格、坐标轴线宽、DPI、留白。
- 颜色规范:主色(蓝、绿、橙、紫)用于不同指标或渠道,背景浅灰以提升对比。
- 数值标注:热力图单元格和漏斗条内均标注百分比;自动选择黑/白文字以保证对比度。
- 比例表现:所有转化率均以0~1表示,图中渲染为百分比。
- 布局:标题、轴标签、图例位置与字号统一,图像导出为PNG,适合嵌入报告或看板。

数据挖掘建议(与上述图稿联动)
- 留存:优先监控近几期队列的D1与D3陡降点,结合产品事件日志做分段对比(如首日关键动作完成率)。
- 漏斗:在漏斗各层插入“归因切片”(如渠道、端、版本)进行分层输出,定位瓶颈层的异质性。
- 渠道:以注册→付费为主目标,注册→激活为中间目标,热力图发现弱项渠道后,在趋势图验证其是否结构性偏低或阶段性波动。结合投放日历与预算做前后窗口对照检验。

示例3

Below is a self-contained, reproducible Python script that uses matplotlib to visualize experimental and control observations. It includes data loading, preprocessing (NaN handling and robust winsorization), statistical estimation (bootstrap confidence intervals), and three plots: distribution, mean with error intervals, and scatter correlation with a bootstrap-based regression band.

Usage:
- If you have a CSV, provide either:
  - Wide format: columns "control" and "experiment" (paired).
  - Long format: columns "group" (values: "control" or "experiment") and "value". Optional "subject_id" for pairing.
- If no input is provided, the script generates synthetic reproducible data.
- Example:
  - python visualize_groups.py --input your_data.csv --outdir figures
  - python visualize_groups.py  (uses synthetic data)

Required: Python 3.9+, numpy, pandas, matplotlib. Optional: scipy for KDE. If SciPy is not available, the script falls back to histograms without KDE.

Script (save as visualize_groups.py):

import argparse
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Optional SciPy
try:
    from scipy.stats import gaussian_kde
    SCIPY_AVAILABLE = True
except Exception:
    SCIPY_AVAILABLE = False

def set_seed(seed: int = 42):
    np.random.seed(seed)

def winsorize_by_mad(x: np.ndarray, zmax: float = 3.0) -> np.ndarray:
    """
    Robust winsorization based on MAD. Caps values beyond zmax * MAD from the median.
    """
    x = np.asarray(x).astype(float)
    median = np.median(x)
    mad = np.median(np.abs(x - median))
    if mad == 0:
        return x  # No variability; nothing to winsorize
    z = (x - median) / (1.4826 * mad)  # 1.4826 scales MAD to be comparable to std under normality
    x_w = x.copy()
    x_w[z > zmax] = median + zmax * 1.4826 * mad
    x_w[z < -zmax] = median - zmax * 1.4826 * mad
    return x_w

def bootstrap_mean_ci(x: np.ndarray, n_boot: int = 5000, alpha: float = 0.05, seed: int = 123):
    """
    Bootstrap mean and 95% CI (percentile) for a 1D array.
    Returns: mean, ci_low, ci_high, se_boot
    """
    x = np.asarray(x).astype(float)
    x = x[~np.isnan(x)]
    rng = np.random.default_rng(seed)
    boots = rng.choice(x, (n_boot, x.size), replace=True).mean(axis=1)
    mean = x.mean()
    se_boot = boots.std(ddof=1)
    ci_low = np.quantile(boots, alpha / 2)
    ci_high = np.quantile(boots, 1 - alpha / 2)
    return mean, ci_low, ci_high, se_boot

def fisher_r_ci(r: float, n: int, alpha: float = 0.05):
    """
    Fisher z-transformation CI for Pearson correlation r with sample size n.
    Returns (ci_low, ci_high). Assumes |r|<1 and n>3.
    """
    r = np.clip(r, -0.999999, 0.999999)
    z = np.arctanh(r)
    se = 1 / np.sqrt(n - 3)
    z_low = z - 1.96 * se
    z_high = z + 1.96 * se
    return np.tanh(z_low), np.tanh(z_high)

def load_data(csv_path: Path | None):
    """
    Load data from CSV, or generate synthetic paired data if csv_path is None.
    Returns: control, experiment, paired (bool)
    """
    if csv_path is None:
        # Synthetic, reproducible paired data
        set_seed(42)
        n = 200
        control = np.random.normal(loc=50, scale=10, size=n)
        effect = 5.0
        experiment = control + effect + np.random.normal(loc=0, scale=8, size=n)
        # Inject some missing values
        miss_idx = np.random.choice(n, size=int(0.05 * n), replace=False)
        control[miss_idx[: len(miss_idx)//2]] = np.nan
        experiment[miss_idx[len(miss_idx)//2:]] = np.nan
        return control, experiment, True

    df = pd.read_csv(csv_path)
    cols = {c.lower(): c for c in df.columns}

    if "control" in cols and "experiment" in cols:
        control = df[cols["control"]].to_numpy(dtype=float)
        experiment = df[cols["experiment"]].to_numpy(dtype=float)
        return control, experiment, True

    if "group" in cols and "value" in cols:
        # Long format; attempt pairing by subject_id if present
        gcol = cols["group"]
        vcol = cols["value"]
        df = df[[gcol, vcol] + ([cols["subject_id"]] if "subject_id" in cols else [])].copy()
        df[gcol] = df[gcol].str.lower().str.strip()
        df = df[df[gcol].isin(["control", "experiment"])]
        if "subject_id" in cols:
            # Pivot to paired wide
            wide = df.pivot_table(index=cols["subject_id"], columns=gcol, values=vcol, aggfunc="mean")
            control = wide.get("control", pd.Series(index=wide.index, dtype=float)).to_numpy()
            experiment = wide.get("experiment", pd.Series(index=wide.index, dtype=float)).to_numpy()
            return control, experiment, True
        else:
            # Unpaired
            control = df.loc[df[gcol] == "control", vcol].to_numpy(dtype=float)
            experiment = df.loc[df[gcol] == "experiment", vcol].to_numpy(dtype=float)
            return control, experiment, False

    raise ValueError("CSV must contain either (control, experiment) columns or (group, value) [and optionally subject_id].")

def preprocess(control: np.ndarray, experiment: np.ndarray, paired: bool, zmax: float = 3.0):
    """
    - Remove NaN/inf.
    - Robust winsorization by MAD.
    - If paired, keep rows where both are valid.
    """
    control = np.asarray(control, dtype=float)
    experiment = np.asarray(experiment, dtype=float)

    if paired:
        mask = np.isfinite(control) & np.isfinite(experiment)
        control, experiment = control[mask], experiment[mask]
        control = winsorize_by_mad(control, zmax=zmax)
        experiment = winsorize_by_mad(experiment, zmax=zmax)
    else:
        control = control[np.isfinite(control)]
        experiment = experiment[np.isfinite(experiment)]
        control = winsorize_by_mad(control, zmax=zmax)
        experiment = winsorize_by_mad(experiment, zmax=zmax)

    return control, experiment

def kde_or_none(x: np.ndarray, grid: np.ndarray):
    if SCIPY_AVAILABLE and x.size > 1:
        try:
            kde = gaussian_kde(x)
            return kde(grid)
        except Exception:
            return None
    return None

def plot_distributions(control: np.ndarray, experiment: np.ndarray, outdir: Path):
    plt.figure(figsize=(8, 5), dpi=120)
    bins = max(10, int(np.sqrt(control.size + experiment.size)))
    # Histograms
    plt.hist(control, bins=bins, density=True, alpha=0.4, color="#1f77b4", label=f"Control (n={control.size})")
    plt.hist(experiment, bins=bins, density=True, alpha=0.4, color="#ff7f0e", label=f"Experiment (n={experiment.size})")
    # KDE curves if available
    xmin = min(np.min(control), np.min(experiment))
    xmax = max(np.max(control), np.max(experiment))
    grid = np.linspace(xmin, xmax, 200)
    d_control = kde_or_none(control, grid)
    d_experiment = kde_or_none(experiment, grid)
    if d_control is not None:
        plt.plot(grid, d_control, color="#1f77b4", lw=2)
    if d_experiment is not None:
        plt.plot(grid, d_experiment, color="#ff7f0e", lw=2)
    # Means
    plt.axvline(control.mean(), color="#1f77b4", ls="--", lw=1)
    plt.axvline(experiment.mean(), color="#ff7f0e", ls="--", lw=1)
    plt.title("Distribution: Control vs Experiment")
    plt.xlabel("Value")
    plt.ylabel("Density")
    plt.legend()
    plt.tight_layout()
    fp = outdir / "01_distributions.png"
    plt.savefig(fp, bbox_inches="tight")
    plt.close()
    return fp

def plot_error_intervals(control: np.ndarray, experiment: np.ndarray, paired: bool, outdir: Path):
    m_c, ci_c_low, ci_c_high, se_c = bootstrap_mean_ci(control)
    m_e, ci_e_low, ci_e_high, se_e = bootstrap_mean_ci(experiment)

    # Effect size: mean difference
    if paired and control.size == experiment.size:
        diff = experiment - control
        md, md_low, md_high, _ = bootstrap_mean_ci(diff)
        effect_text = f"Mean diff (paired): {md:.2f} [{md_low:.2f}, {md_high:.2f}]"
    else:
        md = m_e - m_c
        # Unpaired bootstrap of difference
        rng = np.random.default_rng(123)
        n_boot = 5000
        boots = []
        for _ in range(n_boot):
            b_c = rng.choice(control, control.size, replace=True).mean()
            b_e = rng.choice(experiment, experiment.size, replace=True).mean()
            boots.append(b_e - b_c)
        md_low, md_high = np.quantile(boots, [0.025, 0.975])
        effect_text = f"Mean diff (unpaired): {md:.2f} [{md_low:.2f}, {md_high:.2f}]"

    plt.figure(figsize=(7, 5), dpi=120)
    means = [m_c, m_e]
    ci_lows = [ci_c_low, ci_e_low]
    ci_highs = [ci_c_high, ci_e_high]
    x = np.arange(2)
    colors = ["#1f77b4", "#ff7f0e"]
    plt.errorbar(x, means, yerr=[np.array(means) - np.array(ci_lows), np.array(ci_highs) - np.array(means)],
                 fmt="o", capsize=5, color="black", ecolor="black")
    plt.scatter(x, means, c=colors, s=80, zorder=3)
    plt.xticks(x, ["Control", "Experiment"])
    plt.ylabel("Mean ± 95% CI")
    plt.title("Group Means with 95% Confidence Intervals")
    plt.text(0.5, 0.05, effect_text, transform=plt.gca().transAxes, ha="center")
    plt.tight_layout()
    fp = outdir / "02_error_intervals.png"
    plt.savefig(fp, bbox_inches="tight")
    plt.close()
    return fp

def plot_scatter_correlation(control: np.ndarray, experiment: np.ndarray, paired: bool, outdir: Path):
    if not paired or control.size != experiment.size:
        # Skip if not paired
        return None

    x = control
    y = experiment
    n = x.size

    # Pearson r
    x_c = x - x.mean()
    y_c = y - y.mean()
    r = (x_c @ y_c) / (np.sqrt((x_c**2).sum()) * np.sqrt((y_c**2).sum()))
    r_low, r_high = fisher_r_ci(r, n)

    # Regression line and bootstrap band
    beta = np.polyfit(x, y, deg=1)  # y = beta[0]*x + beta[1]
    x_line = np.linspace(x.min(), x.max(), 200)
    y_line = beta[0] * x_line + beta[1]

    # Bootstrap regression uncertainty
    rng = np.random.default_rng(123)
    n_boot = 2000
    y_boot = np.empty((n_boot, x_line.size))
    idx = np.arange(n)
    for i in range(n_boot):
        b = rng.choice(idx, size=n, replace=True)
        xb, yb = x[b], y[b]
        bcoef = np.polyfit(xb, yb, deg=1)
        y_boot[i] = bcoef[0] * x_line + bcoef[1]
    band_low = np.quantile(y_boot, 0.025, axis=0)
    band_high = np.quantile(y_boot, 0.975, axis=0)

    plt.figure(figsize=(7, 6), dpi=120)
    plt.scatter(x, y, alpha=0.7, s=40, color="#2ca02c", edgecolor="white", linewidth=0.5)
    plt.plot(x_line, y_line, color="black", lw=2, label="OLS fit")
    plt.fill_between(x_line, band_low, band_high, color="gray", alpha=0.2, label="95% bootstrap band")
    plt.xlabel("Control")
    plt.ylabel("Experiment")
    plt.title("Scatter and Correlation (Paired)")
    plt.legend()
    plt.text(0.05, 0.02, f"Pearson r = {r:.3f} [{r_low:.3f}, {r_high:.3f}] (95% CI)", transform=plt.gca().transAxes)
    plt.tight_layout()
    fp = outdir / "03_scatter_correlation.png"
    plt.savefig(fp, bbox_inches="tight")
    plt.close()
    return fp

def main():
    parser = argparse.ArgumentParser(description="Visualize experimental vs control observations.")
    parser.add_argument("--input", type=str, default=None, help="CSV file path. See header format in script description.")
    parser.add_argument("--outdir", type=str, default="figures", help="Output directory for figures.")
    parser.add_argument("--zmax", type=float, default=3.0, help="MAD-based winsorization threshold.")
    args = parser.parse_args()

    outdir = Path(args.outdir)
    outdir.mkdir(parents=True, exist_ok=True)

    csv_path = Path(args.input) if args.input else None
    try:
        control, experiment, paired = load_data(csv_path)
    except Exception as e:
        print(f"Error loading data: {e}", file=sys.stderr)
        sys.exit(1)

    control, experiment = preprocess(control, experiment, paired=paired, zmax=args.zmax)

    dist_fp = plot_distributions(control, experiment, outdir)
    err_fp = plot_error_intervals(control, experiment, paired, outdir)
    scat_fp = plot_scatter_correlation(control, experiment, paired, outdir)

    print("Saved figures:")
    print(f"- {dist_fp}")
    print(f"- {err_fp}")
    if scat_fp is not None:
        print(f"- {scat_fp}")
    else:
        print("- Scatter correlation skipped (requires paired data).")

if __name__ == "__main__":
    main()

Notes on methodology:
- Preprocessing: NaN/inf values are removed. Robust winsorization uses MAD to cap extreme outliers without discarding data, which stabilizes estimates and plots.
- Estimation: Means and 95% confidence intervals use bootstrap percentiles, which are reliable for non-normal distributions. For correlations, Fisher’s z-transform provides an analytical 95% CI for r.
- Visualization:
  - Distribution plots show histograms (and KDE when SciPy is available) with mean indicators.
  - Error interval plot displays mean ± 95% CI per group and annotates the mean difference with its CI (paired or unpaired, based on data).
  - Scatter correlation plot requires paired data and includes an OLS regression line with a bootstrap-based 95% uncertainty band.

适用用户

数据分析师与BI从业者

把表格数据一键生成多图对比脚本;自动标注关键波动;快速完成周报、月报可视化并统一风格。

产品经理与增长负责人

生成留存、漏斗、转化链路与分群趋势图;清晰呈现实验结果与结论;用于评审会与复盘的标准化图稿。

研究人员与数据科学从业者

将实验与观测数据可视化;输出误差区间、分布与相关性图;脚本可复现、可复用,便于论文与附录。

市场与运营团队

制作渠道对比、活动转化、地域分布图;高清导出用于海报、PPT与公众号;高效交付增长复盘。

教师、培训讲师与教育创作者

按课堂主题生成示例脚本与练习模板;中英注释清晰讲解步骤;让学生专注理解方法而非排版。

解决的问题

让业务与数据团队在最短时间内把“数据描述”转化为可直接运行的可视化脚本与清晰说明,快速产出高质量图表并用于汇报、决策与复盘。具体目标:- 降低从数据到图表的门槛,减少反复调试与搜代码时间,通常能显著缩短制图周期。- 统一图表规范(命名、标题、注释、颜色与版式),提升报告的专业度与一致性。- 以“专家视角”生成脚本与说明,兼顾准确性与可读性,减少误解与偏差。- 根据你的数据与场景,自动匹配合适的图表类型(趋势、对比、分布、相关性、热力等)。- 支持多语言输出,满足跨部门与跨地区沟通。- 促使首次试用即可完成一个可嵌入报告的图表;付费升级可获得风格库、批量生成、质量清单与团队协作能力,进一步提升交付效率与品牌一致性。

特征总结

一键生成可运行的matplotlib脚本,按数据类型自动选图与布局,报告、看板与演示即刻成型。
自动补全标题、坐标与图例,统一配色与字体风格,成图专业耐看,减少反复微调时间。
内置清洗与异常提示,附带可选代码片段,修正缺失与离群,确保图表更准更可信。
支持多图联动与子图网格,轻松比较分组与时间趋势,复杂关系一屏讲清。
依据业务目标自动添加注释与关键数值,突出增长拐点与波动来源,快速讲清数据故事。
模板参数可调与多语言说明,中文/英文随时切换,跨团队协作与对外汇报更顺畅。
生成结构清晰、注释完善的脚本骨架,便于二次修改与复用,缩短从草稿到定稿周期。
提供高清导出与自适应比例设置,PPT、报告与网页一稿多用,发布即达展示标准。
面向大数据场景给出采样与性能建议,保持绘图流畅稳定,避免卡顿与无效重绘。

如何使用购买的提示词模板

1. 直接在外部 Chat 应用中使用

将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。

2. 发布为 API 接口调用

把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。

3. 在 MCP Client 中配置使用

在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。

¥15.00元
平台提供免费试用机制,
确保效果符合预期,再付费购买!

您购买后可以获得什么

获得完整提示词模板
- 共 255 tokens
- 2 个可调节参数
{ 数据类型或描述 } { 输出语言 }
自动加入"我的提示词库"
- 获得提示词优化器支持
- 版本化管理支持
获得社区共享的应用案例
限时免费

不要错过!

免费获取高级提示词-优惠即将到期

17
:
23
小时
:
59
分钟
:
59