生成数据概况分析结果

0 浏览
0 试用
0 购买
Sep 25, 2025更新

生成针对数据集的专业数据概况分析结果。

示例1

以下为“新用户转化数据”的数据概况分析结果模板与评估方法。因未提供实际数据与字段结构,本报告以标准化数据质量框架与可复用的计算逻辑为基础,输出结果项采用占位符形式,待接入真实数据后即可生成数值化结论。

一、分析范围与目标
- 数据对象:新注册用户在指定观察期内的转化行为(如首购、关键功能使用、完成KYC等)。
- 目标输出:数据概况(分布、完整性、一致性、唯一性、及时性)、关键质量问题识别、可执行的修复与监控方案。
- 关键参数(需确认):转化定义、观察期(例如注册后7/14/30天)、归因口径(最后触点/首次触点)、时间窗口与时区。

二、字段与结构(参考标准,实际以您系统为准)
- 主表:user_conversion_facts(用户级或事件级)
  - user_id(主键或业务主键)
  - signup_ts(注册时间戳)
  - conversion_flag(是否转化,布尔/枚举)
  - conversion_ts(首次转化时间戳)
  - conversion_days(转化耗时,整数)
  - source、campaign_id、channel(来源归因字段)
  - device_type、os_version、app_version(设备与版本)
  - geo_country、geo_region(地理信息)
  - revenue_first_purchase(首购金额,数值)
  - event_id(事件级数据时使用)
- 维表(可选):campaign_dim、channel_dim、geo_dim、app_version_dim

三、指标定义与计算方法
- 记录量与覆盖度
  - 总记录数:count(*)
  - 新用户数:count(distinct user_id)
  - 观察期覆盖:按日期分组的记录数与新用户数
- 完整性(Null/缺失)
  - 各字段缺失率:sum(is_null(col))/count(*)
- 唯一性/重复
  - user_id唯一性:count(distinct user_id)/count(*)
  - 事件重复:重复event_id比例;重复signup事件比例
- 一致性/逻辑校验
  - conversion_flag与conversion_ts一致:当flag=1时conversion_ts非空;flag=0时conversion_ts为空
  - 时间顺序:conversion_ts >= signup_ts;conversion_days = datediff(conversion_ts, signup_ts)
  - 收入一致性:flag=0时revenue=0或空;flag=1时revenue>=0
- 有效性/取值范围
  - 字段枚举值合法性(channel/source等在白名单内)
  - 时间戳范围合法性(不早于系统上线日期,不晚于当前时间)
  - 数值范围(revenue不为负;conversion_days不为负且不超过设定上限)
- 分布与异常
  - 转化率:sum(flag)/count(distinct user_id)
  - 转化耗时分布:P50/P90/P99
  - 收入分布:P50/P90/P99、最大值与重尾检测(如超P99的倍数)
  - 归因分布:channel、campaign的占比与Top N
  - 设备/地区分布:top值占比与长尾
- 及时性
  - 数据延迟:max(ingest_ts)与事件时间的差异
  - 分区/日期数据完整性:每日是否有数据、是否有漏分区

四、数据概况分析结果(模板,待填)
- 基础规模
  - 观察期:[开始日期] 至 [结束日期]
  - 总记录数:[待填]
  - 新用户数:[待填]
  - 事件级数据比例(如存在):[待填]
- 完整性
  - 缺失率最高的字段:[字段A: X%]、[字段B: Y%]
  - 关键字段缺失率(必须项):signup_ts [X%];user_id [X%];conversion_flag [X%]
- 唯一性与重复
  - user_id唯一性比率:[待填]
  - 重复signup事件比例:[待填]
  - 重复event_id比例:[待填]
- 一致性
  - flag/ts不一致记录比例:[待填]
  - conversion_ts < signup_ts比例:[待填]
  - conversion_days计算不一致比例:[待填]
  - revenue与flag不一致比例:[待填]
- 有效性
  - 非法枚举值(channel/source)比例:[待填];Top非法值:[待填]
  - 异常时间戳(未来/过早)比例:[待填]
  - 负收入或异常大收入比例:[待填]
- 分布与异常
  - 总转化率:[待填]
  - conversion_days分布(P50/P90/P99):[待填]
  - 首购收入分布(P50/P90/P99):[待填]
  - Top渠道/活动占比:[渠道A X%]、[活动B Y%]
  - 设备/地区Top分布:[待填]
- 及时性
  - 数据延迟P95:[待填]小时
  - 漏数日期或分区:[日期列表或“无”]

五、数据质量评估与风险分级(规则清单)
- 高风险(需优先修复)
  - user_id非唯一或缺失率>0.5%
  - signup_ts缺失或非法>0.5%
  - flag=1但conversion_ts缺失>0.1%
  - conversion_ts < signup_ts >0.01%
- 中风险
  - 枚举非法值>1%
  - revenue负值>0.1%
  - 转化耗时>设定上限(如>90天)占比>1%
- 低风险
  - 长尾渠道/活动占比异常导致归因不稳定
  - 数据延迟P95>24小时

六、问题定位与修复建议
- 完整性
  - 在ETL层对必填字段设非空约束;对缺失数据回填或剔除(按业务容忍度)
  - 对时间戳统一时区与格式解析;引入原始事件落库时间(ingest_ts)
- 唯一性
  - 引入业务主键约束(user_id+signup_ts)或幂等键(event_id)
  - 去重策略:按同user_id+近似时间窗口合并
- 一致性
  - 以事件序列校验转化逻辑(signup→activation→conversion)
  - 重算派生字段(conversion_days、revenue)并与原值对账
- 有效性
  - 枚举值白名单校验与拒绝策略;异常值隔离到审计表
  - 阈值规则:revenue<0或>业务合理上限标记为异常
- 及时性
  - 建立分区完成度检查;延迟告警(如超过12小时告警)

七、监控与审计(示例SQL/规则表达)
- 每日数据量与转化率
  - select dt, count(distinct user_id) as new_users, sum(case when conversion_flag=1 then 1 else 0 end)/count(distinct user_id) as conv_rate from user_conversion_facts group by dt;
- 完整性与一致性
  - 缺失率:select 'signup_ts' as col, sum(case when signup_ts is null then 1 else 0 end)/count(*) as null_rate from user_conversion_facts;
  - 逻辑不一致:select sum(case when conversion_flag=1 and conversion_ts is null then 1 else 0 end)/count(*) as inconsistent_rate from user_conversion_facts;
  - 时间顺序:select sum(case when conversion_ts < signup_ts then 1 else 0 end)/count(*) from user_conversion_facts;
- 唯一性与重复
  - select count(*) - count(distinct user_id) as duplicate_user_records from user_conversion_facts;
- 有效性
  - 枚举白名单:select count(*) filter (where channel not in (select channel from channel_dim)) / count(*) as invalid_channel_rate from user_conversion_facts;
- 及时性
  - select percentile_cont(0.95) within group(order by extract(epoch from (ingest_ts - event_ts))/3600.0) as p95_delay_hours from user_conversion_facts;

八、交付与下一步
- 请提供以下信息以生成数值化的概况分析结果:
  - 字段字典与主键定义
  - 转化业务定义与观察期
  - 数据时间范围与样本规模
  - 可选维表(channel/campaign/app_version)及白名单
- 接收后输出:
  - 完整的数值化概况报告
  - 问题清单与优先级
  - 修复与监控落地方案(规则、阈值、任务与告警)

说明:本报告为数据质量概况分析的标准模板,确保在接入真实数据后可直接执行并生成高准确性的结果。针对您的具体数据与业务定义,可快速定制规则与阈值。

示例2

Order Fact Table – Data Profiling Analysis Results (Template + Illustrative Example)

Scope and grain
- Assumed grain: one row per order (order header). If your fact grain is order-line, adjust metrics that rely on totals and referential checks to the product dimension accordingly.
- Intended use: accuracy, completeness, consistency, and integrity assessment to inform cleansing, validation rules, and monitoring.

Table overview (replace illustrative values with actuals)
- Row count: [Example] 1,238,417
- Coverage window (order_date): [Example] 2023-01-01 to 2025-09-20
- Primary key: order_id
  - Uniqueness: [Example] 99.997% unique
  - Duplicate order_ids: [Example] 32
- Freshness (max(updated_at) vs extraction): [Example] 1h 42m

Column-level profiling (core fields)

1) order_id
- Type: string/integer (non-nullable)
- Distinct count: [Example] 1,238,385
- Null rate: [Example] 0.000%
- Duplicates: [Example] 32 rows across 16 ids
- Notes: Enforce not null + uniqueness at source; quarantine duplicate keys.

2) order_date (date/timestamp)
- Min/Max: [Example] 2023-01-01 / 2025-09-20
- Null rate: [Example] 0.000%
- Future-dated (> current_date): [Example] 0.06%
- Invalid dates/timezones: [Example] 0.00% invalid parse; mixed TZ flags detected
- Seasonality: [Example] weekend share 27%; end-of-month spikes present

3) customer_id (FK to dim_customer)
- Null rate: [Example] 0.42%
- Distinct count: [Example] 214,903
- Orphan rate vs dim_customer: [Example] 0.18%
- Notes: Backfill anonymous/guest strategy or surrogate for nulls; address orphans via late-arriving dimension handling.

4) currency_code (ISO 4217)
- Cardinality: [Example] 5 (USD, EUR, GBP, CAD, AUD)
- Null rate: [Example] 0.00%
- Invalid values: [Example] 0.04% (case issues: 'usd')
- Notes: Standardize to uppercase; enforce against dim_currency.

5) order_status (enumerated)
- Allowed set (example): Pending, Completed, Cancelled, Refunded, Partially_Refunded
- Null rate: [Example] 0.00%
- Top distribution: [Example] Completed 82.5%; Pending 9.8%; Cancelled 5.6%; Refunded 2.1%
- Invalid statuses: [Example] 0.03% (legacy codes)
- Notes: Map legacy to canonical set; add check constraint or validation UDF.

6) sales_channel (enumerated)
- Values: [Example] Web, App, Store, Marketplace
- Null rate: [Example] 0.11%
- Distribution: [Example] Web 56%, App 22%, Store 18%, Marketplace 4%
- Notes: Normalize spelling/casing; handle unknowns explicitly.

7) subtotal_amount (numeric)
- Null rate: [Example] 0.01%
- Min/Max: [Example] 0.00 / 12,300.00
- P50/P95: [Example] 72.10 / 420.00
- Negative values: [Example] 0.00%
- Currency consistency: [Example] 100% aligned to currency_code
- Notes: Validate precision/scale; non-negative expectation except adjustments per policy.

8) discount_amount (numeric; sign convention must be confirmed)
- Null rate: [Example] 0.03%
- Min/Max: [Example] 0.00 / 2,000.00 (stored as positive discount to be subtracted)
- P50/P95: [Example] 0.00 / 20.00
- Unexpected sign: [Example] 0.07% negative values
- Notes: Enforce consistent sign; reconcile with pricing engine rules.

9) tax_amount (numeric)
- Null rate: [Example] 0.02%
- Min/Max: [Example] 0.00 / 1,100.00
- Negative values: [Example] 0.01% (likely tax reversals)
- Notes: Negative taxes should align with refund/cancel events.

10) shipping_amount (numeric)
- Null rate: [Example] 0.05%
- Min/Max: [Example] 0.00 / 400.00
- Zero with shipped status: [Example] 3.9% (free shipping or missing fees)
- Notes: Cross-check with shipping method and promo flags.

11) total_amount (numeric; order grand total)
- Null rate: [Example] 0.01%
- Min/Max: [Example] -2,500.00 / 12,499.00
- P50/P95: [Example] 79.99 / 459.00
- Negative totals: [Example] 0.12% (refunds); 413 rows negative with non-refund status
- Notes: Enforce consistency with status and calculation rule.

12) payment_method (categorical)
- Null rate: [Example] 2.30%
- Top values: [Example] Visa 39%, Mastercard 28%, PayPal 18%, Amex 8%, COD 3%, Other 4%
- Incoherent with status (e.g., captured but pending): [Example] 0.09%
- Notes: Validate against payment provider codes; ensure PCI-safe tokenization fields only.

13) created_at / updated_at (timestamps)
- Null rates: [Example] created_at 0.00%; updated_at 0.02%
- updated_at >= created_at: [Example] 96.8% (3.2% violations; clock skew or ingest issues)
- Staleness: [Example] 7.4% not updated > 90 days while still Pending
- Notes: Normalize timezones; enforce monotonic update constraint where applicable.

Cross-field consistency checks

Calculation coherence (define per business rule)
- Expected rule (example): total_amount ≈ subtotal_amount − discount_amount + tax_amount + shipping_amount
- Tolerance (epsilon): [Example] 0.01 currency units
- Result: [Example] 98.9% within tolerance; 1.1% mismatches
- Root causes (observed): [Example] rounding, missing shipping after partial refund, discount applied post-tax in some sources
- Action: Standardize calculation order and rounding precision; compute canonical totals in ETL.

Status-to-amount coherence
- Completed: total_amount > 0.00 — [Example] 99.6% pass
- Cancelled (no fulfillment): total_amount = 0.00 — [Example] 92.3% pass
- Refunded: total_amount <= 0.00 or separate refund fact — [Example] 88.1% pass
- Action: Encode explicit monetary state model (authorization, capture, refund) and align status semantics.

Temporal coherence
- order_date within [created_at − 1d, created_at + 1d]: [Example] 99.2% pass
- updated_at present for state changes: [Example] 94.7% pass
- Action: Recompute order_date from event stream or enforce event-sourced derivation.

Referential integrity (to dimensions)
- customer_id in dim_customer: [Example] 99.82% (0.18% orphan)
- currency_code in dim_currency: [Example] 99.96%
- sales_channel in dim_channel: [Example] 99.89%
- date keys resolvable in dim_date: [Example] 99.94%
- Action: Late-arriving dimension handling, conformance mappings, and reject/quarantine policies.

Outliers and anomaly signals
- Total amount outliers (> P95 × 10): [Example] 27 orders; investigate high-value promotions or currency scaling errors.
- Negative subtotal or tax: [Example] 0.02% combined; likely corrective entries; confirm policy.
- Daily order volume anomalies (7-day z-score > 3): [Example] 2 spikes (marketing campaigns) and 1 dip (ETL delay).
- Duplicate order_id with differing amounts: [Example] 11 cases; deduplicate by latest updated_at or source_of_truth.

Data quality risks identified (examples)
- Inconsistent total computation across sources leads to 1.1% mismatches.
- Status semantics not aligned to monetary state (refund/cancel), producing negative totals with non-refund statuses.
- Small but material orphan FK rate on customer_id (0.18%).
- Timestamp incoherence (3.2%) likely due to timezone/clock issues.
- Casing/format issues in currency_code and categorical fields.

Recommended cleansing and validation rules
- Enforce order_id uniqueness; reject duplicates or retain highest updated_at per id.
- Standardize calculation: compute canonical_total with fixed order and rounding; compare to source totals and flag discrepancies > epsilon.
- Enforce non-negative constraints for subtotal, tax, shipping; define and document discount sign convention; correct records violating the chosen convention.
- Status-amount rule set (examples):
  - Completed: total_amount > 0 and payment_captured = true
  - Cancelled: total_amount = 0 and fulfillment_state = none
  - Refunded: total_amount <= 0 or attach linked refund records
- Referential integrity checks on load; quarantine or delayed-load for late dimensions.
- Normalize enumerations (uppercase currency_code, trim spaces, canonical status/channel values).
- Time normalization to UTC; enforce updated_at >= created_at.

Monitoring KPIs and thresholds (set alerts)
- PK duplicates: target 0; warn > 0, critical > 10/day
- FK orphan rate (customer_id): target < 0.10%; warn ≥ 0.25%
- Null rate by monetary fields: target 0; warn ≥ 0.01%
- Total calculation mismatch rate: target < 0.50%; warn ≥ 1.00%
- Negative totals with non-refund status: target 0; critical ≥ 5/day
- Future-dated orders: target 0; warn ≥ 0.01%
- Data freshness: max(updated_at) lag < 2h; warn ≥ 4h

Notes on interpretation
- All numeric “Example” values are illustrative to show expected outputs and typical ranges. Replace with computed metrics from your dataset.
- Confirm business rules for total calculation, discount sign, refund handling, and status semantics before enforcing rules.

If you provide a schema sample and row extracts, I can replace the illustrative figures with precise metrics and produce a finalized profiling report.

示例3

以下为“核心指标监控数据”的数据概况分析结果设计与生成方案。由于未提供具体数据集,本结果以通用核心指标监控数据模型为前提,给出可复用的概况分析指标、计算方法与输出结构。请在确认字段与业务规则后执行相应计算生成数值结果。

一、目标与范围
- 目标:对核心指标监控数据进行系统性数据概况分析,覆盖完整性、唯一性、有效性、分布与异常、时效性、一致性与漂移等质量维度,为后续质量监控与告警提供可量化基线。
- 范围:基于“指标-日期-分段(维度)”的明细数据表,时间窗口建议为近90天(可按业务需要调整)。

二、数据模型假设(需确认)
核心表:core_metrics
- 主键(复合):metric_id, as_of_date, segment_1, segment_2(如无分段,segment_*可为空)
- 字段:
  - metric_id STRING:指标ID
  - metric_name STRING:指标名称
  - as_of_date DATE:指标所属统计日期
  - value NUMERIC:指标值
  - unit STRING(可空):单位
  - target_value NUMERIC(可空):目标值
  - threshold_min NUMERIC(可空):下限阈值
  - threshold_max NUMERIC(可空):上限阈值
  - segment_1/segment_2 STRING(可空):分段维度(例如渠道、地区)
  - source_system STRING:来源系统
  - event_time TIMESTAMP(可空):指标产生时间(若有)
  - ingested_at TIMESTAMP:入仓时间
  - metric_dict(维表,需外键校验):metric_id, metric_name, owner, definition, unit, expected_frequency

三、概况分析指标与结果结构
输出以多张结果表或一张汇总表呈现,建议如下结构:

1. dq_summary(总体概况)
- 时间范围:start_date, end_date
- 记录总数:record_count
- 指标数:distinct_metric_count
- 日期覆盖率:date_coverage_rate(实际有数据的日期数/应有日期数)
- 分段覆盖率:segment_coverage_rate(至少一个分段值的记录比率)
- 主键唯一率:primary_key_uniqueness_rate
- 完整性合格率(关键字段):completeness_pass_rate
- 有效性合格率(规则集):validity_pass_rate
- 时效性合格率(SLA内):timeliness_pass_rate
- 漂移风险(整体PSI):psi_overall
- 异常记录占比(IQR或Z-Score):outlier_rate

2. dq_completeness(字段级完整性)
- field_name
- non_null_rate
- null_count
- expected_not_null(布尔,来自规则)
- pass_flag

3. dq_uniqueness(主键唯一性)
- duplicate_count
- duplicate_rate
- sample_keys(审计用,可脱敏)

4. dq_validity(规则级有效性)
- rule_id
- rule_description
- fail_count
- fail_rate
- pass_flag
示例规则:
- R1 类型与可解析性:value为数值且非NaN/Inf
- R2 阈值区间:若阈值存在,threshold_min ≤ value ≤ threshold_max
- R3 目标逻辑:若target_value存在,value与target_value的偏差未超容忍范围(需业务设定)
- R4 阈值合理性:threshold_min ≤ threshold_max
- R5 外键完整性:metric_id ∈ metric_dict
- R6 频率一致性:按expected_frequency应有日期存在记录(缺失为违规)

5. dq_distribution(分布与异常)
- metric_id
- stats(min, max, mean, median, std)
- iqr_outlier_rate(按IQR:<Q1-1.5*IQR或>Q3+1.5*IQR)
- zscore_outlier_rate(可选)

6. dq_timeliness(时效性)
- ingestion_delay_seconds(或小时)
- delay_stats(min, max, mean, median, p95)
- pass_rate(延迟≤SLA,例如≤4小时)
- late_count

7. dq_drift(漂移)
- metric_id
- drift_metric(PSI或均值相对变动)
- ref_period(对比基期)
- cur_period(当前期)
- pass_flag(低漂移合格)

四、计算方法与示例SQL(标准SQL,需按实际库方言调整)
将时间窗口通过参数化传入::start_date, :end_date, :sla_hours, :ref_start_date, :ref_end_date

1) 总体计数与覆盖
- 记录总数与指标数:
  SELECT
    COUNT(*) AS record_count,
    COUNT(DISTINCT metric_id) AS distinct_metric_count
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

- 日期覆盖率(以日历表或预期频率推导为准,示例以存在记录的日期计数作为近似):
  SELECT
    COUNT(DISTINCT as_of_date)::float / NULLIF(DATE_PART('day', :end_date - :start_date) + 1, 0) AS date_coverage_rate
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

- 分段覆盖率(至少一个分段非空):
  SELECT
    SUM(CASE WHEN segment_1 IS NOT NULL OR segment_2 IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS segment_coverage_rate
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

2) 主键唯一性(示例用segment_1/segment_2,如无则仅metric_id+as_of_date)
  SELECT
    COUNT(*) - COUNT(DISTINCT CONCAT_WS('|', metric_id, as_of_date::text, COALESCE(segment_1,''), COALESCE(segment_2,''))) AS duplicate_count,
    (COUNT(*) - COUNT(DISTINCT CONCAT_WS('|', metric_id, as_of_date::text, COALESCE(segment_1,''), COALESCE(segment_2,''))))::float / COUNT(*) AS duplicate_rate
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

3) 字段完整性(关键字段示例:metric_id, as_of_date, value, source_system, ingested_at)
  SELECT
    'metric_id' AS field_name,
    SUM(CASE WHEN metric_id IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate,
    SUM(CASE WHEN metric_id IS NULL THEN 1 ELSE 0 END) AS null_count
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date
  UNION ALL
  SELECT
    'as_of_date',
    SUM(CASE WHEN as_of_date IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate,
    SUM(CASE WHEN as_of_date IS NULL THEN 1 ELSE 0 END) AS null_count
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date
  UNION ALL
  SELECT
    'value',
    SUM(CASE WHEN value IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate,
    SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) AS null_count
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date
  UNION ALL
  SELECT
    'source_system',
    SUM(CASE WHEN source_system IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate,
    SUM(CASE WHEN source_system IS NULL THEN 1 ELSE 0 END) AS null_count
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date
  UNION ALL
  SELECT
    'ingested_at',
    SUM(CASE WHEN ingested_at IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate,
    SUM(CASE WHEN ingested_at IS NULL THEN 1 ELSE 0 END) AS null_count
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

4) 有效性校验
- 类型与可解析性(数值与有限值):
  SELECT
    SUM(CASE WHEN value IS NULL OR NOT (value = value) THEN 1 ELSE 0 END) AS invalid_numeric_count, -- NaN检测依赖方言
    SUM(CASE WHEN value IS NULL OR NOT (value = value) THEN 1 ELSE 0 END)::float / COUNT(*) AS invalid_numeric_rate
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

- 阈值区间:
  SELECT
    SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND (value < threshold_min OR value > threshold_max) THEN 1 ELSE 0 END) AS out_of_threshold_count,
    SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND (value < threshold_min OR value > threshold_max) THEN 1 ELSE 0 END)::float / COUNT(*) AS out_of_threshold_rate
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

- 阈值合理性:
  SELECT
    SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND threshold_min > threshold_max THEN 1 ELSE 0 END) AS invalid_threshold_pair_count
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

- 外键完整性(需metric_dict):
  SELECT
    COUNT(*) - COUNT(md.metric_id) AS fk_missing_count,
    (COUNT(*) - COUNT(md.metric_id))::float / COUNT(*) AS fk_missing_rate
  FROM core_metrics cm
  LEFT JOIN metric_dict md ON cm.metric_id = md.metric_id
  WHERE cm.as_of_date BETWEEN :start_date AND :end_date;

5) 分布与异常(IQR)
  WITH stats AS (
    SELECT
      metric_id,
      PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1,
      PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3
    FROM core_metrics
    WHERE as_of_date BETWEEN :start_date AND :end_date
    GROUP BY metric_id
  )
  SELECT
    cm.metric_id,
    SUM(CASE WHEN cm.value < (s.q1 - 1.5*(s.q3 - s.q1)) OR cm.value > (s.q3 + 1.5*(s.q3 - s.q1)) THEN 1 ELSE 0 END)::float / COUNT(*) AS iqr_outlier_rate
  FROM core_metrics cm
  JOIN stats s ON cm.metric_id = s.metric_id
  WHERE cm.as_of_date BETWEEN :start_date AND :end_date
  GROUP BY cm.metric_id;

6) 时效性(基于ingested_at与event_time或as_of_date)
- 若有event_time:
  SELECT
    SUM(CASE WHEN EXTRACT(EPOCH FROM (ingested_at - event_time))/3600 <= :sla_hours THEN 1 ELSE 0 END)::float / COUNT(*) AS timeliness_pass_rate,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ingested_at - event_time))) AS delay_median_seconds,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ingested_at - event_time))) AS delay_p95_seconds
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

- 若无event_time,近似以as_of_date至ingested_at:
  SELECT
    SUM(CASE WHEN EXTRACT(EPOCH FROM (ingested_at - (as_of_date::timestamp))) / 3600 <= :sla_hours THEN 1 ELSE 0 END)::float / COUNT(*) AS timeliness_pass_rate
  FROM core_metrics
  WHERE as_of_date BETWEEN :start_date AND :end_date;

7) 漂移(PSI,按分箱)
示例按每个metric_id将:ref_period与:cur_period按分位箱对齐计算PSI:
  WITH bins AS (
    SELECT
      metric_id,
      PERCENTILE_CONT(array[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]) WITHIN GROUP (ORDER BY value) AS quantiles
    FROM core_metrics
    WHERE as_of_date BETWEEN :ref_start_date AND :ref_end_date
    GROUP BY metric_id
  ),
  ref AS (
    SELECT metric_id, value, 'ref' AS p FROM core_metrics WHERE as_of_date BETWEEN :ref_start_date AND :ref_end_date
  ),
  cur AS (
    SELECT metric_id, value, 'cur' AS p FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date
  ),
  all_data AS (
    SELECT * FROM ref UNION ALL SELECT * FROM cur
  ),
  binned AS (
    SELECT
      a.metric_id,
      a.p,
      CASE
        WHEN a.value < q[1] THEN 0
        WHEN a.value < q[2] THEN 1
        WHEN a.value < q[3] THEN 2
        WHEN a.value < q[4] THEN 3
        WHEN a.value < q[5] THEN 4
        WHEN a.value < q[6] THEN 5
        WHEN a.value < q[7] THEN 6
        WHEN a.value < q[8] THEN 7
        WHEN a.value < q[9] THEN 8
        ELSE 9
      END AS bin_id
    FROM all_data a
    JOIN (
      SELECT metric_id, quantiles AS q FROM bins
    ) b ON a.metric_id = b.metric_id
  ),
  dist AS (
    SELECT metric_id, p, bin_id, COUNT(*)::float / SUM(COUNT(*)) OVER (PARTITION BY metric_id, p) AS prob
    FROM binned
    GROUP BY metric_id, p, bin_id
  )
  SELECT r.metric_id,
         SUM(CASE WHEN r.prob > 0 AND c.prob > 0 THEN (r.prob - c.prob) * LN(r.prob / c.prob) ELSE 0 END) AS psi
  FROM dist r
  JOIN dist c ON r.metric_id = c.metric_id AND r.bin_id = c.bin_id AND r.p = 'ref' AND c.p = 'cur'
  GROUP BY r.metric_id;

五、质量阈值建议(需根据业务与风险承受度确认)
- 主键唯一率:= 100%
- 关键字段完整性:≥ 99.5%(metric_id, as_of_date, value, source_system, ingested_at)
- 有效性(类型/区间/外键/阈值合理性):合格率 ≥ 99%
- 时效性:延迟 ≤ SLA(例如4小时)合格率 ≥ 99%
- 异常占比(IQR):≤ 1%(按指标)
- PSI漂移:整体 ≤ 0.2;单指标 ≤ 0.1(常用经验阈值,需结合业务稳定性校准)

六、监控与告警实现建议
- 每日/每小时批处理:生成上述结果表(dq_summary、dq_completeness、dq_uniqueness、dq_validity、dq_distribution、dq_timeliness、dq_drift)。
- 告警路由:当任一阈值不达标,推送到指标Owner与数据工程团队;对持续两期以上异常的指标提升优先级。
- 留痕与审计:保留历史DQ结果与示例异常样本(含主键),支持复盘与根因分析。
- 版本与变更管理:当新增指标或调整定义/阈值,同步更新metric_dict与规则集,重新基线化漂移。

七、生成结果所需信息(请提供)
- 实际表结构与字段映射:确认是否存在event_time、分段维度、阈值字段。
- 时间窗口与SLA:start_date、end_date、sla_hours;漂移对比基期ref_start_date、ref_end_date。
- 规则细化:
  - 目标值容忍范围(例如相对误差≤5%)
  - 哪些字段必须非空(expected_not_null清单)
  - 分段汇总一致性规则(如父子口径对齐)
- 维表与参照数据:metric_dict定义、期望频率(日/周/月)与应有日期集合。

说明
- 上述为数据概况分析结果的标准化结构与计算实现。待您提供数据与参数后,可直接运行SQL生成数值结果与告警结论。此设计避免臆测具体数值,确保准确性与可验证性。

适用用户

数据分析师

快速摸底新数据集,生成质量概况与风险清单,制定清洗计划,输出可视化要点供汇报与协作。

数据工程师

在接入前完成质量评估与规则设定,一键生成自测清单与阈值建议,降低上线故障与回滚风险。

产品经理/运营

用业务版报告理解数据可信度,识别影响核心指标的质量问题,驱动修复优先级并向管理层汇报。

商业智能/报表开发

识别字段异常与缺失,给出修复路径,保障报表口径一致;对比版本变更,避免发布后指标漂移。

数据治理负责人/合规

建立统一的质量评估标准与模板,落地监控与告警阈值,持续跟踪改进成效,提升审计与合规通过率。

初创团队/项目负责人

无需复杂工具即可获取专业质量分析,快速决策是否接入或采购数据,节省试错与沟通成本。

解决的问题

让你的团队在最短时间内拿到一份可直接用于决策的「数据概况分析」报告。该提示词引导 AI 以数据质量分析师的专业视角,围绕清理、验证、概况分析与监控四个模块,生成结构化、客观、无冗余的结论与改进建议。你只需输入数据集名称并选择期望的输出语言,即可得到清晰易读的分析结果,快速定位缺失、异常、重复、字段不一致等问题,并获得可执行的修复与监控方案。适用于新数据接入评审、模型训练前的数据体检、报表刷新后的健康检查、第三方数据交付验收与合规核查,帮助你缩短分析周期、提升数据可信度、降低决策风险,并建立团队可复用的质量评估标准。

特征总结

一键生成数据集概况报告,涵盖字段分布、缺失率、异常值,为初步摸底省时省心。
自动识别质量风险并标注高危字段,定位来源与影响面,优先级清晰,便于快速决策。
提供可执行的数据清理建议,如去重、标准化、修复缺值,附操作要点,降低返工成本。
自动生成校验规则与采样检查清单,帮助团队上线前自测,减少故障与投诉。
面向业务与技术双版本报告,一份给老板看结果,一份给同事看方法,沟通更顺畅。
支持多语言输出与定制结构,嵌入你现有文档或PPT,快速形成可交付素材。
给出数据监控指标建议与告警阈值,搭建轻量监控方案,尽早发现质量波动。
将复杂概念拆解为简明结构化说明,帮助非数据岗位快速理解并参与治理。
一键对比不同版本数据集变更,评估影响与回归风险,确保发布安全可控。
输出可复用模板与参数建议,后续任务只需换数据集名称,即刻复用整套流程。

如何使用购买的提示词模板

1. 直接在外部 Chat 应用中使用

将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。

2. 发布为 API 接口调用

把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。

3. 在 MCP Client 中配置使用

在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。

¥15.00元
平台提供免费试用机制,
确保效果符合预期,再付费购买!

您购买后可以获得什么

获得完整提示词模板
- 共 246 tokens
- 2 个可调节参数
{ 数据集名称 } { 输出语言 }
自动加入"我的提示词库"
- 获得提示词优化器支持
- 版本化管理支持
获得社区共享的应用案例
限时免费

不要错过!

免费获取高级提示词-优惠即将到期

17
:
23
小时
:
59
分钟
:
59