生成数据概况分析结果

193 浏览
15 试用
4 购买
Sep 25, 2025更新

生成针对数据集的专业数据概况分析结果。

以下为“新用户转化数据”的数据概况分析结果模板与评估方法。因未提供实际数据与字段结构,本报告以标准化数据质量框架与可复用的计算逻辑为基础,输出结果项采用占位符形式,待接入真实数据后即可生成数值化结论。

一、分析范围与目标

  • 数据对象:新注册用户在指定观察期内的转化行为(如首购、关键功能使用、完成KYC等)。
  • 目标输出:数据概况(分布、完整性、一致性、唯一性、及时性)、关键质量问题识别、可执行的修复与监控方案。
  • 关键参数(需确认):转化定义、观察期(例如注册后7/14/30天)、归因口径(最后触点/首次触点)、时间窗口与时区。

二、字段与结构(参考标准,实际以您系统为准)

  • 主表:user_conversion_facts(用户级或事件级)
    • user_id(主键或业务主键)
    • signup_ts(注册时间戳)
    • conversion_flag(是否转化,布尔/枚举)
    • conversion_ts(首次转化时间戳)
    • conversion_days(转化耗时,整数)
    • source、campaign_id、channel(来源归因字段)
    • device_type、os_version、app_version(设备与版本)
    • geo_country、geo_region(地理信息)
    • revenue_first_purchase(首购金额,数值)
    • event_id(事件级数据时使用)
  • 维表(可选):campaign_dim、channel_dim、geo_dim、app_version_dim

三、指标定义与计算方法

  • 记录量与覆盖度
    • 总记录数:count(*)
    • 新用户数:count(distinct user_id)
    • 观察期覆盖:按日期分组的记录数与新用户数
  • 完整性(Null/缺失)
    • 各字段缺失率:sum(is_null(col))/count(*)
  • 唯一性/重复
    • user_id唯一性:count(distinct user_id)/count(*)
    • 事件重复:重复event_id比例;重复signup事件比例
  • 一致性/逻辑校验
    • conversion_flag与conversion_ts一致:当flag=1时conversion_ts非空;flag=0时conversion_ts为空
    • 时间顺序:conversion_ts >= signup_ts;conversion_days = datediff(conversion_ts, signup_ts)
    • 收入一致性:flag=0时revenue=0或空;flag=1时revenue>=0
  • 有效性/取值范围
    • 字段枚举值合法性(channel/source等在白名单内)
    • 时间戳范围合法性(不早于系统上线日期,不晚于当前时间)
    • 数值范围(revenue不为负;conversion_days不为负且不超过设定上限)
  • 分布与异常
    • 转化率:sum(flag)/count(distinct user_id)
    • 转化耗时分布:P50/P90/P99
    • 收入分布:P50/P90/P99、最大值与重尾检测(如超P99的倍数)
    • 归因分布:channel、campaign的占比与Top N
    • 设备/地区分布:top值占比与长尾
  • 及时性
    • 数据延迟:max(ingest_ts)与事件时间的差异
    • 分区/日期数据完整性:每日是否有数据、是否有漏分区

四、数据概况分析结果(模板,待填)

  • 基础规模
    • 观察期:[开始日期] 至 [结束日期]
    • 总记录数:[待填]
    • 新用户数:[待填]
    • 事件级数据比例(如存在):[待填]
  • 完整性
    • 缺失率最高的字段:[字段A: X%]、[字段B: Y%]
    • 关键字段缺失率(必须项):signup_ts [X%];user_id [X%];conversion_flag [X%]
  • 唯一性与重复
    • user_id唯一性比率:[待填]
    • 重复signup事件比例:[待填]
    • 重复event_id比例:[待填]
  • 一致性
    • flag/ts不一致记录比例:[待填]
    • conversion_ts < signup_ts比例:[待填]
    • conversion_days计算不一致比例:[待填]
    • revenue与flag不一致比例:[待填]
  • 有效性
    • 非法枚举值(channel/source)比例:[待填];Top非法值:[待填]
    • 异常时间戳(未来/过早)比例:[待填]
    • 负收入或异常大收入比例:[待填]
  • 分布与异常
    • 总转化率:[待填]
    • conversion_days分布(P50/P90/P99):[待填]
    • 首购收入分布(P50/P90/P99):[待填]
    • Top渠道/活动占比:[渠道A X%]、[活动B Y%]
    • 设备/地区Top分布:[待填]
  • 及时性
    • 数据延迟P95:[待填]小时
    • 漏数日期或分区:[日期列表或“无”]

五、数据质量评估与风险分级(规则清单)

  • 高风险(需优先修复)
    • user_id非唯一或缺失率>0.5%
    • signup_ts缺失或非法>0.5%
    • flag=1但conversion_ts缺失>0.1%
    • conversion_ts < signup_ts >0.01%
  • 中风险
    • 枚举非法值>1%
    • revenue负值>0.1%
    • 转化耗时>设定上限(如>90天)占比>1%
  • 低风险
    • 长尾渠道/活动占比异常导致归因不稳定
    • 数据延迟P95>24小时

六、问题定位与修复建议

  • 完整性
    • 在ETL层对必填字段设非空约束;对缺失数据回填或剔除(按业务容忍度)
    • 对时间戳统一时区与格式解析;引入原始事件落库时间(ingest_ts)
  • 唯一性
    • 引入业务主键约束(user_id+signup_ts)或幂等键(event_id)
    • 去重策略:按同user_id+近似时间窗口合并
  • 一致性
    • 以事件序列校验转化逻辑(signup→activation→conversion)
    • 重算派生字段(conversion_days、revenue)并与原值对账
  • 有效性
    • 枚举值白名单校验与拒绝策略;异常值隔离到审计表
    • 阈值规则:revenue<0或>业务合理上限标记为异常
  • 及时性
    • 建立分区完成度检查;延迟告警(如超过12小时告警)

七、监控与审计(示例SQL/规则表达)

  • 每日数据量与转化率
    • select dt, count(distinct user_id) as new_users, sum(case when conversion_flag=1 then 1 else 0 end)/count(distinct user_id) as conv_rate from user_conversion_facts group by dt;
  • 完整性与一致性
    • 缺失率:select 'signup_ts' as col, sum(case when signup_ts is null then 1 else 0 end)/count(*) as null_rate from user_conversion_facts;
    • 逻辑不一致:select sum(case when conversion_flag=1 and conversion_ts is null then 1 else 0 end)/count(*) as inconsistent_rate from user_conversion_facts;
    • 时间顺序:select sum(case when conversion_ts < signup_ts then 1 else 0 end)/count(*) from user_conversion_facts;
  • 唯一性与重复
    • select count(*) - count(distinct user_id) as duplicate_user_records from user_conversion_facts;
  • 有效性
    • 枚举白名单:select count() filter (where channel not in (select channel from channel_dim)) / count() as invalid_channel_rate from user_conversion_facts;
  • 及时性
    • select percentile_cont(0.95) within group(order by extract(epoch from (ingest_ts - event_ts))/3600.0) as p95_delay_hours from user_conversion_facts;

八、交付与下一步

  • 请提供以下信息以生成数值化的概况分析结果:
    • 字段字典与主键定义
    • 转化业务定义与观察期
    • 数据时间范围与样本规模
    • 可选维表(channel/campaign/app_version)及白名单
  • 接收后输出:
    • 完整的数值化概况报告
    • 问题清单与优先级
    • 修复与监控落地方案(规则、阈值、任务与告警)

说明:本报告为数据质量概况分析的标准模板,确保在接入真实数据后可直接执行并生成高准确性的结果。针对您的具体数据与业务定义,可快速定制规则与阈值。

Order Fact Table – Data Profiling Analysis Results (Template + Illustrative Example)

Scope and grain

  • Assumed grain: one row per order (order header). If your fact grain is order-line, adjust metrics that rely on totals and referential checks to the product dimension accordingly.
  • Intended use: accuracy, completeness, consistency, and integrity assessment to inform cleansing, validation rules, and monitoring.

Table overview (replace illustrative values with actuals)

  • Row count: [Example] 1,238,417
  • Coverage window (order_date): [Example] 2023-01-01 to 2025-09-20
  • Primary key: order_id
    • Uniqueness: [Example] 99.997% unique
    • Duplicate order_ids: [Example] 32
  • Freshness (max(updated_at) vs extraction): [Example] 1h 42m

Column-level profiling (core fields)

  1. order_id
  • Type: string/integer (non-nullable)
  • Distinct count: [Example] 1,238,385
  • Null rate: [Example] 0.000%
  • Duplicates: [Example] 32 rows across 16 ids
  • Notes: Enforce not null + uniqueness at source; quarantine duplicate keys.
  1. order_date (date/timestamp)
  • Min/Max: [Example] 2023-01-01 / 2025-09-20
  • Null rate: [Example] 0.000%
  • Future-dated (> current_date): [Example] 0.06%
  • Invalid dates/timezones: [Example] 0.00% invalid parse; mixed TZ flags detected
  • Seasonality: [Example] weekend share 27%; end-of-month spikes present
  1. customer_id (FK to dim_customer)
  • Null rate: [Example] 0.42%
  • Distinct count: [Example] 214,903
  • Orphan rate vs dim_customer: [Example] 0.18%
  • Notes: Backfill anonymous/guest strategy or surrogate for nulls; address orphans via late-arriving dimension handling.
  1. currency_code (ISO 4217)
  • Cardinality: [Example] 5 (USD, EUR, GBP, CAD, AUD)
  • Null rate: [Example] 0.00%
  • Invalid values: [Example] 0.04% (case issues: 'usd')
  • Notes: Standardize to uppercase; enforce against dim_currency.
  1. order_status (enumerated)
  • Allowed set (example): Pending, Completed, Cancelled, Refunded, Partially_Refunded
  • Null rate: [Example] 0.00%
  • Top distribution: [Example] Completed 82.5%; Pending 9.8%; Cancelled 5.6%; Refunded 2.1%
  • Invalid statuses: [Example] 0.03% (legacy codes)
  • Notes: Map legacy to canonical set; add check constraint or validation UDF.
  1. sales_channel (enumerated)
  • Values: [Example] Web, App, Store, Marketplace
  • Null rate: [Example] 0.11%
  • Distribution: [Example] Web 56%, App 22%, Store 18%, Marketplace 4%
  • Notes: Normalize spelling/casing; handle unknowns explicitly.
  1. subtotal_amount (numeric)
  • Null rate: [Example] 0.01%
  • Min/Max: [Example] 0.00 / 12,300.00
  • P50/P95: [Example] 72.10 / 420.00
  • Negative values: [Example] 0.00%
  • Currency consistency: [Example] 100% aligned to currency_code
  • Notes: Validate precision/scale; non-negative expectation except adjustments per policy.
  1. discount_amount (numeric; sign convention must be confirmed)
  • Null rate: [Example] 0.03%
  • Min/Max: [Example] 0.00 / 2,000.00 (stored as positive discount to be subtracted)
  • P50/P95: [Example] 0.00 / 20.00
  • Unexpected sign: [Example] 0.07% negative values
  • Notes: Enforce consistent sign; reconcile with pricing engine rules.
  1. tax_amount (numeric)
  • Null rate: [Example] 0.02%
  • Min/Max: [Example] 0.00 / 1,100.00
  • Negative values: [Example] 0.01% (likely tax reversals)
  • Notes: Negative taxes should align with refund/cancel events.
  1. shipping_amount (numeric)
  • Null rate: [Example] 0.05%
  • Min/Max: [Example] 0.00 / 400.00
  • Zero with shipped status: [Example] 3.9% (free shipping or missing fees)
  • Notes: Cross-check with shipping method and promo flags.
  1. total_amount (numeric; order grand total)
  • Null rate: [Example] 0.01%
  • Min/Max: [Example] -2,500.00 / 12,499.00
  • P50/P95: [Example] 79.99 / 459.00
  • Negative totals: [Example] 0.12% (refunds); 413 rows negative with non-refund status
  • Notes: Enforce consistency with status and calculation rule.
  1. payment_method (categorical)
  • Null rate: [Example] 2.30%
  • Top values: [Example] Visa 39%, Mastercard 28%, PayPal 18%, Amex 8%, COD 3%, Other 4%
  • Incoherent with status (e.g., captured but pending): [Example] 0.09%
  • Notes: Validate against payment provider codes; ensure PCI-safe tokenization fields only.
  1. created_at / updated_at (timestamps)
  • Null rates: [Example] created_at 0.00%; updated_at 0.02%
  • updated_at >= created_at: [Example] 96.8% (3.2% violations; clock skew or ingest issues)
  • Staleness: [Example] 7.4% not updated > 90 days while still Pending
  • Notes: Normalize timezones; enforce monotonic update constraint where applicable.

Cross-field consistency checks

Calculation coherence (define per business rule)

  • Expected rule (example): total_amount ≈ subtotal_amount − discount_amount + tax_amount + shipping_amount
  • Tolerance (epsilon): [Example] 0.01 currency units
  • Result: [Example] 98.9% within tolerance; 1.1% mismatches
  • Root causes (observed): [Example] rounding, missing shipping after partial refund, discount applied post-tax in some sources
  • Action: Standardize calculation order and rounding precision; compute canonical totals in ETL.

Status-to-amount coherence

  • Completed: total_amount > 0.00 — [Example] 99.6% pass
  • Cancelled (no fulfillment): total_amount = 0.00 — [Example] 92.3% pass
  • Refunded: total_amount <= 0.00 or separate refund fact — [Example] 88.1% pass
  • Action: Encode explicit monetary state model (authorization, capture, refund) and align status semantics.

Temporal coherence

  • order_date within [created_at − 1d, created_at + 1d]: [Example] 99.2% pass
  • updated_at present for state changes: [Example] 94.7% pass
  • Action: Recompute order_date from event stream or enforce event-sourced derivation.

Referential integrity (to dimensions)

  • customer_id in dim_customer: [Example] 99.82% (0.18% orphan)
  • currency_code in dim_currency: [Example] 99.96%
  • sales_channel in dim_channel: [Example] 99.89%
  • date keys resolvable in dim_date: [Example] 99.94%
  • Action: Late-arriving dimension handling, conformance mappings, and reject/quarantine policies.

Outliers and anomaly signals

  • Total amount outliers (> P95 × 10): [Example] 27 orders; investigate high-value promotions or currency scaling errors.
  • Negative subtotal or tax: [Example] 0.02% combined; likely corrective entries; confirm policy.
  • Daily order volume anomalies (7-day z-score > 3): [Example] 2 spikes (marketing campaigns) and 1 dip (ETL delay).
  • Duplicate order_id with differing amounts: [Example] 11 cases; deduplicate by latest updated_at or source_of_truth.

Data quality risks identified (examples)

  • Inconsistent total computation across sources leads to 1.1% mismatches.
  • Status semantics not aligned to monetary state (refund/cancel), producing negative totals with non-refund statuses.
  • Small but material orphan FK rate on customer_id (0.18%).
  • Timestamp incoherence (3.2%) likely due to timezone/clock issues.
  • Casing/format issues in currency_code and categorical fields.

Recommended cleansing and validation rules

  • Enforce order_id uniqueness; reject duplicates or retain highest updated_at per id.
  • Standardize calculation: compute canonical_total with fixed order and rounding; compare to source totals and flag discrepancies > epsilon.
  • Enforce non-negative constraints for subtotal, tax, shipping; define and document discount sign convention; correct records violating the chosen convention.
  • Status-amount rule set (examples):
    • Completed: total_amount > 0 and payment_captured = true
    • Cancelled: total_amount = 0 and fulfillment_state = none
    • Refunded: total_amount <= 0 or attach linked refund records
  • Referential integrity checks on load; quarantine or delayed-load for late dimensions.
  • Normalize enumerations (uppercase currency_code, trim spaces, canonical status/channel values).
  • Time normalization to UTC; enforce updated_at >= created_at.

Monitoring KPIs and thresholds (set alerts)

  • PK duplicates: target 0; warn > 0, critical > 10/day
  • FK orphan rate (customer_id): target < 0.10%; warn ≥ 0.25%
  • Null rate by monetary fields: target 0; warn ≥ 0.01%
  • Total calculation mismatch rate: target < 0.50%; warn ≥ 1.00%
  • Negative totals with non-refund status: target 0; critical ≥ 5/day
  • Future-dated orders: target 0; warn ≥ 0.01%
  • Data freshness: max(updated_at) lag < 2h; warn ≥ 4h

Notes on interpretation

  • All numeric “Example” values are illustrative to show expected outputs and typical ranges. Replace with computed metrics from your dataset.
  • Confirm business rules for total calculation, discount sign, refund handling, and status semantics before enforcing rules.

If you provide a schema sample and row extracts, I can replace the illustrative figures with precise metrics and produce a finalized profiling report.

以下为“核心指标监控数据”的数据概况分析结果设计与生成方案。由于未提供具体数据集,本结果以通用核心指标监控数据模型为前提,给出可复用的概况分析指标、计算方法与输出结构。请在确认字段与业务规则后执行相应计算生成数值结果。

一、目标与范围

  • 目标:对核心指标监控数据进行系统性数据概况分析,覆盖完整性、唯一性、有效性、分布与异常、时效性、一致性与漂移等质量维度,为后续质量监控与告警提供可量化基线。
  • 范围:基于“指标-日期-分段(维度)”的明细数据表,时间窗口建议为近90天(可按业务需要调整)。

二、数据模型假设(需确认) 核心表:core_metrics

  • 主键(复合):metric_id, as_of_date, segment_1, segment_2(如无分段,segment_*可为空)
  • 字段:
    • metric_id STRING:指标ID
    • metric_name STRING:指标名称
    • as_of_date DATE:指标所属统计日期
    • value NUMERIC:指标值
    • unit STRING(可空):单位
    • target_value NUMERIC(可空):目标值
    • threshold_min NUMERIC(可空):下限阈值
    • threshold_max NUMERIC(可空):上限阈值
    • segment_1/segment_2 STRING(可空):分段维度(例如渠道、地区)
    • source_system STRING:来源系统
    • event_time TIMESTAMP(可空):指标产生时间(若有)
    • ingested_at TIMESTAMP:入仓时间
    • metric_dict(维表,需外键校验):metric_id, metric_name, owner, definition, unit, expected_frequency

三、概况分析指标与结果结构 输出以多张结果表或一张汇总表呈现,建议如下结构:

  1. dq_summary(总体概况)
  • 时间范围:start_date, end_date
  • 记录总数:record_count
  • 指标数:distinct_metric_count
  • 日期覆盖率:date_coverage_rate(实际有数据的日期数/应有日期数)
  • 分段覆盖率:segment_coverage_rate(至少一个分段值的记录比率)
  • 主键唯一率:primary_key_uniqueness_rate
  • 完整性合格率(关键字段):completeness_pass_rate
  • 有效性合格率(规则集):validity_pass_rate
  • 时效性合格率(SLA内):timeliness_pass_rate
  • 漂移风险(整体PSI):psi_overall
  • 异常记录占比(IQR或Z-Score):outlier_rate
  1. dq_completeness(字段级完整性)
  • field_name
  • non_null_rate
  • null_count
  • expected_not_null(布尔,来自规则)
  • pass_flag
  1. dq_uniqueness(主键唯一性)
  • duplicate_count
  • duplicate_rate
  • sample_keys(审计用,可脱敏)
  1. dq_validity(规则级有效性)
  • rule_id
  • rule_description
  • fail_count
  • fail_rate
  • pass_flag 示例规则:
  • R1 类型与可解析性:value为数值且非NaN/Inf
  • R2 阈值区间:若阈值存在,threshold_min ≤ value ≤ threshold_max
  • R3 目标逻辑:若target_value存在,value与target_value的偏差未超容忍范围(需业务设定)
  • R4 阈值合理性:threshold_min ≤ threshold_max
  • R5 外键完整性:metric_id ∈ metric_dict
  • R6 频率一致性:按expected_frequency应有日期存在记录(缺失为违规)
  1. dq_distribution(分布与异常)
  • metric_id
  • stats(min, max, mean, median, std)
  • iqr_outlier_rate(按IQR:<Q1-1.5IQR或>Q3+1.5IQR)
  • zscore_outlier_rate(可选)
  1. dq_timeliness(时效性)
  • ingestion_delay_seconds(或小时)
  • delay_stats(min, max, mean, median, p95)
  • pass_rate(延迟≤SLA,例如≤4小时)
  • late_count
  1. dq_drift(漂移)
  • metric_id
  • drift_metric(PSI或均值相对变动)
  • ref_period(对比基期)
  • cur_period(当前期)
  • pass_flag(低漂移合格)

四、计算方法与示例SQL(标准SQL,需按实际库方言调整) 将时间窗口通过参数化传入::start_date, :end_date, :sla_hours, :ref_start_date, :ref_end_date

  1. 总体计数与覆盖
  • 记录总数与指标数: SELECT COUNT(*) AS record_count, COUNT(DISTINCT metric_id) AS distinct_metric_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  • 日期覆盖率(以日历表或预期频率推导为准,示例以存在记录的日期计数作为近似): SELECT COUNT(DISTINCT as_of_date)::float / NULLIF(DATE_PART('day', :end_date - :start_date) + 1, 0) AS date_coverage_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  • 分段覆盖率(至少一个分段非空): SELECT SUM(CASE WHEN segment_1 IS NOT NULL OR segment_2 IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS segment_coverage_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  1. 主键唯一性(示例用segment_1/segment_2,如无则仅metric_id+as_of_date) SELECT COUNT() - COUNT(DISTINCT CONCAT_WS('|', metric_id, as_of_date::text, COALESCE(segment_1,''), COALESCE(segment_2,''))) AS duplicate_count, (COUNT() - COUNT(DISTINCT CONCAT_WS('|', metric_id, as_of_date::text, COALESCE(segment_1,''), COALESCE(segment_2,''))))::float / COUNT(*) AS duplicate_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  2. 字段完整性(关键字段示例:metric_id, as_of_date, value, source_system, ingested_at) SELECT 'metric_id' AS field_name, SUM(CASE WHEN metric_id IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT() AS non_null_rate, SUM(CASE WHEN metric_id IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date UNION ALL SELECT 'as_of_date', SUM(CASE WHEN as_of_date IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT() AS non_null_rate, SUM(CASE WHEN as_of_date IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date UNION ALL SELECT 'value', SUM(CASE WHEN value IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT() AS non_null_rate, SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date UNION ALL SELECT 'source_system', SUM(CASE WHEN source_system IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT() AS non_null_rate, SUM(CASE WHEN source_system IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date UNION ALL SELECT 'ingested_at', SUM(CASE WHEN ingested_at IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate, SUM(CASE WHEN ingested_at IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  3. 有效性校验

  • 类型与可解析性(数值与有限值): SELECT SUM(CASE WHEN value IS NULL OR NOT (value = value) THEN 1 ELSE 0 END) AS invalid_numeric_count, -- NaN检测依赖方言 SUM(CASE WHEN value IS NULL OR NOT (value = value) THEN 1 ELSE 0 END)::float / COUNT(*) AS invalid_numeric_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  • 阈值区间: SELECT SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND (value < threshold_min OR value > threshold_max) THEN 1 ELSE 0 END) AS out_of_threshold_count, SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND (value < threshold_min OR value > threshold_max) THEN 1 ELSE 0 END)::float / COUNT(*) AS out_of_threshold_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  • 阈值合理性: SELECT SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND threshold_min > threshold_max THEN 1 ELSE 0 END) AS invalid_threshold_pair_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  • 外键完整性(需metric_dict): SELECT COUNT() - COUNT(md.metric_id) AS fk_missing_count, (COUNT() - COUNT(md.metric_id))::float / COUNT(*) AS fk_missing_rate FROM core_metrics cm LEFT JOIN metric_dict md ON cm.metric_id = md.metric_id WHERE cm.as_of_date BETWEEN :start_date AND :end_date;

  1. 分布与异常(IQR) WITH stats AS ( SELECT metric_id, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3 FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date GROUP BY metric_id ) SELECT cm.metric_id, SUM(CASE WHEN cm.value < (s.q1 - 1.5*(s.q3 - s.q1)) OR cm.value > (s.q3 + 1.5*(s.q3 - s.q1)) THEN 1 ELSE 0 END)::float / COUNT(*) AS iqr_outlier_rate FROM core_metrics cm JOIN stats s ON cm.metric_id = s.metric_id WHERE cm.as_of_date BETWEEN :start_date AND :end_date GROUP BY cm.metric_id;

  2. 时效性(基于ingested_at与event_time或as_of_date)

  • 若有event_time: SELECT SUM(CASE WHEN EXTRACT(EPOCH FROM (ingested_at - event_time))/3600 <= :sla_hours THEN 1 ELSE 0 END)::float / COUNT(*) AS timeliness_pass_rate, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ingested_at - event_time))) AS delay_median_seconds, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ingested_at - event_time))) AS delay_p95_seconds FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  • 若无event_time,近似以as_of_date至ingested_at: SELECT SUM(CASE WHEN EXTRACT(EPOCH FROM (ingested_at - (as_of_date::timestamp))) / 3600 <= :sla_hours THEN 1 ELSE 0 END)::float / COUNT(*) AS timeliness_pass_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date;

  1. 漂移(PSI,按分箱) 示例按每个metric_id将:ref_period与:cur_period按分位箱对齐计算PSI: WITH bins AS ( SELECT metric_id, PERCENTILE_CONT(array[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]) WITHIN GROUP (ORDER BY value) AS quantiles FROM core_metrics WHERE as_of_date BETWEEN :ref_start_date AND :ref_end_date GROUP BY metric_id ), ref AS ( SELECT metric_id, value, 'ref' AS p FROM core_metrics WHERE as_of_date BETWEEN :ref_start_date AND :ref_end_date ), cur AS ( SELECT metric_id, value, 'cur' AS p FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date ), all_data AS ( SELECT * FROM ref UNION ALL SELECT * FROM cur ), binned AS ( SELECT a.metric_id, a.p, CASE WHEN a.value < q[1] THEN 0 WHEN a.value < q[2] THEN 1 WHEN a.value < q[3] THEN 2 WHEN a.value < q[4] THEN 3 WHEN a.value < q[5] THEN 4 WHEN a.value < q[6] THEN 5 WHEN a.value < q[7] THEN 6 WHEN a.value < q[8] THEN 7 WHEN a.value < q[9] THEN 8 ELSE 9 END AS bin_id FROM all_data a JOIN ( SELECT metric_id, quantiles AS q FROM bins ) b ON a.metric_id = b.metric_id ), dist AS ( SELECT metric_id, p, bin_id, COUNT()::float / SUM(COUNT()) OVER (PARTITION BY metric_id, p) AS prob FROM binned GROUP BY metric_id, p, bin_id ) SELECT r.metric_id, SUM(CASE WHEN r.prob > 0 AND c.prob > 0 THEN (r.prob - c.prob) * LN(r.prob / c.prob) ELSE 0 END) AS psi FROM dist r JOIN dist c ON r.metric_id = c.metric_id AND r.bin_id = c.bin_id AND r.p = 'ref' AND c.p = 'cur' GROUP BY r.metric_id;

五、质量阈值建议(需根据业务与风险承受度确认)

  • 主键唯一率:= 100%
  • 关键字段完整性:≥ 99.5%(metric_id, as_of_date, value, source_system, ingested_at)
  • 有效性(类型/区间/外键/阈值合理性):合格率 ≥ 99%
  • 时效性:延迟 ≤ SLA(例如4小时)合格率 ≥ 99%
  • 异常占比(IQR):≤ 1%(按指标)
  • PSI漂移:整体 ≤ 0.2;单指标 ≤ 0.1(常用经验阈值,需结合业务稳定性校准)

六、监控与告警实现建议

  • 每日/每小时批处理:生成上述结果表(dq_summary、dq_completeness、dq_uniqueness、dq_validity、dq_distribution、dq_timeliness、dq_drift)。
  • 告警路由:当任一阈值不达标,推送到指标Owner与数据工程团队;对持续两期以上异常的指标提升优先级。
  • 留痕与审计:保留历史DQ结果与示例异常样本(含主键),支持复盘与根因分析。
  • 版本与变更管理:当新增指标或调整定义/阈值,同步更新metric_dict与规则集,重新基线化漂移。

七、生成结果所需信息(请提供)

  • 实际表结构与字段映射:确认是否存在event_time、分段维度、阈值字段。
  • 时间窗口与SLA:start_date、end_date、sla_hours;漂移对比基期ref_start_date、ref_end_date。
  • 规则细化:
    • 目标值容忍范围(例如相对误差≤5%)
    • 哪些字段必须非空(expected_not_null清单)
    • 分段汇总一致性规则(如父子口径对齐)
  • 维表与参照数据:metric_dict定义、期望频率(日/周/月)与应有日期集合。

说明

  • 上述为数据概况分析结果的标准化结构与计算实现。待您提供数据与参数后,可直接运行SQL生成数值结果与告警结论。此设计避免臆测具体数值,确保准确性与可验证性。

示例详情

解决的问题

让你的团队在最短时间内拿到一份可直接用于决策的「数据概况分析」报告。该提示词引导 AI 以数据质量分析师的专业视角,围绕清理、验证、概况分析与监控四个模块,生成结构化、客观、无冗余的结论与改进建议。你只需输入数据集名称并选择期望的输出语言,即可得到清晰易读的分析结果,快速定位缺失、异常、重复、字段不一致等问题,并获得可执行的修复与监控方案。适用于新数据接入评审、模型训练前的数据体检、报表刷新后的健康检查、第三方数据交付验收与合规核查,帮助你缩短分析周期、提升数据可信度、降低决策风险,并建立团队可复用的质量评估标准。

适用用户

数据分析师

快速摸底新数据集,生成质量概况与风险清单,制定清洗计划,输出可视化要点供汇报与协作。

数据工程师

在接入前完成质量评估与规则设定,一键生成自测清单与阈值建议,降低上线故障与回滚风险。

产品经理/运营

用业务版报告理解数据可信度,识别影响核心指标的质量问题,驱动修复优先级并向管理层汇报。

特征总结

一键生成数据集概况报告,涵盖字段分布、缺失率、异常值,为初步摸底省时省心。
自动识别质量风险并标注高危字段,定位来源与影响面,优先级清晰,便于快速决策。
提供可执行的数据清理建议,如去重、标准化、修复缺值,附操作要点,降低返工成本。
自动生成校验规则与采样检查清单,帮助团队上线前自测,减少故障与投诉。
面向业务与技术双版本报告,一份给老板看结果,一份给同事看方法,沟通更顺畅。
支持多语言输出与定制结构,嵌入你现有文档或PPT,快速形成可交付素材。
给出数据监控指标建议与告警阈值,搭建轻量监控方案,尽早发现质量波动。
将复杂概念拆解为简明结构化说明,帮助非数据岗位快速理解并参与治理。
一键对比不同版本数据集变更,评估影响与回归风险,确保发布安全可控。
输出可复用模板与参数建议,后续任务只需换数据集名称,即刻复用整套流程。

如何使用购买的提示词模板

1. 直接在外部 Chat 应用中使用

将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。

2. 发布为 API 接口调用

把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。

3. 在 MCP Client 中配置使用

在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。

AI 提示词价格
¥15.00元
先用后买,用好了再付款,超安全!

您购买后可以获得什么

获得完整提示词模板
- 共 246 tokens
- 2 个可调节参数
{ 数据集名称 } { 输出语言 }
获得社区贡献内容的使用权
- 精选社区优质案例,助您快速上手提示词
限时免费

不要错过!

免费获取高级提示词-优惠即将到期

17
:
23
小时
:
59
分钟
:
59