生成针对数据集的专业数据概况分析结果。
以下为“新用户转化数据”的数据概况分析结果模板与评估方法。因未提供实际数据与字段结构,本报告以标准化数据质量框架与可复用的计算逻辑为基础,输出结果项采用占位符形式,待接入真实数据后即可生成数值化结论。 一、分析范围与目标 - 数据对象:新注册用户在指定观察期内的转化行为(如首购、关键功能使用、完成KYC等)。 - 目标输出:数据概况(分布、完整性、一致性、唯一性、及时性)、关键质量问题识别、可执行的修复与监控方案。 - 关键参数(需确认):转化定义、观察期(例如注册后7/14/30天)、归因口径(最后触点/首次触点)、时间窗口与时区。 二、字段与结构(参考标准,实际以您系统为准) - 主表:user_conversion_facts(用户级或事件级) - user_id(主键或业务主键) - signup_ts(注册时间戳) - conversion_flag(是否转化,布尔/枚举) - conversion_ts(首次转化时间戳) - conversion_days(转化耗时,整数) - source、campaign_id、channel(来源归因字段) - device_type、os_version、app_version(设备与版本) - geo_country、geo_region(地理信息) - revenue_first_purchase(首购金额,数值) - event_id(事件级数据时使用) - 维表(可选):campaign_dim、channel_dim、geo_dim、app_version_dim 三、指标定义与计算方法 - 记录量与覆盖度 - 总记录数:count(*) - 新用户数:count(distinct user_id) - 观察期覆盖:按日期分组的记录数与新用户数 - 完整性(Null/缺失) - 各字段缺失率:sum(is_null(col))/count(*) - 唯一性/重复 - user_id唯一性:count(distinct user_id)/count(*) - 事件重复:重复event_id比例;重复signup事件比例 - 一致性/逻辑校验 - conversion_flag与conversion_ts一致:当flag=1时conversion_ts非空;flag=0时conversion_ts为空 - 时间顺序:conversion_ts >= signup_ts;conversion_days = datediff(conversion_ts, signup_ts) - 收入一致性:flag=0时revenue=0或空;flag=1时revenue>=0 - 有效性/取值范围 - 字段枚举值合法性(channel/source等在白名单内) - 时间戳范围合法性(不早于系统上线日期,不晚于当前时间) - 数值范围(revenue不为负;conversion_days不为负且不超过设定上限) - 分布与异常 - 转化率:sum(flag)/count(distinct user_id) - 转化耗时分布:P50/P90/P99 - 收入分布:P50/P90/P99、最大值与重尾检测(如超P99的倍数) - 归因分布:channel、campaign的占比与Top N - 设备/地区分布:top值占比与长尾 - 及时性 - 数据延迟:max(ingest_ts)与事件时间的差异 - 分区/日期数据完整性:每日是否有数据、是否有漏分区 四、数据概况分析结果(模板,待填) - 基础规模 - 观察期:[开始日期] 至 [结束日期] - 总记录数:[待填] - 新用户数:[待填] - 事件级数据比例(如存在):[待填] - 完整性 - 缺失率最高的字段:[字段A: X%]、[字段B: Y%] - 关键字段缺失率(必须项):signup_ts [X%];user_id [X%];conversion_flag [X%] - 唯一性与重复 - user_id唯一性比率:[待填] - 重复signup事件比例:[待填] - 重复event_id比例:[待填] - 一致性 - flag/ts不一致记录比例:[待填] - conversion_ts < signup_ts比例:[待填] - conversion_days计算不一致比例:[待填] - revenue与flag不一致比例:[待填] - 有效性 - 非法枚举值(channel/source)比例:[待填];Top非法值:[待填] - 异常时间戳(未来/过早)比例:[待填] - 负收入或异常大收入比例:[待填] - 分布与异常 - 总转化率:[待填] - conversion_days分布(P50/P90/P99):[待填] - 首购收入分布(P50/P90/P99):[待填] - Top渠道/活动占比:[渠道A X%]、[活动B Y%] - 设备/地区Top分布:[待填] - 及时性 - 数据延迟P95:[待填]小时 - 漏数日期或分区:[日期列表或“无”] 五、数据质量评估与风险分级(规则清单) - 高风险(需优先修复) - user_id非唯一或缺失率>0.5% - signup_ts缺失或非法>0.5% - flag=1但conversion_ts缺失>0.1% - conversion_ts < signup_ts >0.01% - 中风险 - 枚举非法值>1% - revenue负值>0.1% - 转化耗时>设定上限(如>90天)占比>1% - 低风险 - 长尾渠道/活动占比异常导致归因不稳定 - 数据延迟P95>24小时 六、问题定位与修复建议 - 完整性 - 在ETL层对必填字段设非空约束;对缺失数据回填或剔除(按业务容忍度) - 对时间戳统一时区与格式解析;引入原始事件落库时间(ingest_ts) - 唯一性 - 引入业务主键约束(user_id+signup_ts)或幂等键(event_id) - 去重策略:按同user_id+近似时间窗口合并 - 一致性 - 以事件序列校验转化逻辑(signup→activation→conversion) - 重算派生字段(conversion_days、revenue)并与原值对账 - 有效性 - 枚举值白名单校验与拒绝策略;异常值隔离到审计表 - 阈值规则:revenue<0或>业务合理上限标记为异常 - 及时性 - 建立分区完成度检查;延迟告警(如超过12小时告警) 七、监控与审计(示例SQL/规则表达) - 每日数据量与转化率 - select dt, count(distinct user_id) as new_users, sum(case when conversion_flag=1 then 1 else 0 end)/count(distinct user_id) as conv_rate from user_conversion_facts group by dt; - 完整性与一致性 - 缺失率:select 'signup_ts' as col, sum(case when signup_ts is null then 1 else 0 end)/count(*) as null_rate from user_conversion_facts; - 逻辑不一致:select sum(case when conversion_flag=1 and conversion_ts is null then 1 else 0 end)/count(*) as inconsistent_rate from user_conversion_facts; - 时间顺序:select sum(case when conversion_ts < signup_ts then 1 else 0 end)/count(*) from user_conversion_facts; - 唯一性与重复 - select count(*) - count(distinct user_id) as duplicate_user_records from user_conversion_facts; - 有效性 - 枚举白名单:select count(*) filter (where channel not in (select channel from channel_dim)) / count(*) as invalid_channel_rate from user_conversion_facts; - 及时性 - select percentile_cont(0.95) within group(order by extract(epoch from (ingest_ts - event_ts))/3600.0) as p95_delay_hours from user_conversion_facts; 八、交付与下一步 - 请提供以下信息以生成数值化的概况分析结果: - 字段字典与主键定义 - 转化业务定义与观察期 - 数据时间范围与样本规模 - 可选维表(channel/campaign/app_version)及白名单 - 接收后输出: - 完整的数值化概况报告 - 问题清单与优先级 - 修复与监控落地方案(规则、阈值、任务与告警) 说明:本报告为数据质量概况分析的标准模板,确保在接入真实数据后可直接执行并生成高准确性的结果。针对您的具体数据与业务定义,可快速定制规则与阈值。
Order Fact Table – Data Profiling Analysis Results (Template + Illustrative Example) Scope and grain - Assumed grain: one row per order (order header). If your fact grain is order-line, adjust metrics that rely on totals and referential checks to the product dimension accordingly. - Intended use: accuracy, completeness, consistency, and integrity assessment to inform cleansing, validation rules, and monitoring. Table overview (replace illustrative values with actuals) - Row count: [Example] 1,238,417 - Coverage window (order_date): [Example] 2023-01-01 to 2025-09-20 - Primary key: order_id - Uniqueness: [Example] 99.997% unique - Duplicate order_ids: [Example] 32 - Freshness (max(updated_at) vs extraction): [Example] 1h 42m Column-level profiling (core fields) 1) order_id - Type: string/integer (non-nullable) - Distinct count: [Example] 1,238,385 - Null rate: [Example] 0.000% - Duplicates: [Example] 32 rows across 16 ids - Notes: Enforce not null + uniqueness at source; quarantine duplicate keys. 2) order_date (date/timestamp) - Min/Max: [Example] 2023-01-01 / 2025-09-20 - Null rate: [Example] 0.000% - Future-dated (> current_date): [Example] 0.06% - Invalid dates/timezones: [Example] 0.00% invalid parse; mixed TZ flags detected - Seasonality: [Example] weekend share 27%; end-of-month spikes present 3) customer_id (FK to dim_customer) - Null rate: [Example] 0.42% - Distinct count: [Example] 214,903 - Orphan rate vs dim_customer: [Example] 0.18% - Notes: Backfill anonymous/guest strategy or surrogate for nulls; address orphans via late-arriving dimension handling. 4) currency_code (ISO 4217) - Cardinality: [Example] 5 (USD, EUR, GBP, CAD, AUD) - Null rate: [Example] 0.00% - Invalid values: [Example] 0.04% (case issues: 'usd') - Notes: Standardize to uppercase; enforce against dim_currency. 5) order_status (enumerated) - Allowed set (example): Pending, Completed, Cancelled, Refunded, Partially_Refunded - Null rate: [Example] 0.00% - Top distribution: [Example] Completed 82.5%; Pending 9.8%; Cancelled 5.6%; Refunded 2.1% - Invalid statuses: [Example] 0.03% (legacy codes) - Notes: Map legacy to canonical set; add check constraint or validation UDF. 6) sales_channel (enumerated) - Values: [Example] Web, App, Store, Marketplace - Null rate: [Example] 0.11% - Distribution: [Example] Web 56%, App 22%, Store 18%, Marketplace 4% - Notes: Normalize spelling/casing; handle unknowns explicitly. 7) subtotal_amount (numeric) - Null rate: [Example] 0.01% - Min/Max: [Example] 0.00 / 12,300.00 - P50/P95: [Example] 72.10 / 420.00 - Negative values: [Example] 0.00% - Currency consistency: [Example] 100% aligned to currency_code - Notes: Validate precision/scale; non-negative expectation except adjustments per policy. 8) discount_amount (numeric; sign convention must be confirmed) - Null rate: [Example] 0.03% - Min/Max: [Example] 0.00 / 2,000.00 (stored as positive discount to be subtracted) - P50/P95: [Example] 0.00 / 20.00 - Unexpected sign: [Example] 0.07% negative values - Notes: Enforce consistent sign; reconcile with pricing engine rules. 9) tax_amount (numeric) - Null rate: [Example] 0.02% - Min/Max: [Example] 0.00 / 1,100.00 - Negative values: [Example] 0.01% (likely tax reversals) - Notes: Negative taxes should align with refund/cancel events. 10) shipping_amount (numeric) - Null rate: [Example] 0.05% - Min/Max: [Example] 0.00 / 400.00 - Zero with shipped status: [Example] 3.9% (free shipping or missing fees) - Notes: Cross-check with shipping method and promo flags. 11) total_amount (numeric; order grand total) - Null rate: [Example] 0.01% - Min/Max: [Example] -2,500.00 / 12,499.00 - P50/P95: [Example] 79.99 / 459.00 - Negative totals: [Example] 0.12% (refunds); 413 rows negative with non-refund status - Notes: Enforce consistency with status and calculation rule. 12) payment_method (categorical) - Null rate: [Example] 2.30% - Top values: [Example] Visa 39%, Mastercard 28%, PayPal 18%, Amex 8%, COD 3%, Other 4% - Incoherent with status (e.g., captured but pending): [Example] 0.09% - Notes: Validate against payment provider codes; ensure PCI-safe tokenization fields only. 13) created_at / updated_at (timestamps) - Null rates: [Example] created_at 0.00%; updated_at 0.02% - updated_at >= created_at: [Example] 96.8% (3.2% violations; clock skew or ingest issues) - Staleness: [Example] 7.4% not updated > 90 days while still Pending - Notes: Normalize timezones; enforce monotonic update constraint where applicable. Cross-field consistency checks Calculation coherence (define per business rule) - Expected rule (example): total_amount ≈ subtotal_amount − discount_amount + tax_amount + shipping_amount - Tolerance (epsilon): [Example] 0.01 currency units - Result: [Example] 98.9% within tolerance; 1.1% mismatches - Root causes (observed): [Example] rounding, missing shipping after partial refund, discount applied post-tax in some sources - Action: Standardize calculation order and rounding precision; compute canonical totals in ETL. Status-to-amount coherence - Completed: total_amount > 0.00 — [Example] 99.6% pass - Cancelled (no fulfillment): total_amount = 0.00 — [Example] 92.3% pass - Refunded: total_amount <= 0.00 or separate refund fact — [Example] 88.1% pass - Action: Encode explicit monetary state model (authorization, capture, refund) and align status semantics. Temporal coherence - order_date within [created_at − 1d, created_at + 1d]: [Example] 99.2% pass - updated_at present for state changes: [Example] 94.7% pass - Action: Recompute order_date from event stream or enforce event-sourced derivation. Referential integrity (to dimensions) - customer_id in dim_customer: [Example] 99.82% (0.18% orphan) - currency_code in dim_currency: [Example] 99.96% - sales_channel in dim_channel: [Example] 99.89% - date keys resolvable in dim_date: [Example] 99.94% - Action: Late-arriving dimension handling, conformance mappings, and reject/quarantine policies. Outliers and anomaly signals - Total amount outliers (> P95 × 10): [Example] 27 orders; investigate high-value promotions or currency scaling errors. - Negative subtotal or tax: [Example] 0.02% combined; likely corrective entries; confirm policy. - Daily order volume anomalies (7-day z-score > 3): [Example] 2 spikes (marketing campaigns) and 1 dip (ETL delay). - Duplicate order_id with differing amounts: [Example] 11 cases; deduplicate by latest updated_at or source_of_truth. Data quality risks identified (examples) - Inconsistent total computation across sources leads to 1.1% mismatches. - Status semantics not aligned to monetary state (refund/cancel), producing negative totals with non-refund statuses. - Small but material orphan FK rate on customer_id (0.18%). - Timestamp incoherence (3.2%) likely due to timezone/clock issues. - Casing/format issues in currency_code and categorical fields. Recommended cleansing and validation rules - Enforce order_id uniqueness; reject duplicates or retain highest updated_at per id. - Standardize calculation: compute canonical_total with fixed order and rounding; compare to source totals and flag discrepancies > epsilon. - Enforce non-negative constraints for subtotal, tax, shipping; define and document discount sign convention; correct records violating the chosen convention. - Status-amount rule set (examples): - Completed: total_amount > 0 and payment_captured = true - Cancelled: total_amount = 0 and fulfillment_state = none - Refunded: total_amount <= 0 or attach linked refund records - Referential integrity checks on load; quarantine or delayed-load for late dimensions. - Normalize enumerations (uppercase currency_code, trim spaces, canonical status/channel values). - Time normalization to UTC; enforce updated_at >= created_at. Monitoring KPIs and thresholds (set alerts) - PK duplicates: target 0; warn > 0, critical > 10/day - FK orphan rate (customer_id): target < 0.10%; warn ≥ 0.25% - Null rate by monetary fields: target 0; warn ≥ 0.01% - Total calculation mismatch rate: target < 0.50%; warn ≥ 1.00% - Negative totals with non-refund status: target 0; critical ≥ 5/day - Future-dated orders: target 0; warn ≥ 0.01% - Data freshness: max(updated_at) lag < 2h; warn ≥ 4h Notes on interpretation - All numeric “Example” values are illustrative to show expected outputs and typical ranges. Replace with computed metrics from your dataset. - Confirm business rules for total calculation, discount sign, refund handling, and status semantics before enforcing rules. If you provide a schema sample and row extracts, I can replace the illustrative figures with precise metrics and produce a finalized profiling report.
以下为“核心指标监控数据”的数据概况分析结果设计与生成方案。由于未提供具体数据集,本结果以通用核心指标监控数据模型为前提,给出可复用的概况分析指标、计算方法与输出结构。请在确认字段与业务规则后执行相应计算生成数值结果。 一、目标与范围 - 目标:对核心指标监控数据进行系统性数据概况分析,覆盖完整性、唯一性、有效性、分布与异常、时效性、一致性与漂移等质量维度,为后续质量监控与告警提供可量化基线。 - 范围:基于“指标-日期-分段(维度)”的明细数据表,时间窗口建议为近90天(可按业务需要调整)。 二、数据模型假设(需确认) 核心表:core_metrics - 主键(复合):metric_id, as_of_date, segment_1, segment_2(如无分段,segment_*可为空) - 字段: - metric_id STRING:指标ID - metric_name STRING:指标名称 - as_of_date DATE:指标所属统计日期 - value NUMERIC:指标值 - unit STRING(可空):单位 - target_value NUMERIC(可空):目标值 - threshold_min NUMERIC(可空):下限阈值 - threshold_max NUMERIC(可空):上限阈值 - segment_1/segment_2 STRING(可空):分段维度(例如渠道、地区) - source_system STRING:来源系统 - event_time TIMESTAMP(可空):指标产生时间(若有) - ingested_at TIMESTAMP:入仓时间 - metric_dict(维表,需外键校验):metric_id, metric_name, owner, definition, unit, expected_frequency 三、概况分析指标与结果结构 输出以多张结果表或一张汇总表呈现,建议如下结构: 1. dq_summary(总体概况) - 时间范围:start_date, end_date - 记录总数:record_count - 指标数:distinct_metric_count - 日期覆盖率:date_coverage_rate(实际有数据的日期数/应有日期数) - 分段覆盖率:segment_coverage_rate(至少一个分段值的记录比率) - 主键唯一率:primary_key_uniqueness_rate - 完整性合格率(关键字段):completeness_pass_rate - 有效性合格率(规则集):validity_pass_rate - 时效性合格率(SLA内):timeliness_pass_rate - 漂移风险(整体PSI):psi_overall - 异常记录占比(IQR或Z-Score):outlier_rate 2. dq_completeness(字段级完整性) - field_name - non_null_rate - null_count - expected_not_null(布尔,来自规则) - pass_flag 3. dq_uniqueness(主键唯一性) - duplicate_count - duplicate_rate - sample_keys(审计用,可脱敏) 4. dq_validity(规则级有效性) - rule_id - rule_description - fail_count - fail_rate - pass_flag 示例规则: - R1 类型与可解析性:value为数值且非NaN/Inf - R2 阈值区间:若阈值存在,threshold_min ≤ value ≤ threshold_max - R3 目标逻辑:若target_value存在,value与target_value的偏差未超容忍范围(需业务设定) - R4 阈值合理性:threshold_min ≤ threshold_max - R5 外键完整性:metric_id ∈ metric_dict - R6 频率一致性:按expected_frequency应有日期存在记录(缺失为违规) 5. dq_distribution(分布与异常) - metric_id - stats(min, max, mean, median, std) - iqr_outlier_rate(按IQR:<Q1-1.5*IQR或>Q3+1.5*IQR) - zscore_outlier_rate(可选) 6. dq_timeliness(时效性) - ingestion_delay_seconds(或小时) - delay_stats(min, max, mean, median, p95) - pass_rate(延迟≤SLA,例如≤4小时) - late_count 7. dq_drift(漂移) - metric_id - drift_metric(PSI或均值相对变动) - ref_period(对比基期) - cur_period(当前期) - pass_flag(低漂移合格) 四、计算方法与示例SQL(标准SQL,需按实际库方言调整) 将时间窗口通过参数化传入::start_date, :end_date, :sla_hours, :ref_start_date, :ref_end_date 1) 总体计数与覆盖 - 记录总数与指标数: SELECT COUNT(*) AS record_count, COUNT(DISTINCT metric_id) AS distinct_metric_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; - 日期覆盖率(以日历表或预期频率推导为准,示例以存在记录的日期计数作为近似): SELECT COUNT(DISTINCT as_of_date)::float / NULLIF(DATE_PART('day', :end_date - :start_date) + 1, 0) AS date_coverage_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; - 分段覆盖率(至少一个分段非空): SELECT SUM(CASE WHEN segment_1 IS NOT NULL OR segment_2 IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS segment_coverage_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; 2) 主键唯一性(示例用segment_1/segment_2,如无则仅metric_id+as_of_date) SELECT COUNT(*) - COUNT(DISTINCT CONCAT_WS('|', metric_id, as_of_date::text, COALESCE(segment_1,''), COALESCE(segment_2,''))) AS duplicate_count, (COUNT(*) - COUNT(DISTINCT CONCAT_WS('|', metric_id, as_of_date::text, COALESCE(segment_1,''), COALESCE(segment_2,''))))::float / COUNT(*) AS duplicate_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; 3) 字段完整性(关键字段示例:metric_id, as_of_date, value, source_system, ingested_at) SELECT 'metric_id' AS field_name, SUM(CASE WHEN metric_id IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate, SUM(CASE WHEN metric_id IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date UNION ALL SELECT 'as_of_date', SUM(CASE WHEN as_of_date IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate, SUM(CASE WHEN as_of_date IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date UNION ALL SELECT 'value', SUM(CASE WHEN value IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate, SUM(CASE WHEN value IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date UNION ALL SELECT 'source_system', SUM(CASE WHEN source_system IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate, SUM(CASE WHEN source_system IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date UNION ALL SELECT 'ingested_at', SUM(CASE WHEN ingested_at IS NOT NULL THEN 1 ELSE 0 END)::float / COUNT(*) AS non_null_rate, SUM(CASE WHEN ingested_at IS NULL THEN 1 ELSE 0 END) AS null_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; 4) 有效性校验 - 类型与可解析性(数值与有限值): SELECT SUM(CASE WHEN value IS NULL OR NOT (value = value) THEN 1 ELSE 0 END) AS invalid_numeric_count, -- NaN检测依赖方言 SUM(CASE WHEN value IS NULL OR NOT (value = value) THEN 1 ELSE 0 END)::float / COUNT(*) AS invalid_numeric_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; - 阈值区间: SELECT SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND (value < threshold_min OR value > threshold_max) THEN 1 ELSE 0 END) AS out_of_threshold_count, SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND (value < threshold_min OR value > threshold_max) THEN 1 ELSE 0 END)::float / COUNT(*) AS out_of_threshold_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; - 阈值合理性: SELECT SUM(CASE WHEN threshold_min IS NOT NULL AND threshold_max IS NOT NULL AND threshold_min > threshold_max THEN 1 ELSE 0 END) AS invalid_threshold_pair_count FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; - 外键完整性(需metric_dict): SELECT COUNT(*) - COUNT(md.metric_id) AS fk_missing_count, (COUNT(*) - COUNT(md.metric_id))::float / COUNT(*) AS fk_missing_rate FROM core_metrics cm LEFT JOIN metric_dict md ON cm.metric_id = md.metric_id WHERE cm.as_of_date BETWEEN :start_date AND :end_date; 5) 分布与异常(IQR) WITH stats AS ( SELECT metric_id, PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS q3 FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date GROUP BY metric_id ) SELECT cm.metric_id, SUM(CASE WHEN cm.value < (s.q1 - 1.5*(s.q3 - s.q1)) OR cm.value > (s.q3 + 1.5*(s.q3 - s.q1)) THEN 1 ELSE 0 END)::float / COUNT(*) AS iqr_outlier_rate FROM core_metrics cm JOIN stats s ON cm.metric_id = s.metric_id WHERE cm.as_of_date BETWEEN :start_date AND :end_date GROUP BY cm.metric_id; 6) 时效性(基于ingested_at与event_time或as_of_date) - 若有event_time: SELECT SUM(CASE WHEN EXTRACT(EPOCH FROM (ingested_at - event_time))/3600 <= :sla_hours THEN 1 ELSE 0 END)::float / COUNT(*) AS timeliness_pass_rate, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ingested_at - event_time))) AS delay_median_seconds, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (ingested_at - event_time))) AS delay_p95_seconds FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; - 若无event_time,近似以as_of_date至ingested_at: SELECT SUM(CASE WHEN EXTRACT(EPOCH FROM (ingested_at - (as_of_date::timestamp))) / 3600 <= :sla_hours THEN 1 ELSE 0 END)::float / COUNT(*) AS timeliness_pass_rate FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date; 7) 漂移(PSI,按分箱) 示例按每个metric_id将:ref_period与:cur_period按分位箱对齐计算PSI: WITH bins AS ( SELECT metric_id, PERCENTILE_CONT(array[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]) WITHIN GROUP (ORDER BY value) AS quantiles FROM core_metrics WHERE as_of_date BETWEEN :ref_start_date AND :ref_end_date GROUP BY metric_id ), ref AS ( SELECT metric_id, value, 'ref' AS p FROM core_metrics WHERE as_of_date BETWEEN :ref_start_date AND :ref_end_date ), cur AS ( SELECT metric_id, value, 'cur' AS p FROM core_metrics WHERE as_of_date BETWEEN :start_date AND :end_date ), all_data AS ( SELECT * FROM ref UNION ALL SELECT * FROM cur ), binned AS ( SELECT a.metric_id, a.p, CASE WHEN a.value < q[1] THEN 0 WHEN a.value < q[2] THEN 1 WHEN a.value < q[3] THEN 2 WHEN a.value < q[4] THEN 3 WHEN a.value < q[5] THEN 4 WHEN a.value < q[6] THEN 5 WHEN a.value < q[7] THEN 6 WHEN a.value < q[8] THEN 7 WHEN a.value < q[9] THEN 8 ELSE 9 END AS bin_id FROM all_data a JOIN ( SELECT metric_id, quantiles AS q FROM bins ) b ON a.metric_id = b.metric_id ), dist AS ( SELECT metric_id, p, bin_id, COUNT(*)::float / SUM(COUNT(*)) OVER (PARTITION BY metric_id, p) AS prob FROM binned GROUP BY metric_id, p, bin_id ) SELECT r.metric_id, SUM(CASE WHEN r.prob > 0 AND c.prob > 0 THEN (r.prob - c.prob) * LN(r.prob / c.prob) ELSE 0 END) AS psi FROM dist r JOIN dist c ON r.metric_id = c.metric_id AND r.bin_id = c.bin_id AND r.p = 'ref' AND c.p = 'cur' GROUP BY r.metric_id; 五、质量阈值建议(需根据业务与风险承受度确认) - 主键唯一率:= 100% - 关键字段完整性:≥ 99.5%(metric_id, as_of_date, value, source_system, ingested_at) - 有效性(类型/区间/外键/阈值合理性):合格率 ≥ 99% - 时效性:延迟 ≤ SLA(例如4小时)合格率 ≥ 99% - 异常占比(IQR):≤ 1%(按指标) - PSI漂移:整体 ≤ 0.2;单指标 ≤ 0.1(常用经验阈值,需结合业务稳定性校准) 六、监控与告警实现建议 - 每日/每小时批处理:生成上述结果表(dq_summary、dq_completeness、dq_uniqueness、dq_validity、dq_distribution、dq_timeliness、dq_drift)。 - 告警路由:当任一阈值不达标,推送到指标Owner与数据工程团队;对持续两期以上异常的指标提升优先级。 - 留痕与审计:保留历史DQ结果与示例异常样本(含主键),支持复盘与根因分析。 - 版本与变更管理:当新增指标或调整定义/阈值,同步更新metric_dict与规则集,重新基线化漂移。 七、生成结果所需信息(请提供) - 实际表结构与字段映射:确认是否存在event_time、分段维度、阈值字段。 - 时间窗口与SLA:start_date、end_date、sla_hours;漂移对比基期ref_start_date、ref_end_date。 - 规则细化: - 目标值容忍范围(例如相对误差≤5%) - 哪些字段必须非空(expected_not_null清单) - 分段汇总一致性规则(如父子口径对齐) - 维表与参照数据:metric_dict定义、期望频率(日/周/月)与应有日期集合。 说明 - 上述为数据概况分析结果的标准化结构与计算实现。待您提供数据与参数后,可直接运行SQL生成数值结果与告警结论。此设计避免臆测具体数值,确保准确性与可验证性。
快速摸底新数据集,生成质量概况与风险清单,制定清洗计划,输出可视化要点供汇报与协作。
在接入前完成质量评估与规则设定,一键生成自测清单与阈值建议,降低上线故障与回滚风险。
用业务版报告理解数据可信度,识别影响核心指标的质量问题,驱动修复优先级并向管理层汇报。
识别字段异常与缺失,给出修复路径,保障报表口径一致;对比版本变更,避免发布后指标漂移。
建立统一的质量评估标准与模板,落地监控与告警阈值,持续跟踪改进成效,提升审计与合规通过率。
无需复杂工具即可获取专业质量分析,快速决策是否接入或采购数据,节省试错与沟通成本。
让你的团队在最短时间内拿到一份可直接用于决策的「数据概况分析」报告。该提示词引导 AI 以数据质量分析师的专业视角,围绕清理、验证、概况分析与监控四个模块,生成结构化、客观、无冗余的结论与改进建议。你只需输入数据集名称并选择期望的输出语言,即可得到清晰易读的分析结果,快速定位缺失、异常、重复、字段不一致等问题,并获得可执行的修复与监控方案。适用于新数据接入评审、模型训练前的数据体检、报表刷新后的健康检查、第三方数据交付验收与合规核查,帮助你缩短分析周期、提升数据可信度、降低决策风险,并建立团队可复用的质量评估标准。
将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。
把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。
在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。
免费获取高级提示词-优惠即将到期