为指定表格列提供数据验证规则,确保数据准确性。
以下为表 ods_users 的列 age 的数据验证规则与实施示例。定位为 ODS 层的“低侵入验证”:尽量不改变源数据,仅识别并标注/隔离异常。 规则定义 - 字段含义与类型 - age 表示用户年龄(单位:整年)。 - 类型应为整数型(建议 SMALLINT/INT;在宽表或大数据引擎中为 IntegralType)。 - 取值范围与合法性(硬性校验) - 允许空值(ODS 通常保留源系统缺失),但设定质量阈值。 - 非空值必须满足 0 ≤ age ≤ 120。 - 禁止使用哨兵值:-1、999、8888 等。 - 一致性规则(与相关字段) - 若存在 dob(出生日期)与 record_date(记录产生日期或分区日期),则: - 计算年龄 computed_age = floor(months_between(record_date, dob)/12)。 - 当 age 与 computed_age 均可得时,应满足 age = computed_age;不符标记为“不一致”。 - 时间序列稳定性(同一用户) - 对同一 user_id 随时间的记录,age 不应随 record_date 后移而减少。 - 在一年内(record_date 间隔 < 365 天),age 的增量应在 [0, 1];超过视为异常更新。 - 质量阈值(可作为数据质量指标) - 非空率 ≥ 95%(可按分区/批次评估)。 - 超范围比例 = 0%(发现即阻断或隔离)。 - 哨兵值比例 = 0%。 - 与 dob 不一致比例 ≤ 0.5%(超过阈值报警)。 - 异常处理策略 - 超范围、哨兵值:落入“异常区”(quarantine 表/分区),并记录原始值与来源。 - 一致性与时间序列异常:保留原值,打标签(data_quality_flags),供下游修正或回溯。 - 指标超阈值时,管道置“黄色/红色”状态并触发告警。 实施示例 SQL 约束与校验示例(PostgreSQL) - 表级约束(仅在允许加约束的场景) - ALTER TABLE ods_users ADD CONSTRAINT chk_age_range CHECK (age IS NULL OR age BETWEEN 0 AND 120); - 批次质量检查 - 非空率 - SELECT 1 - (COUNT(*) FILTER (WHERE age IS NULL)::float / COUNT(*)) AS non_null_ratio FROM ods_users WHERE dt = CURRENT_DATE; - 超范围与哨兵值 - SELECT COUNT(*) AS invalid_age_cnt FROM ods_users WHERE age IS NOT NULL AND (age < 0 OR age > 120); - SELECT COUNT(*) AS sentinel_cnt FROM ods_users WHERE age IN (-1, 999, 8888); - 与 dob 一致性(record_date 存在) - SELECT COUNT(*) AS mismatch_cnt FROM ods_users WHERE age IS NOT NULL AND dob IS NOT NULL AND record_date IS NOT NULL AND age <> EXTRACT(YEAR FROM age(record_date, dob)); - 时间序列稳定性(同一用户) - WITH t AS ( SELECT user_id, record_date, age, LAG(age) OVER (PARTITION BY user_id ORDER BY record_date) AS prev_age, LAG(record_date) OVER (PARTITION BY user_id ORDER BY record_date) AS prev_date FROM ods_users ) SELECT COUNT(*) AS ts_anomaly_cnt FROM t WHERE age IS NOT NULL AND prev_age IS NOT NULL AND ( age < prev_age OR ( (record_date - prev_date) < INTERVAL '365 days' AND (age - prev_age) NOT BETWEEN 0 AND 1 ) ); Spark(PySpark)校验示例 - 基本规则 - from pyspark.sql import functions as F, types as T df = spark.table("ods_users") total = df.count() null_ratio = df.filter(F.col("age").isNull()).count() / total invalid_df = df.filter(F.col("age").isNotNull() & ((F.col("age") < 0) | (F.col("age") > 120))) sentinel_df = df.filter(F.col("age").isin(-1, 999, 8888)) - 与 dob 一致性 - computed_age = F.floor(F.months_between(F.coalesce(F.col("record_date"), F.current_date()), F.col("dob")) / 12) mismatch_df = df.filter(F.col("age").isNotNull() & F.col("dob").isNotNull() & (F.col("age") != computed_age)) - 时间序列稳定性 - from pyspark.sql.window import Window w = Window.partitionBy("user_id").orderBy("record_date") lag_age = F.lag("age").over(w) lag_date = F.lag("record_date").over(w) ts_anomaly_df = df.select("user_id","record_date","age", lag_age.alias("prev_age"), lag_date.alias("prev_date")) \ .filter( F.col("age").isNotNull() & F.col("prev_age").isNotNull() & ( (F.col("age") < F.col("prev_age")) | ((F.datediff(F.col("record_date"), F.col("prev_date")) < 365) & ((F.col("age") - F.col("prev_age") < 0) | (F.col("age") - F.col("prev_age") > 1))) ) ) Great Expectations 规则示例(Spark/Pandas) - suite.expect_column_values_to_be_between("age", min_value=0, max_value=120, mostly=1.0) - suite.expect_column_values_to_not_be_in_set("age", [-1, 999, 8888]) - suite.expect_column_values_to_not_be_null("age", mostly=0.95) # ODS 可设为指标而非硬失败 - 若有 dob/record_date,可在数据管道中计算 computed_age 并使用 - suite.expect_multicolumn_sum_to_equal 之类并不适用;改为自定义期望或事先派生列后期望相等。 dbt 测试示例(使用 dbt-utils) - models/ods_users.yml - version: 2 models: - name: ods_users columns: - name: age tests: - dbt_utils.expression_is_true: expression: "age is null or (age between 0 and 120)" - dbt_utils.expression_is_true: expression: "age not in (-1, 999, 8888)" - name: dob tests: [] # 若存在 - 若需一致性测试,可加自定义测试 SQL 比较 age 与按 dob/record_date 计算的年龄。 备注 - 0 岁应视为合法(新生儿/未满一岁)。 - 上限 120 为常用业务合理上限;如业务场景需更宽,建议将 120 调整为 130/150,并相应更新规则与阈值。 - ODS 层建议以“标注/隔离”为主,避免覆盖源值;修正动作应在 DWD/DIM 层进行。
Below are pragmatic, production-grade data validation rules for column dwd_customer.email, focused on warehouse-friendly enforcement and pipeline checks. The rules aim to prevent malformed, duplicate, or low-quality values while remaining implementable in SQL and data quality frameworks. Scope and assumptions - Column type: string (VARCHAR). - Email values are single addresses (no lists). - Unicode domains may be present; normalization is required. - Enforcement strategy may differ by platform; where constraints are not enforced, run validation queries in ETL/ELT or DQ jobs. Validation rules 1) Presence and nullability - If the business requires an email for a customer, enforce NOT NULL; otherwise allow NULL. - If NULLs are allowed, ensure empty strings are normalized to NULL. - Rule: TRIM(email) IS NOT NULL AND TRIM(email) <> '' (if mandatory). 2) Length constraints - Total length ≤ 254 characters. - Local part (before @) ≤ 64 characters. - Rule: - LENGTH(email) ≤ 254 - LENGTH(SPLIT_PART(email, '@', 1)) ≤ 64 (or engine-specific string split) 3) Whitespace and control characters - No leading/trailing whitespace; internal whitespace not allowed. - No control characters (ASCII < 32), no tabs, newlines. - Rule: - email = TRIM(email) - NOT REGEXP_LIKE(email, '[\\s\\p{Cntrl}]') (adjust for engine) 4) Single address with single “@” - Exactly one “@”. - Rule: - email LIKE '%@%' AND email NOT LIKE '%@%@%' 5) Syntactic pattern (ASCII pragmatic; excludes quoted local parts) - Local part: atoms separated by single dots; allowed chars A-Z a-z 0-9 and !#$%&'*+/=?^_`{|}~- - No leading/trailing/consecutive dots in local part. - Domain: labels 1–63 chars, alphanumeric plus hyphen (not starting/ending with hyphen), at least one dot. - TLD length ≥ 2. - Rule (regex, anchored): - ^(?:[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+)*)@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?)(?:\\.(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?))+$ - Additional TLD check (optional if engine cannot easily enforce in regex): - LENGTH(REGEXP_SUBSTR(email, '\\.([A-Za-z0-9-]{2,63})$')) > 0 6) Unicode/IDN domain handling - If Unicode domains are permitted, normalize domain to punycode before storage or create a derived field email_domain_ascii for matching/dedup. - Rule: - email_normalized = local part unchanged + '@' + lowercased punycode domain. - If punycode conversion is unavailable in-warehouse, perform in the ingestion/transformation step. 7) Reserved/test addresses - Reject known placeholders: - Local part in {'test', 'admin', 'null'} with example domains (example.com, example.org, example.net). - Reject comma/semicolon-separated values. - Rules: - NOT REGEXP_LIKE(email, '[,;]') - NOT (LOWER(email) LIKE 'test@%' OR LOWER(email) LIKE '%@example.%') 8) Disposable/temporary domains - Maintain a reference table ref_disposable_domains(domain_name). - Rule: - LOWER(SPLIT_PART(email, '@', 2)) NOT IN (SELECT domain_name FROM ref_disposable_domains) 9) Uniqueness and normalization - Define email_canonical for deduplication: lowercased full address after trimming; if IDN supported, domain punycoded then lowercased. - Enforce uniqueness on email_canonical among active customers. - Rules: - email_canonical = LOWER(TRIM(email)) (plus domain punycode if available) - UNIQUE(email_canonical) for active records (or pipeline check) 10) Optional deliverability check (outside warehouse constraints) - Periodic DNS MX (or fallback A) lookup for domains; store deliverability_status. - Rule: - domain has at least one MX record (DQ process), not a strict storage constraint. Example implementations A) Warehouse-side SQL checks (generic pattern; adapt functions to engine) - Invalid rows detection: SELECT customer_id, email FROM dwd_customer WHERE email IS NULL OR TRIM(email) = '' OR LENGTH(email) > 254 OR email NOT LIKE '%@%' OR email LIKE '%@%@%' OR REGEXP_LIKE(email, '[\\s\\p{Cntrl}]') OR NOT REGEXP_LIKE(TRIM(email), '^(?:[A-Za-z0-9!#$%&''*+/=?^_`{|}~-]+(?:\\.[A-Za-z0-9!#$%&''*+/=?^_`{|}~-]+)*)@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?)(?:\\.(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?))+$') OR REGEXP_LIKE(email, '[,;]') OR LENGTH(SPLIT_PART(email, '@', 1)) > 64 OR LOWER(SPLIT_PART(email, '@', 2)) IN (SELECT domain_name FROM ref_disposable_domains); - Canonicalization (ETL step): SELECT customer_id, email, LOWER(TRIM(email)) AS email_canonical FROM staging_customer; B) dbt tests (schema.yml) models: - name: dwd_customer columns: - name: email tests: - not_null: {config: {where: "email_required = true"}} - accepted_regex: regex: "^(?:[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[A-Za-z0-9!#$%&'*+/=?^_`{|}~-]+)*)@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?)(?:\\.(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?))+$" - custom_sql: sql: "LENGTH(email) <= 254 AND LENGTH(SPLIT_PART(email,'@',1)) <= 64" - name: email_canonical tests: - unique: {config: {where: "is_active = true"}} C) Great Expectations (expectation suite pseudocode) - expect_column_values_to_not_be_null (conditional on email_required) - expect_column_values_to_match_regex for the pattern above - expect_column_values_to_be_unique (on email_canonical for active customers) - expect_column_values_to_not_match_regex for '[,;]' - expect_column_values_to_match_regex (negative) for '\\s|\\p{Cntrl}' - Custom expectation: domain not in disposable list via query D) Delta Lake / Spark SQL table constraint (if supported) ALTER TABLE dwd_customer ADD CONSTRAINT chk_email_format CHECK ( LENGTH(email) <= 254 AND email RLIKE '^(?:[A-Za-z0-9!#$%&''*+/=?^_`{|}~-]+(?:\\.[A-Za-z0-9!#$%&''*+/=?^_`{|}~-]+)*)@(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?)(?:\\.(?:[A-Za-z0-9](?:[A-Za-z0-9-]{0,61}[A-Za-z0-9])?))+$' AND LENGTH(split(email, '@')[0]) <= 64 AND NOT (email RLIKE '[\\s\\p{Cntrl}]') AND NOT (email RLIKE '[,;]') ); Operational notes - Constraint enforcement varies by platform; if constraints are informational, run validation queries in ETL/ELT or DQ jobs and quarantine failures. - Store both raw email and normalized (email_canonical). Use email_canonical for uniqueness and joins, raw for auditing. - Maintain and regularly update the disposable domain list. - If the system must support quoted local parts or internationalized local parts, expand the parser accordingly; the provided regex intentionally excludes rare, complex RFC cases to remain practical for data quality.
以下为 ads_marketing_report 表中列 kpi_value 的数据验证规则设计与实现示例。规则采用分层设计(硬性约束、条件约束、质量告警),并提供可配置方案与通用 SQL 检查示例,便于在不同数据仓库或管道中实施。 一、适用范围与假设 - kpi_value 为数值型字段,用于存放营销报表的指标值(如 clicks、impressions、CTR、CPC、CPA、CPM、spend、revenue、ROAS 等)。 - 表中存在与 kpi_value 相关的上下文字段(如 kpi_name、unit、currency_code);如无这些字段,可通过配置表驱动验证。 - 验证分为强约束(必需满足)、条件约束(随 KPI 类型/单位变化)、告警(提示异常但不阻断)。 二、硬性约束(必须满足) - 类型与可取值 - 类型必须为定点小数(DECIMAL/NUMERIC),建议 DECIMAL(38,6),避免 FLOAT/DOUBLE(禁止 NaN/Infinity)。 - 不允许为 NULL(除非业务明确定义缺失场景并有 fail_reason 或 status 标识)。 - 非负性 - 默认非负。仅当 kpi 属于“负值允许”的类型(如 refunds、adjustments)才允许负值。 - 有限数值 - 必须为有限数值(非 NaN、非 Inf)。在使用定点小数类型的前提下可自然满足。 三、条件约束(基于 KPI 类型/单位的规则) 建议维护一个配置表 dim_kpi_validation_config,定义每个 KPI 的单位、范围及精度要求,按此进行验证。 - 单位类别与取值范围 - count(计数类:impressions、clicks、conversions) - 整数要求:必须为整数(小数部分为 0) - 范围:kpi_value >= 0,且 kpi_value ≤ 1e12(上限可配置) - ratio(比率类:ctr、conversion_rate、cost_share、roas 等以比率表达的) - 范围:0 ≤ kpi_value ≤ 1(如单位为比例) - 精度:小数位不超过配置(如 6 位) - pct(百分比类:CTR 等若以百分比存储) - 范围:0 ≤ kpi_value ≤ 100 - 精度:小数位不超过配置(如 4 位) - money(金额类:spend、revenue、cpc、cpm、cpa 等) - 范围:kpi_value ≥ 0,且 kpi_value ≤ 1e12(上限可配置,需结合业务) - 货币精度:与 currency_code 对应的货币最小单位一致(如 USD=2 位、JPY=0 位)。按货币配置执行小数位约束。 - KPI 未配置 - 当 kpi_name 无法在配置表中命中,标记为错误(以防混入未知或拼写错误 KPI)。 四、质量告警(软性规则) - 异常跳变/离群值 - 与历史分布相比出现显著偏离(如相对上一周期变化超过阈值或 z-score > 3),触发告警但不阻断。 - 计算一致性(若同条记录含有相关指标) - CTR 一致性:近似满足 clicks / impressions 与 kpi_value 的关系(允许小误差,如绝对误差 ≤ 1e-6 或相对误差 ≤ 1%) - CPC 一致性:近似满足 spend / clicks - CPA 一致性:近似满足 spend / conversions - ROAS 一致性:近似满足 revenue / spend 五、配置表示例(建议) 创建配置表以驱动验证规则,避免硬编码: - 表结构 - dim_kpi_validation_config( - kpi_name STRING, - unit STRING, -- 'count' | 'ratio' | 'pct' | 'money' - min_val DECIMAL(38,6), - max_val DECIMAL(38,6), - integer_required BOOLEAN, - allow_negative BOOLEAN, - scale_limit INT, -- 允许的小数位数 - currency_required BOOLEAN ) - 示例数据(可根据实际业务调整) - ('impressions','count',0,1000000000000,true,false,0,false) - ('clicks','count',0,1000000000000,true,false,0,false) - ('conversions','count',0,1000000000000,true,false,0,false) - ('ctr','ratio',0,1,false,false,6,false) - ('conversion_rate','ratio',0,1,false,false,6,false) - ('cpc','money',0,1000000000000,false,false,4,true) - ('cpm','money',0,1000000000000,false,false,4,true) - ('cpa','money',0,1000000000000,false,false,4,true) - ('spend','money',0,1000000000000,false,false,4,true) - ('revenue','money',0,1000000000000,false,false,4,true) - ('roas','ratio',0,1000,false,false,6,false) 另建货币最小单位表 dim_currency_precision(code STRING, minor_unit INT),例如: - ('USD',2), ('EUR',2), ('JPY',0), ('GBP',2), ... 六、通用 SQL 验证示例(跨仓库可用的检查查询) 以下检查以 ads_marketing_report(r) 与 dim_kpi_validation_config(c) 联合验证为例,尽量使用常见函数: 1) 未配置 KPI(错误) SELECT r.* FROM ads_marketing_report r LEFT JOIN dim_kpi_validation_config c ON LOWER(r.kpi_name) = LOWER(c.kpi_name) WHERE c.kpi_name IS NULL; 2) 非空(错误) SELECT COUNT(*) AS null_cnt FROM ads_marketing_report WHERE kpi_value IS NULL; 3) 负值不合法(错误) SELECT r.* FROM ads_marketing_report r JOIN dim_kpi_validation_config c ON LOWER(r.kpi_name) = LOWER(c.kpi_name) WHERE c.allow_negative = FALSE AND r.kpi_value < 0; 4) 范围校验(错误) SELECT r.* FROM ads_marketing_report r JOIN dim_kpi_validation_config c ON LOWER(r.kpi_name) = LOWER(c.kpi_name) WHERE (r.kpi_value < c.min_val OR r.kpi_value > c.max_val); 5) 小数位限制(错误) -- 使用四舍五入对比,确保不超过规定小数位 SELECT r.* FROM ads_marketing_report r JOIN dim_kpi_validation_config c ON LOWER(r.kpi_name) = LOWER(c.kpi_name) WHERE ABS(r.kpi_value - ROUND(r.kpi_value, c.scale_limit)) > 0; 6) 整数要求(错误) SELECT r.* FROM ads_marketing_report r JOIN dim_kpi_validation_config c ON LOWER(r.kpi_name) = LOWER(c.kpi_name) WHERE c.integer_required = TRUE AND r.kpi_value <> ROUND(r.kpi_value, 0); 7) 单位一致性(错误) -- ratio 要求 [0,1];pct 要求 [0,100];money 需结合货币最小单位 SELECT r.* FROM ads_marketing_report r JOIN dim_kpi_validation_config c ON LOWER(r.kpi_name) = LOWER(c.kpi_name) LEFT JOIN dim_currency_precision p ON r.currency_code = p.code WHERE (c.unit = 'ratio' AND (r.kpi_value < 0 OR r.kpi_value > 1)) OR (c.unit = 'pct' AND (r.kpi_value < 0 OR r.kpi_value > 100)) OR (c.unit = 'money' AND p.code IS NULL AND c.currency_required = TRUE) OR (c.unit = 'money' AND p.code IS NOT NULL AND ABS(r.kpi_value - ROUND(r.kpi_value, p.minor_unit)) > 0); 8) 计算一致性(告警,需相关字段) -- CTR 近似等于 clicks/impressions SELECT r.* FROM ads_marketing_report r WHERE LOWER(r.kpi_name) = 'ctr' AND r.impressions > 0 AND ABS(r.kpi_value - (r.clicks * 1.0 / r.impressions)) > 0.01; -- 容差可配置 -- CPC 近似等于 spend/clicks SELECT r.* FROM ads_marketing_report r WHERE LOWER(r.kpi_name) = 'cpc' AND r.clicks > 0 AND ABS(r.kpi_value - (r.spend * 1.0 / r.clicks)) > 0.01; 七、实施建议 - 优先以配置表驱动校验,便于新 KPI 扩展与阈值调整。 - 在数据加载(ELT/ETL)或建模层(如 dbt tests、Great Expectations)执行以上检查: - 硬性约束失败直接阻断或回滚批次。 - 条件约束失败记录并告警。 - 告警类(离群、计算一致性)纳入数据质量指标与监控。 - 存储类型建议使用 DECIMAL/NUMERIC,避免浮点带来的 NaN/Inf 与精度问题。 - 货币精度按 currency_code 校验,确保金额与币种最小单位一致。 通过上述规则与检查,可有效保障 ads_marketing_report.kpi_value 的类型正确性、范围合理性、单位一致性以及与相关指标的计算一致性,从而提高报表数据的可信度与可用性。
在导入、转换、落库环节快速生成列级校验规则,配套示例数据与说明,缩短上线周期并减少回滚。
搭建统一的字段校验模板,批量覆盖核心表,提升数据质量标准化,支持审计与走查。
为指标口径关键字段制定格式与范围限制,提前拦截脏数据,减少报表返工,加快周报月报交付。
为表单邮箱、手机号、地区等字段生成验证规则,提升线索有效率,降低外呼与投放浪费。
根据政策要求生成可追溯的校验说明与变更记录,快速配合检查,降低合规风险。
为埋点与用户属性建立一致的校验规范,减少数据缺失与异常,提升增长实验的可信度。
用一条高效提示词,自动生成可执行的数据验证规则,让每一列数据都有清晰边界与可追溯标准,帮助团队把“数据干净度”变成可复制的竞争力。 - 快速:输入表名、列名与期望语言,数秒生成规则与检查步骤。 - 专业:以“数据工程专家”视角给出权威、严谨的规范与说明。 - 全面:覆盖格式、取值范围、唯一性、缺失值、依赖关系与常见异常处理。 - 可落地:附带检查要点与实施建议,便于在表格与数据平台中执行。 - 易传播:输出结构清晰,适合沉淀到数据字典与团队手册,降低沟通成本。 - 降本增效:显著减少返工、缩短报表上线周期,提升指标可信度。 - 多场景:适用于报表上线前校验、日常数据治理、质量巡检与合规审计准备。
将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。
把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。
在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。
免费获取高级提示词-优惠即将到期