热门角色不仅是灵感来源,更是你的效率助手。通过精挑细选的角色提示词,你可以快速生成高质量内容、提升创作灵感,并找到最契合你需求的解决方案。让创作更轻松,让价值更直接!
我们根据不同用户需求,持续更新角色库,让你总能找到合适的灵感入口。
总结数据集质量检查的五种方法,提供专业技术建议。
以下为针对数据集 events(字段:id INT, ts TIMESTAMP, src STRING, val DOUBLE;分区:dt)的五类数据质量检查方法。每类包含目标、规则与可执行的示例 SQL(以 Spark SQL/Hive SQL 风格为参考,可按需调整)。
目标:确保分区键 dt 与记录时间 ts 一致,避免跨日落入错误分区。
规则示例:
示例 SQL: -- 异常记录计数(按分区) SELECT dt, COUNT(*) AS bad_cnt FROM events WHERE ts IS NULL OR to_date(ts) <> to_date(dt) GROUP BY dt;
-- 占比监控(可设阈值,如 bad_rate < 0.1%) SELECT dt, SUM(CASE WHEN ts IS NULL OR to_date(ts) <> to_date(dt) THEN 1 ELSE 0 END) AS bad_cnt, COUNT() AS total_cnt, SUM(CASE WHEN ts IS NULL OR to_date(ts) <> to_date(dt) THEN 1 ELSE 0 END) / COUNT() AS bad_rate FROM events GROUP BY dt;
目标:关键字段必须存在且类型可用。
规则示例:
示例 SQL: SELECT dt, SUM(CASE WHEN id IS NULL THEN 1 ELSE 0 END) AS id_nulls, SUM(CASE WHEN ts IS NULL THEN 1 ELSE 0 END) AS ts_nulls, SUM(CASE WHEN src IS NULL OR trim(src) = '' THEN 1 ELSE 0 END) AS src_nulls FROM events GROUP BY dt;
-- 可增加类型可解析性检查(若原始为字符串,需要 CAST 成功) -- 示例:若 ts 原始为 STRING,可用 TRY_CAST/正则预校验(具体函数依引擎而定)。
目标:识别重复事件,保障下游聚合与审计的准确性。
规则示例:
示例 SQL: -- 基于业务键的重复行 SELECT dt, id, ts, src, COUNT() AS dup_cnt FROM events GROUP BY dt, id, ts, src HAVING COUNT() > 1;
-- 或对分区级别的重复率快速评估 SELECT dt, COUNT() AS total_cnt, COUNT(DISTINCT CONCAT_WS('|', CAST(id AS STRING), CAST(ts AS STRING), src)) AS distinct_cnt, 1 - COUNT(DISTINCT CONCAT_WS('|', CAST(id AS STRING), CAST(ts AS STRING), src)) / COUNT() AS dup_rate FROM events GROUP BY dt;
目标:数值字段处于可接受范围;分类字段在受控字典内。
规则示例:
示例 SQL: -- val 值域校验(以示例阈值 -1e6 到 1e6) SELECT dt, SUM(CASE WHEN val IS NULL OR isnan(val) OR val < -1000000 OR val > 1000000 THEN 1 ELSE 0 END) AS bad_val_cnt, COUNT() AS total_cnt, SUM(CASE WHEN val IS NULL OR isnan(val) OR val < -1000000 OR val > 1000000 THEN 1 ELSE 0 END) / COUNT() AS bad_val_rate FROM events GROUP BY dt;
-- src 参照完整性(找出未知来源) -- 需存在维表 dim_src(src_code) SELECT e.dt, e.src, COUNT(*) AS unknown_src_cnt FROM events e LEFT JOIN dim_src d ON e.src = d.src_code WHERE d.src_code IS NULL GROUP BY e.dt, e.src;
目标:确保分区按时到达、行数在合理范围,及时发现数据延迟或大幅波动。
规则示例:
示例 SQL: -- 最近 N 天应到分区检测(以最近 7 天为例;Spark 可用 sequence 生成日期序列) WITH cal AS ( SELECT explode(sequence(date_sub(current_date(), 6), current_date())) AS dt ), arrived AS ( SELECT to_date(dt) AS dt, COUNT(*) AS row_cnt FROM events WHERE to_date(dt) BETWEEN date_sub(current_date(), 6) AND current_date() GROUP BY to_date(dt) ) SELECT c.dt, COALESCE(a.row_cnt, 0) AS row_cnt, CASE WHEN a.row_cnt IS NULL THEN 'MISSING_PARTITION' ELSE 'OK' END AS status FROM cal c LEFT JOIN arrived a USING (dt);
-- 今日行数与过去 7 日基线(使用近 7 天 row_cnt 的中位数或近似分位数) WITH hist AS ( SELECT to_date(dt) AS dt, COUNT() AS row_cnt FROM events WHERE to_date(dt) BETWEEN date_sub(current_date(), 7) AND date_sub(current_date(), 1) GROUP BY to_date(dt) ), today AS ( SELECT to_date(dt) AS dt, COUNT() AS row_cnt FROM events WHERE to_date(dt) = current_date() GROUP BY to_date(dt) ) SELECT t.dt, t.row_cnt AS today_cnt, percentile_approx(h.row_cnt, 0.5) AS p50_last7, CASE WHEN t.row_cnt < 0.8 * percentile_approx(h.row_cnt, 0.5) THEN 'LOW_VOLUME' WHEN t.row_cnt > 1.2 * percentile_approx(h.row_cnt, 0.5) THEN 'HIGH_VOLUME' ELSE 'OK' END AS volume_status FROM today t CROSS JOIN (SELECT collect_list(row_cnt) AS row_cnt FROM hist) h;
实施建议
Below are five data quality checks for a “paid” dataset derived from table orders with fields: oid (INT), uid (INT), pay_ts (TIMESTAMP), amt (DECIMAL). Each check includes its objective, rule, and an example SQL query (ANSI SQL) applicable to most data warehouses.
Objective: Ensure the dataset strictly adheres to the “paid” definition and contains required values.
Rule:
Example SQL: SELECT SUM(CASE WHEN pay_ts IS NULL THEN 1 ELSE 0 END) AS violations_missing_pay_ts, SUM(CASE WHEN amt IS NULL OR amt <= 0 THEN 1 ELSE 0 END) AS violations_amt_nonpositive_or_null, SUM(CASE WHEN oid IS NULL THEN 1 ELSE 0 END) AS violations_missing_oid, SUM(CASE WHEN uid IS NULL THEN 1 ELSE 0 END) AS violations_missing_uid FROM orders WHERE pay_ts IS NOT NULL OR pay_ts IS NULL;
-- Optional hard filter audit (ensures no unpaid rows leak into “paid” dataset) SELECT COUNT(*) AS unpaid_rows_present FROM orders WHERE pay_ts IS NULL;
Objective: Prevent duplicate paid orders for the same oid.
Rule: Each oid appears at most once in the paid dataset.
Example SQL: SELECT oid, COUNT() AS dup_count FROM orders WHERE pay_ts IS NOT NULL GROUP BY oid HAVING COUNT() > 1;
-- If duplicates exist, inspect conflicting values SELECT o.* FROM ( SELECT oid FROM orders WHERE pay_ts IS NOT NULL GROUP BY oid HAVING COUNT(*) > 1 ) d JOIN orders o USING (oid) ORDER BY o.oid, o.pay_ts;
Objective: Ensure that uid in orders maps to a valid user record.
Rule: For paid rows, uid must exist in the users dimension (or source of truth).
Example SQL: SELECT COUNT(*) AS missing_users FROM orders o LEFT JOIN dim_users u ON o.uid = u.uid WHERE o.pay_ts IS NOT NULL AND u.uid IS NULL;
-- If there is a known set of valid uids (e.g., from a lookup table), use that instead of dim_users.
Implementation notes:
Below are five practical data quality checks tailored for the training dataset clicks with fields: sid (int), label (binary), age (int), last_ts (timestamp), ctr (double). Each check includes what to assert and an example implementation (PySpark) you can embed in a data validation job.
Assertion:
Rationale: Catches impossible or out-of-contract values that indicate upstream defects or parsing issues.
Example (PySpark): from pyspark.sql.functions import current_timestamp
invalid = df.filter( (col("label").isin(0, 1) == False) | (col("age").isNotNull() & ( (col("age") < 0) | (col("age") > 120) )) | (col("ctr").isNotNull() & ( (col("ctr") < 0.0) | (col("ctr") > 1.0) )) | (col("last_ts") > current_timestamp()) ) assert invalid.count() == 0, f"Domain/range violations: {invalid.count()}"
Assertion:
Rationale: Ensures the model trains on timely and representative data; stale inputs degrade performance.
Example (PySpark): from pyspark.sql.functions import max as spark_max, min as spark_min import datetime
stats = df.agg(spark_min("last_ts").alias("min_ts"), spark_max("last_ts").alias("max_ts")).collect()[0] now = datetime.datetime.utcnow() freshness_ok = (now - stats["max_ts"]).total_seconds() <= 24 * 3600 # 24h SLA assert freshness_ok, f"Data not fresh. max(last_ts)={stats['max_ts']} UTC"
coverage_ok = (stats["max_ts"] - stats["min_ts"]).days >= 7 assert coverage_ok, f"Insufficient coverage: {stats['min_ts']}..{stats['max_ts']}"
Assertion:
Rationale: Detects upstream shifts and silent failures that maintain schema but change data characteristics.
Example (PySpark): from pyspark.sql.functions import mean, stddev, count, when
total = df.count() age_null_rate = df.filter(col("age").isNull()).count() / total ctr_null_rate = df.filter(col("ctr").isNull()).count() / total assert age_null_rate <= 0.01, f"age null rate too high: {age_null_rate:.4f}" assert ctr_null_rate <= 0.01, f"ctr null rate too high: {ctr_null_rate:.4f}"
label_rate = df.filter(col("label") == 1).count() / total ctr_stats = df.select(mean("ctr").alias("ctr_mean"), stddev("ctr").alias("ctr_std")).collect()[0]
Notes for productionization:
让任何数据团队在提供数据集的关键属性后,立即获得面向该数据的“五大数据质量检查方案”,以清晰、可执行、可评审的结构呈现,便于快速落地。通过专家级视角输出跨语言内容,帮助标准化质量检查流程,减少反复沟通与返工,降低由数据问题带来的业务风险。即刻输入数据集关键信息,快速生成适配的检查清单与行动建议,推动团队从被动排错走向主动预防。
在新建或改造数据管道时,快速产出覆盖缺失、异常、重复、一致性、时效性的检查清单,用于预生产验证与发布审批。发生故障时按步骤定位问题,缩短恢复时间。
报表异常或指标波动时,获得清晰的核查路线与取数一致性检查,保证口径稳定。生成易读说明,方便与业务沟通、复现与纠偏。
在训练集与特征构建阶段,生成针对样本质量的检查与修复建议,确保数据稳定、提升模型效果。支持多语言说明,便于团队协作。
将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。
把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。
在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。
免费获取高级提示词-优惠即将到期