推荐数据清洗步骤

1 浏览
0 试用
0 购买
Sep 24, 2025更新

提供专业数据清洗建议,精准高效解决数据问题。

示例1

以下清洗步骤针对包含浏览、点击、购买、会话ID、时间戳以及商品与用户表关联键的用户行为日志数据,涵盖字段级规范化、事件级处理、会话重建、维表一致性校验、异常过滤与质量监控。建议根据业务实际调整阈值与规则。

一、架构与字段类型校验
- 明确定义必填字段:event_type、timestamp、至少一个标识符(user_id 或 session_id)。购买事件需有 order_id、item_id、price、quantity。
- 类型规范化:
  - user_id、item_id、session_id:统一为字符串或整型,确保无前导零丢失(如以字符串存储)。
  - timestamp:统一为可解析的时间类型(ISO8601),保留毫秒精度。
  - price、quantity:转换为数值型(浮点/整数)。
  - event_type:统一为标准枚举(如 view、click、add_to_cart、purchase)。
- 范围与合法性检查:
  - timestamp 必须在数据采集窗口内;过滤明显无效时间(如早于系统上线或晚于当前时间超过合理缓冲,如 >72 小时)。
  - quantity ≥ 1;price ≥ 0;对异常极值设定上限并标记。
- 字符串清理:去除空白、统一大小写、消除不可见字符;对来源字段(如 utm、referrer)进行规范化(统一大小写、剔除无效前缀)。

二、时间戳与时区标准化
- 统一时区为 UTC;如果存在本地时区,依据用户或站点时区映射转换为 UTC。
- 处理单位差异:将秒、毫秒统一到毫秒;对同一事件出现多时间字段时明确选择权威字段(如 server_ts 优先于 client_ts)。
- 会话内事件排序:按 timestamp 升序;对同一毫秒内的多个事件保留原始到达顺序字段(ingested_at)以辅助排序。

三、事件类型归一化与语义一致性
- 归一化 event_type:映射同义或变体(如 page_view→view,tap→click,checkout→purchase 前置阶段)。
- 购买事件状态标准化:区分下单、支付成功、取消、退款等,使用统一字段(order_status),避免将未支付事件计为购买。
- 移除或标记测试/演练事件:过滤来自测试账号、内部IP、特定 UA、沙箱环境、明显异常来源参数(如 utm_source=test)。

四、去重与幂等
- 事件级去重:
  - 优先使用事件唯一键(event_id)。缺失时采用复合键:user_id + session_id + item_id + event_type + timestamp(带时间容差,如 ±1s)+ 关键上下文字段(page、device)。
  - 使用窗口函数按到达时间(ingested_at)保留首条,删除其余重复。
- 订单级去重:以 order_id + item_id 去重,避免重复支付/重复上报;对部分发货或拆单需保留但合并金额与数量逻辑。

五、缺失值处理
- user_id 缺失:保留为匿名用户(如 user_id=null,assign guest_user=true),但确保 session_id 存在;后续分析需区分匿名与注册用户。
- session_id 缺失/异常:基于不活跃阈值重建会话(默认 30 分钟无行为切分),生成新的 session_id,并记录重建标记。
- item_id 缺失:
  - 浏览/点击事件:可保留但标记缺失,限制用于商品级分析。
  - 购买事件:无法定位商品则剔除或转人工核对。
- timestamp 缺失:剔除或尝试使用到达时间替代并标记为低可信度。

六、会话一致性与修复
- 会话归属检查:同一 session_id 应归属同一 user_id;发现跨用户复用的 session_id 时进行重命名或拆分。
- 会话长度与边界:
  - 过滤极端长会话(如 >12 小时)或拆分;标记可能的挂起或心跳事件导致的异常。
  - 禁止跨日期/时区不合理漂移的会话;按 UTC 日期分区。
- 事件序列合理性校验:如 purchase 前需至少有一次 view/click;不满足时标记异常序列。

七、维表关联与参照完整性
- 用户维表(users):
  - 验证 user_id 存在;缺失或失效的 user_id 事件标记 unknown_user;必要时保留以支持匿名分析。
  - 处理历史维(SCD):按事件时间关联用户属性的有效时段,避免属性穿越(如会员等级变更)。
- 商品维表(items):
  - 验证 item_id 存在并在事件时间有效;下架或失效商品需标记状态。
  - 价格校验:购买事件 price 与当时商品价格或订单价格一致性检查;若存在促销价或券后价,采用订单价为准。
- 外键清洗策略:对无法关联的外键事件,保留但标注 referential_flag=false;购买类事件优先人工或补数,浏览类可允许一定比例的未知。

八、数值与货币规范
- 货币统一:将 price 转换为统一结算货币(如 USD/CNY),保留原币种与汇率;对缺失币种的价格标记不可比。
- 金额一致性:检查 price × quantity 与行项目总额一致;对含税/不含税策略明确统一口径。
- 极值与异常检测:识别明显低价/高价、超大数量;超阈值记录为异常但不直接删除,供后续审计。

九、机器人、异常流量与噪声过滤
- 机器人识别:基于 user_agent(已知爬虫列表)、事件频率(如每秒多次点击)、访问路径模式、无效 referrer、IP 黑名单;标记 bot_flag 并默认剔除。
- 内部流量:过滤公司办公网段、测试设备、QA 账号;保留样本用于质量评估但不计入业务分析。
- 峰值/攻击检测:短时高并发异常序列标记并隔离。

十、订单事件一致性与售后处理
- 购买事件必须具备 order_id;与订单事实表对齐:支付状态、取消、退款、部分退款。
- 退货/退款事件作为独立事件类型或 order_status 变更记录,避免将负量计入正向购买;保持净额与毛额两个指标。
- 合并拆单:同一订单多商品需逐行处理;重复生成订单记录要去重并保留最新状态。

十一、输出数据模型与质量标注
- 事件事实表(fact_events):event_id、user_id、session_id、item_id、event_type、event_time_utc、price、quantity、device、referrer、source、flags(dedup_flag、bot_flag、referential_flag、reconstructed_session_flag、validity_flag)。
- 会话事实表(fact_sessions):session_id、user_id、session_start/end、event_count、duration、flags。
- 订单事实表(fact_orders):order_id、user_id、order_time、payment_status、currency、total_amount、refund_amount、flags。
- 维表快照:users_dim、items_dim 按有效期管理,确保时间一致性关联。

十二、数据隐私与合规
- 对 PII(如邮箱、手机号、IP)进行脱敏或哈希;仅保留分析所需的匿名标识。
- 遵循数据保留策略与访问控制;记录处理链路与变更日志。

十三、质量监控指标与阈值(每日或分区级)
- 完整性:必填字段缺失率(目标 <0.5%)、外键不可关联率(浏览 <3%,购买 <0.1%)。
- 唯一性:事件重复率(<0.2%)、订单重复率(≈0%)。
- 合法性:时间戳异常率、价格/数量违规率。
- 一致性:会话跨用户率、购买无前置行为比例。
- 及时性:延迟到达事件比例;分层报表监控并设报警阈值。
- 抽样回溯核验:随机抽样与原始日志比对,验证去重与归一化正确性。

实施要点与建议
- 分层清洗:原始层(Raw)→ 标准化层(Staging)→ 事实层(Curated),逐层打标保留可回溯性。
- 增量处理与幂等设计:使用事件唯一键与 upsert,避免重复处理。
- 可配置化阈值:会话不活跃阈值、时间容差、价格极值、机器人规则通过配置管理。
- 审计留痕:为每条记录保存质量标记与清洗原因码,支持后续误差分析与规则优化。

上述步骤可直接映射到常用数据栈(SQL/ETL/流处理)中执行。根据业务需要,优先确保购买相关事件的强一致与可追溯,其次完善浏览/点击事件的归一化与会话质量,以提升漏斗与转化分析的可信度。

示例2

Recommended data cleaning steps for an advertising and behavior dataset (impressions/clicks/conversions, UTM parameters, channel and creative IDs, A/B experiment grouping)

1) Schema normalization and typing
- Define a canonical event schema with required fields: event_id, event_type (impression/click/conversion), event_ts, platform, ad_account_id, campaign_id, ad_group_id, creative_id, channel_id, url, referrer_url, utm_source, utm_medium, utm_campaign, utm_content, utm_term, user_id, device_id, cookie_id, session_id (if exists), experiment_id, variant, order_id/transaction_id, revenue, currency, consent_flag.
- Enforce data types:
  - Timestamps: parse to UTC ISO 8601, store as datetime with timezone; retain raw ingestion_ts for auditing.
  - IDs: cast to string, strip whitespace; if numeric IDs are expected, validate parsability and store canonical string representation to preserve leading zeros.
  - Monetary fields: numeric with explicit currency; validate non-negative.
  - Categorical: lowercase, trimmed, normalized encoding (UTF-8).
- URL and parameter parsing:
  - Decode URLs (percent-encoding), extract query parameters to structured fields.
  - If UTM fields are missing, parse from landing/referrer URLs; preserve raw and parsed versions.

2) Event integrity and validation
- Validate allowed event_type values and remap common synonyms (e.g., “view” → impression; “purchase” → conversion) into the canonical set.
- Timestamp sanity checks:
  - Drop or flag events with event_ts outside campaign or data collection windows.
  - Remove obvious clock errors (e.g., year far in past/future); if platform ingestion_ts exists, cap discrepancies with a defined tolerance (e.g., ±3 days) and flag.
- Event ordering constraints:
  - Conversions must not precede the attributed click/impression for the same user within an attribution window; flag negative time-to-event as data errors.
  - Ensure click timestamps precede conversion timestamps; if not, investigate source-specific latency or duplicate conversion.

3) Identity resolution
- Construct a persistent person_id using deterministic keys in priority order: hashed_email > login user_id > device_id > cookie_id. Maintain a mapping table with versioning.
- Normalize device identifiers (consistent casing, remove separators for some IDs where applicable).
- Remove or flag events lacking any identifier if user-level attribution is required; otherwise route to aggregate-only analyses.

4) Deduplication
- Enforce uniqueness on event_id where available; if missing, generate a synthetic key (hash of source_system, event_type, user_id/device_id, ad_id/creative_id, event_ts rounded to suitable precision).
- Deduplicate within event type using keys and short time windows:
  - Impressions: drop exact duplicates; optionally compress bursts from the same ad server within milliseconds if known duplication pattern.
  - Clicks: collapse repeated identical clicks within 1–2 seconds per user-creative to mitigate multi-fire.
  - Conversions: dedupe by transaction_id/order_id and user_id; keep the first occurrence. If transaction_id missing, dedupe by (user_id, revenue, event_ts within N minutes) with conservative rules and flag.
- Cross-source dedup:
  - If multiple platforms report the same conversion, prefer primary source-of-truth (ecommerce or CRM) and mark ad-platform conversions as secondary.

5) UTM canonicalization
- Normalize utm_* fields: lowercase, trim, remove surrounding quotes, decode URL-encoding.
- Map synonyms into canonical values:
  - Medium: map to controlled vocabulary (cpc, display, social, email, affiliate, referral, organic, paid_social, other). Examples: “ppc” → cpc; “paid social” → paid_social.
  - Source: unify common variants (e.g., “google”, “googleads”, “adwords” → google; “fb”, “facebook” → facebook).
  - Campaign/content/term: enforce naming conventions (no spaces if policy requires; replace illegal characters; trim long values; optionally split structured names into parts using agreed delimiters).
- Validate UTM coherence:
  - Medium/source combinations must be allowed (e.g., cpc + google valid; email + google invalid).
  - Flag missing or inconsistent UTMs; backfill source/medium from channel_id/platform metadata where possible.
- Preserve both original and canonical UTM fields to avoid loss of detail.

6) Channel and creative metadata validation
- Join fact events to dimension tables for channel, campaign, ad group, creative. Validate existence and foreign key integrity.
- Fix known legacy ID changes via a mapping table; flag orphaned IDs with no metadata.
- Validate one-to-many relationships: a creative_id should map to a single campaign/ad_group within the same platform/time window; flag violations.
- Enforce active date ranges: drop or flag events occurring outside the metadata’s valid period.

7) A/B experiment data quality
- Validate experiment assignment:
  - Check that each person_id is assigned to exactly one variant within an experiment; flag crossovers (saw both variants) and decide on handling (exclude or assign by first exposure).
  - Ensure stable assignment over time; detect re-randomization.
- Experiment timing:
  - Exclude events before experiment start or after end for that experiment.
  - Confirm exposure prior to outcome for intent-to-treat vs per-protocol definitions; tag compliance.
- Sample ratio mismatch (SRM) check:
  - Compare observed variant counts to expected allocation using chi-square; flag significant mismatch.
- Contamination control:
  - Remove internal traffic from experiment measurements (see bot/internal filter below).
  - Ensure variants do not share creatives or UTMs that could blur attribution.

8) Bot, fraud, and internal traffic filtering
- User-agent filtering: remove known bot/crawler UA patterns; apply IAB/industry bot lists where available.
- IP and ASN filtering: exclude data center ranges and internal corporate IP ranges; use maintained lists.
- Behavioral heuristics:
  - Excessive clicks per minute, zero dwell time, impossible CTR (e.g., CTR > 1.0 for display), conversions occurring seconds after impression with no click where click is required.
  - Cookie-less sequences with high frequency across geographies.
- Platform fraud signals: use platform flags (invalid click, suspected bot) to exclude or down-weight.
- Document and tag exclusions for auditability.

9) Sessionization and sequencing
- Build sessions per person_id using inactivity threshold (commonly 30 minutes). Assign session_id and order within session.
- Sequence events: impression → click → landing page → downstream events → conversion; check for missing steps and tag inferred paths (e.g., view-through).
- Compute derived fields: time_to_click, time_to_convert, touchpoint_index.

10) Attribution-prep flags
- Create is_click_through and is_view_through flags for conversions.
- Assign attributed_touchpoint_id(s) based on chosen model (e.g., last-click within 7-day window, or position-based). For cleaning, ensure window boundaries and event ordering are correct and tag unattributed conversions.
- Retain both raw and modeled attribution fields.

11) Handling missing, malformed, and outliers
- Quantify missingness per critical field (IDs, timestamps, UTMs). Set thresholds for exclusion (e.g., drop events missing event_ts or event_type; retain events with missing utm_term).
- Do not impute keys or timestamps. For categorical fields used in grouping, assign “unknown” category rather than drop unless analysis requires otherwise.
- Outlier checks:
  - CTR, CVR by creative/channel outside reasonable bounds; flag and review for tracking errors.
  - Revenue outliers; check currency mismatches; standardize currency to a reporting currency with a dated FX rate.
- Remove duplicated transactions and extreme anomalies caused by tracking misfires.

12) Timezone, localization, and calendar normalization
- Standardize all event_ts to UTC; retain local_time and timezone where user-level analysis depends on local behavior.
- Handle daylight saving transitions carefully (use timezone-aware libraries).
- Align reporting calendars (ISO week, fiscal period) and store precomputed date keys.

13) Consent and privacy compliance
- Respect consent_flag and jurisdictional rules (GDPR/CCPA). Exclude or aggregate events without consent as required.
- Mask or hash PII; maintain salted hashes consistently across systems.
- Apply data retention policies; remove expired identifiers.

14) Quality assurance checks and metrics
- Duplicates rate per event type after dedup.
- Join rates to metadata dimensions; orphaned ID percentage.
- Missing rate per UTM field and proportion successfully backfilled.
- SRM p-value for experiments; crossover rate.
- Bot/internal exclusion share; monitor over time.
- Volume reconciliation against platform reports (impressions, clicks, spend, conversions) within acceptable variance.
- Lag distributions (impression→click, click→convert) to detect ingestion delays or ordering errors.

15) Outputs and documentation
- Produce a cleaned fact table with canonical fields:
  - event_id, event_type, event_ts_utc, person_id, session_id, platform, channel_id, campaign_id, ad_group_id, creative_id, url, referrer_url, utm_source_raw/canonical, utm_medium_raw/canonical, utm_campaign_raw/canonical, utm_content_raw/canonical, utm_term_raw/canonical, experiment_id, variant, order_id, revenue, currency, is_click_through, is_view_through, attributed_touchpoint_id, consent_flag, quality_flags (dedup, bot, timestamp_error, orphan_metadata, unattributed).
- Maintain a data dictionary covering field definitions, valid values, and cleaning rules.
- Version and log the cleaning pipeline; store anomaly reports for audit.

Implementation notes
- Prefer SQL for deterministic dedup and joins; use window functions for time-window de-duplication.
- Use robust URL parsing and timezone-aware datetime libraries.
- Keep original raw tables immutable; write cleaned tables with lineage columns (source_system, load_batch_id).

These steps establish consistent, analyzable data for attribution, channel/creative performance, and A/B experiment evaluation while preserving auditability and minimizing bias introduced by data quality issues.

示例3

以下为多渠道运营数据(App/网页/小程序)在存在“重复用户标识”和“字段命名不一致”情况下的推荐清洗步骤与实施要点。目标是建立统一、可追溯、可复用的数据底座,支持后续统计分析与归因。

一、总体流程与分层
- 原始层 Raw:保持来源原貌,仅做基础解码与落盘分区。
- 标准化层 Staging:字段命名、数据类型、时区、事件映射、去重、ID 统一。
- 规范层 Canonical:统一实体模型(用户、事件、订单、商品、渠道),完成主外键关联、明细拆分。
- 应用层 Mart:面向分析的宽表与汇总(转化漏斗、留存、归因、LTV 等)。
- 增量与回填:采用基于事件时间分区(event_date)与主键去重的幂等加载。

二、字段命名与数据类型标准化
- 制定统一命名规范(小写、下划线、英文、含义稳定)。建议核心字段:
  - 通用:source(app/web/mp),channel(渠道/媒体),event_id(若缺失需生成),event_name,event_ts_utc,event_local_ts,ingest_ts,user_key(统一用户键),session_id,device_id,cookie_id,openid,unionid,login_id(账号),page_url,page_referrer,screen_name,app_version,os,browser,geo_country/region/city,utm_source/medium/campaign/content/term,event_params(JSON)。
  - 订单:order_id_global,source_order_id,order_ts_utc,user_key,order_status,payment_status,total_amount,currency,discount_amount,shipping_fee,tax_amount,line_items(拆分为子表:order_id_global, sku_id, product_id, qty, unit_price, currency)。
- 数据类型统一:时间统一为 UTC 时间戳(秒或毫秒统一),数值列显式精度(金额使用 decimal(18,2)),布尔类型规范化为 true/false。
- 字段

适用用户

数据分析师

快速制定清洗方案与执行顺序,统一口径,缩短准备时间,提升模型与报表的准确性与稳定性。

增长/营销经理

清洗投放与行为数据,修正埋点与命名口径,让A/B测试与ROI评估更可信,从而优化预算与素材。

产品运营

整合多渠道数据并去重规范字段,提升看板稳定性,定位异常波动根因,及时优化运营动作。

商业/策略顾问

在紧迫项目中输出交付级清洗SOP,保障数据可信度,降低方案与决策风险,提升客户信任。

研究员与学术助理

规范实验数据预处理流程,提高复现性与结论可靠度,缩短数据整理周期,加快论文产出。

BI与数据工程协作

将建议转为团队可执行步骤与校验清单,减少反复返工,加速报表与数据产品上线交付。

解决的问题

将“杂乱数据→可用数据”的路径变得清晰、快速、可靠:当你提供数据集的简要情况与分析目标时,提示词即刻生成专家级的数据清洗步骤清单,按优先级排列,覆盖缺失与异常处理、重复记录合并、字段一致性校验、时间与编码规范化、分组核验与抽样复查等关键环节。它聚焦实操与结果落地,帮助你缩短准备时间、提升数据可信度、减少返工,让新人也能以资深分析师的标准开展工作,并在电商、增长运营、营销CRM、日志埋点、实验与报表等场景中快速复用与扩展。

特征总结

一键生成个性化清洗清单,按数据特性给出可执行步骤与优先级。
自动识别缺失、重复与异常值问题,提供修复方案及替代策略建议。
轻松生成可复用SOP,明确每步目的与输出,降低沟通与执行成本。
按业务场景自动优化清洗策略,如营销、运营、风控,提升指标可信度。
提供结构化说明与示例,帮助新人上手,资深同学提效,流程更顺畅。
支持多语言输出与风格统一,便于跨部门协作与对外交付使用。
映射步骤至常用工具操作路径,减少试错时间,加速清洗落地执行。
生成质量检查清单与可视化建议,保障数据可用性并增强分析说服力。

如何使用购买的提示词模板

1. 直接在外部 Chat 应用中使用

将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。

2. 发布为 API 接口调用

把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。

3. 在 MCP Client 中配置使用

在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。

¥3.00元
平台提供免费试用机制,
确保效果符合预期,再付费购买!

您购买后可以获得什么

获得完整提示词模板
- 共 250 tokens
- 2 个可调节参数
{ 数据集简述 } { 输出语言 }
自动加入"我的提示词库"
- 获得提示词优化器支持
- 版本化管理支持
获得社区共享的应用案例
限时免费

不要错过!

免费获取高级提示词-优惠即将到期

17
:
23
小时
:
59
分钟
:
59