推荐数据清洗步骤

179 浏览
15 试用
3 购买
Sep 24, 2025更新

提供专业数据清洗建议,精准高效解决数据问题。

以下清洗步骤针对包含浏览、点击、购买、会话ID、时间戳以及商品与用户表关联键的用户行为日志数据,涵盖字段级规范化、事件级处理、会话重建、维表一致性校验、异常过滤与质量监控。建议根据业务实际调整阈值与规则。

一、架构与字段类型校验

  • 明确定义必填字段:event_type、timestamp、至少一个标识符(user_id 或 session_id)。购买事件需有 order_id、item_id、price、quantity。
  • 类型规范化:
    • user_id、item_id、session_id:统一为字符串或整型,确保无前导零丢失(如以字符串存储)。
    • timestamp:统一为可解析的时间类型(ISO8601),保留毫秒精度。
    • price、quantity:转换为数值型(浮点/整数)。
    • event_type:统一为标准枚举(如 view、click、add_to_cart、purchase)。
  • 范围与合法性检查:
    • timestamp 必须在数据采集窗口内;过滤明显无效时间(如早于系统上线或晚于当前时间超过合理缓冲,如 >72 小时)。
    • quantity ≥ 1;price ≥ 0;对异常极值设定上限并标记。
  • 字符串清理:去除空白、统一大小写、消除不可见字符;对来源字段(如 utm、referrer)进行规范化(统一大小写、剔除无效前缀)。

二、时间戳与时区标准化

  • 统一时区为 UTC;如果存在本地时区,依据用户或站点时区映射转换为 UTC。
  • 处理单位差异:将秒、毫秒统一到毫秒;对同一事件出现多时间字段时明确选择权威字段(如 server_ts 优先于 client_ts)。
  • 会话内事件排序:按 timestamp 升序;对同一毫秒内的多个事件保留原始到达顺序字段(ingested_at)以辅助排序。

三、事件类型归一化与语义一致性

  • 归一化 event_type:映射同义或变体(如 page_view→view,tap→click,checkout→purchase 前置阶段)。
  • 购买事件状态标准化:区分下单、支付成功、取消、退款等,使用统一字段(order_status),避免将未支付事件计为购买。
  • 移除或标记测试/演练事件:过滤来自测试账号、内部IP、特定 UA、沙箱环境、明显异常来源参数(如 utm_source=test)。

四、去重与幂等

  • 事件级去重:
    • 优先使用事件唯一键(event_id)。缺失时采用复合键:user_id + session_id + item_id + event_type + timestamp(带时间容差,如 ±1s)+ 关键上下文字段(page、device)。
    • 使用窗口函数按到达时间(ingested_at)保留首条,删除其余重复。
  • 订单级去重:以 order_id + item_id 去重,避免重复支付/重复上报;对部分发货或拆单需保留但合并金额与数量逻辑。

五、缺失值处理

  • user_id 缺失:保留为匿名用户(如 user_id=null,assign guest_user=true),但确保 session_id 存在;后续分析需区分匿名与注册用户。
  • session_id 缺失/异常:基于不活跃阈值重建会话(默认 30 分钟无行为切分),生成新的 session_id,并记录重建标记。
  • item_id 缺失:
    • 浏览/点击事件:可保留但标记缺失,限制用于商品级分析。
    • 购买事件:无法定位商品则剔除或转人工核对。
  • timestamp 缺失:剔除或尝试使用到达时间替代并标记为低可信度。

六、会话一致性与修复

  • 会话归属检查:同一 session_id 应归属同一 user_id;发现跨用户复用的 session_id 时进行重命名或拆分。
  • 会话长度与边界:
    • 过滤极端长会话(如 >12 小时)或拆分;标记可能的挂起或心跳事件导致的异常。
    • 禁止跨日期/时区不合理漂移的会话;按 UTC 日期分区。
  • 事件序列合理性校验:如 purchase 前需至少有一次 view/click;不满足时标记异常序列。

七、维表关联与参照完整性

  • 用户维表(users):
    • 验证 user_id 存在;缺失或失效的 user_id 事件标记 unknown_user;必要时保留以支持匿名分析。
    • 处理历史维(SCD):按事件时间关联用户属性的有效时段,避免属性穿越(如会员等级变更)。
  • 商品维表(items):
    • 验证 item_id 存在并在事件时间有效;下架或失效商品需标记状态。
    • 价格校验:购买事件 price 与当时商品价格或订单价格一致性检查;若存在促销价或券后价,采用订单价为准。
  • 外键清洗策略:对无法关联的外键事件,保留但标注 referential_flag=false;购买类事件优先人工或补数,浏览类可允许一定比例的未知。

八、数值与货币规范

  • 货币统一:将 price 转换为统一结算货币(如 USD/CNY),保留原币种与汇率;对缺失币种的价格标记不可比。
  • 金额一致性:检查 price × quantity 与行项目总额一致;对含税/不含税策略明确统一口径。
  • 极值与异常检测:识别明显低价/高价、超大数量;超阈值记录为异常但不直接删除,供后续审计。

九、机器人、异常流量与噪声过滤

  • 机器人识别:基于 user_agent(已知爬虫列表)、事件频率(如每秒多次点击)、访问路径模式、无效 referrer、IP 黑名单;标记 bot_flag 并默认剔除。
  • 内部流量:过滤公司办公网段、测试设备、QA 账号;保留样本用于质量评估但不计入业务分析。
  • 峰值/攻击检测:短时高并发异常序列标记并隔离。

十、订单事件一致性与售后处理

  • 购买事件必须具备 order_id;与订单事实表对齐:支付状态、取消、退款、部分退款。
  • 退货/退款事件作为独立事件类型或 order_status 变更记录,避免将负量计入正向购买;保持净额与毛额两个指标。
  • 合并拆单:同一订单多商品需逐行处理;重复生成订单记录要去重并保留最新状态。

十一、输出数据模型与质量标注

  • 事件事实表(fact_events):event_id、user_id、session_id、item_id、event_type、event_time_utc、price、quantity、device、referrer、source、flags(dedup_flag、bot_flag、referential_flag、reconstructed_session_flag、validity_flag)。
  • 会话事实表(fact_sessions):session_id、user_id、session_start/end、event_count、duration、flags。
  • 订单事实表(fact_orders):order_id、user_id、order_time、payment_status、currency、total_amount、refund_amount、flags。
  • 维表快照:users_dim、items_dim 按有效期管理,确保时间一致性关联。

十二、数据隐私与合规

  • 对 PII(如邮箱、手机号、IP)进行脱敏或哈希;仅保留分析所需的匿名标识。
  • 遵循数据保留策略与访问控制;记录处理链路与变更日志。

十三、质量监控指标与阈值(每日或分区级)

  • 完整性:必填字段缺失率(目标 <0.5%)、外键不可关联率(浏览 <3%,购买 <0.1%)。
  • 唯一性:事件重复率(<0.2%)、订单重复率(≈0%)。
  • 合法性:时间戳异常率、价格/数量违规率。
  • 一致性:会话跨用户率、购买无前置行为比例。
  • 及时性:延迟到达事件比例;分层报表监控并设报警阈值。
  • 抽样回溯核验:随机抽样与原始日志比对,验证去重与归一化正确性。

实施要点与建议

  • 分层清洗:原始层(Raw)→ 标准化层(Staging)→ 事实层(Curated),逐层打标保留可回溯性。
  • 增量处理与幂等设计:使用事件唯一键与 upsert,避免重复处理。
  • 可配置化阈值:会话不活跃阈值、时间容差、价格极值、机器人规则通过配置管理。
  • 审计留痕:为每条记录保存质量标记与清洗原因码,支持后续误差分析与规则优化。

上述步骤可直接映射到常用数据栈(SQL/ETL/流处理)中执行。根据业务需要,优先确保购买相关事件的强一致与可追溯,其次完善浏览/点击事件的归一化与会话质量,以提升漏斗与转化分析的可信度。

Recommended data cleaning steps for an advertising and behavior dataset (impressions/clicks/conversions, UTM parameters, channel and creative IDs, A/B experiment grouping)

  1. Schema normalization and typing
  • Define a canonical event schema with required fields: event_id, event_type (impression/click/conversion), event_ts, platform, ad_account_id, campaign_id, ad_group_id, creative_id, channel_id, url, referrer_url, utm_source, utm_medium, utm_campaign, utm_content, utm_term, user_id, device_id, cookie_id, session_id (if exists), experiment_id, variant, order_id/transaction_id, revenue, currency, consent_flag.
  • Enforce data types:
    • Timestamps: parse to UTC ISO 8601, store as datetime with timezone; retain raw ingestion_ts for auditing.
    • IDs: cast to string, strip whitespace; if numeric IDs are expected, validate parsability and store canonical string representation to preserve leading zeros.
    • Monetary fields: numeric with explicit currency; validate non-negative.
    • Categorical: lowercase, trimmed, normalized encoding (UTF-8).
  • URL and parameter parsing:
    • Decode URLs (percent-encoding), extract query parameters to structured fields.
    • If UTM fields are missing, parse from landing/referrer URLs; preserve raw and parsed versions.
  1. Event integrity and validation
  • Validate allowed event_type values and remap common synonyms (e.g., “view” → impression; “purchase” → conversion) into the canonical set.
  • Timestamp sanity checks:
    • Drop or flag events with event_ts outside campaign or data collection windows.
    • Remove obvious clock errors (e.g., year far in past/future); if platform ingestion_ts exists, cap discrepancies with a defined tolerance (e.g., ±3 days) and flag.
  • Event ordering constraints:
    • Conversions must not precede the attributed click/impression for the same user within an attribution window; flag negative time-to-event as data errors.
    • Ensure click timestamps precede conversion timestamps; if not, investigate source-specific latency or duplicate conversion.
  1. Identity resolution
  • Construct a persistent person_id using deterministic keys in priority order: hashed_email > login user_id > device_id > cookie_id. Maintain a mapping table with versioning.
  • Normalize device identifiers (consistent casing, remove separators for some IDs where applicable).
  • Remove or flag events lacking any identifier if user-level attribution is required; otherwise route to aggregate-only analyses.
  1. Deduplication
  • Enforce uniqueness on event_id where available; if missing, generate a synthetic key (hash of source_system, event_type, user_id/device_id, ad_id/creative_id, event_ts rounded to suitable precision).
  • Deduplicate within event type using keys and short time windows:
    • Impressions: drop exact duplicates; optionally compress bursts from the same ad server within milliseconds if known duplication pattern.
    • Clicks: collapse repeated identical clicks within 1–2 seconds per user-creative to mitigate multi-fire.
    • Conversions: dedupe by transaction_id/order_id and user_id; keep the first occurrence. If transaction_id missing, dedupe by (user_id, revenue, event_ts within N minutes) with conservative rules and flag.
  • Cross-source dedup:
    • If multiple platforms report the same conversion, prefer primary source-of-truth (ecommerce or CRM) and mark ad-platform conversions as secondary.
  1. UTM canonicalization
  • Normalize utm_* fields: lowercase, trim, remove surrounding quotes, decode URL-encoding.
  • Map synonyms into canonical values:
    • Medium: map to controlled vocabulary (cpc, display, social, email, affiliate, referral, organic, paid_social, other). Examples: “ppc” → cpc; “paid social” → paid_social.
    • Source: unify common variants (e.g., “google”, “googleads”, “adwords” → google; “fb”, “facebook” → facebook).
    • Campaign/content/term: enforce naming conventions (no spaces if policy requires; replace illegal characters; trim long values; optionally split structured names into parts using agreed delimiters).
  • Validate UTM coherence:
    • Medium/source combinations must be allowed (e.g., cpc + google valid; email + google invalid).
    • Flag missing or inconsistent UTMs; backfill source/medium from channel_id/platform metadata where possible.
  • Preserve both original and canonical UTM fields to avoid loss of detail.
  1. Channel and creative metadata validation
  • Join fact events to dimension tables for channel, campaign, ad group, creative. Validate existence and foreign key integrity.
  • Fix known legacy ID changes via a mapping table; flag orphaned IDs with no metadata.
  • Validate one-to-many relationships: a creative_id should map to a single campaign/ad_group within the same platform/time window; flag violations.
  • Enforce active date ranges: drop or flag events occurring outside the metadata’s valid period.
  1. A/B experiment data quality
  • Validate experiment assignment:
    • Check that each person_id is assigned to exactly one variant within an experiment; flag crossovers (saw both variants) and decide on handling (exclude or assign by first exposure).
    • Ensure stable assignment over time; detect re-randomization.
  • Experiment timing:
    • Exclude events before experiment start or after end for that experiment.
    • Confirm exposure prior to outcome for intent-to-treat vs per-protocol definitions; tag compliance.
  • Sample ratio mismatch (SRM) check:
    • Compare observed variant counts to expected allocation using chi-square; flag significant mismatch.
  • Contamination control:
    • Remove internal traffic from experiment measurements (see bot/internal filter below).
    • Ensure variants do not share creatives or UTMs that could blur attribution.
  1. Bot, fraud, and internal traffic filtering
  • User-agent filtering: remove known bot/crawler UA patterns; apply IAB/industry bot lists where available.
  • IP and ASN filtering: exclude data center ranges and internal corporate IP ranges; use maintained lists.
  • Behavioral heuristics:
    • Excessive clicks per minute, zero dwell time, impossible CTR (e.g., CTR > 1.0 for display), conversions occurring seconds after impression with no click where click is required.
    • Cookie-less sequences with high frequency across geographies.
  • Platform fraud signals: use platform flags (invalid click, suspected bot) to exclude or down-weight.
  • Document and tag exclusions for auditability.
  1. Sessionization and sequencing
  • Build sessions per person_id using inactivity threshold (commonly 30 minutes). Assign session_id and order within session.
  • Sequence events: impression → click → landing page → downstream events → conversion; check for missing steps and tag inferred paths (e.g., view-through).
  • Compute derived fields: time_to_click, time_to_convert, touchpoint_index.
  1. Attribution-prep flags
  • Create is_click_through and is_view_through flags for conversions.
  • Assign attributed_touchpoint_id(s) based on chosen model (e.g., last-click within 7-day window, or position-based). For cleaning, ensure window boundaries and event ordering are correct and tag unattributed conversions.
  • Retain both raw and modeled attribution fields.
  1. Handling missing, malformed, and outliers
  • Quantify missingness per critical field (IDs, timestamps, UTMs). Set thresholds for exclusion (e.g., drop events missing event_ts or event_type; retain events with missing utm_term).
  • Do not impute keys or timestamps. For categorical fields used in grouping, assign “unknown” category rather than drop unless analysis requires otherwise.
  • Outlier checks:
    • CTR, CVR by creative/channel outside reasonable bounds; flag and review for tracking errors.
    • Revenue outliers; check currency mismatches; standardize currency to a reporting currency with a dated FX rate.
  • Remove duplicated transactions and extreme anomalies caused by tracking misfires.
  1. Timezone, localization, and calendar normalization
  • Standardize all event_ts to UTC; retain local_time and timezone where user-level analysis depends on local behavior.
  • Handle daylight saving transitions carefully (use timezone-aware libraries).
  • Align reporting calendars (ISO week, fiscal period) and store precomputed date keys.
  1. Consent and privacy compliance
  • Respect consent_flag and jurisdictional rules (GDPR/CCPA). Exclude or aggregate events without consent as required.
  • Mask or hash PII; maintain salted hashes consistently across systems.
  • Apply data retention policies; remove expired identifiers.
  1. Quality assurance checks and metrics
  • Duplicates rate per event type after dedup.
  • Join rates to metadata dimensions; orphaned ID percentage.
  • Missing rate per UTM field and proportion successfully backfilled.
  • SRM p-value for experiments; crossover rate.
  • Bot/internal exclusion share; monitor over time.
  • Volume reconciliation against platform reports (impressions, clicks, spend, conversions) within acceptable variance.
  • Lag distributions (impression→click, click→convert) to detect ingestion delays or ordering errors.
  1. Outputs and documentation
  • Produce a cleaned fact table with canonical fields:
    • event_id, event_type, event_ts_utc, person_id, session_id, platform, channel_id, campaign_id, ad_group_id, creative_id, url, referrer_url, utm_source_raw/canonical, utm_medium_raw/canonical, utm_campaign_raw/canonical, utm_content_raw/canonical, utm_term_raw/canonical, experiment_id, variant, order_id, revenue, currency, is_click_through, is_view_through, attributed_touchpoint_id, consent_flag, quality_flags (dedup, bot, timestamp_error, orphan_metadata, unattributed).
  • Maintain a data dictionary covering field definitions, valid values, and cleaning rules.
  • Version and log the cleaning pipeline; store anomaly reports for audit.

Implementation notes

  • Prefer SQL for deterministic dedup and joins; use window functions for time-window de-duplication.
  • Use robust URL parsing and timezone-aware datetime libraries.
  • Keep original raw tables immutable; write cleaned tables with lineage columns (source_system, load_batch_id).

These steps establish consistent, analyzable data for attribution, channel/creative performance, and A/B experiment evaluation while preserving auditability and minimizing bias introduced by data quality issues.

以下为多渠道运营数据(App/网页/小程序)在存在“重复用户标识”和“字段命名不一致”情况下的推荐清洗步骤与实施要点。目标是建立统一、可追溯、可复用的数据底座,支持后续统计分析与归因。

一、总体流程与分层

  • 原始层 Raw:保持来源原貌,仅做基础解码与落盘分区。
  • 标准化层 Staging:字段命名、数据类型、时区、事件映射、去重、ID 统一。
  • 规范层 Canonical:统一实体模型(用户、事件、订单、商品、渠道),完成主外键关联、明细拆分。
  • 应用层 Mart:面向分析的宽表与汇总(转化漏斗、留存、归因、LTV 等)。
  • 增量与回填:采用基于事件时间分区(event_date)与主键去重的幂等加载。

二、字段命名与数据类型标准化

  • 制定统一命名规范(小写、下划线、英文、含义稳定)。建议核心字段:
    • 通用:source(app/web/mp),channel(渠道/媒体),event_id(若缺失需生成),event_name,event_ts_utc,event_local_ts,ingest_ts,user_key(统一用户键),session_id,device_id,cookie_id,openid,unionid,login_id(账号),page_url,page_referrer,screen_name,app_version,os,browser,geo_country/region/city,utm_source/medium/campaign/content/term,event_params(JSON)。
    • 订单:order_id_global,source_order_id,order_ts_utc,user_key,order_status,payment_status,total_amount,currency,discount_amount,shipping_fee,tax_amount,line_items(拆分为子表:order_id_global, sku_id, product_id, qty, unit_price, currency)。
  • 数据类型统一:时间统一为 UTC 时间戳(秒或毫秒统一),数值列显式精度(金额使用 decimal(18,2)),布尔类型规范化为 true/false。
  • 字段

示例详情

解决的问题

将“杂乱数据→可用数据”的路径变得清晰、快速、可靠:当你提供数据集的简要情况与分析目标时,提示词即刻生成专家级的数据清洗步骤清单,按优先级排列,覆盖缺失与异常处理、重复记录合并、字段一致性校验、时间与编码规范化、分组核验与抽样复查等关键环节。它聚焦实操与结果落地,帮助你缩短准备时间、提升数据可信度、减少返工,让新人也能以资深分析师的标准开展工作,并在电商、增长运营、营销CRM、日志埋点、实验与报表等场景中快速复用与扩展。

适用用户

数据分析师

快速制定清洗方案与执行顺序,统一口径,缩短准备时间,提升模型与报表的准确性与稳定性。

增长/营销经理

清洗投放与行为数据,修正埋点与命名口径,让A/B测试与ROI评估更可信,从而优化预算与素材。

产品运营

整合多渠道数据并去重规范字段,提升看板稳定性,定位异常波动根因,及时优化运营动作。

特征总结

一键生成个性化清洗清单,按数据特性给出可执行步骤与优先级。
自动识别缺失、重复与异常值问题,提供修复方案及替代策略建议。
轻松生成可复用SOP,明确每步目的与输出,降低沟通与执行成本。
按业务场景自动优化清洗策略,如营销、运营、风控,提升指标可信度。
提供结构化说明与示例,帮助新人上手,资深同学提效,流程更顺畅。
支持多语言输出与风格统一,便于跨部门协作与对外交付使用。
映射步骤至常用工具操作路径,减少试错时间,加速清洗落地执行。
生成质量检查清单与可视化建议,保障数据可用性并增强分析说服力。

如何使用购买的提示词模板

1. 直接在外部 Chat 应用中使用

将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。

2. 发布为 API 接口调用

把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。

3. 在 MCP Client 中配置使用

在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。

AI 提示词价格
¥15.00元
先用后买,用好了再付款,超安全!

您购买后可以获得什么

获得完整提示词模板
- 共 250 tokens
- 2 个可调节参数
{ 数据集简述 } { 输出语言 }
获得社区贡献内容的使用权
- 精选社区优质案例,助您快速上手提示词
限时免费

不要错过!

免费获取高级提示词-优惠即将到期

17
:
23
小时
:
59
分钟
:
59