推荐数据清洗步骤

幂简官方

179 浏览

15 试用

3 购买

Sep 24, 2025更新

数据处理文生文

提供专业数据清洗建议，精准高效解决数据问题。

以下清洗步骤针对包含浏览、点击、购买、会话ID、时间戳以及商品与用户表关联键的用户行为日志数据，涵盖字段级规范化、事件级处理、会话重建、维表一致性校验、异常过滤与质量监控。建议根据业务实际调整阈值与规则。

一、架构与字段类型校验

明确定义必填字段：event_type、timestamp、至少一个标识符（user_id 或 session_id）。购买事件需有 order_id、item_id、price、quantity。
类型规范化：
- user_id、item_id、session_id：统一为字符串或整型，确保无前导零丢失（如以字符串存储）。
- timestamp：统一为可解析的时间类型（ISO8601），保留毫秒精度。
- price、quantity：转换为数值型（浮点/整数）。
- event_type：统一为标准枚举（如 view、click、add_to_cart、purchase）。
范围与合法性检查：
- timestamp 必须在数据采集窗口内；过滤明显无效时间（如早于系统上线或晚于当前时间超过合理缓冲，如 >72 小时）。
- quantity ≥ 1；price ≥ 0；对异常极值设定上限并标记。
字符串清理：去除空白、统一大小写、消除不可见字符；对来源字段（如 utm、referrer）进行规范化（统一大小写、剔除无效前缀）。

二、时间戳与时区标准化

统一时区为 UTC；如果存在本地时区，依据用户或站点时区映射转换为 UTC。
处理单位差异：将秒、毫秒统一到毫秒；对同一事件出现多时间字段时明确选择权威字段（如 server_ts 优先于 client_ts）。
会话内事件排序：按 timestamp 升序；对同一毫秒内的多个事件保留原始到达顺序字段（ingested_at）以辅助排序。

三、事件类型归一化与语义一致性

归一化 event_type：映射同义或变体（如 page_view→view，tap→click，checkout→purchase 前置阶段）。
购买事件状态标准化：区分下单、支付成功、取消、退款等，使用统一字段（order_status），避免将未支付事件计为购买。
移除或标记测试/演练事件：过滤来自测试账号、内部IP、特定 UA、沙箱环境、明显异常来源参数（如 utm_source=test）。

四、去重与幂等

事件级去重：
- 优先使用事件唯一键（event_id）。缺失时采用复合键：user_id + session_id + item_id + event_type + timestamp（带时间容差，如 ±1s）+ 关键上下文字段（page、device）。
- 使用窗口函数按到达时间（ingested_at）保留首条，删除其余重复。
订单级去重：以 order_id + item_id 去重，避免重复支付/重复上报；对部分发货或拆单需保留但合并金额与数量逻辑。

五、缺失值处理

user_id 缺失：保留为匿名用户（如 user_id=null，assign guest_user=true），但确保 session_id 存在；后续分析需区分匿名与注册用户。
session_id 缺失/异常：基于不活跃阈值重建会话（默认 30 分钟无行为切分），生成新的 session_id，并记录重建标记。
item_id 缺失：
- 浏览/点击事件：可保留但标记缺失，限制用于商品级分析。
- 购买事件：无法定位商品则剔除或转人工核对。
timestamp 缺失：剔除或尝试使用到达时间替代并标记为低可信度。

六、会话一致性与修复

会话归属检查：同一 session_id 应归属同一 user_id；发现跨用户复用的 session_id 时进行重命名或拆分。
会话长度与边界：
- 过滤极端长会话（如 >12 小时）或拆分；标记可能的挂起或心跳事件导致的异常。
- 禁止跨日期/时区不合理漂移的会话；按 UTC 日期分区。
事件序列合理性校验：如 purchase 前需至少有一次 view/click；不满足时标记异常序列。

七、维表关联与参照完整性

用户维表（users）：
- 验证 user_id 存在；缺失或失效的 user_id 事件标记 unknown_user；必要时保留以支持匿名分析。
- 处理历史维（SCD）：按事件时间关联用户属性的有效时段，避免属性穿越（如会员等级变更）。
商品维表（items）：
- 验证 item_id 存在并在事件时间有效；下架或失效商品需标记状态。
- 价格校验：购买事件 price 与当时商品价格或订单价格一致性检查；若存在促销价或券后价，采用订单价为准。
外键清洗策略：对无法关联的外键事件，保留但标注 referential_flag=false；购买类事件优先人工或补数，浏览类可允许一定比例的未知。

八、数值与货币规范

货币统一：将 price 转换为统一结算货币（如 USD/CNY），保留原币种与汇率；对缺失币种的价格标记不可比。
金额一致性：检查 price × quantity 与行项目总额一致；对含税/不含税策略明确统一口径。
极值与异常检测：识别明显低价/高价、超大数量；超阈值记录为异常但不直接删除，供后续审计。

九、机器人、异常流量与噪声过滤

机器人识别：基于 user_agent（已知爬虫列表）、事件频率（如每秒多次点击）、访问路径模式、无效 referrer、IP 黑名单；标记 bot_flag 并默认剔除。
内部流量：过滤公司办公网段、测试设备、QA 账号；保留样本用于质量评估但不计入业务分析。
峰值/攻击检测：短时高并发异常序列标记并隔离。

十、订单事件一致性与售后处理

购买事件必须具备 order_id；与订单事实表对齐：支付状态、取消、退款、部分退款。
退货/退款事件作为独立事件类型或 order_status 变更记录，避免将负量计入正向购买；保持净额与毛额两个指标。
合并拆单：同一订单多商品需逐行处理；重复生成订单记录要去重并保留最新状态。

十一、输出数据模型与质量标注

事件事实表（fact_events）：event_id、user_id、session_id、item_id、event_type、event_time_utc、price、quantity、device、referrer、source、flags（dedup_flag、bot_flag、referential_flag、reconstructed_session_flag、validity_flag）。
会话事实表（fact_sessions）：session_id、user_id、session_start/end、event_count、duration、flags。
订单事实表（fact_orders）：order_id、user_id、order_time、payment_status、currency、total_amount、refund_amount、flags。
维表快照：users_dim、items_dim 按有效期管理，确保时间一致性关联。

十二、数据隐私与合规

对 PII（如邮箱、手机号、IP）进行脱敏或哈希；仅保留分析所需的匿名标识。
遵循数据保留策略与访问控制；记录处理链路与变更日志。

十三、质量监控指标与阈值（每日或分区级）

完整性：必填字段缺失率（目标 <0.5%）、外键不可关联率（浏览 <3%，购买 <0.1%）。
唯一性：事件重复率（<0.2%）、订单重复率（≈0%）。
合法性：时间戳异常率、价格/数量违规率。
一致性：会话跨用户率、购买无前置行为比例。
及时性：延迟到达事件比例；分层报表监控并设报警阈值。
抽样回溯核验：随机抽样与原始日志比对，验证去重与归一化正确性。

实施要点与建议

分层清洗：原始层（Raw）→ 标准化层（Staging）→ 事实层（Curated），逐层打标保留可回溯性。
增量处理与幂等设计：使用事件唯一键与 upsert，避免重复处理。
可配置化阈值：会话不活跃阈值、时间容差、价格极值、机器人规则通过配置管理。
审计留痕：为每条记录保存质量标记与清洗原因码，支持后续误差分析与规则优化。

上述步骤可直接映射到常用数据栈（SQL/ETL/流处理）中执行。根据业务需要，优先确保购买相关事件的强一致与可追溯，其次完善浏览/点击事件的归一化与会话质量，以提升漏斗与转化分析的可信度。

Recommended data cleaning steps for an advertising and behavior dataset (impressions/clicks/conversions, UTM parameters, channel and creative IDs, A/B experiment grouping)

Schema normalization and typing

Define a canonical event schema with required fields: event_id, event_type (impression/click/conversion), event_ts, platform, ad_account_id, campaign_id, ad_group_id, creative_id, channel_id, url, referrer_url, utm_source, utm_medium, utm_campaign, utm_content, utm_term, user_id, device_id, cookie_id, session_id (if exists), experiment_id, variant, order_id/transaction_id, revenue, currency, consent_flag.
Enforce data types:
- Timestamps: parse to UTC ISO 8601, store as datetime with timezone; retain raw ingestion_ts for auditing.
- IDs: cast to string, strip whitespace; if numeric IDs are expected, validate parsability and store canonical string representation to preserve leading zeros.
- Monetary fields: numeric with explicit currency; validate non-negative.
- Categorical: lowercase, trimmed, normalized encoding (UTF-8).
URL and parameter parsing:
- Decode URLs (percent-encoding), extract query parameters to structured fields.
- If UTM fields are missing, parse from landing/referrer URLs; preserve raw and parsed versions.

Event integrity and validation

Validate allowed event_type values and remap common synonyms (e.g., “view” → impression; “purchase” → conversion) into the canonical set.
Timestamp sanity checks:
- Drop or flag events with event_ts outside campaign or data collection windows.
- Remove obvious clock errors (e.g., year far in past/future); if platform ingestion_ts exists, cap discrepancies with a defined tolerance (e.g., ±3 days) and flag.
Event ordering constraints:
- Conversions must not precede the attributed click/impression for the same user within an attribution window; flag negative time-to-event as data errors.
- Ensure click timestamps precede conversion timestamps; if not, investigate source-specific latency or duplicate conversion.

Identity resolution

Construct a persistent person_id using deterministic keys in priority order: hashed_email > login user_id > device_id > cookie_id. Maintain a mapping table with versioning.
Normalize device identifiers (consistent casing, remove separators for some IDs where applicable).
Remove or flag events lacking any identifier if user-level attribution is required; otherwise route to aggregate-only analyses.

Deduplication

Enforce uniqueness on event_id where available; if missing, generate a synthetic key (hash of source_system, event_type, user_id/device_id, ad_id/creative_id, event_ts rounded to suitable precision).
Deduplicate within event type using keys and short time windows:
- Impressions: drop exact duplicates; optionally compress bursts from the same ad server within milliseconds if known duplication pattern.
- Clicks: collapse repeated identical clicks within 1–2 seconds per user-creative to mitigate multi-fire.
- Conversions: dedupe by transaction_id/order_id and user_id; keep the first occurrence. If transaction_id missing, dedupe by (user_id, revenue, event_ts within N minutes) with conservative rules and flag.
Cross-source dedup:
- If multiple platforms report the same conversion, prefer primary source-of-truth (ecommerce or CRM) and mark ad-platform conversions as secondary.

UTM canonicalization

Normalize utm_* fields: lowercase, trim, remove surrounding quotes, decode URL-encoding.
Map synonyms into canonical values:
- Medium: map to controlled vocabulary (cpc, display, social, email, affiliate, referral, organic, paid_social, other). Examples: “ppc” → cpc; “paid social” → paid_social.
- Source: unify common variants (e.g., “google”, “googleads”, “adwords” → google; “fb”, “facebook” → facebook).
- Campaign/content/term: enforce naming conventions (no spaces if policy requires; replace illegal characters; trim long values; optionally split structured names into parts using agreed delimiters).
Validate UTM coherence:
- Medium/source combinations must be allowed (e.g., cpc + google valid; email + google invalid).
- Flag missing or inconsistent UTMs; backfill source/medium from channel_id/platform metadata where possible.
Preserve both original and canonical UTM fields to avoid loss of detail.

Channel and creative metadata validation

Join fact events to dimension tables for channel, campaign, ad group, creative. Validate existence and foreign key integrity.
Fix known legacy ID changes via a mapping table; flag orphaned IDs with no metadata.
Validate one-to-many relationships: a creative_id should map to a single campaign/ad_group within the same platform/time window; flag violations.
Enforce active date ranges: drop or flag events occurring outside the metadata’s valid period.

A/B experiment data quality

Validate experiment assignment:
- Check that each person_id is assigned to exactly one variant within an experiment; flag crossovers (saw both variants) and decide on handling (exclude or assign by first exposure).
- Ensure stable assignment over time; detect re-randomization.
Experiment timing:
- Exclude events before experiment start or after end for that experiment.
- Confirm exposure prior to outcome for intent-to-treat vs per-protocol definitions; tag compliance.
Sample ratio mismatch (SRM) check:
- Compare observed variant counts to expected allocation using chi-square; flag significant mismatch.
Contamination control:
- Remove internal traffic from experiment measurements (see bot/internal filter below).
- Ensure variants do not share creatives or UTMs that could blur attribution.

Bot, fraud, and internal traffic filtering

User-agent filtering: remove known bot/crawler UA patterns; apply IAB/industry bot lists where available.
IP and ASN filtering: exclude data center ranges and internal corporate IP ranges; use maintained lists.
Behavioral heuristics:
- Excessive clicks per minute, zero dwell time, impossible CTR (e.g., CTR > 1.0 for display), conversions occurring seconds after impression with no click where click is required.
- Cookie-less sequences with high frequency across geographies.
Platform fraud signals: use platform flags (invalid click, suspected bot) to exclude or down-weight.
Document and tag exclusions for auditability.

Sessionization and sequencing

Build sessions per person_id using inactivity threshold (commonly 30 minutes). Assign session_id and order within session.
Sequence events: impression → click → landing page → downstream events → conversion; check for missing steps and tag inferred paths (e.g., view-through).
Compute derived fields: time_to_click, time_to_convert, touchpoint_index.

Attribution-prep flags

Create is_click_through and is_view_through flags for conversions.
Assign attributed_touchpoint_id(s) based on chosen model (e.g., last-click within 7-day window, or position-based). For cleaning, ensure window boundaries and event ordering are correct and tag unattributed conversions.
Retain both raw and modeled attribution fields.

Handling missing, malformed, and outliers

Quantify missingness per critical field (IDs, timestamps, UTMs). Set thresholds for exclusion (e.g., drop events missing event_ts or event_type; retain events with missing utm_term).
Do not impute keys or timestamps. For categorical fields used in grouping, assign “unknown” category rather than drop unless analysis requires otherwise.
Outlier checks:
- CTR, CVR by creative/channel outside reasonable bounds; flag and review for tracking errors.
- Revenue outliers; check currency mismatches; standardize currency to a reporting currency with a dated FX rate.
Remove duplicated transactions and extreme anomalies caused by tracking misfires.

Timezone, localization, and calendar normalization

Standardize all event_ts to UTC; retain local_time and timezone where user-level analysis depends on local behavior.
Handle daylight saving transitions carefully (use timezone-aware libraries).
Align reporting calendars (ISO week, fiscal period) and store precomputed date keys.

Consent and privacy compliance

Respect consent_flag and jurisdictional rules (GDPR/CCPA). Exclude or aggregate events without consent as required.
Mask or hash PII; maintain salted hashes consistently across systems.
Apply data retention policies; remove expired identifiers.

Quality assurance checks and metrics

Duplicates rate per event type after dedup.
Join rates to metadata dimensions; orphaned ID percentage.
Missing rate per UTM field and proportion successfully backfilled.
SRM p-value for experiments; crossover rate.
Bot/internal exclusion share; monitor over time.
Volume reconciliation against platform reports (impressions, clicks, spend, conversions) within acceptable variance.
Lag distributions (impression→click, click→convert) to detect ingestion delays or ordering errors.

Outputs and documentation

Produce a cleaned fact table with canonical fields:
- event_id, event_type, event_ts_utc, person_id, session_id, platform, channel_id, campaign_id, ad_group_id, creative_id, url, referrer_url, utm_source_raw/canonical, utm_medium_raw/canonical, utm_campaign_raw/canonical, utm_content_raw/canonical, utm_term_raw/canonical, experiment_id, variant, order_id, revenue, currency, is_click_through, is_view_through, attributed_touchpoint_id, consent_flag, quality_flags (dedup, bot, timestamp_error, orphan_metadata, unattributed).
Maintain a data dictionary covering field definitions, valid values, and cleaning rules.
Version and log the cleaning pipeline; store anomaly reports for audit.

Implementation notes

Prefer SQL for deterministic dedup and joins; use window functions for time-window de-duplication.
Use robust URL parsing and timezone-aware datetime libraries.
Keep original raw tables immutable; write cleaned tables with lineage columns (source_system, load_batch_id).

These steps establish consistent, analyzable data for attribution, channel/creative performance, and A/B experiment evaluation while preserving auditability and minimizing bias introduced by data quality issues.

以下为多渠道运营数据（App/网页/小程序）在存在“重复用户标识”和“字段命名不一致”情况下的推荐清洗步骤与实施要点。目标是建立统一、可追溯、可复用的数据底座，支持后续统计分析与归因。

一、总体流程与分层

原始层 Raw：保持来源原貌，仅做基础解码与落盘分区。
标准化层 Staging：字段命名、数据类型、时区、事件映射、去重、ID 统一。
规范层 Canonical：统一实体模型（用户、事件、订单、商品、渠道），完成主外键关联、明细拆分。
应用层 Mart：面向分析的宽表与汇总（转化漏斗、留存、归因、LTV 等）。
增量与回填：采用基于事件时间分区（event_date）与主键去重的幂等加载。

二、字段命名与数据类型标准化

制定统一命名规范（小写、下划线、英文、含义稳定）。建议核心字段：
- 通用：source（app/web/mp），channel（渠道/媒体），event_id（若缺失需生成），event_name，event_ts_utc，event_local_ts，ingest_ts，user_key（统一用户键），session_id，device_id，cookie_id，openid，unionid，login_id（账号），page_url，page_referrer，screen_name，app_version，os，browser，geo_country/region/city，utm_source/medium/campaign/content/term，event_params（JSON）。
- 订单：order_id_global，source_order_id，order_ts_utc，user_key，order_status，payment_status，total_amount，currency，discount_amount，shipping_fee，tax_amount，line_items（拆分为子表：order_id_global, sku_id, product_id, qty, unit_price, currency）。
数据类型统一：时间统一为 UTC 时间戳（秒或毫秒统一），数值列显式精度（金额使用 decimal(18,2)），布尔类型规范化为 true/false。
字段

解决的问题

将“杂乱数据→可用数据”的路径变得清晰、快速、可靠：当你提供数据集的简要情况与分析目标时，提示词即刻生成专家级的数据清洗步骤清单，按优先级排列，覆盖缺失与异常处理、重复记录合并、字段一致性校验、时间与编码规范化、分组核验与抽样复查等关键环节。它聚焦实操与结果落地，帮助你缩短准备时间、提升数据可信度、减少返工，让新人也能以资深分析师的标准开展工作，并在电商、增长运营、营销CRM、日志埋点、实验与报表等场景中快速复用与扩展。

适用用户

数据分析师

快速制定清洗方案与执行顺序，统一口径，缩短准备时间，提升模型与报表的准确性与稳定性。

增长/营销经理

清洗投放与行为数据，修正埋点与命名口径，让A/B测试与ROI评估更可信，从而优化预算与素材。

产品运营

整合多渠道数据并去重规范字段，提升看板稳定性，定位异常波动根因，及时优化运营动作。

特征总结

• 一键生成个性化清洗清单，按数据特性给出可执行步骤与优先级。

• 自动识别缺失、重复与异常值问题，提供修复方案及替代策略建议。

• 轻松生成可复用SOP，明确每步目的与输出，降低沟通与执行成本。

• 按业务场景自动优化清洗策略，如营销、运营、风控，提升指标可信度。

• 提供结构化说明与示例，帮助新人上手，资深同学提效，流程更顺畅。

• 支持多语言输出与风格统一，便于跨部门协作与对外交付使用。

• 映射步骤至常用工具操作路径，减少试错时间，加速清洗落地执行。

• 生成质量检查清单与可视化建议，保障数据可用性并增强分析说服力。

如何使用购买的提示词模板

1. 直接在外部 Chat 应用中使用

将模板生成的提示词复制粘贴到您常用的 Chat 应用（如 ChatGPT、Claude 等），即可直接对话使用，无需额外开发。适合个人快速体验和轻量使用场景。

2. 发布为 API 接口调用

把提示词模板转化为 API，您的程序可任意修改模板参数，通过接口直接调用，轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。

3. 在 MCP Client 中配置使用

在 MCP client 中配置对应的 server 地址，让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作，让提示词在不同 AI 工具间无缝衔接。

AI 提示词价格

￥15.00元

先用后买，用好了再付款，超安全！

在线免费用提示词

您购买后可以获得什么

✓

获得完整提示词模板

- 共 250 tokens

- 2 个可调节参数

{ 数据集简述 } { 输出语言 }

✓

获得社区贡献内容的使用权

- 精选社区优质案例，助您快速上手提示词

购买

推荐数据清洗步骤

解决的问题