提供专业数据清洗建议,精准高效解决数据问题。
以下清洗步骤针对包含浏览、点击、购买、会话ID、时间戳以及商品与用户表关联键的用户行为日志数据,涵盖字段级规范化、事件级处理、会话重建、维表一致性校验、异常过滤与质量监控。建议根据业务实际调整阈值与规则。 一、架构与字段类型校验 - 明确定义必填字段:event_type、timestamp、至少一个标识符(user_id 或 session_id)。购买事件需有 order_id、item_id、price、quantity。 - 类型规范化: - user_id、item_id、session_id:统一为字符串或整型,确保无前导零丢失(如以字符串存储)。 - timestamp:统一为可解析的时间类型(ISO8601),保留毫秒精度。 - price、quantity:转换为数值型(浮点/整数)。 - event_type:统一为标准枚举(如 view、click、add_to_cart、purchase)。 - 范围与合法性检查: - timestamp 必须在数据采集窗口内;过滤明显无效时间(如早于系统上线或晚于当前时间超过合理缓冲,如 >72 小时)。 - quantity ≥ 1;price ≥ 0;对异常极值设定上限并标记。 - 字符串清理:去除空白、统一大小写、消除不可见字符;对来源字段(如 utm、referrer)进行规范化(统一大小写、剔除无效前缀)。 二、时间戳与时区标准化 - 统一时区为 UTC;如果存在本地时区,依据用户或站点时区映射转换为 UTC。 - 处理单位差异:将秒、毫秒统一到毫秒;对同一事件出现多时间字段时明确选择权威字段(如 server_ts 优先于 client_ts)。 - 会话内事件排序:按 timestamp 升序;对同一毫秒内的多个事件保留原始到达顺序字段(ingested_at)以辅助排序。 三、事件类型归一化与语义一致性 - 归一化 event_type:映射同义或变体(如 page_view→view,tap→click,checkout→purchase 前置阶段)。 - 购买事件状态标准化:区分下单、支付成功、取消、退款等,使用统一字段(order_status),避免将未支付事件计为购买。 - 移除或标记测试/演练事件:过滤来自测试账号、内部IP、特定 UA、沙箱环境、明显异常来源参数(如 utm_source=test)。 四、去重与幂等 - 事件级去重: - 优先使用事件唯一键(event_id)。缺失时采用复合键:user_id + session_id + item_id + event_type + timestamp(带时间容差,如 ±1s)+ 关键上下文字段(page、device)。 - 使用窗口函数按到达时间(ingested_at)保留首条,删除其余重复。 - 订单级去重:以 order_id + item_id 去重,避免重复支付/重复上报;对部分发货或拆单需保留但合并金额与数量逻辑。 五、缺失值处理 - user_id 缺失:保留为匿名用户(如 user_id=null,assign guest_user=true),但确保 session_id 存在;后续分析需区分匿名与注册用户。 - session_id 缺失/异常:基于不活跃阈值重建会话(默认 30 分钟无行为切分),生成新的 session_id,并记录重建标记。 - item_id 缺失: - 浏览/点击事件:可保留但标记缺失,限制用于商品级分析。 - 购买事件:无法定位商品则剔除或转人工核对。 - timestamp 缺失:剔除或尝试使用到达时间替代并标记为低可信度。 六、会话一致性与修复 - 会话归属检查:同一 session_id 应归属同一 user_id;发现跨用户复用的 session_id 时进行重命名或拆分。 - 会话长度与边界: - 过滤极端长会话(如 >12 小时)或拆分;标记可能的挂起或心跳事件导致的异常。 - 禁止跨日期/时区不合理漂移的会话;按 UTC 日期分区。 - 事件序列合理性校验:如 purchase 前需至少有一次 view/click;不满足时标记异常序列。 七、维表关联与参照完整性 - 用户维表(users): - 验证 user_id 存在;缺失或失效的 user_id 事件标记 unknown_user;必要时保留以支持匿名分析。 - 处理历史维(SCD):按事件时间关联用户属性的有效时段,避免属性穿越(如会员等级变更)。 - 商品维表(items): - 验证 item_id 存在并在事件时间有效;下架或失效商品需标记状态。 - 价格校验:购买事件 price 与当时商品价格或订单价格一致性检查;若存在促销价或券后价,采用订单价为准。 - 外键清洗策略:对无法关联的外键事件,保留但标注 referential_flag=false;购买类事件优先人工或补数,浏览类可允许一定比例的未知。 八、数值与货币规范 - 货币统一:将 price 转换为统一结算货币(如 USD/CNY),保留原币种与汇率;对缺失币种的价格标记不可比。 - 金额一致性:检查 price × quantity 与行项目总额一致;对含税/不含税策略明确统一口径。 - 极值与异常检测:识别明显低价/高价、超大数量;超阈值记录为异常但不直接删除,供后续审计。 九、机器人、异常流量与噪声过滤 - 机器人识别:基于 user_agent(已知爬虫列表)、事件频率(如每秒多次点击)、访问路径模式、无效 referrer、IP 黑名单;标记 bot_flag 并默认剔除。 - 内部流量:过滤公司办公网段、测试设备、QA 账号;保留样本用于质量评估但不计入业务分析。 - 峰值/攻击检测:短时高并发异常序列标记并隔离。 十、订单事件一致性与售后处理 - 购买事件必须具备 order_id;与订单事实表对齐:支付状态、取消、退款、部分退款。 - 退货/退款事件作为独立事件类型或 order_status 变更记录,避免将负量计入正向购买;保持净额与毛额两个指标。 - 合并拆单:同一订单多商品需逐行处理;重复生成订单记录要去重并保留最新状态。 十一、输出数据模型与质量标注 - 事件事实表(fact_events):event_id、user_id、session_id、item_id、event_type、event_time_utc、price、quantity、device、referrer、source、flags(dedup_flag、bot_flag、referential_flag、reconstructed_session_flag、validity_flag)。 - 会话事实表(fact_sessions):session_id、user_id、session_start/end、event_count、duration、flags。 - 订单事实表(fact_orders):order_id、user_id、order_time、payment_status、currency、total_amount、refund_amount、flags。 - 维表快照:users_dim、items_dim 按有效期管理,确保时间一致性关联。 十二、数据隐私与合规 - 对 PII(如邮箱、手机号、IP)进行脱敏或哈希;仅保留分析所需的匿名标识。 - 遵循数据保留策略与访问控制;记录处理链路与变更日志。 十三、质量监控指标与阈值(每日或分区级) - 完整性:必填字段缺失率(目标 <0.5%)、外键不可关联率(浏览 <3%,购买 <0.1%)。 - 唯一性:事件重复率(<0.2%)、订单重复率(≈0%)。 - 合法性:时间戳异常率、价格/数量违规率。 - 一致性:会话跨用户率、购买无前置行为比例。 - 及时性:延迟到达事件比例;分层报表监控并设报警阈值。 - 抽样回溯核验:随机抽样与原始日志比对,验证去重与归一化正确性。 实施要点与建议 - 分层清洗:原始层(Raw)→ 标准化层(Staging)→ 事实层(Curated),逐层打标保留可回溯性。 - 增量处理与幂等设计:使用事件唯一键与 upsert,避免重复处理。 - 可配置化阈值:会话不活跃阈值、时间容差、价格极值、机器人规则通过配置管理。 - 审计留痕:为每条记录保存质量标记与清洗原因码,支持后续误差分析与规则优化。 上述步骤可直接映射到常用数据栈(SQL/ETL/流处理)中执行。根据业务需要,优先确保购买相关事件的强一致与可追溯,其次完善浏览/点击事件的归一化与会话质量,以提升漏斗与转化分析的可信度。
Recommended data cleaning steps for an advertising and behavior dataset (impressions/clicks/conversions, UTM parameters, channel and creative IDs, A/B experiment grouping) 1) Schema normalization and typing - Define a canonical event schema with required fields: event_id, event_type (impression/click/conversion), event_ts, platform, ad_account_id, campaign_id, ad_group_id, creative_id, channel_id, url, referrer_url, utm_source, utm_medium, utm_campaign, utm_content, utm_term, user_id, device_id, cookie_id, session_id (if exists), experiment_id, variant, order_id/transaction_id, revenue, currency, consent_flag. - Enforce data types: - Timestamps: parse to UTC ISO 8601, store as datetime with timezone; retain raw ingestion_ts for auditing. - IDs: cast to string, strip whitespace; if numeric IDs are expected, validate parsability and store canonical string representation to preserve leading zeros. - Monetary fields: numeric with explicit currency; validate non-negative. - Categorical: lowercase, trimmed, normalized encoding (UTF-8). - URL and parameter parsing: - Decode URLs (percent-encoding), extract query parameters to structured fields. - If UTM fields are missing, parse from landing/referrer URLs; preserve raw and parsed versions. 2) Event integrity and validation - Validate allowed event_type values and remap common synonyms (e.g., “view” → impression; “purchase” → conversion) into the canonical set. - Timestamp sanity checks: - Drop or flag events with event_ts outside campaign or data collection windows. - Remove obvious clock errors (e.g., year far in past/future); if platform ingestion_ts exists, cap discrepancies with a defined tolerance (e.g., ±3 days) and flag. - Event ordering constraints: - Conversions must not precede the attributed click/impression for the same user within an attribution window; flag negative time-to-event as data errors. - Ensure click timestamps precede conversion timestamps; if not, investigate source-specific latency or duplicate conversion. 3) Identity resolution - Construct a persistent person_id using deterministic keys in priority order: hashed_email > login user_id > device_id > cookie_id. Maintain a mapping table with versioning. - Normalize device identifiers (consistent casing, remove separators for some IDs where applicable). - Remove or flag events lacking any identifier if user-level attribution is required; otherwise route to aggregate-only analyses. 4) Deduplication - Enforce uniqueness on event_id where available; if missing, generate a synthetic key (hash of source_system, event_type, user_id/device_id, ad_id/creative_id, event_ts rounded to suitable precision). - Deduplicate within event type using keys and short time windows: - Impressions: drop exact duplicates; optionally compress bursts from the same ad server within milliseconds if known duplication pattern. - Clicks: collapse repeated identical clicks within 1–2 seconds per user-creative to mitigate multi-fire. - Conversions: dedupe by transaction_id/order_id and user_id; keep the first occurrence. If transaction_id missing, dedupe by (user_id, revenue, event_ts within N minutes) with conservative rules and flag. - Cross-source dedup: - If multiple platforms report the same conversion, prefer primary source-of-truth (ecommerce or CRM) and mark ad-platform conversions as secondary. 5) UTM canonicalization - Normalize utm_* fields: lowercase, trim, remove surrounding quotes, decode URL-encoding. - Map synonyms into canonical values: - Medium: map to controlled vocabulary (cpc, display, social, email, affiliate, referral, organic, paid_social, other). Examples: “ppc” → cpc; “paid social” → paid_social. - Source: unify common variants (e.g., “google”, “googleads”, “adwords” → google; “fb”, “facebook” → facebook). - Campaign/content/term: enforce naming conventions (no spaces if policy requires; replace illegal characters; trim long values; optionally split structured names into parts using agreed delimiters). - Validate UTM coherence: - Medium/source combinations must be allowed (e.g., cpc + google valid; email + google invalid). - Flag missing or inconsistent UTMs; backfill source/medium from channel_id/platform metadata where possible. - Preserve both original and canonical UTM fields to avoid loss of detail. 6) Channel and creative metadata validation - Join fact events to dimension tables for channel, campaign, ad group, creative. Validate existence and foreign key integrity. - Fix known legacy ID changes via a mapping table; flag orphaned IDs with no metadata. - Validate one-to-many relationships: a creative_id should map to a single campaign/ad_group within the same platform/time window; flag violations. - Enforce active date ranges: drop or flag events occurring outside the metadata’s valid period. 7) A/B experiment data quality - Validate experiment assignment: - Check that each person_id is assigned to exactly one variant within an experiment; flag crossovers (saw both variants) and decide on handling (exclude or assign by first exposure). - Ensure stable assignment over time; detect re-randomization. - Experiment timing: - Exclude events before experiment start or after end for that experiment. - Confirm exposure prior to outcome for intent-to-treat vs per-protocol definitions; tag compliance. - Sample ratio mismatch (SRM) check: - Compare observed variant counts to expected allocation using chi-square; flag significant mismatch. - Contamination control: - Remove internal traffic from experiment measurements (see bot/internal filter below). - Ensure variants do not share creatives or UTMs that could blur attribution. 8) Bot, fraud, and internal traffic filtering - User-agent filtering: remove known bot/crawler UA patterns; apply IAB/industry bot lists where available. - IP and ASN filtering: exclude data center ranges and internal corporate IP ranges; use maintained lists. - Behavioral heuristics: - Excessive clicks per minute, zero dwell time, impossible CTR (e.g., CTR > 1.0 for display), conversions occurring seconds after impression with no click where click is required. - Cookie-less sequences with high frequency across geographies. - Platform fraud signals: use platform flags (invalid click, suspected bot) to exclude or down-weight. - Document and tag exclusions for auditability. 9) Sessionization and sequencing - Build sessions per person_id using inactivity threshold (commonly 30 minutes). Assign session_id and order within session. - Sequence events: impression → click → landing page → downstream events → conversion; check for missing steps and tag inferred paths (e.g., view-through). - Compute derived fields: time_to_click, time_to_convert, touchpoint_index. 10) Attribution-prep flags - Create is_click_through and is_view_through flags for conversions. - Assign attributed_touchpoint_id(s) based on chosen model (e.g., last-click within 7-day window, or position-based). For cleaning, ensure window boundaries and event ordering are correct and tag unattributed conversions. - Retain both raw and modeled attribution fields. 11) Handling missing, malformed, and outliers - Quantify missingness per critical field (IDs, timestamps, UTMs). Set thresholds for exclusion (e.g., drop events missing event_ts or event_type; retain events with missing utm_term). - Do not impute keys or timestamps. For categorical fields used in grouping, assign “unknown” category rather than drop unless analysis requires otherwise. - Outlier checks: - CTR, CVR by creative/channel outside reasonable bounds; flag and review for tracking errors. - Revenue outliers; check currency mismatches; standardize currency to a reporting currency with a dated FX rate. - Remove duplicated transactions and extreme anomalies caused by tracking misfires. 12) Timezone, localization, and calendar normalization - Standardize all event_ts to UTC; retain local_time and timezone where user-level analysis depends on local behavior. - Handle daylight saving transitions carefully (use timezone-aware libraries). - Align reporting calendars (ISO week, fiscal period) and store precomputed date keys. 13) Consent and privacy compliance - Respect consent_flag and jurisdictional rules (GDPR/CCPA). Exclude or aggregate events without consent as required. - Mask or hash PII; maintain salted hashes consistently across systems. - Apply data retention policies; remove expired identifiers. 14) Quality assurance checks and metrics - Duplicates rate per event type after dedup. - Join rates to metadata dimensions; orphaned ID percentage. - Missing rate per UTM field and proportion successfully backfilled. - SRM p-value for experiments; crossover rate. - Bot/internal exclusion share; monitor over time. - Volume reconciliation against platform reports (impressions, clicks, spend, conversions) within acceptable variance. - Lag distributions (impression→click, click→convert) to detect ingestion delays or ordering errors. 15) Outputs and documentation - Produce a cleaned fact table with canonical fields: - event_id, event_type, event_ts_utc, person_id, session_id, platform, channel_id, campaign_id, ad_group_id, creative_id, url, referrer_url, utm_source_raw/canonical, utm_medium_raw/canonical, utm_campaign_raw/canonical, utm_content_raw/canonical, utm_term_raw/canonical, experiment_id, variant, order_id, revenue, currency, is_click_through, is_view_through, attributed_touchpoint_id, consent_flag, quality_flags (dedup, bot, timestamp_error, orphan_metadata, unattributed). - Maintain a data dictionary covering field definitions, valid values, and cleaning rules. - Version and log the cleaning pipeline; store anomaly reports for audit. Implementation notes - Prefer SQL for deterministic dedup and joins; use window functions for time-window de-duplication. - Use robust URL parsing and timezone-aware datetime libraries. - Keep original raw tables immutable; write cleaned tables with lineage columns (source_system, load_batch_id). These steps establish consistent, analyzable data for attribution, channel/creative performance, and A/B experiment evaluation while preserving auditability and minimizing bias introduced by data quality issues.
以下为多渠道运营数据(App/网页/小程序)在存在“重复用户标识”和“字段命名不一致”情况下的推荐清洗步骤与实施要点。目标是建立统一、可追溯、可复用的数据底座,支持后续统计分析与归因。 一、总体流程与分层 - 原始层 Raw:保持来源原貌,仅做基础解码与落盘分区。 - 标准化层 Staging:字段命名、数据类型、时区、事件映射、去重、ID 统一。 - 规范层 Canonical:统一实体模型(用户、事件、订单、商品、渠道),完成主外键关联、明细拆分。 - 应用层 Mart:面向分析的宽表与汇总(转化漏斗、留存、归因、LTV 等)。 - 增量与回填:采用基于事件时间分区(event_date)与主键去重的幂等加载。 二、字段命名与数据类型标准化 - 制定统一命名规范(小写、下划线、英文、含义稳定)。建议核心字段: - 通用:source(app/web/mp),channel(渠道/媒体),event_id(若缺失需生成),event_name,event_ts_utc,event_local_ts,ingest_ts,user_key(统一用户键),session_id,device_id,cookie_id,openid,unionid,login_id(账号),page_url,page_referrer,screen_name,app_version,os,browser,geo_country/region/city,utm_source/medium/campaign/content/term,event_params(JSON)。 - 订单:order_id_global,source_order_id,order_ts_utc,user_key,order_status,payment_status,total_amount,currency,discount_amount,shipping_fee,tax_amount,line_items(拆分为子表:order_id_global, sku_id, product_id, qty, unit_price, currency)。 - 数据类型统一:时间统一为 UTC 时间戳(秒或毫秒统一),数值列显式精度(金额使用 decimal(18,2)),布尔类型规范化为 true/false。 - 字段
快速制定清洗方案与执行顺序,统一口径,缩短准备时间,提升模型与报表的准确性与稳定性。
清洗投放与行为数据,修正埋点与命名口径,让A/B测试与ROI评估更可信,从而优化预算与素材。
整合多渠道数据并去重规范字段,提升看板稳定性,定位异常波动根因,及时优化运营动作。
在紧迫项目中输出交付级清洗SOP,保障数据可信度,降低方案与决策风险,提升客户信任。
规范实验数据预处理流程,提高复现性与结论可靠度,缩短数据整理周期,加快论文产出。
将建议转为团队可执行步骤与校验清单,减少反复返工,加速报表与数据产品上线交付。
将“杂乱数据→可用数据”的路径变得清晰、快速、可靠:当你提供数据集的简要情况与分析目标时,提示词即刻生成专家级的数据清洗步骤清单,按优先级排列,覆盖缺失与异常处理、重复记录合并、字段一致性校验、时间与编码规范化、分组核验与抽样复查等关键环节。它聚焦实操与结果落地,帮助你缩短准备时间、提升数据可信度、减少返工,让新人也能以资深分析师的标准开展工作,并在电商、增长运营、营销CRM、日志埋点、实验与报表等场景中快速复用与扩展。
将模板生成的提示词复制粘贴到您常用的 Chat 应用(如 ChatGPT、Claude 等),即可直接对话使用,无需额外开发。适合个人快速体验和轻量使用场景。
把提示词模板转化为 API,您的程序可任意修改模板参数,通过接口直接调用,轻松实现自动化与批量处理。适合开发者集成与业务系统嵌入。
在 MCP client 中配置对应的 server 地址,让您的 AI 应用自动调用提示词模板。适合高级用户和团队协作,让提示词在不同 AI 工具间无缝衔接。
免费获取高级提示词-优惠即将到期