Dataset name
Conversion Funnel Events (转化漏斗事件集)
Summary
Event-level dataset that standardizes and records user progression through defined conversion funnel stages across web and mobile channels. Designed for measuring funnel performance, diagnosing drop-off points, and supporting compliant attribution and optimization.
Business definition and scope
- Scope: Digital interactions that contribute to user progression from initial touch to conversion (e.g., purchase or signup). Includes on-site/app events and, where available, upstream acquisition signals from ad platforms.
- Use cases: Funnel conversion rate analysis, stage-to-stage drop-off, time-to-conversion, cohort progression, channel/campaign attribution, A/B test evaluation.
- Exclusions: System logs unrelated to user behavior, raw third-party network logs without consent, and any data lacking provenance or required metadata.
Grain
- One record per discrete event occurrence.
- Events are ordered by event_time within a user/session context.
- Canonical event_name values are versioned and governed (see “Event taxonomy”).
Event taxonomy (canonical funnel stages)
Events are standardized to the following categories; implementation may include additional subtypes (versioned).
- Acquisition: ad_impression, ad_click
- Engagement: session_start, page_view, content_view, product_view
- Consideration: add_to_cart, wishlist_add
- Intent: checkout_start, payment_start
- Conversion: purchase_complete (order_placed), signup_complete (if applicable)
- Post-conversion: refund_requested, order_cancelled (optional, for lifecycle analysis)
Core schema (logical fields)
- Event identifiers
- event_id: globally unique identifier (UUID)
- event_name: canonical event type (enum; governed list)
- event_version: schema version of the event payload
- Time
- event_time: UTC timestamp of event occurrence
- ingestion_time: UTC timestamp of data ingestion
- Identity and session
- user_id: persistent authenticated identifier (nullable before login)
- anonymous_id: device/browser-scoped identifier for unauthenticated users
- session_id: server-side session identifier (governed sessionization rules)
- Source and context
- platform: web, iOS, Android, other
- device_type, os, app_version, browser
- page_url / screen_name (web/app context)
- geo_country, geo_region (derived; IP anonymized or truncated per policy)
- Marketing and attribution
- source, medium, campaign_id, channel, term, content (UTM-standardized)
- click_id (e.g., gclid/fbclid) where available
- attribution_model (e.g., last_non_direct_click), attribution_window_days
- Commerce (if applicable)
- product_id, sku, quantity, unit_price, currency
- order_id, order_value, tax, shipping, discount_value
- Consent and privacy
- consent_analytics, consent_advertising (boolean flags)
- consent_timestamp, consent_source
- data_subject_region (e.g., EU/US/Other)
- Quality and lineage metadata
- source_system, pipeline_name, transform_version
- is_duplicate (computed flag)
- data_quality_score (composite metric; see checks below)
Keys and uniqueness
- Primary key: event_id (unique across the dataset).
- Secondary keys: (user_id, event_time) and (session_id, event_time) for ordered analysis.
- Deduplication: Exact dedup on event_id; domain-specific fuzzy dedup for purchases (e.g., identical order_id + user_id + order_value within a short window) per governed rule.
Ordering and sessionization
- Events are ordered by event_time within user_id or anonymous_id and session_id.
- Sessionization rules are defined and versioned (e.g., inactivity timeout, cross-tab continuity) and must be consistently applied across pipelines.
Identity resolution
- Identity stitching policy links anonymous_id to user_id upon authentication; cross-device resolution must use governed deterministic keys (e.g., login) or approved probabilistic methods with documented confidence.
- Identity graph changes are versioned; downstream consumers should not assume stability of historical joins without reprocessing windows.
Attribution rules
- Default reporting uses governed attribution_model and attribution_window_days. Changes require change control approval.
- Last-touch vs. first-touch and multi-touch models must be explicitly selected and documented in downstream reports.
- Direct traffic handling and bot/excluded traffic are defined by governance filters (see “Quality controls”).
Data quality controls
- Schema conformance: event_name must map to a governed schema; mandatory fields cannot be null (e.g., event_id, event_time, platform).
- Primary key uniqueness: no duplicate event_id.
- Referential integrity: product_id and order_id must exist in mastered product/order dimensions when present.
- Timeliness: data availability meets defined SLA (e.g., T+1 for analytics; near-real-time for operational use if applicable).
- Volume and distribution checks: anomaly detection on event counts by platform, channel, and stage.
- Reasonableness checks: monotonic funnel progression rates within expected bounds; spike detection on conversion events.
- Bot/invalid traffic filters: governed lists and heuristics applied consistently (e.g., known bot UA, data center IPs).
Lineage and sources
- Source systems may include client-side SDKs, server-side event collection, commerce backend, and ad tech platforms (distinctly tagged via source_system).
- Transformations: normalization, enrichment (UTM parsing, geo derivation), identity stitching, deduplication, and schema versioning recorded in transform_version metadata.
Refresh cadence
- Batch ingestion and/or streaming pipelines depending on source. Cadence is documented per source_system and pipeline_name.
- Late-arriving data and backfill policy are governed; reprocessing windows are documented in change logs.
Privacy and compliance
- Classification: Contains personal data; treat as Confidential/Restricted per Data Classification Policy.
- Legal basis: Analytics processing must respect consent_analytics; advertising use must respect consent_advertising, applicable regulations (e.g., GDPR/UK GDPR/CPRA), and internal policy.
- Data minimization: Store only necessary identifiers; IP must be anonymized/truncated per policy.
- Data subject rights: Deletion requests cascade to this dataset; lineage tags must support discoverability and purge.
- Cross-border transfer: Follow organizational data transfer controls; ensure appropriate safeguards for EU data where applicable.
Access and usage restrictions
- Access is role-based with least privilege. Raw event-level identifiers are restricted; use aggregated views for broad access.
- Prohibited: Re-identification, combining with external datasets without approved DPIA/assessment, individual-level decisioning without appropriate legal basis and approvals.
Retention and purging
- Retention follows the Data Retention Policy; raw events retained only as long as necessary for stated purposes. Aggregated derivatives may have different retention.
- Legal hold and deletion workflows are supported via lineage and subject resolution.
Stewardship and ownership
- Data Owner: Designated business owner (e.g., Growth/Marketing Analytics).
- Data Steward: Appointed steward responsible for policy conformance, metadata completeness, and change control.
- Technical Owner: Data engineering team responsible for pipelines and quality enforcement.
Change management
- Breaking changes (schema, event_name taxonomy, sessionization rules, attribution defaults) require governance review, version increment (event_version), deprecation schedule, and communication to consumers.
- Backward-compatible changes are documented and tagged; downstream consumers should validate impact.
Known limitations
- Ad impression completeness and deduplication vary by partner capabilities and consent.
- Cross-device identity stitching may be incomplete, affecting multi-touch attribution.
- Offline conversions are included only if integrated and consented; otherwise, conversion metrics reflect digital channels.
Usage guidance
- Compute funnel metrics by ordering events within a consistent identity and session scope; avoid mixing user_id and anonymous_id without applying the identity stitching policy.
- Use governed attribution fields rather than custom logic to ensure consistency across reports.
- For regulatory compliance, always filter analyses by consent flags when the purpose involves analytics or advertising.