Term: Data Lineage
Definition:
Data lineage is the complete, end-to-end record of how data moves, transforms, and is used across an organization’s ecosystem. It captures the origin of data, the systems and processes that ingest, transform, and deliver it, and the downstream assets and consumers that depend on it. Lineage is expressed as connected metadata describing datasets, fields, processes, jobs, systems, and their directional relationships over time.
Purpose:
- Traceability and accountability for critical data used in reporting, analytics, and operations.
- Impact analysis for changes, incidents, and regulatory inquiries.
- Risk reduction through visibility into dependencies, controls, and data usage.
- Evidence for compliance with internal policies and external regulations.
Scope and Granularity:
- Coverage: End-to-end (from external sources to consumption) across batch, streaming, API, and manual processes.
- Levels:
- System-level: Movement between platforms or domains.
- Dataset-level: Flow between tables, files, topics, reports.
- Field/column-level: Derivations and transformations for individual attributes.
- Record-level (provenance): Event- or record-specific lineage when required by high-assurance use cases.
Lineage Types:
- Business lineage: Abstract, understandable view aligned to business terms and processes.
- Technical lineage: Detailed implementation view derived from code, jobs, and runtime metadata.
- Logical lineage: Conceptual relationships independent of specific technologies.
- Physical lineage: Concrete paths across specific systems and assets.
- Static lineage: Derived from code and configurations.
- Runtime/operational lineage: Derived from executed jobs and runtime telemetry.
- Bidirectional lineage: Upstream (origins) and downstream (impacts).
Core Components:
- Nodes:
- Dataset/asset (e.g., table, file, topic, report, model, API).
- Field/attribute.
- Process/transformation (e.g., SQL query, Spark job, ELT step).
- Job/run (execution instance with timestamps and status).
- System/application (platforms hosting assets/processes).
- Actor/role (human or service).
- Edges:
- Consumes/produces.
- Derives from (field-level).
- Reads/writes.
- Executes/controlled by.
- Attributes:
- Timestamps, versions, code references, configurations, data classifications, controls applied, data contracts, data quality checks.
Representation Model:
- Directed acyclic graphs (DAGs) for deterministic flows; may include cycles for iterative workflows.
- Temporal lineage: Versioned representations per code version and per job run for point-in-time reconstruction.
- Semantics:
- Entities and relationships must be uniquely identifiable and resolvable.
- Transformations should encode logic where feasible (e.g., SQL parse trees, UDF references).
- Lineage should link to business glossary terms, policies, data quality rules, and ownership.
Capture and Maintenance Methods:
- Automated harvesting:
- Static code parsing (SQL, ETL configurations, orchestration DAGs).
- Runtime instrumentation and event capture from executors and orchestrators.
- Connectors to catalogs, ETL/ELT tools, BI/reporting platforms, and streaming frameworks.
- Declared/manual lineage:
- Curated mappings for opaque processes, legacy systems, or external vendors.
- Reconciliation and validation:
- Compare declared lineage with observed runtime events.
- Detect breaks due to schema drift, refactors, or undocumented processes.
- Change management:
- Version control integration and CI/CD checks for lineage-impact assertions.
- Approval workflows for lineage updates.
Governance Controls and Roles:
- Policies:
- Mandatory lineage for Critical Data Elements (CDEs), regulatory reports, and PII flows.
- Minimum granularity requirements (e.g., column-level for financial metrics).
- Freshness SLAs for lineage updates tied to deployment and execution.
- Access controls to prevent exposure of sensitive code, keys, or PII in lineage views.
- Vendor and data-sharing lineage obligations in contracts.
- Roles and responsibilities:
- Data Owner: Ensures policy adherence for owned domains.
- Data Steward: Curates and validates lineage, resolves breaks, aligns with glossary.
- Engineering/Platform: Implements harvesters, storage, APIs, and access control.
- Producers/Developers: Provide code annotations and data contracts.
- Risk/Compliance/Audit: Consumes lineage as evidence and for testing controls.
Quality Metrics (examples):
- Coverage: Percentage of CDEs with end-to-end lineage to consumption.
- Granularity: Percentage of lineage at column level for scoped assets.
- Accuracy: Ratio of validated edges vs total edges; discrepancy rate vs runtime events.
- Freshness: Median lag between execution and lineage availability.
- Completeness: Number of orphan nodes or dangling edges.
- Stability: Rate of lineage-breaking changes detected per release.
Common Use Cases:
- Impact analysis for schema changes, deprecations, and platform migrations.
- Root cause analysis for data incidents and quality regressions.
- Regulatory traceability for financial reporting, risk aggregation, and privacy obligations.
- Data access reviews and least-privilege validation.
- Model governance: Track features, training data, and model outputs to sources.
- Cost and performance optimization by identifying redundant pipelines and unused assets.
Risks and Pitfalls:
- Incomplete capture leading to false assurance.
- Overexposure of sensitive logic or data in lineage UI.
- Stale lineage due to missing runtime events or failed harvesters.
- Opaque transformations (UDFs, scripts) without declared mappings.
- Fragmentation across tools without a common model or identifiers.
- Confusing business users with overly technical views; lack of business lineage.
Standards and Interoperability:
- OpenLineage: Open standard for lineage events from data pipelines.
- W3C PROV: Conceptual model for provenance (useful for record-level provenance and interchange).
- ISO/IEC 11179: Metadata registry principles relevant to entity identification.
- Align lineage metadata with enterprise glossary, data contracts, and quality frameworks.
Related Terms:
- Data provenance: Event-level or record-level history and evidence; lineage is broader and typically at dataset/field abstraction.
- Data catalog: System that often stores and visualizes lineage with metadata and glossary.
- Data contract: Formal producer–consumer agreement that can be enforced and traced via lineage.
- Critical Data Element (CDE): High-priority elements requiring stricter lineage controls.
Example (conceptual):
Customer table fields are ingested nightly from CRM to a raw zone, standardized in a curated zone, joined with Orders to compute Customer_LTV in the Analytics schema, and visualized in a Finance dashboard. Column-level lineage maps Customer_LTV to its source fields and transformation logic, with job runs timestamped and linked to data quality checks and owners.
Notes for Implementation:
- Prioritize CDEs and regulated data flows for column-level, runtime-validated lineage.
- Use a central lineage store with APIs; enforce unique identifiers for systems, datasets, and fields.
- Integrate lineage checks into CI/CD and orchestrator hooks; alert on breaks and stale edges.
- Present both business and technical views, filtered by role-based access controls.