Skip to main content
Version: Next

About DataHub Lineage

Feature Availability
Self-Hosted DataHub
Managed DataHub

Lineage is used to capture data dependencies within an organization. It allows you to track the inputs from which a data asset is derived, along with the data assets that depend on it downstream.

Viewing Lineage

You can view lineage under Lineage tab or Lineage Visualization screen.

The UI shows the latest version of the lineage. The time picker can be used to filter out edges within the latest version to exclude those that were last updated outside of the time window. Selecting time windows in the patch will not show you historical lineages. It will only filter the view of the latest version of the lineage.

The Lineage Tab is greyed out - why can’t I click on it?

This means you have not yet ingested lineage metadata for that entity. Please ingest lineage to proceed.

Adding Lineage

Ingestion Source

If you're using an ingestion source that supports extraction of Lineage (e.g. Table Lineage Capability), then lineage information can be extracted automatically. For detailed instructions, refer to the source documentation for the source you are using.

UI

As of v0.9.5, DataHub supports the manual editing of lineage between entities. Data experts are free to add or remove upstream and downstream lineage edges in both the Lineage Visualization screen as well as the Lineage tab on entity pages. Use this feature to supplement automatic lineage extraction or establish important entity relationships in sources that do not support automatic extraction. Editing lineage by hand is supported for Datasets, Charts, Dashboards, and Data Jobs. Please refer to our UI Guides on Lineage for more information.

Recommendation on UI-based lineage

Lineage added by hand and programmatically may conflict with one another to cause unwanted overwrites. It is strongly recommend that lineage is edited manually in cases where lineage information is not also extracted in automated fashion, e.g. by running an ingestion source.

API

If you are not using a Lineage-support ingestion source, you can programmatically emit lineage edges between entities via API. Please refer to API Guides on Lineage for more information.

Lineage Support

Automatic Lineage Extraction Support

This is a summary of automatic lineage extraciton support in our data source. Please refer to the Important Capabilities table in the source documentation. Note that even if the source does not support automatic extraction, you can still add lineage manually using our API & SDKs.

SourceTable-Level LineageColumn-Level LineageRelated Configs
Athena- include_table_location_lineage
BigQuery- incremental_lineage
- enable_stateful_lineage_ingestion
- include_table_location_lineage
- lineage_use_sql_parser
- lineage_parse_view_ddl
- lineage_sql_parser_use_raw_names
- extract_column_lineage
- extract_lineage_from_catalog
- include_table_lineage
- upstream_lineage_in_report
Business Glossary
ClickHouse clickhouse
ClickHouse clickhouse-usage
Databricks- include_table_lineage
- include_column_lineage
- column_lineage_column_limit
DataHub
dbt dbt- incremental_lineage
dbt dbt-cloud- incremental_lineage
Delta Lake
Druid
Elasticsearch
Feast
File
File Based Lineage
Fivetran
Glue
Google Cloud Storage
Hive
Kafka
Kafka Connect
Looker looker
Looker lookml- extract_column_level_lineage
MariaDB
Metabase
Microsoft SQL Server
MLflow
Mode
MongoDB
MySQL
NiFi
Okta
Oracle
Postgres- include_table_location_lineage
- include_view_lineage
PowerBI powerbi- extract_lineage
- convert_lineage_urns_to_lowercase
- enable_advance_lineage_sql_construct
- extract_column_level_lineage
PowerBI powerbi-report-server
Presto
Presto on Hive
Redash
Redshift redshift- incremental_lineage
- enable_stateful_lineage_ingestion
- s3_lineage_config
- include_table_location_lineage
- include_table_lineage
- include_copy_lineage
- include_unload_lineage
- capture_lineage_query_parser_failures
- table_lineage_mode
- extract_column_level_lineage
Redshift redshift-legacy- s3_lineage_config
- include_table_location_lineage
- include_table_lineage
- include_copy_lineage
- include_unload_lineage
- capture_lineage_query_parser_failures
- table_lineage_mode
Redshift redshift-usage-legacy
S3 Data Lake
SageMaker
Salesforce
SAP HANA
Snowflake- incremental_lineage
- enable_stateful_lineage_ingestion
- include_table_location_lineage
- include_table_lineage
- include_view_lineage
- ignore_start_time_lineage
- upstream_lineage_in_report
- include_column_lineage
- include_view_column_lineage
SQL Queries
Superset
Tableau- extract_column_level_lineage
- lineage_overrides
- extract_lineage_from_unsupported_custom_sql_queries
Teradata- include_table_location_lineage
- include_table_lineage
- include_view_lineage
Trino trino
Trino starburst-trino-usage
Vertica- include_table_location_lineage
- include_view_lineage
- include_projection_lineage

Types of Lineage Connections

Types of lineage connections supported in DataHub and the example codes are as follows.

ConnectionExamplesA.K.A
Dataset to Dataset- lineage_emitter_mcpw_rest.py
- lineage_emitter_rest.py
- lineage_emitter_kafka.py
- lineage_emitter_dataset_finegrained.py
- Datahub BigQuery Lineage
- Datahub Snowflake Lineage
DataJob to DataFlow- lineage_job_dataflow.py
DataJob to Dataset- lineage_dataset_job_dataset.py
Pipeline Lineage
Chart to Dashboard- lineage_chart_dashboard.py
Chart to Dataset- lineage_dataset_chart.py
Our Roadmap

We're actively working on expanding lineage support for new data sources. Visit our Official Roadmap for upcoming updates!

References