Data Provenance in ETL Scenarios
Extract – Transform – Load Processes
Data in large organizations are typically distributed in several heterogeneous sources, organized
and stored under different naming conventions, structures, and formats. For supporting the
functionality of On-Line Analytical Processing (OLAP) applications and Decision Support
Systems (DSS), Data Warehouses (DW) are employed to integrate the data and provide a uniform
infrastructure for querying, reporting, mining, and other advanced analysis techniques. The process
of populating the DW with data stemming from the operational sources, in a way that the schema
and business requirements of the DW are met, is referred to as Extract-Transform-Load (ETL)
process. Typically, such processes are handled by ETL tools, which are pieces of software
responsible for the extraction of data from several sources, their cleansing, customization, and
insertion into a DW. However, even though the term ETL is traditionally related to data
warehousing, it may be used in a wider sense to refer to any process of exchanging and
transforming data between data stores. For example, ETL is the core functionality of a recently
emerging type of web applications, called mashups, where information is extracted from various
web sites, and it is appropriately transformed and integrated before presented to the final user.
Data provenance in ETL
In this context, provenance refers essentially to capturing, representing, and managing metadata
for tracing the origin of data elements that take part in various ETL operations or populate the DW.
It can be viewed at different levels of detail, such as: which sources contribute to a given target
element; which particular elements within these sources; which transformations have the target
element undergone. At the finest level of granularity, provenance can be considered as the inverse
of the ETL process: whereas ETL processes facilitate the transferring of data from the sources to
the warehouse, provenance