1
Paper 043-2008
ETL and Data Quality: Which Comes First?
Emilio Power, ThotWave Technolgies, LLC, Chapel Hill, NC
Greg Nelson, ThotWave Technolgies, LLC, Chapel Hill, NC
ABSTRACT
Usually, an early task in any data warehousing project is a detailed examination of the source systems, including an
audit of data quality. Data quality issues could include inconsistent data representation, missing data and difficulty
around understanding relationships between the various source systems.
As ETL and Data Quality technologies converge, it’s important to use the right tools at the right time to fully take
advantage of the strength of each individual tool. Within SAS ® Data Integration Server, there are several
opportunities to address data quality issues – this paper will go over the development of an ETL process with an
emphasis on data quality. A course of action will be established with suggested roles of stakeholders that can have
an input on the ETL process outside of the direct development team. The paper will then cover the topics of
discovering data issues, how to address them and which tool to use to achieve the ultimate goal of having a clean
output suitable for a data warehouse.
INTRODUCTION
Typically, whenever we build a data warehouse we have certain expectations about the level of data quality we
expect in the data. We know that most data warehouses have lots of raw data source systems and we have an idea
of our ideal data structure for the warehouse, but what is often overlooked is a central foundation for data quality. In
a data warehouse fact tables are very sensitive to data ‘impurities’ in the sense that foreign key relations to
dimensions and levels of granularity can be seriously impacted. This has a cascading effect to the extent that these
errors will degrade the functionality and level of value provided by the warehouse. Thus developing a plan for
building in data quality right from the start becomes extremely important. In a previous paper (Nelson, 2007), we
ou