University of Southampton
FACULTY OF MEDICINE, HEALTH AND LIFE SCIENCES
School of Biological Sciences
Data Quality Concepts and Techniques Applied to Taxonomic
Databases
by
Eduardo Couto Dalcin
Thesis for the degree of Doctor of Philosophy
February 2005
2
University of Southampton
FACULTY OF MEDICINE, HEALTH AND LIFE SCIENCES
School of Biological Sciences
Doctor of Philosophy
Abstract
Data Quality Concepts and Techniques Applied to Taxonomic Databases
by
Eduardo Couto Dalcin
The thesis investigates the application of concepts and techniques of data quality in
taxonomic databases to enhance the quality of information services and systems in
taxonomy. Taxonomic data are arranged and introduced in Taxonomic Data Domains in
order to establish a standard and a working framework to support the proposed
Taxonomic Data Quality Dimensions, as a specialised application of conventional Data
Quality Dimensions in the Taxonomic Data Quality Domains.
The thesis presents a discussion about improving data quality in taxonomic databases,
considering conventional Data Cleansing techniques and applying generic data content
error patterns to taxonomic data. Techniques of taxonomic error detection are explored,
with special attention to scientific name spelling errors.
The spelling error problem is scrutinized through spelling error detecting techniques and
algorithms. Spelling error detection algorithms are described and analysed. In order to
evaluate the applicability and efficiency of different spelling error detection algorithms,
3
a suite of experimental spelling error detection tools was developed and a set of
experiments was performed, using a sample of five different taxonomic databases. The
results of the experiments are analysed from the algorithm and from the database point
of view.
Database quality assessment procedures and metrics are discussed in the context of
taxonomic dat