Cleansing Data for Mining and Warehousing
Mong Li Lee
Hongjun Lu
Tok Wang Ling
Yee Teng Ko
School of Computing
National University of Singapore
fleeml luhj lingtwgcompnusedusg
Abstract Given the rapid growth of data it is important to extract
mine and discover useful information from databases and data ware
houses The process of data cleansing is crucial because of the garbage
in garbage out principle Dirty data les are prevalent because of
incorrect or missing data values inconsistent value naming conventions
and incomplete information Hence we may have multiple records refer
ing to the same real world entity In this paper we examine the problem
of detecting and removing duplicating records We present several e
cient techniques to preprocess the records before sorting them so that
potentially matching records will be brought to a close neighbourhood
Based on these techniques we implement a data cleansing system which
can detect and remove more duplicate records than existing methods
Introduction
Organizations today are confronted with the challenge of handling an ever
increasing amount of data In order to respond quickly to changes and make
logical decisions the management needs rapid access to information in order to
research the past and identify relevant trends These information is usually kept
in very large operational databases and the easiest way to gain access to this
data and facilitate strategic decision making is to set up a data warehouse Data
mining techniques can then be used to nd optimal clusterings or interesting
irregularities in the data warehouse because these techniques are able to zoom
in on interesting subparts of the warehouse
Prior to the process of mining information in a data warehouse data cleans
ing or data scrubbing is crucial because of the garbage in garbage out
principle One important task in data cleansing is to deduplicate records In
a normal client database some clients may be represented by several records for
various reaso