Privacy Preserving Distributed Data Mining
Chris Clifton
Department of Computer Sciences
November 9, 2001
Data mining technology has emerged as a means for identifying patterns and trends from large
quantities of data. Data mining and data warehousing go hand-in-hand: most tools operate on
a principal of gathering all data into a central site, then running an algorithm against that data
(Figure 1). There are a number of applications that are infeasible under such a methodology,
leading to a need for distributed data mining. The obvious solution of a “virtual” data warehouse
– heterogeneous access to all the data – is not always possible. The problem is not simply that the
data is distributed, but that it must be distributed. There are several situations where this arises:
1. Connectivity. Transmitting large quantities of data to a central site may be infeasible.
2. Heterogeneity of sources. Is it easier to combine results than combine sources?
3. Privacy of sources. Organizations may be willing to share data mining results, but not data.
This research will concentrate on issue 3: obtaining data mining results that are valid across a
distributed data set, with limited willingness to share data between sites. We propose to perform
local operations on each site that produce intermediate data that can be used to obtain the results,
without revealing the private information at each site.
There are many variants of this problem, depending on how the data is distributed, what type
of data mining we wish to do, and what restrictions are placed on sharing of information. Some
problems are quite tractable, others are more difficult. For example, if we are trying to learn
association rules with support and confidence thresholds, a common data mining problem, there is
a simple distributed solution that provides a degree of privacy to the individual sites. An example
association rule could be:
Received F lu shot and age > 50 implies hospital admission, where at least 5% of
insured meet all the criteria (support), and at least