Randomization in Privacy Preserving Data Mining
Alexandre Evfimievski
Cornell University
Ithaca, NY 14853, USA
aevf@cs.cornell.edu
ABSTRACT
Suppose there are many clients, each having some personal
information, and one server, which is interested only in ag-
gregate, statistically significant, properties of this informa-
tion. The clients can protect privacy of their data by per-
turbing it with a randomization algorithm and then submit-
ting the randomized version. The randomization algorithm
is chosen so that aggregate properties of the data can be
recovered with sufficient precision, while individual entries
are significantly distorted. How much distortion is needed
to protect privacy can be determined using a privacy mea-
sure. Several possible privacy measures are known; finding
the best measure is an open question. This paper presents
some methods and results in randomization for numerical
and categorical data, and discusses the issue of measuring
privacy.
1.
INTRODUCTION
Suppose that some company needs to construct an aggre-
gate model of its customers’ personal data. For example,
a retail store wants to know the age and income of its cus-
tomers who are more likely to buy DVD players or mountain
ski equipment; a movie recommendation system would like
to learn users’ movie preferences in order to make adver-
tisements more targeted; or an on-line business arranges its
webpages according to an aggregate model of its website
visitors. In all these cases, there is one central server (the
company), and many clients (the customers), each having
a piece of information. The server collects this information
and builds its aggregate model using, for example, a clas-
sification algorithm or an algorithm for mining association
rules. Often the resulting model no longer contains person-
ally identifiable information, but contains only averages over
large groups of clients.
The usual solution to the above problem consists in hav-
ing all clients send their personal information to the server.
However, many people are be