256
El Emam et al., PUMF Risk Analysis
Model Formulation
Evaluating Predictors of Geographic Area Population Size
Cut-offs to Manage Re-identification Risk
KHALED EL EMAM, ANN BROWN, PHILIP ABDELMALIK
A b s t r a c t Objective: In public health and health services research, the inclusion of geographic information
in data sets is critical. Because of concerns over the re-identification of patients, data from small geographic areas
are either suppressed or the geographic areas are aggregated into larger ones. Our objective is to estimate the
population size cut-off at which a geographic area is sufficiently large so that no data suppression or further
aggregation is necessary.
Design: The 2001 Canadian census data were used to conduct a simulation to model the relationship between
geographic area population size and uniqueness for some common demographic variables. Cut-offs were computed for
geographic area population size, and prediction models were developed to estimate the appropriate cut-offs.
Measurements: Re-identification risk was measured using uniqueness. Geographic area population size cut-offs
were estimated using the maximum number of possible values in the data set and a traditional entropy measure.
Results: The model that predicted population cut-offs using the maximum number of possible values in the data
set had R2 values around 0.9, and relative error of prediction less than 0.02 across all regions of Canada. The
models were then applied to assess the appropriate geographic area size for the prescription records provided by
retail and hospital pharmacies to commercial research and analysis firms.
Conclusions: To manage re-identification risk, the prediction models can be used by public health professionals, health
researchers, and research ethics boards to decide when the geographic area population size is sufficiently large.
J Am Med Inform Assoc. 2009;16:256–266. DOI 10.1197/jamia.M2902.
Introduction
Privacy legislation in Canada applies to identifiable infor-
mation. This means that if healt