Estimating parameters for probabilistic linkage of privacy-preserved datasets.
dc.contributor.author | Brown, A. | |
dc.contributor.author | Randall, Sean | |
dc.contributor.author | Ferrante, A. | |
dc.contributor.author | Semmens, J. | |
dc.contributor.author | Boyd, J. | |
dc.date.accessioned | 2017-07-27T05:22:41Z | |
dc.date.available | 2017-07-27T05:22:41Z | |
dc.date.created | 2017-07-26T11:11:27Z | |
dc.date.issued | 2017 | |
dc.identifier.citation | Brown, A. and Randall, S. and Ferrante, A. and Semmens, J. and Boyd, J. 2017. Estimating parameters for probabilistic linkage of privacy-preserved datasets.. BMC Med Res Methodol. 17 (1). | |
dc.identifier.uri | http://hdl.handle.net/20.500.11937/54929 | |
dc.identifier.doi | 10.1186/s12874-017-0370-0 | |
dc.description.abstract |
Background: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. The linkage strategy and associated match probabilities are often estimated through investigations into data quality and manual inspection. However, as privacy-preserved datasets comprise encrypted data, such methods are not possible. In this paper, we present a method for estimating the probabilities and threshold values for probabilistic privacy-preserved record linkage using Bloom filters. Methods: Our method was tested through a simulation study using synthetic data, followed by an application using real-world administrative data. Synthetic datasets were generated with error rates from zero to 20% error. Our method was used to estimate parameters (probabilities and thresholds) for de-duplication linkages. Linkage quality was determined by F-measure. Each dataset was privacy-preserved using separate Bloom filters for each field. Match probabilities were estimated using the expectation-maximisation (EM) algorithm on the privacy-preserved data. Threshold cut-off values were determined by an extension to the EM algorithm allowing linkage quality to be estimated for each possible threshold. De-duplication linkages of each privacy-preserved dataset were performed using both estimated and calculated probabilities. Linkage quality using the F-measure at the estimated threshold values was also compared to the highest F-measure. Three large administrative datasets were used to demonstrate the applicability of the probability and threshold estimation technique on real-world data. Results: Linkage of the synthetic datasets using the estimated probabilities produced an F-measure that was comparable to the F-measure using calculated probabilities, even with up to 20% error. Linkage of the administrative datasets using estimated probabilities produced an F-measure that was higher than the F-measure using calculated probabilities. Further, the threshold estimation yielded results for F-measure that were only slightly below the highest possible for those probabilities. Conclusions: The method appears highly accurate across a spectrum of datasets with varying degrees of error. As there are few alternatives for parameter estimation, the approach is a major step towards providing a complete operational approach for probabilistic linkage of privacy-preserved datasets. | |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
dc.title | Estimating parameters for probabilistic linkage of privacy-preserved datasets. | |
dc.type | Journal Article | |
dcterms.source.volume | 17 | |
dcterms.source.number | 1 | |
dcterms.source.issn | 1471-2288 | |
dcterms.source.title | BMC Med Res Methodol | |
curtin.department | Centre for Population Health Research | |
curtin.accessStatus | Open access |