Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets.
MetadataShow full item record
Background: Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Under many jurisdictions, unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. The corresponding set of techniques for privacy-preserving record linkage (PPRL) has received widespread attention. One recent method is based on Bloom filters. Due to superior resilience against cryptographic attacks, composite Bloom filters (cryptographic long-term keys, CLKs) are considered best practice for privacy in PPRL. Real-world performance of these techniques using large-scale data is unknown up to now. Methods: Using a large subset of Australian hospital admission data, we tested the performance of an innovative PPRL technique (CLKs using multibit trees) against a gold-standard derived from clear-text probabilistic record linkage. Linkage time and linkage quality (recall, precision and F-measure) were evaluated. Results: Clear text probabilistic linkage resulted in marginally higher precision and recall than CLKs. PPRL required more computing time but 5 million records could still be de-duplicated within one day. However, the PPRL approach required fine tuning of parameters. Conclusions: We argue that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time. These costs seem to be acceptable in most applied settings, but they have to be considered in the decision to apply PPRL. Further research on the optimal automatic choice of parameters is needed.
Showing items related by title, author, creator and subject.
Schnell, Rainer; Borgs, Christian (2017)© 2016 IEEE. In most European settings, record linkage across different institutions is based on encrypted personal identifiers-such as names, birthdays, or places of birth-To protect privacy. However, in practice up to ...
Ensuring privacy when integrating patient-based datasets: New methods and developments in record linkageBrown, Adrian; Ferrante, Anna; Randall, Sean; Boyd, James; Semmens, James (2017)© 2017 Brown, Ferrante, Randall, Boyd and Semmens. In an era where the volume of structured and unstructured digital data has exploded, there has been an enormous growth in the creation of data about individuals that can ...
Brown, A.; Randall, Sean; Ferrante, A.; Semmens, J.; Boyd, J. (2017)Background: Probabilistic record linkage is a process used to bring together person-based records from within the same dataset (de-duplication) or from disparate datasets using pairwise comparisons and matching probabilities. ...