rx-anon -- A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm

Singhofer, Fabian; Garifullina, Aygul; Kern, Mathias; Scherp, Ansgar

Computer Science > Machine Learning

arXiv:2105.08842 (cs)

[Submitted on 18 May 2021 (v1), last revised 6 Dec 2022 (this version, v2)]

Title:rx-anon -- A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm

Authors:Fabian Singhofer, Aygul Garifullina, Mathias Kern, Ansgar Scherp

View PDF

Abstract:Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joined, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. The parameter $\lambda$ allows to give different weight on the relational and textual attributes during the anonymization process. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity for relational attributes as well as for sensitive terms. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.

Comments:	Accepted paper of DocEng 2021
Subjects:	Machine Learning (cs.LG); Cryptography and Security (cs.CR); Databases (cs.DB)
Cite as:	arXiv:2105.08842 [cs.LG]
	(or arXiv:2105.08842v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2105.08842

Submission history

From: Ansgar Scherp [view email]
[v1] Tue, 18 May 2021 21:50:12 UTC (644 KB)
[v2] Tue, 6 Dec 2022 20:23:22 UTC (1,025 KB)

Computer Science > Machine Learning

Title:rx-anon -- A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:rx-anon -- A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators