Study: Individuals in Australian open dataset are re-identifiable
A group of scientists from the University of Melbourne have used a public database of medical records to identify several people after the database was stripped of identifying information.
The scientists warn that removing that data is not enough if the database contains detailed information about individuals, according to The Register. In the accompanying paper, the authors write that re-identifying individuals in the dataset is possible for “anyone with the technical skills of a computer science student.” The dataset included medical records of 2.9 million Australians published as open data by the Department of Health.
The database contains information about patients, such as payment details, prescriptions and medical records. In addition, each patient has an encrypted identification number, and his date of birth and gender are linked, the researchers said. In addition, all treatment dates were randomly altered within a maximum period of two weeks before or after treatment. That is why the data is not completely anonymized, but according to the scientists it is about de-identification.
According to them, the research shows that it is not difficult to re-identify people in such a dataset. For example, they were able to identify seven known Australians based on publicly available data. The researchers say a malicious party could also combine other, perhaps leaked, databases to identify more individuals. As a result, a ‘strong database’ of today might reveal more information in the future in combination with new data.
Finally, the scientists write that their findings are not isolated results, but that it has been clear for some time that re-identification of individuals in large data sets is possible. They mention that there are several recommendations to publish this kind of data in a secure way. They refer, among other things, to an EU report that was published last year. According to The Register, a law is currently being drafted in Australia prohibiting research into re-identification of individuals in datasets.

