Big Data and the Risk of Re-Identification


On 12 September 2016, researchers at the University of Melbourne alerted the Commonwealth Government that it was possible to reidentify ostensibly “de-identified” Medicare Benefits Scheme (MBS) data that had been released for public access and analysis. The MBS Re-identification Event attracted significant media attention, generated an Australian Office of the Information Commissioner (OAIC) investigation and resulted in the introduction of the Privacy Amendment (Re-identification Offence) Bill 2016 (Cth).

The MBS Re-identification Event is a reminder of the importance of considering the privacy implications of big data analytics.

Big data

Big data refers to the advanced capacity to collect, store, handle and analyse data in ever increasing quantities. A common definition is Gartner’s “three Vs”:

…high- volume , high- velocity and/or high- variety information assets that demand cost-effective, innovative forms of in formation processing for enhanced insight, decision making, and process optimization.

Put simply, the difference between data and big data is one of scale. More data, with more data fields, allow greater insights and observations to be drawn in ways that older datasets could not offer.

Big data in the health sector

Huge amounts of health sector data are already collected for the purposes of administering the complex systems that underpin the provision of health services in Australia.

At a Federal level, the administration of the MBS, Pharmaceutical Benefits Scheme (PBS) and (to a certain extent) the My Health Record system create rich datasets that, when processed, could shed light on emerging trends in the sector, assist in improving system efficiency and inform research. Indeed, the potential of these datasets has recently been recognised by the Productivity Commission, which noted that healthcare data could be used to inform policy decisions, gain a clearer understanding of patient experience and detect positive and negative trends at an early stage.

Public and private health service providers will also collect their own potentially useful datasets. Even smaller organisations, with their correspondingly small datasets, can benefit from big data analytics by creating partnerships with other organisations in order to pool useful information.

However, the public utility of these datasets butts up against the private interests of individuals in maintaining their privacy. The challenge for health service providers is how to benefit from big data while simultaneously protecting sensitive and identifying information.

De-identification: Protecting information

The Privacy Act 1988 (Cth) (which applies to private sector organisations) protects information that reasonably identifies an individual. Similar legislation applies to public sector organisations at State and Territory levels and (in some cases) to organisations handling identifying health information. While these Acts require organisations to handle identifying information in specific ways, they do not generally apply to information that has been de-identified.

This begs the question, when is data de-identified? In short, personal information is ‘de-identified’ where it is no longer about an identifiable or reasonably identifiable individual (see section 6(1) of the Privacy Act 1988 (Cth)). Typically this will require, at the very  least, the removal of names, addresses and contact details. However, even with this information removed, datasets can theoretically (depending on their source) be reasonably identifiable through the application of specialised knowledge, or connection with other datasets. Thus, it is sometimes necessary to apply cryptographic techniques in order to truly de-identify the information. Such techniques include:

  • removing unique identifiers (e.g. income, profession) that may, either by themselves or in combination with other information, identify a particular person;
  • aggregating information into ranges (e.g . data expressed as being that of people 25-30 years old);
  • swapping identifying information between individual s in order to maintain the integrity of the information as a whole, but confound attempts to identify a particular person;
  • generating synthetic data with similar patterns to the original dataset, but without identifying features; and
  • suppressing data that may aid in identifying individual s.

The challenge posed by big data is whether big data can be sufficiently de-identified while retaining the data’s integrity for analysis. As the MBS Re-identification Event shows, it is possible for sufficiently skilled and motivated people to decrypt ostensibly de-identified data (cryptographic attacks), or to re-identify the data by comparison with other (identifying) datasets (linkage attacks). It was the former kind of re-identification that occurred in the MBS Re-identification Event.

MBS Re-identification Event

The MBS Dataset was a 10% slice of MBS information released by the Commonwealth Government on its new ‘’ website (which allows the Commonwealth Government to share datasets with the public). The dataset contained (ostensibly de-identified) information about services provided by MBS item, the location of service provision and, importantly, the encrypted healthcare provider and recipient numbers for each MBS funded service.

With the release of this potentially sensitive dataset, University of Melbourne researchers decided to test its security against re-identification. Using only publically available information about the encryption mechanisms used by the Commonwealth, the researchers were able to decrypt every service provider identifier in the MBS dataset, thereby allowing the individual health practitioners to be identified. In theory, patient identifiers could have been decrypted but the researchers did not do so.

On becoming aware that the MBS Dataset had been re-identified, the Government removed the MBS dataset and a similar PBS dataset from the ‘’ website. An OAIC investigation into the matter is pending. However, the Government has not waited for the results of this investigation and has moved swiftly to introduce legislative changes to protect Commonwealth datasets.

The Privacy Amendment (Re-identification Offence) Bill 2016 (Cth)

On 12 October 2016, the Federal Attorney-General introduced the Privacy Amendment (Re-identification Offence) Bill 2016 (Cth) (Bill) into the Senate. The Attorney-General’s second reading speech explains that the purpose of the Bill is:

to ensure that the considerable benefits associated with the release of public sector datasets can be realised whilst upholding the highest standard of information security and protecting the privacy of Australians.

To this end, the Bill would amend the Privacy Act 1988 (Cth) to make it an offence for an entity (which includes individuals, body corporates, small businesses and Commonwealth agencies) to: (1) intentionally re-identify information that has been disclosed by a Commonwealth agency; and (2) to intentionally disclose that re-identified information. Importantly, the offences will not apply in respect of datasets made available by State and Territory agencies (which are regulated by other privacy laws). Criminal penalties of up to 2 years imprisonment and/or a fine of $21,600 would apply to a contravention of the Bill, while civil penalties extend to fines of up to $108,000. There would be some exemptions from the Bill’s provisions, including for entities providing services under contract to the Commonwealth, for acts undertaken under an agreement with a Commonwealth agency and for entities exempted by the Attorney-General. Further, State and Territory authorities (such as universities and health services) would not be subject to the proposed re-identification offences (though their employees may be liable in their individual capacities).

Effect on research

Somewhat controversially, the Bill’s proposed offences will apply to individuals acting in their personal capacity. Individuals have not typically been required to comply with the Australian Privacy Principles (except in some limited circumstances) and some, such as the University of Melbourne cryptographers who were responsible for the MBS Re-identification Event, are concerned that the Bill will chill legitimate research into cryptography while doing little to address re-identification for criminal or fraudulent purposes.

Whether a researcher would be exempt from the proposed offences depends on whether they operate in the public or private sector, and the scope of their duties of employment. Public sector researchers engaged by a State or Territory authority (such as a university or a health service) could rely on an exemption if they are acting within their duties of employment (see proposed s 16CA(2) of the Bill). However, if these employees acted outside their duties of employment, they would fall foul of the re-identification offences. Private sector researchers would not be exempt from the Bill, however the Attorney-General has indicated that the Government would consider implementing an appropriate exemption if it were in the public interest.

In response to concerns about the status of researchers (and other matters), a Senate Committee recently considered the Bill and recommended that it be passed, in part because researchers are often public sector employees and will thus be exempt from the re-identification offences. In any event, the Bill remains before the Senate, and is yet to be considered by the House of Representatives.

Protecting personal information

Big data has big potential and correspondingly big risks. The MBS Re-Identification Event makes it clear that effective de-identification of data is vital to ensure the benefits are balanced against the risks to personal privacy. In particular, decisions to publish de-identified datasets should be made with consideration of the integrity of techniques used to de-identify data in balancing the benefits and risks of big data.

The MBS Re-identification Event also reminds organisations that handle information to consider privacy as an integral part of data management, especially where that data is to be shared publically or privately. One key question that may be useful to ask is what the OAIC call s the ‘motivated intruder test’. It asks:

whether a reasonably competent motivated person with no specialist skills would be able to identify the data or information (the specific motivation of the intruder is not relevant). It assumes that the motivated intruder would have access to resources such as the internet and all public documents, and would make reasonable enquiries to gain more information.

The answer in the MBS Re-identification Event was a resounding yes.

For Commonwealth agencies, the Bill will (if it is passed) provide legal protections that supplement technical and systemic protections by imposing criminal sanctions for re-identification of de-identified Commonwealth datasets. However, given that the efficacy of such laws will necessarily depend on effective detection and enforcement, the Bill only forms a piece of the data protection puzzle and the efficacy of de-identification techniques will still be of central importance.

For private sector and State and Territory entities, appropriate de-identification techniques and strong privacy systems are of even greater importance because those entities will not have the added legal protections proposed by the Bill.

Finally, the MBS Re-identification Event reinforces the importance of privacy by design, and reminds all organisations of the reputational (and potentially financial) risks of unintentional public releases, or of re-identification events.

If you have any questions arising out of this article, please contact Chris Chosich on (03) 9865 1329 or email 

Share this post

Ready to get in touch?