Data de-identification
De-identifying your data
When a dataset is too sensitive to share in its entirety, it is necessary to consider: "how can a version that is safe to share be created?". The process of doing this involves de-identification or anonymisation to remove all data that can be used to identify individual participants in a research project, thereby protecting their privacy.
Types of identifiable information
De-identifying a dataset requires considering both direct and indirect identifiers. Direct identifiers are sufficient alone to potentially identify a participant are typically simple to remove. Indirect identifiers are data that have the potential to be identifying if combined with other information, and de-identification of these require more careful assessment. These data elements aren't always problematic, but can be used to identify individuals where a particular combination of filters can restrict the data to a very small population. For example, there may only be one Aboriginal woman living in Maldon aged between 20 and 25, so other information about this person can be obtained from the data if the appropriate data filters are applied.
- Direct identifiers (e.g. names, addresses, facial images, phone numbers).
- These should always be omitted from public records.
- If pseudonyms or ID numbers are used, the re-identification key must be stored as securely as raw data would be.
- Indirect identifiers (e.g. place of employment, occupation, postcode, ethnicity, age).
- This should be deidentified considering how information can be combined to identify a participant
- Full and transparent documentation should be included with the data
Methods of deidentification
- Omission: The simplest method, just don't include the data in the dataset (e.g. full names)
- Rounding or grouping: binning numeric or categorical data into larger groups (e.g. ages or occupations)
- Would random noise preserve the data structure better?
- Does large grouping lose too much information or predicted uses by others?
- How much do the group sizes vary (e.g. postcodes can have very different population sizes)?
- Random noise addition: adding or removing random amounts to numeric or geographical data (e.g. dates or sample locations)
- Is there enough added noise? Consider the standard deviation of your data.
For some datasets, making it safely non-identifiable may require stripping away so much data that the set loses almost all value. In these cases the metadata describing the dataset (further information) should be shared with instructions on how and in what circumstances the raw data may be made available upon request.
Key considerations
Name (Anonymised) |
Age (Rounded to decade) |
Previous country of residence (No changes made) |
Date of entry (Random noise added with st.dev. 50 days) |
Current address (Grouped to suburb) |
IP address (Omitted) |
---|---|---|---|---|---|
#0923485 |
30-39 |
Japan |
2020-02-10 |
omitted |
|
#6506544 |
60-69 |
Russia | 2018-04-04 |
omitted |
|
#6745859 |
50-59 |
Tuvalu | 2020-01-03 |
|
Some things to note about this dataset:
- Name: Merely anonymising the name column was clearly not sufficient to anonymise the data.
- Age: Would adding random noise be more useful? e.g. ages 59 and 60 fall into different categories if grouped by decade.
- Country of origin: Is different grouping needed for countries with different population sizes (e.g. Russia = 150,000,000; Tuvalu = 11,000).
- Date of entry: Too little noise risks being identifiable, too much noise risks just scrambling the data.
- Address: What level of grouping needed (e.g. Suburb, postcode, city, state)?
- IP address: Did this data need to be collected in the first place?
- Combined: How many people could the row apply to? If only one or a small number, it's still identifiable!
Other considerations:
- Will stronger deidentification of one field may allow less deidentification of another?
- Can this data be combined with public data to re-identify participants?
- Which aspects of the data should be prioritised when preserving integrity?
Keep a transparent record of the deidentification process alongside the data (especially when data is altered!):
- What has been changed?
- What has been omitted?
- Is there a process to be granted access to the original raw dataset?
Useful Links
- BMJ - Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewersIain Hrynaszkiewicz and colleagues propose a minimum standard for de-identifying datasets to ensure patient privacy when sharing clinical research data
- Research Data Network: Personal data resourcesa curated list of resources for managing personal data and best practice for anonymisation and preservation, compiled by Daniela Duca