What is Data Anonymization?
Data anonymization is the process of protecting private information by de-identifying, cleansing, or erasing data points that may tie back to an individual. The obvious data fields here relate to Personally Identifiable Information (PII) like first name/last name, social security numbers, residential address, email address, etc. Anonymization allows for the stripping out of the fields mentioned while retaining data which speaks to trends, high level demographics, behaviors all while keeping the underlying source anonymous.
The General Data Protection Regulation (GDPR) lays out a specific set of guidelines that protect an individual’s data. Two terms that have been highlighted are: anonymization and pseudonymization:
- anonymization – Recital 26 of the GDPR defines this as “data rendered anonymous in such a way that the data subject is not or no longer identifiable”. The emphasis here is to cleanse out any identifiable information making it impossible to derive information about a specific individual. The bar laid out by the GDPR is a very high one and data controllers often fall short of actually anonymizing data.
- pseudonymization – Article 4(5) of the GDPR defines this as “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.” This emphasizes that de-identified data becomes identifiable when it is paired with additional information.
The bottom line with both standards is that the data should be nearly impossible to re-identify and/or back into.
Common Data Anonymization Techniques
- masking/hashing – this allows for the format of the data to be maintained but select characters are replaced with a hash, such as a DOB of 1/3/1989 becomes ##/##/19## or ##/##/1### or may just be consolidated into a year only without month/day. This makes reverse engineering impossible.
- pseudonymization – as described earlier, this method replaces private identifiers with dummy data or pseudonyms. For example, ‘Anna Smith’ becomes ‘Tammy Spencer’. The real identify is removed, but the statistical information around this individual is retained.
- generalization/aggregation – this technique removes all of the granular data about an individual and rolls it up into a broad category. So, instead of a full residential address like ’10 Apple Road, X City, X State, 00000′ the address is converted to a city-level data point like ‘X City’ or ‘00000’ zip code. For businesses, where the full address is publicly available, the road name may be left in the data.
- dummy data – this is basically fake data, algorithmically synthesized data that has no relation to the actual real data. However, the fake data still relates to the actual data – in format and relationships between the other data attributes.
Ensure Lower Risks of Re-Identification
To lower the risk of backing into a particular identity, it is important to consider the two main types of identifiers:
- direct identifiers – these identifiers directly link to an individual and include the traditional PII type of content – phone number, email address, social security number, etc.
- quasi identifiers – these are not unique to a particular individual but when combined with a number of quasi identifiers, it is possible to back into a identity. For example – job title, industry, company and location can narrow down a CEO/SVP or even General Manager down to who that exact person is.
Risks of Poor Anonymization
When executed improperly, a gap in appropriate anonymization can result in identify disclosure of particular individuals, disclosure of certain attributes around a specific individual, and linkability of multiple data points to create a more complete picture of a specific individual (like salary, employer, gender, zip code, alma mater, etc). This will ultimately result in being the target of an FTC action and/or a violation of the GDPR.