What happens after classifying PII?
Detailed Process of Detection and Anonymization
What is PII?
PII, an acronym for Personally Identifiable Information, is the cornerstone of an individual's digital identity. Leaks or mishandling of PII can unleash a storm of problems, from privacy breaches to identity theft. Global regulations, including GDPR (General Data Protection Regulation) and HIPAA(Health Insurance Portability and Accountability Act of 1996), underscore the significance of PII by laying out strict measures for its protection.
What is PII Classification?
PII Classification is a task in which text token are being identified as PII. In particular, techniques used for Named Entity Recognition (NER) tasks. It is a type of sequence labeling task.
The label (Input Sequence) indicates if a token (like a word) corresponds to a named entity (Output Sequence). Using the default settings, Zetaris will pick on 18 different named entities, including:
PERSON | People, including fictional |
NORP | Nationalities or religious or political groups |
FACILITY | Buildings, airports, highways, bridges, etc. |
ORGANIZATION | Companies, agencies, institutions, etc. |
GPE | Countries, cities, states |
LOCATION | Non-GPE locations, mountain ranges, bodies of water |
PRODUCT | Vehicles, weapons, foods, etc. (Not services) |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
WORK OF ART | Titles of books, songs, etc. |
LAW | Named documents made into laws. |
LANGUAGE | Any named language |
DATE | Absolute or relative dates or periods |
TIME | Times smaller than a day |
PERCENT | Percentage (including “%”) |
MONEY | Monetary values, including unit |
QUANTITY | Measurements, as of weight or distance |
ORDINAL | “first”, “second” |
CARDINAL | Numerals that do not fall under another type |
Note: The the default setting on named entities can be configured for specific elements
What happens after classifying PII?
Tagging
For structured and unstructured data sets, once PII has been identified it will be assigned a tag of "PII".
These tags can also then be used to generate automated security policies whereby obfuscating that data to unprivileged users.
Describing
The automated classifier will also attempt to describe the data as best as possible, leverage both the metadata and field level values.
Data Anonymization
Another approach is to replace the identifier with a dummy value. For example, if a person’s name or phone number is recognized in a sentence like “My name is Amy and my number is 85562333”, then the anonymized text can be “My name is <PERSON_NAME> and my number is <PHONE_NUMBER>”. It is simple and straightforward but there is some disadvantage to this method. We lose important information when using one value for all the same entity types. In this example, we would not know what is the gender of the person. Depending on the use of the anonymized text, it can be desirable to keep such information. Or when it comes to addresses, one would like to preserve the geographical distribution to a certain level.
Another method is to replace the detected entity with a surrogate value. It can be either randomly sampled from a pre-determined list for each entity type, or according to some rule.
Detailed Process of Detection and Anonymization
The process of detections and anonymization goes through two phases. Firstly, the text runs through the Analyzer and locates where PII exists. Secondly, the Anonymizer obfuscates the data according to rules configured for PII detection.