Automated PII Classification

What is PII?

What is PII classification?

What happens after classifying PII?

Detailed Process of Detection and Anonymization

What is PII?

PII, an acronym for Personally Identifiable Information, is the cornerstone of an individual's digital identity. Leaks or mishandling of PII can unleash a storm of problems, from privacy breaches to identity theft. Global regulations, including GDPR (General Data Protection Regulation) and HIPAA(Health Insurance Portability and Accountability Act of 1996), underscore the significance of PII by laying out strict measures for its protection.

What is PII Classification?

PII Classification is a task in which text token are being identified as PII. In particular, techniques used for Named Entity Recognition (NER) tasks. It is a type of sequence labeling task.

The label (Input Sequence) indicates if a token (like a word) corresponds to a named entity (Output Sequence). Using the default settings, Zetaris will pick on 18 different named entities, including:

PERSON	People, including fictional
NORP	Nationalities or religious or political groups
FACILITY	Buildings, airports, highways, bridges, etc.
ORGANIZATION	Companies, agencies, institutions, etc.
GPE	Countries, cities, states
LOCATION	Non-GPE locations, mountain ranges, bodies of water
PRODUCT	Vehicles, weapons, foods, etc. (Not services)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK OF ART	Titles of books, songs, etc.
LAW	Named documents made into laws.
LANGUAGE	Any named language
DATE	Absolute or relative dates or periods
TIME	Times smaller than a day
PERCENT	Percentage (including “%”)
MONEY	Monetary values, including unit
QUANTITY	Measurements, as of weight or distance
ORDINAL	“first”, “second”
CARDINAL	Numerals that do not fall under another type

Note: The the default setting on named entities can be configured for specific elements

What happens after classifying PII?

Tagging

For structured and unstructured data sets, once PII has been identified it will be assigned a tag of "PII".

These tags can also then be used to generate automated security policies whereby obfuscating that data to unprivileged users.

Describing

The automated classifier will also attempt to describe the data as best as possible, leverage both the metadata and field level values.

Data Anonymization

Another approach is to replace the identifier with a dummy value. For example, if a person’s name or phone number is recognized in a sentence like “My name is Amy and my number is 85562333”, then the anonymized text can be “My name is <PERSON_NAME> and my number is <PHONE_NUMBER>”. It is simple and straightforward but there is some disadvantage to this method. We lose important information when using one value for all the same entity types. In this example, we would not know what is the gender of the person. Depending on the use of the anonymized text, it can be desirable to keep such information. Or when it comes to addresses, one would like to preserve the geographical distribution to a certain level.

Another method is to replace the detected entity with a surrogate value. It can be either randomly sampled from a pre-determined list for each entity type, or according to some rule.

Detailed Process of Detection and Anonymization

The process of detections and anonymization goes through two phases. Firstly, the text runs through the Analyzer and locates where PII exists. Secondly, the Anonymizer obfuscates the data according to rules configured for PII detection.