Looking for Personally Identifiable Information? Which Method is Best?.

Back to Blog

(Hint: Heard of Linguistic Modeling?)

Personally identifiable information (PII) is critical to businesses these days, but just having it on hand comes with great risk. Because of the harm it can cause an individual if disclosed, a litany of rules and regulations swirl around it, creating numerous obligations for those that store it. Companies that interact with PII (and these days which ones don’t?) need to be aware of the heightened requirements for handling such data, because regulatory agencies are ready and willing to penalize those with inadequate data security policies.

The U.S. has a hodge-podge of state and federal regulations that govern the use and maintenance of PII, such as the Federal Trade Commission Act, which has been used to regulate online and electronic data security.  The Graham Leach Bliley Act has been applied to enforce rights related to financial information, and the Health Insurance Portability and Accountability Act (HIPAA) protects personal health information.  U.S. states have passed their own laws and regulations as well, such as California’s Online Privacy Protection Act (CalOPPA) and Consumer Privacy Act of 2018 (CCPA) which will take effect in 2020. And of course, there’s the imposing EU General Data and Protection Regulation (GDPR), a comprehensive 2018 law governing the use of personal information, which has also been fostering some trepidation in the U.S.

Protect it? Sure. But first you have to identify it. 

In a nutshell, laws and regulations aim to have PII treated differently from other data and kept more secure. But in order to give this type of data special treatment,  a company needs to understand where in its electronic storage such data exists—especially in the event of a data breach. You have to be able to identify it, which is much easier said than done.

Accurate identification of PII is becoming something of a holy grail in the data world. In truth, no approach is 100% perfect, but some methods are considerably more successful than others. Let’s consider a few different ways to identify PII.

Manual Review. With manual review, human reviewers examine text and make nuanced judgment calls about whether documents contain PII, but this can be a slow, costly process and is subject to human error. With the volume of data in most companies today, manual review is not really a viable option for identifying PII data at risk in the company.

Technological Solutions. Technology can speed up the process, and there are a few methods commonly used, including Key Terms, Entity Extraction, Regular Expressions and Linguistic Modeling. These methods deploy different flavors of search with varying levels of accuracy. Read on to see which is best.

  • Key Terms. When the key terms method is applied, queries are run for words, phrases and other document content that are believed to be indicative of PII. While faster than manual review, Key Terms can ignore document context and require burdensome testing.  The lack of document context also tends to result in low precision and recall.
  • Entity Extraction. Entity extraction utilizes machine learning methods to locate content such as people, places and organizations, and then creates metadata fields for this data to be searched. This method can also help accelerate the process, but is not the most ideal because entities tend to be too broad for PII use cases and usually result in over-inclusion.
  • Regular Expressions. When regular expressions are employed, sets of standardized alphanumeric codes cull the data to locate text configurations potentially of interest, such as Social Security numbers. While useful to identify patterns and numbers, regular expressions are prone to false positives because of an inability to recognize document context.
  • Linguistic Modeling. Linguistic Modeling is the top performer when the goal is to identify PII and is currently the gold standard for identifying PII. It performs well because it can overcome the limitations of other methods.In this approach, thousands of complex search terms designed by linguists to classify PII are run on the real data population.  The terms – which are derived from linguistic analysis of documents containing multiple categories of known PII – can recognize context by taking into account the co-occurrence of numerous features in a document, as opposed to words, phrases or patterns alone.  The results are then packaged into categories of PII, providing a clear picture of what types of PII exist in the data.With linguistic modeling, it is also possible to provide granular results, identifying various financial, medical or other PII. The benefits of granular results are significant; knowing what type of PII has been identified can matter greatly, depending on the reasons for undertaking the search and can be much more efficient and cost-effective because of its accuracy.

Regardless of the reason for needing to identify PII, it has become an important topic for companies and attorneys.  Currently linguistic modeling is the best approach to help companies avoid the risks associated with failing to adequately protect this data, or respond to worst-case scenarios where such data was potentially breached.  If your company falls into one of these categories, contact an expert to discuss how linguistic modeling can help identify PII.


Leave a Reply

Your email address will not be published. Required fields are marked *


Thank you for subscribing to the H5 blog, True North.

We strive to provide quality content on a variety of topics related to search, eDiscovery and the legal realm.

Please check your email inbox for your subscription confirmation!