Greatest Hits: Improving Precision in Keyword Search
When performing a keyword search for relevant documents in litigation, it may seem safer to err on the side of over-tagging so that the odds of missing relevant documents are minimized. But it is often quite easy to improve precision (i.e., eliminate false hits), and from a time and cost perspective it’s well worth making the effort. This doesn’t necessarily mean that you need to come up with different search terms. Precision can be greatly improved by simply leveraging some basic Boolean search operators. A few iterative passes through the population after some simple adjustments have been made can reduce the tagged population considerably. There are several tactics you can use.
- Locate promising keywords to refine
Start by running the basic keyword search and then reviewing the hit count of each individual keyword. Are there any keywords that have a much greater number of documents hit than you would have predicted? Keywords with a large number of hits are often the best places to improve precision as they are more likely to be over-inclusive, adding unnecessary and irrelevant document volume to the overall tagged total. An assessment of what got pulled in that doesn’t really belong there will inform how to refine the search to more effectively capture only what you need.
- Add an “anchor”
Sometimes, a keyword needs an additional element in order to increase precision. For example, a keyword like “meeting minutes” might grab the targeted meeting minutes in addition to irrelevant meeting minutes. By adding a subject matter “anchor”—a keyword element that is thought to be highly correlated with the presence of relevant subject matter—precision can be increased: “meeting minutes” could become “marketing board AND meeting minutes” or “marketing board w/10 meeting minutes.”
- Use exclusion
If you can account for imprecision by identifying a particular responsible element, you can add exceptions to an existing keyword by using “NOT.” For example, “meeting minutes” could become “meeting minutes NOT executive committee meeting minutes.” This approach can be more time-consuming, however, as it often requires creating exceptions that are document-specific. Also, if a sample of a document population is being used to validate keywords, building document-specific exceptions is unlikely to address other irrelevant documents present in the document population but outside of the sample being used.
- Revisit the use of wildcards
Wildcards, when used carefully with keywords, can safely increase precision by covering variations of a concept. However, wildcards can also go haywire unexpectedly and the results need scrutiny to see if a revision makes sense. For example, if the original intent of the keyword “sting*” was to return discussions about stinging insects, you may not want those documents with the word “stingy.” Replacing the wildcard operator with a more limited set of keyword variants (“sting or stings or stinger or stinging”) or using an exception to exclude unwanted hits (“sting* NOT stingy”) can help to boost precision.
- Use appropriate proximity operators
If a keyword includes a proximity operator, investigate whether reducing the operator size might result in increased precision. For example, the keyword “customer* w/50 marketing” might be too broad, and could be replaced with the keyword “customer* w/25 marketing,” especially if you observe that the two keyword elements (“customer” and “marketing”) tend to be closer together in relevant documents than they are in non-relevant documents.
- Scrutinize metadata
Sometimes, syntax allows the ability to draw on various metadata fields for use in keywords. When reviewing keyword hits, observe whether relevant documents tend to be within a certain date range or tend to be a certain kind of file extension. The related metadata fields can then be incorporated to refine keywords. For example, if a keyword seems to work only on documents with the .DOC file extension, add a metadata element to the keyword in order to limit the hits for that keyword to documents with this file extension. If a keyword seems to work in most documents, but also hits a number of irrelevant .XLS spreadsheets, add an exception to the keyword to “NOT out” .XLS file extensions from the keyword hits.
Tania Lihatsh has worked in the legal services industry for more than 7 years. As a Senior Consultant at H5, Ms. Lihatsh supports H5 engagements by providing expertise in information retrieval and subject matter analysis, modeling and research, routinely advising clients in matters related to patent infringement, environmental remediation, and product liability.