Back to Articles

Beyond Words: Linguists Can Change the Game in eDiscovery

This article appeared in Legatech News on 5/3/2016.

You’re negotiating keywords with the opposing side in a contentious matter. The data volume is huge. They’ve looked at your keyword list and are convinced that your proposed search terms will leave gems on the cutting room floor. The terms you chose all make sense, but how can you show, indeed know, that your list of terms won’t leave relevant material behind? What kind of expert can help you defend your list?

If your list was the result of case-team brainstorming, it may be difficult to back up your claims. But if you were savvy enough to use the services of a linguist—especially a linguist with search expertise—you’ll be surprised to find what solid ground you’re on when it comes to defending your choices. In developing queries for a situation like this, a linguist with search expertise will apply a process with a scientifically sound foundation, enabling you, in the end, to reasonably justify the inclusion of certain terms and the omission of others.

You might think that the subject matter knowledge attorneys generally have in their area of specialization would be enough to cover the bases. Often though, this actually hinders the ability to come up with an effective set of search terms. Subject matter experts tend to be biased by their own entrenched ideas about how they would discuss their pet topic, which may actually be different from the way others would communicate about it. The truth is that finding what is sought in the often complex data sets collected for discovery requires a robust knowledge of both language and subject matter, as well as the requisite expertise in search term development. How many attorneys studied linguistic variability or information retrieval along with contracts and torts?

First, let’s ensure that no one is confusing “linguist” with “translator,” as often happens. Linguistics is an academic field of study that examines the underlying behavior and structure of natural language. Linguists are trained to examine language the same way scientists are trained to examine data: objectively, divorced from preconceived notions—not instinctively, using intuitions about how language is supposed to behave. While the latter approach may be good for starters, used alone, it typically falls well short of the goal and introduces risk of both over- and under-production because of queries too broad or not broad enough.

A linguist will approach the situation analytically and consider in their thinking common variabilities of language not usually considered by legal teams creating their own lists. It is the exploration of each of these linguistic variabilities—with the opposition, if need be—that can best dispel the concern that something valuable will be left behind. In linguistic parlance, these variabilities are conceptual, syntactic, lexical, and morphological, roughly translating for the layperson into categories of subject matter, word or phrase construction, vocabulary/word choice, and word forms that comprise language.

Further, each of these linguistic variabilities has its own world of variation that the linguist knows how to tap, depending on the level of information being sought. Consider lexical variability alone—actual word choice—which includes elements such as synonyms (different words with the same or similar meanings, as happy, joyful); meronyms (parts of a whole, as wheel, car); and hypernyms (a category that includes other words, as flower, daisy).  (More detailed descriptions of these “nyms” are not warranted here, but lexical variability alone suggests the layers of complexity involved in effective search term development.)

Linguists who possess both search and eDiscovery expertise are truly game changers here. For one thing, such “litigation linguists,” as we might call them, are very skilled at working with counsel to understand key matter concepts and determine what level of analysis may be required to find targeted data. They understand how to sample the data population to seed their initial thinking. They realize that keyword selection must take into account both the variety of data types and sources tapped for the matter as well as the idiosyncrasies of the content creators, who use their own phrasing, colloquialisms, business shorthand, and trade jargon to express opinions or emotions, present ideas or ask questions. Linguists know how to address unexpected or covert language use; they can discern through linguistic data analysis, for example, that the phrase “keep them at first base” means “hinder.”  Or, they can consider intent: is this a matter where there are likely attempts to conceal information? If so, linguistic sentiment analyses may uncover information that no subject matter keywords would ever hit.

Just as important in all of this is familiarity with search technology. In addition to expertise in search term construction (e.g., Boolean queries, proximity operators, wildcards), litigation linguists are usually well acquainted with the anomalies of various search engine technologies that legal teams usually don’t consider. They understand how data populations are indexed—that is, which elements within a document population are made available to a search engine—and can likely determine if an unexpected result is a problem with the index or the search query. Index parameters are often unwittingly set by the IT department to speed up the indexing process, with little understanding of the impact it could have on search results. Are numbers being indexed, for instance? If not, that may explain why those documents about the disputed patent weren’t retrieved in the search. Are “stop words” (common words like the, is, at, which) ignored?  If so, this might create problems if you need them in your search (just ask Which Way Signage, Inc.).

As electronically-stored content becomes more voluminous and complex, it’s unlikely that attorneys working for law firms or companies will be able to address search imperatives adequately with home-grown keyword lists. What’s needed for the most cost-effective and defensible solution is a combination of linguistic, information retrieval and eDiscovery expertise. Either embodied in one person or a small group, such expertise is out there to be found. You don’t even have to search that hard to find it.


Thank you for subscribing to the H5 blog, True North.

We strive to provide quality content on a variety of topics related to search, eDiscovery and the legal realm.

Please check your email inbox for your subscription confirmation!