True North.

Back to Blog

ESI Problems? Use Your Numbers! (Better yet, let the experts use theirs.)

The power of sampling and statistics to address (un)common ESI challenges.

For those of us who whined that there was no reason to learn high-level math or statistics because we’d never use it in real life, we were both right and wrong. Most of us don’t need to know how to use complex formulas or algorithms, it’s true, but that’s only because we rely on the skill and expertise of those who do.

Even if we’re short on math or statistical expertise, though, we should at least be able to recognize a problem that such disciplines can solve. It’s not always obvious.

For example, go back in time and consider the idea of election predictions. As a layperson contemplating how one might accurately predict the results of a presidential election, what would your thinking have been?

For many years, the approach was just to ask a bunch of people—the more, the better, obviously—and hope for the best. As the population grew too large, though, the results became less and less accurate—enough to render the results relatively useless. But once those with the proper scientific expertise looked at the problem, they understood it to be a statistical sampling exercise and—voilà!—the proliferation of polls as we know them today.

The experts knew that just a small sample, properly drawn, could more accurately represent the whole population than a huge sample with no statistical design ever could, and that, in fact, they could even calculate how far off the mark the result might be. Nowadays, “margin of error” is part of our vocabulary, and we give due credence to polls that are scientifically conducted.

The point is that there are problems we confront every day whose solutions may defy us, simply because we don’t recognize that a certain type of expertise can be applied to solve them. When it comes to problems involving huge volumes of electronically-stored information (ESI), the science of statistics should be an expertise we keep in mind.

Discussion of statistics has recently come to the fore in the eDiscovery realm because of its role in sampling procedures related to so-called “predictive coding.” But that’s just the tip of the iceberg. There are myriad ESI problems that can be addressed using statistical sampling, and we need to be able to recognize situations where it might be applied. Then, if we’re not sophisticated enough on our own to develop a solution (which we probably aren’t), we should at least know we can call in an expert who can.

Let’s consider a few examples.

Problem: Duplicative email archives – maybe.
One of your clients, a large corporation, is involved in a litigation. The company has two email archives, each containing several million emails. One resides on its Exchange servers and one has been created by its information archiving technology.  You’ve already reviewed and produced from the Exchange population. The client knows that the other archive contains responsive information, but they believe it’s largely duplicative of the Exchange population and thus doesn’t need to be reviewed.  If the entire archive were extracted and restored, it could be de-duplicated against the Exchange population, but that would cost a fortune.  They know they need some kind of empirical support of the redundancy of the archive so that they can defend its decision not to review it, but the only way to prove that is to restore the whole thing. Or is it?

Did you immediately think: hey, call in the statistician? If you didn’t, you should have.

The statistician’s solution.

The statistician designed and executed a protocol to estimate:

(a)  the prevalence of responsive documents n the un-reviewed archive and

(b)  the extent to which those responsive documents are duplicative of  documents in the  Exchange population.

Instead of restoring the entire archive, small statistically valid samples were taken, assessed and compared to the Exchange population.  The protocol involved the use of sampling and the application of both exact and near de‐duplication technologies, supplemented by manual search and review. The results of the evaluation showed, to a high degree of confidence, that responsive documents in the archive population are entirely, or almost entirely, duplicated in the Exchange population (and so would have already been reviewed and produced (or logged as privileged)).

A lot less costly and time consuming than restoring the entire archive or reviewing it.

Or, consider this:

Problem: Paper v. electronic copies – the same?
Your client’s normal business operations occasion the creation of very large numbers of paper files, which the client then scans and stores both in a physical archive (the paper originals) and an electronic archive (the scanned copies). The files (and the imaging policy that applies to them) vary by business unit, product line, document function, and time period.  The client is faced with a litigation in which the content of the files is material to the issues being litigated. The client would like to restrict the scope of its responsive search to just the electronic copies, thereby avoiding the time and expense of reviewing (or re‐scanning) millions of paper files. The client needs sound empirical support for its decision to restrict the scope of its search.

Call in the statistician? You bet.

The statistician’s solution

The statistician designed a sampling protocol to enable an assessment of the quality and completeness of the imaging of the files. The sampling protocol involved a stratified design, thus allowing a distinct assessment for different file types (as defined by business unit, product line, etc.) and permitting any remediation, if necessary, to be focused on just the subset(s) of files that are found to have issues. A manual comparison of the sampled paper and electronic files was executed, enabling a sound and defensible decision regarding the scope of its search.

Solutions like these are all in a day’s work for a statistics and sampling expert. Next time you’re contemplating some problem involving document populations—what they contain, how they compare to something else, how duplicative the information might be (think backup tapes and archives)—ask yourself if there’s a way statistics might be applied to solve it. Think of sampling as a way to glean information about a large volume of documents by looking at a just a few, and also consider that there are additional tools and technology that can be added to the mix. Then bring in the expert(s). Some of the challenges you face may be easier to solve than you think.

Just be glad that even if you weren’t paying attention in class, somebody else was.

Learn more about how sampling and statitistics can help you by downloading
our H5 Practice Brief, “Measuring the Accuracy of Technology-assisted Review.”


For more information, visit


Thank you for subscribing to the H5 blog, True North.

We strive to provide quality content on a variety of topics related to search, eDiscovery and the legal realm.

Please check your email inbox for your subscription confirmation!