Measuring Technology-Assisted Review: 99% Accuracy = 100% Misleading
Not long ago, we came across some marketing materials from an e-discovery vendor (we won’t name the vendor) making the claim that the advertised technology-assited review had been demonstrated to have achieved 99.9% accuracy. What we found interesting about the claim was not the level of performance demonstrated (we know that the “accuracy” measure says nothing meaningful about real performance), but rather the persistence in our industry of the use of a metric—accuracy—that even those with a moderate acquaintance with the measurement of search systems would know to be, at best, uninformative and, at worst, misleading.
Indeed, in the comments section of a blog discussion of several months ago, it was shown that test results allowing the blogger to claim 99.9% accuracy actually, when restated in a more meanigful way, permitted him to state with a high level of confidence only that he had succeeded in finding at least 1% of what he had set out to find, not (as some, taking “accuracy” in its non-specialized sense, might assume) that 99.9% of the relevant material had been found. The accuracy measure, in spite of the dazzlingly high number associated with it, was shown to be entirely meaningless and misleading. Yet, as evidenced by the marketing materials we mentioned above, the use of the measure persists.
“Accuracy” is not a term of art in this field. The persistence in the use of a vacuous and potentially misleading measure of quality highlights the need for a better-informed marketplace for e-discovery products and services, both on the consumer side and on the producer side. Consumers should approach vendors of e-discovery products and services with an informed skepticism. That means neither swooning at any claim that includes a score of some kind in the high 90’s nor regarding all vendors as purveyors of snake oil. It means being aware that there are certain substantive questions about quality that a vendor should be able to answer and requiring that any vendor answer those questions.
Producers, on the other hand, should recognize the long-term advantages of a better-educated marketplace. While the use of vacuous metrics may win some sales in the short term, real quality will become apparent in the post-sale assessment of actual effectiveness. Discordance between pre-sale claims and post-sale evaluations will harm the reputation both of the vendor and of the industry as a whole and will serve only to slow the adoption of truly effective technology-assisted solutions. Long-term interests will be better served by using meaningful measures and setting realistic expectations as to the levels of performance that will be achieved. While the market is certainly still maturing, it is already mature enough to know that absolute perfection, or even “99.9%” of perfection, is not a reasonable standard for any review system (including traditional manual review).
Discussions of the quality of e-discovery products and services could be put on a sounder footing if consumers were prepared to ask a few meaningful questions and producers were prepared to answer those questions directly and honestly. Examples of such questions are the following.
1) For purposes of this discussion, let’s take a closer look at the quality assurance measures typically taken in a review for responsiveness completed using your methodology. Select a typical project and describe some of its salient characteristics (nature of the matter, size and nature of the document population, timelines, etc.).
2) Continuing with this typical project, did you make a statistically-valid measurement of the level of recall and precision achieved with the production? If so, what were your measurements and the confidence levels associated with those measurements?
3) By what method did you obtain your measurements of recall and precision? Can you verify that this is generally a known and accepted way to measure?
4) Can you provide documentation of both the design and execution of the sampling protocol used on the review in order to obtain your measurements of recall and precision?
5) What were the academic and professional qualifications of the individual(s) who designed the sampling and measurement protocol and who oversaw its execution?
6) Apart from the specific project we have been discussing, has your document review methodology been evaluated in any independent scientific studies? If so:
a) Briefly describe each of these studies. (What was the venue? Who designed and oversaw the study? What other methodologies were evaluated in the study? Where have the results of the study been published?)
b) What were the estimated precision and recall achieved in each evaluation?
c) Did you did you maintain your own internal estimates of recall and precision?
d) How did your internal estimates compare to those actually obtained in the study?
Regarding these questions, it should be noted that there is no single “correct” answer to any of them. There are certainly instances, for example, when the prevalence of responsive material in the population is so low as to make precise estimation of recall impractical; in such instances, producers need only be prepared to explain what makes the estimation of recall difficult and to describe what alternative methods of quality assurance they take in this circumstance. It should also be noted that the questions are such that attorneys will not always be well equipped to evaluate a vendor’s response to them. That does not mean, however, that the questions should not be asked. It means only that attorneys will have to recognize that there are occasions when they will have to ask for assistance from those with the appropriate kinds of expertise in information retrieval and in sampling and statistics.
H5 believes that if consumers and producers were to frame their discussions of quality around questions such as these, they would be able to bypass meaningless claims such as “99.9% accuracy” and proceed more quickly to a substantive and credible conversation about the real effectiveness of the offering in question. And that would be a good thing for everyone.