Continuous Active Learning (CAL) Basics: An Interview with Amanda Jones
Technology-assisted review comes in a variety of flavors, with protocols that include Simple Passive Learning (SPL), Simple Active Learning (SAL) and a newer approach, Continuous Active Learning (CAL). With the shared goal of separating responsive from non-response documents in eDiscovery, each protocol has benefits and drawbacks and there is a lively debate in the eDiscovery community about which ultimately provides superior performance. The answer may land somewhere around “it depends.”
In this Q&A, we hear from industry expert Amanda Jones, who supervises the development of new processes and offerings for eDiscovery at H5 and designs and oversees the implementation of innovative linguistic, machine learning, and statistical techniques to support document classification and review.
Q: In your view, what are the key aspects of Continuous Active Learning (CAL) that distinguish it from other forms of Technology-Assisted Review (TAR)?
A: Let’s start with the “active” aspect of CAL. Active learning here basically means that the selection of training documents is done algorithmically with no direction from the user, beyond the coding they supply when they review documents. The goal of using an active learning approach is generally to maximize classifier training efficiency. What’s a bit different in CAL systems is that they tend to focus on selecting training documents more likely to be responsive. This skew toward likely responsive documents from the outset of the CAL training process stands in contrast to many traditional active learning paradigms, which tend to select documents deemed most ambiguous (i.e., items falling closest to the dividing between response and non-responsive documents).
Q: What difference does this make?
A: In a traditional active learning approach, the thinking behind selecting ambiguous or gray area documents for training is that resolution of these types of documents early on will provide the fastest route for clarifying the dividing line between what is responsive and non-responsive. Yet, in practice, for eDiscovery review, this approach would require reviewers to resolve many gray and borderline documents in the very beginning stages of the review, when they are least knowledgeable about the subject matter and are likely still resolving many interpretation questions.
Q: So if the “active” aspect of CAL is about how training documents are selected algorithmically, what does the “continuous” aspect of CAL speak to?
A: The “continuous” aspect of CAL speaks to the ongoing process of ranking and re-ranking documents for manual review based on the constant stream of incoming coding throughout the review’s lifecycle. In other words, CAL takes into account not only an initial set of training assessments to rank and prioritize documents, but continuously updates those rankings based on the most recent assessments.
Q: In this workflow, where training documents are algorithmically selected on an ongoing basis, and then presented to the attorney for manual review, what exactly constitutes the training set or “seed set” for model development?
A: Because training is an ongoing process in CAL workflows, defining a “seed set” for model development is somewhat arbitrary. It is important to remember that CAL workflows generally assume that all responsive documents will be manually reviewed and that all coded documents will be incorporated into the continuously growing training set. Unlike other TAR approaches, the end goal is not to automatically classify documents either as responsive or non-responsive. Rather, CAL is optimized to route likely responsive documents to the manual review queue while curtailing inclusion of non-responsive documents. In this context, ongoing document prioritization is the driver rather than automated one-time classification, making the notion of “seed set” largely irrelevant.
Q: Does CAL require any initial set of assessments to get the process going?
A: Yes, to start the process, an initial set of both responsive and non-responsive documents are required so that CAL can begin inferring characteristics of relevance. The size and composition of this initial set of documents has not been standardized, though, and an agreed upon set of best practices has not been articulated to guide these choices.
Q: In other TAR workflows, reviewing a “control set” of documents separate from the training process is key for evaluating model accuracy and performance overall. How does CAL incorporate a “control set” in its workflow?
A: Typically, CAL workflows do not incorporate the use of a “control set.” This means that CAL does not require the up-front overhead of reviewing a separate control set that is distinct from the training process. Dispensing with the control set, however, does mean that there is not a readily available estimate of the rate of responsiveness in the review population. And this can limit visibility into ultimate review volume and timeline. Some see this as a drawback.
Q: Control sets are also important for calculating recall in certain systems, a key quantitative indicator used to measure overall review comprehensiveness. Since there is no control set, how is recall calculated in a CAL workflow?
A: Quantitative assessment of review comprehensiveness can be conducted after the fact. For example, an elusion test can and often is implemented – that is, you can measure the rate of responsiveness in the discard pile. And, in fact, it is always possible, if desired, to calculate a rough point estimate for recall or even a formal statistically valid recall metric with narrow margins of error at the conclusion of a CAL review. The latter is simply oftentimes deemed unnecessary or disproportionately burdensome.
Q: Prior to conducting an elusion test, or statistically calculating recall, how do you know you are nearing the end of a CAL review?
A: Assessing how far along you are in a CAL review lifecycle is determined somewhat impressionistically. Reviewers know they are reaching the conclusion of the review when they hit a point of diminishing returns. That is, the number of responsive documents in the review queue begins to taper off. This has been likened to listening for the moment when popcorn should be removed from the microwave.
Q: Given that CAL is heavily designed around providing a queue of documents for manual review, does the composition of the review team change?
A: Typically, CAL workflows require relatively little SME/case team attorney involvement in the day-to-day first pass review process. Instead, since it is assumed that all responsive documents will eventually be reviewed manually, it’s reasonable to limit case team attorney participation to quality control, allowing contract attorney teams to take on the majority of the work for first pass responsive review.
Q: Since all the documents CAL finds responsive have to be reviewed, how well does CAL scale in terms of document volumes or rates of responsiveness? Are there certain data volumes or rates of responsiveness that are considered too high for CAL?
A: The key thing to remember for planning purposes is that CAL workflows require review of all responsive documents in addition to a portion of non-responsive documents in the population. Estimating, early on, the rate of responsiveness as well as the ultimate size of the document population, will give the review team some key information in order to gauge the ultimate scale of the manual review to be conducted in a CAL workflow.
Q: On average, how many non-responsive documents do review teams need to review for every responsive document found in the course of a CAL review?
A: There are no standardized metrics to answer this question, and of course it will vary from project to project. For lower yielding concepts, where responsive content may be harder to identify, the ratio of non-responsive documents needed to be reviewed for every responsive document will likely go up. It is therefore useful to have some estimate of the prevalence of the different types of responsive material that may exist in the review population to inform planning and budgeting.
Amanda Jones is an Associate Director in the H5 Professional Services Group. She supervises the development of new processes and offerings for eDiscovery, as well as designs and oversees the implementation of innovative linguistic, machine learning, and statistical techniques to support document classification and review. Before joining H5, she oversaw technology-assisted review and search consulting services at Xerox Litigation Services and Recommind.
Want to learn more from Amanda and other H5 experts about technology-assisted review? Read “Keywords Before TAR? What to Ask First.”