This site uses cookies to store information on your computer. Some are essential to make our site work; others help us improve the user experience. By using the site, you consent to the placement of these cookies. Read our privacy policy to learn more.

The Basics of E-Discovery , Chapter 7B: Predictive Coding

The sections of our Beginner's Guide to E-Discovery are largely broken out by e-discovery phases. We're bucking the trend here by devoting an entire section to an emerging technology-driven process that can span multiple e-discovery phases. At the most basic level, predictive coding is a "machine learning" process that involves using software to take information entered by people and applying that logic to much larger data sets. It exploded on the e-discovery scene around 2012 as a means to revolutionize the process of analyzing documents for relevancy, potentially replacing the traditional manual, human review model. It has been at the center of several important case rulings and been the subject of several key e-discovery studies.

The Many Names of Predictive Coding

Before we dive into the specifics of how exactly predictive coding works, we need to first address the name. Predictive coding is often used interchangeably with the terms predictive intelligence, technology-assisted review (TAR), or computer-assisted review (CAR). Different experts prefer different terms and point to subtle nuances to each definition. For the purposes of this guide they all essentially mean the same thing.

Predictive Coding Definition

Predictive coding is a machine learning process that uses software to take logic entered by people, for the purpose of finding responsive documents, and applies it to much larger datasets to reduce the number of irrelevant and non-responsive documents that need to be reviewed manually. 

How Predictive Coding Works

While Predictive Coding is a relatively new concept in the legal world, machine learning algorithms exist all around us. It's how our email systems filter spam messages from our inboxes and how websites are able to pepper us with advertising tailored to our specific browsing habits.

In e-discovery, predictive coding is primarily used to quickly and accurately locate relevant documents during the review phase and greatly expedite the review process, resulting in significant cost savings. While all predictive algorithms will vary in their exact methodology, the process at a very simplistic level looks something like this:


Seed Sets

Reviewers pull a representative cross-section of documents, known as a "seed set," from the full population of documents that need to be reviewed.


Responsive or unresponsive

Reviewers code (label) each document in the seed set as responsive or unresponsive and input those results into the predictive coding software.


Predictive formula generated

The software analyzes the seed set and creates an internal algorithm (formula) for predicting the responsiveness of future documents.


Sample and refine

Users sample the results of the algorithm on additional documents and refine the algorithm by continually coding and inputting sample documents until they achieve desired results.


Complete review

The software applies that algorithm to the entire review set and codes all remaining documents as responsive or unresponsive.

As you can probably already tell there is a fair amount of technological sophistication (and some dreaded math) underlying how predictive coding works. This is not the venue for a deep dive into predictive coding specifics, but two important concepts to know are:


Active learning

An iterative method whereby the training set is repeatedly augmented by additional documents chosen specifically by the algorithm, and manually coded by a human reviewer.


Passive learning

An iterative method that uses completely random document samples to train the machine learning algorithm until the desired result is achieved.

Opinions vary as to what method is most effective, and there are several variations to each one. You can learn a lot more about active learning vs. passive learning by reading the recent study, "Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery." E-Discovery attorney and technologist Ralph Losey has done extensive research on the subject as well and reports his finding on the blog, "e-Discovery Team."

Predictive Coding in the Courts

When predictive coding first emerged there was a great deal of anticipation among legal practitioners on how courts would respond. Federal Magistrate Judge Andrew Peck's decision in Da Silva Moore v. Publicis Groupe (Southern District of New York, 2012) is considered the first official judicial endorsement of predictive coding as a way to review documents. Today, there is little discord among judges that predictive coding has a place in e-discovery.

Most disputes now center on how transparent parties have to be with how they use the technology, namely how they construct and code their seed sets. In another influential ruling in Rio Tinto Plc v. Vale S.A. (Southern District of New York, 2015), Judge Peck addressed the issue of transparency, stating that in general that he encourages parties to be transparent with their seed set development but also noted that there are other means to evaluate the efficacy of predictive coding without agreeing on seed set parameters, including manual sampling of coded documents.

Predictive Coding Adoption Trends

Many experts would contend that predictive coding adoption trends have not lived up to the initial hype. At least not yet. Predictive coding is frequently included as a topic in industry surveys so there is a fair amount of data available on usage.

According to Exterro's 2015 Federal Judges Survey on E-Discovery Best Practices and Trends, 45% of the responding judges said that they only "somewhat agreed" that predictive coding is employed with regularity in cases, while 41% didn't agree with the statement at all. That doesn't exactly suggest widespread adoption. In fact, in FTI's 2015 Advice from Counsel Survey, 42% of corporate respondents said they had never used predictive coding.

Expert Perspective

Technology-assisted Review Continues to be Greatly Underutilized

Despite its potential to dramatically reduce review costs, technology-assisted review (“TAR") continues to be greatly underutilized. Advancements in TAR, including the emergence of a second generation of predictive coding tools—referred to as “TAR 2.0"—and the strategic use of TAR technologies should remove many of the barriers to its use.


Gareth Evans
Co-Chair, E-Discovery Practice Group
Gibson Dunn

There are a few explanations for the industry's tepid response to predictive coding, including:

​ Human Review Myth

There is a pervasive myth that manual human review represents the most thorough and accurate way to review documents for relevancy. A landmark study in 1985 disproved this theory, revealing that attorneys supervising skilled paralegals believed they had found at least 75% of the relevant documents from a document collection, using search terms and iterative search, when they had in fact found only 20%. Other studies since then have confirmed the fallibility of pure manual human review. However, the wide-held perception that human review represents the "gold standard" has certainly slowed adoption of predictive coding.

Technical Unfamiliarity

Introducing a disruptive technology in any environment is going to be met with a fair amount of skepticism. This holds especially true in the legal industry, which has historically been very resistant towards new technology (many lawyers would still take a legal pad over an iPad). Predictive coding is complex. It involves fairly advanced elements of data science and statistical sampling. Even though supporters of predictive coding will argue that many of those complexities are largely hidden from actual users of the technology, the promise of cheaper, more efficient e-discovery has yet to outweigh the fear of the unknown for many lawyers.

Upfront Expense

While predictive coding has been proven to drastically decrease the time and expense of document review, like any technology-driven process, it requires an upfront investment and deployment of a new tool. This often constitutes a capital expenditure, rather than an operational expense, which implicates the IT and procurement teams and requires clear financial justification. Many legal teams don't have the time or energy to build a case for why a predictive coding tool is necessary, so they stick with the status quo.

​ Limited Case Law

Although courts have endorsed the use of predictive coding, the pool of cases involving the technology is still quite shallow. Risk-averse lawyers are still uneasy with the limited number of rulings that have addressed predictive coding and may be waiting for a more solidified judicial consensus

Predictive Coding Beyond E-Discovery Review

The focus of predictive coding has centered on document review, where most people believe its adoption can have the greatest impact on reducing e-discovery costs. But using machine learning algorithms to classify and predict document relevancy can be applied to any document set. Some technology vendors, including Exterro, have built the capability into the early case assessment (ECA) process, allowing users to leverage predictive analytics prior to document collection. In fact, predictive coding shows tremendous promise outside of the litigation bubble altogether, as an information governance tool allowing organizations to proactively code, categorize and filter documents at the point of creation.

Practical Predictive Coding

You can learn much more about using predictive coding beyond litigation by watching Exterro's on-demand webcast, "Practical Predictive Coding."

Best Practices Checklist

Predictive coding is a powerful technology, but its effectiveness is heavily dependent on human intervention. Don't forget, predictive algorithms are designed to approximate human judgment (via the seed set). That means bad data going into the system will always produce bad data coming out. In other words, predictive coding cannot correct sloppy attorney work. Here are some best practices for getting the most out of predictive coding:

Predictive Coding Best Practices

Get familiar with the technology before using it

You don't need to be a statistician or even a technical expert to use predictive coding, but it's important to get familiar with the fundamental basics of how the system works. Does it employ active learning or passive learning? How many documents comprise the initial seed set? Does the system employ a relevancy score or not? These are the types of questions that may come up with a judge or opposing party, so it's better to be proactive than appear uninformed when the process is put under a microscope.

Use expert reviewers for training the model

As we mentioned earlier, predictive coding follows the adage 'garbage in, garbage out.' When training the predictive model, it's smart to employ your very strongest document reviewers, usually senior attorneys with close attention to detail and deep knowledge of the case. It's critical that the seed set is properly coded, or the entire review could be compromised.

Develop a relevancy threshold

A lot of predictive coding tools employ a relevancy score, which essentially rates the system's confidence on a document's relevance. You should come up with a systematic approach for determining which documents will be reviewed by human reviewers after the coding process. Namely, you want to establish a cutoff point where documents will be discarded and which will proceed to the next phase. There isn't a right answer for what the threshold point should be, but showing consistency will improve the defensibility of the process.

Validate results

Rather than think of predictive coding as a replacement of human review, it's better to think of it as something that augments the process. You should employ a system for manually sampling documents from both your relevant and non-relevant sets after the coding is complete to look for inconsistencies. No review method is completely perfect. However, discovering a large number of wrongfully coded documents may necessitate retraining the algorithm and running the review again. Keep in mind, the opposing party may ask to sample the non-relevant documents as well, so it's smart to validate internally before producing results.

Expert Perspective

The Only Constant is the Need to Change

I have learned that the only constant is the need to change, to keep up with the ever-changing technologies like predictive coding. That education process leads to modification of the practices. They must be understandable to work well, and not overly complicated. It is a constant struggle to maintain a beginners mind. I have always promoted the interdisciplinary team approach to e-discovery where attorneys work with engineers, scientists and information technology specialists of all kinds.


Ralph Losey
E-Discovery Blogger / Attorney,
e-Discovery Team Blog

Predictive Coding Tools

We've tried to explain how predictive coding works at a very basic level, but it's important to recognize that all products include different capabilities that impact the process. Some specific features to look for in predictive coding software include:


Document insights

Beyond just delivering a relevancy score, look for tools that also highlight the content and metadata within each document that are relevant to the prediction. This will make the process of validating the results or revisiting specific documents much easier.


Predictive models

Once you've trained a predictive model, look for software that enables you to save and reuse it for other data sets. It's not uncommon to run into different matters that revolve around similar issues, people, and data.


Robust reporting

Predictive coding may be an acceptable means of review in the eyes of the courts today, but as discussed earlier, there is still ambiguity around the level of transparency required. For this reason, you should make sure your predictive coding software comes with robust reporting that logs system users and all actions taken if and when that information is requested.


Spot Prediction

Anyone who has been involved in an e-discovery project knows that cases rarely follow a perfectly linear path. Your document set is constantly evolving as new information comes to light. The spot prediction feature is useful, because it allows you to apply existing predictive models to individual documents and e-mails that may be brought into the mix after the initial coding process.

Next Section

Predictive coding is one significant e-discovery technology innovation. Our next section looks at e-discovery software in general, exploring what types of tools are available, how to go about acquiring a new application, and highlighting key software characteristics.

Chapter 7A
Document Review, Analysis & Production
Chapter 8
E-Discovery Software