The Basics of E-Discovery , Chapter 7B: Predictive Coding (Technology Assisted Review) & Artificial Intelligence


The sections of our Basics of E-Discovery guide largely map to the phases of the e-discovery process. But in this section, we’re focusing on technology—specifically the application of artificial intelligence (AI) during document review. For several years, predictive coding has been the dominant e-discovery AI technology. Predictive coding software is a form of machine learning that takes data input by people about document relevance and then applies it to much larger document sets. Now, however, new AI technologies are emerging, and e-discovery--in particular document review--is on the cusp of another technological leap forward.

Learn more about important new technologies used during e-discovery review in this video introduction to predictive coding and artificial intelligence (AI).

A Brief AI Primer

In our personal lives, we use AI on a daily basis, even if we don’t think about it. Waze helps us find the quickest route home. Netflix recommends movies we watch to relax. Spam filters let us ignore thousands of pointless emails. As Maura Grossman, Research Professor in Computer Science at the University of Waterloo, explains, “One of the odd things about AI is that as soon as we get accustomed to it, we simply call it ‘software.’”

This normalization of AI is already happening in legal technology. E-Discovery professionals often refer to technology assisted review, AI, predictive coding, and machine learning as similar, or even the same thing, when in fact they are not.

On its own, AI is just a buzzword. There is no consensus definition. The term itself was first used in 1955 by John McCarthy in a research proposal for the Dartmouth Workshop, which was based on the idea that “every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.”

AI encompasses two types of approaches: rules-based approaches that perform tasks based on human-programmed rules and machine learning models. Earlier AIs (think Deep Blue, the first chess AI to beat a reigning world champion) were primarily rule-based. Today, most AIs use the machine learning model.

What is Machine Learning?

Machine learning AIs use algorithms to learn from labeled training examples and/or iterative cycles of prediction and analysis of outcomes. Machine learning has applications across many fields, including predictive coding in the review phase of e-discovery.

How Predictive Coding Works

In e-discovery, predictive coding is primarily used to quickly and accurately locate relevant documents during the review phase and greatly expedite the review process, resulting in significant cost savings. While all predictive algorithms will vary in their exact methodology, the process at a very simplistic level looks something like this:


Seed Sets

Reviewers pull a representative cross-section of documents, known as a "seed set," from the full population of documents that need to be reviewed.


Responsive or unresponsive

Reviewers code (label) each document in the seed set as responsive or unresponsive and input those results into the predictive coding software.


Predictive formula generated

The software analyzes the seed set and creates an internal algorithm (formula) for predicting the responsiveness of future documents.


Sample and refine

Users sample the results of the algorithm on additional documents and refine the algorithm by continually coding and inputting sample documents until they achieve desired results.


Complete review

The software applies that algorithm to the entire review set and codes all remaining documents as responsive or unresponsive.

As you can probably already tell there is a fair amount of technological sophistication underlying how predictive coding works. This is not the venue for a deep dive into predictive coding specifics, but two important concepts to know are:


Active learning

An iterative method whereby the training set is repeatedly augmented by additional documents chosen specifically by the algorithm, and manually coded by a human reviewer.


Passive learning

An iterative method that uses completely random document samples to train the machine learning algorithm until the desired result is achieved.

Opinions vary as to what method is most effective, and there are several variations to each one. You can learn a lot more about active learning vs. passive learning by reading the recent study, "Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery." E-Discovery attorney and technologist Ralph Losey has done extensive research on the subject as well and reports his finding on the blog, "e-Discovery Team."

Predictive Coding in the Courts

When predictive coding first emerged there was a great deal of anticipation among legal practitioners on how courts would respond. Federal Magistrate Judge Andrew Peck's decision in Da Silva Moore v. Publicis Groupe (Southern District of New York, 2012) is considered the first official judicial endorsement of predictive coding as a way to review documents. Today, most judges agree that predictive coding has a well-established place in e-discovery.

Most disputes now center on how transparent parties have to be with how they use the technology, namely how they construct and code their seed sets. In another influential ruling in Rio Tinto Plc v. Vale S.A. (Southern District of New York, 2015), Judge Peck addressed the issue of transparency, stating that in general that he encourages parties to be transparent with their seed set development but also noted that there are other means to evaluate the efficacy of predictive coding without agreeing on seed set parameters, including manual sampling of coded documents.

Many experts would contend that predictive coding adoption trends have not lived up to the initial hype -- at least not yet. There are a few explanations for the industry’s tepid response to predictive coding, including:

​ Human Review Myth

There is a pervasive myth that manual human review represents the most thorough and accurate way to review documents for relevancy. A landmark study in 1985 disproved this theory, revealing that attorneys supervising skilled paralegals believed they had found at least 75% of the relevant documents from a document collection, using search terms and iterative search, when they had in fact found only 20%. Other studies since then have confirmed the fallibility of pure manual human review. However, the wide-held perception that human review represents the "gold standard" has certainly slowed adoption of predictive coding.

Now, however, increasing data volumes have made it almost impossible for litigants to demand human eyes on all document in major corporate litigation. This has only increased the need for organizations to leverage predictive coding and other AI techniques to cull data sets down into something more manageable before moving into human review.

Technical Unfamiliarity

Introducing a disruptive technology in any environment is going to be met with a fair amount of skepticism. This holds especially true in the legal industry, which has historically been very resistant towards new technology (many lawyers would still take a legal pad over an iPad). Predictive coding is complex. It involves fairly advanced elements of data science and statistical sampling. Even though supporters of predictive coding will argue that many of those complexities are largely hidden from actual users of the technology, the promise of cheaper, more efficient e-discovery has yet to outweigh the fear of the unknown for many lawyers.

Upfront Expense

While predictive coding has been proven to drastically decrease the time and expense of document review, like any technology-driven process, it requires an upfront investment and deployment of a new tool. This often constitutes a capital expenditure, rather than an operational expense, which involves the IT and procurement teams and requires clear financial justification. Many legal teams don't have the time or energy to build a case for why a predictive coding tool is necessary, so they stick with the status quo. If you are interested in learning how to build a case for investing in e-discovery technology, our Comprehensive Guide to Buying E-Discovery Software is a great place to start.

Expert Perspective

Technology-assisted Review Continues to be Greatly Underutilized

Despite its potential to dramatically reduce review costs, technology-assisted review (“TAR") continues to be greatly underutilized. Advancements in TAR, including the emergence of a second generation of predictive coding tools—referred to as “TAR 2.0"—and the strategic use of TAR technologies should remove many of the barriers to its use.


Gareth Evans
Co-Chair, E-Discovery Practice Group
Gibson Dunn

Moving Beyond Predictive Coding

While predictive coding has long been the principal type of AI in e-discovery review, new technologies are emerging that may rival its primacy. As research into AI has continued, the power of deep learning techniques to surpass the abilities of previous generations of AI has become evident. Google's AlphaGoZero, a go-playing engine that learned without any human inputs beyond the rules of the game, reached world champion levels of play in just three days!

Similar technologies are now reshaping AI-augmented document review. Exterro's Smart Labeling builds upon the most advanced innovations in deep learning and natural language processing to eliminate the need for seed sets. Rather than requiring training, it learns in a way that mimics human brain functions, recognizing patterns in words, sentence structures, and more to understand which documents are relevant or privileged and then applying labels to them. This ensures that reviewers see the documents most likely to be relevant sooner, helping legal teams get to the facts of the matter sooner and at less expense.

AI can now translate or summarize documents, helping human reviewers understand critical context around the types of documents frequently found in large data sets. AI can also help orchestrate review processes by guiding human reviewers to documents that are most likely to be responsive or most likely to be relevant to the goals of their assigned tasks. In broader applications of document review, such as document review in response to a data breach, AI can detect personal data (such as addresses, phone numbers, Social Security numbers, and more) to help organizations meet tight regulatory timelines to notify subjects of data breaches.

Best Practices Checklist

Whether you're using predictive coding or another form of AI in document review, remember the maxim, "Garbage in, garbage out." Bad data going into the system will always produce bad data coming out. No technology can correct sloppy attorney work. Here are some best practices for getting the most out of predictive coding and other AI tools:

Technology Assisted Review Best Practices

Get familiar with the technology before using it

You don't need to be a statistician or even a technical expert to use AI tools, but it's important to get familiar with the fundamental basics of how the system works. Does it employ active learning or passive learning? Does it require an initial seed set and if so, how large? Does the system employ a relevancy score or not?

These are the types of questions that may come up with a judge or opposing party, so it's better to be proactive than appear uninformed when the process is put under a microscope. Be prepared to explain how the technology works and why it will help save time or money or produce a more accurate set of results than other means.

Use expert reviewers for training the model

As we mentioned earlier, AI technology follows the adage, 'garbage in, garbage out.' Whether you're using a seed set or the technology is learning in the background, it's smart to employ your very strongest document reviewers, usually senior attorneys with close attention to detail and deep knowledge of the case. It's critical that the algorithm learns from accurate examples, or the entire review could be compromised.

For self-learning tools, you can rely on expert document reviewers to evaluate whether or not the AI is labeling, summarizing, or translating document accurately.

Develop a relevancy threshold

A lot of predictive coding tools employ a relevancy score, which essentially rates the system's confidence on a document's relevance. You should come up with a systematic approach for determining which documents will be reviewed by human reviewers after the coding process. (Of course, with Smart Labeling, the system takes care of this task for you!) Namely, you want to establish a cutoff point where documents will be discarded and which will proceed to the next phase. There isn't a right answer for what the threshold point should be, but showing consistency will improve the defensibility of the process.

Validate results

Rather than think of technology as a replacement of human review, it's better to think of it as something that augments the process. No review method is completely perfect. You should employ a system for manually sampling documents from both your relevant and non-relevant sets after the coding is complete to look for inconsistencies. However, discovering a large number of wrongfully coded documents may necessitate retraining the algorithm and running the review again. Keep in mind, the opposing party may ask to sample the non-relevant documents as well, so it's smart to validate internally before producing results.

Expert Perspective

The Only Constant is the Need to Change

I have learned that the only constant is the need to change, to keep up with the ever-changing technologies like predictive coding. That education process leads to modification of the practices. They must be understandable to work well, and not overly complicated. It is a constant struggle to maintain a beginner's mind. I have always promoted the interdisciplinary team approach to e-discovery where attorneys work with engineers, scientists and information technology specialists of all kinds.


Ralph Losey
E-Discovery Blogger / Attorney,
e-Discovery Team Blog

Predictive Coding Tools and AI

We've tried to explain how predictive coding works at a very basic level, but it's important to recognize that all products include different capabilities that impact the process. Some specific features to look for in e-discovery review software include:


Document insights

Beyond just delivering a relevancy score, look for tools that also highlight the content and metadata within each document that are relevant to the prediction. This will make the process of validating the results or revisiting specific documents much easier.


Predictive models

Once you've trained a predictive model, look for software that enables you to save and reuse it for other data sets. It's not uncommon to run into different matters that revolve around similar issues, people, and data. Persistent labels across document sets (such as for privilege) can also produce real savings for legal teams.


Robust reporting

Predictive coding may be an acceptable means of review in the eyes of the courts today, but as discussed earlier, there is still ambiguity around the level of transparency required. For this reason, you should make sure your predictive coding software comes with robust reporting that logs system users and all actions taken if and when that information is requested.


Spot Prediction

Anyone who has been involved in an e-discovery project knows that cases rarely follow a perfectly linear path. Your document set is constantly evolving as new information comes to light. The spot prediction feature is useful, because it allows you to apply existing predictive models to individual documents and e-mails that may be brought into the mix after the initial coding process.

Next Section

Predictive coding and artificial intelligence represent significant e-discovery technology innovation. Our next section looks at e-discovery software in general, exploring what types of tools are available, how to go about acquiring a new application, and highlighting key software characteristics.