The Many Names of Predictive Coding
Before we dive into the specifics of how exactly predictive coding works, we need to first address the name. Predictive coding is often used interchangeably with the terms predictive intelligence, technology-assisted review (TAR), or computer-assisted review (CAR). Different experts prefer different terms and point to subtle nuances to each definition. For the purposes of this guide they all essentially mean the same thing.
Predictive Coding Definition
Predictive coding is a machine learning process that uses software to take logic entered by people, for the purpose of finding responsive documents, and applies it to much larger datasets to reduce the number of irrelevant and non-responsive documents that need to be reviewed manually.
How Predictive Coding Works
While Predictive Coding is a relatively new concept in the legal world, machine learning algorithms exist all around us. It's how our email systems filter spam messages from our inboxes and how websites are able to pepper us with advertising tailored to our specific browsing habits.
In e-discovery, predictive coding is primarily used to quickly and accurately locate relevant documents during the review phase and greatly expedite the review process, resulting in significant cost savings. While all predictive algorithms will vary in their exact methodology, the process at a very simplistic level looks something like this:
Reviewers pull a representative cross-section of documents, known as a "seed set," from the full population of documents that need to be reviewed.
Responsive or unresponsive
Reviewers code (label) each document in the seed set as responsive or unresponsive and input those results into the predictive coding software.
Predictive formula generated
The software analyzes the seed set and creates an internal algorithm (formula) for predicting the responsiveness of future documents.
Sample and refine
Users sample the results of the algorithm on additional documents and refine the algorithm by continually coding and inputting sample documents until they achieve desired results.
The software applies that algorithm to the entire review set and codes all remaining documents as responsive or unresponsive.
As you can probably already tell there is a fair amount of technological sophistication (and some dreaded math) underlying how predictive coding works. This is not the venue for a deep dive into predictive coding specifics, but two important concepts to know are:
Opinions vary as to what method is most effective, and there are several variations to each one. You can learn a lot more about active learning vs. passive learning by reading the recent study, "Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery." E-Discovery attorney and technologist Ralph Losey has done extensive research on the subject as well and reports his finding on the blog, "e-Discovery Team."
Predictive Coding in the Courts
When predictive coding first emerged there was a great deal of anticipation among legal practitioners on how courts would respond. Federal Magistrate Judge Andrew Peck's decision in Da Silva Moore v. Publicis Groupe (Southern District of New York, 2012) is considered the first official judicial endorsement of predictive coding as a way to review documents. Today, there is little discord among judges that predictive coding has a place in e-discovery.
Most disputes now center on how transparent parties have to be with how they use the technology, namely how they construct and code their seed sets. In another influential ruling in Rio Tinto Plc v. Vale S.A. (Southern District of New York, 2015), Judge Peck addressed the issue of transparency, stating that in general that he encourages parties to be transparent with their seed set development but also noted that there are other means to evaluate the efficacy of predictive coding without agreeing on seed set parameters, including manual sampling of coded documents.
Predictive Coding Adoption Trends
Many experts would contend that predictive coding adoption trends have not lived up to the initial hype. At least not yet. Predictive coding is frequently included as a topic in industry surveys so there is a fair amount of data available on usage.
According to Exterro's 2015 Federal Judges Survey on E-Discovery Best Practices and Trends, 45% of the responding judges said that they only "somewhat agreed" that predictive coding is employed with regularity in cases, while 41% didn't agree with the statement at all. That doesn't exactly suggest widespread adoption. In fact, in FTI's 2015 Advice from Counsel Survey, 42% of corporate respondents said they had never used predictive coding.
Technology-assisted Review Continues to be Greatly Underutilized
Despite its potential to dramatically reduce review costs, technology-assisted review (“TAR") continues to be greatly underutilized. Advancements in TAR, including the emergence of a second generation of predictive coding tools—referred to as “TAR 2.0"—and the strategic use of TAR technologies should remove many of the barriers to its use.
Co-Chair, E-Discovery Practice Group
There are a few explanations for the industry's tepid response to predictive coding, including:
Human Review Myth
There is a pervasive myth that manual human review represents the most thorough and accurate way to review documents for relevancy. A landmark study in 1985 disproved this theory, revealing that attorneys supervising skilled paralegals believed they had found at least 75% of the relevant documents from a document collection, using search terms and iterative search, when they had in fact found only 20%. Other studies since then have confirmed the fallibility of pure manual human review. However, the wide-held perception that human review represents the "gold standard" has certainly slowed adoption of predictive coding.
Introducing a disruptive technology in any environment is going to be met with a fair amount of skepticism. This holds especially true in the legal industry, which has historically been very resistant towards new technology (many lawyers would still take a legal pad over an iPad). Predictive coding is complex. It involves fairly advanced elements of data science and statistical sampling. Even though supporters of predictive coding will argue that many of those complexities are largely hidden from actual users of the technology, the promise of cheaper, more efficient e-discovery has yet to outweigh the fear of the unknown for many lawyers.
While predictive coding has been proven to drastically decrease the time and expense of document review, like any technology-driven process, it requires an upfront investment and deployment of a new tool. This often constitutes a capital expenditure, rather than an operational expense, which implicates the IT and procurement teams and requires clear financial justification. Many legal teams don't have the time or energy to build a case for why a predictive coding tool is necessary, so they stick with the status quo.
Limited Case Law
Although courts have endorsed the use of predictive coding, the pool of cases involving the technology is still quite shallow. Risk-averse lawyers are still uneasy with the limited number of rulings that have addressed predictive coding and may be waiting for a more solidified judicial consensus
Predictive Coding Beyond E-Discovery Review
The focus of predictive coding has centered on document review, where most people believe its adoption can have the greatest impact on reducing e-discovery costs. But using machine learning algorithms to classify and predict document relevancy can be applied to any document set. Some technology vendors, including Exterro, have built the capability into the early case assessment (ECA) process, allowing users to leverage predictive analytics prior to document collection. In fact, predictive coding shows tremendous promise outside of the litigation bubble altogether, as an information governance tool allowing organizations to proactively code, categorize and filter documents at the point of creation.
Practical Predictive Coding
You can learn much more about using predictive coding beyond litigation by watching Exterro's on-demand webcast, "Practical Predictive Coding."
Best Practices Checklist
Predictive coding is a powerful technology, but its effectiveness is heavily dependent on human intervention. Don't forget, predictive algorithms are designed to approximate human judgment (via the seed set). That means bad data going into the system will always produce bad data coming out. In other words, predictive coding cannot correct sloppy attorney work. Here are some best practices for getting the most out of predictive coding:
Predictive Coding Best Practices
Get familiar with the technology before using it
You don't need to be a statistician or even a technical expert to use predictive coding, but it's important to get familiar with the fundamental basics of how the system works. Does it employ active learning or passive learning? How many documents comprise the initial seed set? Does the system employ a relevancy score or not? These are the types of questions that may come up with a judge or opposing party, so it's better to be proactive than appear uninformed when the process is put under a microscope.
Use expert reviewers for training the model
As we mentioned earlier, predictive coding follows the adage 'garbage in, garbage out.' When training the predictive model, it's smart to employ your very strongest document reviewers, usually senior attorneys with close attention to detail and deep knowledge of the case. It's critical that the seed set is properly coded, or the entire review could be compromised.
Develop a relevancy threshold
A lot of predictive coding tools employ a relevancy score, which essentially rates the system's confidence on a document's relevance. You should come up with a systematic approach for determining which documents will be reviewed by human reviewers after the coding process. Namely, you want to establish a cutoff point where documents will be discarded and which will proceed to the next phase. There isn't a right answer for what the threshold point should be, but showing consistency will improve the defensibility of the process.
Rather than think of predictive coding as a replacement of human review, it's better to think of it as something that augments the process. You should employ a system for manually sampling documents from both your relevant and non-relevant sets after the coding is complete to look for inconsistencies. No review method is completely perfect. However, discovering a large number of wrongfully coded documents may necessitate retraining the algorithm and running the review again. Keep in mind, the opposing party may ask to sample the non-relevant documents as well, so it's smart to validate internally before producing results.
The Only Constant is the Need to Change
I have learned that the only constant is the need to change, to keep up with the ever-changing technologies like predictive coding. That education process leads to modification of the practices. They must be understandable to work well, and not overly complicated. It is a constant struggle to maintain a beginners mind. I have always promoted the interdisciplinary team approach to e-discovery where attorneys work with engineers, scientists and information technology specialists of all kinds.
E-Discovery Blogger / Attorney,
e-Discovery Team Blog
Predictive Coding Tools
We've tried to explain how predictive coding works at a very basic level, but it's important to recognize that all products include different capabilities that impact the process. Some specific features to look for in predictive coding software include:
Predictive coding is one significant e-discovery technology innovation. Our next section looks at e-discovery software in general, exploring what types of tools are available, how to go about acquiring a new application, and highlighting key software characteristics.