I’ve enjoyed reading Karl Schieneman’s recent series on metrics in predictive coding in the eDiscovery Journal. One passage in particular from his most recent post caught my attention:
“I agree it would be nice to know in advance if the number of random sample documents your TAR system uses is enough to train it adequately. If the system is looking at 5,000 documents as a training set, is that enough? Or, should it be something smaller, such as 2,000 documents? Or whether the final recall rate of responsive documents found should be an estimated 70, 80, or 90 percent of the total responsive documents in the collection…These types of metrics ARE NOT LIKELY to emerge for a number of reasons.”
Besides arousing dreadful memories from my first and only foray into statistics during high school, Schieneman raises some interesting questions regarding what all these numbers really mean in the grand scheme of things. It also got me thinking about just how sampling works in the context of predictive technologies in e-discovery. We hear so much about random sampling. In truth, that represents only half the story. Judgmental sampling plays an equally essential, albeit slightly different, role.
At the risk of over simplifying what really is a very complex topic, I thought it might be useful to distinguish the different sampling types. The best way for a mathematics simpleton like me to conceptualize a topic such as this is to think about it purely in the context of predictive technologies in e-discovery.
Random sampling generates a statistically sound sample of a given document corpus. That means that the ratio of responsive documents to non-responsive documents in the random sample should accurately reflect the document set as a whole.
For example, if I have a set of 50,000 documents and am seeking a 95% confidence level (meaning that I want the system to account for some uncertainty) and a 5% margin of error (I am okay having the system account for some error as well, in the interest of thoroughness) then, mathematically, I will need a minimum sample set of 382 (I used the handy sample size calculator from Raosoft.com to expedite the math. Sampling experts will undoubtedly notice that I left out a couple variables in the scenario defined above. This was done intentionally in the interest of simplicity). So, what exactly does that mean? In a nutshell, by manually reviewing those 382 random sample documents, an expert reviewer should be able to ascertain with a fairly high degree of certainty how many responsive documents exist within the full document set. If the reviewer determines that 10 of the 382 documents were responsive (2.6%) and the rest unresponsive, I can reasonably conclude that of my total corpus of 50,000, roughly 1,300 documents should be responsive.
This is where things can get a little confusing.
During the course of a matter, legal teams often become privy to new, valuable information about the case. A series of custodian interviews might unveil a set of “homerun” key terms or a relevant document that link inexorably to other relevant materials. This is invaluable fodder that needs be added to the training already done in the predictive system. By applying the judgmental findings to the initial sample set as responsive, the training model can be further refined and run against the corpus to identify potentially responsive ESI with even greater accuracy. It goes the other way too. For predictive technologies to work properly, they need to be trained on both responsive and unresponsive documents to ensure an adequate delineation between the two.
For example, Ralph Losey recently engaged a predictive coding tool for a search project in which he attempted to uncover a specific set of Enron documents related to involuntary employee terminations. Losey, whose day-by-day account of the project is an absolute must-read for anyone interested in how predictive technologies work, smartly tagged references to voluntary employee terminations as unresponsive to help the system more accurately identify only those references to involuntary terminations as responsive.
At this point, you may be asking yourself why any of this matters. I would argue that understanding sampling and its role in predictive technologies matters greatly. While experts like Losey have been arguing for some time that predictive technologies represent the future of e-discovery search and review, there is still a lot of fear, apprehension and general confusion about how the technology works or maybe more appropriately, how humans work with the technology.
Going to back to the beginning of the post and Schieneman’s argument that there will likely never be a set of “uniform metric standards” for predictive technologies, I am inclined to agree with him and take it a step further by suggesting that there shouldn’t have to be. That’s the beauty of predictive technologies. Humans control the variables. The technology simply does the work.
To learn more about how predictive technologies can be applied throughout all phases of e-discovery, register for Exterro’s upcoming webcast here.