This site uses cookies to store information on your computer. Some are essential to make our site work; others help us improve the user experience. By using the site, you consent to the placement of these cookies. Read our privacy policy to learn more.

Getting the Scoop on De-dupe

Created on August 7, 2012

Content Marketing Manager

De-duplication is a critical e-discovery process for reducing data volumes in advance of attorney review and reducing overall costs. Computer hashing has become the most common way to identify duplicate files. A computer hash is an encryption algorithm that generates a unique value to identify a particular computer file. Hashing serves two main purposes in e-discovery. First, it helps authenticate the data. Any changes to a document result in a changed hash value, thus exposing any attempts to manipulate potentially relevant evidence. Secondly, hashing is used for file identification. Since a hash value is based on the contents of a file, it can be used by legal teams to track down identical documents – even those with different file names - and flag them as duplicates. Hashing can also be used to filter out the irrelevant system files. It has become a widely adopted standard for de-duplication and most data processing technologies employ at least one of the common hash formulas (MD5, SHA-1, etc.).

There are cases, however, where parties may want to pursue more rigorous methods for identifying duplicates. In addition to the surface-level content of a given file lies metadata, a set of data that gives information about the particular file. In the context of an email, the metadata might include when the email was sent, who it was sent to and when it was opened. The list of metadata fields is extensive. When metadata is brought into the fold, the definition of what constitutes a duplicate file becomes less clear. eDiscovery Journal analyst Greg Buckles explained in a recent article, “the fly in the ointment here is that most emails are not true duplicates, but rather similar versions with slight differences in certain fields."

In one of her many significant e-discovery rulings, Judge Shira Scheindlin concluded in National Day Laborer Organizing Network v. United States Immigration and Customs Enforcement Agency , 10 Civ. 3488, (S.D.N.Y., Feb. 7, 2011) that metadata is part of an electronic document and must be included in Freedom of Information Act (FOIA) productions. While somewhat narrow in its scope, the ruling was viewed by many experts as a potential game changer, portending the growing role of metadata in all types of litigation.

In a typical civil litigation case most metadata fields are not important enough to warrant production of duplicate files as individual copies. But there are exceptions, as Buckles explains, “I have been involved with many matters where key issues depended on whether specific custodians actually read certain emails." In this instance, a traditional hashing approach might prove insufficient since it would identify all copies of a particular email as identical regardless of who received or opened it. There may be other instances where the file name itself may be consequential. For instance, words like “confidential" or “sensitive" in a file name could be incriminating in and of themselves depending on the nature of the case.

Given these complexities, it is becoming increasingly important for organizations to have some degree of flexibility in how they approach the de-duplication process. Hashing will always serve a valuable role in identifying duplicate documents, but there may be cases where it's necessary to add filters on top of the standard algorithm, such as file location, last modified date or file name, that must also match for a file to be considered an actual duplicate. There may also be a need to specify the parameters of the de-duplication effort. If parties can agree that only one unique file must be produced regardless of the underlying metadata, then they can de-dupe across all data involved in the matter in one fail swoop. But if the nature of the case precipitates a more deliberate approach, it may be necessary to run a de-duplication at just the custodian level, flagging all duplicate files owned by a specific individual but ignoring those same files that have come under the possession of someone else.

The important thing to note is that there is no set standard for de-duplication. It is an issue that must be worked out between parties in pre-discovery conferences. “How you handle potential duplicates is just as important as the criteria by which you define duplicates," writes Buckles. “Retaining user actions (folder, forward, reply, tag, etc) can be critical in some cases and absolutely irrelevant in others." Like any ambiguous standard, and there are many in e-discovery, it always helps to have a variety of options, and in this case, technologies at your disposal.

To learn more about issues surrounding data culling and defensible data reduction, watch Exterro's on-demand webcast “Proportionality vs. Defensibility in E-Discovery Collections" here.


~ Select Clients ~

JP Morgan Chase
United Health
American Express
Lockheed Martin
New York Life
Golden Living
Hanover Insurance
Consolidated Edison Company (Con-Ed)
Los Alamos National Lab
Washingon AGO
Juniper Networks
American Airlines
Lifepoint Hospitals
Columbia Pipeline Group
Lower Colorado River Authority (LCRA)
State Compensation Insurance Fund
The AES Corporation
Stryker Corporation
South Carolina Ports Authority