The Basics of E-Discovery , Chapter 6: Data Collection


Data collection is perhaps the most technically rigorous and complex of all the e-discovery phases. It involves the extraction of potentially relevant electronically stored information (ESI) from its native source into a separate repository. Because collection involves direct interaction with data, most people think of it as principally an IT activity. However, an effective collection strategy demands active involvement of both legal and IT professionals alike.

The particulars of e-discovery collections, such as the timing and methodologies, lead to a lot of confusion among practitioners. There are several misperceptions surrounding collection that we hope to clarify throughout this section; so let's get to it!

Learn about the difference between data collection and data preservation in this video introduction to data collection.

Collection is NOT Preservation

Even relatively experienced legal professionals tend to conflate preservation and collection. While collecting as a way to preserve certainly meets the court's requirements, it is a very costly and inefficient way to do so. Think of preservation in terms of ensuring potentially relevant data isn't deleted. Courts don't prescribe a particular method for preservation, they just require that it gets done. Collection, on the other hand, is the first tangible step towards producing documents to the other side. While certainly not all collected documents will ultimately get produced, the idea is that collection feeds into the review process, which in turn dictates the production set. Put simply, preservation is very process based, while collection is much more action based (and usually way more technical).

Collection is not preservation

Types of ESI That Must Be Collected

Virtually every form of electronic data is up for grabs in e-discovery. And while it's one thing to identify and preserve various forms of ESI, it's often quite another to actually go out and collect it all. Different data sources have different levels of accessibility and present different collection challenges. Here is a breakdown of several common categories of ESI that might need to be collected for e-discovery:



Data that you interact with on a regular basis, such as email and other traditional files that are stored on a local hard drive or network drive. This ESI tends to be fairly easy to access and collect.

6 Esi Cloud


Data created and stored on cloud servers, ranging from software as a service applications to Google Drive and social media accounts, have grown exponentially in the past several years. Cloud providers have their own policies for accessing data, but e-discovery technology solutions have the ability to integrate with many of the most popular cloud services, making collection easier than even three years ago.

6 Esi Mobile


Collecting data from mobile devices often requires sophisticated tools and highly specialized expertise, often requiring consultants. But as more and more companies have embraced bring your own device (BYOD) policies, they have to be prepared to preserve and collect ESI from call logs, text messages, instant messaging, geolocation data, and other applications. Case law elucidates much of current e-discovery practice for mobile data collection. You can stay up-to-date at our Case Law Library.



Data that is no longer in active use but is stored or archived. Even though offline data can't be accessed over a shared server, collecting it usually presents fairly minimal challenges as long as you know the physical location of the data and the system on which it's stored.



Traditional backup tapes or disaster recovery systems are designed to store data in the event that it must be restored. These systems compress files and are not easily searchable or accessible and therefore they tend to present significant collection hurdles.



Previously deleted or fragmented files that exist on various systems and are usually not readily visible to regular system users. These files are highly inaccessible, and attempting to recover them requires specialized tools. Learn more about hidden files below in our section on forensic imaging.

Metadata 101

You can't address collection in e-discovery without talking metadata. You'll come across different definitions for metadata (like all things e-discovery, it seems one definitive explanation for a concept, process, or activity is never enough). Our favorite definition is that metadata is the data about the data.

When you look at regular document on your computer you see the words in the document of course, along with the name of the file, and where it's located. In addition to the visible information is a whole host of information about the document, such as when it was created, modified, last updated, who made edits to the document, etc. In a regular business sense, this information has marginal value. However, in e-discovery, , this contextual information can be hugely important and must be included when the document is ultimately collected.

Forensic Image vs. Logical Copy

If you tell an IT professional that you need to collect data from a computer hard drive, chances are you are going to be presented with a forensic image of the drive (also known as a "bit by bit" or bit stream copy). At the most basic level, a forensic image is a complete copy of a drive – including the portions of the drive that aren't allocated to active files (known as slack space). It is essentially an exact duplicate of the original drive. These types of images give you both the files you'd expect to see if you were browsing a file listing, and also data from previously deleted files. Forensic imaging requires specific tools and is usually performed by an expert.

Alternatively, a logical copy is simply a copy of the contents of the directories on a disk and does not include previously deleted data or other information that a forensic image would capture. They are also much less technically intensive and can be performed by just about anyone with a little training and the right software.

So which is the best approach for e-discovery?

Most experts will tell you that in a great majority of civil matters, a logical copy will meet the court's expectation. There is certainly a place for forensic imaging, but it's usually only necessary when there is a suspicion of data tampering or in cases where previously deleted files are at the center of the controversy. You can learn much more about the difference between a forensic image and a logical copy by viewing our infographic, " E-Discovery Data Collection Considerations for Organizations."

Collection Methodologies

There are a variety of ways that organizations approach the collection process. Questions that might dictate the collection methodology might include:

  • How much data is involved in the legal matter?
  • How many sources of data are implicated, and how accessible are those data sources?
  • Will the collection involve any specialized tools or expertise?
  • Does the legal matter involve encrypted or sensitive data?
  • Are there internal IT resources available to perform/assist with the data collection?
  • What are the time constraints (production deadlines, retention schedules, etc.)?
  • What type(s) of collection technologies are deployed to perform the collections?
  • Is the matter civil or criminal (which informs the decision between forensic and logical copies)?

Answers to these questions will help determine which of the following collection approaches to employ.


Employee Self-Collection

Employee self-collection involves letting the custodians themselves copy relevant files into a shared drive or portable storage device. Most experts advise against employee self-collection pointing out that most employees aren't technically savvy and are highly likely to make mistakes and overlook key documents. Likewise, several courts have also questioned whether employee self-collection constitutes a "defensible" e-discovery response given the potential for bad actors. That being said, for small matters involving low volumes of conventional data (email, word processing documents, etc.) employee self-collection may be reasonable and cost effective , especially when the opposing party and judge have signed off on the plan ahead of time.


IT Collection

By far the most common collection approach, IT collection involves members of the IT department performing the actual data collection at the direction of the legal department. On the surface, involving IT professionals in the collection process makes sense, as they understand the data landscape and usually possess the technical skill to get everything that's needed. However, there are downsides. In organizations with limited internal IT resources, data collections can be time consuming and keep IT professionals from other business-critical projects. As mentioned above, IT professionals also tend to associate data collection with forensic imaging, and without clear guidance from the legal team on what specifically to go after, they are likely to collect very broadly, resulting in more data that has to be processed and reviewed, driving up e-discovery costs.


External Collection

For organizations with very limited IT resources, a third party expert might be called on to perform the data collection. An outside expert is likely to have set procedures and all the necessary tools and skill to perform a collection that will withstand the highest levels of judicial scrutiny. However, there can be a considerable expense associated with bringing in outside assistance, which is why most experts advise that organizations have at least some level of internal collection capacity.


Remote Collection

These collections employ a centralized internal collection system that is integrated with company data sources allowing ESI to be collected remotely. Though the collection might still be performed by an IT professional, it doesn't require any direct interaction with the data sources, themselves and can usually be performed much quicker and more efficiently than with traditional methods. These systems also support more targeted collections by applying search and analytics technologies. Many experts argue remote collection is the most efficient and cost effective approach for large organizations with consistent collection demands. More on this in the technology discussion below.

Developing a Collection Strategy

As opposed to the collection methodologies described above which address the 'how' of collection, your collection strategy addresses the 'why.' Why are you collecting certain data? Why collect data one way instead of another?

Your collection strategy will change with every matter. In some cases - say very high stakes legal matters involving precarious data sources - it may be wise to collect data immediately. In other matters, immediate collections may not be necessary, especially if you have a strong preservation process in place. It's common for litigants to collect very highly relevant data early, since they know it will need to be collected eventually, but collecting very broadly in the early days of a matter is usually not advisable, as this will drive up your costs with little benefit to your case.

Collection strategy can change with every matter

It's also important to consider how your case strategy impacts your collection strategy. If your case is inevitably headed for an early settlement, it doesn't make a lot of sense to collect and process a bunch of data that won't be needed. Other considerations that should go into your collection strategy include:

  • Whether or not outside experts should be involved
  • If there is any sensitive data that warrants greater protection measures
  • Whether any employees might have an incentive to alter or delete relevant data

Validating the Collection

Collection can lead to contentious disputes between parties if there is suspicion that parties are not acting in good faith or that the collection process itself altered the contents of the data. When such controversies surface, parties typically rely on a few mechanisms for proving that a collection was conducted in a defensible manner. To ensure defensibility, proactively monitor and record your compliance in these ways:


Chain of Custody

E-Discovery think tank The Sedona Conference defines chain of custody as the "documentation regarding the possession, movement, handling and location of evidence from the time it is identified to the time it is presented in court or otherwise transferred or submitted." A thorough chain of custody log demonstrates the authenticity of a document and disproves any claims of data tampering.



Commonly referred to as a "digital fingerprint," a hash value is a special encryption code that is associated with each computer file. Hash code provides files with a unique digital identifier. If a file's contents change, the file's hashtag will change as well, indicating that the file is not the same as it was before. By comparing hash values before and after collection, you can easily show that a file is the same pre-collection as it is after. For more on hashing, read e-discovery attorney Ralph Losey's terrific blog post on the topic.


Audit Trail

Audit trails are automated records generated by systems that track user activity. In the context of collection, they can be helpful in showing when a collection took place, the amount of data collected,, and which user initiated it if such information is ever requested by a judge or adversary.

Data Processing

Data processing and collection are closely intertwined. In e-discovery, processing prepares collected data for attorney review. After data is collected, the resulting document set will include a rather messy mix of file types and formats, attachments, meaningless system files, and plenty of duplicates. Processing cleans up the mess and the collected ESI so that it can be culled and searched by attorneys and review tools.

We won't get to into the weeds on data processing, since it's a highly technical process that includes a lot of concepts and jargon that the average e-discovery practitioner doesn't need to know. What is worth discussing, however, is who actually does the processing. Traditionally, most organizations outsourced data processing to third party vendors who would use specialized technologies to winnow data sets down and deliver them back to clients for next steps. Today, many companies still outsource processing, but there are a growing number of companies who have deployed processing software in-house. There is also an emerging class of collection technologies that essentially consolidates collection and processing into one step; we'll look at these tools in the technology section below.

Data Collection Best Practices

Data collection is a dynamic and multi-faceted process that relies on sound e-discovery strategy, as well as solid technical resources and expertise. There are important best practices that fall under each of the various elements of the collection process, but here are four big ones you should know:

Don't over-collect

We know that it's easy to identify a relevant custodian and copy his or her entire hard drive or email folders. But easy doesn't equate to smart. More data collected means more data processed and ultimately reviewed. And that all adds up to more time and money spent on e-discovery. Instead, develop strong preservation and early case assessment processes, and target your collections so that you are only collecting the potentially relevant ESI, nothing more and nothing less. Learn more about avoiding over-collection by reading Exterro's white paper, "Eliminating E-Discovery Over-Collection."

Be proactive

It's inevitable that some matters are going to present unique collection challenges. Maybe the case involves mobile data or highly unorganized data on legacy systems. Whatever the case may be, do yourself a favor and recognize these challenges early on rather than right at the point where data needs to be collected. It's always better – and much cheaper – to assess your needs proactively to determine if outside resources will be needed and, if so, which vendors. Even if outside help isn't necessary, it's important to give your internal IT team a heads up that a potentially big project may be coming their way soon, so they can plan accordingly.

Tier your collections

This relates to our point above about over-collection. It's always best to think of collection in terms of phases or tiers, rather than to try and do everything all at once. A tiered collection strategy involves prioritizing data so that only the most highly relevant data is collected immediately and less relevant data is collected only when needed. Just remember, a strong preservation process is a prerequisite to a defensible tiered collection strategy..

Create an ESI repository

Matters overlap. You may have dozens of lawsuits that revolve around the same or similar issues, people, and data. Rather than create multiple copies of the same data, create a centralized repository of collected ESI and develop a systematic process for reusing data across matters.

Data Collection Tools

Data collection is not a one size fits all endeavor, nor is it a process that is supported by a single technology. There are a variety of tools that you can deploy depending on your specific collection needs and priorities. Here are some specific systems and capabilities you may want to consider:

In-Place Processing:

As mentioned earlier, processing has traditionally taken place post-collection as a separate e-discovery process typically handled by service providers, who charge on a per-gigabyte basis. However, there are new search and collection technologies that process data at the point of collection, eliminating the need to send collected data to a third party vendor.

Pre-Collection Analytics:

It may not specifically be a collection technology, but pre-collection analytics have a huge influence on the collection process. These are tools that crawl data sources and deliver basic insights like document volumes, and can also perform more advanced searching and filtering to really hone in on relevant content. They equip you with the necessary intelligence and visibility to target collections and focus efforts on relevant content.

Data Source Integrations:

Integrating your collection software with your enterprise data sources (email servers, Sharepoint servers, structured databases, etc.) can greatly streamline the collection process by eliminating the need for IT to conduct manual collections. Integrations allow for collection to be handled remotely, and they also minimize a lot of the technical complexities surrounding collections, allowing non-IT professionals to be more involved in the process.

Spot Collectors:

Even when you have an integrated collection environment and can collect over the network, you may occasionally need to grab data off a system that isn't connected to the network, such as field computer that is used by an employee who works remotely. Spot collector tools are portable USB devices that allow IT professionals or custodians to crawl and collect off non-network systems. The major benefit that these tools offer is that they can be pre-configured to collect only relevant files rather than complete copies of a computer's hard drive.

Mobile Collection Tools:

We discussed the challenges of collecting mobile data earlier. Ideally, any data on a mobile device will be located somewhere else that is a little more accessible, such as an email server. But workers in some industries create content that never leaves their phones – like text messages – that may need to be collected. Fortunately, there are specific devices that are designed to extract data off of mobile devices and reformat it for the purposes of attorney review and legal production.

Next Section

Up next in the Beginner's Guide to E-Discovery, we go from data collection to document review, the most costly of all e-discovery phases.