This site uses cookies to store information on your computer. Some are essential to make our site work; others help us improve the user experience. By using the site, you consent to the placement of these cookies. Read our privacy policy to learn more.

The Basics of E-Discovery , Chapter 6: Data Collection

Data collection is perhaps the most technically rigorous and complex of all the e-discovery phases. It involves the extraction of potentially relevant electronically stored information (ESI) from its native source into a separate repository. Because collection involves direct interaction with data, most people associate it as a mainly IT activity. However, there is a lot of strategy that goes into collection that demands of both legal and IT professionals alike.

The particulars of e-discovery collections, such as the timing and methodologies, lead to a lot of confusion among practitioners. There are several misperceptions surrounding collection that we hope to clarify throughout this section; so let's get to it!

Collection is NOT Preservation

Even relatively experienced legal professionals tend to conflate preservation and collection. While collecting as a way to preserve certainly would meet the court's intention, it is a very costly and inefficient way to do so. Think of preservation in terms of ensuring potentially relevant data isn't deleted. Courts don't prescribe a particular method for preservation, they just require that it gets done. Collection, on the other hand, is the first tangible step towards producing documents to the other side. While certainly not all collected documents will ultimately get produced, the idea is that collection feeds into the review process, which in turn dictates the production set. Put simply, preservation is very process based, while collection is much more action based (and usually way more technical).

Collection is not preservation

Types of ESI That Must Be Collected

Virtually every form of electronic data is up for grabs in e-discovery. And while it's one thing to identify and preserve various forms of ESI, it's often quite another to actually go out and collect it all. Different data sources have different levels of accessibility and present different collection challenges. Here is a breakdown of five common categories of ESI that might need to be collected for e-discovery:



Data that you interact with on a regular basis, such as email and other traditional files that are stored on a local hard drive or network drive. This ESI tends to be fairly easy to access and collect.


Cloud / Mobile

By far the fastest growing category of ESI, this is data that is created and stored on cloud servers (e.g. cloud-based applications, cloud storage, social media, etc.) or mobile devices, outside the scope of corporate networks or formal IT oversight. Cloud providers have differing policies and processes with respect to accessing data, and it's helpful to familiarize yourself with those details before you need to actually collect the data. Meanwhile, collecting from mobile devices will usually require sophisticated tools and potentially outside experts. To gain a better understanding of the precipitous rise of mobile device ESI, check out our infographic, " The Value of Mobile Data in E-Discovery."



Data that is no longer in active use but is stored or archived. Even though offline data can't be accessed over a shared server, collecting it usually presents fairly minimal challenges as long as you know the physical location of the data and the system on which it's stored.



Traditional backup tapes or disaster recovery systems are designed to store data in the event that it must be restored. These systems compress files and are not easily searchable or accessible and therefore they tend to present significant collection hurdles.



Previously deleted or fragmented files that exist on various systems and are usually not readily visible to regular system users. These files are highly inaccessible, and attempting to recover them requires specialized tools. More on hidden files below in our section on forensic imaging.

Metadata 101

You can't address collection in e-discovery without talking metadata. You'll come across different definitions for metadata (like all things e-discovery, it seems one definitive explanation for a concept, process, or activity is never enough). Our favorite definition is that metadata is the data about the data. Let us explain…

When you look at regular document on your computer you see the words in the document of course, along with the name of the file, and where it's located. Behind the visible information is a whole host of other information about the document, such as when it was created, modified, last updated, who made edits to the document, etc. In a regular business sense, this information is pretty useless. Why would you need to know that a colleague edited a document three months ago? In the land of e-discovery, this contextual information can be hugely important and has to be included when the document is ultimately collected.

Forensic Image vs. Logical Copy

If you tell an IT professional that you need to collect data from a computer hard drive, chances are you are going to be presented with a forensic image of the drive (also known as a "bit by bit" or bit stream copy). At the most basic level, a forensic image is a complete copy of a drive – including the portions of the drive that aren't allocated to active files (known as slack space). It is what would normally be considered an exact duplicate. These types of images give you both the files you'd expect to see if you were browsing a file listing, and also data from previously deleted files. Forensic imaging requires specific tools and is usually administered by an expert.

Alternatively, a logical copy is simply a copy of the contents of the directories on a disk and does not include previously deleted data or other information that a forensic image would capture. They are also much less technically intensive and can be performed by just about anyone with a little training and the right software.

So which is the best approach for e-discovery?

Most experts will tell you that in a great majority of civil matters, a logical copy will meet the court's expectation. There is certainly a place for forensic imaging, but it's usually only necessary when there is a suspicion of data tampering or in cases where previously deleted files are at the center of the controversy. You can learn much more about the difference between a forensic image and a logical copy by viewing our infographic, " E-Discovery Data Collection Considerations for Organizations."

Collection Methodologies

There are a variety of ways that organizations approach the collection process. Questions that might dictate the collection methodology might include:

  • How much data is involved in the legal matter?
  • How many sources of data are implicated, and how accessible are those data sources?
  • Will the collection involve any specialized tools or expertise?
  • Does the legal matter involve encrypted or sensitive data?
  • Are there internal IT resources available to perform/assist with the data collection?
  • What are the time constraints (production deadlines, retention schedules, etc.)?
  • What type(s) of collection technologies are deployed to perform the collections?
  • Civil or criminal (to get to the comment above about forensic vs logical)?

Answers to these questions will help determine which collection approaches to employ, which include:


Employee Self-Collection

Probably the riskiest of all collection approaches, employee self-collection involves letting the custodians themselves copy relevant files into a shared drive or portable storage device. Most experts advise against employee self-collection pointing out that most employees aren't technically savvy and are highly likely to make mistakes and overlook key documents. Likewise, several courts have also questioned whether employee self-collection constitutes a "defensible" e-discovery response. That being said, for small matters involving low volumes of highly conventional data (email, word processing documents, etc.) employee self-collection may be reasonable and cost effective , especially when the opposing party and judge have signed off on the plan ahead of time.


IT Collection

By far the most common collection approach, IT collection involves members of the IT department performing the actual data collection at the direction of the legal department. On the surface, involving IT professionals in the collection process makes sense since they understand the data landscape and usually possess the technical skill to get everything that's needed. However, there are downsides. In organizations with limited internal IT resources, data collections can be time consuming and keep IT professionals from other business-critical projects. As mentioned above, IT professionals also tend to associate data collection with forensic imaging, and without clear guidance from the legal team on what specifically to go after, they are likely to collect very broadly, resulting in more data that has to be processed and reviewed, driving up e-discovery costs.


External Collection

For organizations with very limited IT resources, a third party expert might be called on to perform the data collection. An outside expert is likely to have set procedures and all the necessary tools and skill to perform a collection that will withstand the highest levels of judicial scrutiny. But there can be a considerable expense associated with bringing in outside assistance, which is why most experts advise that organizations make it a priority to establish at least some level of internal collection capacity.


Remote Collection

These collections employ a centralized internal collection system that is integrated with company data sources allowing for ESI to be collected remotely. Though the collection might still be performed by an IT professional, it doesn't require any direct interaction with the data sources themselves and can usually be performed much quicker and more efficiently than with traditional methods. These systems also support more targeted collections by applying search and analytics technologies. While this type of collection technology is relatively new to the market and does require an upfront investment and extensive deployment process, many experts argue it's the most efficient and cost effective approach for large organizations with consistent collection demands. More on this in the technology discussion below.

Developing a Collection Strategy

As opposed to the collection methodologies described above which address the 'how' of collection, your collection strategy address the 'why.' Why are you collecting certain data? Why collect data one way instead of another?

Your collection strategy will change with every matter. In some cases - say very high stakes legal matters involving precarious data sources - it may be wise to collect data immediately. In other matters, immediate collections may not be necessary, especially if you have a strong preservation process in place. It's common for litigants to collect very highly relevant data early, since they know it will need to be collected eventually, but collecting very broadly in the early days of a matter is usually not advisable, as this will typically just drive up your costs with very little associated benefit to your case.

Collection strategy can change with every matter

It's also important to consider how your case strategy impacts your collection strategy. If your case is inevitably headed for an early settlement, it probably doesn't make a lot of sense to collect and process a bunch of data that ultimately won't be needed. Other considerations that should go into your collection strategy include whether or not outside experts should be involved, if there is any sensitive data that warrants greater protection measures, and whether any employees – like a person named in an incriminating lawsuit – might have incentive to alter or delete relevant data, in which case a more proactive collection might be warranted.


For a great discussion among e-discovery experts on considerations for creating an e-discovery collection strategy, watch the on-demand webcast, "Data Collection Considerations for Litigation, Criminal and Regulatory Matters."

Validating the Collection

Collection can lead to a lot of contentious disputes between parties when there is suspicion that not all relevant data was collected, or that the collection process itself altered the contents of the data. When such controversies surface, parties typically rely on a few mechanisms for proving that a collection was conducted in a defensible manner. These include:


Chain of Custody

E-Discovery think tank The Sedona Conference defines chain of custody as the "documentation regarding the possession, movement, handling and location of evidence from the time it is identified to the time it is presented in court or otherwise transferred or submitted." A thorough chain of custody log is designed to demonstrate the authenticity of a document and disprove any claims of data tampering.



Commonly referred to as a "digital fingerprint," a hash value is a special encryption code that is associated with each computer file. The purpose of a hash code is to provide files with a unique identifier. If a file's contents or metadata change, the file's hashtag will change as well, indicating that the file is not the same as it was before. By comparing hash values before and after collection, you can easily show that a file is the same pre-collection as it is after. For more on hashing, read e-discovery attorney Ralph Losey's terrific blog post on the topic.


Audit Trail

Audit trails are automated records generated by systems that track user activity. In the context of collection, they can be helpful in showing when a collection took place, what it entailed (collected data amount), and which user initiated it if such information is ever requested by a judge or adversary.

Data Processing

Rather than cover data processing in its own section of the guide, we've decided to include it with collection, since the two are so closely intertwined. Processing can be simply defined as the process of preparing collected data for attorney review. After data is collected, the resulting document set will include a rather messy mix of file types and formats, attachments, meaningless system files, and plenty of duplicates. Processing is all about cleaning up the mess and formatting the collected ESI so that it can be culled and searched by attorneys and review tools.

We won't get to into the weeds on data processing, since it's a highly technical process that includes a lot of concepts and jargon that the average e-discovery practitioner doesn't need to know. What is worth discussing, however, is who actually does the processing. Traditionally, most organizations outsourced data processing to third party vendors who would use specialized technologies to winnow data sets down and deliver them back to clients for next steps. Today, many companies still outsource processing, but there are a growing number of companies who have deployed processing software in-house. There is also an emerging class of collection technologies that essentially consolidates collection and processing into one step. More on these tools in the technology section below.

Data Collection Best Practices

Data collection is a dynamic and multi-faceted process that relies on sound e-discovery strategy, as well as solid technical resources and expertise. There are important best practices that fall under each of the various elements of the collection process, but here are four big ones you should know:

Don't over-collect

We know that it's easy to identify a relevant custodian and copy his or her entire hard drive or email folders. But easy doesn't equate to smart. More data collected means more data processed and ultimately reviewed. And that all adds up to more MONEY spent on e-discovery. Instead, develop strong preservation and early case assessment processes, and target your collections so that you are only collecting the potentially relevant ESI, nothing more and nothing less. Learn more about avoiding over-collection by reading Exterro's white paper, "Eliminating E-Discovery Over-Collection."

Be proactive

It's inevitable that some matters are going to present unique collection challenges. Maybe it's a case involving mobile data or one involving highly unorganized data on legacy systems. Whatever the case may be, do yourself a favor and recognize these challenges early on rather than right at the point where data needs to be collected. It's always better – and much cheaper – to assess your needs proactively to determine if outside resources will be needed and, if so, which vendors. Even if outside help isn't needed, it's important to give your internal IT team a heads up that a potentially big project may be coming their way soon, so they can plan accordingly.

Tier your collections

This relates to our point above about over-collection. It's always best to think of collection in terms of phases or tiers, rather than to try and do everything all at once. A tiered collection strategy involves prioritizing data so that only the most highly relevant data is collected immediately and less relevant data is collected only when absolutely needed. Just remember, that the only way to execute a defensible tiered collection strategy is to have a very strong preservation process that gives you the ability to not collect everything immediately.

Create an ESI repository

Matters overlap. You may have dozens of lawsuits that revolve around the same or similar issues, people, and data. Rather than create multiple copies of the same data, create a centralized repository of collected ESI and develop a systematic process for reusing data across matters.

Data Collection Tools

Data collection is not a one size fits all endeavor, nor is it a process that is supported by a single technology. There are a variety of tools that you can deploy depending on your specific collection needs and priorities. Here are some specific systems and capabilities you may want to consider:

In-Place Processing:

As mentioned earlier, processing has traditionally taken place post-collection as a separate e-discovery process typically handled by service providers, who charge on a per-gigabyte basis. However, there are new search and collection technologies that process data at the point of collection, eliminating the need to send collected data to a third party vendor.

Pre-Collection Analytics:

It may not specifically be a collection technology, but pre-collection analytics have a huge influence on the collection process. These are tools that crawl data sources and deliver basic insights like document volumes, and can also perform more advanced searching and filtering to really hone in on relevant content. They equip you with the necessary intelligence and visibility to target collections and focus efforts on just the relevant content.

Data Source Integrations:

Integrating your collection software with your enterprise data sources (email servers, Sharepoint servers, structured databases, etc.) can greatly streamline the collection process by eliminating the need for IT to conduct manual collections. Integrations allow for collection to be handled remotely, and they also minimize a lot of the technical complexities surrounding collections, allowing non-IT professionals to be more involved in the process.

Spot Collectors:

Even when you have an integrated collection environment and can collect over the network, there are still instances when you may need to grab data off a system that isn't connected to the network, such as field computer that is used by an employee who works remotely. Spot collector tools are portable USB devices that allow IT professionals or custodians to crawl and collect off non-network systems. The major benefit that these tools offer is that they can be pre-configured to collect only relevant files rather than complete copies of a computer's hard drive.

Mobile Collection Tools:

We discussed the challenges of collecting mobile data earlier. Ideally, any data on a mobile device will be located somewhere else that is a little more accessible, such as an email server. But workers in some industries create content that never leaves their phones – like text messages – that may need to be collected. Fortunately, there are specific devices that are designed to extract data off of mobile devices and reformat it for the purposes of attorney review and legal production.

Next Section

Up next in the Beginner's Guide to E-Discovery, we go from data collection to document review, the most costly of all e-discovery phases.

Chapter 5
Early Case Assessment
Chapter 7A
Document Review, Analysis & Production