Skip to content

Digital Forensics

The Importance of Understanding Multimedia Provenance

By Justin Tolman | May 30, 2024

In today's digital landscape, understanding the provenance of multimedia files is crucial for law enforcement, federal agencies, and corporations to ensure authenticity and trust in digital content. The manipulation of multimedia data has a long history, dating back to the 19th century. For instance, in 1869, William Mumler was accused of fraud for producing "spirit photographs" by double-exposing images to create ghostly apparitions. Over the last 150 years the tools have become more sophisticated and more accessible.  With that accessibility bad actors are able to manipulate politics, business, and destroy reputations. In a recent podcast with Bertram Lyons, CEO of Medex Forensics, Lyons drives home the importance of understanding multimedia provenance data.

What Does Real Mean for Media Files?

Synthetic (or AI generated) media has quickly become almost impossible to differentiate from media that originated from a camera. Even the photographs we take with our phone cameras today are automatically heavily modified from color, to lighting, and even sometimes the content. Understanding the relationship between "original," "real," and "authentic" is crucial in the context of digital media provenance analysis. In our podcast, Lyons laid out some useful definitions for terms that we often use as synonyms, even though they are not:

Original refers to the source of creation. This may be a device or it could be software (including AI generated). In other words, did the file originate on this device or from this software. Real pertains to the interpretation of the content and its factual accuracy. This is often subjective and relies heavily on context. 

For instance, if a video is recorded on a phone, the file is original to that phone as long as it hasn't been altered since its creation. However, this does not imply that the content within the file is "real." A video could depict a staged altercation, making the content fictitious even though the file is original. 

Authentic is about verifying the claims made about a digital file. Authentication involves proving or disproving these claims, such as verifying whether a video was created using a specific device or software. This process does not directly address whether the content is real or fake; it simply validates the truthfulness of the claim regarding the file's origin or creation process. 

In essence, while "original" describes the unaltered state of a file, "real" questions the factual integrity of the content, and "authentic" focuses on verifying specific claims made about the file. From the example above, showing the video of the fight to be real will most likely require investigation outside of the video file itself. 

The saying, “pics or it didn’t happen”, has long been the standard for “proving” an eyewitness story. But with the rise and ease of accessibility of synthetic media creation tools, investigators may need to “exit the digital” and return to traditional investigative work to prove what media is authentic, and what is real. 

Understanding Media File Formats and Internal Structures

The effective analysis of digital multimedia files requires an understanding of file formats and their internal structures. Each file format, whether an MP4, JPEG, or another type, may contain a unique set of internal structures that each store different data points. These data points can include metadata, timestamps, and encoding details, which are critical for forensic analysis. Without this understanding, it's impossible to accurately determine the file's origin, modifications, and overall authenticity.

Projects like the Coalition for Content Provenance and Authenticity (C2PA) are creating ecosystems for assigning provenance data to digital content. These initiatives aim to embed cryptographic metadata within multimedia files to trace their origin and verify their authenticity. This embedded data helps build a chain of trust, ensuring that content creators, editors, and publishers maintain the integrity of digital files throughout their lifecycle. By fostering collaboration among major technology companies and content creators, the C2PA is working towards a standardized approach to content authenticity, which is essential in combating misinformation and enhancing digital trust.

The challenge for schemas like C2PA is widespread adoption. To illustrate this we can look at an example we are more familiar with: exif. Exif data can be stored within JPEGs (and other files), and may store information related to the camera and its settings, location information, date(s), and much more. However, if that JPEG image is converted to a PNG using any number of available free software, this information is lost. Possibly more important is that if these images are uploaded to any of the largest social media platforms, this data is not included in the re-encoded files created by the platforms for publication. 

For provenance data schemas such as C2PA to be successful, there must be wide industry support. Lyons highlighted the significant role of social media companies in the realm of digital content provenance and authenticity. These platforms must take responsibility for preserving metadata and provenance information within the files they handle. Lyons stated, "Unfortunately, social media platforms never picked up on any of the previous efforts such as Exif or XMP. Social media platforms extract content from uploaded files, create new files with the extracted content and don't bring data forward into the new files from the original files. And that's been a huge challenge for forensics specifically, but also for everybody."

AI and the Ease of Disinformation

The rise of artificial intelligence (AI) has significantly impacted the landscape of disinformation campaigns and the handling of victim data. AI-generated content, including deepfakes, poses a substantial threat by creating realistic yet false multimedia files. This technology can be used maliciously to spread false information, manipulate public opinion, and exploit individuals. 

Forensic tools that analyze the internal structure and metadata of digital files are crucial in detecting AI-generated content and preventing its misuse. By focusing on how a file was created, rather than who created it, these tools help balance the need for privacy with effective content moderation.

In early 2024 OpenAI announced Sora, a text-to-video model that could create hyper-realistic video. Lyons wrote a blog post the day after Sora was announced illustrating some of the work that can and should be done when multimedia provenance needs to be determined. Summed up, Lyons’ examination of the Sora videos showed editing using Adobe and Apple products across both Windows and Apple operating systems. Lyons even found empty audio data with embedded XMP documents. Information briefly described in his blog post sets a pattern for what investigators should look for when determining a file's authenticity. 

Addressing the risks of disinformation, I think this quote sums up the danger quite well. “While the proliferation of deepfakes risks false information being accepted as true, the larger threat is that the possibility of AI manipulation is making it easier for the public or those in power to dismiss real content as fake. If this general mistrust creeps into the courtroom, it may poison the well of all audiovisual digital evidence.” (“Deepfakes in the Dock: Preparing International Justice for Generative AI”, Raquel Llorente)

To ensure we are on the same page as far as the difference between disinformation and misinformation we will define these words. 

Disinformation is false information being created and spread with intent to deceive or mislead.

Misinformation refers to false or inaccurate information that is spread without malicious intent. 

People create synthetic media by entering specific text prompts into an AI system. While it is possible the content generated from the AI is not what the user completely intended, the user then made the choice to distribute that content to a wider audience. If the content is such that it could create harm, this intent to distribute is an act of disinformation. While the focus of this blog article is multimedia, AI can easily generate text based content from prompts or other inputs, the same definitions apply to AI generated text. 

The Legal Landscape of Synthetic Media

In a Law Enforcement based investigation ultimately there needs to be a charge, or indictment. This can be challenging when the evidence is synthetic media, and depends heavily on the jurisdiction for which the alleged crime has occurred. In the United States law differs from State to State. Many (but not all) states when it comes to CSAM investigations have vocabulary around “any media that depicts” acts involving a minor. So an investigator may make the argument that the media depicts a minor, and thus charge. In jurisdictions where the media must depict an actual victim, the situation is much more challenging. 

However, the key factors in determining the legality of manipulated content are harm and intent. Content that causes harm, whether through disinformation, fraud, or exploitation, is more likely to be deemed illegal. Understanding the intent behind content creation and the potential harm it can cause is crucial for navigating the legal implications of manipulated multimedia files.

This is not just an issue for law enforcement. Corporations need to be aware, and be equipped to validate the provenance of multimedia files. Lyons addressed the risks of synthetic media, particularly as it relates to insurance fraud. He stated, "Insurance and banking corporations have a challenge with being worried about fraud when it comes to claims.  If someone is claiming something happened, how does that claim get generated? Usually a video or  photograph is sent in. ‘This happened to my car or my roof fell off my house’. Now it's easy to fake that, right?"

Conclusion

With AI software becoming more accessible, and often included in many social media applications for free, synthetic media is going to become an “every case” occurrence. So what should investigators do?

Prepare, don't panic! Sam Gregory in his TED talk, “When AI can fake reality, who can you trust?” laid out three steps for being prepared for synthetic media: 

  1. Obtain software and other technologies that can aid in the detection of synthetic media and in the authentication of digital evidence. Get training to have the skills necessary to recognize and understand the appropriate metadata relating to multimedia provenance. This may include (but is not limited to) exif, XMP, and C2PA schemas.
  2. Understand the various projects around content provenance. In step one, we are looking at the technology and the media itself. But it is important to understand what the current trends are for validating these technologies. Examples may include new schemas or watermarking. Understanding the direction of the industry will guide your training. 
  3. This last step may be a bit in the future, but it is important to know the pipeline of responsibility. As industry leaders adopt and maintain data provenance information, that information will be available for analysis at various locations across the pipeline. 

In an era where digital content is ubiquitous and easily manipulated, understanding multimedia provenance data is more important than ever. As the legal landscape continues to evolve, staying informed about the latest developments in multimedia provenance is essential for ensuring that investigators are able to determine the authenticity and integrity of digital content.

Digital forensic software like Exterro FTK Forensic Toolkit investigators advanced multimedia analysis tools that can help determine the nature of video evidence. 

Justin Tolman has been working in digital forensics for 12 years. He has a bachelor’s degree in Computer Information Technology from BYU-Idaho and a master’s degree in Cyber Forensics from Purdue University. After graduating he worked as a Computer Forensic Specialist with the Ohio Bureau of Criminal Investigation and currently works as the Forensic Subject Matter Expert and Evangelist at Exterro. Justin has written training manuals on computer and mobile device forensics, as well as (his personal favorite) SQLite database analysis. He frequently presents at conferences, on webinars, produces YouTube content, and hosts the FTK Over the Air podcast. 

Sign Up for Alerts

Get notified when new content for specific topics is available.

Sign Up