Tuesday, July 26, 2016

Extract document #metadata – #Tika and #exiftool


Metadata is critical to any investigation. So much knowledge can be gleamed from the review of metadata from pictures and documents that it’s a big topic in the news. Look at the DNC hack last month. But for those of us in the digital forensics and the field of information security metadata has always been critical to our investigations.

If metadata is a new/confusing term for you then go read about it: http://ift.tt/12SKG6a

When using commercial products to some investigators trust the information from the commercial tools they paid licenses to use without validating the results using a secondary tool or reading the ‘release notes’ for caveats. Always read the release notes! It’s best practice to always test your tools to ensure you are not only getting accurate results but also as many results as possible!


I have been reminded once again that commercial tools can miss document metadata. Sometimes its because you are running an old version, other times because you did not select the proper check box when processing, and other times the tool just does not support the document format. Some obscure document types have spotty records with even the most popular commercial software.

In this case the software uses file extensions to determine file type on a standard pass and if you didn’t run an extra processing option to have it use the magic header/document signature to determine document type the forensic tool would not parsed the metadata completely.

In this case a colleague was reviewing an office .xlsx document that showed only ‘content created’ and ‘source modified’ as the document properties. This looked fishy to me so I suggested that more data could be present by inspecting the file manually. In this case renaming the xlsx to gzip and unzipping it to read the .xml files manually.

Sure enough multiple fields including ‘author’, ‘last modified by’, and more were present! In this instance file properties under Windows only showed the ‘content created’ and ‘date last saved’. The same exercise could have been conducted using Tika or ExifTool. So, if I hadn’t suggested digging deeper the metadata might have been missed…

Metadata tools

I’ll list some tools I recommend for validating metadata results from documents and images below.  I’m also showing a way to run the Apache Tika tool that reads metadata from files under Windows without having to use Java! The reason this is fun is Tika reads a TON of file formats but is written in Java and I don’t like to install Java unless required and this will let you use it on any machine running .NET.

Free tools that get comprehensive metadata are:

Sorry for the shameless plug of my own metadata extraction tool called ‘MetaDiver‘ that uses Tika heavily for metadata extraction.

Tika in Windows

You can run the latest Tika on windows to inspect files individually.

To use Tika on Windows you will need to do a few things.

  • Download the Tika jar file here.
  • Download iKVM here.

Use the syntax from a command prompt “ikvm.exe -jar tika.jar” from the ikvm directory. It’s that simple.






























To use exiftool just download exiftool and run it from a command prompt.


ExifTool with Tika

According to the Tika documentation you can wrap Tika around exiftool to add even more information extraction. Tika already supports an insane number of document formats.


I haven’t tested this configuration.


Exiftool and Tika are both free, well maintained, cross platform and regularly updated with the latest changes to document formats. Having the latest version because file structures change as software changes and testing the latest version are really important when it comes to metadata to ensure you can read everything from the file you are inspecting!

Beware of how the different tools handle dates regarding time zones, UTC vs local time and daylight savings time.

I hope you find this post informative and actionable.





by Dave via EasyMetaData