The Nanite project builds on DROID and Apache Tika to provide a rich format identification and characterization system. It aims to make it easier to run identification and characterisation at scale, and helps compare and combine the results of different tools.
- nanite-core contains the core identification code, a wrapped version of DROID that can parse InputStreams.
- nanite-hadoop allows nanite-core identifiers to be run on web archives via Map-Reduce on Apache Hadoop. It depends on the (W)ARC Record Readers from the WAP codebase. It can also use Apache Tika and libmagic for identification. Files can be characterized using Tika and output in a format suitable for importing into C3PO.
Nanite has been used at scale, see this blog post
In version 1.3.1-90 of nanite-core, a new API has been introduced to make it possible to get the PUID-level data out, as an alternative to only being able to access the extended MIME type.
You can use the Nanite API like so:
// Can use a File or an InputStream:
File inFile = new File("src/test/resources/lorem-ipsum.doc");
// If you use the InputStream, you need to add the resource name if you
// want extension-based identification to work:
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.toURI().toString());
// To get the identification as an extended MIME type:
MediaType mt = dd.detect(inFile);
// Or:
mt = dd.detect(new FileInputStream(inFile), metadata);
// Giving:
// MIME Type: application/msword; version=97-2003
System.out.println("MIME Type: " + mt);
// Or, get the raw DROID results
List<IdentificationResult> lir = dd.detectPUIDs(inFile);
for (IdentificationResult ir : lir) {
System.out.println("PUID: " + ir.getPuid() + " '" + ir.getName()
+ "' " + ir.getVersion() + " (" + ir.getMimeType()
+ ") via " + ir.getMethod() + " identification.");
// PUID: fmt/40 'Microsoft Word Document' 97-2003
// (application/msword) via Container identification.
// Which you can then turn into an extended MIME type if required:
System.out.println("Extended MIME:"
+ DroidDetector.getMimeTypeFromResult(ir));
// Extended MIME:application/msword; version=97-2003
}
The DroidDetector is not threadsafe, and multithreaded processes should have a separate instance of the DroidDetector for each thread.
The Nanite system deliberately embeds a copy of the latest PRONOM signature files at the time of release, with the -XX part of the version number tracking the PRONOM release number. i.e. 1.3.1-90 includes PRONOM signature file version 82 and the corresponding container signatures.
Nanite does not support auto-updating the signature files, but if you wish, you can download them and pass them to the DroidDetector
via the DroidDetector(File fileSignaturesFile, File containerSignaturesFile)
constructor.
Version numbers are like x.x.x-yy
- changes to the yy
refer to updates to the PRONOM signature files, whereas changes to the x.x.x part refer to changes to the code that uses them. Only the latter are recorded here:
- 1.4.1
- Updates to how temporary files are handled, attempting to ensure large sets of temporary files are not left in place unnecessarily.
- 1.4.0
- Significant update to the implementation to take advantage of improvements in DROID 6.5. DROID's improved API means less code is required to run it in Nanite.
- 1.3.1
- Revert to not falling back on extension-based identification by default, as enabling this is a breaking API change.
- 1.3.0
- As of this release, the DROID code for guessing based on file extension is also included by default, if binary signature detection fails. New parameters on the DroidDetector allow this to be controlled.
This work was partially supported by the SCAPE project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137)