Nanite - a friendly swarm of format-identifying robots

The Nanite project builds on DROID and Apache Tika to provide a rich format identification and characterization system. It aims to make it easier to run identification and characterisation at scale, and helps compare and combine the results of different tools.

nanite-core contains the core identification code, a wrapped version of DROID that can parse InputStreams.
nanite-hadoop allows nanite-core identifiers to be run on web archives via Map-Reduce on Apache Hadoop. It depends on the (W)ARC Record Readers from the WAP codebase. It can also use Apache Tika and libmagic for identification. Files can be characterized using Tika and output in a format suitable for importing into C3PO.

Nanite has been used at scale, see this blog post

Using the Nanite API

In version 1.3.1-90 of nanite-core, a new API has been introduced to make it possible to get the PUID-level data out, as an alternative to only being able to access the extended MIME type.

You can use the Nanite API like so:

		// Can use a File or an InputStream:
		File inFile = new File("src/test/resources/lorem-ipsum.doc");

		// If you use the InputStream, you need to add the resource name if you
		// want extension-based identification to work:
		Metadata metadata = new Metadata();
		metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.toURI().toString());

		// To get the identification as an extended MIME type:
		MediaType mt = dd.detect(inFile);
		// Or:
		mt = dd.detect(new FileInputStream(inFile), metadata);
		// Giving:
		// MIME Type: application/msword; version=97-2003
		System.out.println("MIME Type: " + mt);

		// Or, get the raw DROID results
		List<IdentificationResult> lir = dd.detectPUIDs(inFile);
		for (IdentificationResult ir : lir) {

			System.out.println("PUID: " + ir.getPuid() + " '" + ir.getName()
					+ "' " + ir.getVersion() + " (" + ir.getMimeType()
					+ ") via " + ir.getMethod() + " identification.");
			// PUID: fmt/40 'Microsoft Word Document' 97-2003
			// (application/msword) via Container identification.

			// Which you can then turn into an extended MIME type if required:
			System.out.println("Extended MIME:"
					+ DroidDetector.getMimeTypeFromResult(ir));
			// Extended MIME:application/msword; version=97-2003
		}

The DroidDetector is not threadsafe, and multithreaded processes should have a separate instance of the DroidDetector for each thread.

Limitations

The Nanite system deliberately embeds a copy of the latest PRONOM signature files at the time of release, with the -XX part of the version number tracking the PRONOM release number. i.e. 1.3.1-90 includes PRONOM signature file version 82 and the corresponding container signatures.

Nanite does not support auto-updating the signature files, but if you wish, you can download them and pass them to the DroidDetector via the DroidDetector(File fileSignaturesFile, File containerSignaturesFile) constructor.

Change Log

Version numbers are like x.x.x-yy - changes to the yy refer to updates to the PRONOM signature files, whereas changes to the x.x.x part refer to changes to the code that uses them. Only the latter are recorded here:

1.4.1
- Updates to how temporary files are handled, attempting to ensure large sets of temporary files are not left in place unnecessarily.
1.4.0
- Significant update to the implementation to take advantage of improvements in DROID 6.5. DROID's improved API means less code is required to run it in Nanite.
1.3.1
- Revert to not falling back on extension-based identification by default, as enabling this is a breaking API change.
1.3.0
- As of this release, the DROID code for guessing based on file extension is also included by default, if binary signature detection fails. New parameters on the DroidDetector allow this to be controlled.

Acknowledgements

This work was partially supported by the SCAPE project. The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137)

tballison/nanite

Nanite - a friendly swarm of format-identifying robots

Using the Nanite API

Limitations

Change Log

Acknowledgements