bluesky/tiled

Explore other ways to identify file types

Closed this issue · 9 comments

When serving a directory of files, there may exist valid data files that lack a feature in the file name (such as a file extension) to identify the type of file. For example, there is no common file extension for SPEC data files and some users are accustomed to omitting a file extension. As shown in #174, the file extension may be too complicated to examine or not one of the recognized values. The .dat and .txt extensions are also used for various types of data files, including CSV.

Need some programmatic technique to identify the type of file, similar to the UNIX file command. Python examples include is_spec_file(filename), isNeXusFile(filename)

Such routines could be called with unrecognized files.

The addition technique could be inserted into this block:

if ext in mimetypes_by_file_ext:
mimetype = mimetypes_by_file_ext[ext]
else:
# Use the Python's built-in facility for guessing mimetype
# from file extension. This loads data about mimetypes from
# the operating system the first time it is used.
mimetype, _ = mimetypes.guess_type(path)

Might be better to search known mimetypes first since identification by file content is the more expensive operation. Associate each identified file type with ad hoc, unique mimetype.

I support this. I intentionally kept it simple to start, looking at file extension only, but I agree it's time to enable more sophisticated techniques.

I propose to add a configuration setting:

# config.yml
...
mimetype_detection_hook: my_custom_module:my_sniifer

which would enable you and anyone to experiment with this outside the tiled package like this:

# my_custom_module.py

def my_sniffer(filepath):
    ...
    return "..."

The function may inspect the filename and, if it needs to, open the file and read as many bytes as it wants to. The return value should be MIME type, either a registered one like text/csv or a custom one text/x-specfile.

This would override the code you excerpted above, so it would be in total control over how types were determined. It could decide whether to copy the mimetype search approach as a first pass or to overrule it.

If people developed "sniffers" that prove to be generally useful, we can always move them into tiled proper at some later point. Either way, I think it will be important to enable people who deploy tiled to customize the sniffer behavior like this on their own.

What do you think?

That seems very general. I like it.

@prjemian This is now implemented in v0.1.0a67 and documented at https://blueskyproject.io/tiled/how-to/read-custom-formats.html.

Let me know if you get a chance to try it out on SPEC or NeXus.

Starting to look at this now. Case 2 is the most likely scenario since our data files may have extensions. Yet that extension cannot be trusted to be informative when the extension content is overloaded for various data formats (such as .dat: could be SPEC, CSV, binary, ..., .h5 could be NeXus, Data Exchange, or other).

The interface is called for each file:

# custom.py

def detect_mimetype(filepath, mimetype):
    if mimetype is None:
        # If we are here, detection based on file extension came up empty.
        ...
        mimetype = "text/csv"
    return mimetype

While this could become time-expensive when repeating over a directory structure with many similar files (a typical pattern), it could be optimized. One optimization (in the custom handler) could be a sense of recognition that files in a directory likely follow a pattern, such as any combination of these rules:

  • all files in this directory are [this known format], regardless of naming style
  • NeXus/HDF5 area detector files in this directory have .h5 extension
  • SPEC files in this directory have .dat extension
  • custom NeXus/HDF5 files in this directory have .hdf5 or 'nx' or 'nxs` extension
  • file starts with recognized pattern

Even if that handling is better suited to a class, the optimizing class would be called from the detect_mimetype() function. Seems straightforward.

Another optimization:

  • directory contains a file that provides the mime type mapping
  • the mapping file could be created manually or by a previous run

This aligns with two optimizations I have been working on:

  • Stash the detection results (and metadata) in an index file so that the detection only has to happen once for each new file, not repeatedly on every tiled server startup.
  • Enable the user to explicitly index (or would a better term be “register”) certain files or directories with a new command like tiled register. This would give the user the opportunity to provide additional guidance on how to handle those specific files or directories, perhaps tiled register dir/ --ext .h5=application/x-nexus. That may be easier for users than going through trial-and-error to guide an automated detection scheme.

The local mapping may provide more flexibility. Our directories tend to have mixed content such that an ignore setting would be good for Python, SPEC macro, MatLab procedures, IgorPro procedures, text, markdown, ... But then, this is just another aspect of a custom handler.

Unless you have some specifics in mind, let's work up some custom handlers and compare.

Sounds good, let’s!