Allow extension with custom extractors
Opened this issue · 3 comments
Feature idea, relatively low priority:
Right now you have to modify the source in order to add a new extractor class. There should be a way to do this dynamically, i.e. by calling TextExtractor.register(SomeCustomFileHandlerClass)
or similar. We will have to find a way to determine the ordering / precedence of extractors since the list will be dynamic.
As the order only becomes relevant if you have multiple extractors capable of extracting text from the same file types, I would suggest postponing the ordering discussion until someone has a use-case for that, this way the code stays simple until required, and you might have a better understanding of the problem once a use-case presents itself.
If you decide to implement such a order-based registry, some prior art for that would be:
- Rails' controller callbacks
- Rackstash's FilterChain (which was originally inspired by the Rails API)
These options allow you to define an ordered list of handlers, which will be tried one after another. With these APIs, you can implement something like
TextExtractor.register(OCRExtractor)
TextExtractor.register_after(OCRExtractor, FirstParagraphExtractor)
Alternatively, you could also ignore the multiple-extractors-per-file-type issue at the registry at all and instead allow to register a MultiExtractor
performs does this transparently for the registry. Then, you could use something much simpler such as Rackstash::ClassRegistry which is used here to register classes to names and to build instances for them. See Rackstash::Encoder for how this can be used.