planio-gmbh/plaintext

Allow extension with custom extractors

Opened this issue · 3 comments

Feature idea, relatively low priority:

Right now you have to modify the source in order to add a new extractor class. There should be a way to do this dynamically, i.e. by calling TextExtractor.register(SomeCustomFileHandlerClass) or similar. We will have to find a way to determine the ordering / precedence of extractors since the list will be dynamic.

As the order only becomes relevant if you have multiple extractors capable of extracting text from the same file types, I would suggest postponing the ordering discussion until someone has a use-case for that, this way the code stays simple until required, and you might have a better understanding of the problem once a use-case presents itself.

While working on my last PR I thought about a registry as well. So yes, @jkraemer, there is some point. However, my decision was somewhat similar to the thoughts of @thegcat ("build stuff when you need it").

If you decide to implement such a order-based registry, some prior art for that would be:

These options allow you to define an ordered list of handlers, which will be tried one after another. With these APIs, you can implement something like

TextExtractor.register(OCRExtractor)
TextExtractor.register_after(OCRExtractor, FirstParagraphExtractor)

Alternatively, you could also ignore the multiple-extractors-per-file-type issue at the registry at all and instead allow to register a MultiExtractor performs does this transparently for the registry. Then, you could use something much simpler such as Rackstash::ClassRegistry which is used here to register classes to names and to build instances for them. See Rackstash::Encoder for how this can be used.