ClearTK/cleartk

consider switching to UIMA resources for classifiers, etc.

bethard opened this issue · 1 comments

Original issue 393 created by ClearTK on 2013-11-16T18:08:59.000Z:

We should talk about whether or not we should switch over to the UIMA way of doing things for stuff like CleartkAnnotator:

http://mail-archives.apache.org/mod_mbox/uima-user/201311.mbox/%3CB87CC687-68E8-47C8-91B0-78F8DBEBCBC4%40apache.org%3E

Switching to the UIMA way would mean that instead of:

AnalysisEngineFactory.createPrimitiveDescription(
ExamplePOSAnnotator.class,
CleartkSequenceAnnotator.PARAM_IS_TRAINING,
true,
DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
outputDirectory,
DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME,
MalletCRFStringOutcomeDataWriter.class);

We would do something like:

AnalysisEngineFactory.createPrimitiveDescription(
ExamplePOSAnnotator.class,
CleartkSequenceAnnotator.PARAM_IS_TRAINING,
true,
CleartkSequenceAnnotator.PARAM_DATA_WRITER_FACTORY,
ExternalResourceFactory.createExternalResourceDescription(
DefaultSequenceDataWriterFactory.class,
DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER,
ExternalResourceFactory.createExternalResourceDescription(
MalletCRFStringOutcomeDataWriter.class,
MalletCRFStringOutcomeDataWriter.PARAM_OUTPUT_DIRECTORY,
outputDirectory))

This is a bit more verbose, but it does make it clearer where the configuration parameters come from, since they're all scoped by their external resource grouping. Also, with this approach, if you load the same classifier in more than one place, it will only be loaded once if you use the same ExternalResourceDescription in both places. (But using the same classifier twice is probably uncommon.)

If we went this route, it would require some substantial changes. DataWriterFactory, DataWriter, ClassifierFactory and Classifier would have to implement SharedResourceObject instead of Initializable. Note that this would involve removing all the XXX(File) constructor in DataWriters, and adding an initialize(UimaContext) method to DataWriter_ImplBase.

Of course, this would be backwards compatible, and a change to some of the core ML APIs. So either we do this for 2.0, or we don't do it until 3.0.

@pkluegl Looks interesting?