dstl/baleen

BaleenCollectionReader.getContentExtractor() results in ClassNotFoundException

aalsup opened this issue · 7 comments

Using the following config which I borrowed from the baleen-runner tests:

sample_pipeline.yaml:

collectionreader:
  class: FolderReader
  folders:
    - /tmp/data

annotators:
  - class: regex.Email
  - class: regex.Url

consumers:
  - class: EntityCount

The application generates the following error in the output:

2018-03-19 21:32:32,138 DEBUG uk.gov.dstl.baleen.core.utils.BuilderUtils - Couldn't find class uk.gov.dstl.baleen.contentextractors.StructureContentExtractor in package uk.gov.dstl.baleen.contentextractors
java.lang.ClassNotFoundException: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor

Notice, the CNFE has the package spec repeated twice: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor

I believe the bug is caused by passing a fully qualified classname AND defaultPackage="uk.gov.dstl.baleen.contentextractors" to BuilderUtils.getClassFromString() here:

https://github.com/dstl/baleen/blob/master/baleen-uima/src/main/java/uk/gov/dstl/baleen/uima/BaleenCollectionReader.java#L178

Another possible fix would be to modify BuilderUtils.getClassFromString() and test if the className parameter contains the defaultPackage:

https://github.com/dstl/baleen/blob/master/baleen-core/src/main/java/uk/gov/dstl/baleen/core/utils/BuilderUtils.java#L64

Lastly, another fix would be to modify BaleenDefaults.DEFAULT_CONTENT_EXTRACTOR so that it does not contain the FQ classname, here:

https://github.com/dstl/baleen/blob/master/baleen-core/src/main/java/uk/gov/dstl/baleen/core/utils/BaleenDefaults.java#L34

Thank you for the detailed error report.
I have had a look at this this morning and cannot replicate your issue.
Please could you let me know how you are passing the pipeline to Baleen? I have tried submitting an identical (other than folder location) pipeline via a baleen config file at startup and also configured by hand using the plankton interface.
Am I right in assuming (given your previous issue) that you are running from a version built from the latest 2.5.0-SNAPSHOT code?
Sorry to not be more helpful...

John, thanks for your quick reply. The error appears very early in the console logs, and the application continues to bootstrap and run with no further errors. Here's my configuration:

$ ls -l /tmp/baleen
-rw-r--r--  1 ahalsup  wheel  212079116 Mar 20 10:34 baleen-2.5.0-SNAPSHOT.jar
drwxr-xr-x  3 ahalsup  wheel         96 Mar 20 10:35 data
-rw-r--r--  1 ahalsup  wheel        138 Mar 20 10:38 runner.yaml
-rw-r--r--  1 ahalsup  wheel        192 Mar 20 10:38 sample_pipeline.yaml

$ ls -l /tmp/baleen/data
-rw-r--r--  1 ahalsup  wheel  95 Mar 20 10:34 data.txt

/tmp/baleen/runner.yaml

pipelines:
  - name: sample
    file: /tmp/baleen/sample_pipeline.yaml

logging:
  loggers:
    - name: console
      minLevel: DEBUG

/tmp/baleen/sample_pipeline.yaml

collectionreader:
  class: FolderReader
  folders:
    - /tmp/baleen/data

annotators:
  - class: regex.Email
  - class: regex.Url

consumers:
  - class: EntityCount
  - class: print.Entities

/tmp/baleen/data/data.txt

This is an example email email@example.com which would correspond with a URL http://example.com

Here's the command to run, and a snippet of the console output:

$ java -jar ./baleen-2.5.0-SNAPSHOT.jar runner.yaml
10:48:10.019 [main] INFO uk.gov.dstl.baleen.runner.Baleen - Baleen starting
10:48:10.021 [main] INFO uk.gov.dstl.baleen.runner.Baleen - Baleen about to run
...
2018-03-20 10:48:10,550 DEBUG uk.gov.dstl.baleen.collectionreaders.FolderReader[sample] - Starting function initialize
2018-03-20 10:48:10,553 DEBUG uk.gov.dstl.baleen.core.metrics.LoggingMetricListener - Created timer 'sample:uk.gov.dstl.baleen.collectionreaders.FolderReader:initialize'
2018-03-20 10:48:10,555 DEBUG uk.gov.dstl.baleen.core.utils.BuilderUtils - Couldn't find class uk.gov.dstl.baleen.contentextractors.StructureContentExtractor in package uk.gov.dstl.baleen.contentextractors
java.lang.ClassNotFoundException: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at uk.gov.dstl.baleen.core.utils.BuilderUtils.getClassFromString(BuilderUtils.java:67)
	at uk.gov.dstl.baleen.uima.BaleenCollectionReader.getContentExtractor(BaleenCollectionReader.java:178)
	at uk.gov.dstl.baleen.collectionreaders.FolderReader.doInitialize(FolderReader.java:146)
	at uk.gov.dstl.baleen.uima.BaleenCollectionReader.initialize(BaleenCollectionReader.java:64)
...
2018-03-20 10:48:10,557 DEBUG uk.gov.dstl.baleen.contentextractors.StructureContentExtractor[sample] - Starting function initialize
2018-03-20 10:48:10,557 DEBUG uk.gov.dstl.baleen.core.metrics.LoggingMetricListener - Created timer 'sample:uk.gov.dstl.baleen.contentextractors.StructureContentExtractor:initialize'
...
2018-03-20 10:48:11,401 INFO  org.eclipse.jetty.server.Server - Started @1548ms
2018-03-20 10:48:11,401 DEBUG org.eclipse.jetty.util.component.AbstractLifeCycle - STARTED @1548ms org.eclipse.jetty.server.Server@7eecb5b8
2018-03-20 10:48:11,401 INFO  uk.gov.dstl.baleen.core.web.BaleenWebApi - Server started
2018-03-20 10:48:11,401 INFO  uk.gov.dstl.baleen.core.manager.BaleenManager - Initialisation complete

After stopping the application ctrl+c, I see that the entityCount.tsv file has been created:

$ cat entityCount.tsv
/private/tmp/baleen/data/data.txt	2
/private/tmp/baleen/data/data.txt	2

Thank you for the extra information... now that I have set my logging minlevel to the same as yours I am seeing the same result.

It does look like Baleen is working, despite the error, as the entityCount.tsv file is the default output for the EntityCount consumer that you are using. The output you have suggests you have run baleen twice on your single file and so the file with its two entities have been added to the output file twice.

If you set the minlevel to INFO in runner.yaml, you will suppress a lot of the output and you should also see the results of the print.Entities consumer that you are also running.

2018-03-20 15:20:49,919 INFO  uk.gov.dstl.baleen.consumers.print.Entities[sample] - uk.gov.dstl.baleen.types.semantic.Entity:
        Value: email@example.com
        Type: uk.gov.dstl.baleen.types.common.CommsIdentifier
        Span: 25 -> 42

2018-03-20 15:20:49,920 INFO  uk.gov.dstl.baleen.consumers.print.Entities[sample] - uk.gov.dstl.baleen.types.semantic.Entity:
        Value: http://example.com
        Type: uk.gov.dstl.baleen.types.common.Url
        Span: 77 -> 95

This should hopefully demonstrates that Baleen is working, despite the errors, and so hopefully you can continue with your use. I will leave this open, as clearly something is not right nonetheless.

@JohnDaws - you rock! Thanks for looking into this for me.

So I believe this is actually expected behaviour and not a problem (although perhaps a little inefficient). The method used to find the class first checks for the class name in the default package, and then if it can't be found there checks the name on it's own. Doing it in this order prevents someone creating a class with no package with the same name as an existing class and it being picked up by mistake.

In the BaleenDefaults class, the default ContentExtractor is fully specified for clarity. But that means when a component attempts to gets the default content extractor it first checks by prepending the default package name before trying the already-qualified (correct) name. It logs the exception for debugging purposes, but it's only there for information and is nothing to worry about.

So perhaps a little inefficient, but not an issue and can be safely ignored. I'd recommend closing this issue.

👍 Agreed.

Thanks James for providing a detailed answer, I will close the thread.

Andrewm, for info, my goto consumer for testing purposes is the Html5 consumer.
This will write all each input document to an html file wrap a around each entity with some metadata. It's a bit easier to keep track of than print.entities once you have more than a few results.

pipeline

...
consumers:
  - class: EntityCount
  - class: print.Entities
  - class: Html5
    outputFolder: .\html_out
    css: .\email_url.css

Without a css file, the Html output will just look like plain text in a browser, but if you add the css file below to your html output folder then the entities within the text should appear 20% larger and colour coded by type. Obviously you can add the other types and be a bit more adventurous with the styling if you wish.

email_url.css

span.Url {color:red;} 
span.CommsIdentifier {color:blue;}
span.baleen {font-size:120%;}