BaleenCollectionReader.getContentExtractor() results in ClassNotFoundException
aalsup opened this issue · 7 comments
Using the following config which I borrowed from the baleen-runner tests:
sample_pipeline.yaml
:
collectionreader:
class: FolderReader
folders:
- /tmp/data
annotators:
- class: regex.Email
- class: regex.Url
consumers:
- class: EntityCount
The application generates the following error in the output:
2018-03-19 21:32:32,138 DEBUG uk.gov.dstl.baleen.core.utils.BuilderUtils - Couldn't find class uk.gov.dstl.baleen.contentextractors.StructureContentExtractor in package uk.gov.dstl.baleen.contentextractors
java.lang.ClassNotFoundException: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor
Notice, the CNFE has the package spec repeated twice: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor
I believe the bug is caused by passing a fully qualified classname AND defaultPackage="uk.gov.dstl.baleen.contentextractors" to BuilderUtils.getClassFromString()
here:
Another possible fix would be to modify BuilderUtils.getClassFromString()
and test if the className
parameter contains the defaultPackage
:
Lastly, another fix would be to modify BaleenDefaults.DEFAULT_CONTENT_EXTRACTOR
so that it does not contain the FQ classname, here:
Thank you for the detailed error report.
I have had a look at this this morning and cannot replicate your issue.
Please could you let me know how you are passing the pipeline to Baleen? I have tried submitting an identical (other than folder location) pipeline via a baleen config file at startup and also configured by hand using the plankton interface.
Am I right in assuming (given your previous issue) that you are running from a version built from the latest 2.5.0-SNAPSHOT code?
Sorry to not be more helpful...
John, thanks for your quick reply. The error appears very early in the console logs, and the application continues to bootstrap and run with no further errors. Here's my configuration:
$ ls -l /tmp/baleen
-rw-r--r-- 1 ahalsup wheel 212079116 Mar 20 10:34 baleen-2.5.0-SNAPSHOT.jar
drwxr-xr-x 3 ahalsup wheel 96 Mar 20 10:35 data
-rw-r--r-- 1 ahalsup wheel 138 Mar 20 10:38 runner.yaml
-rw-r--r-- 1 ahalsup wheel 192 Mar 20 10:38 sample_pipeline.yaml
$ ls -l /tmp/baleen/data
-rw-r--r-- 1 ahalsup wheel 95 Mar 20 10:34 data.txt
/tmp/baleen/runner.yaml
pipelines:
- name: sample
file: /tmp/baleen/sample_pipeline.yaml
logging:
loggers:
- name: console
minLevel: DEBUG
/tmp/baleen/sample_pipeline.yaml
collectionreader:
class: FolderReader
folders:
- /tmp/baleen/data
annotators:
- class: regex.Email
- class: regex.Url
consumers:
- class: EntityCount
- class: print.Entities
/tmp/baleen/data/data.txt
This is an example email email@example.com which would correspond with a URL http://example.com
Here's the command to run, and a snippet of the console output:
$ java -jar ./baleen-2.5.0-SNAPSHOT.jar runner.yaml
10:48:10.019 [main] INFO uk.gov.dstl.baleen.runner.Baleen - Baleen starting
10:48:10.021 [main] INFO uk.gov.dstl.baleen.runner.Baleen - Baleen about to run
...
2018-03-20 10:48:10,550 DEBUG uk.gov.dstl.baleen.collectionreaders.FolderReader[sample] - Starting function initialize
2018-03-20 10:48:10,553 DEBUG uk.gov.dstl.baleen.core.metrics.LoggingMetricListener - Created timer 'sample:uk.gov.dstl.baleen.collectionreaders.FolderReader:initialize'
2018-03-20 10:48:10,555 DEBUG uk.gov.dstl.baleen.core.utils.BuilderUtils - Couldn't find class uk.gov.dstl.baleen.contentextractors.StructureContentExtractor in package uk.gov.dstl.baleen.contentextractors
java.lang.ClassNotFoundException: uk.gov.dstl.baleen.contentextractors.uk.gov.dstl.baleen.contentextractors.StructureContentExtractor
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at uk.gov.dstl.baleen.core.utils.BuilderUtils.getClassFromString(BuilderUtils.java:67)
at uk.gov.dstl.baleen.uima.BaleenCollectionReader.getContentExtractor(BaleenCollectionReader.java:178)
at uk.gov.dstl.baleen.collectionreaders.FolderReader.doInitialize(FolderReader.java:146)
at uk.gov.dstl.baleen.uima.BaleenCollectionReader.initialize(BaleenCollectionReader.java:64)
...
2018-03-20 10:48:10,557 DEBUG uk.gov.dstl.baleen.contentextractors.StructureContentExtractor[sample] - Starting function initialize
2018-03-20 10:48:10,557 DEBUG uk.gov.dstl.baleen.core.metrics.LoggingMetricListener - Created timer 'sample:uk.gov.dstl.baleen.contentextractors.StructureContentExtractor:initialize'
...
2018-03-20 10:48:11,401 INFO org.eclipse.jetty.server.Server - Started @1548ms
2018-03-20 10:48:11,401 DEBUG org.eclipse.jetty.util.component.AbstractLifeCycle - STARTED @1548ms org.eclipse.jetty.server.Server@7eecb5b8
2018-03-20 10:48:11,401 INFO uk.gov.dstl.baleen.core.web.BaleenWebApi - Server started
2018-03-20 10:48:11,401 INFO uk.gov.dstl.baleen.core.manager.BaleenManager - Initialisation complete
After stopping the application ctrl+c
, I see that the entityCount.tsv
file has been created:
$ cat entityCount.tsv
/private/tmp/baleen/data/data.txt 2
/private/tmp/baleen/data/data.txt 2
Thank you for the extra information... now that I have set my logging minlevel to the same as yours I am seeing the same result.
It does look like Baleen is working, despite the error, as the entityCount.tsv file is the default output for the EntityCount consumer that you are using. The output you have suggests you have run baleen twice on your single file and so the file with its two entities have been added to the output file twice.
If you set the minlevel to INFO in runner.yaml, you will suppress a lot of the output and you should also see the results of the print.Entities consumer that you are also running.
2018-03-20 15:20:49,919 INFO uk.gov.dstl.baleen.consumers.print.Entities[sample] - uk.gov.dstl.baleen.types.semantic.Entity:
Value: email@example.com
Type: uk.gov.dstl.baleen.types.common.CommsIdentifier
Span: 25 -> 42
2018-03-20 15:20:49,920 INFO uk.gov.dstl.baleen.consumers.print.Entities[sample] - uk.gov.dstl.baleen.types.semantic.Entity:
Value: http://example.com
Type: uk.gov.dstl.baleen.types.common.Url
Span: 77 -> 95
This should hopefully demonstrates that Baleen is working, despite the errors, and so hopefully you can continue with your use. I will leave this open, as clearly something is not right nonetheless.
So I believe this is actually expected behaviour and not a problem (although perhaps a little inefficient). The method used to find the class first checks for the class name in the default package, and then if it can't be found there checks the name on it's own. Doing it in this order prevents someone creating a class with no package with the same name as an existing class and it being picked up by mistake.
In the BaleenDefaults class, the default ContentExtractor is fully specified for clarity. But that means when a component attempts to gets the default content extractor it first checks by prepending the default package name before trying the already-qualified (correct) name. It logs the exception for debugging purposes, but it's only there for information and is nothing to worry about.
So perhaps a little inefficient, but not an issue and can be safely ignored. I'd recommend closing this issue.
👍 Agreed.
Thanks James for providing a detailed answer, I will close the thread.
Andrewm, for info, my goto consumer for testing purposes is the Html5 consumer.
This will write all each input document to an html file wrap a around each entity with some metadata. It's a bit easier to keep track of than print.entities once you have more than a few results.
pipeline
...
consumers:
- class: EntityCount
- class: print.Entities
- class: Html5
outputFolder: .\html_out
css: .\email_url.css
Without a css file, the Html output will just look like plain text in a browser, but if you add the css file below to your html output folder then the entities within the text should appear 20% larger and colour coded by type. Obviously you can add the other types and be a bit more adventurous with the styling if you wish.
email_url.css
span.Url {color:red;}
span.CommsIdentifier {color:blue;}
span.baleen {font-size:120%;}