processFulltextDocument fails on 0.23% arXiv PDFs
MarksonChen opened this issue ยท 6 comments
I ran processFulltextDocument on 22103 arXiv PDFs. 22053 PDFs succeeded and 50 failed.
Running on MacOS M2 chip
Java version: 17.0.10
Server started with Gradle (./gradlew run
)
An example error log:
ERROR [2024-05-09 13:13:55,538] org.grobid.service.process.GrobidRestProcessFiles: An unexpected exception occurs.
! java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
! at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
! at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
! at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
! at java.base/java.util.Objects.checkIndex(Objects.java:359)
! at java.base/java.util.ArrayList.get(ArrayList.java:427)
! at org.grobid.core.data.Note.getPageNumber(Note.java:77)
! at org.grobid.core.document.TEIFormatter.lambda$toTEITextPiece$0(TEIFormatter.java:1460)
! at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:178)
! at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
! at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
! at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
! at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
! at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
! at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682)
! at org.grobid.core.document.TEIFormatter.toTEITextPiece(TEIFormatter.java:1461)
! at org.grobid.core.document.TEIFormatter.toTEIBody(TEIFormatter.java:1015)
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2648)
! ... 78 common frames omitted
! Causing: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occurred while running Grobid.
! at org.grobid.core.engines.FullTextParser.toTEI(FullTextParser.java:2708)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:320)
! at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:119)
! at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:587)
! at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:577)
! at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:290)
! at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:291)
! at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:240)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
! at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
! at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
! at java.base/java.lang.reflect.Method.invoke(Method.java:568)
! at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:134)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:177)
! at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
! at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:81)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)
! at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)
! at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:256)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
! at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
! at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
! at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
! at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:235)
! at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684)
! at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)
! at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)
! at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
! at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
! at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665)
! at io.dropwizard.servlets.ThreadNameFilter.doFilter(ThreadNameFilter.java:36)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.handle(AllowedMethodsFilter.java:46)
! at io.dropwizard.jersey.filter.AllowedMethodsFilter.doFilter(AllowedMethodsFilter.java:40)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:313)
! at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:267)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
! at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
! at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
! at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202)
! at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635)
! at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
! at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1382)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
! at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484)
! at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
! at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1304)
! at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at io.dropwizard.metrics.jetty11.InstrumentedHandler.handle(InstrumentedHandler.java:307)
! at io.dropwizard.jetty.RoutingHandler.handle(RoutingHandler.java:52)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:822)
! at io.dropwizard.jetty.ZipExceptionHandlingGzipHandler.handle(ZipExceptionHandlingGzipHandler.java:26)
! at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:173)
! at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
! at org.eclipse.jetty.server.Server.handle(Server.java:563)
! at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:505)
! at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:762)
! at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:497)
! at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
! at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314)
! at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
! at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
! at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:936)
! at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1080)
! at java.base/java.lang.Thread.run(Thread.java:842)
The 50 PDFs that failed:
https://arxiv.org/pdf/2202.03169
https://arxiv.org/pdf/2007.10408
https://arxiv.org/pdf/2008.08076
https://arxiv.org/pdf/2203.00397
https://arxiv.org/pdf/2202.00145
https://arxiv.org/pdf/2110.13423
https://arxiv.org/pdf/2006.16218
https://arxiv.org/pdf/2305.01868
https://arxiv.org/pdf/2206.11939
https://arxiv.org/pdf/1711.05715
https://arxiv.org/pdf/2110.11222
https://arxiv.org/pdf/2006.13025
https://arxiv.org/pdf/1902.00450
https://arxiv.org/pdf/2109.04212
https://arxiv.org/pdf/2105.14849
https://arxiv.org/pdf/cs/9906002
https://arxiv.org/pdf/2101.09398
https://arxiv.org/pdf/1911.00536
https://arxiv.org/pdf/1912.02762
https://arxiv.org/pdf/2104.07857
https://arxiv.org/pdf/2106.15093
https://arxiv.org/pdf/1901.09401
https://arxiv.org/pdf/2201.10129
https://arxiv.org/pdf/2010.04879
https://arxiv.org/pdf/1206.5241
https://arxiv.org/pdf/2203.14101
https://arxiv.org/pdf/1905.06214
https://arxiv.org/pdf/2205.05789
https://arxiv.org/pdf/1810.00953
https://arxiv.org/pdf/1910.11856
https://arxiv.org/pdf/1501.02876
https://arxiv.org/pdf/2202.01987
https://arxiv.org/pdf/2303.02186
https://arxiv.org/pdf/2010.05761
https://arxiv.org/pdf/2204.11918
https://arxiv.org/pdf/2002.12361
https://arxiv.org/pdf/1810.07311
https://arxiv.org/pdf/1905.03817
https://arxiv.org/pdf/1901.07846
https://arxiv.org/pdf/2202.03798
https://arxiv.org/pdf/1711.01244
https://arxiv.org/pdf/2006.03040
https://arxiv.org/pdf/2004.10964
https://arxiv.org/pdf/1803.00590
https://arxiv.org/pdf/1612.06109
https://arxiv.org/pdf/1704.03651
https://arxiv.org/pdf/1610.09534
https://arxiv.org/pdf/2202.03555
https://arxiv.org/pdf/2008.04990
Hi @MarksonChen
This is normally fixed with #1075
Are you using the latest master version?
Hi kermitt2,
Thank you for your reply. I was using 0.8.0.
However, after switching to the latest master version (using git clone https://github.com/kermitt2/grobid.git
), 49 out of 50 papers listed above still cannot be parsed with processFulltextDocument.
Thank you @MarksonChen for checking and reporting these arXiv error cases.
Indeed the problem is not related to the issue corresponding to #1075, sorry. I just pushed a quick fix and these files should work too.
Hi, kermitt2, thank you so much for your speedy fix! The amount of continual work put into this open-source project has been remarkable. All 22085 fetchable arXiv PDFs can be parsed successfully with processFulltextDocument.
@kermitt2 I have a dejavu on this issue while working on PR #1097 and #1099.
This happen, as far as I remember, when a note with the same "label" is identified in the text. So when the notes list is collected from the text, by using the int idx = clusterTokens.indexOf(matching.get());
without updating the position, will result in having the same note with the same positions.
For the first article of the list, 2202.03169, happens because there are three notes with the same intervals. Maybe we could just filter them as an additional precaution.
I write here also some additional information, as I will forget in one hour.
I've checked just one example, which is quite messed up:
TakeasanexamplethesetupinFigure2,whereaballcaxn,xsetup2oRfSercetpiorens3en.1t,thceauosbaslerfvaacttiornswatithimouetsateupntiqanudesetoTfakeasanexamplethes1e0tu2pin3F.2ig.uLrea2r,nwinhgerweitahbIanlltecravnentionsoverTime
t
t+
t t+1 t+1 106 Note that when two variables Ci and Cj can only be inter-
We consider a dataset D of tuples {x , x , I } where
I'm reopening this, I'm following up my last comment.
Avoiding the duplicated interval is done by updating the search space of the indexOf by reducing the list of tokens.
However, I noticed that
identifier
, which should be unique from notes point of view.
I'm submitting the PR with two fixes:
- avoid collecting the same position in the text when the note label is the same. So for example if we have
This note1, and this note2, but back to the first note1
, we would collect twice the offset of the first1
label. - update the labels2notes so that we use the identifier instead.