internetarchive/fatcat

arXiv work missing

Closed this issue · 3 comments

Hi. I searched for "Sidewalk Measurements from Satellite Images: Preliminary Findings" a preprint found here https://arxiv.org/abs/2112.06120

Is there a bot importing regularly from arXiv?

We do have this work, and it shows up in search results for me: https://fatcat.wiki/release/search?q=Sidewalk+Measurements+from+Satellite+Images+Preliminary+Findings

https://fatcat.wiki/release/tphkgaozxbedxnwf35nf5yxey4

The issue may be if the Images: is included in a fatcat.wiki search, that gets passed through to Elasticsearch/Lucene, which interprets it as a facet/filter, and returns no results: https://fatcat.wiki/release/search?q=Sidewalk+Measurements+from+Satellite+Images%3A+Preliminary+Findings&generic=1

Does that match what you experienced?

In scholar.archive.org, we have a kludge to try and notice this pattern and add quotes around such tokens, but the implementation isn't very good so I haven't copied it over. A "real" custom query parser is probably the solution, but is a larger project to bite off. Added a note about that specific issue to #29

Oh, and to answer the question, yes, a bot pulls new papers from arxiv every 24 hours using the OAI-PMH feed. New URLs are then enqueued for crawling, though arxiv.org often rate-limits our crawlers so it can take a while for them to get archived a through the entire indexing pipeline.

Exactly fits the issue. Thanks for the fast response 🤩