jjjake/internetarchive

--fts ignores --parameters, --field, --sort

Opened this issue · 5 comments

Hi,

I am doing ia search --parameters="..."

...but I do not know what parameters it accepts.

Is there a list or documentation anywhere?

My goal is to return a small number of results sorted by most recently "added" first.

  • on the website that is sort=-publicdate
  • and in advanced search it is sort createdate desc
  • this page says sort_by=-addeddate

But those do not seem to work with ia search, or maybe I am doing it wrong?

I have also tried

  • ia search --parameters="rows=10" --sort="addeddate desc" "hanafuda"
  • ia search --parameters="rows:10" --sort="created_on desc" "hanafuda"

Any help appreciated.

Thanks!

OK, I figured it out and support seems to be missing, so I will rename the issue.

ia search 'hanafuda' --parameters rows:10 --field addeddate --sort "addeddate desc"

  • returns expected results (GOOD)

But...

ia search 'hanafuda' --fts --parameters rows:10 --field addeddate --sort "addeddate desc"

  • returns more rows than requested (BAD)
  • returns unsorted results (BAD)

I am using:

  • pip install internetarchive
  • version 3.4.0
jjjake commented

The confusion here is that ia search uses various endpoints depending on several things. It uses the Scrape API by default, Advanced Search when either rows or page parameters are specified, and our beta FTS API when either --fts or --dsl-fts are specified.

The reasoning behind this is because the Advanced Search API is not designed for scraping/retrieving full result sets (it's capable of doing so, but it's not designed for it). The Scrape API is designed for dumping full result sets. I assume that most people want full result sets when using ia search, and that's why the Scrape API is the default. When a user specifies that they only want a subset of the results (i.e. via page or rows params), then Advanced Search is used.

Then there's the FTS API. This is in beta, is not currently documented publicly, and is subject to change. The specific parameter you're after though is size as opposed to rows:

» ia search 'hanafuda' --fts --parameters size:10 | wc -l
      10

--fields is not currently supported with --fts, all indexed fields are returned by default. addeddate is not returned, but publicdate is (under .fields.meta_publicdate). Sorting is not supported in the beta FTS API at this time.

Sorry for the confusion. We hope to consolidate these endpoints in the future!

Thanks @jjjake very informative. I'll keep an eye on progress.

It seems very wasteful to query the whole set when I only want the most X recent (for example any new items since the last time I did the query). But maybe I'm overthinking it!? I prefer to keep things lean and save time and electricity on this earth.

chgans commented

The "beta FTS API" doesn't seem to point to the right endpoint.
results from "ia search" are not the same as the one used by https://archive.org/search?query=...
JS from this page uses https://archive.org/services/search/beta/page_production/, which return cleaner results.

Is there any plan to switch to that endpoint?

jjjake commented

@chgans be-api.us.archive.org/ia-pub-fts-api is the current recommendation from the developers of our FTS beta API. We do hope to consolidate our search endpoints in the future though. Thanks for checking!