yacy/yacy_search_server

/date search is sorted by index time, not by date published

okybaca opened this issue · 1 comments

/date search uses the “date indexed” to sort out the results. If I crawl a huge news site, all, even the really historical pages (NYtimes got archives dating back to 19th century, for example) are dated as “today”.

Because I search the news, this is a huge use-stopper for me.

Furthermore, in /date search results, some entries are dated even in the future, which is undesired behavior.

Would it be possible to do some heuristics on a real date published, probably using some combination of metadata?
Other search engines do, somehow.
Is it possible to switch /date operator to use time published, instead of the indexing date?

Some examples of date extraction from other software:

https://lyndon.codes/2019/05/03/extracting-date-times/
problem analysis + java

https://htmldate.readthedocs.io/en/latest/
python library

https://github.com/agnelvishal/newspaper/blob/master/newspaper/extractors.py
(python, line 192 onwards)

Also mentioned in #193

it's probably a question for @Orbiter : is there some field in the solr scheme, intended for storing of 'date published'?
if not, how should be named? published_date_dt?