/date search is sorted by index time, not by date published
okybaca opened this issue · 1 comments
/date search uses the “date indexed” to sort out the results. If I crawl a huge news site, all, even the really historical pages (NYtimes got archives dating back to 19th century, for example) are dated as “today”.
Because I search the news, this is a huge use-stopper for me.
Furthermore, in /date search results, some entries are dated even in the future, which is undesired behavior.
Would it be possible to do some heuristics on a real date published, probably using some combination of metadata?
Other search engines do, somehow.
Is it possible to switch /date operator to use time published, instead of the indexing date?
Some examples of date extraction from other software:
https://lyndon.codes/2019/05/03/extracting-date-times/
problem analysis + java
https://htmldate.readthedocs.io/en/latest/
python library
https://github.com/agnelvishal/newspaper/blob/master/newspaper/extractors.py
(python, line 192 onwards)
Also mentioned in #193