cblgh/lieu

Crawler indexes all pages on a particular domain rather pages under a path

Amolith opened this issue · 1 comments

When running Lieu over all the sites in the fediring, we've found that it's only bound by domain rather than domain+path. This causes quirks with static site hosts like cronut.cafe; the only cronut.cafe user who's also a member of the ring is ~sfr, but multiple other users who aren't members have been indexed as well: https://search.fediring.net/?q=cronut

I think a good solution might be keeping track of not only the domain that's being crawled but also the original URL and ignoring links to parent directories.

cblgh commented

@Amolith thanks for the issue! the thing i went with initially was the notion of filtered sites, filtering out webring domains which appeared to crowd out overall useful results. I'll look into changing things to move away from strictly using domains :)