alephdata/memorious

Normalized URLs and non-reproducibility

Closed this issue · 3 comments

Crawling non-normalized (fragile) URLs #87 is possible thanks to commit e595d18.
However normalization is still applied to the URL used as key and stored in the JSON metadata file, for which it cannot be disabled.
This means that the URL stored in the JSON file does not give access to the page that was crawled, which effectively prevents manual inspection and reproducibility.

I'm willing to write the required fixes but would like to know beforehand how you'd like to see it implemented, as my last PR #88 didn't fit your vision :-)

pudo commented

So you propose we get rid of normalization entirely? I'm game with that ....

@pudo I like the idea of normalizing URLs to avoid crawling and storing copies of the same page.
I was surprised to see that web servers can be fragile and permutations of parameters are not always spurious.

So I don't know if we should get rid of normalization entirely but I definitely need to be able to disable it entirely for a number of websites, and effectively this would solve that.

sunu commented

Closing this since I have removed url normalization completely in 4c80713. Let us know if anything else is broken @moreymat