Different behaviour between FakeRedis and real redis
simonwoerpel opened this issue · 2 comments
Hey,
I encountered an issue when writing a recursive crawler. With recursive I mean:
- fetch a page
- parse urls on it
- emit these again to fetch
- parse again for urls
- emit these again to fetch
- parse again for urls
- emit again for fetch
the config section looks like this:
fetch:
method: fetch
handle:
pass: parse
parse:
method: parse
params:
store:
mime_group: documents
include_paths:
- ".//div[@class='artikel']" # find urls for 1st iteration
- ".//div[@class='archiveArticleInfo']/ul/li[1]" # find urls for 2nd iteration
- ".//div[@id='buttons']/div[@class='save']" # find urls for 3rd iteration (these are never emitted in debug mode, but in deployed mode)
handle:
fetch: fetch
store: store
in this scenario, memorious in debug mode (via memorious run my_crawler
) never happens to fetch in the third iteration, but when building via docker and running it then, it does.
Of course it would be great if crawler execution behaves the exact same way for local developement 🙃
@pudo and me had a short discussion about it and we came up with the idea that it has something to do with fakeredis vs. "real" redis, as the data
dictionary is sometimes altered in place in the code, which has different side-effects when using fakeredis than using the real redis...
I already tried to do a data = data.copy()
before this line: https://github.com/alephdata/memorious/blob/master/memorious/operations/parse.py#L58 but this doesn't help.
I am sure someone who knows the memorious codebase better (looking at you @sunu 😂 ) can point me into the right direction how to fix this...
@simonwoerpel Can you check if the latest version fixes the issue for you?
Hoping the fix worked. Feel free to reopen otherwise.