alephdata/memorious

Different behaviour between FakeRedis and real redis

simonwoerpel opened this issue · 2 comments

Hey,

I encountered an issue when writing a recursive crawler. With recursive I mean:

  • fetch a page
  • parse urls on it
  • emit these again to fetch
  • parse again for urls
  • emit these again to fetch
  • parse again for urls
  • emit again for fetch

the config section looks like this:

  fetch:
    method: fetch
    handle:
      pass: parse

  parse:
    method: parse
    params:
      store:
        mime_group: documents
      include_paths:
        - ".//div[@class='artikel']"  # find urls for 1st iteration
        - ".//div[@class='archiveArticleInfo']/ul/li[1]"  # find urls for 2nd iteration
        - ".//div[@id='buttons']/div[@class='save']"  # find urls for 3rd iteration (these are never emitted in debug mode, but in deployed mode)
    handle:
      fetch: fetch
      store: store

in this scenario, memorious in debug mode (via memorious run my_crawler) never happens to fetch in the third iteration, but when building via docker and running it then, it does.

Of course it would be great if crawler execution behaves the exact same way for local developement 🙃

@pudo and me had a short discussion about it and we came up with the idea that it has something to do with fakeredis vs. "real" redis, as the data dictionary is sometimes altered in place in the code, which has different side-effects when using fakeredis than using the real redis...

I already tried to do a data = data.copy() before this line: https://github.com/alephdata/memorious/blob/master/memorious/operations/parse.py#L58 but this doesn't help.

I am sure someone who knows the memorious codebase better (looking at you @sunu 😂 ) can point me into the right direction how to fix this...

sunu commented

@simonwoerpel Can you check if the latest version fixes the issue for you?

sunu commented

Hoping the fix worked. Feel free to reopen otherwise.