decrypto-org/spider

Check why certain pages were filtered out by the blacklist

Opened this issue · 2 comments

The latest counts showed that we filter out a large part of the contents, either because they have a "wrong" mimetype (we should not even download those, see #23 ) or because the parser finds something that resembles base64. We have to crawl through a few of those pages to see, whether the parser works correctly or if there is some sort of bug.

So, we found the issue: We also filter data URLs (containing data). Such URLs seems to include tiny image elements often. Therefore we suggest that we switch to just removing image data elements. This way, we would have the images in volatile memory for a very short time, but without giving anybody access to it. Such behaviour is similar to a node in the internet routing traffic - it does have to store the content shortly, but it cannot be held responsible for any illegal content it may forwards.
Further, we can whitelist such things as SVG images