Check why certain pages were filtered out by the blacklist

Question

Check why certain pages were filtered out by the blacklist

Opened this issue 6 years ago · 2 comments

The latest counts showed that we filter out a large part of the contents, either because they have a "wrong" mimetype (we should not even download those, see #23 ) or because the parser finds something that resembles base64. We have to crawl through a few of those pages to see, whether the parser works correctly or if there is some sort of bug.

Answer 1 · 2018-06-04T05:26:31.000Z

So, we found the issue: We also filter data URLs (containing data). Such URLs seems to include tiny image elements often. Therefore we suggest that we switch to just removing image data elements. This way, we would have the images in volatile memory for a very short time, but without giving anybody access to it. Such behaviour is similar to a node in the internet routing traffic - it does have to store the content shortly, but it cannot be held responsible for any illegal content it may forwards.
Further, we can whitelist such things as SVG images

Answer 2 · 2018-06-04T08:03:09.000Z

Agreed, let's move ahead with this change.

…

On Mon, Jun 4, 2018, 07:26 Roman Brunner ***@***.***> wrote: So, we found the issue: We also filter data URLs (containing data). Such URLs seems to include tiny image elements often. Therefore we suggest that we switch to just removing image data elements. This way, we would have the images in volatile memory for a very short time, but without giving anybody access to it. Such behaviour is similar to a node in the internet routing traffic - it does have to store the content shortly, but it cannot be held responsible for any illegal content it may forwards. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#24 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAhPPAClvgiywtCkcQs3esevLz0KYWbEks5t5MUHgaJpZM4UVMJP> .