mozilla/bleach

feature: strip all URLs

jvanasco opened this issue · 2 comments

It does not seem possible to strip all URLs with Bleach.

For example, the closest we can get to from the docs is...

import bleach
def remove_it(attrs, new=False):
    return None
payloads = (
    'a <a href="http://example.com/outer">https://example.com/inner</a> b',
    "a https://example.com/bare b",
)
for payload in payloads:
    print("=====")
    result = bleach.linkify(payload, callbacks=[remove_it])
    print(result)
    result = bleach.clean(payload, protocols=[])
    print(result)

However, The result is:

=====
a https://example.com/inner b
a <a>https://example.com/inner</a> b
=====
a https://example.com/bare b
a https://example.com/bare b

While the desired result is simply:

=====
a  b
a  b
=====
a  b
a  b

In many situations dealing with User Generated Content, preventing any URLs whatsoever is desirable - even rendered as plaintext. Currently, this must be handled outside of bleach in a separate processing step. Being able to filter this out within bleach is desirable, as the URLs have already been parsed.

I think you need to write a new filter. I bet you could base it on the current LinkifyFilter but change this part here:

bleach/bleach/linkifier.py

Lines 316 to 332 in 4f951d3

if attrs is None:
# Just add the text--but not as a link
new_tokens.append(
{"type": "Characters", "data": match.group(0)}
)
else:
# Add an "a" tag for the new link
_text = attrs.pop("_text", "")
new_tokens.extend(
[
{"type": "StartTag", "name": "a", "data": attrs},
{"type": "Characters", "data": str(_text)},
{"type": "EndTag", "name": "a"},
]
)
end = match.end()

Does that help?

I'm assuming that helped. Closing this out.