mozilla/bleach

Open bracket '<' still cleaned up without closing bracket

Closed this issue · 1 comments

Describe the bug

Thanks for the fix provided #705!
I think I found regression after that fix

  • Python Version: [e.g. 3.9.6]
  • Bleach Version: [e.g. 6.1.0]

To Reproduce

Steps to reproduce the behavior:

# Working!
In [5]: bleach.clean("<test abc")
Out[5]: '&lt;test abc'
# Doesn't work (because of duplicated words?)
In [6]: bleach.clean("<test abc abc")
Out[6]: ''
# However this work
In [12]: bleach.clean("<test abc abd")
Out[12]: '&lt;test abc abd'
# Doesn't work (with space in the end)
In [7]: bleach.clean("<test abc ")
Out[7]: ''
# Doesn't work (with space in the end)
In [8]: bleach.clean("asd<test abc ")
Out[8]: 'asd'
# However this work
In [9]: bleach.clean("asd<test abc asd")
Out[9]: 'asd&lt;test abc asd'

Expected behavior

# Doesn't work (because of duplicated words?)
In [6]: bleach.clean("<test abc abc")
Out[6]: '&lt;test abc abc'
# Doesn't work (with space in the end)
In [7]: bleach.clean("<test abc ")
Out[7]: '&lt;test abc '
# Doesn't work (with space in the end)
In [8]: bleach.clean("asd<test abc ")
Out[8]: 'asd&lt;test abc '

Additional context

Add any other context about the problem here.

willkg commented

Thank you for putting so much effort into this bug report--I really appreciate it!

I think there are a couple of issues here:

  1. It looks like the duplicate token does affect things. It kicks up two parse errors and then everything goes sideways:
    {'type': 7, 'data': 'eof-in-attribute-name'}
    {'type': 7, 'data': 'duplicate-attribute'}
  2. It looks like we need to handle another parse error case:
    {'type': 7, 'data': 'expected-end-of-tag-but-got-eof'}

We'll need to fix each issue separately. I'll see what I can do.