Open angle bracket '<' with few words after cleaned up if there's no closing bracket
alyohea opened this issue · 1 comments
alyohea commented
Describe the bug
After #544 being fixed it seems the issue is still persist. But it reproducible in another way
- Python Version: 3.8.13
- Bleach Version: 6.0.0
To Reproduce
Steps to reproduce the behavior:
# Fixed!
In [2]: bleach.clean("<random")
Out[2]: '<random'
# Fixed!
In [3]: bleach.clean("random<text")
Out[3]: 'random<text'
# Problem!
In [4]: bleach.clean("<random text")
Out[4]: ''
Expected behavior
In [4]: bleach.clean("<random text")
Out[4]: '<random text'
Additional context
Previously it was fixed by #667, so that <
without >
considered as eof-in-tag-name
, but in the case above it's considered as EOF in the attribute name -- 'eof-in-attribute-name'
:
392 if last_error_token:
393 B-> if last_error_token["data"] == "eof-in-tag-name":
394 # Handle the case where the text being parsed ends with <
395 # followed by a series of characters. It's treated as a tag
396 # name that abruptly ends, but we should treat that like
397 # character data
398 yield {
(Pdb)
399 "type": TAG_TOKEN_TYPE_CHARACTERS,
400 "data": "<" + self.currentToken["name"],
401 }
402 else:
403 yield last_error_token
404
405 def consumeEntity(self, allowedChar=None, fromAttribute=False):
406 # If this tokenizer is set to consume entities, then we can let the
407 # superclass do its thing.
408 if self.consume_entities:
409 return super().consumeEntity(allowedChar, fromAttribute)
(Pdb) last_error_token
{'type': 7, 'data': 'eof-in-attribute-name'}
willkg commented
Thank you for writing this up!