mozilla/bleach

Open angle bracket '<' with few words after cleaned up if there's no closing bracket

alyohea opened this issue · 1 comments

Describe the bug

After #544 being fixed it seems the issue is still persist. But it reproducible in another way

  • Python Version: 3.8.13
  • Bleach Version: 6.0.0

To Reproduce

Steps to reproduce the behavior:

# Fixed!
In [2]: bleach.clean("<random")
Out[2]: '&lt;random'

# Fixed!
In [3]: bleach.clean("random<text")
Out[3]: 'random&lt;text'

# Problem!
In [4]: bleach.clean("<random text")
Out[4]: ''

Expected behavior

In [4]: bleach.clean("<random text")
Out[4]: '&lt;random text'

Additional context

Previously it was fixed by #667, so that < without > considered as eof-in-tag-name, but in the case above it's considered as EOF in the attribute name -- 'eof-in-attribute-name':

392  	        if last_error_token:
393 B->	            if last_error_token["data"] == "eof-in-tag-name":
394  	                # Handle the case where the text being parsed ends with <
395  	                # followed by a series of characters. It's treated as a tag
396  	                # name that abruptly ends, but we should treat that like
397  	                # character data
398  	                yield {
(Pdb) 
399  	                    "type": TAG_TOKEN_TYPE_CHARACTERS,
400  	                    "data": "<" + self.currentToken["name"],
401  	                }
402  	            else:
403  	                yield last_error_token
404  	
405  	    def consumeEntity(self, allowedChar=None, fromAttribute=False):
406  	        # If this tokenizer is set to consume entities, then we can let the
407  	        # superclass do its thing.
408  	        if self.consume_entities:
409  	            return super().consumeEntity(allowedChar, fromAttribute)
(Pdb) last_error_token
{'type': 7, 'data': 'eof-in-attribute-name'}
willkg commented

Thank you for writing this up!