python/cpython

Reject invalid escape sequences (and octal escape sequences) in bytes and Unicode strings

Closed this issue · 15 comments

In Python 3.6, invalid escape sequence were deprecated in string literals (bytes and str): issue #71551, commit 110b6fe.

What's New in Python 3.6: Deprecated Python behavior:

A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. (Contributed by Emanuel Barry in bpo-27364.)

I propose now raises a SyntaxError, rather than a DeprecationWarning (which is silent in most cases).

Example:

$ python3 -W default -c 'print(list("\z"))'
<string>:1: DeprecationWarning: invalid escape sequence '\z'
['\\', 'z']
$ python3 -W default -c 'print(list(b"\z"))'
<string>:1: DeprecationWarning: invalid escape sequence '\z'
[92, 122]

Note: Python REPL ate some DeprecationWarning which makes manual testing harder. It was fixed last month by commit 426d72e in issue gh-96052.


Python 3.11 now emits a deprecation warning for invalid octal escape sequence (issue gh-81548):

Octal escapes in string and bytes literals with value larger than 0o377 now produce DeprecationWarning. In a future Python version they will be a SyntaxWarning and eventually a SyntaxError. (Contributed by Serhiy Storchaka in gh-81548.)

Example:

$ python3.11 -Wdefault -c 'print(list(b"\777"))'
<string>:1: DeprecationWarning: invalid octal escape sequence '\777'
[255]

I created PR #98404 to implement this change.

This change should help to catch some mistakes in regular expressions and Windows paths.

Example with PR #98404 which now raises SyntaxError:

>>> import re
>>> re.findall('.py\B', '1python 2pyc 3pyo 4py')
SyntaxError: invalid escape sequence '\B'
>>> wrong_path = "C:\Program Files\Python\python.exe"
SyntaxError: invalid escape sequence '\P'

Raw strings r'...' should be used instead:

>>> import re
>>> re.findall(r'.py\B', '1python 2pyc 3pyo 4py')
['1py', '2py', '3py']
>>> wrong_path = r"C:\Program Files\Python\python.exe"
>>> wrong_path
'C:\\Program Files\\Python\\python.exe'

See also discussion in #77093, and the reasons that the SyntaxWarning change was rolled back.

When this warning was introduced, there were no any plans of making it an error in the near future. It was planned as a warning with very long period.

The specific of this warning is that it is not emitted for the code containing this specific kind of bug, but only emitted for the code which does not contain a bug. But which may be in near proximity of the code which contains a bug (and does not emit a warning itself).

Perhaps it is a time to make it more visible (convert it into SyntaxWarning). And then after other long period of 4-5 versions it could be converted into SyntaxError.

It first went into Python 3.6, now EOL, so all supported Python versions have the deprecation warning.


Serhiy suggested in #71551 (comment) a DeprecationWarning or PendingDeprecationWarning in 3.6, a SyntaxWarning in 3.8 and a SyntaxError in 4.0:

I think "a silent warning" means that it should emit a DeprecationWarning or a PendingDeprecationWarning. Since there is no haste, we should use 2-releases deprecation period. After this a deprecation can be changed to a SynataxWarning in 3.8 and to a UnicodeDecodeError (for strings) and a ValueError (for bytes) in 4.0. The latter are converted to SyntaxError by parser.

So that was a suggestion of 2 releases from deprecation to syntax warning. If we do a syntax warning now, that will have already been 6 releases.

I don't know what the original estimate of 3.8 -> 4.0 was? Also two releases?


Guido suggested in #71551 (comment) for "several" releases before the error.

I think ultimately it has to become an error (otherwise I wouldn't
have agreed to the warning, silent or not). But because there's so
much 3rd party code that depends on it we indeed need to take
"several" releases before we go there.

Six releases (3.6 -> 3.12) is perhaps several?


The original issue also had suggestions from Victor and Guido to contact linters/PyCQA to include a warning to help projects prepare, and the original author did so.

And the good news is it was added to pycodestyle (part of Flake8) as W605 in April 2018, so that's 4.5 years of linter warnings. (Thanks to this, I've fixed invalid escape sequences in several packages.)

@vstinner Would it be worth running pycodestyle --select W605 on some top list of packages to get an idea of exposure?

Depending on the result, it may be worth promoting to a SyntaxWarning for an extra release or two, or even keeping the DeprecationWarning a bit longer (in light of #77093 (comment)).

Ok, let's start with replacing DeprecationWarning (silent by default) with SyntaxWarning (displayed once by default): PR #99011.

Fixed by a60ddd3

At the end, it remains a warning, but SyntaxWarning (showed by default) is now emitted instead of DeprecationWarning (silent by default). According to @hugovk, sadly many projects of the PyPI top 5000 contain invalid escape sequences. It will take time to update them, before considering to convert the SyntaxWarning to an SyntaxError.

Thanks for everybody who helped me on making this change possible!

For affected projects: just add r at the beginning of your strings to disable escape sequences. But be careful, "newline:\n, invalid: \P" cannot be simply converted to a raw string by adding r since it would convert the newline character (U+000A) to two characters (backslash followed by newline). The correct fix is to double the second backslash: "newline:\n, invalid: \\P" (\P becomes \\P). An alternative is to mix different kinds of strings: "newline:\n" r", invalid:\P" or "newline:\n" + r", invalid:\P" (depending on your personal preference ;-)).

In case of regular expressions, '\n' and r'\n' mean the same, so in almost all cases you can freely add the r prefix. The only exception is \b not in a character set. But it is extremely rare, and if you have one in non-raw string representing a regular expression, it is most likely a bug.

I'm wondering if we'll be forced to roll back these changes before release again due to a lot of warnings in third-party code.

I'm wondering if we'll be forced to roll back these changes before release again due to a lot of warnings in third-party code.

I created https://discuss.python.org/t/collaboration-on-handling-python-3-12-incompatible-changes-distutils-removal-invalid-escape-escape-etc/20721 to discuss this change.

If you work on windows and have the habit of documenting your code inside trippelquotes, (A company requirement) you now get a warning whenever a directory name starts on a d because ex: """C:\dist\project\file.dat""" gives a SyntaxWarning.

And not to mention the regexp issue that have a fix.

More work to port a 3.11 project to 3.12 than moving it from python2 to python3

Just replace """C:\dist\project\file.dat""" with r"""C:\dist\project\file.dat""": add r prefix.

Is there a tool that modifies the problematic regular expression?
I have so many syntax warning inside my code that doing it manually sounds too risky

When I run: python -W default -c 're.compile(r"\d")' I get <string>:1: SyntaxWarning: invalid escape sequence '\d' - is this how it should be?
I used r so why I get SyntaxWarning?

I cannot reproduce your issue. Also, this issue is closed.

# no warnings are displayed
$ python3.10 -W default -c 'import re; re.compile(r"\d")'
$ python3.11 -W default -c 'import re; re.compile(r"\d")'
$ python3.12 -W default -c 'import re; re.compile(r"\d")'
$ python3.13 -W default -c 'import re; re.compile(r"\d")'

quick update to say that the warning can be automatically fixed with ruff

ruff --select W605 --fix

I've checked some edge cases (strings with \n), it's adjusting to raw string or escaping to double slash, as needed for each string.