Interned strings are immortal, despite what the documentation says
Closed this issue ยท 37 comments
Bug report
Bug description:
The sys.intern documentation explicitly says:
Interned strings are not immortal; you must keep a reference to the return value of
intern()around to benefit from it.
However, they were made immortal in #19474 (Implement Immortal Objects -- implementation of PEP 683), without a documentation update. The 3.12 What's New also doesn't mention the change. [edit: it's now clear that the PR author (@eduardo-elizondo) intended it but the reviewer (@markshannon) did not]
PEP-683 itself only mentions the change as a possible optimization.
Meanwhile, pathlib interns every path segment it processes, presumably depending on the documented behaviour. In CPython's test suite, that causes names of temporary directories to leak (stealthily, since interned strings are excluded from the total refcount).
[edit: Worse, type_setattro, PyObject_SetAttr or PyDict_SetItemString immortalize keys, so strings used for key/attribute names this way are not reclaimed until interpreter shutdown.]
On the other hand, free-threading (PEP-703) seems to rely [edit: relies] on these being immortal.
What's the correct behaviour? [edit: specifically, in 3.12 and non-free-threading 3.13, should this change be reverted or documented?]
@eduardo-elizondo @ericsnowcurrently
Linked PRs
Related fixes:
I would suggest that using "immortal" here could be confusing. The flagged checked by _Py_IsImmortal() is low-level implementation detail and should not be something that the library documentation discusses. Strings that you call intern() on may or may not have this flag set. Different Python runtimes might implement intern() some other way. The behavior that matters is that the string gets stored in an interned strings table and that future calls to intern() with the same string content will return an object with the same id(), even if the user program doesn't keep a reference to the string.
The behavior that matters is that the string gets stored in an interned strings table and that future calls to intern() with the same string content will return an object with the same id(), even if the user program doesn't keep a reference to the string.
No. That is not the behaviour on Python 3.11.
I get:
$ python3.11
Python 3.11.7 (main, Dec 18 2023, 00:00:00) [GCC 13.2.1 20231205 (Red Hat 13.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> id1 = id(sys.intern('some string'))
>>> other = sys.intern('other string')
>>> id2 = id(sys.intern('some string'))
>>> id1 == id2
False
IMO, the guarantee is that if two strings are interned, they can be compared by identity.
CPython uses this for a fast path in string comparison; so some users call intern on strings that are likely to be compared to one another -- even ones that srent expected to live until interpreter shutdown. We can say that's optmizing for a specific implementation, but it's common -- even pathlib and the compiler (e.g. eval) do it.
The new behaviour can leak strings that are not likely to be used again.
$ cat test.py
import sys
id1 = id(sys.intern('some string'))
other = sys.intern('other string')
id2 = id(sys.intern('some string'))
assert id1 == id2
$ python3.8 ./test.py
$ python3.9 ./test.py
$ python3.10 ./test.py
$ python3.11 ./test.py
$ python3.12 ./test.py
$ python3.13 ./test.pyNo assertions triggered.
The the new string is allowed to reuse the old one's memory; I guess this is build-specific. What platform are you on?
Instead of other = sys.intern('other string'), try putting in gc.collect and something more intensive, like import asyncio?
I'm on macOS using the official build. Strange thing: I can reproduce the behaviour you're seeing in the REPL.
Instead of
other = sys.intern('other string'), try putting ingc.collectand something more intensive, likeimport asyncio?
Nope, cannot reproduce using $ python3.11 ./test.py as above.
Ah! Right, literal constants are part the code object, which does normally stay around. You'll need something that's not folded to a constant, like:
$ cat test.py
import sys
def intern_and_report(string):
string = sys.intern(string)
the_id = id(string)
print(f'{string!r} at {id(string)}: refcount {sys.getrefcount(string):x}')
return the_id
id1 = intern_and_report('a' + __name__)
str_b = sys.intern('b' + __name__)
id2 = intern_and_report('a' + __name__)
assert id1 == id2$ python3.10 ./test.py
'a__main__' at 139775546163056: refcount 3
'a__main__' at 139775546294640: refcount 3
Traceback (most recent call last):
File "/tmp/rttair/./test.py", line 11, in <module>
assert id1 == id2
AssertionError
$ python3.12 ./test.py
'a__main__' at 139964910181616: refcount ffffffff
'a__main__' at 139964910181616: refcount ffffffff
Yes; thanks!
See also #113601
@encukou just getting to this! You're right, as of now, all interned strings should be immortal, so the sys.intern docs should probably be updated to update that they are indeed immortal and the fact that we don't need to keep a reference to benefit from it anymore.
As for the leaks, as @erlend-aasland already called out, we have a PR to correctly clean them up in #113601!
The behavior that matters is that the string gets stored in an interned strings table and that future calls to intern() with the same string content will return an object with the same id(), even if the user program doesn't keep a reference to the string.
No. That is not the behaviour on Python 3.11.
Sorry, I was not very clear. I was describing the behaviour we have with Python 3.12 and we had with older versions of Python (version 2.2 and before, it seems). It looks like this commit changed it: 45ec02a.
Since 3.12 changed this, definitely the documentation should be updated. Further, I think we should have had a discussion and explicitly decide if this is desirable behaviour, since it reverts the effect of 45ec02a. Maybe there was one and I missed it? Perhaps free-threading doesn't require that every interned string become immortal but only some of them (e.g. symbols used by code objects etc). That might preserve free-threaded performance (e.g. interned strings shared between threads) while avoid the "leak" like encountered by pathlib (e.g. interned strings only used for a short time and then discarded).
as of now, all interned strings should be immortal
To be clear, I think this is bad. Some interned strings are temporary, and they now leak. They should be cleaned up sooner that at interpreter shutdown.
The documentation for sys.intern explains, "Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it."
Some open source libraries depend on this behavior, like msgpack-python. Changing the behavior of sys.intern will cause problems for some applications. I work on an application that is experiencing out-of-memory errors because of this issue.
One solution might be to add a separate function called sys.immortalize(string). Alternately, an optional argument to sys.intern could also do the trick. For example, sys.intern(s, immortalize=True).
Restoring the old, documented behavior of sys.intern would save people a lot of trouble. Making the new behavior opt-in (with a separate function or an optional argument) might be a way to have our cake and eat it too.
Does this mean that PyDict_SetItemString will also immortalize its key argument? Will every key passed to PyDict_SetItemString live forever?
It seems like that could also be a source of leaks. Maybe the documentation for PyDict_SetItemString should be updated to indicate that key will be kept alive forever.
Alternately, maybe PyDict_SetItemString should not call PyUnicode_InternInPlace. I notice a comment on that line asks, XXX Should we really?. Seems like a good question.
I don't see a good reason why interning strings should make them immortal.
The two things, immortality and interning, are separate concepts.
Why were interned strings changed to be immortal in 3.12?
Since interned strings that are tied to code objects are effectively immortal, making those actually immortal seems fine if necessary for performance.
But strings interned with sys.intern() should not be made immortal implicitly, IMO.
@jvs suggestion of adding an immortalize argument to sys.intern() seems a good one.
The immortal objects PR seems to allow interned, but not immortals strings:
https://github.com/python/cpython/pull/19474/files#diff-6be7f081fe6c5e9cbc89323a00399b291d1cda855bcb4c6eeaee0fac89c2f8ddR105
so it is a bit surprising that they aren't supported.
It might also be worth making the distinction between "interpreter immortal" and "process immortal" strings, as the latter can be shared between multiple interpreters.
The free-threaded build needs every interned string to be immortal, not just symbols, if you don't want to introduce bottlenecks that prevent scaling in multi-threaded programs.
The documentation can describe the possible behaviors without promising a specific implementation. Something like:
"Depending on the Python version and implementation, interned strings may or may not be immortal; you should keep a reference to the return value of intern() around to benefit from it."
Why do all interned strings need to be immortal?
Strings where there is potentially high contention on the refcount need to be immortal, to allow scaling.
But that only implies that interned strings must be immortal if there is a strong correlation between contention and interning.
Is there such a strong correlation?
Are there not strings that are contended and are not interned, or does the free-threading build intern so many strings that all possibly contended strings are interned?
Interning implies sharing and sharing between threads leads to reference count contention. To take the pathlib example, I'd much rather pathlib did not intern strings, but if it does intern the components, it's important that those strings are immortalized, because otherwise you are likely to have a bunch of strings shared between threads that otherwise would not be shared.
Are there not strings that are contended and are not interned...?
It's definitely possible, but from what I've seen it's not the most common source of reference count contention. If you don't intern a string, it's much less likely to be incidentally shared between threads.
This seems like a breaking change that is kind of flying under the radar. The documentation explains, "Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it."
It looks like it been that way for 20 years
I'd be worried that there are many applications that depend on the old behavior. If there's a way to maintain backwards compatibility, while also finding a way to avoid bottlenecks around shared strings, then that might be the best option.
Can we make it mortal again on --enable-GIL build?
Right. It looks like free-threading needs this, and the ecosystem (and stdlib) will need to adapt, so the question is how soon.
Specifically, what should we do in 3.12 -- revert the change, or document it?
I guess ultimately the RM should decide that.
Since interned strings that are tied to code objects are effectively immortal, making those actually immortal seems fine if necessary for performance.
This is not always true, unfortunately. You can use a temporary code object for eval, like we do for namedtuple.
(Or you can use ast.literal_eval to parse data -- that doesn't create a code object per se, but AFAIK strings are already immortalized in the AST.)
As Python is a dynamic language, the ability to execute dynamic code is important. Not all code is immortal.
Anyway, consider the use case of receiving JSON from the outside and mapping its contents to a Python structure: not only does interning allow large numbers of identical strings in JSON to be combined into one, but it also speeds up comparisons with Python symbols.
However, if that string becomes immortal, it will cause memory leaks, and we will not be able to intern the strings contained in the JSON received from outside.
Therefore, a new API is needed that returns an interned string only if the same string has already been interned.
But we don't want to add new API in Python 3.12 and 3.13. So, I think revert it in GIL build is the best option.
Since interned strings that are tied to code objects are effectively immortal
effectively means "almost all, almost all of the time".
Strings interned compiling a namedtuple, we be effectively immortal, as the namedtuple will almost always be bound to a global variable with module lifetime. Strings interned in eval() will be mostly symbols referring to existing code, and thus, almost always already interned.
The point is that making all strings and symbols used in the compiler immortal will leak very few strings and very little memory.
Rather than diverge the normal and no-gil builds even further than they already are, I think we can keep this behavior.
It the behavior of sys.intern() that is the problem here.
But we don't want to add new API in Python 3.12 and 3.13. So, I think revert it in GIL build is the best option.
For sys.intern() I agree that we should revert it.
I think it is fine to leave the bytecode compiler as it is for both builds.
@colesbury what is the evidence that strings that at explicitly interned with sys.intern() are more prone to contention than other strings?
If there is no evidence , then we should revert the behavior of sys.intern() for both builds.
What about PyUnicode_InternInplace?
msgpack uses both of sys.intern() and PyUnicode_InternInplace().
@markshannon, please see my comments above.
I know the discussion is moving into different places already, but to me, object immortality has always been about being able to have externally allocated and deeply immutable objects/types. The latter is best exemplified by allocating the object/type in read-only memory, but the former is most critical: immortal objects are those that Python did not create1 and is not going to deallocate itself.
The primary scenario I was thinking of when I first suggested them was to allow creating strings and lists pre-initialization (to specify init config). Those would be allocated outside of the normal runtime, would exist throughout the life of the runtime, and would be freed by the host process, not the Python runtime. But because they are "normal" objects, we don't have to copy them. The secondary scenario, which became much more interesting when talking to Eric about subinterpreters, was the ability to avoid having to break our existing API by converting all our static types into heap types.
Under this design, runtime-allocated strings are entirely inappropriate for immortalisation.
Interned or not, if Python is going to free them at/before finalisation, they are not immortal. The way to keep non-immortal objects alive is to add a reference, which is what sys.intern does. If that doesn't play well with threading, then threading needs to figure out a way to solve that without redefining existing orthogonal concepts that look like they might fit.
It seems if pathlib is going to be a problem because it uses sys.intern, then probably pathlib should stop doing that. It can easily have its own interpreter/thread-specific memoization dict, and probably even make it a tunable parameter for users who care more about memory overhead than refcount contention. In a free-threaded world, any process-wide or runtime-wide state needs to be reevaluated, as it's very likely to be the wrong model now.
In terms of compiler allocated stuff, I don't disagree with one object being the "external allocator". I assume it would be the code object, which rather than allocating 100 individual strings could instead create a naive array with 100 immortal strings that reference the memory mapped source file. When the code object is deallocated, it clears out that array itself. Without some clever clone-on-access mechanism for those strings, it means we basically intern code objects until finalization, but that entire array of immortal strings is more efficient than individually allocated ones, and also freely portable across subinterpreters (though the code object would not be - maybe a non-portable proxy is required).
Alternatively, if the compiler has its own allocator, then provided finalization cleans it all up, there's no problem. But we shouldn't leak anything that we allocate. It has to have an owner, and that owner has to release it.
Footnotes
-
"Statically allocated in the Python runtime" is not what I mean here. Those weren't created by the runtime, they existed prior to the runtime starting, so are externally allocated. โฉ
It seems the discussion died down, without a clear next step.
I still think this unplanned change should be reverted.
I don't know the internals here, and learning them looks like a major time investment, so I'd like to first ask if a revert would be OK to backport. @Yhg1s, would it? (If you need SC to decide, let me know and I can ask.)
I'm proposing a partial revert, to be backported to 3.13 and 3.12:
PyUnicode_InternInPlace(and everything that uses it, likesys.internetc.) revert to not immortalizing strings- In free-threaded builds, interned strings will still be always immortalized
- CPython's statically allocated strings will still be immortal
- Strings created by the compiler will be made immortal. (There might be other places where we should switch from
PyUnicode_InternInPlacetoPyUnicode_InternImmortal, to be found during the implementation.)
We then may decide to go back to immortalizing everything in future versions -- with proper discussion and announcements.
If a revert is not OK, then I'll retroactively change the 3.12+ documentation and What's New documents.
Yes, I agree this should be reverted in 3.12, and conditional on free threading in 3.13. That's my opinion as RM, but considering the discussions the DC had about the immortal objects PEP, I'm sure they all agree. Immortalising all interned strings is incorrect and wrong.
(I will say that I see free-threaded builds needing this as a clunky band aid to allow further experimentation, and a better solution should be found before making it the default. If that's not possible, we need more discussion about this pretty fundamental and problematic change. That's more of a question for future RMs/SCs though.)
The PR is up: #120520
It is performance-neutral according to the faster-cpython benchmarks.
It does leaves some C API changes for follow-up PRs, so they can be separately evaluated, backported, and possibly reverted:
PyUnicode_InternInPlacenot immortalizing (part of the plan above)PyDict_SetItemStringnot immortalizing- Re-adding
PyUnicode_InternImmortalto public API - Interning one-character singletons at startup
- Remove the internal
_PyUnicode_InternInPlace(an favour of explicit InternMortal/InternImmortal); preferably in 3.14 only - #113601
- For
Py_DEBUG, add global tablekey_destroy_functo verifystatically_allocated
For a separate change: #120520 (comment) (and #120520 (comment), #120520 (comment), #120520 (comment)).
The big PR is now in 3.13. Thank you, Eric, for reviews!
I'll wait a bit (and watch Fedora rebuilds) before backporting to 3.12.
The next PR I want to send is the behaviour change for public C API, which should fix msgpack's memory leaks:
-
In
PyUnicode_InternInPlace, don't immortalize. -
In a bunch of other functions that take
const char *, I've looked around and found them almost exclusively used with C literals, or other strings that are known at compile time (or otherwise effectively immortal). Both in CPython and in a few third-party modules I checked. These are:PyDict_SetItemStringPyObject_SetAttrString,PyObject_DelAttrStringPyUnicode_InternFromString- the
PyModule_Add*convenience functions
So for these, I think it'd be better if they do immortalize, but we should explicitly mention that in the documentation, and also mention what to call instead for non-static (e.g. user-supplied) strings.
Does that sound good?
For the items after that:
- I haven't seen much use case for a public
PyUnicode_InternImmortal. I'm now not planning to resurrect it. - Refactoring the internals is a bit of a rabbit hole. I might turn it into a mentoring project (which would take a few months at least).
The PR added some assertions that call _PyIsImmortal. These are wrong, since stable ABI extensions can make immortal objects mortal.
@ericsnowcurrently @eduardo-elizondo: could you update PEP 683 (Immortal Objects, Using a Fixed Refcount)? I don't think this part of the specification ever made it to the implementation:
an immortal object will still be considered immortal, even if somehow its refcount were modified (e.g. by an older stable ABI extension).
Per PEP 1, a Final PEP should accurately describe the implementation at the point where it is marked Final.
The 3.12 backport is at #123065.
@Yhg1s, re:
Yes, I agree this should be reverted in 3.12, and conditional on free threading in 3.13. That's my opinion as RM, but considering the discussions the DC had about the immortal objects PEP, I'm sure they all agree.
Unfortunately it's far from a straightforward revert.
Do you still want this in 3.12? Should I ask the SC?