untokenize() does not round-trip for tab characters
Closed this issue · 9 comments
Bug report
Bug description:
Tab characters used as white space do not roundtrip when using untokenize
. Here's an example:
import tokenize, io
source = "a +\tb"
tokens = list(tokenize.generate_tokens(io.StringIO(source).readline))
x = tokenize.untokenize(tokens)
print(x)
# a + b
Here, the tab is replaced with a regular space. Given that untokenize tries to match the source string exactly, I think we should fix this. We can do that by getting the characters from the source line rather than always using a normal space:
Line 185 in 9c2bb7d
I'll send a fix in a moment :)
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Linked PRs
The example behavior is the same back to 3.8, and I believe it is intentional and not a bug. Untokenize does not intend to reproduce the source string exactly. Which is to say, untokenize('tokenize'(source)) == source is not guaranteed. (I quote 'tokenize' as call details are omitted.) The current doc says "the spacing between tokens (column positions) may change."
Rather the doc says that untokenize should produce a source string that tokenizes back to a matching stream of TokenInfo objects. In other words, 'tokenize'(untokenize(stream) match= stream. The matching is that the first two items of each tuple, the token type and value, are the same. Since there is no WHITESPACE token, there are no ' '
and '\t'
values to match -- or not match.
Even though spacing may change, the current code tries to maintain it by inferring whitespace from differences between successive end and start columns. So the start and end position may match also. But the before and after line values may still differ. The proposal is to also avoid this if possible. (I do not know what other cases might be covered by the existing disclaimer.)
There are two back-compatibility issues: First, there may be users that want or depend upon the current behavior, perhaps because the change would break tests. Second, adding a required parameter with no default to Untokenizer.add_whitespace could break user code. Even though the absence of "Untokenizer" from tokenize.__all__
(and the doc) makes it private, it might be in use. If the proposal were otherwise acceptible, a 'discuss' ideas posts might be a good idea.
I don't think this is worth changing. As @terryjreedy points out, the intent was never to produce the same source code as the input to tokenize. I think this change would start a slippery slope to trying to achieve perfect fidelity.
Fair enough, if you think this is not worth fixing I'm happy to close the issue :)
Though, the motivation for this was the docstring of untokenize
which says that for 5-tuple tokens, the output should match exactly:
Lines 321 to 330 in bad3cde
Maybe we should update the docstring to match what the docs say then?
Yes, I think the docstring should be changed to match the docs.
Agree with Eric. The full input part of the docstring was merged by Thomas Wouters 2006-12-12 in 89f507f. The initial version of the current .rst doc was merged by Georg Brandl 2007-08-15 in 116aa62 and is essentially unchanged since. As best I remember, @pablogsal has made comments on other issues that the guarantee is given in the doc.
There are other sentences in both doc and docstring that need revision. I intend to later post suggested revisions here for discussion.
Hm, I was strongly on the side of changing the docs here and leaving the behavior in place, as @terryjreedy suggests. However, I then looked at @tomasr8's PR and the fix is really very simple, so I'd be supportive of actually just fixing this.
Personally, I think it's a slippery slope, and we should not accept the PR, no matter how simple it is. I don't see the point in making things slightly closer to round tripping, but not going all the way.
I think the missing point here is that when we tried to "fix" similar thing in the past because they were very simple even if it was not guaranteed behaviour there has been many instances where something subtle broke and we had to play weird games to fix everything at once so I think I agree with @ericvsmith
I don't see the point in making things slightly closer to round tripping, but not going all the way.
Good point, and there are cases where we cannot go all the way because the tokens lack some information (e.g. for backslash continuations).
Let me close this then and we can simply update the docstring.