untokenize() does not round-trip for tab characters

Bug report

Bug description:

Tab characters used as white space do not roundtrip when using untokenize. Here's an example:

import tokenize, io

source = "a +\tb"

tokens = list(tokenize.generate_tokens(io.StringIO(source).readline))
x = tokenize.untokenize(tokens)
print(x)
# a + b

Here, the tab is replaced with a regular space. Given that untokenize tries to match the source string exactly, I think we should fix this. We can do that by getting the characters from the source line rather than always using a normal space:

cpython/Lib/tokenize.py

Line 185 in 9c2bb7d

self.tokens.append(" " * col_offset)

I'll send a fix in a moment :)

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Linked PRs

gh-128032

The example behavior is the same back to 3.8, and I believe it is intentional and not a bug. Untokenize does not intend to reproduce the source string exactly. Which is to say, untokenize('tokenize'(source)) == source is not guaranteed. (I quote 'tokenize' as call details are omitted.) The current doc says "the spacing between tokens (column positions) may change."

Rather the doc says that untokenize should produce a source string that tokenizes back to a matching stream of TokenInfo objects. In other words, 'tokenize'(untokenize(stream) match= stream. The matching is that the first two items of each tuple, the token type and value, are the same. Since there is no WHITESPACE token, there are no ' ' and '\t' values to match -- or not match.

Even though spacing may change, the current code tries to maintain it by inferring whitespace from differences between successive end and start columns. So the start and end position may match also. But the before and after line values may still differ. The proposal is to also avoid this if possible. (I do not know what other cases might be covered by the existing disclaimer.)

There are two back-compatibility issues: First, there may be users that want or depend upon the current behavior, perhaps because the change would break tests. Second, adding a required parameter with no default to Untokenizer.add_whitespace could break user code. Even though the absence of "Untokenizer" from tokenize.__all__ (and the doc) makes it private, it might be in use. If the proposal were otherwise acceptible, a 'discuss' ideas posts might be a good idea.

I don't think this is worth changing. As @terryjreedy points out, the intent was never to produce the same source code as the input to tokenize. I think this change would start a slippery slope to trying to achieve perfect fidelity.

Fair enough, if you think this is not worth fixing I'm happy to close the issue :)
Though, the motivation for this was the docstring of untokenize which says that for 5-tuple tokens, the output should match exactly:

cpython/Lib/tokenize.py

Lines 321 to 330 in bad3cde

    
               Round-trip invariant for full input: 
        
                   Untokenized source will match input source exactly 
        
               Round-trip invariant for limited input: 
        
                   # Output bytes will tokenize back to the input 
        
                   t1 = [tok[:2] for tok in tokenize(f.readline)] 
        
                   newcode = untokenize(t1) 
        
                   readline = BytesIO(newcode).readline 
        
                   t2 = [tok[:2] for tok in tokenize(readline)] 
        
                   assert t1 == t2

Maybe we should update the docstring to match what the docs say then?

Yes, I think the docstring should be changed to match the docs.

Agree with Eric. The full input part of the docstring was merged by Thomas Wouters 2006-12-12 in 89f507f. The initial version of the current .rst doc was merged by Georg Brandl 2007-08-15 in 116aa62 and is essentially unchanged since. As best I remember, @pablogsal has made comments on other issues that the guarantee is given in the doc.

There are other sentences in both doc and docstring that need revision. I intend to later post suggested revisions here for discussion.

Hm, I was strongly on the side of changing the docs here and leaving the behavior in place, as @terryjreedy suggests. However, I then looked at @tomasr8's PR and the fix is really very simple, so I'd be supportive of actually just fixing this.

Personally, I think it's a slippery slope, and we should not accept the PR, no matter how simple it is. I don't see the point in making things slightly closer to round tripping, but not going all the way.

I think the missing point here is that when we tried to "fix" similar thing in the past because they were very simple even if it was not guaranteed behaviour there has been many instances where something subtle broke and we had to play weird games to fix everything at once so I think I agree with @ericvsmith

I don't see the point in making things slightly closer to round tripping, but not going all the way.

Good point, and there are cases where we cannot go all the way because the tokens lack some information (e.g. for backslash continuations).

Let me close this then and we can simply update the docstring.

	Round-trip invariant for full input:
	Untokenized source will match input source exactly

	Round-trip invariant for limited input:
	# Output bytes will tokenize back to the input
	t1 = [tok[:2] for tok in tokenize(f.readline)]
	newcode = untokenize(t1)
	readline = BytesIO(newcode).readline
	t2 = [tok[:2] for tok in tokenize(readline)]
	assert t1 == t2