Possible incorrect formatting from remove tabs

Question

Possible incorrect formatting from remove tabs

Closed this issue 2 years ago · 13 comments

If lines contain spaces followed by tabs then replacing with a fixed number of spaces will miss-align which can be a problem for some file types, e.g. fi a file contains 2 lines such as:
"
Line 1
Line 2
"
Then visually, with tab spacing of 4, it will be:

    Line 1
    Line 2

But after tab replacement will become:

    Line 1
     Line 2

Suggest adding:

def line_tab_strip(line: bytes, whitespaces_count: int) -> bytes:
    """Remove tabs and replace with whitespaces_count spaces maintaining alignment"""
    spaces = b" " * whitespaces_count
    return spaces.join(sect.strip(b' ') for sect in line.split(b"\t"))

And changing the line:

    lines = [line.replace(b'\t', b' ' * whitespaces_count) for line in lines]

With something like:

    lines = [line_tab_strip(line, whitespaces_count) for line in lines]

This should give the expected alignment even on multi column lines such as: \tCol 1\t Col 2 \tCol 3 \n

Answer 1 · 2023-01-20T11:07:47.000Z

Hi @SteveBarnes-BH!

Interestng, thank you for reporting this.

Would you like to submit a PR with a small unit test covering this case,
in order to fix this bug?

Answer 2 · 2023-01-20T11:32:42.000Z

Patch for possible test to show the issue:

--- a/tests/remove_tabs_test.py                                                
+++ b/tests/remove_tabs_test.py                                                
@@ -6,8 +6,9 @@ from pre_commit_hooks.remove_tabs import main as remove_tabs   
 @pytest.mark.parametrize(                                                     
     ('input_s', 'expected'),                                                  
     (                                                                         
-        ('foo \t\nbar', 'foo     \nbar'),                                     
+        ('foo \t\nbar', 'foo    \nbar'),                                      
         ('bar\n\tbaz\n', 'bar\n    baz\n'),                                   
+        ('foo \tfoo2\nbar', 'foo    foo2\nbar'),                              
     ),                                                                        
 )                                                                             
 def test_remove_tabs(input_s, expected, tmpdir):

Answer 3 · 2023-01-20T14:32:06.000Z

Alright, thanks!

I can apply those patchs myself, thank you very much for providing them 😊
But would you like to contribute directly through a Pull Request on this GitHub repo?

Some contributors sometimes like to have their name in the repo commits history,
and I just want to let you decide what you prefer.

Answer 4 · 2023-01-20T14:40:41.000Z

Since this is a work account I an raise a bug report or suggestion but contributing code requires a pile of paperwork :-( in advance hence the patch in the issue report ;-) I hope that you understand.

Answer 5 · 2023-01-20T14:45:15.000Z

Sure, I can perfectly understand! No worries 😊
I'm going to apply your suggestions

Answer 6 · 2023-01-20T14:56:35.000Z

I checked more closely your suggested patch on tests/remove_tabs_test.py,
and I think that in fact there may be no bug there.

If you consider the input string foo \t\nbar,
you can see that a whitespace character is present before the \t,
explaining why the output string is foo \nbar with 5 whitespace characters,
after running remove-tabs on it.

Answer 7 · 2023-01-20T15:00:35.000Z

But visually foo \tbar and foo\tbar will be identical in most editors so the result of foo bar with 5 spaces will be unexpected. This is due to tab alignment.

Answer 8 · 2023-01-20T15:03:38.000Z

Strictly each tab should align the text that follows it to the next whitespaces_count after its position so my solution above, while it will work for many cases is still not perfect.

Answer 9 · 2023-01-20T16:00:51.000Z

I think that a much better approach would be:

lines = lines.expandtabs(whitespaces_count)

This should align tabs correctly so that: b' \tA\ttabbed \tstring\twith\tspaces\n\tAn\tother\ttabbed\tline\there!\n'
Gives: b' A tabbed string with spaces\n An other tabbed line here!\n'
Which renders to:

    A   tabbed  string  with    spaces
    An  other   tabbed  line    here!

Answer 10 · 2023-01-20T16:48:54.000Z

I think I now better understand your point.

.expandtabs() is a very good suggestion, sadly 'foo \t\nbar'.expandtabs(4)
produces 'foo \nbar' with 5 whitespace characters

Answer 11 · 2023-01-20T19:05:25.000Z

I opened PR #61: what do you think of it @SteveBarnes-BH?

Answer 12 · 2023-01-21T06:38:02.000Z

Now that I am on my own machine & time I can go into a little more detail. As an old timer who first learnt to use tabs on a mechanical typewriter, (at school I was made to do secretarial studies in an attempt to improve my spelling & learnt a lot just not spelling), my understanding of what is expected with tabs matches the explanation here.
So if you have a tab size of 4, (the most common for programming), this corresponds to setting tab positions like:
---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---
So a test sample such as:
b'No leading\ttab\n\tleading\ttab\n \tSpace then\tTab\nTabs\tbetween\tevery\tword\nSpace \tthen \ttab \tbetween \tevery \tword.'
Would be visually in your editor or word processor (with fixed width fonts) rendered as below _putting the ruler in as well:

---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---
No leading  tab
    leading tab
    Space then  Tab
Tabs    between every   word    in  the line.
Space   then    tab     between     every   word    in  the     line.

This is what the built in string/bytes method .expandtabs(4) gives.

In my personal opinion it is best if the hook leaves the file visually unchanged as, if the originator of the file used a mix of spaces and tabs to align the words in a specific way, then they would expect the output of the hook to look the same.

Answer 13 · 2023-01-21T13:02:01.000Z

Thank you for the detailed explanation & PR! 😊

I now understand why .expandtabs() behaves this way:

input 'foo \t' -> output 'foo '
input 'fo \t' -> output 'fo '

Alright, I'm going to merge your PR #62