Possible incorrect formatting from remove tabs
Closed this issue ยท 13 comments
If lines contain spaces followed by tabs then replacing with a fixed number of spaces will miss-align which can be a problem for some file types, e.g. fi a file contains 2 lines such as:
"
Line 1
Line 2
"
Then visually, with tab spacing of 4, it will be:
Line 1
Line 2
But after tab replacement will become:
Line 1
Line 2
Suggest adding:
def line_tab_strip(line: bytes, whitespaces_count: int) -> bytes:
"""Remove tabs and replace with whitespaces_count spaces maintaining alignment"""
spaces = b" " * whitespaces_count
return spaces.join(sect.strip(b' ') for sect in line.split(b"\t"))
And changing the line:
lines = [line.replace(b'\t', b' ' * whitespaces_count) for line in lines]
With something like:
lines = [line_tab_strip(line, whitespaces_count) for line in lines]
This should give the expected alignment even on multi column lines such as: \tCol 1\t Col 2 \tCol 3 \n
Hi @SteveBarnes-BH!
Interestng, thank you for reporting this.
Would you like to submit a PR with a small unit test covering this case,
in order to fix this bug?
Patch for possible test to show the issue:
--- a/tests/remove_tabs_test.py
+++ b/tests/remove_tabs_test.py
@@ -6,8 +6,9 @@ from pre_commit_hooks.remove_tabs import main as remove_tabs
@pytest.mark.parametrize(
('input_s', 'expected'),
(
- ('foo \t\nbar', 'foo \nbar'),
+ ('foo \t\nbar', 'foo \nbar'),
('bar\n\tbaz\n', 'bar\n baz\n'),
+ ('foo \tfoo2\nbar', 'foo foo2\nbar'),
),
)
def test_remove_tabs(input_s, expected, tmpdir):
Alright, thanks!
I can apply those patchs myself, thank you very much for providing them ๐
But would you like to contribute directly through a Pull Request on this GitHub repo?
Some contributors sometimes like to have their name in the repo commits history,
and I just want to let you decide what you prefer.
Since this is a work account I an raise a bug report or suggestion but contributing code requires a pile of paperwork :-( in advance hence the patch in the issue report ;-) I hope that you understand.
Sure, I can perfectly understand! No worries ๐
I'm going to apply your suggestions
I checked more closely your suggested patch on tests/remove_tabs_test.py
,
and I think that in fact there may be no bug there.
If you consider the input string foo \t\nbar
,
you can see that a whitespace character is present before the \t
,
explaining why the output string is foo \nbar
with 5 whitespace characters,
after running remove-tabs
on it.
But visually foo \tbar
and foo\tbar
will be identical in most editors so the result of foo bar
with 5 spaces will be unexpected. This is due to tab alignment.
Strictly each tab should align the text that follows it to the next whitespaces_count after its position so my solution above, while it will work for many cases is still not perfect.
I think that a much better approach would be:
lines = lines.expandtabs(whitespaces_count)
This should align tabs correctly so that: b' \tA\ttabbed \tstring\twith\tspaces\n\tAn\tother\ttabbed\tline\there!\n'
Gives: b' A tabbed string with spaces\n An other tabbed line here!\n'
Which renders to:
A tabbed string with spaces
An other tabbed line here!
I think I now better understand your point.
.expandtabs()
is a very good suggestion, sadly 'foo \t\nbar'.expandtabs(4)
produces 'foo \nbar'
with 5 whitespace characters
I opened PR #61: what do you think of it @SteveBarnes-BH?
Now that I am on my own machine & time I can go into a little more detail. As an old timer who first learnt to use tabs on a mechanical typewriter, (at school I was made to do secretarial studies in an attempt to improve my spelling & learnt a lot just not spelling), my understanding of what is expected with tabs matches the explanation here.
So if you have a tab size of 4, (the most common for programming), this corresponds to setting tab positions like:
---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---
So a test sample such as:
b'No leading\ttab\n\tleading\ttab\n \tSpace then\tTab\nTabs\tbetween\tevery\tword\nSpace \tthen \ttab \tbetween \tevery \tword.'
Would be visually in your editor or word processor (with fixed width fonts) rendered as below _putting the ruler in as well:
---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---T---
No leading tab
leading tab
Space then Tab
Tabs between every word in the line.
Space then tab between every word in the line.
This is what the built in string/bytes method .expandtabs(4)
gives.
In my personal opinion it is best if the hook leaves the file visually unchanged as, if the originator of the file used a mix of spaces and tabs to align the words in a specific way, then they would expect the output of the hook to look the same.