Auto detection of syntax fails if first line is empty

Question

Auto detection of syntax fails if first line is empty

Closed this issue a month ago · 5 comments

I'm not sure if this is a bug or feature. The bug is not just for html but I use html as an example only:

Will highlight correct:

echo -e "<HTML>" | bat -pp

Will not highlight:

echo -e "\n<HTML>" | bat -pp

At the moment (when stdin is used), batcat checks for the first line to determine the syntax. Instead, it shall check the first line that is not "empty".

Answer 1 · 2025-05-26T20:36:41.000Z

Hi, thanks for bringing this up. The current behaviour matches Sublime Text, which invented the sublime-syntax highlighting grammars which bat is using. I'm a little bit afraid that changing it could cause other compatibility problems later on as well as complicate the code base / syntax detection logic. I personally have practically never encountered a file with a blank first line. But if it can be implemented cleanly, I would probably agree to merge such a change...

Answer 2 · 2025-06-23T23:32:24.000Z

I've looked into this; it seems that InputReader has a first_line property which is just read up til the first newline of the file. This property seems to be used only to check the syntax with the first line (HighlightedAssets::get_first_line_syntax in assets.rs) and to check if the file has anything to print at all (Controller::print_file in controller.rs).

Implementing this would require bat to look ahead through whitespace-only lines until it finds non-whitespace. However, first_line is also used currently to avoid re-reading the input buffer; if it's not empty, the read_line function returns something from there instead of querying the input reader. I don't think it's acceptable to have this function return a vec with more than one newline, so we'd have to change how InputReader stores the data. It could be done efficiently if we stored lines in a Vec<Vec<u8>> , such that they could be popped (the lines would have to be stored in reverse order). Then the only change outside of InputReader would be in get_first_line_syntax, where we use the first element instead of the whole property.

I don't know how clean this would be. I can implement it, but is this over-engineering for such a small case? Or is this reasonable?

Answer 3 · 2025-06-24T03:32:23.000Z

Thanks for investigating @apometta. It does sound like your solution would be clean enough, but it does indeed feel to me a bit like over-engineering, for a use-case which I have never personally encountered and can be trivially handled by a wrapper script to trim leading blank lines...
Let's see what the other maintainers think. What are your thoughts please @Enselic @eth-p ? 🙂

Answer 4 · 2025-06-24T04:52:13.000Z

As far as I know there is never a reason to keep the leading newline in a file, so simply requiring to removing any leading newline seems like a reasonable requirement to me.

I agree it feels like over engineering to handle this case.

Answer 5 · 2025-10-03T01:27:02.000Z

Just to play devil's advocate, this issue only exists because @SkyperTHC did in fact encounter files with blank lines at the start. It's not just a hypothetical 😛