Implicit narrowing conversion might create trouble

queued_dedent_count has type uint32_t whereas buffer has type char*, so you'll wrap around for deeply indented code.

tree-sitter-agda/src/scanner.cc

Line 23 in 59cefa7

buffer[i++] = queued_dedent_count;

This line has a similar problem where uint16_t will get narrowed to char, so non-ASCII characters will be misrepresented.

tree-sitter-agda/src/scanner.cc

Line 29 in 59cefa7

buffer[i++] = *iter;

Maybe I'm grossly misunderstanding something -- the example in the Readme does show non-ASCII symbols and somehow it has been working fine?

Thanks for reporting this!
I'm not quite sure what non-ASCII characters got misrepresented means, could you please provide some screenshot to show what's happening?
The example in the Readme has been working fine for me.

Ah! Okay, I was misreading. Sorry for the false alarm. The problem is only with indentation, because you're only saving indentation values to the buffer, never actual characters. Say you have code which has

foobar = do
  -- depth = 2
  print "Blah"
  -- depth = 130
                                                                                                                                 print "Blep"

This is going to make the parser think that they're both at the same indentation because char is (almost always) 8 bits (Here's a simplified godbolt example which shows this wrap around behaviour.). I don't fully understand what the resulting SExp means though

(source_file [0, 0] - [3, 0]
  (function_clause [0, 0] - [3, 0]
    (lhs [0, 0] - [0, 6]
      (atom [0, 0] - [0, 6]
        (qualified_name [0, 0] - [0, 6])))
    (rhs [0, 7] - [3, 0]
      (expr [0, 9] - [3, 0]
        (do [0, 9] - [3, 0]
          (do_stmt [1, 2] - [2, 141]
            (atom [1, 2] - [1, 7]
              (qualified_name [1, 2] - [1, 7]))
            (atom [1, 8] - [1, 14]
              (literal [1, 8] - [1, 14]))
            (atom [2, 129] - [2, 134]
              (qualified_name [2, 129] - [2, 134]))
            (atom [2, 135] - [2, 141]
              (literal [2, 135] - [2, 141]))))))))

I guess this is not a problem in practice because people don't generate Agda code (do they?), so the indentation level isn't going to exceed 128, but it might cause issues if you do get that kind of code.

I see, but IFAIK there are also other problems regarding the scanner and indentation.
We need this customized scanner because we cannot handle indentation in the parser.
But I've only managed to make the scanner semi-functional because my C programming was so bad. That's why it's so buggy when it comes to indentation.

the indentation rules of Agda should be very similar to that of the Haskell, but we can't just copy their scanner because it has some termination issue. So the scanner we are using now was adapted from the python one.

Understood. I bumped across this because I was writing a similar scanner. I've already fixed this in the code I'm working on. Do you have time to review a PR? I can submit a patch to fix this issue.

Of course, that won't fix the other issues with the scanner but hey, at least it is something.

That would be really helpful!

@theindigamer sorry having to bug you about this, but the CI build is mysteriously failing on Linux, which didn't happen on your pull request. Is there anything I should be aware of?

The problem seems to be that the compilation command needs to be given a flag to understand that it should be compiling with C++11 features. I wonder why the CI didn't fail earlier. I don't have time right now to look into it more but I'll get back to you within ~12 hours.

Thanks a lot!!