Token's reported location shifts when query has Turkish characters

Question

Token's reported location shifts when query has Turkish characters

Closed this issue 10 months ago · 6 comments

A token found by parser.parse_sql_json has a location attribute which I can feed into substring. When Turkish characters ı and ş are in the query, this location is shifted, depending on how many such characters there are. (The expected output is cut-off, when that happens.)

I am pretty sure it's not just Turkish characters. But I don't know the greatest subset of characters that would cause this issue and I don't feel proficient on encodings. Feel free to edit the title.

The script below explains the problem. The original purpose of the script is to find schemas and tables mentioned in a given SQL query.

import json
import pglast


def used_tables(sql_query:str):
    json_str = pglast.parser.parse_sql_json(sql_query)
    parse_tree = json.loads(json_str)

    deps = set()
    for subtree in _find(parse_tree, 'RangeVar'):
        # Instead of using `schemaname` and `relname`, I use the location to
        # extract the dependency in a case-sensitive manner.
        dep = sql_query[subtree['location']:].split()[0]
        deps.add(dep)

    return deps


def _find(tree, subtree_key):
    """ Recursive searcher. """
    for key, val in tree.items():
        if key == subtree_key:
            yield val
        elif type(val) == list:
            for i in val:
                yield from _find(i, subtree_key)
        elif type(val) == dict:
            yield from _find(val, subtree_key)


from sys import version
print(version)

print('\nThis generates expected output:')
no_turkish_chars = '''
select t1.Sales Satis
from Schema.Table t1
join Schema2.Table2 t2
on t1.col = t2.col
'''
print(used_tables(no_turkish_chars))

print('\nThis fails to locate beginnings:')
# Note that the last word of first line is not "Satis" but "Satış" with a
# dotless i and an s with cedilla.
turkish_chars = '''
select t1.Sales Satış
from Schema.Table t1
join Schema2.Table2 t2
on t1.col = t2.col
'''
print(used_tables(turkish_chars))

print('\nPython has no problem indexing:')
print('ış1 ış2 ış3'[4:])

""" OUTPUT:

3.11.4 (main, Jun  8 2023, 02:02:15) [GCC 12.2.0]

This generates expected output:
{'Schema2.Table2', 'Schema.Table'}

This fails to locate beginnings:
{'hema.Table', 'hema2.Table2'}

Python has no problem indexing:
ış2 ış3

"""

Answer 1 · 2023-08-17T10:30:49.000Z

This is caused by the fact that internally (that is, at the libpg_query level) the SQL statement is handled as an UTF-8 encoded string, and the reported location is the offset of the byte position withing that, not of the Unicode character.
Try using the pglast.parser.Displacements helper class, as it done in a few places, for example here, and let me know how it goes.
I see that such class is not mentioned in the documentation... I will fix that.

Answer 2 · 2023-08-17T10:53:32.000Z

Thank you for the quick response! Displacements works. Now I can have comments with all sorts of characters in my queries.

Actually the exact location you linked was open in another window as I wrote the issue, but that's where I had given up. I looked into the definition parse_sql_json to see why this happens and I decided it is some cython black magic I wouldn't understand. I didn't notice Displacements.

The file you linked has good documentation, once one decides to follow Displacements (line 180).

Answer 3 · 2023-08-18T13:56:58.000Z

I closed the issue by accident, sorry. That is your button to click. I was trying to comment this:

Now that I think about it... It seems to me that adding a location_utf8 attribute next to location would be even better than pointing the user to Displacements. It would be at the place where one would look first. And less code for the user to write.

That's your decision, of course. Maybe the implementation is not as straightforward as it seems to me. The library works as is and it looks clean & well maintained. People with this problem will eventually find this issue (or your new documentation) and solve their problem.

Answer 4 · 2023-08-18T15:07:35.000Z

Well, I bet that most users use the provided AST, returned by parse_sql(): that already has the location corrected.
It is the parse_sql_json() that does not, and is debatable whether it should do the same or not...
Will think about this, and at the very least will expose the class doc.

Answer 5 · 2023-08-19T21:04:17.000Z

I added an example of usage, and adapted your snippet to a test of parse_sql_json().

Answer 6 · 2023-08-24T06:39:59.000Z

I improved the doc and released this in v5.