Sqlparser parseMultiple doesn’t split some stuff correctly

Question

Sqlparser parseMultiple doesn’t split some stuff correctly

Jakuzeisme opened this issue 2 months ago · 9 comments

Hey , thanks for this package but i would love to ask, the parseMultiple function is it meant to split the statement or we should handle that our selves ? And if yes what us the accuracy please.
I got stuck somewhere for a while, it's very basic stuff and it didn't split it well so I got confused:

CREATE TABLE IF NOT EXISTS "employees" (
	"employee_id"	INTEGER,
	"department_id"	INTEGER NOT NULL,
	"employee_name"	TEXT(100) NOT NULL,
	PRIMARY KEY("employee_id" AUTOINCREMENT)
);
CREATE TABLE IF NOT EXISTS "projects" (
	"project_id"	BIGINT AUTO_INCREMENT,
	"employee_id"	INT NOT NULL,
	"project_title"	CHAR(255) NOT NULL,
	PRIMARY KEY("project_id", "employee_id")
);

Please tell me how accurate the splitting In parseMultiple is ,thanks .
So I can know which way to go

Answer 1 · 2024-10-08T12:54:43.000Z

Chatgpt wrote this. The max one

List<String> splitSqlStatements(String sql) {
  List<String> statements = [];
  StringBuffer buffer = StringBuffer();
  int parenCount = 0;
  String? encloser;
  bool inSingleLineComment = false;
  bool inMultiLineComment = false;
  String? dollarQuoteTag;

  for (int i = 0; i < sql.length; i++) {
    String char = sql[i];
    String nextChar = i + 1 < sql.length ? sql[i + 1] : '';
    String? prevChar = i > 0 ? sql[i - 1] : null;

    // Handle comments
    if (!inSingleLineComment && !inMultiLineComment && encloser == null && dollarQuoteTag == null) {
      if (char == '-' && nextChar == '-') {
        inSingleLineComment = true;
      } else if (char == '/' && nextChar == '*') {
        inMultiLineComment = true;
      }
    } else if (inSingleLineComment && char == '\n') {
      inSingleLineComment = false;
    } else if (inMultiLineComment && char == '*' && nextChar == '/') {
      inMultiLineComment = false;
      buffer.write(char); // Write '*' of '*/'
      buffer.write(nextChar); // Write '/' of '*/'
      i++; // Skip '/'
      continue;
    }

    // Handle enclosers (', ", `, [, ], and PostgreSQL $$)
    if (!inSingleLineComment && !inMultiLineComment) {
      if (dollarQuoteTag != null) {
        if (sql.substring(i).startsWith(dollarQuoteTag)) {
          buffer.write(dollarQuoteTag);
          i += dollarQuoteTag.length - 1;
          dollarQuoteTag = null;
          continue;
        }
      } else if (encloser != null) {
        if (char == encloser) {
          if ((encloser == "'" || encloser == '"') && nextChar == encloser) {
            // Escaped quote by doubling, skip the next character
            buffer.write(char); // Write the first quote
            buffer.write(nextChar); // Write the second quote
            i++; // Skip the next quote
            continue;
          } else {
            encloser = null;
          }
        }
      } else {
        if (char == "'" || char == '"' || char == '`' || char == '[') {
          encloser = (char == '[') ? ']' : char;
        } else if (char == '$' && nextChar == '$') {
          // Handle dollar-quoted strings (PostgreSQL)
          RegExp dollarTagExp = RegExp(r'\$[^\$]*\$');
          Match? match = dollarTagExp.matchAsPrefix(sql.substring(i));
          if (match != null) {
            dollarQuoteTag = match.group(0);
            buffer.write(dollarQuoteTag!);
            i += dollarQuoteTag!.length - 1;
            continue;
          }
        }
      }
    }

    // Handle parentheses
    if (!inSingleLineComment && !inMultiLineComment && encloser == null && dollarQuoteTag == null) {
      if (char == '(') {
        parenCount++;
      } else if (char == ')') {
        if (parenCount > 0) {
          parenCount--;
        }
      }
    }

    // Write character to buffer
    buffer.write(char);

    // Check for statement terminator
    if (char == ';' && !inSingleLineComment && !inMultiLineComment && encloser == null && dollarQuoteTag == null && parenCount == 0) {
      statements.add(buffer.toString().trim());
      buffer.clear();
    }
  }

  // Add any remaining SQL as the last statement
  if (buffer.isNotEmpty) {
    String remaining = buffer.toString().trim();
    if (remaining.isNotEmpty) {
      statements.add(remaining);
    }
  }

  return statements;
}

What do u think?

Answer 2 · 2024-10-08T15:16:04.000Z

We're already doing something much more sophisticated than splitSqlStatements to tokenize the input source. So the tokenizer is well aware of the semicolon, but the parser chooses to ignore it. The problem is that we don't support this syntax: PRIMARY KEY("employee_id" AUTOINCREMENT) (since SQLite doesn't support autoincrement in table constraints either). Since we want to be able to report multiple syntax errors instead of stopping at the first one, we generally fast-forward to the next semicolon if an error happens, allowing us to parse the next statement. When parsing CREATE TABLE statements however, we had an additional piece of logic that would try to skip to the next comma for errors in a column definition or table constraint. The intention here was to parse subsequent column definitions / table constraints as part of the same CREATE TABLE statement.
In this case, skipping to the next comma at the place of the syntax error actually skips over the semicolon, so we find ourselves in the middle of the second statement and pick it up there. So the parser thinks you wrote

CREATE TABLE IF NOT EXISTS "employees" (
	"employee_id"	INTEGER,
	"department_id"	INTEGER NOT NULL,
	"employee_name"	TEXT(100) NOT NULL,
         -- < skipped invalid syntax >
	"employee_id"	INT NOT NULL,
	"project_title"	CHAR(255) NOT NULL,
	PRIMARY KEY("project_id", "employee_id")
);

It's worth noting that this is not fundamentally incorrect. There are legitimate syntax errors reported by the parser here, and the AST for erroneous inputs will always be generated on a best-effort basis. Recovering from parsing errors is more of an art than science.

In the linked commit, I've improved the error recovery to consider nested scopes of error recovery (so despite the error inside the table definition, we remember that we're in a statement and don't skip into the middle of the next one). The results are pretty great, only the primary key from the first statement and the AUTO_INCREMENT constraint in the first column for the second table are missing due to syntax errors.

Visualized problem

Answer 3 · 2024-10-08T15:18:22.000Z

By the way, if it's really important that you skip statements at semicolons, you can use the tokenization API for that and then parse statements one at the time. Again, all bets on the returned AST are off if there is a parsing error, we just try to make stuff up to recover from errors.

import 'package:collection/collection.dart';
import 'package:sqlparser/sqlparser.dart';
import 'package:sqlparser/utils/node_to_text.dart';

void main() {
  const source = '''
CREATE TABLE IF NOT EXISTS "employees" (
	"employee_id"	INTEGER,
	"department_id"	INTEGER NOT NULL,
	"employee_name"	TEXT(100) NOT NULL,
	PRIMARY KEY("employee_id" AUTOINCREMENT)
);
CREATE TABLE IF NOT EXISTS "projects" (
	"project_id"	BIGINT AUTO_INCREMENT,
	"employee_id"	INT NOT NULL,
	"project_title"	CHAR(255) NOT NULL,
	PRIMARY KEY("project_id", "employee_id")
);
''';

  final engine = SqlEngine();
  final tokens = engine.tokenize(source);
  final statements = tokens.splitAfter((t) => t.type == TokenType.semicolon);

  for (final statement in statements) {
    if (statement case [final first, ..., final last]) {
      final parsed = engine
          .parse(source.substring(first.firstPosition, last.lastPosition));
      print('Found ${parsed.rootNode.toSql()}, errors: ${parsed.errors}');
    }
  }
}

Answer 4 · 2024-10-08T15:37:14.000Z

Hmm. Thanks for your response. I think that makes sense especially the statement about more art than science, I get that.

What do you recommend, I'm not an expert in sql tbh and I need the best possible solution
Should I stick to the script gpt came up with or can the package handle It more efficiently?

Answer 5 · 2024-10-08T17:44:20.000Z

What is the task you're trying to achieve here? Splitting a string of SQL statements by the statement? If so I'd use the tokenizer and then the snippet from my previous comment (splitting on TokenType.semicolon).

Answer 6 · 2024-10-08T18:08:02.000Z

Yes, my intention is to split not parse

Answer 7 · 2024-10-08T18:08:53.000Z

Thanks I'm going with this

import 'package:collection/collection.dart';
import 'package:sqlparser/sqlparser.dart';

List<String> splitSqlStatements(String source) {
  final engine = SqlEngine();
  final tokens = engine.tokenize(source);

  // Split tokens into groups ending with a semicolon
  final tokenGroups = tokens.splitAfter((t) => t.type == TokenType.semicolon);

  final statements = <String>[];

  for (final group in tokenGroups) {
    if (group.isEmpty) continue;

    final firstToken = group.first;
    final lastToken = group.last;

    // Extract the original text corresponding to the token range
    final statementText = source.substring(
      firstToken.span.start.offset,
      lastToken.span.end.offset,
    ).trim();

    if (statementText.isNotEmpty) {
      statements.add(statementText);
    }
  }

  return statements;
}

Answer 8 · 2024-11-11T18:12:42.000Z

We can't really do too much with a non-SQL input. Are you hitting an infinite loop? If so, it would be good to have the exact input so I can try to reproduce this.

Answer 9 · 2024-11-11T21:09:33.000Z

Yeah I know but I expected an error.

The intended behavior is that:

SqlEngine.tokenize throws a CumulatedTokenizerException when encountering invalid tokens.
SqlEngine.parse does not throw exceptions but instead reports all tokenizer and parsing errors through ParseResult.errors.

I've added a testcase for this in b2a0d50 and it seems to work as intended.

I can't really help you with the other question, I think the common approach is to let users pick a file to write to which will give your app access to that temporarily. Setting your target SDK to Android 10 might work, but it's possible that Google will reject your app from the Play store. Searching for this on StackOverflow will probably give you more conclusive results.