Sqlparser parseMultiple doesn’t split some stuff correctly
Jakuzeisme opened this issue · 9 comments
Hey , thanks for this package but i would love to ask, the parseMultiple function is it meant to split the statement or we should handle that our selves ? And if yes what us the accuracy please.
I got stuck somewhere for a while, it's very basic stuff and it didn't split it well so I got confused:
CREATE TABLE IF NOT EXISTS "employees" (
"employee_id" INTEGER,
"department_id" INTEGER NOT NULL,
"employee_name" TEXT(100) NOT NULL,
PRIMARY KEY("employee_id" AUTOINCREMENT)
);
CREATE TABLE IF NOT EXISTS "projects" (
"project_id" BIGINT AUTO_INCREMENT,
"employee_id" INT NOT NULL,
"project_title" CHAR(255) NOT NULL,
PRIMARY KEY("project_id", "employee_id")
);
Please tell me how accurate the splitting In parseMultiple is ,thanks .
So I can know which way to go
Chatgpt wrote this. The max one
List<String> splitSqlStatements(String sql) {
List<String> statements = [];
StringBuffer buffer = StringBuffer();
int parenCount = 0;
String? encloser;
bool inSingleLineComment = false;
bool inMultiLineComment = false;
String? dollarQuoteTag;
for (int i = 0; i < sql.length; i++) {
String char = sql[i];
String nextChar = i + 1 < sql.length ? sql[i + 1] : '';
String? prevChar = i > 0 ? sql[i - 1] : null;
// Handle comments
if (!inSingleLineComment && !inMultiLineComment && encloser == null && dollarQuoteTag == null) {
if (char == '-' && nextChar == '-') {
inSingleLineComment = true;
} else if (char == '/' && nextChar == '*') {
inMultiLineComment = true;
}
} else if (inSingleLineComment && char == '\n') {
inSingleLineComment = false;
} else if (inMultiLineComment && char == '*' && nextChar == '/') {
inMultiLineComment = false;
buffer.write(char); // Write '*' of '*/'
buffer.write(nextChar); // Write '/' of '*/'
i++; // Skip '/'
continue;
}
// Handle enclosers (', ", `, [, ], and PostgreSQL $$)
if (!inSingleLineComment && !inMultiLineComment) {
if (dollarQuoteTag != null) {
if (sql.substring(i).startsWith(dollarQuoteTag)) {
buffer.write(dollarQuoteTag);
i += dollarQuoteTag.length - 1;
dollarQuoteTag = null;
continue;
}
} else if (encloser != null) {
if (char == encloser) {
if ((encloser == "'" || encloser == '"') && nextChar == encloser) {
// Escaped quote by doubling, skip the next character
buffer.write(char); // Write the first quote
buffer.write(nextChar); // Write the second quote
i++; // Skip the next quote
continue;
} else {
encloser = null;
}
}
} else {
if (char == "'" || char == '"' || char == '`' || char == '[') {
encloser = (char == '[') ? ']' : char;
} else if (char == '$' && nextChar == '$') {
// Handle dollar-quoted strings (PostgreSQL)
RegExp dollarTagExp = RegExp(r'\$[^\$]*\$');
Match? match = dollarTagExp.matchAsPrefix(sql.substring(i));
if (match != null) {
dollarQuoteTag = match.group(0);
buffer.write(dollarQuoteTag!);
i += dollarQuoteTag!.length - 1;
continue;
}
}
}
}
// Handle parentheses
if (!inSingleLineComment && !inMultiLineComment && encloser == null && dollarQuoteTag == null) {
if (char == '(') {
parenCount++;
} else if (char == ')') {
if (parenCount > 0) {
parenCount--;
}
}
}
// Write character to buffer
buffer.write(char);
// Check for statement terminator
if (char == ';' && !inSingleLineComment && !inMultiLineComment && encloser == null && dollarQuoteTag == null && parenCount == 0) {
statements.add(buffer.toString().trim());
buffer.clear();
}
}
// Add any remaining SQL as the last statement
if (buffer.isNotEmpty) {
String remaining = buffer.toString().trim();
if (remaining.isNotEmpty) {
statements.add(remaining);
}
}
return statements;
}
What do u think?
We're already doing something much more sophisticated than splitSqlStatements
to tokenize the input source. So the tokenizer is well aware of the semicolon, but the parser chooses to ignore it. The problem is that we don't support this syntax: PRIMARY KEY("employee_id" AUTOINCREMENT)
(since SQLite doesn't support autoincrement in table constraints either). Since we want to be able to report multiple syntax errors instead of stopping at the first one, we generally fast-forward to the next semicolon if an error happens, allowing us to parse the next statement. When parsing CREATE TABLE
statements however, we had an additional piece of logic that would try to skip to the next comma for errors in a column definition or table constraint. The intention here was to parse subsequent column definitions / table constraints as part of the same CREATE TABLE
statement.
In this case, skipping to the next comma at the place of the syntax error actually skips over the semicolon, so we find ourselves in the middle of the second statement and pick it up there. So the parser thinks you wrote
CREATE TABLE IF NOT EXISTS "employees" (
"employee_id" INTEGER,
"department_id" INTEGER NOT NULL,
"employee_name" TEXT(100) NOT NULL,
-- < skipped invalid syntax >
"employee_id" INT NOT NULL,
"project_title" CHAR(255) NOT NULL,
PRIMARY KEY("project_id", "employee_id")
);
It's worth noting that this is not fundamentally incorrect. There are legitimate syntax errors reported by the parser here, and the AST for erroneous inputs will always be generated on a best-effort basis. Recovering from parsing errors is more of an art than science.
In the linked commit, I've improved the error recovery to consider nested scopes of error recovery (so despite the error inside the table definition, we remember that we're in a statement and don't skip into the middle of the next one). The results are pretty great, only the primary key from the first statement and the AUTO_INCREMENT
constraint in the first column for the second table are missing due to syntax errors.
By the way, if it's really important that you skip statements at semicolons, you can use the tokenization API for that and then parse statements one at the time. Again, all bets on the returned AST are off if there is a parsing error, we just try to make stuff up to recover from errors.
import 'package:collection/collection.dart';
import 'package:sqlparser/sqlparser.dart';
import 'package:sqlparser/utils/node_to_text.dart';
void main() {
const source = '''
CREATE TABLE IF NOT EXISTS "employees" (
"employee_id" INTEGER,
"department_id" INTEGER NOT NULL,
"employee_name" TEXT(100) NOT NULL,
PRIMARY KEY("employee_id" AUTOINCREMENT)
);
CREATE TABLE IF NOT EXISTS "projects" (
"project_id" BIGINT AUTO_INCREMENT,
"employee_id" INT NOT NULL,
"project_title" CHAR(255) NOT NULL,
PRIMARY KEY("project_id", "employee_id")
);
''';
final engine = SqlEngine();
final tokens = engine.tokenize(source);
final statements = tokens.splitAfter((t) => t.type == TokenType.semicolon);
for (final statement in statements) {
if (statement case [final first, ..., final last]) {
final parsed = engine
.parse(source.substring(first.firstPosition, last.lastPosition));
print('Found ${parsed.rootNode.toSql()}, errors: ${parsed.errors}');
}
}
}
Hmm. Thanks for your response. I think that makes sense especially the statement about more art than science, I get that.
What do you recommend, I'm not an expert in sql tbh and I need the best possible solution
Should I stick to the script gpt came up with or can the package handle It more efficiently?
What is the task you're trying to achieve here? Splitting a string of SQL statements by the statement? If so I'd use the tokenizer and then the snippet from my previous comment (splitting on TokenType.semicolon
).
Yes, my intention is to split not parse
Thanks I'm going with this
import 'package:collection/collection.dart';
import 'package:sqlparser/sqlparser.dart';
List<String> splitSqlStatements(String source) {
final engine = SqlEngine();
final tokens = engine.tokenize(source);
// Split tokens into groups ending with a semicolon
final tokenGroups = tokens.splitAfter((t) => t.type == TokenType.semicolon);
final statements = <String>[];
for (final group in tokenGroups) {
if (group.isEmpty) continue;
final firstToken = group.first;
final lastToken = group.last;
// Extract the original text corresponding to the token range
final statementText = source.substring(
firstToken.span.start.offset,
lastToken.span.end.offset,
).trim();
if (statementText.isNotEmpty) {
statements.add(statementText);
}
}
return statements;
}
We can't really do too much with a non-SQL input. Are you hitting an infinite loop? If so, it would be good to have the exact input so I can try to reproduce this.
Yeah I know but I expected an error.
The intended behavior is that:
SqlEngine.tokenize
throws aCumulatedTokenizerException
when encountering invalid tokens.SqlEngine.parse
does not throw exceptions but instead reports all tokenizer and parsing errors throughParseResult.errors
.
I've added a testcase for this in b2a0d50 and it seems to work as intended.
I can't really help you with the other question, I think the common approach is to let users pick a file to write to which will give your app access to that temporarily. Setting your target SDK to Android 10 might work, but it's possible that Google will reject your app from the Play store. Searching for this on StackOverflow will probably give you more conclusive results.