simonw/sqlite-utils

Problem with square bracket in CSV column name

foscoj opened this issue · 7 comments

testing some data from european power information (entsoe.eu), the title of the csv contains square brackets.
as I am playing with glitch, sqlite-utils are used for creating the db.

Traceback (most recent call last):

File "/app/.local/bin/sqlite-utils", line 8, in

sys.exit(cli())

File "/app/.local/lib/python3.7/site-packages/click/core.py", line 764, in call

return self.main(*args, **kwargs)

File "/app/.local/lib/python3.7/site-packages/click/core.py", line 717, in main

rv = self.invoke(ctx)

File "/app/.local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke

return _process_result(sub_ctx.command.invoke(sub_ctx))

File "/app/.local/lib/python3.7/site-packages/click/core.py", line 956, in invoke

return ctx.invoke(self.callback, **ctx.params)

File "/app/.local/lib/python3.7/site-packages/click/core.py", line 555, in invoke

return callback(*args, **kwargs)

File "/app/.local/lib/python3.7/site-packages/sqlite_utils/cli.py", line 434, in insert

default=default,

File "/app/.local/lib/python3.7/site-packages/sqlite_utils/cli.py", line 384, in insert_upsert_implementation

docs, pk=pk, batch_size=batch_size, alter=alter, **extra_kwargs

File "/app/.local/lib/python3.7/site-packages/sqlite_utils/db.py", line 997, in insert_all

extracts=extracts,

File "/app/.local/lib/python3.7/site-packages/sqlite_utils/db.py", line 618, in create

extracts=extracts,

File "/app/.local/lib/python3.7/site-packages/sqlite_utils/db.py", line 310, in create_table

self.conn.execute(sql)

sqlite3.OperationalError: unrecognized token: "]"

entsoe_2016.csv

renamed to txt for uploading compatibility

entsoe_2016.txt

code is remixed directly from your https://glitch.com/edit/#!/datasette-csvs repo

Thanks for the example file - looks like it can be trimmed down to just these two lines to replicate the bug:

"MTU (CET)","Day-ahead Price [EUR/MWh]"
"01.01.2016 00:00 - 01.01.2016 01:00","23.86"

https://stackoverflow.com/a/22694438 looks like the answer:

When using square brackets, it is not possible to have these characters in the identifier.

When using double quotes, you can escape them in the name by doubling them:

CREATE TABLE "hello ""world"""(key INTEGER PRIMARY KEY);

There's something weird about this. I created a test database file like so:

sqlite3 /tmp/demo.db <<EOF
    BEGIN TRANSACTION;
    CREATE TABLE "data" (
        "MTU (CET)" TEXT,
        "Day-ahead Price [EUR/MWh]" TEXT
    );
    INSERT INTO "data" VALUES('01.01.2016 00:00 - 01.01.2016 01:00','23.86');
    COMMIT;
EOF

Then confirmed that it works as expected in SQLite:

 $ sqlite3 /tmp/demo.db 
SQLite version 3.24.0 2018-06-04 14:10:15
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE IF NOT EXISTS "data" (
        "MTU (CET)" TEXT,
        "Day-ahead Price [EUR/MWh]" TEXT
    );
sqlite> .headers on
sqlite> select * from data;
MTU (CET)|Day-ahead Price [EUR/MWh]
01.01.2016 00:00 - 01.01.2016 01:00|23.86
sqlite> 

BUT... if I open the same database in Python, something weird happens:

In [1]: import sqlite3                                                                                                         
In [2]: conn = sqlite3.connect("/tmp/demo.db")                                                                                 
In [3]: cursor = conn.cursor()                                                                                                 
In [4]: cursor.execute("select * from data")                                                                                   
Out[4]: <sqlite3.Cursor at 0x10c70a0a0>
In [5]: cursor.fetchall()                                                                                                      
Out[5]: [('01.01.2016 00:00 - 01.01.2016 01:00', '23.86')]
In [6]: cursor.description                                                                                                     
Out[6]: 
(('MTU (CET)', None, None, None, None, None, None),
 ('Day-ahead Price', None, None, None, None, None, None))
In [7]: conn.row_factory = sqlite3.Row                                                                                         
In [8]: cursor = conn.cursor()                                                                                                 
In [9]: cursor.execute("select * from data")                                                                                   
Out[9]: <sqlite3.Cursor at 0x10c7a8490>
In [10]: row = cursor.fetchall()                                                                                                     
In [12]: row                                                                                                                   
Out[12]: <sqlite3.Row at 0x10c3fe670>
In [15]: row.keys()                                                                                                            
Out[15]: ['MTU (CET)', 'Day-ahead Price']

Note that in cursor.description AND in row.keys() above the second column is displayed as 'Day-ahead Price' - when we would expect it to be displayed as Day-ahead Price [EUR/MWh]

So.... it looks like there may be a bug in Python's sqlite3 module where columns with square braces in them have that portion of the name stripped out!

I'm not sure what to do about this one.

I can't fix it: this bug in Python's sqlite3 module means that even if I write a database out with column names that include [] I won't be able to read them back again.

So... I could do one of the following:

  • Throw an error if a column name includes those characters. That's my preferred option I think.
  • Automatically replace [ in column names with ( and ] with )
  • Do the automatic replacement but show a user-visible warning when I do it
  • Throw an error, but give the user an option to run with e.g. --fix-column-names which applies that automatic fix.

Since this is likely to be an incredibly rare edge-case I think I'd rather minimize the amount of code that deals with it, so my preferred option is to just throw that error and stop.

Probably the best option to just throw the error.
Is there any active dev chan where we could post the issue to python sqlite3?

I filed a bug in the Python issue tracker here: https://bugs.python.org/issue39652

I pushed a branch with my experiment in it, but I'm going to fix this by throwing an error on [ or ] in a column name instead - I won't implement the changes from that branch.