SQLite & DuckDB support

Question

SQLite & DuckDB support

Closed this issue 2 years ago · 20 comments

Supporting SQLite as well as Postgres would open up a whole bunch of use cases:

Portable model setups
Vignette generation without a Postgres DB
Data exports that can then be inspected (not even with MFDB)
Removing dontrun from the MFDB documentation examples

At least one problem will be weighted mean, but we could implement that by hand in the SQL by having _sum, _sqsum and _total columns in the group, and then an outer query to do the reduce step. Less neat, but not the end of the world.

Answer 1 · 2021-02-26T18:08:14.000Z

@bthe Okay, after an hour of wild hacking in wip-sqlite-support this is looking achievable, and can get the following to produce no results successfully :)

mdb <- mfdb(db_params = list(drv = RSQLite::SQLite(), dbname = ":memory:"))
mfdb_sample_count(mdb, c(), list())

The business logic of MFDB seems portable enough, it's either disabling the schema-selecting logic (which doesn't have an equivalent in SQLite) or ancillary queries like checking tables exist, which are subtly different.

Answer 2 · 2021-02-26T18:10:59.000Z

Obviously that syntax is pretty ugly, also need to have some sugar e.g. mfdb('iceland') should refer to a postgres DB, mfdb('iceland.sqlite') and mfdb(':memory:') should refer to SQLite.

Answer 3 · 2021-03-14T11:14:50.000Z

It would be also be handy to be able to create a sqlite subset from a larger postgres DB to facilitate collaboration on specific topics.

Answer 4 · 2021-04-26T17:14:48.000Z

Good news: wip-sqlite-support seems to be working finally. Bad news: It's not working very quickly.

You can use the existing mfdb_cs_dump() / mfdb_cs_restore() to move data from a postgres MFDB to a SQLite one. I've done this for ling and the queries for model setup isn't finishing in a reasonable amount of time. Right answers at least though.

I suspect most of this is the query optimizer not being as forgiving as postgres.

Answer 5 · 2021-04-26T22:52:33.000Z

I assume you still have some tricks up your sleave?

Answer 6 · 2021-04-27T09:20:14.000Z

Yeah, I've got a few things to try---I'd not really looked at performance up until now.

It may also be worth trying DuckDB at this point, given it's supposed to be a drop-in replacement. If it's dramatically faster without much more work then it may be easier to just stick with that.

Answer 7 · 2021-04-30T18:01:19.000Z

I wasn't getting very far with what can be done with SQLite without changing too much, so I had a bash at DuckDB too whilst the pain-points were in my head.

Didn't take too long to get a local version where the tests pass (ish :P), and it's indeed a lot faster, beating postgresql in making lln_landings (2.4s vs 4.4s), let alone SQLite (~10s).

As well as the lack of FOREIGN KEY which I mentioned before, there's also unique index frutiness with deleting and re-inserting data, which may prove irritating to work around.

Answer 8 · 2021-05-01T09:42:25.000Z

Interesting. Do you expect that these shortcomings will addressed in duckdb any time soon?

Answer 9 · 2021-05-03T14:36:05.000Z

Probably not in the near future---for both it's kinda outside DuckDB's intended use case, so it's easier to work around the shortcomings than solve them. Which is also true in our case too.

Answer 10 · 2021-05-04T14:55:55.000Z

Okay, I've merged both SQLite and DuckDB support into 7.x if you'd like a play. You can copy from one to the other via. the dump/restore commands:

mdb<-mfdb('Iceland', db_params = list(host = 'mfdb.hafro.is')) ; mfdb_cs_dump(mdb, 'ling.tar.gz')
mdb<-mfdb('Iceland.duckdb') ; mfdb_cs_restore(mdb, 'ling.tar.gz')

Answer 11 · 2021-05-10T10:49:03.000Z

Thanks to this I've rewritten the examples to use SQLite / DuckDB, which means we can use \dontrun a lot less, which was one of the pains of releasing to CRAN initially.

Answer 12 · 2021-05-10T10:50:30.000Z

@MikkoVihtakari You might be interested in trying the 7.x branch with DuckDB, since you suggested it originally. The only difference in usage should be opening an MFDB connection with mfdb('db_filename.duckdb') instead of the Postgres schema name.

Answer 13 · 2021-05-27T07:47:33.000Z

Thanks @lentinj! I will give it a try as soon as I have time to focus on Gadget model development again. Sorry for the silence. Was at the sea. Back now.

Answer 14 · 2021-06-01T11:11:53.000Z

I'm having issues with the uploading stomach data, small data sets are ok but larger data sets throw peculiar errors about missing stomach names that are indeed there and crashing my R session.

Answer 15 · 2021-06-01T13:17:46.000Z

It just can't stomach your data ;)

Answer 16 · 2021-06-04T15:08:52.000Z

Any hints as to what the errors look like? Making up large numbers of stomachs seems to be working fine here.

Answer 17 · 2021-06-23T10:55:12.000Z

Should not have written that the database works great on that email...

It worked great. I did:

mfdb(schema_name = "data/mfdb/ghl.duckdb", destroy_schema = TRUE)
mdb <- mfdb("data/mfdb/ghl.duckdb")

And then added data into the database. I used the database, emptied the workspace without disconnecting and now when I try:

mdb <- mfdb("data/mfdb/ghl.duckdb")
Error in mfdb("data/mfdb/ghl.duckdb") : 
  Could not connect to mf database: $ operator is invalid for atomic vectors

Also:

mfdb(schema_name = "data/mfdb/ghl.duckdb", destroy_schema = TRUE)
Error in mfdb(schema_name = "data/mfdb/ghl.duckdb", destroy_schema = TRUE) : 
 Could not connect to mf database: $ operator is invalid for atomic vectors

The solution is to unlink("data/mfdb/ghl.duckdb") (delete the file) and recreate the database.

It works until I restart R and will again get:

if(!exists("mdb")) mdb <- mfdb("data/mfdb/ghl.duckdb")
Error in mfdb("data/mfdb/ghl.duckdb") : 
  Could not connect to mf database: $ operator is invalid for atomic vectors

Answer 18 · 2021-06-24T15:32:17.000Z

Destroying connections without explicitly disconnecting resulting in corrupted databases is something I've noticed with DuckDB too. I've not dug deeply, but when it refuses to reconnect, you should notice that the write-ahead-log file data/mfdb/ghl.duckdb.wal isn't empty. Deleting the log should mean you can connect again, but at the cost of potential data loss.

I'll dig further and see if I can get a better answer.

Answer 19 · 2021-06-24T17:03:06.000Z

@MikkoVihtakari I'm guessing you're using R studio? Issue 1898 in https://github.com/duckdb/duckdb/issues/ will sound pretty familiar if so. Might be worth trying the example there and commenting.

I'm doubt there's much MFDB can do to work around it though.

Answer 20 · 2021-06-25T05:45:24.000Z

Yep, I was suspecting it was Rstudio problem since it triggers that fancy connection panel. I'll report here if I'll find a workaround.

It will be likely fixed at some point.

Here is the issue to follow the process: duckdb/duckdb#1898