gadget-framework/mfdb

SQLite & DuckDB support

Closed this issue · 20 comments

Supporting SQLite as well as Postgres would open up a whole bunch of use cases:

  • Portable model setups
  • Vignette generation without a Postgres DB
  • Data exports that can then be inspected (not even with MFDB)
  • Removing dontrun from the MFDB documentation examples

At least one problem will be weighted mean, but we could implement that by hand in the SQL by having _sum, _sqsum and _total columns in the group, and then an outer query to do the reduce step. Less neat, but not the end of the world.

@bthe Okay, after an hour of wild hacking in wip-sqlite-support this is looking achievable, and can get the following to produce no results successfully :)

mdb <- mfdb(db_params = list(drv = RSQLite::SQLite(), dbname = ":memory:"))
mfdb_sample_count(mdb, c(), list())

The business logic of MFDB seems portable enough, it's either disabling the schema-selecting logic (which doesn't have an equivalent in SQLite) or ancillary queries like checking tables exist, which are subtly different.

Obviously that syntax is pretty ugly, also need to have some sugar e.g. mfdb('iceland') should refer to a postgres DB, mfdb('iceland.sqlite') and mfdb(':memory:') should refer to SQLite.

bthe commented

It would be also be handy to be able to create a sqlite subset from a larger postgres DB to facilitate collaboration on specific topics.

Good news: wip-sqlite-support seems to be working finally. Bad news: It's not working very quickly.

You can use the existing mfdb_cs_dump() / mfdb_cs_restore() to move data from a postgres MFDB to a SQLite one. I've done this for ling and the queries for model setup isn't finishing in a reasonable amount of time. Right answers at least though.

I suspect most of this is the query optimizer not being as forgiving as postgres.

bthe commented

I assume you still have some tricks up your sleave?

Yeah, I've got a few things to try---I'd not really looked at performance up until now.

It may also be worth trying DuckDB at this point, given it's supposed to be a drop-in replacement. If it's dramatically faster without much more work then it may be easier to just stick with that.

I wasn't getting very far with what can be done with SQLite without changing too much, so I had a bash at DuckDB too whilst the pain-points were in my head.

Didn't take too long to get a local version where the tests pass (ish :P), and it's indeed a lot faster, beating postgresql in making lln_landings (2.4s vs 4.4s), let alone SQLite (~10s).

As well as the lack of FOREIGN KEY which I mentioned before, there's also unique index frutiness with deleting and re-inserting data, which may prove irritating to work around.

bthe commented

Interesting. Do you expect that these shortcomings will addressed in duckdb any time soon?

Probably not in the near future---for both it's kinda outside DuckDB's intended use case, so it's easier to work around the shortcomings than solve them. Which is also true in our case too.

Okay, I've merged both SQLite and DuckDB support into 7.x if you'd like a play. You can copy from one to the other via. the dump/restore commands:

mdb<-mfdb('Iceland', db_params = list(host = 'mfdb.hafro.is')) ; mfdb_cs_dump(mdb, 'ling.tar.gz')
mdb<-mfdb('Iceland.duckdb') ; mfdb_cs_restore(mdb, 'ling.tar.gz')

Thanks to this I've rewritten the examples to use SQLite / DuckDB, which means we can use \dontrun a lot less, which was one of the pains of releasing to CRAN initially.

@MikkoVihtakari You might be interested in trying the 7.x branch with DuckDB, since you suggested it originally. The only difference in usage should be opening an MFDB connection with mfdb('db_filename.duckdb') instead of the Postgres schema name.

Thanks @lentinj! I will give it a try as soon as I have time to focus on Gadget model development again. Sorry for the silence. Was at the sea. Back now.

bthe commented

I'm having issues with the uploading stomach data, small data sets are ok but larger data sets throw peculiar errors about missing stomach names that are indeed there and crashing my R session.

It just can't stomach your data ;)

Any hints as to what the errors look like? Making up large numbers of stomachs seems to be working fine here.

Should not have written that the database works great on that email...

It worked great. I did:

mfdb(schema_name = "data/mfdb/ghl.duckdb", destroy_schema = TRUE)
mdb <- mfdb("data/mfdb/ghl.duckdb")

And then added data into the database. I used the database, emptied the workspace without disconnecting and now when I try:

mdb <- mfdb("data/mfdb/ghl.duckdb")
Error in mfdb("data/mfdb/ghl.duckdb") : 
  Could not connect to mf database: $ operator is invalid for atomic vectors

Also:

mfdb(schema_name = "data/mfdb/ghl.duckdb", destroy_schema = TRUE)
Error in mfdb(schema_name = "data/mfdb/ghl.duckdb", destroy_schema = TRUE) : 
 Could not connect to mf database: $ operator is invalid for atomic vectors

The solution is to unlink("data/mfdb/ghl.duckdb") (delete the file) and recreate the database.

It works until I restart R and will again get:

if(!exists("mdb")) mdb <- mfdb("data/mfdb/ghl.duckdb")
Error in mfdb("data/mfdb/ghl.duckdb") : 
  Could not connect to mf database: $ operator is invalid for atomic vectors

Destroying connections without explicitly disconnecting resulting in corrupted databases is something I've noticed with DuckDB too. I've not dug deeply, but when it refuses to reconnect, you should notice that the write-ahead-log file data/mfdb/ghl.duckdb.wal isn't empty. Deleting the log should mean you can connect again, but at the cost of potential data loss.

I'll dig further and see if I can get a better answer.

@MikkoVihtakari I'm guessing you're using R studio? Issue 1898 in https://github.com/duckdb/duckdb/issues/ will sound pretty familiar if so. Might be worth trying the example there and commenting.

I'm doubt there's much MFDB can do to work around it though.

Yep, I was suspecting it was Rstudio problem since it triggers that fancy connection panel. I'll report here if I'll find a workaround.

It will be likely fixed at some point.

Here is the issue to follow the process: duckdb/duckdb#1898