ohwgiles/laminar

Adding PostgreSQL support

mitya57 opened this issue · 8 comments

Hi!

We would like to use Laminar with PostgreSQL, because SQLite is less scalable and has performance issues on large amounts of data (see related issue #192).

I am willing to add PostgreSQL support to Laminar and make a pull request. In case you are open to such a contribution, I would like to discuss possible ways to do it. I see the following options:

  • Keep the existing interface of Database class, make it abstract and add two implementations: one for SQLite and one for PostgreSQL. The PostgreSQL one will then use either libpq or libpqxx.

    ➕: Less dependencies.
    ➖: Trying to fit PostgreSQL into the interface of Database class may need some non-trivial logic, e.g. with binding parameters. I feel like it may be reinventing the wheel that someone invented before us.
    ➖: Adding other backends like MySQL/MariaDB will be problematic in the future.

  • Use one of the existing C++ libraries which provide a common interface for different DB backends. For example:

    • nanodbc — a small C++ wrapper around the ODBC interface.

      ➕: With ODBC we get support for almost every database in the world.
      ➕: The API it provides is quite nice, and uses C++ templates similar to our Database class.
      ➖: While nanodbc itself is tiny (one .cpp and one .h), it needs ODBC itself, and connectors: psqlODBC or sqliteodbc.

    • SOCI — a library which implements the backends itself. The default interface is quite different from ours, but with dynamic binding we can get some approximation.

      ➕: Packaged in different distros, e.g. in Debian or nixpkgs.
      ➕: All popular DB servers, except maybe MS SQL, are supported.
      ➖: On some platforms, users will need to compile this library themselves.

    • Maybe some other library? There is a list here but I have not looked at every option.

    If we switch to one of these libraries, I will need to rewrite code that does queries. But there are not many queries and most of them are quite straightforward, so it should be less work than writing a PostgreSQL implementation of Database class.

Please let me know what you think.

Thanks for a well described proposal.

I am not wholly opposed to adding support for Postgres. If it does go ahead I think I would prefer your first proposed method, because I don't want Laminar to be a complicated "supports everything" system, and between sqlite and Postgres I think the full spectrum of needs are well served.

However, I am uncomfortable pushing ahead with this without understanding in more detail the performance issues in the sqlite implementation. The best solution IMO is to just fix that, if at all possible.

I am working on same project as mitya57

For me, Postgres would bring more than just performance of the web interface, it has better tooling and logistics.
In no particular order:

  • sqlite3 VACUUM is not as good as re-packing database with sqlite3 laminar.sqlite .dump | sqlite3 compressed.sqlite, which means that I need to take care about concurrent access myself. Postgres VACUUM FULL does it for me.
  • With Postgres I already have incremental backups setup. I don't think there is such thing with sqlite3. Yes, I can use borg or bup to do de-duplication, but that does not play nice with S3.
  • Postgres caters to programmatic access to database much better than "ssh to the box when Laminar is running".
  • Postgres has autoexplain to make it easier to understand what
    queries are running and if they are slow.

Regarding the SQLite performance issues, I think they are related to the fact that laminar stores the whole output in a field of type TEXT. And when you make any query, even if it does not request output column, SQLite still fetches it internally.

PostgreSQL has a TOAST concept: if length of a value is more than 2 kB, it will be compressed and stored out-of-line.

SQLite cannot do that based on a threshold, however it has a BLOB type. Blobs will be stored in a separate tablespace, so queries that do not request that column (e.g. on laminar's main page) will be faster. On the other hand, queries that do request that column may become a little bit slower. This will be especially visible when the outputs are small in size.

ferki commented

I love both SQLite and PostgreSQL, so I have no strong opinion for or against supporting both.

I thought to chip in with some pointers that may be interesting for some of the specific SQLite challenges mentioned above, though.

  • With Postgres I already have incremental backups setup. I don't think there is such thing with sqlite3. Yes, I can use borg or bup to do de-duplication, but that does not play nice with S3.

Litestream "is a standalone disaster recovery tool for SQLite", which "safely replicates changes incrementally to another file or S3".

  • Postgres caters to programmatic access to database much better than "ssh to the box when Laminar is running".

rqlite and dqlite might work for remote programmatic access, though I'm not sure how well these would operate on a single instance without their originally targeted clustered use cases.

For me, Postgres would bring more than just performance of the web interface, it has better tooling and logistics.

Agreed the tooling is better, and in general Postgres is more powerful.

Postgres caters to programmatic access to database much better than "ssh to the box when Laminar is running".

Of course by virtue of being a network capable server. Is this a major concern though? I wouldn't expect large amounts of performance-tuned external queries on laminar's database - there just isn't that much there to analyze. Or am I missing a use case?

Postgres has autoexplain to make it easier to understand what
queries are running and if they are slow.

Yes the tooling is irrefutably more powerful. Sqlite is irrefutably simpler and easier to set up. Supporting both comes with an added complexity cost. Again, not opposed to having added support for Postgres, but I think the performance is a much stronger reason than the tooling, and if it can be fixed in Sqlite then I think there remains rather limited justification for adding Postgres support.

SQLite cannot do that based on a threshold, however it has a BLOB type. Blobs will be stored in a separate tablespace, so queries that do not request that column (e.g. on laminar's main page) will be faster. On the other hand, queries that do request that column may become a little bit slower. This will be especially visible when the outputs are small in size.

This is interesting! The job output should almost certainly be BLOB type then. Anyone tried this out?

Postgres caters to programmatic access to database much better than "ssh to the box when Laminar is running".

Of course by virtue of being a network capable server. Is this a major concern though? I wouldn't expect large amounts of performance-tuned external queries on laminar's database - there just isn't that much there to analyze. Or am I missing a use case?

With SQLite, you can't delete old jobs (e.g. to save on the size of laminar.sqlite) from inside of the Laminar job because the database is locked.

ferki commented

SQLite cannot do that based on a threshold, however it has a BLOB type. Blobs will be stored in a separate tablespace, so queries that do not request that column (e.g. on laminar's main page) will be faster.

I believe this part of the comment might have caused some confusion here and in related issues.

PostgreSQL has its own TOAST concept to store large blobs outside the rows, allowing the planner to skip retrieving them if they don't need to be returned by a given query. PostgreSQL also has its separate tablespaces concept, which allows storing different parts of the database on different storage mediums (like storing different tables on different disks). This might allow storing TOAST data on separate disks.

While other databases, like Oracle certainly has similar concepts, I'm not aware of anything like that for SQLite, and still couldn't find something like that after hours of searching and reading docs.

@mitya57: could you point to such a built-in way to store and/or "skip querying" blobs separately for SQLite, please?

The closest I can imagine, is checking the data size to be stored against a threshold, and decide to store that internally in a BLOB (or perhaps TEXT) field, or store it externally in a file, and save only the filename pointing to it.

ferki commented

With SQLite, you can't delete old jobs (e.g. to save on the size of laminar.sqlite) from inside of the Laminar job because the database is locked.

For reference, I use the following command as part of the Laminar user's crontab for regular job cleanups (at a quiet period, though):

sqlite3 /var/lib/laminar/laminar.sqlite "DELETE FROM builds WHERE date( queuedAt, 'unixepoch', 'localtime' ) < '$(date -d '1 month ago' +%F)';"

I imagine it might need a retry mechanism or frequent execution on a more busy server to find a suitable time to clean up.

From inside a Laminar job, I expect it would possible to offload the execution of a similar command through the standard¹ at daemon.


¹: at least "for some values of standard", right :) At least the systems I managed, always had both cron and at available to schedule command execution as desired by the task at hand.