SQL schema specification inconsistencies

Question

SQL schema specification inconsistencies

Closed this issue a year ago · 12 comments

Codd seems to have issues with schema specifications in SQL depending on the search path. For example, consider the following:

ALTER TABLE ONLY public.email
  ADD CONSTRAINT email_employee__id_fkey
    FOREIGN KEY (employee__id)
      REFERENCES public.employee(id);

This code specifies the public schema, but doing so is not required when that schema is at the front of the search path. Indeed, the default search path is "$user", public and specifying public is not required (when a schema named after the current user does not exist). Note that some people customize the search path to include their schema, so this is not just an issue with public.

PostgreSQL dump files created with pg_dump set the search path to the empty string in order to avoid potential issues with an unknown (customized) search path, and they consistently specify the schema. Codd can add a dump file without issue, but verification then results in discrepancies due to inconsistent schema specifications when a schema that is in the search path is used.

As an example, please see my debugging log below.

(1) Create the test schema

Run a test database:

$ docker run \
    --detach \
    --name "pg_codd" \
    --publish "127.0.0.1:5432:5432" \
    --env POSTGRES_DB="postgres" \
    --env POSTGRES_USER="postgres" \
    --env POSTGRES_PASSWORD="password" \
    "postgres:11-alpine"

Create a test project:

$ mkdir /tmp/codd-public
$ cd /tmp/codd-public
$ mkdir bin config sql-migrations expected-schema
$ wget -O bin/codd \
    https://github.com/mzabani/codd/releases/download/v0.1.1/codd
$ chmod 0755 bin/codd

Configure settings in config/codd-dev.sh:

export CODD_CONNECTION=postgres://postgres:password@127.0.0.1/postgres
export CODD_EXPECTED_SCHEMA_DIR=expected-schema
export CODD_MIGRATION_DIRS=sql-migrations

Load settings in the current shell:

$ source config/codd-dev.sh

Create a migration in 0001-init-codd.sql that allows Codd to initialize a database without having to create it (reference):

-- codd: no-txn

SELECT 1;

Add the migration, initializing the database:

$ ./bin/codd add 0001-init-codd.sql

Create a migration in 0002-init-schema.sql that initializes the schema:

CREATE TABLE public.employee (
  id SERIAL NOT NULL,
  name TEXT NOT NULL,
  CONSTRAINT employee_pkey PRIMARY KEY (id)
);

CREATE TABLE public.email (
  employee__id INTEGER NOT NULL,
  address TEXT NOT NULL,
  CONSTRAINT email_pkey PRIMARY KEY (employee__id, address),
  CONSTRAINT email_employee__id_fkey
    FOREIGN KEY (employee__id)
      REFERENCES public.employee(id)
);

CREATE UNIQUE INDEX email_lower_address_unique
  ON public.email(lower(address));

Note that specifying public here does not cause any problems.

Add the migration:

$ ./bin/codd add 0002-init-schema.sql

Verify the schema:

$ ./bin/codd verify-schema
[Info] Database and expected schemas match.

Dump the schema to dump.sql:

$ docker exec pg_codd \
    pg_dump \
      --host 127.0.0.1 \
      --port 5432 \
      --username postgres \
      --no-password \
      --schema-only \
      --exclude-schema=codd_schema \
      --quote-all-identifiers \
      postgres \
  > dump.sql

(2) Start over using the dumped schema

Stop the database and reset the project schema/migrations:

$ docker stop pg_codd
$ docker rm pg_codd
$ rm -rf expected-schema/*
$ mv sql-migrations/*-0001-init-codd.sql 0001-init-codd.sql
$ mv sql-migrations/*-0002-init-schema.sql 0002-init-schema.sql

Start a new (empty) database:

$ docker run \
    --detach \
    --name "pg_codd" \
    --publish "127.0.0.1:5432:5432" \
    --env POSTGRES_DB="postgres" \
    --env POSTGRES_USER="postgres" \
    --env POSTGRES_PASSWORD="password" \
    "postgres:11-alpine"

Add the first migration, initializing the database:

$ ./bin/codd add 0001-init-codd.sql

Add the dumped schema:

$ ./bin/codd add dump.sql

Verifying the schema now results in an error:

$ ./bin/codd verify-schema
[Error] DB and expected schemas do not match. Differing objects and their current DB schemas are:

{
    "schemas/public/tables/email/constraints/email_employee__id_fkey": [
        "different-schemas",
        {
            "deferrable": false,
            "deferred": false,
            "definition": "FOREIGN KEY (employee__id) REFERENCES employee(id)",
            "fk_deltype": "a",
            "fk_matchtype": "s",
            "fk_ref_table": "employee",
            "fk_updtype": "a",
            "inhcount": 0,
            "local": true,
            "noinherit": true,
            "parent_constraint": null,
            "supporting_index": "employee_pkey",
            "type": "f",
            "validated": true
        }
    ],
    "schemas/public/tables/employee/cols/id": [
        "different-schemas",
        {
            "collation": null,
            "collation_nsp": null,
            "default": "nextval('employee_id_seq'::regclass)",
            "hasdefault": true,
            "identity": "",
            "inhcount": 0,
            "local": true,
            "notnull": true,
            "order": 1,
            "privileges": null,
            "type": "int4"
        }
    ]
}

Query the expected email_employee__id_fkey state:

$ cat expected-schema/schemas/public/tables/email/constraints/email_employee__id_fkey \
  | python -m json.tool

{
    "deferrable": false,
    "deferred": false,
    "definition": "FOREIGN KEY (employee__id) REFERENCES public.employee(id)",
    "fk_deltype": "a",
    "fk_matchtype": "s",
    "fk_ref_table": "employee",
    "fk_updtype": "a",
    "inhcount": 0,
    "local": true,
    "noinherit": true,
    "parent_constraint": null,
    "supporting_index": "employee_pkey",
    "type": "f",
    "validated": true
}

The difference is the schema specification:

- "definition": "FOREIGN KEY (employee__id) REFERENCES public.employee(id)",
+ "definition": "FOREIGN KEY (employee__id) REFERENCES employee(id)",

Query the expected employee.id state:

$ cat expected-schema/schemas/public/tables/employee/cols/id \
  | python -m json.tool

{
    "collation": null,
    "collation_nsp": null,
    "default": "nextval('public.employee_id_seq'::regclass)",
    "hasdefault": true,
    "identity": "",
    "inhcount": 0,
    "local": true,
    "notnull": true,
    "order": 1,
    "privileges": null,
    "type": "int4"
}

The difference is the schema specification:

- "default": "nextval('employee_id_seq'::regclass)",
+ "default": "nextval('public.employee_id_seq'::regclass)",

(3) Start over using a dumped schema that is patched

Stop the database and reset the project schema/migrations:

$ docker stop pg_codd
$ docker rm pg_codd
$ rm -rf expected-schema/*
$ mv sql-migrations/*-0001-init-codd.sql 0001-init-codd.sql
$ mv sql-migrations/*-dump.sql dump.sql

The dump file includes the following settings:

$ sed -n '8,21p;22q' dump.sql

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

SET default_tablespace = '';

SET default_with_oids = false;

This is the setting that clears the search path (on line 13):

SELECT pg_catalog.set_config('search_path', '', false);

Create a patched dump with this setting commented out:

$ sed '13s/^/-- /' dump.sql > dump-patched.sql

Start a new (empty) database:

$ docker run \
    --detach \
    --name "pg_codd" \
    --publish "127.0.0.1:5432:5432" \
    --env POSTGRES_DB="postgres" \
    --env POSTGRES_USER="postgres" \
    --env POSTGRES_PASSWORD="password" \
    "postgres:11-alpine"

Add the first migration, initializing the database:

$ ./bin/codd add 0001-init-codd.sql

Add the patched dump:

$ ./bin/codd add dump-patched.sql

Verify the schema:

$ ./bin/codd verify-schema
[Info] Database and expected schemas match.

Answer 1 · 2023-03-24T22:52:00.000Z

Thank you very much for such a detailed bug report. And I'm very happy to see you experimenting with codd like in the article too! Also, I still need to reply to your post in Discourse, which I hope to do Sunday or early next week.

I wasn't aware of this issue, so thanks for figuring it out. The solution to it seems to be very simple - codd could just set search_path = '' before building the database schema (and restore the original value afterwards just for sanity purposes). In fact, I cooked something up quickly over the last hour and it seems to pass existing tests locally.

I haven't merged and released it yet because I'd like to add a test for this first, and double-check I'm not missing something. But fixing this is my next top priority, and I hope to get it done next week because one or two weeks after that I'm afraid I'll be very busy.

In the meantime, I'm attaching a statically compiled version with that change, hoping it might do the trick for you until a proper release comes along. You can also look at the code changes (it includes other non-related stuff for now) at #134.

codd.zip

Answer 2 · 2023-03-26T21:52:48.000Z

Thank you!

I had been thinking about the same way to resolve the issue. I was unsure about how the search path is scoped, however. Investigating, the set_config documentation states the following:

If is_local is true, the new value will only apply during the current transaction. If you want the new value to apply for the rest of the current session, use false instead.

The term "session" is unfortunately not explicitly defined. A Stack Overflow answer concisely defines it as "synonymous with a TCP connection" with links to various supporting content.

We set the search path to the empty string using a SET command and later restore it to the previous value using set_config. Is there a reason for using different ways to set the value? Perhaps SET is preferred for setting constant values while set_config is preferred/necessary for setting variable values?

The SET documentation states the following:

If SET (or equivalently SET SESSION) is issued within a transaction that is later aborted, the effects of the SET command disappear when the transaction is rolled back. Once the surrounding transaction is committed, the effects will persist until the end of the session, unless overridden by another SET.

We are not doing this within a transaction, correct? In this case, the SET command immediately affects the whole session. This should not cause issues with a session only used (sequentially) by Codd, but I think it would be preferable to minimize the scope to be safe. What do you think about using a transaction and using SET LOCAL search_path='' to change the configuration?

Using a transaction would also be preferable to ensure that the selected data is consistent, IMHO.

I completely empathize with having a busy schedule. 😄 I really appreciate your time and efforts!

Answer 3 · 2023-03-27T20:42:26.000Z

We are not doing this within a transaction, correct? In this case, the SET command immediately affects the whole session. This should not cause issues with a session only used (sequentially) by Codd, but I think it would be preferable to minimize the scope to be safe. What do you think about using a transaction and using SET LOCAL search_path='' to change the configuration?

Good point. By default, the function that builds the schema from the database runs in a transaction before committing, but perhaps only if all pending migrations are in-txn (i.e. not even one no-txn migration pending). I'll have to check if that's the case, but it might be a good opportunity to create a class constraint that could be used to ensure a function only runs inside a transaction, e.g. an InTxn m constraint. That'd be a more solid foundation to make sure we're building the db schema atomically. It might infect test code significantly, but it's a small price to pay.

We set the search path to the empty string using a SET command and later restore it to the previous value using set_config. Is there a reason for using different ways to set the value? Perhaps SET is preferred for setting constant values while set_config is preferred/necessary for setting variable values?

The only reason is odd parsing rules for SET:

codd-experiments=> show search_path;
   search_path   
-----------------
 "$user", public
(1 row)

codd-experiments=> set search_path to '';
SET
codd-experiments=> set search_path to '"$user", public';
SET
codd-experiments=> show search_path;
     search_path     
---------------------
 """$user"", public"
(1 row)

SET works if I don't wrap the value in simple quotes, but since I'm using value interpolation from postgresql-simple, I'm not sure if that'd work with the ? placeholder.. set_config at least is a regular function with familiar syntax :)

But using set_config twice to avoid confusion sounds like a good idea.

Answer 4 · 2023-03-27T21:13:05.000Z

By default, the function that builds the schema from the database runs in a transaction before committing

I had been thinking that readSchemaFromDatabase could be implemented using a transaction, but I now realize why this is not viable. This function is sometimes called within an existing transaction (such as when using add without no-txn) and sometimes without (such as when using verify-schema).

it might be a good opportunity to create a class constraint that could be used to ensure a function only runs inside a transaction

Good idea!

But using set_config twice to avoid confusion sounds like a good idea.

Thank you for the explanation.

If readSchemaFromDatabase itself were implemented using a transaction, restoring the previous value would not be necessary (thanks to transaction-local configuration).

I wonder if restoring the previous value is necessary in the actual case? It is if subsequent commands are executed within the same transaction, but perhaps reading the schema state is always done at the end of transactions? I guess restoring the value is a good idea regardless, for safety in case the function is used differently by somebody.

One small suggestion is to implement a function withSearchPath instead of interleaving the implementation in readSchemaFromDatabase.

Answer 5 · 2023-03-28T20:26:06.000Z

I guess restoring the value is a good idea regardless, for safety in case the function is used differently by somebody.

Yep, that's what I'm thinking, too.

One small suggestion is to implement a function withSearchPath instead of interleaving the implementation in readSchemaFromDatabase.

Sounds good.

I'll start working on this asap. Still, it might take me some time given there's a lot happening - I'm moving to a different place, which means a couple of weeks of reduced free time in my evenings.

But it's only a matter of time. Hopefully the executable I shared in the other comment works for you temporarily?

Answer 6 · 2023-03-28T20:55:13.000Z

Sounds good.

No worries about the timing. Good luck with your move!

I have not started building Codd yet. The environment where I use it is a Reflex FRP project that is still using GHC 8.6.5 on an old nixpkgs revision. We are already in the progress of migrating to GHC 8.10.7, so I have been waiting to integrate Codd until that is done. I will be building my own executables soon, and the one that you shared is very helpful in the meantime. Thanks!

By the way, I did an experiment with PostgreSQL upgrades and wrote in my blog about it: Codd Experiment 2: PostgreSQL Upgrade. It might be worthwhile to add functionality to Codd to help verify such upgrades, but it is also possible to verify them manually.

I created an issue regarding the error messages (#136). I will create issues for the usability suggestions, but there is a blog entry that I would like to write first so that I can reference it. That blog entry is also a prerequisite for a possible PostgreSQL upgrade issue, and it relates to managing multiple environments as well.

EDIT: I was able to finish that prerequisite blog entry: Codd Design Ideas.

Answer 7 · 2023-04-19T19:51:42.000Z

I'm finally kind of back. Sorry it took so long, but I had more issues than I expected to fix in the new place.

Since I'm still dealing with some remnants of moving, I'm still not back full time. In light of that, I think I'll worry about getting a small PR that fixes this issue in first, and work on enforcing being inside a transaction in the call signature in a second PR.

Also, I'll make sure to arrange time to read your blog entry on design ideas, and thanks once again for experimenting and writing about your thoughts!

Answer 8 · 2023-04-20T00:15:03.000Z

Hi! No worries!

First resolving the path issue and then ensuring that the database state is retrieved within a transaction sounds like a good plan.

I had some more design ideas and documented them in a Codd Design Ideas (Part 2) blog entry. In summary, storing hashes of the JSON files ("snapshots") that represent the database state makes it possible to check equality with any recorded state. It is then possible to not commit snapshots (expected-schema) to the repository, making resolving conflicts easier. Snapshots can be stored/shared separately, and it is possible to recreate them on demand using temporary PostgreSQL instances, confirming validity using the hashes. Persisting the hashes for all states makes it easy to implement inverse validation, multi-step validation, and checkpoint validation. The design makes it possible to handle PostgreSQL upgrades, with or without checkpoints. I have since realized that the migration software version should also be tracked in order to handle upgrades that change snapshot details (such as adding or removing a configuration property).

I had some time last week to start a prototype of these ideas, and I hope to find time to work on it again sometime soon. It is based on snapshots of database state but is otherwise very different from how Codd currently works. You are of course welcome to use any ideas that you want, but I totally understand if you do not want to make such huge design changes. Perhaps implementing in a separate tool is more appropriate.

I worked on the snapshot queries a bit, and I now much better appreciate the work that you have put into those. Nice job!

I have mentioned elsewhere that I would like to just manage the database schema and not include roles/permissions. I was thinking that it might be better to check database settings, roles, and permissions separately. I have since realized that ownership complicates this idea, and I am still thinking about this.

One thing that I do not yet fully understand is inclusion of grantors. It seems to me that the role used to grant permissions is not significant. For example, perhaps development servers use the default superuser postgres while production servers use a different superuser as a security measure. I only care that the role that manages the database has consistent permissions, not the name of the superuser that granted those permissions. Would it be problematic to always leave out grantors? What am I missing?

Answer 9 · 2023-05-11T20:38:29.000Z

First resolving the path issue and then ensuring that the database state is retrieved within a transaction sounds like a good plan.

Good to hear that. I have such a PR ready at #134. I've been working on codd by myself since its inception, but I thought I should ask if you would like to review PRs. Please feel free to say you don't want to, don't have time, or anything (or for no reason at all). In case your answer is "No" to this PR, also feel free if there are specific PRs you would like to review in the future too.

I had some more design ideas and documented them in a Codd Design Ideas (Part 2) blog entry. In summary,

I think we are currently discussing to a greater extent your ideas in another issue, so I won't comment about them here. In short I find them very very interesting and hope we can find some way to fit them into codd, but I think both the current and your proposed workflows have pros and cons (unless I misunderstand yours), so I'd like to think we can keep both without making a mess of codd. Like I said in the other issue, we may want to open an issue to discuss your model and how to fit it into codd.

I worked on the snapshot queries a bit, and I now much better appreciate the work that you have put into those. Nice job!

Thanks :)

One thing that I do not yet fully understand is inclusion of grantors.
I only care that the role that manages the database has consistent permissions, not the name of the superuser that granted those permissions. Would it be problematic to always leave out grantors? What am I missing?

There are some DDL statements which can make a difference if you ignore grantors. One example I can think of is if you run:

set role grantor_1; revoke select on table employee from some_user;
set role grantor_2; revoke select on table employee from some_user;
set role some_user; select * from employee;

Case 1: If some_user has select privileges granted only by those two grantor users, this query will fail.
Case 2: If some_user has select privileges granted by grantor_1 and grantor_9 instead, this query will return results.

So codd needs to differentiate grantors to make sure the schemas differ in these two cases, or else it wouldn't hold its promise that if you have the same schema, the same query and the same data, you get the same results.

For example, perhaps development servers use the default superuser postgres while production servers use a different superuser as a security measure

In the shop I work at we decided to recreate the users that exist in our cloud postgres offering locally. They're not exactly the same since postgres cloud offerings are slightly modified from vanilla postgres, but we've had good success with that model.
But I'm curious about security. Can you elaborate on different usernames might be a good thing for security?

Answer 10 · 2023-05-25T02:10:51.000Z

Looks good!

I am happy to review when I have time, but I cannot guarantee prompt availability.

One thing that I have been thinking about is use of the Codd library from other applications. For example, the application that I am working on has a development mode that automatically performs migrations (to a Gargoyle-managed local database) on start. Many assumptions can be made when the library is just used by the project executable (codd), while more care must be taken for a library that may be used in other contexts.

IMHO, exception management is one of the most tricky things about Haskell, primarily because exceptions are not managed using types. The current implementation of withEmptySearchPath is not exception-safe: the search path is not reset if f raises an exception. This is not an issue when the exception terminates the session and/or program, but perhaps it could cause problems if the exception is caught and the connection/session is used again. This potential issue can be resolved by using bracket. The readSchemaFromDatabase function has a MonadUnliftIO constraint, so UnliftIO.Exception.bracket could be used to provide additional safety.

withEmptySearchPath is in a where clause and can only be used within readSchemaFromDatabase, so perhaps there is no need.

Some people avoid default names for administrative users, just to provide one more hurdle for attackers. When password authentication is used, attackers may attempt to brute-force the password for the postgres user if they manage to get access to a system that has access to the database. Assuming that it is not stored on the system, using a non-standard username makes such attempts moot. Personally, I do not worry about this with PostgreSQL, and I have not worked a job where that was a concern for many years.

(I am careful about usernames in publicly-accessible services, of course, and I always make sure to set PermitRootLogin no in OpenSSH servers.)

Answer 11 · 2023-05-27T19:17:11.000Z

I am happy to review when I have time, but I cannot guarantee prompt availability.

Thanks! And no worries with the availability. I'll keep the PR open waiting for your review, but don't hesitate to let me know if you change your mind, and don't feel pressured to be thorough if you want to review; PRs would've been merged without any review before your arrival anyway.

One thing that I have been thinking about is use of the Codd library from other applications. For example, the application that I am working on has a development mode that automatically performs migrations (to a Gargoyle-managed local database) on start. Many assumptions can be made when the library is just used by the project executable (codd), while more care must be taken for a library that may be used in other contexts.

We currently use codd as a library in the shop I work at, but I'm embarrassed to admit I never really put much effort into making the codebase reasonable for that use-case. I'm not sure what you have in mind here. Can you elaborate a bit more? Feel free to create separate issues for this stuff too.

The current implementation of withEmptySearchPath is not exception-safe: the search path is not reset if f raises an exception. This is not an issue when the exception terminates the session and/or program, but perhaps it could cause problems if the exception is caught and the connection/session is used again. This potential issue can be resolved by using bracket

I think it might be a bit more complicated than that to make this safe against exceptions, since async exceptions make things more difficult and postgresql-simple does not handle them very well just yet. See haskellari/postgresql-simple#69. I'm also the author of a PR trying to improve that (and there's a lot of conversation in that PR from people more experienced than me suggesting this is hard), but I think even if that PR were merged, we'd still have to be very careful when handling exceptions here. If e.g. the code throws a SqlException the transaction is no longer usable and we can only rollback anyway; but callers would need to know how called code differentiates exceptions to know if they're still inside a valid transaction or if it's already been rolled back.

Granted, I'm guilty of writing queries in exception handlers carelessly in a few places in codd, and even postgresql-simple is, but maybe this is something to be considered with more care before we implement it?

Some people avoid default names for administrative users, just to provide one more hurdle for attackers.

Seems sensible. I think that can be replicated in development environments too, though. Still, I don't want to shield myself from what setups out there in the real world might do, so it sounds more and more like some custom post-processing of codd's json schema should be allowed, or custom schema comparison knobs like some you've suggested before.

Answer 12 · 2023-05-28T23:18:54.000Z

One thing that I have been thinking about is use of the Codd library from other applications.

I'm not sure what you have in mind here. Can you elaborate a bit more? Feel free to create separate issues for this stuff too.

I have not tried to integrate Codd yet, so I am afraid that I do not yet have anything concrete. Generally, I think it would be nice to have a library API that allows core tasks to be integrated into other software.

The project that I am working on currently uses a different migration system, and it automatically performs migrations on server start when in development mode. I currently have a proof-of-concept that uses Codd, and it simply runs a codd up process. I can work on integration once our codebase has been migrated to GHC 8.10.7. Integration should provide consistent logging, if nothing else.

At a previous job, I have worked on "scripted" multi-step migrations. I have no plans to do that again any time soon, but it is another example of when the Codd library could be helpful. A Haskell program could coordinate migrations, software upgrades, tests, error monitoring, rollbacks, notifications, etc.

I think it might be a bit more complicated than that to make this safe against exceptions

Thank you very much for the links and explanation! This is very informative, and I now agree that your current implementation (without bracket) is preferred.

Thank you!