ActiveRecord Truncation does not work when using multiple schemas
Opened this issue · 17 comments
Hello - thanks for putting together database_cleaner
! I've been using the gem for a while and I know maintenance takes effort, so appreciate those who contribute to this project.
I'm using the following setup -
- Rails 5.1.0
- ActiveRecord 3.0.0
- Postgres 9.6
Like many multi-tenant applications, my app utilizes postgres schemas which store parrallel, independent copies of my database tables.
So if I have two tenants (earth
, mars
) and two tables (users
, companies
) I would have the following tables in my DB -
schema_migrations
ar_internal_metadata
public.users
public.companies
earth.users
earth.companies
mars.users
mars.companies
When using DatabaseCleaner.strategy = :truncation
, it looks for a list of tables in my DB to truncate (see here).
When I run that query against my local test database -
SELECT schemaname || '.' || tablename
FROM pg_tables
WHERE
tablename !~ '_prt_' AND
tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata'
AND schemaname = ANY (current_schemas(false))
;
?column?
-------------------------------------------
public.users
public.companies
(2 rows)
As you can see, it only returns the public schemas, because of the schemaname =
criteria. This effectively doesn't wipe my non-public schema tables, causing run-over between tests.
I understand that there are options to use a transaction instead, or even the only: [..]
parameter to list all my tables. But it seems like this should work as is without having to use one of those as a workaround (and if my DB is quite large, a whitelist can get messy and inefficient across thousands of tests).
Would it make sense to update the query logic to something like
SELECT schemaname || '.' || tablename
FROM pg_tables
WHERE
tablename !~ '_prt_' AND
tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata'
AND schemaname NOT IN ('pg_catalog', 'information_schema')
;
Thanks!
Related to: https://github.com/DatabaseCleaner/database_cleaner/issues/225
I've come across a similar thing. For now doing
def with_all_schemas
old = ActiveRecord::Base.connection.schema_search_path
begin
ActiveRecord::Base.connection.schema_search_path = [
your_list_of_schemas_here
].join(',')
yield
ensure
ActiveRecord::Base.connection.schema_search_path = old
end
end
and then wrapping calls to Databasecleaner.clean
with with_all_schemas
seems to do the trick
Honestly, this is pretty awful.
But the solution above works and can be improved by using
select schema_name from information_schema.schemata where schema_name not like 'information_schema' and schema_name not like 'pg_%'
to fetch all schemas in the active database for cleaning, which is the behavior I expected database_cleaner to exhibit.
For me as I am using sequel was to use search_path as described here https://github.com/jeremyevans/sequel/blob/master/doc/opening_databases.rdoc#postgres
My workaround was to just truncate the tables I need myself after each test example:
config.after(:each) do
DatabaseCleaner.clean
['schema.tablename'].each do |table|
ActiveRecord::Base.connection.execute("TRUNCATE #{table}")
end
end
Hi, so I'd love to help get this problem solved, but honestly I'm not familiar with with the concept of multiple schemas in PostgreSQL, so I'm going to need some help understanding. Can someone explain why an app might have more than one schema?
Also, about the current clause that is causing the issue: schemaname = ANY (current_schemas(false))
:
Looking in the git history, the commit that added this line has the following to say:
commit 77bca129d6031571b4da1edf63b9f5ae93fc6ee2
Author: Billy Watson <billy@korrelate.com>
Date: Tue Jul 22 12:31:03 2014 -0400
always return Postgres table names with schema to avoid awfulness
- without the schema, the migrations table gets truncated since the default AR method returns ‘public.schema_migrations’
- also a possibility is truncation of tables other than desired once since without the schema, Postgres has no way of knowing which table to truncate if two schemas have the same table
Can someone unpack this for me?
Finally, in the OP @abhchand has suggested the possible solution of replacing the line with schemaname NOT IN ('pg_catalog', 'information_schema')
. Can I get some opinions on this suggestion? Would it address the problem exhibited here, while still addressing the concerns that Billy Watson was trying to solve in his commit?
Hiya @botandrose . The reason we're using multiple schemata is to do close integration with a proprietary 3rd party application. There are some good reasons not to do it this way (it's a microservices sin and I feel a little bit ill about it) but amongst the advantages we find are...
- We get to have multiple services doing a lot of otherwise complex work with remote data but with the code simplicity as if it were local.
- We get a lot of the database efficiency being able to do SQL joins between the
public
data we fully own and the data we don't. - A lot of rails stuff by default is just unaware of the other schemata. Which is good and the right thing in most cases.
I think a general solution which would ensure truncate
strategy works might be to count records in all tables everywhere at the start of the whole test suite and truncate all that had zero records.
There are many reasons why some apps use several schemas, but for sure it is a thing.
What Billy Watson was trying to solve is totally fine, but there is one piece in his solution
that is not working as as he is expecting (current_schemas(false)
, even when the database has multiple schemas current_schemas(false) is only returning {public} because it depends what is set in the search_path.
select current_schemas(false);
current_schemas
-----------------
{public}
While if you execute the query suggested by @deathwish you will indeed get all your schemas (excluding the Postgres' ones of course)
SELECT schema_name FROM information_schema.schemata WHERE schema_name not like 'information_schema' AND schema_name not like 'pg_%';
schema_name
-------------
public
archive
So, combining the query above with the current code, it will solve the issue while still addressing the concern that Billy Watson was trying to solve since it now lists the tables in all schemas and still appending the schema name.
SELECT schemaname || '.' || tablename as table_name
FROM pg_tables
WHERE tablename !~ '_prt_'
AND tablename <> 'schema_migrations' AND tablename <> 'ar_internal_metadata'
AND schemaname = ANY (
SELECT schema_name
FROM information_schema.schemata
WHERE schema_name not like 'information_schema'
AND schema_name not like 'pg_%'
);
table_name
-----------------------------------------
public.users
public.questions
public.responses
public.versions
archive.logs
archive.versions
Or even simpler as @alejandro-falkowski-gelato suggested, setting up the search_path
in the database configuration will in fact lists all the desired schemas when running current_schemas(false)
test:
adapter: postgresql
host: localhost
database: db_test
...
schema_search_path: 'schema1,public'
So it will worth at least mention in the docs that you should set the search_path
if you expect DatabaseCleaner also truncate the tables in all schemas rather than only public
Thank you, @Adamantish, for the description of your use-case... I think I have a good handle on this now! And thank you, @erlinis, for your superb detective work and proposed solution, which appears to me to be solid.
Is there any reason why one might not want to clean all tables outside of the public schema? Let's say we were to fold this change into a hypothetical v1.8.5 bugfix release of database_cleaner
. Would this, in fact, be a breaking change for some users? In that case, would it be better to release it in the form of a new option to the truncation strategy, like all_schemas: true
etc?
Hi @botandrose,
A new option to the truncation strategy sounds great, but instead of all_schemas: true/false
,
I think would be better to have something like included_schemas: [ 'public', 'schema1']
and if nothing is specified ['public']
is the default value.
This way current behavior won't be affected and the developer will have more control over what clean.
@erlinis I like it! What do others in this thread think? Are you game for taking a stab at it in a PR?
Like the original submitter's per-tenant schema use case, I'd definitely prefer to be able to clean all schemata rather than saying which ones should be cleaned. That said, I agree that being able to specify which schemata are cleaned is also important.
We are in dynamic-language land here, so why not both? Something like schemata: :all
is not incompatible with schemata: ['public', 'schema1']
, and allows the later addition of additional functionality like schemata: {except: 'schema2'}
.
The only other note I have immediately is that this should interact correctly with the cache: true|false
option as documented, which also implies that it should be possible to have a non-fixed list of schemata to be cleaned.
@deathwish All great points! Like you, I think :except
can be punted on into its own subsequent PR, and agree that this functionality should provide an :all
option, for those who don't want to enumerate every schema. Also agree on respecting the :cache
option. Thank you for pointing these things out!
Is this issue still relevant with the latest version of this gem?
@jherdman - I am on the latest version (2.0.1) and can confirm it's still an issue. @botandrose, anything in the works?
This line seems to be the problem.
@only = all_tables.map { |table| table.split(".").last }
For me all_tables
contains the following two entries: public.items
and events.items
. So database cleaner will execute DELETE FROM items
twice.
Is there a reason to remove the schema? Seems like a simple bug fix, change this:
all_tables = cache_tables? ? connection.database_cleaner_table_cache : connection.database_tables
@only = all_tables.map { |table| table.split(".").last }
To this:
@only = cache_tables? ? connection.database_cleaner_table_cache : connection.database_tables