LemmyNet/lemmy

[Bug]: Couldn't run DB Migrations from 0.19.3-beta.3 to 0.19.4-beta.4

ticoombs opened this issue ยท 18 comments

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Do you agree to follow the rules in our Code of Conduct?
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Upgrading from -beta.3 to -beta.4 results in failed migrations:

lemmy-1  | Lemmy v0.19.4-beta.4
lemmy-1  | Error: LemmyError { message: Unknown("Couldn't run DB Migrations: Failed to run 2024-02-24-034523_replaceable-schema with: constraint \"person_aggregates_person_id_fkey\" of relation \"person_aggregates\" does not exist"), inner: Couldn't run DB Migrations: Failed to run 2024-02-24-034523_replaceable-schema with: constraint "person_aggregates_person_id_fkey" of relation "person_aggregates" does not exist, context: SpanTrace [] }

Steps to Reproduce

  1. 0.19.3 -> 0.19.4-beta.2
  2. 0.19.4-beta.2 -> 0.19.4-beta.3
  3. 0.19.4-beta.3 -> 0.19.4-beta.4 - Lemmy will not come online

Technical Details

Current Indexes:

indexes.reddthat.log

Version

0.19.4-beta.4

Lemmy Instance URL

No response

Related: #4333

This is a pretty strange one, because the migration ran fine on both my local machine, and voyager.lemmy.ml (and CI of course would've failed).

Grepping for that relation:

migrations/2021-03-09-171136_split_user_table_2/down.sql
183:ALTER TABLE user_aggregates RENAME CONSTRAINT person_aggregates_person_id_fkey TO user_aggregates_user_id_fkey;                                        

migrations/2021-03-09-171136_split_user_table_2/up.sql
112:ALTER TABLE person_aggregates RENAME CONSTRAINT user_aggregates_user_id_fkey TO person_aggregates_person_id_fkey;                                                                                                                                                                                                 

migrations/2024-02-24-034523_replaceable-schema/down.sql
1052:    ALTER CONSTRAINT person_aggregates_person_id_fkey NOT DEFERRABLE;                                                                                 

migrations/2024-02-24-034523_replaceable-schema/up.sql                                                                                                     
17:    ALTER CONSTRAINT person_aggregates_person_id_fkey INITIALLY DEFERRED;

Type \d person_aggregates and show me the output there.

To clarify, \d person_aggregates is typed in psql

@ticoombs have you ever ran diesel migration revert or diesel migration redo, even on very old migrations?

A buggy down.sql file is the only thing I know of that could cause this kind of inconsistency.

Do you have any idea on how to fix it?

If the same migration is still failing, do this to find out what changes to manually do. This must be done with version 0.19.4-beta.4 (the version with the failing migration).

  1. Run migrations if you reverted any previous migrations
  2. In terminal, run pg_dump --no-owner --no-privileges --no-table-access-method --schema-only --no-sync -f broken.sqldump
  3. Compare broken.sqldump with this: schema.zip

It also might be as simple as adding that index, but I'd be concerned that other areas of your DB has issues if some constraints are missing.

I also have a similar problem when upgrading from 0.19.3 to 0.19.4 , but in my case its complaining about function name \"hot_rank\" is not unique:

Error: LemmyError { message: Unknown("Couldn't run DB Migrations: Failed to run 2024-02-24-034523_replaceable-schema with: function name \"hot_rank\" is not unique"), inner: Couldn't run DB Migrations: Failed to run 2024-02-24-034523_replaceable-schema with: function name "hot_rank" is not unique, context: SpanTrace [] }

I never made any changes to my db, and did not run any of the diesel migration commands @dullbananas asked about.

I'm gonna test this with a prod DB to make sure.

edit @kroese have you been running other unreleased lemmy versions like this beta, in production? Because that could also cause some major DB issues.

@dessalines Yes, I always run the betas in production, until now it never caused any problems.

I also had no other choice, since the official 0.19 release contains a bug in the thumbnail handling code that affected my instance. And since voyager.lemmy.ml is also doing it, it seemed not a big deal. But maybe the voyager instance resets it database every time to a clean state?

Those are the federation test instances, they can get wiped at any time. Although they usually keep a clean upgrade path so we can see if there are any broken migrations. But yeah we can't support running alpha / beta builds in prod DB, you'll probably have broken migrations more often than not, and have to fix it yourself by running the down.sql migrations manually.

I'll report back if I get any issues running this migration with the lemmy.ml prod DB locally.

Okay @kroese I tested with lemmy.ml's prod DB, and the migration did run fine, so somehow you ended up with two hot_rank functions.

Try doing

sudo docker compose exec -t postgres psql -Ulemmy
...
# Hopefully this will drop all of them
drop function hot_rank cascade;
# You might need to copy and paste this block, but probably not
https://github.com/LemmyNet/lemmy/blob/main/migrations/2024-02-24-034523_replaceable-schema/down.sql#L1010

And try running the beta again.

And

@dessalines When i run that query:

drop function hot_rank cascade;

I get:

ERROR: function name "hot_rank" is not unique

So exactly the same error as during the migration

See this: https://stackoverflow.com/a/7623246

It should generate the proper drop function lines.

Thanks, it turned out I had these two functions:

FUNCTION hot_rank(numeric, timestamp without time zone) cascade;
FUNCTION hot_rank(numeric, timestamp with time zone) cascade;

So one with a timestamp with timezone and one with a timestamp without timezone? Very strange!

I dropped them both, but then the migration complained:

could not find a function named "hot_rank"

So then I executed the block of code from down.sql and then the migration completed succesfully.

So luckily it seems fixed now! But still curious where this timezone-problem could have originated.

Whew, glad that's fixed. I'm sure it was a case where a beta / alpha build of ours had the wrong function type, and we corrected it by overwriting an older migration. We try to avoid that bc it breaks a clean upgrade path but sometimes its necessary.

I'll re-open if there are any other issues.

@dessalines Seems you helped another person but it was certainly a db issue.

I managed to get a diesel container and found that:

  [X] 2024-02-12-211114_add_vote_display_mode_setting
  [X] 2024-02-15-171358_default_instance_sort_type
  [ ] 2024-02-24-034523_replaceable-schema
  [X] 2024-02-27-204628_add_post_alt_text
  [X] 2024-02-28-144211_hide_posts
  [X] 2024-03-06-104706_local_image_user_opt
  [X] 2024-03-06-201637_url_blocklist
  [X] 2024-04-05-153647_alter_vote_display_mode_defaults

It had deployed half of the migrations some how... ๐Ÿ˜“

I had to remove all the rows from person_aggregates which failed when i added:
'alter table only public.person_aggregates add constraint person_aggregates_person_id_fkey foreign key (person_id) references public.person(id) on update cascade on delete cascade;'

Because the local_image migrate happened too, i had to nuke all items where the id was null: 'delete from local_image where local_user_id is null;'

I then told diesel to revert the last 5 migrations: diesel migration revert -n 5 (via a separate container)

Then lemmy would start correctly, perform the needed migrations itself, and we are now on beta.4.

Side Note: For some reason lemmy would fail if I ran the migrations manually diesel migration run. Technically it would load, and work, but whenever i attempted to comment or post, it failed with a CannotResolvePost. I had to let lemmy perform the migrations otherwise the database would end up in a worse state.