Add documentation describing the data migration process we should follow
sarayourfriend opened this issue · 0 comments
Problem
As part of the ECS-ification of the Django service, we are moving towards an automated database migration handling approach. Under this new approach, migrations will be automatically applied when a new version of the Django application is deployed. In order to avoid deployments that take hours, we must not rely on Django database migrations for data migrations: that is, we cannot rely on SQL to transform the data in the database. In addition to creating migrations that could last hours (depending on the contents of them), we also want to avoid creating additional database load.
Description
En lieu of using Django migrations to transform the data in the database, we will instead follow a data migration strategy that relies on Django management commands to programmatically transform the data. This has several benefits (some repeated from above):
- Can be throttled to prevent overwhelming database load;
- Encourages zero-downtime deployment planning;
- The transformation can be unit tested including more easily testing data edge cases that might be easy to forget about (and even harder to handle) in a regular SQL data migration;
- Prevents deployments from ever going longer than a few minutes because we avoid all long-running migrations.
We need to document this process in the Sphinx documentation and spread the word about this to the Openverse contributors. If possible, it would be nice to even put a linting check that verifies that we are not introducing migrations that include data transformations.
Alternatives
Additional context
Please refer to https://github.com/WordPress/openverse-infrastructure/issues/176 for the original discussion motivating this change. The repository is private and if you do not have access but would like to see the issue, ping a core contributor, and they can share the discussion with you.
Implementation
- 🙋 I would be interested in implementing this feature.