lookit/lookit-api

Video deletion backlog/maintenance

becky-gilbert opened this issue · 0 comments

Summary

Due to problems with our system for video deletion in S3 (#1423, #1430), we have a backlog of videos in S3 that are not in our DB and need to be deleted. We may also want to consider adding a task to check the S3 videos against those in our DB, so that any lingering S3 videos that should be deleted are cleaned up as part of regular maintenance.

Description

We recently found a problem with our S3 video deletion process, and as a result we will need to address the backlog of video files (~300) in S3 that should've been deleted. We can do this by:

  1. getting the file names from the "Video.DoesNotExist" Sentry error that is generated when a file could not be deleted, and/or
  2. comparing the video file names from S3 with those in our DB and removing any from S3 that do not exist in our DB.

One question is whether to do this "manually" (i.e triggered/monitored by a dev, though it could be partially automated with a script that generates a list of files and then deletes them via the AWS CLI), or via a fully-automated Celery task. If we were to do this via a Celery, we would need to put some safeguards in place to ensure that we never accidentally delete videos (e.g. if there were a database connection problem).

Proposal

I suggest we make this a fully-automated Celery task that does the following for all video storage buckets:

  • Get list of all video file names from bucket (perhaps filter by date created, and only grab those older than e.g. 1 year)
  • Get list of all videos that are currently already queued for cloud deletion (as part of our 7-day soft deletion for e.g. deleting preview data)
  • For each S3 video, if it does not exist in the database and is not already queued for deletion, then delete it immediately.

Implementation notes:

  • We could get the full sets of all S3 videos and all DB videos and compare them in one go, but that would be memory-intensive. Hence the much slower but memory-light approach of just getting the S3 video list and iterating through that.
  • We could queue it for soft deletion, by adding it to the delete_video_from_cloud queue, but there's probably no point to doing that. These cases differ from those when videos are deleted via user actions (deleting preview responses, checking the 'withdraw' box in the exit survey), in which case it is possible for users to realize that they made an error and get in touch with us about trying to recover data.