cds-snc/notification-planning

Jobs are marked as complete too early [bug-bogue]

Closed this issue · 9 comments

Describe the bug

Jobs are being marked as complete before the notifications for that job are actually added to the database.

By the time we get to L191 of notifications_dao.py the job is already marked as finished in the job table in the db.

This is a bug, since no notifications have been added to the db at this point. Because the job has already been marked as finished at this point, this query in check_job_status() will return no results, and process_incomplete_job()` will not be run on this job.

This bug is caused by process_job and process_incomplete_job in tasks.py. There we have the following flow:

  1. Break the rows for the job into chunks and call process_rows on each chunk
  2. Mark the job as complete with job_complete

This probably worked correctly if process_rows used to not be async. But now process_rows is async because it calls save_emails.apply_async or save_smss.apply_async. So those processes can still be in progress at the time we mark the job as complete.

Bug Severity

See examples in the documentation

SEV-2 Major

To Reproduce

  1. Check out main
  2. Prepare a bulk send via the API, giving the job a name like "my test job"
  3. Set a breakpoint on L191 of notifications_dao.py
  4. Start the celery debugger
  5. Send the bulk job
  6. When the breakpoint is hit, go into the database and find the job with the name "my test job"
  7. observe that the job is already marked as complete, even though no notifications from the job have been added to the db yet

Expected behaviour

The job should not be marked as complete until all notifications from the job have been added to the db.

Impact

This prevents us from fixing the "duplicate notifications from long running jobs" bug: https://app.zenhub.com/workspaces/notify-planning-614b3ad91bc2030015ed22f5/issues/gh/cds-snc/notification-planning/79

Additional context

I found this while testing a fix for the duplicate notifications bug.

QA Steps

  1. log into staging
  2. open Blazer and connect to the staging db
  3. Query for jobs associated with your service id
  4. send a bulk job via either CSV upload or the api of maybe 10 emails
  5. re-run your jobs SQL query and observe that a new row has appeared for the job you just sent via API
  6. observe that the job_status column is set to either pending or in progress
  7. wait 1 minute
  8. re-run the SQL query and observe that job_status now has a value of finished

Celery provides callbacks for async task calls that we may be able to leverage to call job_complete only after the task successfully completes.

Implemented an initial potential solution leveraging Celery Chains.

I'm helping out with this too. We tried another potential solution using Celery Groups but it turns out we can't go this route unless we set up a Result Backend.

We may want to do that eventually, but we should probably use a simpler solution for this bug. We can set up a recurring task to update the job status. Will and I can work on a PR for that today.

I created a card to investigate whether it is viable for us to leverage a Results Backend with Celery.

Pr is up, comments from Jimmy, Will will review.

Reviewed by @whabanks, merged and deployed to Staging, ready for QA.

Will to give a once over today

QA'd on staging, looks good!