propublica/django-collaborative

Memory leak?

tommeagher opened this issue · 17 comments

Every day, Heroku throws hundreds of Error R14 (Memory quota exceeded) errors, within an hour or so of rebooting. I'm not sure how to start to diagnose this, but wanted to file an issue in case other users are seeing this or if you have advice to resolve it.

Hey there,

For starters, I'd need to know a little bit about your deployment and also what kinds of sources you have loaded and how many.

Do you have the background updater running?

Hi. Thanks for the quick response, @brandonrobertz. We have the basic setup on Heroku. I deployed it in maybe October and haven't touched the backend since. I don't have the background updater running. I must have missed that in the docs at the time. We have six data sources right now with between 25 and 500 responses.

So do you have the latest code? I made some memory improvements in late February. If you don't I suggest updating because I've made lots of bug fixes since Oct 2019.

I'm guessing not. I didn't see any official releases, so didn't think there was a new version yet. I can pull from master and try to set up the background updater with the heroku scheduler. Will let you know if that resolves our issues.

The Collaborate background updater is implemented as a Heroku custom clock command. By default, it gets set up for you when you use the Heroku deploy button, but I don't know if it was working/set up in Oct, though. Check to see if it's already there before you do a bunch of work setting one up yourself.

Here's some basic info on Heroku clocks, JIC: https://devcenter.heroku.com/articles/scheduled-jobs-custom-clock-processes

I'd suggest making a backup of your DB before updating so you can easily roll back if need be.

Let me know how it goes.

Ahhh, ok. So it looks like I do have a clock process in the Procfile after all.
Do you have documentation on best way to upgrade? When I used the Heroku button to set it up originally, I don't think it created a local version of it.
So I could clone it down, pull from your repo, merge the updates and push it back to Heroku and run python manage.py migrate in the console. Or is there a better way to do this?

There's an option to rebuild from latest master in the Heroku application management UI. I believe it's the last settings screen, but I haven't done this in a while.

@brandonrobertz I'm not seeing that option anywhere in the Heroku app dashboard. Will try it on the CLI and report back.

So I managed to figure out how to bring my deployed instance up to match master here. About two hours after deployment, I'm getting the same error messages.

heroku/clock.1: Process running mem=557M(108.8%)
heroku/clock.1: Error R14 (Memory quota exceeded)

It looks like the clock.py process is driving memory usage up and up.

Screen Shot 2020-07-20 at 11 56 11 AM

I need more details about your data sources. What type are they? Screendoor, Google with shared auth, Google Sheet via service acct, CSV, etc? I'd also check your clock logs to see if its outputting errors leading up to the memory quota issue. Unless you can attach a memory profiler to your instance, I'm going to have to replicate this locally to understand the issue, so more details are better.

@brandonrobertz I'm happy to send over lots of details. Can I email you directly?

Yeah you can email the logs directly. brandon at bxroberts dot org. Let's try and keep as much discussion here as possible, though.

Cool. I just emailed you some logs to take a look at @brandonrobertz. As I said, we have 6 callouts right now with 25-500 records. They're all coming from Typeform, which we pump into private Google Sheets and then connect to Collaborate from there.

I was able to reproduce the issue and I pushed up a new version, v1.1.0.

The problem, as I understand it, lies in the apscheduler library's use of a processing pool to execute clock commands. The pool leaves initialized processes around which eat up memory. The 1.1.0 fix limits the total number of these workers available to 1, lowering total memory and keeping it constant.

Let me know if the latest version fixes this problem for you or causes any issues.

Brandon, amazing! Thanks so much for putting this together so quickly. It's only been a few hours, but so far the memory issues seem to be resolved, at least so far.
Screen Shot 2020-07-21 at 1 16 56 PM

Now there is a different error popping up. It seems every time the clock.py runs this group of errors pops up:

2020-07-21T17:24:45.350717+00:00 app[clock.1]: Success!
2020-07-21T17:24:56.836144+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 12 messages since 2020-07-21T17:23:56.946367+00:00.
2020-07-21T17:25:04.713330+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 11 messages since 2020-07-21T17:24:04.817594+00:00.
2020-07-21T17:25:14.031779+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 12 messages since 2020-07-21T17:24:14.136013+00:00.
2020-07-21T17:25:16.126704+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 19 messages since 2020-07-21T17:24:16.229637+00:00.
2020-07-21T17:25:56.836207+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 1 messages since 2020-07-21T17:24:56.937582+00:00.
2020-07-21T17:26:04.713431+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 1 messages since 2020-07-21T17:25:04.818142+00:00.
2020-07-21T17:26:14.031881+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 1 messages since 2020-07-21T17:25:14.136505+00:00.
2020-07-21T17:26:14.717907+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 1 messages since 2020-07-21T17:26:04.819265+00:00.
2020-07-21T17:26:16.126805+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 1 messages since 2020-07-21T17:25:16.232164+00:00.
2020-07-21T17:26:16.840387+00:00 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.db4d25b8-fd4c-4f59-8637-20be633602fb' dropped 1 messages since 2020-07-21T17:25:56.940648+00:00.

There only seems to be 10 of them at a time, every 15 minutes. I'm not sure how much of a concern this should be, but wanted to see what you thought.

That's just a logging amount exceeding threshold error: https://devcenter.heroku.com/articles/error-codes#l10-drain-buffer-overflow

You could change the log level to something non-DEBUG, or just ignore it, since either way you won't get those logs.

I'm closing this since memory issue seems to be fixed!

Feel free to open a different issue if you notice anything else.