parcelvoy/platform

Memory and redis issues on Worker instance

Opened this issue · 9 comments

Hey there! I'm experiencing two weird errors that I don't know what else to do anymore.

First one is about some sort of memory leak or anything like that:
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

image

I already tried to increase memory of the instance, right now it's at 4GB and already tried to change the node max old space size using the ENV NODE_OPTIONS="--max-old-space-size=3072" env var. But it is still happening, not sure what else I could do to mitigate this.

Second one is related to Redis, I often receive a ReplyError: READONLY You can't write against a read only replica. -- My Redis instance is separated from the worker, but I'm not sure what is causing this. -- This one is happening frequently.

image image

Any thoughts of how to handle it?

I just ran these commands on redis, hope this will prevent the second error mentioned.

SLAVEOF NO ONE
REPLICAOF NO ONE

I have not seen either of those before! How do you have redis setup? Are you in cluster mode? Using Elasticache or your own instance running?

@pushchris I'm running on AWS ECS Fargate, I have a task with worker and a separated task with redis.
The "slave" error went away with the commands I've mentioned earlier.. however the first issue is the real problem here, I often get this memory heap error and it causes deadlock issues on databases etc

Update: Redis issue was not related to memory issue.

I just updated my deploy to a EC2 "monolith" (ui / api / worker / redis) with AWS RDS

It's running smoothier than the ECS, however I just experienced the same error:

Screenshot 2024-09-04 at 13 01 02

@pushchris We have like 10 projects and each of them have a lot of lists, I've observed that all lists get updated at the same time once I open the project on the UI (is this the normal behavior ?) -- Also, everytime I get this heap memory error, I have like 2~4 campaigns running, sending 1k-2k emails each.

Another info that might be useful, not sure if this is OK, but when I check worker logs, I get like SEVERAL log of queue:job:started, like really SEVERAL, screenshot is just illustrative because it prints like a lot.

Screenshot 2024-09-04 at 13 09 06

Is this normal? Or something is wrong here?

By the way, I just checked this answer on stack overflow:
https://stackoverflow.com/questions/55613789/how-to-fix-fatal-error-ineffective-mark-compacts-near-heap-limit-allocation-fa

I just updated the env for the worker like this: NODE_OPTIONS=--max_old_space_size=7580

Let's see if this helps.

Well, it keep happening, however I'm not experiencing deadlock issues or stuck campaigns anymore. I guess it was happening only on ECS FARGATE.

Screenshot 2024-09-05 at 13 32 47

It's throwing errors but at least its working. Not sure what else I could do to fix this.

Unfortunately heap errors like that are really hard to debug without actually running a memory dump. Typically they are caused by some sort of memory leak that is building up over time.

To address your other question, what do you have your log level set to? All of that being printed is related to being at an info level, if you drop down to a warn you'll get less console logging and it should help somewhat with memory (though I don't expect substantial) but mostly just help with speed.

I would be very curious to know what job specifically is causing that memory issue though and how large the data is that is being passed around. Do you have a way of monitoring memory on those instances over time? If its a big spike that causes the crash that would be different than a slow gradual increase and could be helpful to know.

Yeah, I was at debug level to actually see what's going on. I returned it back to error log level, but it's still happening.

I will drop the screenshot here, it oftens causes a deadlock error on journey_process job

Same errors, different days (actually it's happening every day) -- However it seems to be working just fine, besides the errors.

Screenshot 2024-09-06 at 12 45 32
Screenshot 2024-09-09 at 13 00 05

More info, memory of worker instance keep increasing.. that's why things start to fail at certain point:

Screenshot 2024-09-12 at 14 06 41

It starts like 50mb.. 70mb, and then it keeps increasing.. 300mb.. 700mb.. 1gb.. until maximum available.

I just changed docker compose to keep worker at 2GB maximum memory and I'm spawning 5 workers using docker compose up worker --scale worker=5 -d. Hopefully it will contain the problem.

However I don't know exactly where this memory leak is happening.