lithops-cloud/lithops

Periodic problems with a job that does not start

Opened this issue · 2 comments

Hello!

From time to time (10-15% of cases) I see a situation when EC2 starts in consume mode, job submitted on the work queue, but the worker does not see this job.
Here is an example of logs of this situation for master:

$ tail -f -n 100 /tmp/lithops-*/master-service.log
2024-10-31 11:04:42,186	[DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-10-31 11:04:42,974	[INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-10-31 11:04:42,975	[DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-10-31 11:04:42,976	[DEBUG] lithops.standalone.keeper:53 -- Starting BudgetKeeper for sm-lithops-prod-032-new (172.31.40.50), instance ID: i-0dab07dbaf9476b7d
2024-10-31 11:04:42,976	[DEBUG] lithops.standalone.keeper:55 -- Delete sm-lithops-prod-032-new on dismantle: False
2024-10-31 11:04:42,976	[DEBUG] lithops.standalone.keeper:72 -- BudgetKeeper started
2024-10-31 11:04:42,976	[DEBUG] lithops.standalone.keeper:75 -- Auto dismantle activated - Soft timeout: 30s, Hard Timeout: 7200s
2024-10-31 11:04:42,976	[DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7198 seconds
2024-10-31 11:04:42,976	[INFO] lithops.standalone.master:569 -- Starting job monitoring thread
2024-10-31 11:04:43,979	[DEBUG] lithops.standalone.master:592 -- ExecutorID: 8f3b98-11 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-10-31 11:04:43,979	[DEBUG] lithops.standalone.master:592 -- ExecutorID: 4538f6-6 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-10-31 11:04:44,635	[DEBUG] lithops.standalone.master:541 -- Received job 4538f6-11-M000
2024-10-31 11:04:44,636	[DEBUG] lithops.standalone.master:371 -- Going to setup 1 workers
2024-10-31 11:04:44,636	[DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-10-31 11:04:44,643	[DEBUG] lithops.standalone.master:522 -- Job 4538f6-11-M000 correctly submitted to work queue 'wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1'
2024-10-31 11:04:45,077	[INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-10-31 11:04:45,078	[DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-10-31 11:04:45,079	[DEBUG] lithops.standalone.master:409 -- 1 of 1 workers started for work queue: wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1
2024-10-31 11:05:44,260	[DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7140 seconds

for worker:

$ tail -f /tmp/lithops-root/worker-service.log 
2024-10-31 11:04:44,095	[INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: r6a.xlarge - Runtime name: metaspace2020/metaspace-aws-ec2:3.5.1 - Worker processes: 4
2024-10-31 11:04:44,096	[INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-10-31 11:04:44,096	[INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-10-31 11:04:44,096	[INFO] lithops.standalone.worker:154 -- Redis consumer process 2 started
2024-10-31 11:04:44,097	[INFO] lithops.standalone.worker:154 -- Redis consumer process 3 started
2024-10-31 11:04:45,097	[INFO] lithops.standalone.worker:219 -- Redis consumer process 0 finished
2024-10-31 11:04:45,097	[INFO] lithops.standalone.worker:219 -- Redis consumer process 1 finished
2024-10-31 11:04:45,097	[INFO] lithops.standalone.worker:219 -- Redis consumer process 2 finished
2024-10-31 11:04:45,098	[INFO] lithops.standalone.worker:219 -- Redis consumer process 3 finished
2024-10-31 11:04:45,099	[DEBUG] lithops.standalone.worker:289 -- Worker service finished

information from Redis:

>>> import redis
>>> client = redis.Redis()
>>> client.keys()
[b'job:8f3b98-11-M000', b'tasksdone:8f3b98-11-M000', b'tasksdone:4538f6-6-M000', b'worker:sm-lithops-prod-032-new', b'job:4538f6-11-M000', b'job:4538f6-6-M000', b'wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1']
>>> client.hget('job:4538f6-11-M000', 'status')
b'submitted'

If I restarted the worker sudo systemctl restart lithops-worker.service, I immediately see the beginning of processing the job:

$ tail -f /tmp/lithops-root/worker-service.log 
2024-10-31 11:04:45,097	[INFO] lithops.standalone.worker:219 -- Redis consumer process 1 finished
2024-10-31 11:04:45,097	[INFO] lithops.standalone.worker:219 -- Redis consumer process 2 finished
2024-10-31 11:04:45,098	[INFO] lithops.standalone.worker:219 -- Redis consumer process 3 finished
2024-10-31 11:04:45,099	[DEBUG] lithops.standalone.worker:289 -- Worker service finished
2024-10-31 11:17:54,146	[INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: r6a.xlarge - Runtime name: metaspace2020/metaspace-aws-ec2:3.5.1 - Worker processes: 4
2024-10-31 11:17:54,146	[INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-10-31 11:17:54,146	[INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-10-31 11:17:54,147	[INFO] lithops.standalone.worker:154 -- Redis consumer process 2 started
2024-10-31 11:17:54,147	[DEBUG] lithops.standalone.worker:183 -- ExecutorID 4538f6-11 | JobID M000 - Running CallID 00000 in the local worker (consumer 0)
2024-10-31 11:17:54,147	[INFO] lithops.standalone.worker:154 -- Redis consumer process 3 started
2024-10-31 11:19:03,524	[DEBUG] lithops.standalone.worker:213 -- ExecutorID 4538f6-11 | JobID M000 - CallID 00000 execution finished

I don't know how to reproduce this problem, but I see it from time to time, so I can add additional logging if needed.

Hi @sergii-mamedov, thanks for the detailed description of the issue. I reviewed the code and identified a potential problem. I’ll work on fixing the logic, test it, and then submit a PR.

@sergii-mamedov I added a fix for this