Periodic problems with a job that does not start
Opened this issue · 2 comments
Hello!
From time to time (10-15% of cases) I see a situation when EC2 starts in consume mode
, job submitted on the work queue, but the worker does not see this job.
Here is an example of logs of this situation for master:
$ tail -f -n 100 /tmp/lithops-*/master-service.log
2024-10-31 11:04:42,186 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-10-31 11:04:42,974 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-10-31 11:04:42,975 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:53 -- Starting BudgetKeeper for sm-lithops-prod-032-new (172.31.40.50), instance ID: i-0dab07dbaf9476b7d
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:55 -- Delete sm-lithops-prod-032-new on dismantle: False
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:72 -- BudgetKeeper started
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:75 -- Auto dismantle activated - Soft timeout: 30s, Hard Timeout: 7200s
2024-10-31 11:04:42,976 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7198 seconds
2024-10-31 11:04:42,976 [INFO] lithops.standalone.master:569 -- Starting job monitoring thread
2024-10-31 11:04:43,979 [DEBUG] lithops.standalone.master:592 -- ExecutorID: 8f3b98-11 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-10-31 11:04:43,979 [DEBUG] lithops.standalone.master:592 -- ExecutorID: 4538f6-6 | JObID: M000 - Tasks done: 1/1 - Completed!
2024-10-31 11:04:44,635 [DEBUG] lithops.standalone.master:541 -- Received job 4538f6-11-M000
2024-10-31 11:04:44,636 [DEBUG] lithops.standalone.master:371 -- Going to setup 1 workers
2024-10-31 11:04:44,636 [DEBUG] lithops.standalone.backends.aws_ec2.aws_ec2:58 -- Creating AWS EC2 client
2024-10-31 11:04:44,643 [DEBUG] lithops.standalone.master:522 -- Job 4538f6-11-M000 correctly submitted to work queue 'wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1'
2024-10-31 11:04:45,077 [INFO] lithops.standalone.backends.aws_ec2.aws_ec2:101 -- AWS EC2 client created - Region: eu-west-1
2024-10-31 11:04:45,078 [DEBUG] lithops.standalone.standalone:69 -- Standalone handler created successfully
2024-10-31 11:04:45,079 [DEBUG] lithops.standalone.master:409 -- 1 of 1 workers started for work queue: wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1
2024-10-31 11:05:44,260 [DEBUG] lithops.standalone.keeper:108 -- Time to dismantle: 7140 seconds
for worker:
$ tail -f /tmp/lithops-root/worker-service.log
2024-10-31 11:04:44,095 [INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: r6a.xlarge - Runtime name: metaspace2020/metaspace-aws-ec2:3.5.1 - Worker processes: 4
2024-10-31 11:04:44,096 [INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-10-31 11:04:44,096 [INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-10-31 11:04:44,096 [INFO] lithops.standalone.worker:154 -- Redis consumer process 2 started
2024-10-31 11:04:44,097 [INFO] lithops.standalone.worker:154 -- Redis consumer process 3 started
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 0 finished
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 1 finished
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 2 finished
2024-10-31 11:04:45,098 [INFO] lithops.standalone.worker:219 -- Redis consumer process 3 finished
2024-10-31 11:04:45,099 [DEBUG] lithops.standalone.worker:289 -- Worker service finished
information from Redis:
>>> import redis
>>> client = redis.Redis()
>>> client.keys()
[b'job:8f3b98-11-M000', b'tasksdone:8f3b98-11-M000', b'tasksdone:4538f6-6-M000', b'worker:sm-lithops-prod-032-new', b'job:4538f6-11-M000', b'job:4538f6-6-M000', b'wq:localhost:metaspace2020-metaspace-aws-ec2:3.5.1']
>>> client.hget('job:4538f6-11-M000', 'status')
b'submitted'
If I restarted the worker sudo systemctl restart lithops-worker.service
, I immediately see the beginning of processing the job:
$ tail -f /tmp/lithops-root/worker-service.log
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 1 finished
2024-10-31 11:04:45,097 [INFO] lithops.standalone.worker:219 -- Redis consumer process 2 finished
2024-10-31 11:04:45,098 [INFO] lithops.standalone.worker:219 -- Redis consumer process 3 finished
2024-10-31 11:04:45,099 [DEBUG] lithops.standalone.worker:289 -- Worker service finished
2024-10-31 11:17:54,146 [INFO] lithops.standalone.worker:263 -- Starting Worker - Instace type: r6a.xlarge - Runtime name: metaspace2020/metaspace-aws-ec2:3.5.1 - Worker processes: 4
2024-10-31 11:17:54,146 [INFO] lithops.standalone.worker:154 -- Redis consumer process 0 started
2024-10-31 11:17:54,146 [INFO] lithops.standalone.worker:154 -- Redis consumer process 1 started
2024-10-31 11:17:54,147 [INFO] lithops.standalone.worker:154 -- Redis consumer process 2 started
2024-10-31 11:17:54,147 [DEBUG] lithops.standalone.worker:183 -- ExecutorID 4538f6-11 | JobID M000 - Running CallID 00000 in the local worker (consumer 0)
2024-10-31 11:17:54,147 [INFO] lithops.standalone.worker:154 -- Redis consumer process 3 started
2024-10-31 11:19:03,524 [DEBUG] lithops.standalone.worker:213 -- ExecutorID 4538f6-11 | JobID M000 - CallID 00000 execution finished
I don't know how to reproduce this problem, but I see it from time to time, so I can add additional logging if needed.
Hi @sergii-mamedov, thanks for the detailed description of the issue. I reviewed the code and identified a potential problem. I’ll work on fixing the logic, test it, and then submit a PR.
@sergii-mamedov I added a fix for this