Substra/substra-backend

Can't launch all celery workers/schedulers

Closed this issue · 15 comments

I'm trying to start all the workers/schedulers for the 2 orgs setup.
If I run the command:

DJANGO_SETTINGS_MODULE=backend.settings.dev BACKEND_ORG=owkin BACKEND_DEFAULT_PORT=8000 BACKEND_PEER_PORT_EXTERNAL=9051 celery -E -A backend worker -l info -B -n owkin -Q owkin,scheduler,celery --hostname owkin.scheduler

the logs get attached to my terminal and I can't run the commands to start the following instances. Thus I decided to pas the --detach argument, which works, but when trying to lauch a new celery worker I get an error saying:

ERROR: Pidfile (celeryd.pid) already exists. Seems we're already running? (pid: 4815) Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/util.py", line 319, in _exit_function p.join() File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join assert self._parent_pid == os.getpid(), 'can only join a child process' AssertionError: can only join a child process

I could run the command with the --pidfile= (with no path) flag, but is that the right way to go?
Thanks!

Hello @chrisalexandrepena,
Did you try with different terminals window?

Do you get the same error?

I tried running them in separate screen sessions but I always get this error. What is strange is that when I input all the lines specified in the repository README for the firt time, without the --detach flag:

  • It displayed the logs of my first worker
  • I stoped the logs with a simple ctrl+C
  • I could see through celery flower that my worker was still up
  • If I powered off my running worker, all of a sudden celery launched the second one
  • If I powered off that new worker celery launched the 3rd one, and so on

I haven't been able to replicate that behaviour since

Quick note, you will maybe need to change DJANGO_SETTINGS_MODULE=backend.settings.dev to DJANGO_SETTINGS_MODULE=backend.settings.celery.dev, but this implies having a ledger running :)

So you still have the issue?
Are you on linux, mac, windows, everything else?
Maybe --pidfile is the solution for your environment as described here: https://stackoverflow.com/questions/53521959/dockercelery-error-pidfile-celerybeat-pid-already-exists
But I don't think you need to use it at tall.

My ledger is running so that shouldn't be a problem :)
I've just tried it, but the DJANGO_SETTINGS_MODULE=backend.settings.celery.dev still gives me the error. I had already tried adding a --pidfile="" before, and it does work. Only I wasn't sure if that would cause any problems with the rest of the app?

I'm running the system on a linux virtuabox (ubuntu server 18.04).

It should not cause any problems with the rest of the app, but I suggest you to check all the celeries running on your machine.
Maybe you will need to kill some: ps aux | grep celery

If I try to launch all the tasks in a row, using --detach and --pidfile="" flags, and after killing all previous celery worker instances I get a strange error:

Traceback (most recent call last):
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/kombu/utils/objects.py", line 42, in __get__
    return obj.__dict__[self.__name__]
KeyError: 'data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/hfc/fabric/user.py", line 257, in _restore_state
    enrollment = state_dict['enrollment']
KeyError: 'enrollment'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/dev/substra-backend/.venv/bin/celery", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/__main__.py", line 16, in main
    _main()
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/celery.py", line 322, in main
    cmd.execute_from_commandline(argv)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/celery.py", line 496, in execute_from_commandline
    super(CeleryCommand, self).execute_from_commandline(argv)))
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/base.py", line 275, in execute_from_commandline
    return self.handle_argv(self.prog_name, argv[1:])
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/celery.py", line 488, in handle_argv
    return self.execute(command, argv)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/celery.py", line 420, in execute
    ).run_from_argv(self.prog_name, argv[1:], command=argv[0])
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/worker.py", line 221, in run_from_argv
    *self.parse_options(prog_name, argv, command))
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/base.py", line 398, in parse_options
    self.parser = self.create_parser(prog_name, command)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/base.py", line 414, in create_parser
    self.add_arguments(parser)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/bin/worker.py", line 277, in add_arguments
    default=conf.worker_state_db,
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/utils/collections.py", line 126, in __getattr__
    return self[k]
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/utils/collections.py", line 429, in __getitem__
    return getitem(k)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/utils/collections.py", line 278, in __getitem__
    return mapping[_key]
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/collections/__init__.py", line 987, in __getitem__
    if key in self.data:
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/kombu/utils/objects.py", line 44, in __get__
    value = obj.__dict__[self.__name__] = self.__get(obj)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/app/base.py", line 141, in data
    return self.callback()
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/app/base.py", line 924, in _finalize_pending_conf
    conf = self._conf = self._load_config()
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/app/base.py", line 934, in _load_config
    self.loader.config_from_object(self._config_source)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/loaders/base.py", line 131, in config_from_object
    self._conf = force_mapping(obj)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/celery/utils/collections.py", line 46, in force_mapping
    if isinstance(m, (LazyObject, LazySettings)):
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/django/utils/functional.py", line 213, in inner
    self._setup()
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/django/conf/__init__.py", line 44, in _setup
    self._wrapped = Settings(settings_module)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/django/conf/__init__.py", line 107, in __init__
    mod = importlib.import_module(self.SETTINGS_MODULE)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/ubuntu/dev/substra-backend/backend/backend/settings/celery/dev.py", line 1, in <module>
    from ..deps.ledger import *
  File "/home/ubuntu/dev/substra-backend/backend/backend/settings/deps/ledger.py", line 32, in <module>
    cert_path=LEDGER['client']['cert_path']
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/hfc/fabric/user.py", line 348, in create_user
    user = User(name, org, state_store)
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/hfc/fabric/user.py", line 58, in __init__
    self._restore_state()
  File "/home/ubuntu/dev/substra-backend/.venv/lib/python3.6/site-packages/hfc/fabric/user.py", line 272, in _restore_state
    raise IOError("Cannot deserialize the user", e)
OSError: [Errno Cannot deserialize the user] 'enrollment'

When I check after getting the error, a random number of the workers have been successfuly spawned. If I launch it again, I get the same error but can see new workers spawning. And if I remove all active workers again and relaunch my command I end up with the same error message and a different number of successfuly spawned workers...
Using the DJANGO_SETTINGS_MODULE=backend.settings.celery.dev doesn't affect the outcome...:(

What tasks are you trying to launch?
Where does come from this error? From a celery worker?

Is a docker instance running a fabric ca is running?
Can you show the result of:

$> docker ps -a

Thanks,

The command I'm trying to launch is:

DJANGO_SETTINGS_MODULE=backend.settings.celery.dev BACKEND_ORG=owkin BACKEND_DEFAULT_PORT=8000 BACKEND_PEER_PORT_EXTERNAL=9051 celery -E --detach -A backend worker -l info -B -n owkin -Q owkin,scheduler,celery --hostname owkin.scheduler --pidfile="" && \
DJANGO_SETTINGS_MODULE=backend.settings.celery.dev BACKEND_ORG=owkin BACKEND_DEFAULT_PORT=8000 BACKEND_PEER_PORT_EXTERNAL=9051 celery -E --detach -A backend worker -l info -B -n owkin -Q owkin,owkin.worker,celery --hostname owkin.worker --pidfile=""  && \
DJANGO_SETTINGS_MODULE=backend.settings.celery.dev BACKEND_ORG=chu-nantes BACKEND_DEFAULT_PORT=8001 BACKEND_PEER_PORT_EXTERNAL=7051 celery -E --detach -A backend worker -l info -B -n chunantes -Q chu-nantes,scheduler,celery --hostname chu-nantes.scheduler --pidfile="" && \
DJANGO_SETTINGS_MODULE=backend.settings.celery.dev BACKEND_ORG=chu-nantes BACKEND_DEFAULT_PORT=8001 BACKEND_PEER_PORT_EXTERNAL=7051 celery -E --detach -A backend worker -l info -B -n chunantes -Q chu-nantes,chu-nantes.worker,celery --hostname chu-nantes.worker --pidfile="" && \
DJANGO_SETTINGS_MODULE=backend.settings.common celery --detach --pidfile="" -A backend beat -l info

Here are all my docker containers:
Capture du 2019-12-10 16-22-53

I've just tested your commands and it seems to work:

$> DJANGO_SETTINGS_MODULE=backend.settings.celery.dev BACKEND_ORG=owkin BACKEND_DEFAULT_PORT=8000 BACKEND_PEER_PORT_EXTERNAL=9051 celery -E --detach -A backend worker -l info -B -n owkin -Q owkin,scheduler,celery --hostname owkin.scheduler --pidfile="" && \
> DJANGO_SETTINGS_MODULE=backend.settings.celery.dev BACKEND_ORG=owkin BACKEND_DEFAULT_PORT=8000 BACKEND_PEER_PORT_EXTERNAL=9051 celery -E --detach -A backend worker -l info -B -n owkin -Q owkin,owkin.worker,celery --hostname owkin.worker --pidfile=""  && \
> DJANGO_SETTINGS_MODULE=backend.settings.celery.dev BACKEND_ORG=chu-nantes BACKEND_DEFAULT_PORT=8001 BACKEND_PEER_PORT_EXTERNAL=7051 celery -E --detach -A backend worker -l info -B -n chunantes -Q chu-nantes,scheduler,celery --hostname chu-nantes.scheduler --pidfile="" && \
> DJANGO_SETTINGS_MODULE=backend.settings.celery.dev BACKEND_ORG=chu-nantes BACKEND_DEFAULT_PORT=8001 BACKEND_PEER_PORT_EXTERNAL=7051 celery -E --detach -A backend worker -l info -B -n chunantes -Q chu-nantes,chu-nantes.worker,celery --hostname chu-nantes.worker --pidfile="" && \
> DJANGO_SETTINGS_MODULE=backend.settings.common celery --detach --pidfile="" -A backend beat -l info

$> ps aux | grep celery
guillau+ 30691  0.0  0.2 667912 70660 pts/10   Sl   16:55   0:00 /home/guillaume/.venv/substrabac/bin/python /home/guillaume/.venv/substrabac/bin/celery -E --detach -A backend worker -l info -B -n owkin -Q owkin,scheduler,celery --hostname owkin.scheduler --pidfile=
guillau+ 30698  101  0.3 629560 102384 ?       S    16:55   0:08 [celeryd: celery@owkin.scheduler:MainProcess] -active- (worker -l info -B -Q owkin,scheduler,celery -E -A backend --hostname=owkin.scheduler)
guillau+ 30807  0.0  0.2 667916 71052 pts/10   Sl   16:55   0:00 /home/guillaume/.venv/substrabac/bin/python /home/guillaume/.venv/substrabac/bin/celery -E --detach -A backend worker -l info -B -n owkin -Q owkin,owkin.worker,celery --hostname owkin.worker --pidfile=
guillau+ 30813 97.1  0.1 147064 53300 ?        R    16:55   0:06 /home/guillaume/.venv/substrabac/bin/python -m celery worker -l info -B -Q owkin,owkin.worker,celery -E -A backend --hostname=owkin.worker
guillau+ 30847  0.0  0.2 667904 70904 pts/10   Sl   16:55   0:00 /home/guillaume/.venv/substrabac/bin/python /home/guillaume/.venv/substrabac/bin/celery -E --detach -A backend worker -l info -B -n chunantes -Q chu-nantes,scheduler,celery --hostname chu-nantes.scheduler --pidfile=
guillau+ 30855  108  0.2 537088 68528 ?        R    16:55   0:05 /home/guillaume/.venv/substrabac/bin/python /home/guillaume/.venv/substrabac/bin/celery -E --detach -A backend worker -l info -B -n chunantes -Q chu-nantes,scheduler,celery --hostname chu-nantes.scheduler --pidfile=
guillau+ 30890  0.0  0.2 667916 70952 pts/10   Sl   16:55   0:00 /home/guillaume/.venv/substrabac/bin/python /home/guillaume/.venv/substrabac/bin/celery -E --detach -A backend worker -l info -B -n chunantes -Q chu-nantes,chu-nantes.worker,celery --hostname chu-nantes.worker --pidfile=
guillau+ 30896  100  0.2 537100 69128 ?        R    16:55   0:04 /home/guillaume/.venv/substrabac/bin/python /home/guillaume/.venv/substrabac/bin/celery -E --detach -A backend worker -l info -B -n chunantes -Q chu-nantes,chu-nantes.worker,celery --hostname chu-nantes.worker --pidfile=
guillau+ 30928  0.0  0.1 617796 58628 pts/10   Sl   16:56   0:00 /home/guillaume/.venv/substrabac/bin/python /home/guillaume/.venv/substrabac/bin/celery --detach --pidfile= -A backend beat -l info
guillau+ 30937  132  0.1 487144 55244 ?        R    16:56   0:02 /home/guillaume/.venv/substrabac/bin/python /home/guillaume/.venv/substrabac/bin/celery --detach --pidfile= -A backend beat -l info
guillau+ 30989  0.0  0.2 667928 71004 ?        Sl   16:56   0:00 /home/guillaume/.venv/substrabac/bin/python -m celery worker -l info -B -Q owkin,scheduler,celery -E -A backend --hostname=owkin.scheduler
guillau+ 31007  0.0  0.2 628484 87104 ?        R    16:56   0:00 [celeryd: celery@owkin.scheduler:MainProcess] -active- (worker -l info -B -Q owkin,scheduler,celery -E -A backend --hostname=owkin.scheduler)
guillau+ 31010  0.0  0.2 628740 87244 ?        S    16:56   0:00 [celeryd: celery@owkin.scheduler:ForkPoolWorker-2]
guillau+ 31014  0.0  0.0  14784  1000 pts/10   S+   16:56   0:00 grep --color=auto celery

All is running correctly.
I've also not being able to reproduce your error with the pidfile.

Maybe there is something wrong with your python virtualenv.

Can you give us the output of:
$> docker logs -f run-owkin

Of course:

external_orgs:  ['chu-nantes']
Sign update proposal on chu-nantes ...
Send update proposal with org: chu-nantes...
Wait For Peers to join channel
Join channel substrachannel with peers ['peer1-owkin', 'peer2-owkin'] ...
Peers ['peer1-owkin', 'peer2-owkin'] successfully joined channel substrachannel
Installing chaincode on ['peer1-owkin', 'peer2-owkin'] ...
Installing chaincode on ['peer1-chu-nantes', 'peer2-chu-nantes'] ...
policy:  OR('owkinMSP.member', 'chu-nantesMSP.member')
Upgraded chaincode with policy: {'identities': [{'role': {'name': 'member', 'mspId': 'owkinMSP'}}, {'role': {'name': 'member', 'mspId': 'chu-nantesMSP'}}], 'policy': {'1-of': [{'signed-by': 0}, {'signed-by': 1}]}} and result: "{'name': 'substracc', 'version': '2.0', 'escc': 'escc', 'vscc': 'vscc', 'policy': {'version': 0, 'rule': {'n_out_of': {'n': 1, 'rules': [{'signed_by': 0}, {'signed_by': 1}]}}, 'identities': [{'principal_classification': 'ROLE', 'principal': {'msp_identifier': 'owkinMSP', 'role': 'MEMBER'}}, {'principal_classification': 'ROLE', 'principal': {'msp_identifier': 'chu-nantesMSP', 'role': 'MEMBER'}}]}, 'data': {'hash': b'"\x0c% \xa2R\x81*\x89#\x18\x0fl\xaf8\xd9\x95{O\x8bN\xbc\xad\x15\xe8\x8b)\xebrz=8', 'metadatahash': b'\x0fz\x15\xa7\x95\x01*\xca\xe2\x88P \xae3\xf5\x07\xba\xee\xcd\xfb\xa85g\xa6\x7f\xafqW\x0f\x0f\xad\xc8'}, 'id': b'\xec\x06\xcf\xd7{\x16 \x13\x83H\x0b8\xe2J\xfa\xee\x91\\/\x1eLg\x85\x1d\xcdj\xdb_HO\x02\xf5', 'instantiation_policy': {'version': 0, 'rule': {'n_out_of': {'n': 1, 'rules': [{'signed_by': 0}, {'signed_by': 1}]}}, 'identities': [{'principal_classification': 'ROLE', 'principal': {'msp_identifier': 'chu-nantesMSP', 'role': 'ADMIN'}}, {'principal_classification': 'ROLE', 'principal': {'msp_identifier': 'owkinMSP', 'role': 'ADMIN'}}]}}"
Removing chaincode docker containers ...
1daa3b149306
570f179f8116
Try to query chaincode from peer ['peer1-owkin', 'peer2-owkin'] on org owkin
Queried chaincode, result: []
Congratulations! Ledger has been correctly initialized.

Everything is running great, I don't see how you could see these errors.
Did you try with a docker setup for substra-backend?

I had not tried the docker setup, it works indeed very well thanks :)

Can we close this issue then?

I guess you can :)

Thanks,