Function executor container crashes

Question

Function executor container crashes

kevinric opened this issue 5 years ago · 7 comments

I am trying to run Cloudburst in cluster mode on AWS following the Getting Started Guide (in mesh networking mode without a domain), but one function container seems to be caught in a crash-loop.

The logs of container function-1 say that the address is already in use:

Copying flow.egg-info to /usr/local/lib/python3.6/dist-packages/flow-0.1.0-py3.6.egg-info
running install_scripts
Traceback (most recent call last):
  File "cloudburst/server/executor/server.py", line 497, in <module>
    int(exec_conf['thread_id']))
  File "cloudburst/server/executor/server.py", line 59, in executor
    pin_socket.bind(sutils.BIND_ADDR_TEMPLATE % (sutils.PIN_PORT + thread_id))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

I can run functions on the remaining function executors without any problem, but when I try to register a DAG (the one from the example), the scheduler container crashes (error see below), which seems to happen because there are no candidates and could be caused by the function container that hasn't started.

Traceback (most recent call last):
  File "cloudburst/server/scheduler/server.py", line 346, in <module>
    scheduler(conf['ip'], conf['mgmt_ip'], sched_conf['routing_address'])
  File "cloudburst/server/scheduler/server.py", line 181, in scheduler
    call_frequency)
  File "/hydro/cloudburst/cloudburst/server/scheduler/create.py", line 86, in create_dag
    success = policy.pin_function(dag.name, fref, colocated)
  File "/hydro/cloudburst/cloudburst/server/scheduler/policy/default_policy.py", line 249, in pin_function
    node, tid = sys_random.sample(candidates, 1)[0]
  File "/usr/lib/python3.6/random.py", line 320, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative

What am I doing wrong?

Answer 1 · 2020-06-12T19:40:37.000Z

I changed the environment variable value in the function-ds.yml (kops cluster config)

env:
        [...]
        - name: THREAD_ID
          value: "0"

to the arbitrary value 10:

env:
        [...]
        - name: THREAD_ID
          value: "10"

The container no longer crashes and I can register DAGs again.

Is this a general conflict and should therefore be changed or did it only occur to me?

(I also noticed, that I opened this Issue in the wrong repo - should have been opened in the cluster-repo - sorry for that)

Answer 2 · 2020-06-14T23:38:37.000Z

@kevinric Hello! I suspect your second error has to do with the fact that a function pod is failing, so the scheduler has trouble scheduling the DAG due to insufficient available function pod count.

But for your first issue, we didn't run into this problem when running experiments, and I just spun up a cluster myself but didn't reproduce your crash-loop scenario. Just from the error msg, it seems that you already have another socket listening on the same port. When you change the thread id to something else, the error went away because the port number is a function of the thread id, and therefore the conflict went away.

Answer 3 · 2020-06-15T00:19:00.000Z

Is there a followup here on improved error handling, messaging? If we can get a test case around this and good erroring that would be a positive outcome.

Answer 4 · 2020-06-20T23:07:48.000Z

Thanks for your replies! It's strange that this only happens to me, since I use the standard scripts to create the nodes, but as long as the workaround works I'm happy :).

Since the error only occurs in cluster mode and seems to be due to my configuration, I'm not sure how to create a useful test case. The local mode works without problems. Do you have any idea on how to create the test?

Answer 5 · 2020-06-22T01:40:24.000Z

Hi @kevinric -- could you SSH into the instance (you can find its public IP on your EC2 console) and use netstat or lsof or some similar tool to see what process is listening on port 4000? That would help us figure out why this is happening. Thanks!

Answer 6 · 2020-06-22T13:13:51.000Z

Hi @vsreekanti! Thanks! protokube is listening on port 4000
tcp6 0 0 :::4000 :::* LISTEN 3086/protokube

I used kubernetes without a domain by setting the HYDRO_CLUSTER_NAME to [clusertname].k8s.local. The networking of kubernetes local mode seems to cause the port conflict. Using a domain for the cluster fixed it.

Answer 7 · 2020-06-22T15:50:02.000Z

Makes sense. Would you be up for adding a note in the cluster creation script docs on the cluster repo with this? Thanks again for sticking with this!