stanford-mast/pocket

Error Adding Routes / Connectivity Issues

Closed this issue · 5 comments

Hello.

When I execute the ./add_ip_routes.sh script, I get the following errors:

ssh: Could not resolve hostname ip-XX-X-XXX-XX.us-west-2.compute.internal: Name or service not known
ssh: Could not resolve hostname ip-XX-X-XXX-XX.us-west-2.compute.internal: Name or service not known

If I manually execute the commands, replacing the IP recovered with kubectl get nodes --show-labels | grep metadata | awk '{print $1}' by just the numerical IP address (e.g., ssh -t admin@XX.X.XXX.XX "sudo ip route add default via 10.1.0.1 dev eth1 tab 2"), the commands will execute, though I get a new error.

For the first command "sudo ip route add default via 10.1.0.1 dev eth1 tab 2", I get the following error: RTNETLINK answers: Network is unreachable.

The Lambda functions are also not able to connect to the namenode server. When attempting to connect, I obtain the following:

START RequestId: ... Version: $LATEST
Attempting to connect...
Connecting to metadata server failed!
put buffer failed: tmp-0: Exception
Traceback (most recent call last):
  File "/var/task/latency.py", line 67, in lambda_handler
    pocket_write_buffer(p, jobid, iter, text, datasize)
  File "/var/task/latency.py", line 33, in pocket_write_buffer
    raise Exception("put buffer failed: "+ dst_filename)
Exception: put buffer failed: tmp-0

END RequestId: ...
REPORT RequestId: ...	Duration: 1.47 ms	Billed Duration: 100 ms	Memory Size: 3008 MB	Max Memory Used: 28 MB

I figure the two errors are related. I'm just not sure how to proceed. As far as I can tell, I've followed the setup instructions exactly as they're written. Do you have any idea what might be going wrong? Just pointing me in the direction of what to look at to address these issues would be helpful.

When I run the microbenchmark, I seem to be able to connect to the metadata server, but I still get the same put buffer error as well. This is while I'm running the controller.py script.

Edit: Resolved this issue by using bigger machines (turns out all my storage nodes were pending which was why the lambda couldn't connect to them).

Hello.

When I execute the ./add_ip_routes.sh script, I get the following errors:

ssh: Could not resolve hostname ip-XX-X-XXX-XX.us-west-2.compute.internal: Name or service not known
ssh: Could not resolve hostname ip-XX-X-XXX-XX.us-west-2.compute.internal: Name or service not known

If I manually execute the commands, replacing the IP recovered with kubectl get nodes --show-labels | grep metadata | awk '{print $1}' by just the numerical IP address (e.g., ssh -t admin@XX.X.XXX.XX "sudo ip route add default via 10.1.0.1 dev eth1 tab 2"), the commands will execute, though I get a new error.

For the first command "sudo ip route add default via 10.1.0.1 dev eth1 tab 2", I get the following error: RTNETLINK answers: Network is unreachable.

The Lambda functions are also not able to connect to the namenode server. When attempting to connect, I obtain the following:

START RequestId: ... Version: $LATEST
Attempting to connect...
Connecting to metadata server failed!
put buffer failed: tmp-0: Exception
Traceback (most recent call last):
  File "/var/task/latency.py", line 67, in lambda_handler
    pocket_write_buffer(p, jobid, iter, text, datasize)
  File "/var/task/latency.py", line 33, in pocket_write_buffer
    raise Exception("put buffer failed: "+ dst_filename)
Exception: put buffer failed: tmp-0

END RequestId: ...
REPORT RequestId: ...	Duration: 1.47 ms	Billed Duration: 100 ms	Memory Size: 3008 MB	Max Memory Used: 28 MB

I figure the two errors are related. I'm just not sure how to proceed. As far as I can tell, I've followed the setup instructions exactly as they're written. Do you have any idea what might be going wrong? Just pointing me in the direction of what to look at to address these issues would be helpful.

I come across the same issue. Do you solve this problem?

Can you first double check that the master and all nodes have status Ready = True in the output of the kops validate cluster command?

I have not seen this particular issue before, but the connectivity issue you are describing seems like a problem with the VPC setup. The VPC setup that Pocket uses is similar to the example in the Amazon documentation here. Please check that your route tables and VPC setup match the example. You can try manually creating the VPC, route tables, etc based on this example. The reason for using a NAT in the setup is to enable lambdas to talk to both Pocket (which is running in a VPC) and public internet services (such as S3).

This page has some advice for testing a NAT, in case it is helpful.

Can you first double check that the master and all nodes have status Ready = True in the output of the kops validate cluster command?

I have not seen this particular issue before, but the connectivity issue you are describing seems like a problem with the VPC setup. The VPC setup that Pocket uses is similar to the example in the Amazon documentation here. Please check that your route tables and VPC setup match the example. You can try manually creating the VPC, route tables, etc based on this example. The reason for using a NAT in the setup is to enable lambdas to talk to both Pocket (which is running in a VPC) and public internet services (such as S3).

This page has some advice for testing a NAT, in case it is helpful.

I found that it is because the eth1 added in metadata server is not up. I fix it by the following steps:

  1. I edit all the security groups created by kops, such that all types of inbound and outbound traffic are allowed.
  2. I ssh to the metadata server and run sudo ifup eth1 and the problem is fixed.
    Thanks for your kindly reply!

@ukulililixl I am having a similar issue to this. What command did you use to ssh to the metadata server?
It seems that KOPS creates the nodes with a different Key-Pair than the one I am created "kubernetes.pocketcluster.k8s.local-xxxxxxxx'