emrspecialistsamer/aws-glue-workshop

Sagemaker notebook fails to connect to Glue dev endpoint

Opened this issue · 4 comments

Using the workshop cloud formation template, I successfully created the environment, but when I open the sagemaker jupyter notebook and run the first step, I get:
The code failed because of a fatal error: Error sending http request and maximum retry encountered..*
Some things to try:
a. Make sure Spark has enough available resources for Jupyter to create a Spark context.
b. Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c. Restart the kernel.

I checked the security group, role, restarted the kernel, and even created a new workshop environment in another region with the same results, so I'm not sure why others are not seeing the same.
Finally, I created another sagemaker notebook to the same endpoint but from the Glue Dev endpoints console to compare. It creates the notebook with lifecycle startup script that connects to the Glue dev endpoint. I added the same lifecycle startup script for my workshop notebook, restarted it, and now it works. Does the CloudFormation template need to add a similar lifecycle startup?

I had exactly the same issues when I tried to run this workshop. I couldn't continue

unirt commented

+1
I checked the execution result of python3 /home/ec2-user/glue/dev_endpoint_connection_checker.py via terminal in jupyter was Livy connection failed, sleeping for 5 seconds. , but I didn't find the root cause.

Yes, the problem is the configured lifecycle does not reliably connect to the Glue dev endpoint. I've compared the workshop lifecycle config to the "aws-glue-test-LLConfig" created when you create a sagemaker notebook from Glue. Both check permissions and cron a reconnect job, but the latter checks the connection before deactivating miniconda. I'll test adding that check to this LC config and issue a PR if it works reliably.
In the meantime, you can work around it by stopping your notebook, reconfiguring it to use the same "aws-glue-test-LLConfig", and restarting.

The problem is the sparkmagic was not connected to the dev endpoint, you need to update the ~/.sparkmagic/config.json file with the private IP of the dev endpoint.