nchammas/flintrock

SPARK_PUBLIC_DNS is incorrectly set when launching into a private VPC

maxpoulain opened this issue · 8 comments

Hi,

We are having issues in the Spark UI notably when using Flintrock.
To have more context on our use of Flintrock:

  • Flintrock version: 2.0.0
  • Python version: 3.8.9
  • OS: ubuntu:20.04
  • We are using Flintrock in a private VPC with the option authorize-access-from in the config file.
  • We download Spark and Hdfs sources from our own S3 bucket.

When we go to the Spark UI on port 8080, the page is displayed correctly but the links to the other pages are broken.
Here you can find an extract from the HTML code of the page for a link to a worker.

<a href="http://<?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot;?>
<!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot;
		 &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;>
<html xmlns=&quot;http://www.w3.org/1999/xhtml&quot; xml:lang=&quot;en&quot; lang=&quot;en&quot;>
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>:8081">
              worker-20211015082848-<MASKED_IP>-42451
            </a>

I have a similar error message when I launch spark-shell for example:

Spark context Web UI available at http://<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
		 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>:4040

I have the impression that the error comes from a problem finding the ip address or something related.. maybe we have done a mistake with our configuration or maybe it's not related to Flintrock..

So, if you have any clues or elements to solve this problem, this would be a great help for us.

Thank you in advance for your help,

Maxime

That looks pretty weird. So instead of the link pointing to an IP address or host name, it literally points to a block of HTML?

Can you share your Flintrock config? Do you see the same behavior if you launch a cluster on a public VPC?

Yes its pretty weird and yes it seems to point to a block of HTML instead of IP address or host name..

Here is our Flintrock config:

services:
  spark:
    version: 2.2.0
    download-source: s3://our-bucket/spark-related-packages/
  hdfs:
    version: 2.7.3
    download-source: s3://our-bucket/spark-related-packages/

provider: ec2

providers:
  ec2:
    key-name: key
    identity-file: key.pem
    instance-type: m5.2xlarge
    region: eu-west-1
    availability-zone: eu-west-1c
    ami: our-custom-ami # Based on Amazon Linux 2 AMI
    user: ec2-user
    spot-price: 0.4
    vpc-id: our-vpc-id
    subnet-id: our-subnet-id
    instance-profile-name: our-role
    tags:
        - TEAM,DATA
    min-root-ebs-size-gb: 120
    tenancy: default 
    ebs-optimized: no
    instance-initiated-shutdown-behavior: terminate
    authorize-access-from:
      - X.X.X.X/8
      - Y.Y.Y.Y/8

launch:
  num-slaves: 3
  install-hdfs: True
  install-spark: True
  java-version: 8

I just tried to launch a cluster on a public VPC and it's working well without any error ! So it seems to be related to the private VPC..

Is it just the UI that's broken? I would expect something to be wrong with the cluster too.

Can you post the full contents of the files under spark/conf on the cluster master (in the case where the UI is broken)?

I think I just found the problem inside spark/conf/spark-env.sh !
There is a curl to have SPARK_PUBLIC_DNS but it returns :

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
		 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

Here is spark/conf/spark-env.sh file:

#!/usr/bin/env bash

export SPARK_LOCAL_DIRS="/media/root/spark"

# Standalone cluster options
export SPARK_EXECUTOR_INSTANCES="1"
export SPARK_EXECUTOR_CORES="$(($(nproc) / 1))"
export SPARK_WORKER_CORES="$(nproc)"

export SPARK_MASTER_HOST="<masked_master_hostname>"

# TODO: Make this dependent on HDFS install.
export HADOOP_CONF_DIR="$HOME/hadoop/conf"

# TODO: Make this non-EC2-specific.
# Bind Spark's web UIs to this machine's public EC2 hostname
export SPARK_PUBLIC_DNS="$(curl --silent http://169.254.169.254/latest/meta-data/public-hostname)"

# TODO: Set a high ulimit for large shuffles
# Need to find a way to do this, since "sudo ulimit..." doesn't fly.
# Probably need to edit some Linux config file.
# ulimit -n 1000000

# Should this be made part of a Python service somehow?
export PYSPARK_PYTHON="python3"

It seems that http://169.254.169.254/latest/meta-data/public-hostname is not working no ? Because when I do curl http://169.254.169.254/latest/meta-data/ I have :

ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
events/
hostname
iam/
identity-credentials/
instance-action
instance-id
instance-life-cycle
instance-type
local-hostname
local-ipv4
mac
metrics/
network/
placement/
profile
public-keys/
reservation-id
security-groups

There is no public-hostname !

OK, it sounds like we need to understand how to set SPARK_PUBLIC_DNS when launching into a private VPC. Do things work if it's just left unset?

I just tried to launch a new cluster into private VPC by commenting out the line of SPARK_PUBLIC_DNS like that:

# export SPARK_PUBLIC_DNS="$(curl --silent http://169.254.169.254/latest/meta-data/public-hostname)"

And it seems to work perfectly ! There is no error and we no longer have the previous error !

OK great. Maybe we don't need this config at all anymore, or maybe we only need it when launching into a public VPC.