nchammas/flintrock

i4i instance type cluster fails to restart

a-cesari opened this issue · 8 comments

Hi,
I'm having issues when stopping and restarting the cluster.
Stop is working fine (i.e. flintrock stop my-cluster).
However when trying to start again (flintrock start my-cluster) the instances fails 1 of the 2 sanity checks, they cannot be reached event with console ssh login, and the cluster won't start.
I'm guessing is something related to the ephemeral storage because (as you can see from the system log below) the instance is going in a "recovery mode" due to some errors related to ext4 partition non found

Mounting /media/ephemeral0...

[    4.953751] EXT4-fs (nvme1n1): VFS: Can't find ext4 filesystem

Do you have any guess?
Thanks for your kind help.
Andrea

Here a more complete log file. After you can find also my flintrock config.

        Starting Apply Kernel Variables...

[�[32m  OK  �[0m] Started Apply Kernel Variables.

[�[32m  OK  �[0m] Created slice system-ec2net\x2difup.slice.

         Starting Relabel kernel modules early in the boot, if needed...

[�[32m  OK  �[0m] Started Relabel kernel modules early in the boot, if needed.

[�[32m  OK  �[0m] Found device Elastic Network Adapter (ENA).

[�[32m  OK  �[0m] Started Monitoring of LVM2 mirrors,...ng dmeventd or progress polling.

[�[32m  OK  �[0m] Reached target Local File Systems (Pre).

         Mounting /media/ephemeral0...

[    4.953751] EXT4-fs (nvme1n1): VFS: Can't find ext4 filesystem
[�[1;31mFAILED�[0m] Failed to mount /media/ephemeral0.

See 'systemctl status media-ephemeral0.mount' for details.

[�[1;33mDEPEND�[0m] Dependency failed for Local File Systems.

[�[1;33mDEPEND�[0m] Dependency failed for Migrate local... structure to the new structure.

[�[1;33mDEPEND�[0m] Dependency failed for Relabel all filesystems, if necessary.

[�[1;33mDEPEND�[0m] Dependency failed for Mark the need to relabel after reboot.

         Starting Preprocess NFS configuration...

[�[32m  OK  �[0m] Reached target Timers.

[�[32m  OK  �[0m] Reached target Network (Pre).

[�[32m  OK  �[0m] Reached target Cloud-init target.

[�[32m  OK  �[0m] Reached target Network.

         Starting Initial cloud-init job (metadata service crawler)...

[�[32m  OK  �[0m] Reached target Login Prompts.

[�[32m  OK  �[0m] Reached target Paths.

[�[32m  OK  �[0m] Reached target Sockets.

         Starting Create Volatile Files and Directories...

         Starting Tell Plymouth To Write Out Runtime Data...

[�[32m  OK  �[0m] Started Emergency Shell.

[�[32m  OK  �[0m] Reached target Emergency Mode.

[�[32m  OK  �[0m] Started Preprocess NFS configuration.

[�[32m  OK  �[0m] Started Create Volatile Files and Directories.

         Starting RPC bind service...

         Mounting RPC Pipe File System...

         Starting Security Auditing Service...

[    5.025955] RPC: Registered named UNIX socket transport module.
[    5.025956] RPC: Registered udp transport module.
[    5.025957] RPC: Registered tcp transport module.
[    5.025957] RPC: Registered tcp NFSv4.1 backchannel transport module.
[�[32m  OK  �[0m] Started RPC bind service.

[�[32m  OK  �[0m] Mounted RPC Pipe File System.

[�[32m  OK  �[0m] Started Security Auditing Service.

         Starting Update UTMP about System Boot/Shutdown...

[�[32m  OK  �[0m] Reached target rpc_pipefs.target.

[�[32m  OK  �[0m] Reached target NFS client services.

[�[32m  OK  �[0m] Reached target Remote File Systems (Pre).

[�[32m  OK  �[0m] Reached target Remote File Systems.

[�[32m  OK  �[0m] Started Update UTMP about System Boot/Shutdown.

         Starting Update UTMP about System Runlevel Changes...

[�[32m  OK  �[0m] Started Update UTMP about System Runlevel Changes.

[�[32m  OK  �[0m] Started Tell Plymouth To Write Out Runtime Data.

[�[32m  OK  �[0m] Started udev Wait for Complete Device Initialization.

         Starting Activation of DM RAID sets...

[    5.305390] device-mapper: uevent: version 1.0.3
[    5.309580] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) initialised: dm-devel@redhat.com
[�[32m  OK  �[0m] Started Activation of DM RAID sets.

[�[32m  OK  �[0m] Reached target Local Encrypted Volumes.

[    4.977500] cloud-init[2346]: Cloud-init v. 19.3-46.amzn2.0.1 running 'init' at Thu, 02 May 2024 18:43:55 +0000. Up 4.95 seconds.

[    4.993484] cloud-init[2346]: ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++

[    4.997062] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

[    4.997895] cloud-init[2346]: ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |

[    4.997985] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

[    4.999620] cloud-init[2346]: ci-info: |  eth0  | False |     .     |     .     |   .   | (masked by me) |

[    5.013097] cloud-init[2346]: ci-info: |   lo   |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |

[    5.016240] cloud-init[2346]: ci-info: |   lo   |  True |  ::1/128  |     .     |  host |         .         |

[    5.017904] cloud-init[2346]: ci-info: +--------+-------+-----------+-----------+-------+-------------------+

[    5.018004] cloud-init[2346]: ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++

[    5.021742] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+

[    5.021831] cloud-init[2346]: ci-info: | Route | Destination | Gateway | Interface | Flags |

[    5.023449] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+

[    5.044822] cloud-init[2346]: ci-info: +-------+-------------+---------+-----------+-------+

[�[32m  OK  �[0m] Started Initial cloud-init job (metadata service crawler).

[�[32m  OK  �[0m] Reached target Cloud-config availability.

[�[32m  OK  �[0m] Reached target Network is Online.

         Starting Notify NFS peers of a restart...

[�[32m  OK  �[0m] Started Notify NFS peers of a restart.

Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" or ^D to
try again to boot into default mode.

Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.

Press Enter to continue.
services:
  spark:
    version: 3.5.1
    download-source: "s3://xxxx/flintrock/spark/spark-{v}/"
    # executor-instances: 1
  hdfs:
    version: 3.3.6
    download-source: "s3://xxxx/flintrock/hadoop/hadoop-{v}/"
provider: ec2

providers:
  ec2:
    key-name: xxx
    identity-file: /home/xxx/spark/xxx.pem
    instance-type: i4i.xlarge
    #instance-type: m5d.large
    region: eu-central-1
    # availability-zone: <name>
    ami: ami-0a946522147cbcbcc  # Amazon Linux 2, us-east-1
    user: ec2-user
    # spot-price: <price>
    vpc-id: *masked*
    subnet-id: *masked*
    # placement-group: <name>
    security-groups:
     - sg_xxx
    #   - group-name2
    instance-profile-name: role_xx
    tags:
      - owner,spark_cluster
    #   - key2, value2  # leading/trailing spaces are trimmed
    #   - key3,  # value will be empty
    # min-root-ebs-size-gb: <size-gb>
    tenancy: default  # default | dedicated
    ebs-optimized: no  # yes | no
      #min-root-ebs-size-gb: 120
    instance-initiated-shutdown-behavior: terminate  # terminate | stop
    user-data: /home/ec2-user/spark/user-data.sh
    # authorize-access-from:
    #   - 10.0.0.42/32
    #   - sg-xyz4654564xyz

launch:
  num-slaves: 1
  install-hdfs: True
  install-spark: True
  # java-version: 8

debug: true

What is ami-0a946522147cbcbcc? Is it one of the default Amazon Linux AMIs provided by Amazon? If not, could you try one of those, please?

Hi @nchammas , yes it's an official Amazon Linux 2 image

If you have an already know working combination of instance type and ami, I can try with them to check if it's a problem related to ami or instance type.

Hi @nchammas , yes it's an official Amazon Linux 2 image

Can you show me where exactly you are seeing that? I am not able to find mention of this AMI in the official listing from Amazon.

I just tried to launch, stop, and then start a cluster using ami-0588935a949f9ff17 and it worked fine for me.

Hi @nchammas , yes it's an official Amazon Linux 2 image

Can you show me where exactly you are seeing that? I am not able to find mention of this AMI in the official listing from Amazon.

I just tried to launch, stop, and then start a cluster using ami-0588935a949f9ff17 and it worked fine for me.

I can only use amis in eu-central-1. And I can't find the one you are mentioning in eu-central-1 region.
I now tried with this one (probably they also updated it during these days) but still same problem

image

I'm not sure where ami-0578f46b79ca9e3e7 is coming from, either. Please try an AMI returned by this list:

aws ec2 describe-images \
    --region eu-central-1 \
    --owners amazon \
    --filters \
        "Name=name,Values=amzn2-ami-hvm-*-gp2" \
        "Name=root-device-type,Values=ebs" \
        "Name=virtualization-type,Values=hvm" \
        "Name=architecture,Values=x86_64" \
    --query \
        'reverse(sort_by(Images, &CreationDate))[:100].{CreationDate:CreationDate,ImageId:ImageId,Name:Name,Description:Description}'

Please also try a different instance type, like m6i.large. Different instance types have different storage configurations. Flintrock is tested against a very small set of the possible storage configurations.

I'm not sure where ami-0578f46b79ca9e3e7 is coming from, either. Please try an AMI returned by this list:

aws ec2 describe-images \
    --region eu-central-1 \
    --owners amazon \
    --filters \
        "Name=name,Values=amzn2-ami-hvm-*-gp2" \
        "Name=root-device-type,Values=ebs" \
        "Name=virtualization-type,Values=hvm" \
        "Name=architecture,Values=x86_64" \
    --query \
        'reverse(sort_by(Images, &CreationDate))[:100].{CreationDate:CreationDate,ImageId:ImageId,Name:Name,Description:Description}'

Please also try a different instance type, like m6i.large. Different instance types have different storage configurations. Flintrock is tested against a very small set of the possible storage configurations.

Hi, thanks for the suggestion. Indeed it's a problem of finding the instance type.
The following combos are now working in my case:

instance_type ami launch destroy restart (stop + start)
m6i.large ami-0121de3d416d6f6a2 yes yes yes
m6i.large ami-0578f46b79ca9e3e7 yes yes yes
m5.large ami-0578f46b79ca9e3e7 yes yes yes
i4i.xlarge ami-0578f46b79ca9e3e7 yes yes NO

It would be nice to understand what's the difference in storage config of the i4i. However not a big issue for me. I can use other instance types.
Thanks a lot for the support.
Feel free to close the issue if you wish.

Andrea

I will leave the issue open and re-title it to focus on this storage-related problem. Flintrock should handle it more gracefully, even if we don't support it.