[question] changes in fully qualified domain names
vsoch opened this issue · 9 comments
Hiya! I have a few questions about changes that I've seen between this module over time. I can't pinpoint the date exactly, but I'd say I tried (mostly) the same deployment a few months apart and saw the following differences:
Fully Qualified Domain Names
By default, the hostname
that came up used to be of the format gffw-compute-a-001
and now have a suffix gffw-compute-a-001.c.llnl-flux.internal
. Could that be a setting here?
Network name
Before I used to have my workers ping port 8050 on eth0
, but now the network seems to be called ens4
. Is that linked to change here?
I'm concerned because I don't see eth0 here:
$ sudo ifconfig
ens4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1460
inet 10.10.0.4 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::4001:aff:fe0a:4 prefixlen 64 scopeid 0x20<link>
ether 42:01:0a:0a:00:04 txqueuelen 1000 (Ethernet)
RX packets 1732 bytes 266656 (260.4 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1747 bytes 311514 (304.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 90 bytes 6400 (6.2 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 90 bytes 6400 (6.2 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
And reading about ens4, it's a:
ens4 is an inactive device with no NetworkManager connection profile defined.
And in practice I can start my lead broker on it, but nothing can connect to it. I think this is a bug, or possibly some configuration that is wonky so the networking is not working as it used to.
Thanks for your help! Apologies I'm not very experienced with networking but was curious. The main change I did (which possibly could have led to the above too) is switching from Rocky 8 to a Debian bookworm base.
And for comparison (with the previously cached modules) I can see an etho0, the hostname is shorter, and my application works!
[sochat1_llnl_gov@gffw-compute-a-001 ~]$ flux resource list
STATE NNODES NCORES NODELIST
free 3 12 gffw-compute-a-[001-003]
allocated 0 0
down 0 0
[sochat1_llnl_gov@gffw-compute-a-001 ~]$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1460
inet 10.10.0.5 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::753f:8614:d6ea:fea2 prefixlen 64 scopeid 0x20<link>
ether 42:01:0a:0a:00:05 txqueuelen 1000 (Ethernet)
RX packets 25785 bytes 170143603 (162.2 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 18977 bytes 1363020 (1.2 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 2 bytes 140 (140.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2 bytes 140 (140.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[sochat1_llnl_gov@gffw-compute-a-001 ~]$ hostname
gffw-compute-a-001
The /etc/hosts
also looks very different - here is the working setup:
$ cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.0.5 gffw-compute-a-001.c.llnl-flux.internal gffw-compute-a-001 # Added by Google
169.254.169.254 metadata.google.internal # Added by Google
And the broken one:
$ cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
okay I think bookworm might be missing network interface firmware! 😱 Testing out if I know how to install it...
okay - I've tried now bullseye (debian-11) and that fixed the DNS names looking weird and the /etc/hosts, and the same is true on ubuntu, but there is absolutely no eth0 device. I don't even know how to debug this :(
This is what I'm seeing: https://twitter.com/vsoch/status/1687610567765438464
I hope you can help I'm out of ideas.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days
It's not stale - nobody has responded to my original issue. :(
@vsoch plz create ticket for Google Cloud Support. This is not an issue related to this module.
Thanks
I agree - there's nothing directly at the vpc network or subnet level that would control the VM hostname.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days