Concurrency issue when nodes are brought up in parallel
electrofelix opened this issue · 4 comments
It appears that when using parallel up of machines that landrush may attempt to start the DNS server multiple times, and consequently the wrong PID is recorded since the last started server's pid will be written to the pidfile instead of the running one.
This was confirmed by adding a simple:
puts "pid = #{pid}"
before the write_pid() call in lib/landrush/server.rb
The resulting output was:
[landrush] Using eth0 (172.17.0.5)
==> srv-1.ncs.net: [landrush] starting dns server
srv-1.ncs.net: [landrush] 'ruby /home/baileybd/.vagrant.d/gems/gems/landrush-1.2.0/lib/landrush/start_server.rb 10053 /home/baileybd/.vagrant.d/data/landrush /home/baileybd/.vagrant.d/gems/gems'
==> srv-4.ncs.net: [landrush] adding machine entry: srv-4.ncs.net => 172.17.0.5#<Process::Waiter:0x007f5c881dd2d8>
pid = 19564
==> srv-5.ncs.net: [landrush] starting dns server
[landrush] Using eth0 (172.17.0.6)
srv-1.ncs.net: [landrush] 'ruby /home/baileybd/.vagrant.d/gems/gems/landrush-1.2.0/lib/landrush/start_server.rb 10053 /home/baileybd/.vagrant.d/data/landrush /home/baileybd/.vagrant.d/gems/gems'
pid = 19571
==> srv-2.ncs.net: [landrush] adding machine entry: srv-2.ncs.net => 172.17.0.6
[landrush] Using eth0 (172.17.0.4)
Checking the processes for start_server.rb
ps auxww | grep landrush/start_server.rb |'{print $2}'
19564
However the process pid file:
cat ~/.vagrant.d/data/landrush/run/landrush.pid
19571
Therefore with parallel machine up's the code can be executed more than once, and this can result in the wrong pid being recorded, since it will be the last attempted start of the DNS server (which may fail due to the port being in use), resulting in the second pid being written out to the pidfile.
The main area responsible for this is the post_boot_setup action:
landrush/lib/landrush/action/setup.rb
Lines 35 to 43 in 1077e14
def post_boot_setup
record_dependent_vm
configure_server
record_machine_dns_entry
setup_static_dns
start_server
return unless machine.config.landrush.host_redirect_dns?
env[:host].capability(:configure_visibility_on_host, host_ip_address, config.tld)
end
The calls configure_server
, setup_static_dns
& start_server
should all only be executed once, by which ever machine thread reaches this code first. This would also solve the problem with failing to find the DNS server process on destroy.
Thanks for the bug report and pull request. Looking into it.
It appears that when using parallel up of machines that landrush may attempt to start the DNS server multiple times,
So you mean
vagrant up --parallel
@hferentschik exactly:
Looks like something has changed that prevents multiple instances of the server from being started without this patch, I'm using vagrant 1.8.7.
However I am seeing entries failing to be added to ~/.vagrant.d/data/landrush/hosts.json and missing entries from vagrant landrush list
in certain cases, so although the server is now starting once only there are other problems in the configuration being written out that would be solved by this change.
Vagrantfile
VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.box = "debian/jessie64"
config.landrush.enabled = true
config.landrush.tld = 'ncs.net'
config.landrush.host_redirect_dns = false
# use my host resolvconf dnsmasq instance, probably not be required
# for systems that can reach the google dns servers instead
config.landrush.upstream '127.0.0.1'
config.vm.provider :libvirt do |libvirt, override|
override.vm.synced_folder ".", "/vagrant", type: "9p"
libvirt.memory = 4096
libvirt.cpus = 2
libvirt.disk_bus = "scsi"
libvirt.volume_cache = "unsafe"
libvirt.nested = true
libvirt.cpu_mode = "host-passthrough"
libvirt.driver = "kvm"
libvirt.host = "localhost"
libvirt.uri = "qemu:///system"
end
(1..10).each do |n|
config.vm.define "srv-#{n}" do |node|
node.vm.hostname = "srv-#{n}.ncs.net"
end
end
end
vagrant landrush stop
rm -rf ~/.vagrant.d/data/landrush/
vagrant up --parallel --provider libvirt
... <lots of output>
vagrant landrush list
# produced the following output
193.121.168.192.in-addr.arpa srv-3.test.net
srv-2.test.net 192.168.121.124
124.121.168.192.in-addr.arpa srv-2.test.net
srv-9.test.net 192.168.121.188
188.121.168.192.in-addr.arpa srv-9.test.net
srv-6.test.net 192.168.121.186
186.121.168.192.in-addr.arpa srv-6.test.net
srv-7.test.net 192.168.121.240
240.121.168.192.in-addr.arpa srv-7.test.net
srv-5.test.net 192.168.121.46
46.121.168.192.in-addr.arpa srv-5.test.net
Bringing the set up and down without removing the landrush data dir results in more getting populated, but it shows that there is a race in the configuration getting written out.
And I also wouldn't be too confident that the multiple attempts to start the daemon won't return unless the reason for it starting to behave itself can be tracked down as it's likely just timing is allowing things to work better now.
Merged via pull request #328