lnxbil/docker-machine-driver-proxmox-ve

Rancher OS and Proxmox VE 5.2-12

pietrushnic opened this issue · 33 comments

Hi @lnxbil,
many thanks for this driver.

I'm fighting little bit with setting infrastructure using it, after initial problems with running on my Proxmox VE server I was able to reliably create VMs. My setup-vm.sh script looks like that:

#!/bin/bash -x

export PATH=${PATH}:${HOME}/bin

PVE_NODE="pve"
PVE_HOST="<my_ip>"
PVE_MEMORY=${PVE_MEMORY:-1}
PVE_REALM="pve"
PVE_POOL="docker-machine"
PVE_STORAGE="local-lvm"
PVE_STORAGE_TYPE="RAW"
PVE_IMAGE_FILE="local:iso/rancheros.iso"

docker-machine rm --force $VM_NAME >/dev/null 2>&1 || true

docker-machine --debug \
    create \
    --driver proxmox-ve \
    --proxmox-host $PVE_HOST \
    --proxmox-user $PVE_USER \
    --proxmox-realm $PVE_REALM \
    --proxmox-password $PVE_PASSWD \
    --proxmox-node $PVE_NODE \
    --proxmox-memory-gb $PVE_MEMORY \
    --proxmox-image-file "$PVE_IMAGE_FILE" \
    --proxmox-storage $PVE_STORAGE \
    --proxmox-pool $PVE_POOL \
    --proxmox-storage-type $PVE_STORAGE_TYPE \
    --proxmox-driver-debug \
    $* \
    $VM_NAME

eval $(docker-machine env ${VM_NAME})

docker ps

Then I figured out that boot2docker doesn't persist data as it should - I described that on forum. So I wanted to give a try to Rancher OS.

First problem is that default image v1.5.0 doesn't have qemu-guest-agent enabled by default, which cause infinite loop in proxmox driver, I managed to enable and run that service manually:

(my-vm) {"time":"2018-12-31T17:24:11.88153714+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"64","message":"status code was '500' and error is\n500 QEMU guest agent is not running"}
(my-vm) {"time":"2018-12-31T17:24:11.88159905+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"waiting for VM to become active"}
(my-vm) {"time":"2018-12-31T17:24:16.641505327+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"VM is active waiting more"}
(my-vm) {"time":"2018-12-31T17:24:19.305080141+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"Creating directory '/home/docker/.ssh'"}
Error creating machine: Error in driver during machine creation: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain
notifying bugsnag: [Error creating machine: Error in driver during machine creation: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain]
++ docker-machine env my-vm
Error checking TLS connection: Error checking and/or regenerating the certs: There was an error validating certificates for host "192.168.3.218:2376": dial tcp 192.168.3.218:2376: connect: connection refused
You can attempt to regenerate them using 'docker-machine regenerate-certs [name]'.
Be advised that this will trigger a Docker daemon restart which might stop running containers.

+ eval
+ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

But as you can see I faced other error.

It would be great to improve little bit README.md so anyone can start debugging, development and bufixing - e.g. I'm not Go expert and had little problems with figuring out how to iterate through development cycle. Anyway this is great project and I definitely has potential of being very useful for community - maybe at some point could be included as default component of docker-machine.

Hi Piotr,

the main problem is that after the creation of the vm, the qemu agent is crucial and therefore it cannot continue without it. The agent's main purpose is the return the current IP to provision via SSH. If that fails, there is nothing you can do, because you cannot reach the VM. If the provisioning step fails, you cannot continue or at least I do not know how to do it. A Docker machine-driver is mainly about setting up the surroundings and the magic is happening behind the scenes.

If you want to have support for rancher, you need to have to build rancher by yourself with enabled guest agent - same as with boot2docker in #6 or with cloudinit in #7.

Best,
Andreas

@lnxbil,
this is Rancher OS image that contain working QEMU agent on boot - it rely on patch posted here and I pushed that to 3mdeb repo.

So I have iso, but still there is the same issue with driver because ssh handshake fails:

(ros-test) {"time":"2019-01-03T13:39:34.629453383+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"64","message":"status code was '500' and error is\n500 QEMU guest agent is not running"}   
(ros-test) {"time":"2019-01-03T13:39:34.6294881+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"waiting for VM to become active"}                                             
(ros-test) {"time":"2019-01-03T13:39:38.587210201+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"VM is active waiting more"}                                                 
(ros-test) {"time":"2019-01-03T13:39:40.909649067+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"Creating directory '/home/docker/.ssh'"}                                    
Error creating machine: Error in driver during machine creation: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain                               
notifying bugsnag: [Error creating machine: Error in driver during machine creation: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain]          
++ docker-machine env ros-test
Error checking TLS connection: Error checking and/or regenerating the certs: There was an error validating certificates for host "192.168.3.193:2376": dial tcp 192.168.3.193:2376: connect: connection refused   
You can attempt to regenerate them using 'docker-machine regenerate-certs [name]'.
Be advised that this will trigger a Docker daemon restart which might stop running containers.

+ eval
+ docker ps

I assume logging method for ssh provisioning is incorrect for Rancher OS. Maybe driver should use cloud init from Proxmox VE to provision Rancher with keys?

@pietrushnic Looks like that the qemu-agent service in RancherOS miss something.
Please refer to hyperv and vmware tools services:

https://github.com/rancher/os-services/blob/master/o/open-vm-tools.yml#L12-L15
https://github.com/rancher/os-services/blob/master/h/hyperv-vm-tools.yml#L12-L15

https://github.com/rancher/os-services/blob/master/images/10-openvmtools/Dockerfile#L30-L41
https://github.com/rancher/os-services/blob/master/images/10-hypervvmtools/Dockerfile#L40-L51

Can you try to rebuild qemu-guest-agent service and build a new ISO?

@niusmallnan sure, but I don't know how to include custom os-services. Any pointers to that?

@niusmallnan I managed to provide those changes to qemuguestagent, you can check in 3mdeb repo to include those changes I modified Dockerfile.dapper:

diff --git a/Dockerfile.dapper b/Dockerfile.dapper
index cf73a34d34ed..4cf9b5959c3d 100644
--- a/Dockerfile.dapper
+++ b/Dockerfile.dapper
@@ -69,7 +69,7 @@ ARG BUILD_DOCKER_URL_arm64=https://github.com/rancher/docker/releases/download/v
 
 ARG OS_RELEASES_YML=https://releases.rancher.com/os
 
-ARG OS_SERVICES_REPO=https://raw.githubusercontent.com/${OS_REPO}/os-services
+ARG OS_SERVICES_REPO=https://raw.githubusercontent.com/3mdeb/os-services
 ARG IMAGE_NAME=${OS_REPO}/os
 
 ARG OS_CONSOLE=default
(END)

Unfortunately nothing changed. I also see other problem - I'm not sure where to report that. There is cloud-init interface in Proxmox VE 5.2 as on screenshot, but unavailable for VM with Rancher OS, probably some requirements where not fullfilled.
cloud-init

Unfortunately nothing changed. I also see other problem - I'm not sure where to report that. There is cloud-init interface in Proxmox VE 5.2 as on screenshot, but unavailable for VM with Rancher OS, probably some requirements where not fullfilled

Just add a cloudinit drive in the hardware tab and then it'll show up.

@lnxbil using Rancher OS? I added and rebooted from ISO, effect is the same as previously.

@lnxbil it looks that I misunderstood how cloud-init works in this case.

@lnxbil @niusmallnan what is the correct way of deploying ssh keys - since this seem to be problematic using docker-machine + Proxmox VE driver + Rancher OS. In this configuration (with recent ISO from @niusmallnan repo) agent starts on boot and docker-machine sees VM is up, but when trying to deploy keys it gets the same error as mentioned above:

(ros-test) {"time":"2019-01-05T23:26:22.104290317+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"waiting for VM to become active"}                                           
(ros-test) {"time":"2019-01-05T23:26:25.022949704+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"VM is active waiting more"}                                                 
(ros-test) {"time":"2019-01-05T23:26:27.33223727+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"Creating directory '/home/docker/.ssh'"}                                     
Error creating machine: Error in driver during machine creation: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain                               
notifying bugsnag: [Error creating machine: Error in driver during machine creation: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain]          
++ docker-machine env ros-test
Error checking TLS connection: Error checking and/or regenerating the certs: There was an error validating certificates for host "192.168.3.222:2376": dial tcp 192.168.3.222:2376: connect: connection refused   
You can attempt to regenerate them using 'docker-machine regenerate-certs [name]'.
Be advised that this will trigger a Docker daemon restart which might stop running containers.

What are the default credentials in the VM? The driver needs to login via password.

@lnxbil I'm not sure, but from information I can find there is only password for rancher user and it is rancher. Will test that in a sec.

@lnxbil rancher:rancher doesn't seem to work as well as other combinations that I tried based on grepping @niusmallnan code. I'm trying to modify Rancher OS to have known password for SSH. I assume password for rancher user is enough ?

@lnxbil @niusmallnan ok, it looks like I have progress. Using 2 small changes:

diff --git a/proxmoxdriver.go b/proxmoxdriver.go
index 3498cb502b98..dfe496b3dbfb 100644
--- a/proxmoxdriver.go
+++ b/proxmoxdriver.go
@@ -258,7 +258,7 @@ func (d *Driver) GetSSHPort() (int, error) {

 func (d *Driver) GetSSHUsername() string {
        if d.SSHUser == "" {
-               d.SSHUser = "docker"
+               d.SSHUser = "rancher"
        }

        return d.SSHUser

and

diff --git a/scripts/global.cfg b/scripts/global.cfg
index c7db32f346c3..e89646921cec 100755
--- a/scripts/global.cfg
+++ b/scripts/global.cfg
@@ -1 +1 @@
-APPEND rancher.autologin=tty1 rancher.autologin=ttyS0 rancher.autologin=ttyS1 rancher.autologin=ttyS1 console=tty1 console=ttyS0 console=ttyS1 printk.devkmsg=on panic=10 ${APPEND}
\ No newline at end of file
+APPEND rancher.password=rancher rancher.autologin=tty1 rancher.autologin=ttyS0 rancher.autologin=ttyS1 rancher.autologin=ttyS1 console=tty1 console=ttyS0 console=ttyS1 printk.devkmsg=on panic=10 ${APPEND}

I was able to provision ssh keys. Unfortunately docker-machine fails further:

Error creating machine: Error running provisioning: Unable to verify the Docker daemon is listening: Maximum number of retries (10) exceeded                                                                      
notifying bugsnag: [Error creating machine: Error running provisioning: Unable to verify the Docker daemon is listening: Maximum number of retries (10) exceeded]                                                 
++ docker-machine env ros-test
Error checking TLS connection: Error checking and/or regenerating the certs: There was an error validating certificates for host "192.168.3.221:2376": dial tcp 192.168.3.221:2376: connect: connection refused   
You can attempt to regenerate them using 'docker-machine regenerate-certs [name]'.
Be advised that this will trigger a Docker daemon restart which might stop running containers.

+ eval
+ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES   

I'm not sure why this happen because using:

eval $(docker-machine env ros-test)
docker info

Give correct result, as well as other commands. Also VM is visible in docker-machine ls:

NAME       ACTIVE   DRIVER       STATE     URL                        SWARM   DOCKER        ERRORS
ros-test   *        proxmox-ve   Running   tcp://192.168.3.221:2376           v18.06.1-ce  

That is deep in the docker-machine driver, not in the pve one. All I did was to try to please the driver until it worked.

Any news?

We did something on RancherOS, it can work basically.
The problem now is that the root disk cannot be formatted automatically, and we also saw the same problem on boot2docker.
cc @kingsd041

Yes. I also did not find a good workflow for everything.

I played around with a PXE install based on Debian with cloudinit, but that takes just too long even with a local mirror.

@macpijan did you tried something more in this topic?

@pietrushnic RancherOS can work on Proxmox VE and supports automatic formatting of the root disk, we are about to release the RC version.
You can also build rancheros based on the master branch and start rancheros using docker-machine

# git clone https://github.com/rancher/os.git
# cd os
# OS_AUTOFORMAT=true make proxmoxve

Then upload dist/artifacts/rancheros.iso to ProxmoxVE

@pietrushnic I have just tested the latest RancherOS build as suggested by the @kingsd041
It works out of the box with docker-machine + Proxmox, with no modifications required for the docker-machine-driver-proxmox-ve or to the RancherOS. So now we just need to docker-machine create ...
, then the provisioning happens and then we can use docker-machine ssh .. to access it.

@kingsd041 thanks!

@macpijan: Perfect.

I still can't get it to work. I always end up with the same error that my host kernel must be greater than the RancherOS kernel. I don't know why this has to be greater, but the version is equal. I'm running it on RancherOS with kernel 4.14.85-rancher.
Will there be specialized iso images for this on the RancherOS site or how would we get a Proxmox VE enabled RancherOS to everyone interested?

Strange, if I override the kernel version like this, it works:

KERNEL_VERSION=4.14.84-rancher OS_AUTOFORMAT=true make proxmoxve

@pietrushnic I have just tested the latest RancherOS build as suggested by the @kingsd041
It works out of the box with docker-machine + Proxmox, with no modifications required for the docker-machine-driver-proxmox-ve or to the RancherOS. So now we just need to docker-machine create ...
, then the provisioning happens and then we can use docker-machine ssh .. to access it.

@kingsd041 thanks!

What did you use as credentials for the RancherOS?

@lnxbil I do not really know.. All that had to be done was to upload the rancheros.iso built from master branch to Proxmox and then run docker-machine create. Since then I am able to docker-machine ssh. I'm not sure what was used for the initial provisioning process.

@lnxbil I do not really know.. All that had to be done was to upload the rancheros.iso built from master branch to Proxmox and then run docker-machine create. Since then I am able to docker-machine ssh. I'm not sure what was used for the initial provisioning process.

@macpijan If you have not recompiled the driver and changed the login user name (defaults to docker) - as @pietrushnic suggested - you won't be able to login into rancher at all.

@lnxbil Then why I am?

IIUC, the first login appears somewhere at the {"time":"2019-01-30T18:53:15.964690723+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"Creating directory '/home/docker/.ssh'"} ?

Can I somehow enable more logging to see what kind of authentication was used?

(mp-rancheros-test4) {"time":"2019-01-30T18:53:10.967083107+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"64","message":"status code was '500' and error is\n500 QEMU guest agent is not running"}
(mp-rancheros-test4) {"time":"2019-01-30T18:53:10.967121516+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"waiting for VM to become active"}
(mp-rancheros-test4) {"time":"2019-01-30T18:53:13.531870797+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"VM is active waiting more"}
(mp-rancheros-test4) {"time":"2019-01-30T18:53:15.964690723+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"Creating directory '/home/docker/.ssh'"}
(mp-rancheros-test4) {"time":"2019-01-30T18:53:16.67549536+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"192.168.3.183 -> "}
(mp-rancheros-test4) {"time":"2019-01-30T18:53:16.67553157+01:00","level":"INFO","prefix":"-","file":"proxmoxdriver.go","line":"58","message":"Trying to copy to 192.168.3.183:22:/home/docker/.ssh"}
(mp-rancheros-test4) Calling .GetConfigRaw
(mp-rancheros-test4) Calling .DriverName
(mp-rancheros-test4) Calling .DriverName
Waiting for machine to be running, this may take a few minutes...
(mp-rancheros-test4) Calling .GetState
Detecting operating system of created instance...
Waiting for SSH to be available...
Getting to WaitForSSH function...
(mp-rancheros-test4) Calling .GetSSHHostname
(mp-rancheros-test4) Calling .GetSSHPort
(mp-rancheros-test4) Calling .GetSSHKeyPath
(mp-rancheros-test4) Calling .GetSSHKeyPath
(mp-rancheros-test4) Calling .GetSSHUsername
Using SSH client type: external
Using SSH private key: /home/maciej/.docker/machine/machines/mp-rancheros-test4/id_rsa (-rw-------)

Can I somehow enable more logging to see what kind of authentication was used?

There is only password authentication implemented:

sshConfig := &ssh.ClientConfig{
User: d.GetSSHUsername(),
Auth: []ssh.AuthMethod{
ssh.Password(d.GuestPassword)},
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
}

That's why I asked. You normally need to pass the proxmox-guest-password as a parameter to get it to work. Did you pass that variable to the RancherOS? The username is also hard coded to docker (if it is empty):

func (d *Driver) GetSSHUsername() string {
if d.SSHUser == "" {
d.SSHUser = "docker"
}
return d.SSHUser
}

BTW: There is a recent change in the PVE API that allows to execute stuff via qemu agent and also read/write files directly, which could and should be used instead of this strange password stuff. At the moment I wrote the plugin, this API was not present yet.

This should be investigated further but "limits" the driver to (probably) PVE >= 5.3

@lnxbil

All I have done is:

OS_AUTOFORMAT=true make proxmoxve
scp dist/artifacts/rancheros.iso  promoxuser@proxmoxip:/var/lib/vz/template/iso/rancheros.iso
./docker-machine/setup-vm.sh (the script from 1st post, no guest pass params in it)

A few minutes later I have provisioned machine and can docker-machine ssh to it.

RancherOS iso: https://cloud.3mdeb.com/index.php/s/Yx8KFBKgGyPA6kt
(Built from e226c543c0ec commit)

@macpijan Thank's for providing the ISO, I'll try.

As suspected, I could not login due to password problems:

Error creating machine: Error in driver during machine creation: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain

I just released a new version and tested it exclusively with RancherOS Proxmox VE ISO and it works. Could you please test and see if it works in your setup aswell? If so, we can close this issue

@lnxbil thanks for fixing that, it would take me time to get back to the issue, but if you can wait a week I will try to get back with results. Otherwise please feel free to close and I will reopen if there will be any remaining problem.

So for most things work with 2 small additional issues that I found, which not block further development. I have to say this is a very interesting project and I appreciate you are investing time in it. I assume there is no better approach to docker+PromoxVE.

I consider this issue solved.