"socket not connected" inside docker container after recover
Closed this issue · 7 comments
Setup:
- Rancher
- 5 Node etcd Cluster from catalog. Heartbeat 5ms, Timeout 75ms
- etcd-proxy on each host "load balancing" for the etcd cluster
- Portworx on 3 Nodes from catalog
Problem:
Despite my best efforts the etcd cluster from the rancher catalog still becomes unavailable for a few milliseconds from time to time.
This causes portworx to restart and recover, which works fine and reports all 3 Nodes as healthy after a few moments.
Execing into a container using a pxd Volume and trying to access said volume still shows the error:
ls: /path/to/file Socket not connected
This is only fixed after restarting the container.
Expected:
It would be best if the storage would simply be accessible again after portworx has recovered. If that is not possible triggering containers which were affected to restart would at least cause the error to self-heal all the way through
The storage should be accessible when Portworx restarts. Just so that I understand the problem, in your environment, if you restart Portworx container, the user containers lose connectivity? IOs should continue seamlessly even if Portworx container restarts.
Let us know the steps and we'll troubleshoot..
Yes, the storage is no longer accessable inside the container. Example output inside an affected container:
...
root@b65e3b6b14d3:/# ls -l
ls: Zugriff auf data nicht möglich: Der Socket ist nicht verbunden
insgesamt 68
drwxr-xr-x 2 root root 4096 Mai 16 14:42 bin
drwxr-xr-x 2 root root 4096 Dez 28 17:42 boot
d????????? ? ? ? ? ? data
drwxr-xr-x 5 root root 360 Mai 22 12:05 dev
drwxr-xr-x 1 root root 4096 Mai 22 08:43 etc
drwxr-xr-x 2 root root 4096 Dez 28 17:42 home
drwxr-xr-x 9 root root 4096 Mai 16 14:42 lib
drwxr-xr-x 2 root root 4096 Mai 16 14:42 lib64
drwxr-xr-x 2 root root 4096 Jan 16 18:02 media
drwxr-xr-x 2 root root 4096 Jan 16 18:02 mnt
drwxr-xr-x 3 root root 4096 Mai 16 14:42 opt
dr-xr-xr-x 850 root root 0 Mai 22 12:05 proc
drwx------ 1 root root 4096 Mai 22 09:49 root
drwxr-xr-x 1 root root 4096 Mai 22 08:43 run
drwxr-xr-x 2 root root 4096 Mai 16 14:42 sbin
drwxr-xr-x 2 root root 4096 Jan 16 18:02 srv
-rw-r--r-- 10 root root 3996 Apr 13 11:27 start.sh
dr-xr-xr-x 13 root root 0 Mai 22 12:05 sys
drwxrwxrwt 1 root root 4096 Mai 22 12:05 tmp
drwxr-xr-x 10 root root 4096 Mai 16 14:42 usr
drwxr-xr-x 1 root root 4096 Mai 16 14:42 var
root@b65e3b6b14d3:/# ls data
ls: Zugriff auf data nicht möglich: Der Socket ist nicht verbunden
...
I have tried a variaty of setups which cause this problem. The easiest setup to reproduce this should be:
- Install etcd from rancher catalog, scale 3
- expose Port 2379 to host
- install portworx with etcd://127.0.0.1:2379 as etcd address
- Create a service with named volume and volume drive pxd
- the etcd containers will restart from time to time
- portworx error: "etcd unavailable or missconfigured". Alternativly: "Etcd misconfigured or has no leader leader" - not the literal error message, I don't have that at hand right now
- portworx will come back online as the etcd becomes stable again
In my current setup I've taken to use an etcd in proxy mode as portworx' target, increased its scale to 5 and tweaked the timing settings a bit. This increases the time it takes for an error to happen drastically but it does not stop it from happening.
@svensp sorry for the delay - If you are still experiencing this problem, I have a theory on what may be going on -
can you give us output of /opt/pwx/bin/pxctl -j volume list
?
We're curently still experiencing this. Our installation is still running as the same version as before which is 1.2.2, I'll try updating to 1.2.8 and see if that helps.
The following is the output from 1.2.2, with abz_abz-storage_dc932
currently in the socket not connected
state inside a container.
[{
"id": "113703864416429536",
"readonly": false,
"locator": {
"name": "sentry-psql"
},
"ctime": "2017-07-03T14:57:39Z",
"spec": {
"ephemeral": false,
"size": "1073741824",
"format": "ext4",
"block_size": "65536",
"ha_level": "1",
"cos": "low",
"io_profile": "sequential",
"dedupe": false,
"snapshot_interval": 0,
"shared": false,
"aggregation_level": 1,
"encrypted": false,
"passphrase": "",
"snapshot_schedule": "",
"scale": 0,
"sticky": false,
"max_backups": 0,
"backup_schedule": ""
},
"usage": "33947648",
"last_scan": "2017-07-03T14:57:39Z",
"format": "ext4",
"status": "up",
"state": "attached",
"attached_on": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"device_path": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"replica_sets": [
{
"nodes": [
"af36978b-c009-4bcd-8865-100d7ea5d9a8"
]
}
],
"error": "",
"runtime_state": [
{
"runtime_state": {
"FullResyncBlocks": "[{0 0} {-1 0} {-1 0} {-1 0} {-1 0}]",
"ID": "0",
"ReadQuorum": "1",
"ReadSet": "[0]",
"ReplicaSetCurr": "[0]",
"ReplicaSetNext": "[0]",
"ResyncBlocks": "[{0 0} {-1 0} {-1 0} {-1 0} {-1 0}]",
"RuntimeState": "clean",
"TimestampBlocksPerNode": "[0 0 0 0 0]",
"TimestampBlocksTotal": "0",
"WriteQuorum": "1",
"WriteSet": "[0]"
}
}
],
"secure_device_path": "",
"background_processing": false
},{
"id": "213313833312877066",
"readonly": false,
"locator": {
"name": "ahgz_fileproxy_data"
},
"ctime": "2017-06-16T04:19:45Z",
"spec": {
"ephemeral": false,
"size": "5368709120",
"format": "ext4",
"block_size": "65536",
"ha_level": "3",
"cos": "low",
"io_profile": "sequential",
"dedupe": false,
"snapshot_interval": 0,
"shared": true,
"aggregation_level": 1,
"encrypted": false,
"passphrase": "",
"snapshot_schedule": "",
"scale": 0,
"sticky": false,
"max_backups": 0,
"backup_schedule": ""
},
"usage": "69554176",
"last_scan": "2017-06-16T04:19:45Z",
"format": "ext4",
"status": "up",
"state": "attached",
"attached_on": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"device_path": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"replica_sets": [
{
"nodes": [
"af36978b-c009-4bcd-8865-100d7ea5d9a8",
"fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"ef1468b8-70d5-4f2a-8959-8f5e0db1ad48"
]
}
],
"error": "",
"runtime_state": [
{
"runtime_state": {
"FullResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"ID": "0",
"ReadQuorum": "1",
"ReadSet": "[0 1 2]",
"ReplicaSetCurr": "[0 1 2]",
"ReplicaSetNext": "[0 1 2]",
"ResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"RuntimeState": "clean",
"TimestampBlocksPerNode": "[0 0 0 0 0]",
"TimestampBlocksTotal": "0",
"WriteQuorum": "2",
"WriteSet": "[0 1 2]"
}
}
],
"secure_device_path": "",
"background_processing": false
},{
"id": "218619615890089249",
"readonly": false,
"locator": {
"name": "koca_test_fileproxy_data"
},
"ctime": "2017-05-22T14:33:53Z",
"spec": {
"ephemeral": false,
"size": "1073741824",
"format": "ext4",
"block_size": "65536",
"ha_level": "3",
"cos": "low",
"io_profile": "sequential",
"dedupe": false,
"snapshot_interval": 0,
"shared": true,
"aggregation_level": 1,
"encrypted": false,
"passphrase": "",
"snapshot_schedule": "",
"scale": 0,
"sticky": false,
"max_backups": 0,
"backup_schedule": ""
},
"usage": "515907584",
"last_scan": "2017-05-22T14:33:53Z",
"format": "ext4",
"status": "up",
"state": "attached",
"attached_on": "ef1468b8-70d5-4f2a-8959-8f5e0db1ad48",
"device_path": "ef1468b8-70d5-4f2a-8959-8f5e0db1ad48",
"replica_sets": [
{
"nodes": [
"af36978b-c009-4bcd-8865-100d7ea5d9a8",
"fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"ef1468b8-70d5-4f2a-8959-8f5e0db1ad48"
]
}
],
"error": "",
"runtime_state": [
{
"runtime_state": {
"FullResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"ID": "0",
"ReadQuorum": "1",
"ReadSet": "[0 1 2]",
"ReplicaSetCurr": "[0 1 2]",
"ReplicaSetNext": "[0 1 2]",
"ResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"RuntimeState": "clean",
"TimestampBlocksPerNode": "[0 40 40 0 0]",
"TimestampBlocksTotal": "40",
"WriteQuorum": "2",
"WriteSet": "[0 1 2]"
}
}
],
"secure_device_path": "",
"background_processing": false
},{
"id": "342538426071651372",
"readonly": false,
"locator": {
"name": "pxd-test"
},
"ctime": "2017-07-18T06:47:41Z",
"spec": {
"ephemeral": false,
"size": "1073741824",
"format": "ext4",
"block_size": "65536",
"ha_level": "3",
"cos": "low",
"io_profile": "sequential",
"dedupe": false,
"snapshot_interval": 0,
"shared": true,
"aggregation_level": 1,
"encrypted": false,
"passphrase": "",
"snapshot_schedule": "",
"scale": 0,
"sticky": false,
"max_backups": 0,
"backup_schedule": ""
},
"usage": "33947648",
"last_scan": "2017-07-18T06:47:41Z",
"format": "ext4",
"status": "up",
"state": "attached",
"attached_on": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"device_path": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"replica_sets": [
{
"nodes": [
"af36978b-c009-4bcd-8865-100d7ea5d9a8",
"fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"ef1468b8-70d5-4f2a-8959-8f5e0db1ad48"
]
}
],
"error": "",
"runtime_state": [
{
"runtime_state": {
"FullResyncBlocks": "[{0 0} {1 0} {-1 0} {-1 0} {-1 0}]",
"ID": "0",
"ReadQuorum": "1",
"ReadSet": "[0 1 2]",
"ReplicaSetCurr": "[0 1 2]",
"ReplicaSetNext": "[1]",
"ResyncBlocks": "[{0 0} {1 0} {-1 0} {-1 0} {-1 0}]",
"RuntimeState": "clean",
"TimestampBlocksPerNode": "[0 0 0 0 0]",
"TimestampBlocksTotal": "0",
"WriteQuorum": "2",
"WriteSet": "[0 1 2]"
}
}
],
"secure_device_path": "",
"background_processing": false
},{
"id": "510246567112056213",
"readonly": false,
"locator": {
"name": "abz_abz-storage_dc932"
},
"ctime": "2017-07-04T09:12:58Z",
"spec": {
"ephemeral": false,
"size": "1073741824",
"format": "ext4",
"block_size": "65536",
"ha_level": "3",
"cos": "low",
"io_profile": "sequential",
"dedupe": false,
"snapshot_interval": 0,
"shared": true,
"aggregation_level": 1,
"encrypted": false,
"passphrase": "",
"snapshot_schedule": "",
"scale": 0,
"sticky": false,
"max_backups": 0,
"backup_schedule": ""
},
"usage": "682528768",
"last_scan": "2017-07-04T09:12:58Z",
"format": "ext4",
"status": "up",
"state": "attached",
"attached_on": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"device_path": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"replica_sets": [
{
"nodes": [
"af36978b-c009-4bcd-8865-100d7ea5d9a8",
"fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"ef1468b8-70d5-4f2a-8959-8f5e0db1ad48"
]
}
],
"error": "",
"runtime_state": [
{
"runtime_state": {
"FullResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"ID": "0",
"ReadQuorum": "1",
"ReadSet": "[0 1 2]",
"ReplicaSetCurr": "[0 1 2]",
"ReplicaSetNext": "[0 1 2]",
"ResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"RuntimeState": "clean",
"TimestampBlocksPerNode": "[0 0 0 0 0]",
"TimestampBlocksTotal": "0",
"WriteQuorum": "2",
"WriteSet": "[0 1 2]"
}
}
],
"secure_device_path": "",
"background_processing": false
},{
"id": "537734476473597616",
"readonly": false,
"locator": {
"name": "koca_fileproxy_data"
},
"ctime": "2017-06-30T00:10:38Z",
"spec": {
"ephemeral": false,
"size": "10737418240",
"format": "ext4",
"block_size": "65536",
"ha_level": "3",
"cos": "low",
"io_profile": "sequential",
"dedupe": false,
"snapshot_interval": 0,
"shared": true,
"aggregation_level": 1,
"encrypted": false,
"passphrase": "",
"snapshot_schedule": "",
"scale": 0,
"sticky": false,
"max_backups": 0,
"backup_schedule": ""
},
"usage": "1587208192",
"last_scan": "2017-06-30T00:10:38Z",
"format": "ext4",
"status": "up",
"state": "attached",
"attached_on": "ef1468b8-70d5-4f2a-8959-8f5e0db1ad48",
"device_path": "ef1468b8-70d5-4f2a-8959-8f5e0db1ad48",
"replica_sets": [
{
"nodes": [
"af36978b-c009-4bcd-8865-100d7ea5d9a8",
"fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"ef1468b8-70d5-4f2a-8959-8f5e0db1ad48"
]
}
],
"error": "",
"runtime_state": [
{
"runtime_state": {
"FullResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"ID": "0",
"ReadQuorum": "1",
"ReadSet": "[0 1 2]",
"ReplicaSetCurr": "[0 1 2]",
"ReplicaSetNext": "[0 1 2]",
"ResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"RuntimeState": "clean",
"TimestampBlocksPerNode": "[0 37 37 0 0]",
"TimestampBlocksTotal": "37",
"WriteQuorum": "2",
"WriteSet": "[0 1 2]"
}
}
],
"secure_device_path": "",
"background_processing": false
},{
"id": "882926935111313361",
"readonly": false,
"locator": {
"name": "traefik_certs"
},
"ctime": "2017-07-05T13:04:25Z",
"spec": {
"ephemeral": false,
"size": "1073741824",
"format": "ext4",
"block_size": "65536",
"ha_level": "3",
"cos": "low",
"io_profile": "sequential",
"dedupe": false,
"snapshot_interval": 0,
"shared": true,
"aggregation_level": 1,
"encrypted": false,
"passphrase": "",
"snapshot_schedule": "",
"scale": 0,
"sticky": false,
"max_backups": 0,
"backup_schedule": ""
},
"usage": "36446208",
"last_scan": "2017-07-05T13:04:25Z",
"format": "ext4",
"status": "up",
"state": "attached",
"attached_on": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"device_path": "fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"replica_sets": [
{
"nodes": [
"af36978b-c009-4bcd-8865-100d7ea5d9a8",
"fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"ef1468b8-70d5-4f2a-8959-8f5e0db1ad48"
]
}
],
"error": "",
"runtime_state": [
{
"runtime_state": {
"FullResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"ID": "0",
"ReadQuorum": "1",
"ReadSet": "[0 1 2]",
"ReplicaSetCurr": "[0 1 2]",
"ReplicaSetNext": "[0 1 2]",
"ResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"RuntimeState": "clean",
"TimestampBlocksPerNode": "[0 0 0 0 0]",
"TimestampBlocksTotal": "0",
"WriteQuorum": "2",
"WriteSet": "[0 1 2]"
}
}
],
"secure_device_path": "",
"background_processing": false
},{
"id": "950597714658235465",
"readonly": false,
"locator": {
"name": "abz-staging_abz-staging-storage_845db"
},
"ctime": "2017-07-06T07:53:12Z",
"spec": {
"ephemeral": false,
"size": "1073741824",
"format": "ext4",
"block_size": "65536",
"ha_level": "3",
"cos": "low",
"io_profile": "sequential",
"dedupe": false,
"snapshot_interval": 0,
"shared": true,
"aggregation_level": 1,
"encrypted": false,
"passphrase": "",
"snapshot_schedule": "",
"scale": 0,
"sticky": false,
"max_backups": 0,
"backup_schedule": ""
},
"usage": "35745792",
"last_scan": "2017-07-06T07:53:12Z",
"format": "ext4",
"status": "up",
"state": "attached",
"attached_on": "af36978b-c009-4bcd-8865-100d7ea5d9a8",
"device_path": "/dev/pxd/pxd950597714658235465",
"attach_path": [
"178.63.85.76"
],
"replica_sets": [
{
"nodes": [
"af36978b-c009-4bcd-8865-100d7ea5d9a8",
"fd211e6d-e48c-4218-ab2d-ab78a94b8380",
"ef1468b8-70d5-4f2a-8959-8f5e0db1ad48"
]
}
],
"error": "",
"runtime_state": [
{
"runtime_state": {
"FullResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"ID": "0",
"ReadQuorum": "1",
"ReadSet": "[0 1 2]",
"ReplicaSetCurr": "[0 1 2]",
"ReplicaSetNext": "[0 1 2]",
"ResyncBlocks": "[{0 0} {1 0} {2 0} {-1 0} {-1 0}]",
"RuntimeState": "clean",
"TimestampBlocksPerNode": "[0 0 0 0 0]",
"TimestampBlocksTotal": "0",
"WriteQuorum": "2",
"WriteSet": "[0 1 2]"
}
}
],
"secure_device_path": "",
"background_processing": false
}]
@svensp can you share your use case for abz_abz-storage_dc932 ? If you don't require this volume to be shared, then you can update the setting and the volume will survive restarts.
That said, you should get to the bottom of etcd flap..
Use case:
The volume shares user uploaded data for a php web application.
We have a total of 3 machines and the app is scheduled in Rancher to run on 2 of them at any given time. If an outage of 1 Server occurs the other active instance is still served while redundancy is restored by starting it on the third one.
Etcd stability has increased tremendously since we moved from the Catalog entry using 2.3.7 to 3.2 using 3 different services so they can be configured individually and doing backups via etcd snapshot.
Currently the biggest reason for etcd going down actually seems the hosts hanging because the portworx process hangs and the server is configured to restart on hanging processes.
I'll open a proper issue for that once I have a kernel hang panic log but for now the following kernel messages seem related:
[241056.128134] blk_update_request: I/O error, dev pxd/pxd708560278161888801, sector 2097024
[241056.128188] blk_update_request: I/O error, dev pxd/pxd708560278161888801, sector 2097024
[241056.128229] Buffer I/O error on dev pxd/pxd708560278161888801, logical block 262128, async page read
[241056.294427] EXT4-fs (pxd!pxd708560278161888801): mounted filesystem with ordered data mode. Opts: discard
It looks like a hardware failure except its happening on a pxd volume. Neither smartctl nor mdadm report any hardware errors however.