moby/moby

Docker 1.9.1 hanging at build step "Setting up ca-certificates-java"

jredl-va opened this issue ยท 258 comments

A few of us within the office upgraded to the latest version of docker toolbox backed by Docker 1.9.1 and builds are hanging as per the below build output.

docker version:

 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      darwin/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

docker info:

Containers: 10
Images: 57
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 77
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.13-boot2docker
Operating System: Boot2Docker 1.9.1 (TCL 6.4.1); master : cef800b - Fri Nov 20 19:33:59 UTC 2015
CPUs: 1
Total Memory: 1.956 GiB
Name: vbootstrap-vm
ID: LLM6:CASZ:KOD3:646A:XPRK:PIVB:VGJ5:JSDB:ZKAN:OUC4:E2AK:FFTC
Debug mode (server): true
 File Descriptors: 13
 Goroutines: 18
 System Time: 2015-11-24T02:03:35.597772191Z
 EventsListeners: 0
 Init SHA1: 
 Init Path: /usr/local/bin/docker
 Docker Root Dir: /mnt/sda1/var/lib/docker
Labels:
 provider=virtualbox

uname -a:

Darwin JRedl-MB-Pro.local 15.0.0 Darwin Kernel Version 15.0.0: Sat Sep 19 15:53:46 PDT 2015; root:xnu-3247.10.11~1/RELEASE_X86_64 x86_64

Here is a snippet from the docker build uppet that hangs on the Setting up ca-certificates-java line. Something to do with the latest version of docker and openjdk?

update-alternatives: using /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/tnameserv to provide /usr/bin/tnameserv (tnameserv) in auto mode
update-alternatives: using /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/jexec to provide /usr/bin/jexec (jexec) in auto mode
Setting up ca-certificates-java (20140324) ...

Docker file example:

FROM gcr.io/google_appengine/base

# Prepare the image.
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y -qq --no-install-recommends build-essential wget curl unzip python python-dev php5-mysql php5-cli php5-cgi openjdk-7-jre-headless openssh-client python-openssl && apt-get clean

I can confirm that this is not an issue with Docker 1.9.0 or Docker Toolbox 1.9.0d. Let me know if I can provide any further information but this feels like a regression of some sort within the new version.

bsao commented

I am facing same problem. I am investigating.

We're facing the same problem as well.

bsao commented

Yep, it is a problem em docker 1.9. I had downgraded to 1.8.3 and all problems solved. Now i am investigating a workarround. will post here! Tks

I'm having the same issue with docker 1.9.1a

bean5 commented

I have docker 1.8.3, so maybe the process of installing a different version of docker remedies the situation. @bsao.

having this same issue with docker version 1.9.1, build a34a1d5

Are you only seeing this on boot2docker?

I cannot repo on a stock ubuntu with aufs or on my machine. let me try with boot2docker to see if I can repo there.

+1 in Docker 1.9.1 for ubuntu:14.10 using OSX

bean5 commented

This is an issue that started appearing after I turned on VPN for work. Even after I turned off VPN and restarted the docker machine on OSX it continued to have this problem. I re-installed Docker 1.9.1 and then 1.8.3, still seeing the issue. Blocks me from using most if not all of my dockers on the Mac.

+1 in Docker 1.9.1 for ubuntu 12.04 using OS X 10.11

Developers in my office came across this by accident too.

This version/build worked: Docker version 1.9.0, build 76d6bc9

This version/build hung:Docker version 1.9.1, build a34a1d5

@crosbymichael I unfortunately have not tried it on any other environment than Boot2Docker.

bean5 commented

Someone with the know-how of git-bisecting and docker could use the build IDs provided by @chico1198!

I experienced the same problem with 1.9.1 on OSX El Capitan, downgrading to 1.9.0 didn't help.

Same issue here on OSX 10.9.3 with:
Docker version 1.9.1, build a34a1d5
Docker version 1.9.0, build 76d6bc9

@crosbymichael I logged in boot2docker and ran ps auxf, this is what I saw:

root      1290  0.4  1.8 1346656 75692 ?       Sl   Nov27   4:53 /usr/local/bin/docker daemon -D -g /var/lib/docker -H unix:// -H tcp://0.0.0.0:2376 [...]
root      8556  0.0  0.0      0     0 ?        Ss   05:12   0:00  \_ [sh]
root     24221 99.8  0.0      0     0 ?        Zl   05:33  64:17  |   \_ [java] <defunct>
root     24657  0.0  0.0      0     0 ?        Ss   06:07   0:00  \_ [sh]
root      6174 79.6  0.0      0     0 ?        Zl   06:22  12:33      \_ [java] <defunct>
root      7295 49.3  0.0      0     0 ?        Zl   06:32   2:49      \_ [java] <defunct>

+1 with docker 1.9.1 on OSX 10.11 with attempting to build image from ubuntu 14.04

junxy commented

+1
use DockerToolbox-1.9.1a.pkg

docker version                                                                                      2 master?
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      darwin/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64

Downgrading to Docker 1.8.3 is my temporary workaround. Here's the target I use in my Makefile.

downgrade-docker:
  docker-machine ssh $(DOCKER_MACHINE_NAME) sudo /etc/init.d/docker stop
  docker-machine ssh $(DOCKER_MACHINE_NAME) "while sudo /etc/init.d/docker status ; do sleep 1; done"
  docker-machine ssh $(DOCKER_MACHINE_NAME) "sudo curl 'https://get.docker.com/builds/Linux/x86_64/docker-1.8.3' -o /usr/local/bin/docker-1.8.3"
  docker-machine ssh $(DOCKER_MACHINE_NAME) "sudo ln -sf /usr/local/bin/docker-1.8.3 /usr/local/bin/docker"
  # FIXME: Starting machine is not enough; always fails with message like "Need TLS certs for 127.0.0.1,10.0.2.15,192.168.99.100"
  #docker-machine ssh $(DOCKER_MACHINE_NAME) sudo /etc/init.d/docker start
  docker-machine stop $(DOCKER_MACHINE_NAME) 
  docker-machine start $(DOCKER_MACHINE_NAME) 

I couldn't reproduce this. Does it always hang at "setting up certificates" ? Did you try sending a ^D to close some pipe? Can you also try sending a SIGUSR1 to the daemon and paste the stack trace here when it's stuck?

+1 with docker 1.9.1 on OS X 10.10

I tried downgrading to 1.8.3 using @osterman 's Makefile and also had troubles with the SSH key:

ip-10-100-0-211:docker-dev leaf$ docker-machine start default
(default) OUT | Starting VM...
Too many retries waiting for SSH to be available.  Last error: Maximum number of retries (60) exceeded

Tested it by doing different openjdk installs inside debian:jessie and ubuntu
OSX 10.11.1, boot2docker 1.9.1: hangs
OSX 10.11.1, boot2docker 1.9.0: works
Ubuntu 14.04 with docker 1.9.1: works

The boot2docker vms were created with:
docker-machine create -d virtualbox --virtualbox-boot2docker-url=https://github.com/boot2docker/boot2docker/releases/download/v1.9.0/boot2docker.iso
and
docker-machine create -d virtualbox --virtualbox-boot2docker-url=https://github.com/boot2docker/boot2docker/releases/download/v1.9.1/boot2docker.iso

On Ubuntu 14.04 docker was installed following the documentation on https://docs.docker.com/engine/installation/ubuntulinux/

+1, running docker 1.9.1 build a34a1d5 on OSX Yosemite 10.10.5.

I can't reproduce this.

patsa commented

Same issue here.
Is there any way to downgrade to an earlier version on Windows?

+1, docker 1.9.1 @ El Capitan

+1, Docker 1.9.1 on OS X 10.11.1

+1, Docker 1.9.1a, OS X 10.10.5

+1, Docker 1.9.1 build a34a1d5, Windows 10

+1, Docker 1.9.1 build a34a1d5, OS X 10.11.1, Docker-Machine 0.5.1 build 7e8e38e

+1

Same on Docker-machine on OSX 10.11.1
Docker version 1.9.1, build a34a1d5
docker-machine version 0.5.1 (HEAD)

I'm able to reproduce this on docker-machine, OS X 10.10.5, so this may be something related to boot2docker. docker top also gives me <defunct> for a java process;

docker top dreamy_sammet                                                                  Tue Dec  1 15:58:47 2015
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                2538                1023                0                   14:44               ?                   00:00:00            /bin/sh -c apt-get update && apt-get install -y -qq --no-install-recommends build-essential wget curl unzip python python-dev php5-mysql php5-cli php5-cgi openjdk-7-jre-headless openssh-client python-openssl && apt-get clean
root                2566                2538                1                   14:44               ?                   00:00:16            apt-get install -y -qq --no-install-recommends build-essential wget curl unzip python python-dev php5-mysql php5-cli php5-cgi openjdk-7-jre-headless openssh-client python-openssl
root                4830                2566                0                   14:46               pts/0               00:00:00            /usr/bin/dpkg --status-fd 14 --configure libgdbm3:amd64 libjson-c2:amd64 libbsd0:amd64 libedit2:amd64 libkeyutils1:amd64 libkrb5support0:amd64 libk5crypto3:amd64 libkrb5-3:amd64 libgssapi-krb5-2:amd64 libidn11:amd64 libsasl2-modules-db:amd64 libsasl2-2:amd64 libldap-2.4-2:amd64 libmagic1:amd64 libsqlite3-0:amd64 libwrap0:amd64 libxml2:amd64 perl-modules:all perl:amd64 mime-support:all libexpat1:amd64 libpython2.7-stdlib:amd64 python2.7:amd64 libpython-stdlib:amd64 python:amd64 libasan1:amd64 libasyncns0:amd64 libatomic1:amd64 libavahi-common-data:amd64 libavahi-common3:amd64 libdbus-1-3:amd64 libavahi-client3:amd64 libcilkrts5:amd64 libisl10:amd64 libcloog-isl4:amd64 libcups2:amd64 librtmp1:amd64 libssh2-1:amd64 libcurl3:amd64 libogg0:amd64 libflac8:amd64 libpng12-0:amd64 libfreetype6:amd64 ucf:all fonts-dejavu-core:all fontconfig-config:all libfontconfig1:amd64 libglib2.0-0:amd64 libgomp1:amd64 x11-common:all libice6:amd64 libicu52:amd64 libitm1:amd64 liblcms2-2:amd64 liblsan0:amd64 libmpfr4:amd64 mysql-common:all libmysqlclient18:amd64 libnspr4:amd64 libnss3:amd64 libonig2:amd64 libpcsclite1:amd64 libsm6:amd64 libvorbis0a:amd64 libvorbisenc2:amd64 libsndfile1:amd64 libxau6:amd64 libxdmcp6:amd64 libxcb1:amd64 libx11-data:all libx11-6:amd64 libx11-xcb1:amd64 libxext6:amd64 libxi6:amd64 libxtst6:amd64 libpulse0:amd64 libpython2.7:amd64 libc-dev-bin:amd64 linux-libc-dev:amd64 libc6-dev:amd64 libexpat1-dev:amd64 libpython2.7-dev:amd64 libquadmath0:amd64 libsctp1:amd64 libtsan0:amd64 libubsan0:amd64 tzdata-java:all java-common:all libjpeg62-turbo:amd64 ca-certificates-java:all openjdk-7-jre-headless:amd64 libmpc3:amd64 libpsl0:amd64 wget:amd64 bzip2:amd64 libperl4-corelibs-perl:all lsof:amd64 openssh-client:amd64 patch:amd64 xz-utils:amd64 binutils:amd64 cpp-4.9:amd64 cpp:amd64 libgcc-4.9-dev:amd64 gcc-4.9:amd64 gcc:amd64 libstdc++-4.9-dev:amd64 g++-4.9:amd64 g++:amd64 make:amd64 libtimedate-perl:all libdpkg-perl:all dpkg-dev:all build-essential:amd64 curl:amd64 libpython-dev:amd64 libqdbm14:amd64 psmisc:amd64 php5-common:amd64 php5-json:amd64 php5-cli:amd64 php5-cgi:amd64 php5-mysql:amd64 python-ply:all python-pycparser:all python-cffi:amd64 python-pkg-resources:all python-six:all python-cryptography:amd64 python2.7-dev:amd64 python-dev:amd64 python-openssl:all unzip:amd64
root                6711                4830                0                   14:46               pts/0               00:00:00            /bin/bash /var/lib/dpkg/info/ca-certificates-java.postinst configure
root                6725                6711                97                  14:46               pts/0               00:12:25            [java] <defunct>

/cc @tianon @nathanleclaire @JeffDM perhaps any of you has an idea where to look, or what to debug, I couldn't really find something

Looks like memory is not the problem, however the <defunct> process does consume 100% CPU;

CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O               BLOCK I/O
d263da116bfd        99.51%              689.3 MB / 2.1 GB   32.82%              157.9 MB / 2.754 MB   25.15 MB / 130.4 MB

The container seems to be stuck as well, and I had to reboot the vm to get it killed

+1 Docker version 1.9.1, build a34a1d5, Win 7.

I've run into similar problems that turned out to be OOM, even though the stats command shows memory available to the container. The problem happened soon after task manager showed 0 free physical memory, while stats continued to show <100%.

Weird thing is, that the process kept running, so it was not killed. I can retry with a -m, however, it's strange that this happens on 1.9.x, but (following this discussion) not on 1.8. Also, running the same on a 1GB DigitalOcean droplet (also 1.9.1) succeeded. Perhaps that one uses swap, should check that

It actually kept happening to me after I uninstalled 1.9.1 and installed 1.8.3. Looked like the uninstall wasn't very thorough though on Mac because firing up the shell was without delay on 1.8.3, unlike a normal first run where it sets up ssh keys and stuff.

USER POLL

The best way to get notified when there are changes in this discussion is by clicking the Subscribe button in the top right.

The people listed below have appreciated your meaningfull discussion with a random +1:

@mattes

bean5 commented

31 participants on this issue and counting.

@bean5 please keep your comments constructive

bean5 commented

@thaJeztah I didn't mean to offend nor be deconstructive. I mean to draw attention to the fact that github shows the number of people participating...and I gathered that @GordonTheTurtle wanted to construct a list of people who have done +1. Maybe I was confused by what he meant. In any case, I watch this issue with great anticipation since it has affected me on more than one occasion in the past weeks. I am glad we have information from various users.

I am able to duplicate this issue on my setup (using Docker Machine on Mac).

Here are my findings so far.

As noted by other posters, the simplest way to duplicate this has been to use the boot2docker 1.9.1 ISO with AUFS. This Dockerfile should minimally reproduce the problem fairly quickly:

FROM debian:jessie

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends openjdk-7-jre-headless

Looking at dmesg, I see some AUFS errors after attempting such a build, but I am not 100% sure they are related:

docker@default:~$ dmesg | tail
aufs au_opts_verify:1597:docker[14186]: dirperm1 breaks the protection by the permission bits on the lower branch
aufs au_opts_verify:1597:docker[14186]: dirperm1 breaks the protection by the permission bits on the lower branch
aufs au_opts_verify:1597:docker[14186]: dirperm1 breaks the protection by the permission bits on the lower branch
device veth955cc15 entered promiscuous mode
IPv6: ADDRCONF(NETDEV_UP): veth955cc15: link is not ready
eth0: renamed from vethc63e038
IPv6: ADDRCONF(NETDEV_CHANGE): veth955cc15: link becomes ready
docker0: port 2(veth955cc15) entered forwarding state
docker0: port 2(veth955cc15) entered forwarding state
docker0: port 2(veth955cc15) entered forwarding state

If I create a Docker 1.9.1 machine which uses overlay as the storage driver:

$ docker-machine create -d virtualbox --engine-storage-driver overlay overlay

The process does NOT hang and this line runs successfully! Looks like AUFS and/or kernel is the problem.

boot2docker/boot2docker did bump both kernel versions and AUFS commit for the 1.9.1 release, so those are both factors which need to be ruled out or investigated further:

Currently trying 1.9.0 ISO with a 1.9.1 binary to see if the surface area of potential bug area can be reduced further.

The Dockerfile will build fine and not hang on a boot2docker 1.9.0 ISO with a Docker 1.9.1 binary. The issue seems not to lie with Docker 1.9.1, but rather the environment in which it is being run.

I am using the 1.9.1 release with no issue on aufs, but have significantly more cpu/ram/storage than the default machine config.

I just tried raising the memory to 4GB for my VM, but still able to reproduce

@cpuguy83 AUFS on boot2docker 1.9.1?

As noted above, b2d bundles a very specific version of AUFS.

Yep

Containers: 13
Images: 191
Server Version: 1.9.1
Storage Driver: aufs
 Root Dir: /mnt/sda1/var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 221
 Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.1.13-boot2docker
Operating System: Boot2Docker 1.9.1 (TCL 6.4.1); master : cef800b - Fri Nov 20 19:33:59 UTC 2015
CPUs: 1
Total Memory: 3.859 GiB
Name: default
ID: XMQH:4YAW:ZDSA:OWC7:GAPC:US5P:YQ4M:SVMQ:VXNL:RRZC:YNHT:ZBHE
Debug mode (server): true
 File Descriptors: 12
 Goroutines: 19
 System Time: 2015-12-01T23:05:28.760107918Z
 EventsListeners: 0
 Init SHA1:
 Init Path: /usr/local/bin/docker
 Docker Root Dir: /mnt/sda1/var/lib/docker
Labels:
 provider=virtualbox

I also see some java processes becoming defunct in a container. I am able to reproduce this issue with the following steps
run the container:

docker run --rm -it myJavaContainerFromCentos7 bash

create Foo.java with the following:

class Foo {
    public static void main (String[] a) {
        System.out.println("hello world");
    }
}

compile and run it results in a defunct java process, with 1 core using 100%cpu:
javac Foo.java && java Foo

however... if a System.exit(0); is added after the println everything is ok:

class Foo {
    public static void main (String[] a) {
        System.out.println("hello world");
        System.exit(0);  // clean exit, no hang
    }
}

version info:
osx 10.10.3
docker 1.9.1
boot2docker version 1.9.1 uname -a is "linux ci 4.1.13-boot2docker"
numproc = 1

strace output with System.exit(0);

open("/usr/java/jdk1.7.0_75/jre/lib/amd64/jvm.cfg", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0755, st_size=677, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f27b1dab000
read(3, "# Copyright (c) 2003, Oracle and"..., 4096) = 677
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x7f27b1dab000, 4096)            = 0
stat("/usr/java/jdk1.7.0_75/jre/lib/amd64/server/libjvm.so", {st_mode=S_IFREG|0755, st_size=15224066, ...}) = 0
futex(0x7f27b17580d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/usr/java/jdk1.7.0_75/jre/lib/amd64/server/libjvm.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\245\36\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=15224066, ...}) = 0
mmap(NULL, 15167976, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f27b031c000
mprotect(0x7f27b0e8f000, 2097152, PROT_NONE) = 0
mmap(0x7f27b108f000, 802816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb73000) = 0x7f27b108f000
mmap(0x7f27b1153000, 262632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f27b1153000
close(3)                                = 0
open("/usr/java/jdk1.7.0_75/bin/../lib/amd64/jli/libm.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=11922, ...}) = 0
mmap(NULL, 11922, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f27b1da9000
close(3)                                = 0
open("/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260T\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1141552, ...}) = 0
mmap(NULL, 3150168, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f27b001a000
mprotect(0x7f27b011b000, 2093056, PROT_NONE) = 0
mmap(0x7f27b031a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x100000) = 0x7f27b031a000
close(3)                                = 0
mprotect(0x7f27b031a000, 4096, PROT_READ) = 0
munmap(0x7f27b1da9000, 11922)           = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f27b1ca4000
mprotect(0x7f27b1ca4000, 4096, PROT_NONE) = 0
clone(child_stack=0x7f27b1da3fb0,                                                                                                    flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,  parent_tidptr=0x7f27b1da49d0, tls=0x7f27b1da4700, child_tidptr=0x7f27b1da49d0) = 118
futex(0x7f27b1da49d0, FUTEX_WAIT, 118, NULLhellowerld
 <unfinished ...>
 +++ exited with 0 +++

strace output without System.exit(0);

open("/usr/java/jdk1.7.0_75/jre/lib/amd64/jvm.cfg", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0755, st_size=677, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fac9a490000
read(3, "# Copyright (c) 2003, Oracle and"..., 4096) = 677
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x7fac9a490000, 4096)            = 0
stat("/usr/java/jdk1.7.0_75/jre/lib/amd64/server/libjvm.so", {st_mode=S_IFREG|0755, st_size=15224066, ...}) = 0
futex(0x7fac99e3d0d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/usr/java/jdk1.7.0_75/jre/lib/amd64/server/libjvm.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\245\36\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=15224066, ...}) = 0
mmap(NULL, 15167976, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac98a01000
mprotect(0x7fac99574000, 2097152, PROT_NONE) = 0
mmap(0x7fac99774000, 802816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb73000) = 0x7fac99774000
mmap(0x7fac99838000, 262632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fac99838000
close(3)                                = 0
open("/usr/java/jdk1.7.0_75/bin/../lib/amd64/jli/libm.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=11922, ...}) = 0
mmap(NULL, 11922, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fac9a48e000
close(3)                                = 0
open("/lib64/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260T\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1141552, ...}) = 0
mmap(NULL, 3150168, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fac986ff000
mprotect(0x7fac98800000, 2093056, PROT_NONE) = 0
mmap(0x7fac989ff000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x100000) = 0x7fac989ff000
close(3)                                = 0
mprotect(0x7fac989ff000, 4096, PROT_READ) = 0
munmap(0x7fac9a48e000, 11922)           = 0
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fac9a389000
mprotect(0x7fac9a389000, 4096, PROT_NONE) = 0
clone(child_stack=0x7fac9a488fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fac9a4899d0, tls=0x7fac9a489700, child_tidptr=0x7fac9a4899d0) = 142
futex(0x7fac9a4899d0, FUTEX_WAIT, 142, NULLhellowerld
) = 0
exit_group(0)                           = ?

the process is now hung but you can enter the container:

docker exec -it myContainer bash

and see the following:

ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 23:47 ?        00:00:00 bash
root       138     1  0 23:51 ?        00:00:00 strace java Foo
root       141   138 24 23:51 ?        00:01:21 [java] <defunct>
root       151     0  1 23:57 ?        00:00:00 bash
root       167   151  0 23:57 ?        00:00:00 ps -ef

quick look at stats:

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O               BLOCK I/O
myContainer                  24.72%              64.18 MB / 8.365 GB   0.77%               11.09 MB / 202.6 kB   8.192 kB / 14.99

Everything works fine in 1.8.3.

lerox commented

+1, Docker version 1.9.1, build a34a1d5, OS X

+1, Docker version 1.9.1, build a34a1d5, OS X 10.10.5, Docker Machine Version: 0.5.1 (HEAD)

+1

Docker version 1.9.1, build a34a1d5, OS X 10.11.1 (15B42)

+1

Docker version 1.9.1, build a34a1d5 on OS X 10.11.1

This issue really is quite bizarre. If I strace the failing apt-get command, the end of the output is:

stat("/etc/apt/sources.list", {st_mode=S_IFREG|0644, st_size=161, ...}) = 0
open("/etc/apt/sources.list", O_RDONLY) = 5
read(5, "deb http://httpredir.debian.org/"..., 8191) = 161
pipe([6, 7])                            = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc6fc88aa10) = 14
close(7)                                = 0
fcntl(6, F_GETFL)                       = 0 (flags O_RDONLY)
fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc6fc892000
lseek(6, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
read(6, Process 14 attached
 <unfinished ...>
[pid    14] rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, 8) = 0
[pid    14] rt_sigaction(SIGQUIT, {SIG_DFL, [QUIT], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_DFL, [], 0}, 8) = 0
[pid    14] rt_sigaction(SIGINT, {SIG_DFL, [INT], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_DFL, [], 0}, 8) = 0
[pid    14] rt_sigaction(SIGWINCH, {SIG_DFL, [WINCH], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {0x7fc6fc0e5750, [WINCH], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, 8) = 0
[pid    14] rt_sigaction(SIGCONT, {SIG_DFL, [CONT], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_DFL, [], 0}, 8) = 0
[pid    14] rt_sigaction(SIGTSTP, {SIG_DFL, [TSTP], SA_RESTORER|SA_RESTART, 0x7fc6fb531180}, {SIG_DFL, [], 0}, 8) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(3, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(4, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(5, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(6, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(7, F_SETFD, FD_CLOEXEC) = 0
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(8, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(9, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(10, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)
[pid    14] getrlimit(RLIMIT_NOFILE, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
[pid    14] fcntl(11, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)

Where those (Bad file descriptor) errors continue to loop indefinitely.

RLIMIT_NOFILE
              Specifies a value one greater than the maximum file descriptor
              number that can be opened by this process.  Attempts (open(2),
              pipe(2), dup(2), etc.)  to exceed this limit yield the error
              EMFILE.  (Historically, this limit was named RLIMIT_OFILE on
              BSD.)

SIGPIPE is failing? this might correspond to my previous post where I saw java "hello world" causing zombie processes without an explicit "System.exit(0);" -- or maybe thats a completely different issue. if so sorry for the noise.

what happens to your cpu while looping indefinitely?

java "hello world" causing zombie processes without an explicit "System.exit(0);"

That certainly sounds similar to the problem encountered here.

I can definitely confirm the b2d issue (even did the bisect to track it most positively to the 4.1.13 kernel bump). I can also reproduce on 4.2.6 with b2d.

As an additional kink, my Gentoo host is currently on 4.1.13 + AUFS patches also, and I'm seeing the same exact problem, so we've definitely ruled out anything b2d-specific.

I think it might be worth trawling through commits between 4.1.12 and 4.1.13 to see if anything that might be related jumps out.

(ie, https://www.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.1.13)

Yup, something breaks from kernel 4.1.12 => 4.1.13. I can confirm that baking a boot2docker ISO for the former doesn't trip this bug but the former does.

So, it's not specifically related to boot2docker, but seems to be related to the kernel version interacting with AUFS.

i've got a hair brained theory...

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.1.13&id=6c0da28df5dac10672efe955eb89051a850008eb

the commit above makes a change to filemap.c to generic_perform_write(struct file *file, struct iov_iter *i, loff_t pos)

below is the chunk of code i personally want to test because the comment describes both deadlock and livelock race conditions and i see the cpu pegged at 100%. but thats just me and my jump-to-conclusions mat.

4.1.13 mm/filemap.c#l_2448

...
 2448 again:
 2449       /*
 2450        * Bring in the user page that we will copy from _first_.
 2451        * Otherwise there's a nasty deadlock on copying from the
 2452        * same page as we're writing to, without it being marked
 2453        * up-to-date.
 2454        *
 2455        * Not only is this an optimisation, but it is also required
 2456        * to check that the address is actually valid, when atomic
 2457        * usercopies are used, below.
 2458        */
 2459       if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
 2460           status = -EFAULT;
 2461           break;
 2462       }
 2463 
 2464       if (fatal_signal_pending(current)) {
 2465           status = -EINTR;
 2466           break;
 2467       }
 2468 
 2469       status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 2470                       &page, &fsdata);
 2471       if (unlikely(status < 0))
 2472           break;
 2473 
 2474       if (mapping_writably_mapped(mapping))
 2475           flush_dcache_page(page);
 2476 
 2477       copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 2478       flush_dcache_page(page);
 2479 
 2480       status = a_ops->write_end(file, mapping, pos, bytes, copied,
 2481                       page, fsdata);
 2482       if (unlikely(status < 0))
 2483           break;
 2484       copied = status;
 2485 
 2486       cond_resched();
 2487 
 2488       iov_iter_advance(i, copied);
 2489       if (unlikely(copied == 0)) {
 2490           /*
 2491            * If we were unable to copy any data at all, we must
 2492            * fall back to a single segment length write.
 2493            *
 2494            * If we didn't fallback here, we could livelock
 2495            * because not all segments in the iov can be copied at
 2496            * once without a pagefault.
 2497            */
 2498           bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
 2499                       iov_iter_single_seg_count(i));
 2500           goto again;
 2501       }
 2502       pos += copied;
 2503       written += copied;
 2504 
 2505       balance_dirty_pages_ratelimited(mapping);
 2506   } while (iov_iter_count(i));
bean5 commented

@andrewgdavis one could use that commit during git bisect as a specific testing point!

Seeing a similar hang when shutting down mongodb. Definitely present in 1.9.x. Not present in 1.8.x.

I've been able to solve this issue for myself by increasing the docker-machine VM's memory from 1024 to 2048 MB and assigning 2 CPUs instead of 1.

Works:

VM: Ubuntu 14.04 (2gb ram)
Docker Engine: 1.9.1
Docker base image: ubuntu:latest

Does not work:

VM: Ubuntu 15.10 (2 gb ram)
Docker Engine: 1.9.1,1.9.0,1.8.3
Docker base image: ubuntu:latest, ubuntu:14.04

@marsinvasion If possible, can you print the output of uname -a on both tested systems?

VM: Ubuntu 14.04
Linux ubuntu 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

VM: Ubuntu 15.10
Linux ubuntu 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

+1
Docker version 1.9.1, build a34a1d5 on OS X 10.11.1

Encountered on OS X 10.9.5 with docker 1.9.1.

Inspired by @marsinvasion, I got a successful workaround by giving my docker-machine 2 CPUs and 4096Mb RAM.

Oops, spoke to soon. It stopped working upon changing a Dockerfile I'm working on and re-running build.

Also seeing this hellacious bug (docker-machine boot2docker 1.9.1 on OS X), from a previously building ubuntu:15.04 image. It seems to require restarting my docker server to get those zombie containers to go away.

I thought docker-library/openjdk#19 was related but maybe not, here we're getting a hang, there they got an error about not finding "java".

Switched my server to overlay as a workaround. Before that it created a bunch of zombie containers as well.

Docker version 1.9.1, build a34a1d5 on OS X 10.11.1

Anyone know what's involved in migrating an existing boot2docker.iso system to https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/ or is it easier to do a full rebuild? That page has ominous warnings about CentOS image builds -- what are the "yum" workarounds, is it related to #10180?

Definitely not fixed by Docker Toolbox 1.9.1a. Suffering from this bug with that version. Looking back through the comments, it looks like I'm not the only one.

Snapu commented

nope still not building

I had to delete the VM in virtualbox and start from scratch for it to work.

Also, tried deleting and creating a new VM several times to no avail.

Installed 1.9.1a, did docker-machine rm default and used Docker Quickstart Terminal to regenerate default machine. Rebuilt images (that derive from java:7-jre) and ran, still does not work. Continues to work just fine with overlay machine built as suggested above:

$ docker-machine create -d virtualbox --engine-storage-driver overlay overlay
Snapu commented

^thanks! I can confirm the overlay machine is working.

Using overlay as the engine storage driver also worked for fixing the MongoDB shutdown hang.

You can workaround the Dockerfile build failure by installing Oracle java instead of OpenJDK:

# Oracle java is bulkier but avoids boot2docker/aufs docker issue 18180
RUN apt-get install -y software-properties-common python-software-properties && add-apt-repository -y ppa:webupd8team/java && apt-get update
RUN echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections
RUN apt-get install -y oracle-java8-installer && apt-get install -y oracle-java8-set-default

But I was underestimating the scope of the problem, boot2docker 1.9.1 leads to zombie java processes, even on CentOS containers where openjdk installs fine.
root 322 11.1 0.0 0 0 ? Zsl 18:43 29:48 [java] <defunct>

I'm unable to configure my docker server with --engine-storage-driver overlay because I build CentOS-based images, and overlayfs is not compatible with yum (#10180).

I'm sure Docker folks would not recommend this, but the way I moved past this blocking issue is by building a boot2docker.iso that uses docker 1.9.1 with a slightly older AUFS. Instructions in boot2docker/boot2docker#1099 (comment).

tried oracle jdk1.7.0_75 and jdk1.8.0_65; both hang and create a defunct java process.

FROM : #10589
@neverfox exactly the same problem here, with the same image +1

~ docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.5.1
 Git commit:   a34a1d5
 Built:        Sat Nov 21 00:49:19 UTC 2015
 OS/Arch:      darwin/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.3
 Git commit:   a34a1d5
 Built:        Fri Nov 20 17:56:04 UTC 2015
 OS/Arch:      linux/amd64


~ docker-machine inspect default
{
    "ConfigVersion": 3,
    "Driver": {
        "Driver": {
            "VBoxManager": {},
            "IPAddress": "192.168.99.100",
            "MachineName": "default",
            "SSHUser": "docker",
            "SSHPort": 61012,
            "SSHKeyPath": "/Users/myuser/.docker/machine/machines/default/id_rsa",
            "StorePath": "/Users/myuser/.docker/machine",
            "SwarmMaster": false,
            "SwarmHost": "tcp://0.0.0.0:3376",
            "SwarmDiscovery": "",
            "CPU": 1,
            "Memory": 4096,
            "DiskSize": 20000,
            "Boot2DockerURL": "",
            "Boot2DockerImportVM": "",
            "HostOnlyCIDR": "192.168.99.1/24",
            "HostOnlyNicType": "82540EM",
            "HostOnlyPromiscMode": "deny",
            "NoShare": false
        },
        "Locker": {}
    },
    "DriverName": "virtualbox",
    "HostOptions": {
        "Driver": "",
        "Memory": 0,
        "Disk": 0,
        "EngineOptions": {
            "ArbitraryFlags": [],
            "Dns": null,
            "GraphDir": "",
            "Env": [],
            "Ipv6": false,
            "InsecureRegistry": [],
            "Labels": [],
            "LogLevel": "",
            "StorageDriver": "",
            "SelinuxEnabled": false,
            "TlsVerify": true,
            "RegistryMirror": [],
            "InstallURL": "https://get.docker.com"
        },
        "SwarmOptions": {
            "IsSwarm": false,
            "Address": "",
            "Discovery": "",
            "Master": false,
            "Host": "tcp://0.0.0.0:3376",
            "Image": "swarm:latest",
            "Strategy": "spread",
            "Heartbeat": 0,
            "Overcommit": 0,
            "ArbitraryFlags": [],
            "Env": null
        },
        "AuthOptions": {
            "CertDir": "/Users/myuser/.docker/machine/certs",
            "CaCertPath": "/Users/myuser/.docker/machine/certs/ca.pem",
            "CaPrivateKeyPath": "/Users/myuser/.docker/machine/certs/ca-key.pem",
            "CaCertRemotePath": "",
            "ServerCertPath": "/Users/myuser/.docker/machine/machines/default/server.pem",
            "ServerKeyPath": "/Users/myuser/.docker/machine/machines/default/server-key.pem",
            "ClientKeyPath": "/Users/myuser/.docker/machine/certs/key.pem",
            "ServerCertRemotePath": "",
            "ServerKeyRemotePath": "",
            "ClientCertPath": "/Users/myuser/.docker/machine/certs/cert.pem",
            "StorePath": "/Users/myuser/.docker/machine/machines/default"
        }
    },
    "Name": "default",
    "RawDriver": "eyJWQm94TWFuYWdlciI6e30sIklQQWRkcmVzcyI6IjE5Mi4xNjguOTkuMTAwIiwiTWFjaGluZU5hbWUiOiJkZWZhdWx0IiwiU1NIVXNlciI6ImRvY2tlciIsIlNTSFBvcnQiOjYxMDEyLCJTU0hLZXlQYXRoIjoiL1VzZXJzL2RhdmlkZnJhbmNvZXVyLy5kb2NrZXIvbWFjaGluZS9tYWNoaW5lcy9kZWZhdWx0L2lkX3JzYSIsIlN0b3JlUGF0aCI6Ii9Vc2Vycy9kYXZpZGZyYW5jb2V1ci8uZG9ja2VyL21hY2hpbmUiLCJTd2FybU1hc3RlciI6ZmFsc2UsIlN3YXJtSG9zdCI6InRjcDovLzAuMC4wLjA6MzM3NiIsIlN3YXJtRGlzY292ZXJ5IjoiIiwiQ1BVIjoxLCJNZW1vcnkiOjQwOTYsIkRpc2tTaXplIjoyMDAwMCwiQm9vdDJEb2NrZXJVUkwiOiIiLCJCb290MkRvY2tlckltcG9ydFZNIjoiIiwiSG9zdE9ubHlDSURSIjoiMTkyLjE2OC45OS4xLzI0IiwiSG9zdE9ubHlOaWNUeXBlIjoiODI1NDBFTSIsIkhvc3RPbmx5UHJvbWlzY01vZGUiOiJkZW55IiwiTm9TaGFyZSI6ZmFsc2V9"
}
โžœ  ~  docker inspect 74
[
{
    "Id": "7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04",
    "Created": "2015-11-27T13:23:11.515987776Z",
    "Path": "/docker-entrypoint.sh",
    "Args": [
        "cassandra",
        "-f"
    ],
    "State": {
        "Status": "running",
        "Running": true,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 1263,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2015-11-27T13:23:11.612899257Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "338a92b912e4d5a84c4f399a9475a1476f8226eff85c2592c8e80ba58b13d225",
    "ResolvConfPath": "/mnt/sda1/var/lib/docker/containers/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04/resolv.conf",
    "HostnamePath": "/mnt/sda1/var/lib/docker/containers/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04/hostname",
    "HostsPath": "/mnt/sda1/var/lib/docker/containers/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04/hosts",
    "LogPath": "/mnt/sda1/var/lib/docker/containers/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04/7471b734d7e7e47270511453a04d903c974cba77a2a0d259255355a653f95e04-json.log",
    "Name": "/pensive_kalam",
    "RestartCount": 0,
    "Driver": "aufs",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": null,
        "ContainerIDFile": "",
        "LxcConf": [],
        "Memory": 0,
        "MemoryReservation": 0,
        "MemorySwap": 0,
        "KernelMemory": 0,
        "CpuShares": 0,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": -1,
        "Privileged": false,
        "PortBindings": {},
        "Links": null,
        "PublishAllPorts": false,
        "Dns": [],
        "DnsOptions": [],
        "DnsSearch": [],
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": [],
        "NetworkMode": "default",
        "IpcMode": "",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "no",
            "MaximumRetryCount": 0
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "",
        "ConsoleSize": [
            0,
            0
        ],
        "VolumeDriver": ""
    },
    "GraphDriver": {
        "Name": "aufs",
        "Data": null
    },
    "Mounts": [
        {
            "Name": "2249b03f9a598e5ac3f306983877292baa299c4499c9db77eb9bfcb88fd2f541",
            "Source": "/mnt/sda1/var/lib/docker/volumes/2249b03f9a598e5ac3f306983877292baa299c4499c9db77eb9bfcb88fd2f541/_data",
            "Destination": "/var/lib/cassandra",
            "Driver": "local",
            "Mode": "",
            "RW": true
        }
    ],
    "Config": {
        "Hostname": "7471b734d7e7",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": true,
        "AttachStderr": true,
        "ExposedPorts": {
            "7000/tcp": {},
            "7001/tcp": {},
            "7199/tcp": {},
            "9042/tcp": {},
            "9160/tcp": {}
        },
        "Tty": false,
        "OpenStdin": false,
        "StdinOnce": false,
        "Env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "CASSANDRA_VERSION=2.1.11",
            "CASSANDRA_CONFIG=/etc/cassandra"
        ],
        "Cmd": [
            "cassandra",
            "-f"
        ],
        "Image": "cassandra:2.1.11",
        "Volumes": {
            "/var/lib/cassandra": {}
        },
        "WorkingDir": "",
        "Entrypoint": [
            "/docker-entrypoint.sh"
        ],
        "OnBuild": null,
        "Labels": {},
        "StopSignal": "SIGTERM"
    },
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "e2f074e4b10e67cd7ac22d6e73d50304fc3f0a68d67c7fee6d7f8d647c9eb9b1",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": {
            "7000/tcp": null,
            "7001/tcp": null,
            "7199/tcp": null,
            "9042/tcp": null,
            "9160/tcp": null
        },
        "SandboxKey": "/var/run/docker/netns/e2f074e4b10e",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "63596aa5ec20516d477921fec4197d086b4dd4f1ad25014b5ddf027b82891966",
        "Gateway": "172.17.0.1",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "172.17.0.2",
        "IPPrefixLen": 16,
        "IPv6Gateway": "",
        "MacAddress": "02:42:ac:11:00:02",
        "Networks": {
            "bridge": {
                "EndpointID": "63596aa5ec20516d477921fec4197d086b4dd4f1ad25014b5ddf027b82891966",
                "Gateway": "172.17.0.1",
                "IPAddress": "172.17.0.2",
                "IPPrefixLen": 16,
                "IPv6Gateway": "",
                "GlobalIPv6Address": "",
                "GlobalIPv6PrefixLen": 0,
                "MacAddress": "02:42:ac:11:00:02"
            }
        }
    }
}
]

I simply ran docker run -it cassandra:2.1.11 and your terminal will be stuck, no way to stop the container. You have to stop the whole VM.

+1

Was able to duplicate issue earlier today on Docker 1.9.1 running Mac OSX 10.11.1 (15B42)

Was able to get around it by installing Docker 1.9.0

_Apologies for lack of information was on my work machine earlier during the day - will provide updated information at later time_

๐Ÿ‘

Same here with Docker 1.9.1 and OS X 10.11.

For people having this issue

We've so far narrowed this down to not being a docker bug but a kernel issue in combination with AUFS in the kernel that is used by the current boot2docker version; see #18180 (comment)

  • If you want to stay informed on progress, use the subscribe button on this page. do not comment if you don't have new information that may help to resolve this issue.
  • if you want to help resolving this, performing a git-bissect of the kernel may help #18180 (comment)
  • remember that each comment will send out more than 2000 e-mails to subscribers, and countless puppies will die ๐Ÿ˜„

Just tested Storage Driver: devicemapper (with Server Version: 1.9.1 and kernel 4.2.6), and the bug does not reproduce, so we're still in "strange interaction between some change in the newer kernel and the AUFS patches" land. ๐Ÿ˜ž

Tested, and bug is still present on the fresh 4.1.14 kernel, so we're still sitting on some commit that was backported to 4.1.13 interacting weird with the AUFS patches (and didn't get lucky with it being already fixed in the interim).

I decided to give it the old college try and cloned the boot2docker repo; then modified the aufs commit in the dockerfile to the previous version. So docker 1.9.1 kernel 4.1.13 + previous AUFS version that was shipped before 1.9.1. Compilation is slow on my machine ... is there a docker swarm setup that I can run in conjunction with a git bisect and aggregate the results? that would be sweet.

any way, I will post my results shortly if it works...

update:
4.1.13 + this AUFS commit still exhibit the problem.
ENV AUFS_COMMIT 1724fe65683d126a92c6baeea0b3c7d0306c63ef

I'm not aware of any easy setup to aggregate the results, although one could conceivably be built.

FWIW, https://sources.debian.net/src/ca-certificates-java/jessie/debian/postinst.in/ is the exact script that's running in that package, and https://sources.debian.net/src/ca-certificates-java/jessie/src/main/java/org/debian/security/UpdateCertificates.java/ is the exact Java source that's being executed when we get the hang + defunct + pegged CPU.

Got into related issue (java process hangs) today.

Host environment: Linux lenovo 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Distro: Ubuntu 15.10
Docker Engine: 1.9.1
Docker Machine: 0.5.0 (04cfa58)

I am following the network multi-host tutorial. Only difference is that I am playing with the oracle/nosql image. That image is based on Oracle Linux and uses OpenJDK.

@brunoborges yes, that could be the same issue, see #18500 (comment)

@brunoborges just check your boot2docker.iso version โ€“ if it 1.9.1 you could try downgrade to 1.9.0 and recreate your machine and pull images once again.
If you go this way, could you write a short report here?

So i got to wondering why this only happens on java, and not any other languages. In one of my previous posts I was able to detail the most basic of reproductions by simply compiling and running

class Foo {
public static void main(String[] a) {
  System.out.println("hellowerld");
  }
}

for the failure case which resulted in defunct java process
and then

class Foo {
public static void main(String[] a) {
  System.out.println("hellowerld");
  System.exit(0);
  }
}

for the expected (non defunct) case.

I then tried to reproduce something similar using python. I was unsuccessful-- but I tried. For those interested I was trying to exhibit the last strace output exit_group(0) = ? that was seen from the zombie java process. (This link provided me with a lot of info about python threading / seccomp / etc http://stackoverflow.com/questions/25678139/how-do-you-cleanly-exit-after-enabling-seccomp-in-python )

So off to kernel land: After rebuilding the boot2docker iso, messing with the aufs verions and kernel versions (nothing of which really made a difference) I got fed up with how slow the compilation process was using numproc=1. So I changed it to 6. ==> note no longer 1 cpu (who only has 1 cpu now a days?). Suddenly the failure case

class Foo {
public static void main(String[] a) {
  System.out.println("hellowerld");
  }
}

started working.

Obviously, the next thing to try was to bump it back down to 1 cpu. ==> FAIL. back to a defunct java process.

So then I wanted to explore more about how java shuts down. It's not well defined. but with only 1 cpu this java process was able to be run successfully: (please don't make fun of my horrible java.)

import java.util.Iterator;
import java.util.Set;

class Foo {

static public final Object a = new Object();
static {
  final Object aa = a;
  Runtime.getRuntime().addShutdownHook(new Thread() {
        @Override
        public void run() {
                System.out.println("added one");
                if (aa == null)
                        { System.out.println("out"); }
        }
  });
  System.out.println("exit");
  Set<Thread> threadSet = Thread.getAllStackTraces().keySet();

  Thread[] threadArray = threadSet.toArray(new Thread[threadSet.size()]);
  for(Thread xxx : threadArray)
  {
    System.out.println(xxx.toString());
  }
////  System.exit(0);
}
static public void main(String[] a) {}

Can anyone else please confirm this behavior? << question is now moot

Update: Even with more than one core, a defunct java process can occur. (I was running cassandra-cli and it happened.)

docker-machine ssh myVM

ps -ef:
docker    6606  5863  0 Dec11 ?        00:00:00 /bin/sh /cassandra/bin/cassandra-cli -f /home/foo/my.cli -h 172.17.0.2
docker    6651  6606 99 Dec11 ?        00:41:29 [java] <defunct>
cat /proc/6606/stack
[<ffffffff8106e491>] do_wait+0x1ab/0x23f
[<ffffffff8106e5bc>] SYSC_wait4+0x97/0xb0
[<ffffffff8106d66b>] child_wait_callback+0x0/0x43
[<ffffffff8155466e>] system_call_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff

cat /proc/6651/stack
[<ffffffff8106f06c>] do_exit+0x88f/0x8cc
[<ffffffff81075f8d>] signal_wake_up_state+0x23/0x36
[<ffffffff8106f104>] do_group_exit+0x36/0xa6
[<ffffffff8106f180>] __wake_up_parent+0x0/0x1d
[<ffffffff8155466e>] system_call_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff