alertlogic/erllambda

Runtime issues: pthread and awk error writing to stdout

rmpalomino opened this issue · 1 comments

I've noticed pthread errors off and on in for a long time but it hasn't been a big problem. In the past week I've started to see a new failure scenario where it seems like, according to the error logged, that the Lambda temporary disk space might be filled up.

I'm not sure what to make of it, but it is causing ~2 hours of failed processing every time that it happens. In the last occurrence, logs of the error were limited to a single log group of a single Lambda function that has two Kinesis event sources. I assume that the ~2 hours is the lifecycle of that failed container since logs eventually completely stop for that log group.

My first thought was that maybe the pthread error is causing a crashdump to be stored despite the bootstrap script trying to disable them, and then too many of them cause the disk space to run out. I thought this because the execution before the awk error in these occurrences has consistently been the pthread error.

I tried to bump the configured memory for the Lambda function, despite the error mentioning disk space instead of memory, and that didn't appear to help at all.

Erlang runtime details:
OTP version: 21.3.8.4
OpenSSL version: 1.0.2k-fips
erllambda version: 2.1.3

Example of the error is below. The sequence of executions can be summarized as:

  1. Execution success
  2. pthread error
  3. awk error
    ... awk error repeats for all executions until log stream ends...
Invoke Success path 1571007997408 http://127.0.0.1:9001/2018-06-01/runtime/invocation/ffcddaa8-63e1-4a12-b6e7-5b9cbd3dac39/response
Invoke Next path 1571007997410 http://127.0.0.1:9001/2018-06-01/runtime/invocation/next
END RequestId: ffcddaa8-63e1-4a12-b6e7-5b9cbd3dac39
REPORT RequestId: ffcddaa8-63e1-4a12-b6e7-5b9cbd3dac39	Duration: 134.89 ms	Billed Duration: 200 ms	Memory Size: 768 MB	Max Memory Used: 386 MB	

START RequestId: cec77f66-f012-4052-8c31-afebbcc079ae Version: $LATEST
pthread/ethr_event.c:164: Fatal error in wait__(): Operation not permitted (1)
END RequestId: cec77f66-f012-4052-8c31-afebbcc079ae
REPORT RequestId: cec77f66-f012-4052-8c31-afebbcc079ae	Duration: 664.46 ms	Billed Duration: 700 ms	Memory Size: 768 MB	Max Memory Used: 386 MB	
RequestId: cec77f66-f012-4052-8c31-afebbcc079ae Error: Runtime exited with error: signal: aborted (core dumped)
Runtime.ExitError
creating necessary erllambda run dirs
OpenSSL is OpenSSL 1.0.2k-fips 26 Jan 2017
starting ErlangVM
awk: cmd. line:5: (FILENAME=- FNR=7) warning: error writing standard output (No space left on device)

START RequestId: cec77f66-f012-4052-8c31-afebbcc079ae Version: $LATEST
creating necessary erllambda run dirs
OpenSSL is OpenSSL 1.0.2k-fips 26 Jan 2017
starting ErlangVM
awk: cmd. line:5: (FILENAME=- FNR=7) warning: error writing standard output (No space left on device)
END RequestId: cec77f66-f012-4052-8c31-afebbcc079ae
REPORT RequestId: cec77f66-f012-4052-8c31-afebbcc079ae	Duration: 86.19 ms	Billed Duration: 100 ms	Memory Size: 768 MB	Max Memory Used: 12 MB	
RequestId: cec77f66-f012-4052-8c31-afebbcc079ae Error: Runtime exited with error: exit status 1
Runtime.ExitError

START RequestId: b8c66eb1-e0ea-4967-a1a0-5530e002f343 Version: $LATEST
creating necessary erllambda run dirs
OpenSSL is OpenSSL 1.0.2k-fips 26 Jan 2017
starting ErlangVM
awk: cmd. line:5: (FILENAME=- FNR=7) warning: error writing standard output (No space left on device)
END RequestId: b8c66eb1-e0ea-4967-a1a0-5530e002f343
REPORT RequestId: b8c66eb1-e0ea-4967-a1a0-5530e002f343	Duration: 64.65 ms	Billed Duration: 100 ms	Memory Size: 768 MB	Max Memory Used: 15 MB	
RequestId: b8c66eb1-e0ea-4967-a1a0-5530e002f343 Error: Runtime exited with error: exit status 1
Runtime.ExitError

A GitHub search reveals a different pthread related crash: leo-project/leofs#843 (comment)

Sep 20 18:55:53 bodies-master leo_manager[406]: pthread/ethr_event.c:164: Fatal error in wait__(): Invalid argument (22)

From my initial investigation, it seems like the problem is actually Linux core dumps being generated in /tmp after the pthread error occurs. The /tmp/erllambda_rundir disk usage is not changing, and core.beam.smp.* files are only present on the container that has suffered pthread issues.

ls -lash /tmp
4.0K drwx------ 3 sbx_user1098 448 4.0K Oct 16 03:16 .
0 dr-xr-xr-x 21 root root 285 Sep 24 18:18 ..
44M -rw------- 1 sbx_user1098 448 107M Oct 16 02:58 core.beam.smp.7
44M -rw------- 1 sbx_user1098 448 106M Oct 16 03:16 core.beam.smp.8
42M -rw------- 1 sbx_user1098 448 104M Oct 16 02:29 core.beam.smp.9
4.0K drwxrwxr-x 6 sbx_user1098 448 4.0K Oct 16 02:28 erllambda_rundir

I'm going to see if I can disable them using ulimit -c 0.