adobe/helix-static

error while fetching content Error: getaddrinfo EMFILE raw.githubusercontent.com

Closed this issue · 15 comments

rofe commented

Description
helix-static regularly throws EMFILE (too many open files) errors, as seen in the helix-pages application in Coralogix. I think we should try to understand why they are happening and avoid them:

{
   "level"  :  "error" ,
   "ow"  : {
     "activationId"  :  "[ID]" ,
     "actionName"  :  "/helix/helix-services-private/static@1.5.42" ,
     "transactionId"  :  "[ID]" 
  },
   "message"  :  "error while fetching content Error: connect EMFILE raw.githubusercontent.com raw.githubusercontent.com:443",
   "timestamp"  :  "2020-01-19T10:28:25.927313699Z" 
}

(I suspect these subsequently cause helix-log to log another EMFILE type error of its own)
See https://helix.coralogix.com/#/query/logs?id=2604zpGWgVh for samples.

Expected behavior
Ideally no such errors should be thrown.

Each invocation will possibly consume three connections:

  • one to GitHub
  • one to Coralogix
  • one to Epsagon

if one of them is leaking sockets, we can easily exhaust the available file handles, given that the maximum parallelism is 200 – so we could have up to 200 × n outgoing connections.

after adding some status, here's a sequence of invocations that look weird:

$ wsk activation logs b10d632e4fa74a708d632e4fa71a707c
2020-01-20T07:25:39.737Z       stderr: action-status { status: { numInvocations: 115, memInfo: 137277440, age: 1772006, numFileHandles: 21, uuid: 'db8f71d4c5dad81aff1b5ef9d50a1b11', concurrency: 1 } }
2020-01-20T07:25:39.738Z       stderr: instrumenting epsagon.
2020-01-20T07:25:39.743Z       stderr: deliverStatic with adobe/ferrumjsorg/54d751f37633fa777ce0816390b3bdbe515d0295 path=/index.html file=/index.html allow=undefined deny=undefined root= esi=false
2020-01-20T07:25:39.744Z       stderr: deliverPlain: url=https://raw.githubusercontent.com/adobe/ferrumjsorg/54d751f37633fa777ce0816390b3bdbe515d0295//index.html
2020-01-20T07:25:39.787Z       stderr: got response. size=37438, type=text/html
2020-01-20T07:25:39.789Z       stderr: delivering file /index.html type text/html binary: false
$ wsk activation logs 8602bfe05d58416082bfe05d58c160fa
2020-01-20T07:25:39.749Z       stderr: action-status { status: { numInvocations: 116, memInfo: 137158656, age: 1772019, numFileHandles: 25, uuid: 'db8f71d4c5dad81aff1b5ef9d50a1b11', concurrency: 2 } }
2020-01-20T07:25:39.751Z       stderr: instrumenting epsagon.
2020-01-20T07:25:39.755Z       stderr: deliverStatic with adobe/helix-pages/600562e9639232a54c0e9e587ad3cd1f903c6df8 path=/index.html file=/index.html allow=undefined deny=undefined root=/htdocs esi=false
2020-01-20T07:25:39.756Z       stderr: deliverPlain: url=https://raw.githubusercontent.com/adobe/helix-pages/600562e9639232a54c0e9e587ad3cd1f903c6df8/htdocs/index.html
2020-01-20T07:25:39.799Z       stderr: delivering error /index.html 404
$ wsk activation logs bfbe7d778bc24878be7d778bc2687885
2020-01-20T07:25:39.758Z       stderr: action-status { status: { numInvocations: 117, memInfo: 137797632, age: 1772027, numFileHandles: 25, uuid: 'db8f71d4c5dad81aff1b5ef9d50a1b11', concurrency: 3 } }
2020-01-20T07:25:39.758Z       stderr: instrumenting epsagon.
2020-01-20T07:25:39.763Z       stderr: deliverStatic with adobe/ferrumjsorg/54d751f37633fa777ce0816390b3bdbe515d0295 path=/404.html file=/404.html allow=undefined deny=undefined root= esi=false
2020-01-20T07:25:39.764Z       stderr: deliverPlain: url=https://raw.githubusercontent.com/adobe/ferrumjsorg/54d751f37633fa777ce0816390b3bdbe515d0295//404.html
2020-01-20T07:25:39.784Z       stderr: delivering error /404.html 404
$ wsk activation logs c4a8d8ddf3084720a8d8ddf3086720ea
2020-01-20T07:25:39.769Z       stderr: action-status { status: { numInvocations: 118, memInfo: 138031104, age: 1772039, numFileHandles: 38, uuid: 'db8f71d4c5dad81aff1b5ef9d50a1b11', concurrency: 4 } }
2020-01-20T07:25:39.770Z       stderr: instrumenting epsagon.
2020-01-20T07:25:39.775Z       stderr: deliverStatic with adobe/helix-pages/600562e9639232a54c0e9e587ad3cd1f903c6df8 path=/404.html file=/404.html allow=undefined deny=undefined root=/htdocs esi=false
2020-01-20T07:25:39.776Z       stderr: deliverPlain: url=https://raw.githubusercontent.com/adobe/helix-pages/600562e9639232a54c0e9e587ad3cd1f903c6df8/htdocs/404.html
2020-01-20T07:25:39.795Z       stderr: delivering error /404.html 404
$ wsk activation logs d6372a0387a54fcab72a0387a56fcada
2020-01-20T07:26:09.138Z       stderr: action-status { status: { numInvocations: 119, memInfo: 155811840, age: 1801407, numFileHandles: 280, uuid: 'db8f71d4c5dad81aff1b5ef9d50a1b11', concurrency: 1 } }
2020-01-20T07:26:09.139Z       stderr: instrumenting epsagon.
2020-01-20T07:26:09.147Z       stderr: deliverStatic with adobe/project-helix.io/master path=/styles.css file=/styles.css allow= deny= root=/htdocs esi=false
2020-01-20T07:26:09.148Z       stderr: deliverPlain: url=https://raw.githubusercontent.com/adobe/project-helix.io/master/htdocs/styles.css
2020-01-20T07:26:09.246Z       stderr: got response. size=6246, type=text/css
2020-01-20T07:26:09.247Z       stderr: delivering file htdocs/styles.css type text/css binary: false

The activation #119 has suddenly 280 open filehandles.

In #165, I've added a test/stress.test.js file that will (try to) make 2000 requests over a time span of 16 seconds. Depending on the availability of the CORALOGIX_API_KEY and EPSAGON_TOKEN environment variables, it will also enable tracing and logging.

Here are the results I've been seeing when running this inside a Docker container (no ulimits applied)

Environment Result Maximum Number of File Handles
Plain successful execution 52
Only Epsagon JavaScript heap out of memory 23
Only Coralogix aborted due to slowness 993
Both JavaScript heap out of memory 994

Based on that, I'd say we have

  1. a connection leak in helix-log
  2. a memory leak in epsagon-node

Let me know if you want me to share the log files.

  1. a connection leak in helix-log

I don't think it's a connection leak...it is just not re-using the connection. the problem is in the helix-log coralogix client, such as it sends each log via a new connection. so if your action has for example 5 log entries, and you issue 2000 actions, you will end up in 10k connections :-)
AFAICS, phin doesn't use keep-alive connections by default:
https://github.com/adobe/helix-log/blob/96fa9498e4b2477fdd16af7e395045472e7441d9/src/coralogix.js#L129
This would probably already solve the problem. I don't know if phin supports connection pools. otherwise we should either switch to the official coralogix-logger or use request with a connection pool:

pool - an object describing which agents to use for the request. If this option is omitted the request will use the global agent (as long as your options allow for it). Otherwise, request will search the pool for your custom agent. If no custom agent is found, a new agent will be created and added to the pool. Note: pool is used only when the agent option is not specified.
A maxSockets property can also be provided on the pool object to set the max number of sockets for all agents created (ex: pool: {maxSockets: Infinity}).

  1. a memory leak in epsagon-node

epsagon is tracking the traces based on activation_id. if you test doesn't set a process.env.__OW_ACTIVATION_ID, it will internally overwrite the tracer map. so I think this is not a good test.

koraa commented

Let's first isolate the issue, before we decide on what to do.

It currently does open a connection for every request; the reason for this is the difficulty of isolating errors in bulk requests (an error in one message might invalidate the bulk of messages). Unfortunately, the native coralogix client handles this issue by simply retransmitting the entire batch…indefinately…which could lead to an endless loop.
Coralogix have promised to solve this issue in the future; they plan to provide error isolation in their api and provide a syslog based api…

Now, what should currently happen is that the coralogix logger starts a request, so it opens a connection, transmits the data and closes the connection again.

I've tried to reproduce the issue with the script below; what I am seeing is what I described above; the file descriptors max out at around 60; 20 more than at a very low load…

Strace pretty much also looks as expected; connections being opened, data being transmitted, connections closed…

@trieloff Would you mind deploying the stress testing script in runtime to see if CoralogixLogger causing the errors?

Stress Test Source

const { CoralogixLogger, makeLogMessage } = require('./CoralogixLogger');
const process = require('process');

const cl = new CoralogixLogger(
  "<REDACTED>", "stress-test", "stress-test"
);

const hitime = () => {
  const [s, ms] = process.hrtime();
  return s + ms/1e9;
}

const main = async () => {
  let t0 = hitime();
  let idx=0;
  for (; idx < 20000; idx++) {
    if ((idx % 1000) === 0) console.log("Wrote ", idx);
    cl.log(makeLogMessage({
      message: ["Stress test!"],
    }));
    idx++;
    await new Promise((res) => setTimeout(res, 0));
  }
  const t1 = hitime();
  const message = ["Sent ", idx, " messages to coralogix in ", (t1 - t0), " seconds."];
  cl.log(makeLogMessage({ message }));
  console.log(...message);
}

main();

Strace output

write(31, "\27\3\3\1\313LG\3\232\20\233M_q\337\340\24\260\327\350x\2$\222N\243,d\355]\301'"..., 464) = 464
read(35, "\26\3\3\0W\2\0\0S\3\3\214\243\374\233d\203F\300\31x\0265\16O\6#\316\10AB\375"..., 65536) = 143
write(35, "\24\3\3\0\1\1\26\3\3\0(\212\277\215d\247\264P\263\nl\34\222SGY\266\241\1\277h\360"..., 515) = 515
read(32, "\26\3\3\0r\4\0\0n\0\2\243\0\0hSSK-E00438749\0\0\0e"..., 65536) = 170
write(32, "\27\3\3\1\312\4\364\265\25\250~u\225\211\16\251\247\31\3636\244%\326\207*\2000N\373\326\t\0"..., 463) = 463
getsockopt(25, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
write(25, "\26\3\1\1\326\1\0\1\322\3\3\30\304\343\3643{\350{\317\250\367\33y\304\204j\270z\340\204\204"..., 475) = 475
read(36, "\26\3\3\0W\2\0\0S\3\3W\340\333\207\34\300\1777\204\200]\1776\17\377\343d)\336\264["..., 65536) = 143
write(36, "\24\3\3\0\1\1\26\3\3\0(\230\16\333c\251\246m\231F\16\377\220\333f{\304\263/\0\343\373"..., 515) = 515
read(16, "\1\0\0\0\0\0\0\0", 1024)      = 8
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 40
connect(40, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("52.19.211.175")}, 16) = -1 EINPROGRESS (Operation now in progress)
read(16, "\1\0\0\0\0\0\0\0", 1024)      = 8
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 42
connect(42, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("52.214.88.252")}, 16) = -1 EINPROGRESS (Operation now in progress)
read(33, "\26\3\3\0r\4\0\0n\0\2\243\0\0hSSK-E00438749\0\0\0\202"..., 65536) = 170
mprotect(0x1b0ee8282000, 249856, PROT_READ|PROT_WRITE) = 0
mprotect(0x1b0ee8282000, 249856, PROT_READ|PROT_EXEC) = 0
madvise(0x7f01a4108000, 188416, MADV_DONTNEED) = 0
write(33, "\27\3\3\1\312zp4\344<\305fVO\366\302\32\255\317'\35\n\33'\277\351\367\215\272\371\353f"..., 463) = 463
read(28, "\27\3\3\1\30\0\0\0\0\0\0\0\1B\204\361a\33\364\n\211\354\324\374b\f\315.\304\33m\211"..., 65536) = 316
write(28, "\25\3\3\0\32\35n\3\226B=\315\266]\331)\312\340\v`\205\f\271\247\353\257;\346\t\267A", 31) = 31
close(28)                               = 0
read(37, "\26\3\3\0W\2\0\0S\3\3\244\353\31]+\f]\345\252\35\nl\360\206I\250\3570\t;\370"..., 65536) = 143
write(37, "\24\3\3\0\1\1\26\3\3\0(L\336\216|K\300\205\344\360\205\251\261=\266\202\37\377\7g#u"..., 514) = 514
getsockopt(24, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
write(24, "\26\3\1\1\326\1\0\1\322\3\3\256#&Z\204\343_~(\202\16\221\275\340\302\32;\372\3230F"..., 475) = 475
read(29, "\27\3\3\1\30\0\0\0\0\0\0\0\1\375}\3\202\336 \221\345r\305\356\317YKy\312\342$m"..., 65536) = 316
write(29, "\25\3\3\0\32\257M\"\232-G(\263\200\n\243\315\373:\4//c?\267\312X\367g\37{", 31) = 31
close(29)                               = 0
read(41, "\26\3\3\0r\4\0\0n\0\2\243\0\0hSSK-E00438742\0\0\0\245"..., 65536) = 170
write(41, "\27\3\3\1\313M\f\275\216!\310\246\302\233\320\2267>Q=\334!\261\306M\307\336\306\201\253\242\242"..., 464) = 464
getsockopt(39, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
write(39, "\26\3\1\1\326\1\0\1\322\3\3Z\177t\312P\206\341\265\224\351$^\346?0\277\214\271\202\247\355"..., 475) = 475
read(38, "\26\3\3\0W\2\0\0S\3\3\252}\205\306\377jd\313'\346\272\276\326\24B\33\273\6\370\32\1"..., 65536) = 98
read(38, "\26\3\3\0(\0\0\0\0\0\0\0\0\220\36\f]\217\303w\224\fQ._\267<\320\303\244\310q"..., 65536) = 45
write(38, "\24\3\3\0\1\1\26\3\3\0(\347\351\327\336G\243oBW\343\234\333%%\204\242G\373\304\341k"..., 514) = 514
read(16, "\1\0\0\0\0\0\0\0", 1024)      = 8
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 28
connect(28, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("52.214.88.252")}, 16) = -1 EINPROGRESS (Operation now in progress)
read(16, "\1\0\0\0\0\0\0\0", 1024)      = 8
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 29
connect(29, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("52.19.211.175")}, 16) = -1 EINPROGRESS (Operation now in progress)
read(23, "\26\3\3\0W\2\0\0S\3\3\351\31\300l\355\343I\364\3$\30uq\214\0\236\17C\5\204\352"..., 65536) = 143
write(23, "\24\3\3\0\1\1\26\3\3\0(a\267\21*pA\307\343\5Y\211\267\320UV\270P\267\22\355\335"..., 515) = 515
getsockopt(26, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
write(26, "\26\3\1\1\326\1\0\1\322\3\3V\32\322\342n$\336\2604$\243r\211\261r\347\0(h\\\21"..., 475) = 475
read(30, "\27\3\3\1\30\0\0\0\0\0\0\0\1:v^uLZ\320F\271\t!\243;\247\211Lm\363\272"..., 65536) = 316
write(30, "\25\3\3\0\0321\215\342\276\271\345-VF8\300*\351\"\v\362{\356\377r\274\233\265\235\10u", 31) = 31
close(30)                               = 0
getsockopt(27, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
koraa commented

@tripodsan Phin supports connection pooling; it's just a thin wrapper around the node HTTP api which supports the feature. But let's establish a baseline before we try that…

koraa commented

Could we also check whether the log levels are really well adjusted? I do think instrumenting epsagon might better be trace log level…

koraa commented

Hmm, interesting. So I have been able to trigger opening a lot of file handles with coralogix…by cpu hogging…I simply send 20k log messages while blocking all other coroutines…so all the requests have to be started in parallell which gets me to just under 1000 file handles…but helix-static is not hogging CPU, is it? The only other scenario I could think of would be an extremely high latency so requests take a long time, but that shouldn't happen either, should it?

Could we also check whether the log levels are really well adjusted? I do think instrumenting epsagon might better be trace log level…

well, I think whatever the log level is, the connections should not linger too long. it should be possible to log 100 log entries per action :-)

I think using keep-alive would already solve the problem.

I simply send 20k log messages while blocking all other coroutines…so all the requests have to be started in parallell which gets me to just under 1000 file handles

In reality, we're talking about 4 concurrent request every minute, logging maybe 4 log messages (1k of data, maybe).... and still the filehandles get lost over time....

see https://helix.coralogix.com/app/kibana#/discover/078e9700-3b4c-11ea-aeea-873cfee05359?_g=(filters%3A!())

I did some more tests with:

  • a tweaked helix-log coralogix logger that limits the #sockets to 32.
  • configured epsagon to exclude api.coralogix.com

In the chart below, the continous lines represent the memory consumption of individual action instances (right y-axis). the stepped lines represent the number of open filehandles (left y-axis).

Around 03:00 I redeployed helix-static (1.7.2). since then, the memory-line of the red action continuously growed, and there are spikes of num-open-filehandles. other actions, like the green one shows a same behaviour.

Around 04:00 I redeployed helix-static (1.7.3), this time with epsagon completely disabled. the 1.7.2 actions are still used for a while, but around 04:15 they die and from then on, only 1.7.3 actions are created.

It seems that the epsagon-disabled actions don't show problems with memory leaks and filehandle spikes.

image

Theory

My theory is, that the epsagon client collects the traces and tries to send them to their server periodically. it does that by sending each trace individually, or at least in chunks. due to a problem (which is not explained yet), the client thinks that transmitting the traces failed. With the next retry, the same traces, plus the new ones are transmitted again. This could explain the growing memory consumption and the spikes in open filehandles.

Impressive sleuthing, Columbo.

did more tests with a patched epsagon:

$ curl -sD - https://adobeioruntime.net/api/v1/web/helix/helix-services-private/static@test
x-openwhisk-activation-id: 37b91699015141f6b91699015171f645

$ wsk activation logs 37b91699015141f6b91699015171f645
2020-01-22T06:01:51.361Z       stderr: action-status { status: { numInvocations: 2, memInfo: 95674368, age: 5501, numFileHandles: 21, uuid: '54ba7ff16dff262da68738708a9c4c1d', concurrency: 1 } }
2020-01-22T06:01:51.373Z       stderr: instrumenting epsagon.
2020-01-22T06:01:51.377Z       stderr: using experimental-epsagon 37b91699015141f6b91699015171f645

$ curl -sD - https://adobeioruntime.net/api/v1/web/helix/helix-services-private/static@test
x-openwhisk-activation-id: a35dcdd123d44b2f9dcdd123d4bb2f87

$ wsk activation logs a35dcdd123d44b2f9dcdd123d4bb2f87
2020-01-22T06:02:07.009Z       stderr: action-status { status: { numInvocations: 3, memInfo: 84291584, age: 21149, numFileHandles: 22, uuid: '54ba7ff16dff262da68738708a9c4c1d', concurrency: 1 } }
2020-01-22T06:02:07.012Z       stderr: instrumenting epsagon.
2020-01-22T06:02:07.013Z       stderr: using experimental-epsagon a35dcdd123d44b2f9dcdd123d4bb2f87
2020-01-22T06:02:07.014Z       stderr: using experimental-epsagon a35dcdd123d44b2f9dcdd123d4bb2f87

$ curl -sD - https://adobeioruntime.net/api/v1/web/helix/helix-services-private/static@test | fgrep activation-id
x-openwhisk-activation-id: f739acfe12654688b9acfe1265c6885a
$ wsk activation logs f739acfe12654688b9acfe1265c6885a
2020-01-22T06:02:31.236Z       stderr: action-status { status: { numInvocations: 4, memInfo: 84901888, age: 45376, numFileHandles: 22, uuid: '54ba7ff16dff262da68738708a9c4c1d', concurrency: 1 } }
2020-01-22T06:02:31.238Z       stderr: instrumenting epsagon.
2020-01-22T06:02:31.239Z       stderr: using experimental-epsagon f739acfe12654688b9acfe1265c6885a
2020-01-22T06:02:31.239Z       stderr: using experimental-epsagon f739acfe12654688b9acfe1265c6885a
2020-01-22T06:02:31.239Z       stderr: using experimental-epsagon f739acfe12654688b9acfe1265c6885a

it seems that for every activation, the previous epsagon is executed.... maybe something wrong with our wrapper?

the problem is on how we initialize the epsagon wrapper:

helix-static/src/index.js

Lines 517 to 536 in 909657d

function epsagon(action) {
return async (params) => {
if (params && params.EPSAGON_TOKEN) {
// ensure that epsagon is only required, if a token is present.
// This is to avoid invoking their patchers otherwise.
// eslint-disable-next-line global-require
const { openWhiskWrapper } = require('epsagon');
log.info('instrumenting epsagon.');
// eslint-disable-next-line no-param-reassign
action = openWhiskWrapper(action, {
token_param: 'EPSAGON_TOKEN',
appName: 'Helix Services',
metadataOnly: false, // Optional, send more trace data
ignoredKeys: [/[A-Z0-9_]+/],
urlPatternsToIgnore: ['api.coralogix.com'],
});
}
return action(params);
};
}

this was a tricky one 🙂 but eslint would have told us:

// eslint-disable-next-line no-param-reassign

because the action is captured by the closure, we wrap, and re-wrap, and re-wrap the action. I’ll will add the same protection that exists in the lamda-wrapper to epsagon.