Memory leak exists, should use GC_malloc_ignore_off_page when alloc large object.
Closed this issue · 18 comments
Hi @anuraaga, thanks for your project, we are using this nottinygc in our higress project which based on Envoy's wasm extension mechanism.
We discovered a memory leak during use, and you can find demo code to reproduce the memory leak here:
https://github.com/alibaba/higress/tree/main/plugins/wasm-go/extensions/gc-test
Note here that we replace nottinygc with github.com/higress-group/nottinygc
, if we comment out this replacement like this:
...
// replace github.com/wasilibs/nottinygc v0.5.1 => github.com/higress-group/nottinygc v0.0.0-20231019105920-c4d985d443e1
require (
github.com/alibaba/higress/plugins/wasm-go v0.0.0
github.com/tetratelabs/proxy-wasm-go-sdk v0.22.0
github.com/tidwall/gjson v1.14.3
)
...
Then build the wasm file and start envoy with the following configuration:
node:
cluster: test-cluster
id: test-idn
admin:
address:
socket_address: { address: 127.0.0.1, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 127.0.0.1, port_value: 10000 }
per_connection_buffer_limit_bytes: 1024000000
filter_chains:
- filters:
- name: envoy.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
http_filters:
- name: gctest
typed_config:
"@type": type.googleapis.com/udpa.type.v1.TypedStruct
type_url: type.googleapis.com/envoy.extensions.filters.http.wasm.v3.Wasm
value:
config:
name: basic_auth
vm_config:
runtime: envoy.wasm.runtime.v8
code:
local:
filename: ./mem.wasm
allow_precompiled: true
fail_open: true
configuration:
"@type": type.googleapis.com/google.protobuf.StringValue
value: '{"bytes": 10485760}'
- name: envoy.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains:
- "*"
routes:
- match:
prefix: /
direct_response:
status: 200
body:
inline_string: "hello world"
Use the following command to perform a stress test:
wrk -R 50 -t 10 -c 10 -d60s -s post.lua http://localhost:10000/
After a few seconds, we can see the following log:
[2023-10-23 15:42:30.858][1902490][info][wasm] [source/extensions/common/wasm/context.cc:1204] wasm log basic_auth: [gc-test] {"Sys": 1063452672,"HeapSys": 1058013184,"HeapIdle": 145313792,"HeapInuse": 0,"HeapReleased": 0}
[2023-10-23 15:42:30.861][1902490][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:41] Function: proxy_on_request_headers failed: Uncaught RuntimeError: unreachable\nProxy-Wasm plugin in-VM backtrace:\n 0: 0x1d4fc - runtime._panic\n 1: 0x1ce84 - runtime.alloc\n 2: 0x26758 - main.onHttpRequestHeaders\n 3: 0x3761a - proxy_on_request_headers
Thanks for the report @johnlanni - to confirm, are you saying you don't see the issue with the patch in your branch of nottinygc (https://github.com/higress-group/nottinygc/pull/1/files) or you do see it with that as well? I'm currently running locally first with the code in that branch, using higress-group/nottinygc
and see the issue as well so just want to confirm that.
We resolved the issue in github.com/higress-group/nottinygc
with these changes: https://github.com/higress-group/nottinygc/pull/1/files.
There are two keypoints:
- Use
GC_malloc_ignore_off_page
instead ofGC_malloc
when alloc large objects which follow the recommend of Bdwgc:
- Do not export the
malloc
symbol ininit.go
. Because proxy-wasm will find this symbol when load the wasm, and envoy will use this function to alloc memory in wasm vm:
https://github.com/proxy-wasm/proxy-wasm-cpp-host/blob/b7e690703c7f26707438a2f1ebd7c197bc8f0296/src/wasm.cc#L169-L172
https://github.com/proxy-wasm/proxy-wasm-cpp-host/blob/b7e690703c7f26707438a2f1ebd7c197bc8f0296/include/proxy-wasm/wasm.h#L403
This symbol is bound tomi_malloc
, bypassing GC_malloc in bdwgc, so GC will not work.
If the "malloc" symbol is not exported, proxy wasm will use the "proxy_on_memory_allocate" symbol. The symbol is correctly bound to the function when using the tetrate sdk: https://github.com/tetratelabs/proxy-wasm-go-sdk/blob/1b9daaf70730bd6197ca8a34b2a8af713bbce7ad/proxywasm/internal/abi_callback_alloc.go#L20
Thanks for the pointers, and sorry I had an issue with my build command, I'm able to verify the fix. For me, only ignore_page seems to be enough to not see an issue, I will start with it and thinnk more about malloc. AFAIK, Envoy uses malloc to allocate static memory so it should not need to be GC'd (languages without a GC would basically not be able to implement it otherwise), in fact it could cause worse performance for the GC otherwise.
For the first change I sent #32, perhaps you can try using commit 667d0d4 to see if you also see the same behavior as me, that it is enough to fix it in your repro? Thank you.
@anuraaga Envoy needs to use malloc in the wasm module when copying memory from the host to the wasm vm, such as copying the request body to the memory section of the wasm module. This memory should be managed by the wasm vm, not Envoy itself. So the second change is needed, otherwise, this copid memory will leak.
Thanks @johnlanni - I also found https://prog.world/webassembly-will-bring-them-all-together/ which mentions it. That's a surprising model 😅 In that case it does seem like the only option is to stop exporting malloc, I will do some more checking on it and then make the change.
@anuraaga Yes, this is a bit counterintuitive. But because envoy doesn't know when the wasm module is not using the copied buffer. So it can only be released by the wasm module itself.
I have another question that is a bit confusing. Why do we need to introduce mimalloc after using bdwgc? Bdwgc doesn't really hand back the memory to OS, right? So the memory pool in mimalloc is not used?
A bit unsure what the difference is, but it is recommended by bdwgc and seems to make a big
difference in some apps.
My understanding is that when using GC_malloc, bdwgc will consider that any section in the allocated memory will be refrenced by pointers, so the larger the allocated memory, the greater the possibility of being mistaken for a potential pointer to generate a reference.
I have another question that is a bit confusing. Why do we need to introduce mimalloc after using bdwgc?
This is inherited from original exploration where it improved performance in apps that use malloc a lot like coraza-proxy-wasm. Later we added bdwgc and I think it was the far more significant improvement so I have been wondering if dlmalloc will perform fine and want to compare it. Mimalloc may still have general improvements besides pooling though.
My understanding is that when using GC_malloc, bdwgc will consider that any section in the allocated memory will be refrenced by pointers,
Since 0.5, we switched from GC_malloc to _typed, which indicates which parts may be pointers, and most importantly that there aren't pointers in byte arrays. Notably, there is a version of _typed to ignore off page, so it appears that they are orthogonal issues. I'll try to look more at the code in bdwgc to see if I can understand the difference.
Thanks @johnlanni! I also realized that the manual coraza-proxy-wasm ftw test I always use should be in CI in this repo. Any other tests will also be welcome here, or there as well of course depending on the scope. I'll ping after adding the coraza one to see what makes sense structure-wise, thank you
@johnlanni I have added an e2e test based on the higress gc-test
https://github.com/wasilibs/nottinygc/tree/main/e2e/higress-gc-test
https://github.com/wasilibs/nottinygc/blob/main/magefiles/e2e.go#L115
Let me know if you have any suggestions for it. Thanks.
Hi,
We have a similar issue with wasm plugin built by tinygo. The envoy instance that holds the plugin consumes hundreds percent of CPU with increasing value of memory. It seems that nottinygc couldn't release the memory. We tried to implement the simplest test to reproduce this case here.
Does anyone have any idea?
Hi @behnamnikbakht - are you seeing perpetually increasing memory usage over time or is it that memory is not reduced when idle? Unfortunately Wasm has no means to return memory to the os, this is fundamental to the spec and nothing GC can do about it. So res memory will never go down even when idle
By the way, your test case appears to be the same as this e2e
https://github.com/wasilibs/nottinygc/blob/main/e2e/higress-gc-test/main.go#L64
AFAICT it is behaving as expected.
Wasm has no means to return memory to the os, this is fundamental to the spec and nothing GC can do about it. So res memory will never go down even when idle
In our production environment, the WASM plugin processes 2000 requests per minute. The plugin heavily uses make for allocation. Our performance environment simulated 2000 RPM using Jmeter and ran the test for almost 2 days continuously. RES RES Memory growth of the envoy went from 150 MB to 900MB (see the below picture). If the RES memory will never go down, restarting the envoy every day is the only option. Any suggestions, please?
Thanks for the graph @cheranm - I realized that a lot of the focus in our e2e tests is on the GC allocation and not process memory. When running the higress e2e and looking at docker stats
, I do see the same growing physical memory, despite our reported GC heap size being constant at around 20MB. I instrumented the binary with additional debugging, including logging mimalloc, and could confirm all the numbers double check to 20MB.
I don't know what to make of this - the numbers being reported by bdwgc and mimalloc could be wrong for some reason, but I don't think the chance of this is high. But in that case, it means the memory increase is outside what is allocated by this library, and I have no clues on how to debug that.