Question: Does the gc execute before the `proxywasm.DispatchHttpCall` callback executes ?
sandeeppnaik opened this issue ยท 20 comments
Hi @anuraaga !
I am running into a strange behaviour when my wasm plugin executes a DispatchHttpCall
. It appears the call to the service (dispatch call) is never made in majority of the runs, making me think if the gc clears the context before the execution of the dispatch call.
The flow works as expected when i don't run the custom gc
Req -> parseHeaders -> dispatchCall(ServiceA) -> callback -> getResponseHeaders -> update headers -> ReqGoesUpstream
The metrics below measure the execution_count
(execution of the plugin) & the success_count
(completion of callback). Below are the numbers when I don't run the custom gc - execution_count = success_count.
Whereas when i run the custom gc, only a portion of requests end up making a call upstream (I verified that requests don't reach serviceA)
My main.go has these imports
import (
"crypto/rand"
"fmt"
"math/big"
"os"
"strings"
"time"
"github.com/tetratelabs/proxy-wasm-go-sdk/proxywasm"
"github.com/tetratelabs/proxy-wasm-go-sdk/proxywasm/types"
"github.com/tidwall/gjson"
_ "github.com/wasilibs/nottinygc"
"golang.org/x/exp/slices"
)
And the file ends with this after all the functions
// required for nottinygc
//
//export sched_yield
func sched_yield() int32 {
return 0
}
Appreciate your thoughts on this behaviour , and thank you for this package :)
Hi @sandeeppnaik - would it be possible to get a reproduction showing the issue? If there's a repo with a command to build the plugin and a docker compose command to start two envoys to show the problem, that would be very helpful. If not, I'll try to set it up but it may take some more time.
Nothing immediately comes to mind as a potential problem so will need to debug a repro. For the question in the title specifically, GC can happen with any allocation, and there are indeed allocations happening during that call. Though that should be ok in principle.
Hey @anuraaga , I have added the plugin code on 2 branches
- compile with customgc - processes the dispatch only for a portion of requests
- compile without custom gc - processes dispatch for all requests
You will need a dummy server with an endpoint for the plugin to return success.
and there are indeed allocations happening during that call
-> Does the location of the stub method in the code matter at all ?
Thanks for the report and repro @sandeeppnaik, I found an issue with the recent addition of precise GC. v0.6.0 includes the fix and trying it with your repro, it seems to be working well. Let me know if you still have any issues.
Thanks for looking into this promptly ! I am running v.0.6.0 now. I can definitely see that all the dispatch calls are working as expected. Question about the garbage collection
Here, the first bump beginning at 15:40 is without GC v/s the second bump at 16:02 with cusotm GC>
THe ramp up of mem when running custom GC is better than without it. However I would expect the second bump to go down over time and free up the memory, although not seeing that. Any idea why @anuraaga ?
@sandeeppnaik IIUC, this is a graph of real memory usage of the Envoy process. WebAssembly currently does not define any way to return memory to the OS, so any memory that has been allocated will stay allocated forever. Environment variables like GC_PRINT_STATS
can provide information about how much of the allocated memory is actually in use by objects if that is what you are interested in
https://github.com/ivmai/bdwgc#installation-and-portability
and the hope is that when the process stabilizes it will efficiently use the memory it has already allocated from the OS. But unfortunately due to the fundamental limitation of wasm, there is no way or workaround to allow unused memory to ever be released completely.
this is a graph of real memory usage of the Envoy process
-> yep this was envoy's memory usage. Aah i understand, so the plugin never returns the memory back to the proxy but tends to use whatever is allocated. If that is the case, the memory usage of the plugin has to be consistent for a consistent load pattern (sustained throughput of x rps over a period of stability when there is no load between 2 consecutive load tests) right ? Although I see it is ever growing until it hits the proxy's limit mem.
![image](https://private-user-images.githubusercontent.com/22852913/277239097-ed07cc86-aa76-43b5-bc85-c55874b082fe.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNzE0MjMsIm5iZiI6MTcxODI3MTEyMywicGF0aCI6Ii8yMjg1MjkxMy8yNzcyMzkwOTctZWQwN2NjODYtYWE3Ni00M2I1LWJjODUtYzU1ODc0YjA4MmZlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjEzVDA5MzIwM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZjMzFiMzQyZGU4YTM0NjhlOGZjODI0ZDQyY2IzMjNhNmEwNzM0M2JhZGIyMTJhMDdlMjdkZjJkOGRhNTkyZWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.B4FukrJTTwW9fN-lQC2aUfLzfWi8ygRqnLCXDJSOdbI)
{"level":"warn","time":"2023-10-23T05:47:12.025635Z","msg":"Envoy may have been out of memory killed. Check memory usage and limits."}
{"level":"error","time":"2023-10-23T05:47:12.025679Z","msg":"Envoy exited with error: signal: killed"}
{"level":"error","time":"2023-10-23T05:47:12.025715Z","msg":"error serving tap http server: http: Server closed"}
It would be good to confirm it is the memory within a VM going up and not the number of VMs itself, Envoy has metrics wasm.envoy.wasm.runtime.v8.active
and wasm.envoy.wasm.runtime.v8.created
to show the number of VMs. It could be an issue with GC but need to confirm the VMs, unfortunately there isn't a per-VM memory usage stat yet though.
I noticed both of the above metrics stay constant throughout the load test while the memory spikes
curl http://localhost:15090/stats/prometheus | grep v8
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0# TYPE envoy_wasm_envoy_wasm_runtime_v8_created counter
envoy_wasm_envoy_wasm_runtime_v8_created{} 10
100 # TYPE envoy_wasm_envoy_wasm_runtime_v8_active gauge
9envoy_wasm_envoy_wasm_runtime_v8_active{} 12
Interestingly, in the previous version nottinygc v0.5.1
the gc does not spike. But not entirely sure if that is because it prevents the plugin from making the dispatch call or not
Thanks for confirming - I'm assuming the plugin you are testing against isn't the same as in https://github.com/sandeeppnaik/wasm-test/tree/main? FWIW, I can't reproduce increasing memory usage locally using that repo - it might be harder to debug issues without a closer reproduction in that case. Other things to try are running with GC_PRINT_STATS
environment variable set and see if nottinygc v0.4.0
has any different behavior.
Hi @sandeeppnaik - I got another report with a potential fix
Perhaps if you get a chance you could try the 667d0d4 commit if it causes a difference. Thanks.
Unfortunately, still seeing the same behaviour of mem spiking with this @anuraaga. My plugin is using pretty much the above example. May i know what throughput you are testing with ? When I run at about ~600rps, I can reproduce memory spikes.
![image](https://private-user-images.githubusercontent.com/22852913/277318362-3c9d9329-3a2d-485c-b648-b422b7945d33.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTgyNzE0MjMsIm5iZiI6MTcxODI3MTEyMywicGF0aCI6Ii8yMjg1MjkxMy8yNzczMTgzNjItM2M5ZDkzMjktM2EyZC00ODVjLWI2NDgtYjQyMmI3OTQ1ZDMzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjEzVDA5MzIwM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg3ZDAzODUxN2Q4OTVjYjM1OWQyNWZkZmFhMjEzOGIyNGJlMzllN2JkMDBmMmE4ZDMzMTQ2NGE3NTEzNWQyMDcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.hOSFL6oq11XNOhVdByJWD5Ul8iQiZbLtC37JG9PygDg)
Also tested with 0.40, it seems to have the same behaviour.
Tested this too-> github.com/higress-group/nottinygc v0.0.0-20231019105920-c4d985d443e1
seems to stop that mem leak. This seems to prevent the mem leak !
Thanks for the confirmation - it seems like the issue with malloc affects callbacks more strongly than other apps. I will go ahead and incorporate that in soon
Getting a panic while compiling the binary tinygo build -o main.wasm -gc=custom -tags=nottinygc_proxywasm -scheduler=none -target=wasi main.go
panic: cannot convert pointer value to byte
goroutine 197 [running]:
github.com/tinygo-org/tinygo/interp.rawValue.bytes(...)
/Users/runner/work/tinygo/tinygo/interp/memory.go:743
github.com/tinygo-org/tinygo/interp.rawValue.Uint(...)
/Users/runner/work/tinygo/tinygo/interp/memory.go:751
github.com/tinygo-org/tinygo/interp.(*runner).run(0xc00910dd60, 0xc009f2c640, {0xc008e01690, 0x4, 0x0?}, 0xc009b68720, {0xc006a3b3c0, 0x1c})
/Users/runner/work/tinygo/tinygo/interp/interpreter.go:828 +0x21f6
github.com/tinygo-org/tinygo/interp.(*runner).run(0xc00910dd60, 0xc009f2d4f0, {0xc008e00b10, 0x4, 0x0?}, 0xc009b68360, {0xc007114900, 0x18})
/Users/runner/work/tinygo/tinygo/interp/interpreter.go:526 +0x7ccc
github.com/tinygo-org/tinygo/interp.(*runner).run(0xc00910dd60, 0xc009f2d4a0, {0xc005d70f90, 0x1, 0x0?}, 0xc009b55f20, {0xc0071148a0, 0x14})
/Users/runner/work/tinygo/tinygo/interp/interpreter.go:526 +0x7ccc
github.com/tinygo-org/tinygo/interp.(*runner).run(0xc00910dd60, 0xc009f2d450, {0xc0047f8e10, 0x2, 0x0?}, 0xc009b55e90, {0xc006cd4ac0, 0x10})
/Users/runner/work/tinygo/tinygo/interp/interpreter.go:526 +0x7ccc
github.com/tinygo-org/tinygo/interp.(*runner).run(0xc00910dd60, 0xc009f2d400, {0xc008d9b390, 0x2, 0x0?}, 0xc009b55b30, {0xc006cd47a0, 0xc})
/Users/runner/work/tinygo/tinygo/interp/interpreter.go:526 +0x7ccc
github.com/tinygo-org/tinygo/interp.(*runner).run(0xc00910dd60, 0xc009f2d3b0, {0xc008d9a290, 0x1, 0x0?}, 0xc009b554d0, {0xc006c4b2d8, 0x8})
/Users/runner/work/tinygo/tinygo/interp/interpreter.go:526 +0x7ccc
github.com/tinygo-org/tinygo/interp.(*runner).run(0xc00910dd60, 0xc009f2d360, {0x0, 0x0, 0x100066faa?}, 0x0, {0x104a4b0a0, 0x4})
/Users/runner/work/tinygo/tinygo/interp/interpreter.go:526 +0x7ccc
github.com/tinygo-org/tinygo/interp.Run({0x110f05240?}, 0x29e8d60800?, 0x0?)
/Users/runner/work/tinygo/tinygo/interp/interp.go:117 +0x67e
github.com/tinygo-org/tinygo/builder.optimizeProgram({0x110f04080?}, 0xc000024240)
/Users/runner/work/tinygo/tinygo/builder/build.go:1046 +0x35
github.com/tinygo-org/tinygo/builder.Build.func5(0xc00907d9e0)
/Users/runner/work/tinygo/tinygo/builder/build.go:576 +0x61d
github.com/tinygo-org/tinygo/builder.runJob(0xc00907da40, 0x0?)
/Users/runner/work/tinygo/tinygo/builder/jobs.go:222 +0x4d
created by github.com/tinygo-org/tinygo/builder.runJobs in goroutine 1
/Users/runner/work/tinygo/tinygo/builder/jobs.go:123 +0x5bd
@sandeeppnaik I'm not sure if it would cause that crash but the invocation seems to be missing the custommalloc
build tag, also good to confirm you've removed the sched_yield
function from your own code.
Thanks for writign back @anuraaga . My bad, added this tinygo build -o main.wasm -gc=custom -scheduler=none -target=wasi -tags="custommalloc nottinygc_proxywasm" main.go
.
Works like a charm ! Confirming there is no memory leak at high load, and I am not implementig the stub method ๐
@sandeeppnaik I have released a new version with the changes. Warning that I ended up naming the tag nottinygc_envoy
in the end. Hopefully this works fine now and thanks for the patience.
Hey @anuraaga , noticing a behaviour where I am incrementing a counter metrics just prior to returning from the call back / ResumeHttpRequest
where I see those metrics never incrementing. Is this a situation of very eager garbage collection ? I dont see the same behaviour with the higress version. Any thoughts ?
We generally wouldn't expect GC to have an effect since there are no allocations when incrementing a metric.
https://github.com/tetratelabs/proxy-wasm-go-sdk/blob/main/proxywasm/hostcall.go#L759
But you never know, would need a repro to debug it.
I have added an e2e test that is based off of your wasm-test
https://github.com/wasilibs/nottinygc/tree/main/e2e/envoy-dispatch-call
https://github.com/wasilibs/nottinygc/blob/main/magefiles/e2e.go#L68
Checking counters after a minute of running they seem to be stable, but perhaps you're seeing issues after longer runtime or maybe there are some differences with the test case? If you have any suggestions for that e2e test, a PR would be great. Thanks.