Regression: Coordinator crashes with OOM in 468
Opened this issue · 9 comments
Coordinator in 468
consumes much more memory than 467
that leads to OS OOM.
What could be the reason for such memory consumption change ?
Listed below the output from free memory stats and gc logs during execution from both coordinator versions.
467
===
** after 15 minutes
[2024-12-24T11:10:18.671+0000][7691][info][gc,start ] GC(60) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:18.672+0000][7691][info][gc,task ] GC(60) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases ] GC(60) Pre Evacuate Collection Set: 0.87ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases ] GC(60) Merge Heap Roots: 0.44ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases ] GC(60) Evacuate Collection Set: 159.86ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases ] GC(60) Post Evacuate Collection Set: 1.28ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases ] GC(60) Other: 0.12ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap ] GC(60) Eden regions: 100->0(89)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap ] GC(60) Survivor regions: 8->9(14)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap ] GC(60) Old regions: 34->34
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap ] GC(60) Humongous regions: 0->0
[2024-12-24T11:10:18.834+0000][7691][info][gc,metaspace ] GC(60) Metaspace: 291741K(295488K)->291741K(295488K) NonClass: 254345K(256448K)->254345K(256448K) Class: 37395K(39040K)->37395K(39040K)
[2024-12-24T11:10:18.834+0000][7691][info][gc ] GC(60) Pause Young (Normal) (G1 Evacuation Pause) 4449M->1279M(16384M) 162.817ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,cpu ] GC(60) User=0.43s Sys=0.04s Real=0.17s
total used free shared buff/cache available
Mem: 30 7 19 0 4 23
Swap: 0 0 0
Tue Dec 24 11:10:26 UTC 2024
** after 1:15 hour
[2024-12-24T12:09:59.680+0000][7686][info][gc,marking ] GC(173) Concurrent Mark From Roots 2367.862ms
[2024-12-24T12:09:59.680+0000][7686][info][gc,marking ] GC(173) Concurrent Preclean
[2024-12-24T12:09:59.681+0000][7686][info][gc,marking ] GC(173) Concurrent Preclean 1.161ms
[2024-12-24T12:09:59.681+0000][7691][info][gc,start ] GC(173) Pause Remark
[2024-12-24T12:09:59.724+0000][7691][info][gc ] GC(173) Pause Remark 5988M->5691M(16384M) 42.204ms
[2024-12-24T12:09:59.724+0000][7691][info][gc,cpu ] GC(173) User=0.15s Sys=0.00s Real=0.04s
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking ] GC(173) Concurrent Mark 2413.972ms
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions 483.491ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,start ] GC(173) Pause Cleanup
[2024-12-24T12:10:00.210+0000][7691][info][gc ] GC(173) Pause Cleanup 5725M->5725M(16384M) 0.054ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,cpu ] GC(173) User=0.00s Sys=0.00s Real=0.00s
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking ] GC(173) Concurrent Clear Claimed Marks
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking ] GC(173) Concurrent Clear Claimed Marks 0.427ms
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking ] GC(173) Concurrent Cleanup for Next Mark
[2024-12-24T12:10:00.228+0000][7686][info][gc,marking ] GC(173) Concurrent Cleanup for Next Mark 17.547ms
[2024-12-24T12:10:00.228+0000][7686][info][gc ] GC(173) Concurrent Mark Cycle 3285.446ms
total used free shared buff/cache available
Mem: 30 16 10 0 4 14
Swap: 0 0 0
Tue Dec 24 12:10:41 UTC 2024
===
468
===
** after 15 minutes
[2024-12-24T11:10:04.019+0000][7679][info][gc,start ] GC(62) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:04.019+0000][7679][info][gc,task ] GC(62) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases ] GC(62) Pre Evacuate Collection Set: 0.32ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases ] GC(62) Merge Heap Roots: 0.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases ] GC(62) Evacuate Collection Set: 139.19ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases ] GC(62) Post Evacuate Collection Set: 2.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases ] GC(62) Other: 0.08ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap ] GC(62) Eden regions: 118->0(104)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap ] GC(62) Survivor regions: 8->10(16)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap ] GC(62) Old regions: 33->33
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap ] GC(62) Humongous regions: 0->0
[2024-12-24T11:10:04.161+0000][7679][info][gc,metaspace ] GC(62) Metaspace: 292209K(296000K)->292209K(296000K) NonClass: 254676K(256768K)->254676K(256768K) Class: 37533K(39232K)->37533K(39232K)
[2024-12-24T11:10:04.161+0000][7679][info][gc ] GC(62) Pause Young (Normal) (G1 Evacuation Pause) 5009M->1277M(16384M) 142.330ms
[2024-12-24T11:10:04.162+0000][7679][info][gc,cpu ] GC(62) User=0.44s Sys=0.04s Real=0.14s
total used free shared buff/cache available
Mem: 30 16 9 0 4 13
Swap: 0 0 0
Tue Dec 24 11:10:27 UTC 2024
** after 1:15 hour
[2024-12-24T12:10:12.713+0000][7679][info][gc,start ] GC(179) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T12:10:12.714+0000][7679][info][gc,task ] GC(179) Using 4 workers of 4 for evacuation
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases ] GC(179) Pre Evacuate Collection Set: 0.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases ] GC(179) Merge Heap Roots: 0.24ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases ] GC(179) Evacuate Collection Set: 77.50ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases ] GC(179) Post Evacuate Collection Set: 2.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases ] GC(179) Other: 0.09ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap ] GC(179) Eden regions: 303->0(302)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap ] GC(179) Survivor regions: 4->5(39)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap ] GC(179) Old regions: 113->113
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap ] GC(179) Humongous regions: 1->1
[2024-12-24T12:10:12.795+0000][7679][info][gc,metaspace ] GC(179) Metaspace: 296387K(300352K)->296387K(300352K) NonClass: 258630K(260800K)->258630K(260800K) Class: 37756K(39552K)->37756K(39552K)
[2024-12-24T12:10:12.795+0000][7679][info][gc ] GC(179) Pause Young (Normal) (G1 Evacuation Pause) 13361M->3699M(16384M) 82.112ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,cpu ] GC(179) User=0.26s Sys=0.01s Real=0.08s
total used free shared buff/cache available
Mem: 30 29 1 0 0 0
Swap: 0 0 0
Tue Dec 24 12:10:25 UTC 2024
*** syslog OOM output
grep -i oom /var/log/syslog
2024-12-24T12:14:35.481240+00:00 ip-172-30-20-32 kernel: http-worker-691 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T12:14:35.496433+00:00 ip-172-30-20-32 kernel: oom_kill_process+0x116/0x270
2024-12-24T12:14:35.496434+00:00 ip-172-30-20-32 kernel: ? oom_evaluate_task+0x143/0x1e0
2024-12-24T12:14:35.496435+00:00 ip-172-30-20-32 kernel: __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T12:14:35.511859+00:00 ip-172-30-20-32 kernel: [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7634,uid=0
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: Out of memory: Killed process 7634 (trino-server) total-vm:36015048kB, anon-rss:29973656kB, file-rss:2948kB, shmem-rss:0kB, UID:0 pgtables:61856kB oom_score_adj:0
2024-12-24T12:14:35.792898+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: A process of this unit has been killed by the OOM killer.
2024-12-24T12:14:35.803968+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.
2024-12-24T12:14:35.806272+00:00 ip-172-30-20-32 systemd[1]: system.slice: A process of this unit has been killed by the OOM killer.
Trino coordinator is running on EC2 machine type r6a.xlarge
with OpenJDK Runtime Environment Temurin-23.0.1+11 (build 23.0.1+11)
jvm.conf is
-server
-Xmx16G
-Xms16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-Djava.io.tmpdir=/data/tmpdir
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Dfile.encoding=UTF-8
-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647)
-XX:+UnlockDiagnosticVMOptions
-XX:G1NumCollectionsKeepPinned=10000000
-XX:GCLockerRetryAllocationCount=32
-Xlog:gc*:/data/trino/var/log/gc.log:time,level,tags,tid:filecount=10,filesize=100M
Executing same load on r6a.2xlarge
with double memory size and it seems that OOM will happen soon.
Memory consumption after 1.5 hours:
===
467
===
$ date && free -g && (top -n 20| grep trino) && date && free -g
Tue Dec 24 15:00:31 UTC 2024
total used free shared buff/cache available
Mem: 61 24 32 0 4 36
Swap: 0 0 0
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7766 root 20 0 44.6g 23.0g 30676 S 18.2 37.4 35:48.76 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 20.3 37.4 35:49.37 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 18.6 37.4 35:49.93 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 16.3 37.4 35:50.42 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 15.0 37.4 35:50.87 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 16.6 37.4 35:51.37 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 12.3 37.4 35:51.74 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 11.3 37.4 35:52.08 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 13.7 37.4 35:52.49 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 12.3 37.4 35:52.86 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 11.0 37.4 35:53.19 trino-s+
7766 root 20 0 44.6g 23.0g 30676 S 7.6 37.4 35:53.42 trino-s+
7766 root 20 0 44.5g 23.0g 30676 S 7.3 37.4 35:53.64 trino-s+
7766 root 20 0 44.5g 23.0g 30676 S 6.0 37.4 35:53.82 trino-s+
7766 root 20 0 44.5g 23.0g 30676 S 6.6 37.4 35:54.02 trino-s+
7766 root 20 0 44.5g 23.0g 30676 S 6.6 37.4 35:54.22 trino-s+
7766 root 20 0 44.4g 23.0g 30676 S 8.0 37.4 35:54.46 trino-s+
7766 root 20 0 44.4g 23.0g 30676 S 6.3 37.4 35:54.65 trino-s+
7766 root 20 0 44.4g 23.0g 30676 S 7.0 37.4 35:54.86 trino-s+
7766 root 20 0 44.6g 23.1g 30676 S 278.7 37.6 36:03.25 trino-s+
Tue Dec 24 15:01:29 UTC 2024
===
468
===
$ date && free -g && (top -n 20| grep trino) && date && free -g
Tue Dec 24 15:00:30 UTC 2024
total used free shared buff/cache available
Mem: 61 43 13 0 4 17
Swap: 0 0 0
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7784 root 20 0 61.7g 42.1g 30820 S 10.0 68.6 39:12.02 trino-s+
7784 root 20 0 61.7g 42.2g 30820 S 8.3 68.6 39:12.27 trino-s+
7784 root 20 0 61.8g 42.2g 30820 S 7.0 68.7 39:12.48 trino-s+
7784 root 20 0 61.8g 42.2g 30820 S 10.3 68.7 39:12.79 trino-s+
7784 root 20 0 61.9g 42.2g 30820 S 7.3 68.7 39:13.01 trino-s+
7784 root 20 0 61.9g 42.3g 30820 S 5.6 68.8 39:13.18 trino-s+
7784 root 20 0 61.9g 42.3g 30820 S 5.7 68.8 39:13.35 trino-s+
7784 root 20 0 61.8g 42.3g 30820 S 5.0 68.9 39:13.50 trino-s+
7784 root 20 0 61.8g 42.3g 30820 S 3.0 68.9 39:13.59 trino-s+
7784 root 20 0 61.8g 42.3g 30820 S 2.0 68.9 39:13.65 trino-s+
7784 root 20 0 61.8g 42.3g 30820 S 2.3 68.8 39:13.72 trino-s+
7784 root 20 0 61.8g 42.3g 30820 S 1.7 68.8 39:13.77 trino-s+
7784 root 20 0 61.8g 42.2g 30820 S 2.0 68.7 39:13.83 trino-s+
7784 root 20 0 61.8g 42.2g 30820 S 2.0 68.7 39:13.89 trino-s+
7784 root 20 0 62.0g 42.4g 30820 S 181.4 69.0 39:19.35 trino-s+
7784 root 20 0 62.9g 43.0g 30820 S 220.7 70.0 39:25.97 trino-s+
7784 root 20 0 63.4g 43.5g 30820 S 41.2 70.8 39:27.21 trino-s+
7784 root 20 0 63.8g 44.0g 30820 S 43.9 71.6 39:28.53 trino-s+
7784 root 20 0 64.6g 44.7g 30820 S 155.0 72.7 39:33.18 trino-s+
7784 root 20 0 64.9g 45.1g 30820 S 34.6 73.3 39:34.22 trino-s+
Tue Dec 24 15:01:27 UTC 2024
total used free shared buff/cache available
Mem: 61 46 10 0 4 14
Swap: 0 0 0
OS OOM happens again
$ grep oom /var/log/syslog
2024-12-24T17:14:34.706917+00:00 ip-172-30-21-101 kernel: query-execution invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T17:14:34.744709+00:00 ip-172-30-21-101 kernel: oom_kill_process+0x116/0x270
2024-12-24T17:14:34.744711+00:00 ip-172-30-21-101 kernel: ? oom_evaluate_task+0x143/0x1e0
2024-12-24T17:14:34.744712+00:00 ip-172-30-21-101 kernel: __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T17:14:34.792485+00:00 ip-172-30-21-101 kernel: [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T17:14:34.792510+00:00 ip-172-30-21-101 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7784,uid=0
2024-12-24T17:14:34.792511+00:00 ip-172-30-21-101 kernel: Out of memory: Killed process 7784 (trino-server) total-vm:74788372kB, anon-rss:63005676kB, file-rss:2412kB, shmem-rss:0kB, UID:0 pgtables:128700kB oom_score_adj:0
2024-12-24T17:14:37.446266+00:00 ip-172-30-21-101 kernel: oom_reaper: reaped process 7784 (trino-server), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2024-12-24T17:14:37.759101+00:00 ip-172-30-21-101 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.
It looks we need to be looking for something off-heap (java heap seems fine).
As for python:
this code
is potentially leaky. We are not closing
Instance
explicitly - just assuming GC will do the job - and IDK if it does not hold some native memory.Also it looks like we are not always have allocate/deallocate pairs. E.g for
argTypeAddress
and returnTypeAddress
in PythonEngine.doSetup
. IDK if that is a problem or not. @electrum PTAL.But this is probably not what @guyco33 is observing as this should not be called if Python functions are not in use.
what connector/table format are you using?
I'm using postgres
, mysql
and hive
tables with ORC
and csv gzip file formats
@losipiuk There is no close for Instance
. It doesn't hold any native memory. The WASM memory is allocated on-heap as a single contiguous array. The argType
and returnType
pointers passed to setup()
are saved and held onto for the lifetime of the WASM engine.
There is a potential leak of PythonEngine
which creates a Jimfs
instance that may get registered with the JDK file system machinery. We create a ScalarFunctionImplementation
with an instanceFactory
so that the engine can create one per-thread as needed. If this is a problem, we'll need to change the engine to support Closeable
function instances, use java.lang.ref.Cleaner
, etc. Also, we may need memory tracking for the function instances, as language functions will typically use more memory than other types of functions.
That said, that seems unrelated to this issue, as there is no evidence they are being used here.