trinodb/trino

Regression: Coordinator crashes with OOM in 468

Opened this issue · 9 comments

Coordinator in 468 consumes much more memory than 467 that leads to OS OOM.
What could be the reason for such memory consumption change ?

Listed below the output from free memory stats and gc logs during execution from both coordinator versions.

467
===
** after 15 minutes 

[2024-12-24T11:10:18.671+0000][7691][info][gc,start       ] GC(60) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:18.672+0000][7691][info][gc,task        ] GC(60) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Pre Evacuate Collection Set: 0.87ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Merge Heap Roots: 0.44ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Evacuate Collection Set: 159.86ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Post Evacuate Collection Set: 1.28ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Other: 0.12ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Eden regions: 100->0(89)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Survivor regions: 8->9(14)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Old regions: 34->34
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Humongous regions: 0->0
[2024-12-24T11:10:18.834+0000][7691][info][gc,metaspace   ] GC(60) Metaspace: 291741K(295488K)->291741K(295488K) NonClass: 254345K(256448K)->254345K(256448K) Class: 37395K(39040K)->37395K(39040K)
[2024-12-24T11:10:18.834+0000][7691][info][gc             ] GC(60) Pause Young (Normal) (G1 Evacuation Pause) 4449M->1279M(16384M) 162.817ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,cpu         ] GC(60) User=0.43s Sys=0.04s Real=0.17s
               total        used        free      shared  buff/cache   available
Mem:              30           7          19           0           4          23
Swap:              0           0           0
Tue Dec 24 11:10:26 UTC 2024


** after 1:15 hour

[2024-12-24T12:09:59.680+0000][7686][info][gc,marking     ] GC(173) Concurrent Mark From Roots 2367.862ms
[2024-12-24T12:09:59.680+0000][7686][info][gc,marking     ] GC(173) Concurrent Preclean
[2024-12-24T12:09:59.681+0000][7686][info][gc,marking     ] GC(173) Concurrent Preclean 1.161ms
[2024-12-24T12:09:59.681+0000][7691][info][gc,start       ] GC(173) Pause Remark
[2024-12-24T12:09:59.724+0000][7691][info][gc             ] GC(173) Pause Remark 5988M->5691M(16384M) 42.204ms
[2024-12-24T12:09:59.724+0000][7691][info][gc,cpu         ] GC(173) User=0.15s Sys=0.00s Real=0.04s
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking     ] GC(173) Concurrent Mark 2413.972ms
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking     ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions 483.491ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,start       ] GC(173) Pause Cleanup
[2024-12-24T12:10:00.210+0000][7691][info][gc             ] GC(173) Pause Cleanup 5725M->5725M(16384M) 0.054ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,cpu         ] GC(173) User=0.00s Sys=0.00s Real=0.00s
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Clear Claimed Marks
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Clear Claimed Marks 0.427ms
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Cleanup for Next Mark
[2024-12-24T12:10:00.228+0000][7686][info][gc,marking     ] GC(173) Concurrent Cleanup for Next Mark 17.547ms
[2024-12-24T12:10:00.228+0000][7686][info][gc             ] GC(173) Concurrent Mark Cycle 3285.446ms
               total        used        free      shared  buff/cache   available
Mem:              30          16          10           0           4          14
Swap:              0           0           0
Tue Dec 24 12:10:41 UTC 2024

===
468
===
** after 15 minutes 

[2024-12-24T11:10:04.019+0000][7679][info][gc,start       ] GC(62) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:04.019+0000][7679][info][gc,task        ] GC(62) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Pre Evacuate Collection Set: 0.32ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Merge Heap Roots: 0.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Evacuate Collection Set: 139.19ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Post Evacuate Collection Set: 2.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Other: 0.08ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Eden regions: 118->0(104)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Survivor regions: 8->10(16)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Old regions: 33->33
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Humongous regions: 0->0
[2024-12-24T11:10:04.161+0000][7679][info][gc,metaspace   ] GC(62) Metaspace: 292209K(296000K)->292209K(296000K) NonClass: 254676K(256768K)->254676K(256768K) Class: 37533K(39232K)->37533K(39232K)
[2024-12-24T11:10:04.161+0000][7679][info][gc             ] GC(62) Pause Young (Normal) (G1 Evacuation Pause) 5009M->1277M(16384M) 142.330ms
[2024-12-24T11:10:04.162+0000][7679][info][gc,cpu         ] GC(62) User=0.44s Sys=0.04s Real=0.14s
               total        used        free      shared  buff/cache   available
Mem:              30          16           9           0           4          13
Swap:              0           0           0
Tue Dec 24 11:10:27 UTC 2024


** after 1:15 hour

[2024-12-24T12:10:12.713+0000][7679][info][gc,start       ] GC(179) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T12:10:12.714+0000][7679][info][gc,task        ] GC(179) Using 4 workers of 4 for evacuation
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Pre Evacuate Collection Set: 0.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Merge Heap Roots: 0.24ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Evacuate Collection Set: 77.50ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Post Evacuate Collection Set: 2.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Other: 0.09ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Eden regions: 303->0(302)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Survivor regions: 4->5(39)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Old regions: 113->113
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Humongous regions: 1->1
[2024-12-24T12:10:12.795+0000][7679][info][gc,metaspace   ] GC(179) Metaspace: 296387K(300352K)->296387K(300352K) NonClass: 258630K(260800K)->258630K(260800K) Class: 37756K(39552K)->37756K(39552K)
[2024-12-24T12:10:12.795+0000][7679][info][gc             ] GC(179) Pause Young (Normal) (G1 Evacuation Pause) 13361M->3699M(16384M) 82.112ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,cpu         ] GC(179) User=0.26s Sys=0.01s Real=0.08s
               total        used        free      shared  buff/cache   available
Mem:              30          29           1           0           0           0
Swap:              0           0           0
Tue Dec 24 12:10:25 UTC 2024


*** syslog OOM output 
grep -i oom /var/log/syslog

2024-12-24T12:14:35.481240+00:00 ip-172-30-20-32 kernel: http-worker-691 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T12:14:35.496433+00:00 ip-172-30-20-32 kernel:  oom_kill_process+0x116/0x270
2024-12-24T12:14:35.496434+00:00 ip-172-30-20-32 kernel:  ? oom_evaluate_task+0x143/0x1e0
2024-12-24T12:14:35.496435+00:00 ip-172-30-20-32 kernel:  __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T12:14:35.511859+00:00 ip-172-30-20-32 kernel: [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7634,uid=0
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: Out of memory: Killed process 7634 (trino-server) total-vm:36015048kB, anon-rss:29973656kB, file-rss:2948kB, shmem-rss:0kB, UID:0 pgtables:61856kB oom_score_adj:0
2024-12-24T12:14:35.792898+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: A process of this unit has been killed by the OOM killer.
2024-12-24T12:14:35.803968+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.
2024-12-24T12:14:35.806272+00:00 ip-172-30-20-32 systemd[1]: system.slice: A process of this unit has been killed by the OOM killer.

Trino coordinator is running on EC2 machine type r6a.xlarge with OpenJDK Runtime Environment Temurin-23.0.1+11 (build 23.0.1+11)

jvm.conf is

-server
-Xmx16G
-Xms16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-Djava.io.tmpdir=/data/tmpdir
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Dfile.encoding=UTF-8
-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647)
-XX:+UnlockDiagnosticVMOptions
-XX:G1NumCollectionsKeepPinned=10000000
-XX:GCLockerRetryAllocationCount=32
-Xlog:gc*:/data/trino/var/log/gc.log:time,level,tags,tid:filecount=10,filesize=100M

Executing same load on r6a.2xlarge with double memory size and it seems that OOM will happen soon.

Memory consumption after 1.5 hours:

===
467
===

$ date && free -g && (top -n 20| grep trino) && date && free -g

Tue Dec 24 15:00:31 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          24          32           0           4          36
Swap:              0           0           0

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   7766 root      20   0   44.6g  23.0g  30676 S  18.2  37.4  35:48.76 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  20.3  37.4  35:49.37 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  18.6  37.4  35:49.93 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  16.3  37.4  35:50.42 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  15.0  37.4  35:50.87 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  16.6  37.4  35:51.37 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  12.3  37.4  35:51.74 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  11.3  37.4  35:52.08 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  13.7  37.4  35:52.49 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  12.3  37.4  35:52.86 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  11.0  37.4  35:53.19 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S   7.6  37.4  35:53.42 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   7.3  37.4  35:53.64 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.0  37.4  35:53.82 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.6  37.4  35:54.02 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.6  37.4  35:54.22 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   8.0  37.4  35:54.46 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   6.3  37.4  35:54.65 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   7.0  37.4  35:54.86 trino-s+
   7766 root      20   0   44.6g  23.1g  30676 S 278.7  37.6  36:03.25 trino-s+
Tue Dec 24 15:01:29 UTC 2024
===
468
===

$ date && free -g && (top -n 20| grep trino) && date && free -g

Tue Dec 24 15:00:30 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          43          13           0           4          17
Swap:              0           0           0

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   7784 root      20   0   61.7g  42.1g  30820 S  10.0  68.6  39:12.02 trino-s+
   7784 root      20   0   61.7g  42.2g  30820 S   8.3  68.6  39:12.27 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   7.0  68.7  39:12.48 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S  10.3  68.7  39:12.79 trino-s+
   7784 root      20   0   61.9g  42.2g  30820 S   7.3  68.7  39:13.01 trino-s+
   7784 root      20   0   61.9g  42.3g  30820 S   5.6  68.8  39:13.18 trino-s+
   7784 root      20   0   61.9g  42.3g  30820 S   5.7  68.8  39:13.35 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   5.0  68.9  39:13.50 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   3.0  68.9  39:13.59 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   2.0  68.9  39:13.65 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   2.3  68.8  39:13.72 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   1.7  68.8  39:13.77 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   2.0  68.7  39:13.83 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   2.0  68.7  39:13.89 trino-s+
   7784 root      20   0   62.0g  42.4g  30820 S 181.4  69.0  39:19.35 trino-s+
   7784 root      20   0   62.9g  43.0g  30820 S 220.7  70.0  39:25.97 trino-s+
   7784 root      20   0   63.4g  43.5g  30820 S  41.2  70.8  39:27.21 trino-s+
   7784 root      20   0   63.8g  44.0g  30820 S  43.9  71.6  39:28.53 trino-s+
   7784 root      20   0   64.6g  44.7g  30820 S 155.0  72.7  39:33.18 trino-s+
   7784 root      20   0   64.9g  45.1g  30820 S  34.6  73.3  39:34.22 trino-s+
Tue Dec 24 15:01:27 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          46          10           0           4          14
Swap:              0           0           0

OS OOM happens again

$ grep oom /var/log/syslog

2024-12-24T17:14:34.706917+00:00 ip-172-30-21-101 kernel: query-execution invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T17:14:34.744709+00:00 ip-172-30-21-101 kernel:  oom_kill_process+0x116/0x270
2024-12-24T17:14:34.744711+00:00 ip-172-30-21-101 kernel:  ? oom_evaluate_task+0x143/0x1e0
2024-12-24T17:14:34.744712+00:00 ip-172-30-21-101 kernel:  __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T17:14:34.792485+00:00 ip-172-30-21-101 kernel: [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T17:14:34.792510+00:00 ip-172-30-21-101 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7784,uid=0
2024-12-24T17:14:34.792511+00:00 ip-172-30-21-101 kernel: Out of memory: Killed process 7784 (trino-server) total-vm:74788372kB, anon-rss:63005676kB, file-rss:2412kB, shmem-rss:0kB, UID:0 pgtables:128700kB oom_score_adj:0
2024-12-24T17:14:37.446266+00:00 ip-172-30-21-101 kernel: oom_reaper: reaped process 7784 (trino-server), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2024-12-24T17:14:37.759101+00:00 ip-172-30-21-101 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.

@guyco33 can you do a heapdump when gc pressure goes up?

I've run the same exact test queries for about 1 hour on both 467 and 468 and created a trace dump comparison with VisualVM.
Maybe the increase in objects count might explain the increase in overall memory although the total heap size is less than 2GB

Image Image Image

It looks we need to be looking for something off-heap (java heap seems fine).

  • @wendigo did you recall updating any libs which use off-heap memory in 467..468 period? Anyway this should ratter impacty workers not coordinator.
  • Do python functions support has potential of using off-heap memeroy?
  • @guyco33 what connector/table format are you using?

@losipiuk not really

As for python:
this code

Instance instance = Instance.builder(PYTHON_MODULE)
.withMachineFactory(PythonModule::create)
.withImportValues(importValues)
.build();
allocate = instance.export("allocate");
deallocate = instance.export("deallocate");
setup = instance.export("setup");
execute = instance.export("execute");
memory = instance.memory();

is potentially leaky. We are not closing Instance explicitly - just assuming GC will do the job - and IDK if it does not hold some native memory.
Also it looks like we are not always have allocate/deallocate pairs. E.g for argTypeAddress and returnTypeAddress in PythonEngine.doSetup. IDK if that is a problem or not. @electrum PTAL.
But this is probably not what @guyco33 is observing as this should not be called if Python functions are not in use.

what connector/table format are you using?

I'm using postgres, mysql and hive tables with ORC and csv gzip file formats

@losipiuk There is no close for Instance. It doesn't hold any native memory. The WASM memory is allocated on-heap as a single contiguous array. The argType and returnType pointers passed to setup() are saved and held onto for the lifetime of the WASM engine.

There is a potential leak of PythonEngine which creates a Jimfs instance that may get registered with the JDK file system machinery. We create a ScalarFunctionImplementation with an instanceFactory so that the engine can create one per-thread as needed. If this is a problem, we'll need to change the engine to support Closeable function instances, use java.lang.ref.Cleaner, etc. Also, we may need memory tracking for the function instances, as language functions will typically use more memory than other types of functions.

That said, that seems unrelated to this issue, as there is no evidence they are being used here.