Out of memory error on large project

Question

Out of memory error on large project

mumrah opened this issue 2 years ago · 14 comments

I'm testing out this action for Apache Kafka. After running the unit tests with gradle, and then this action, I get an error:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
   1: 0xb0a860 node::Abort() [/home/runner/runners/2.304.0/externals/node16/bin/node]
   2: 0xa1c193 node::FatalError(char const*, char const*) [/home/runner/runners/2.304.0/externals/node16/bin/node]
   3: 0xcf9a6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/runner/runners/2.304.0/externals/node16/bin/node]
   4: 0xcf9de7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/runner/runners/2.304.0/externals/node16/bin/node]
   5: 0xeb1685  [/home/runner/runners/2.304.0/externals/node16/bin/node]
   6: 0xeb2166  [/home/runner/runners/2.304.0/externals/node16/bin/node]
   7: 0xec068e  [/home/runner/runners/2.304.0/externals/node16/bin/node]
   8: 0xec10d0 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/runner/runners/2.304.0/externals/node16/bin/node]
   9: 0xec3fc5 v8::internal::Heap::HandleGCRequest() [/home/runner/runners/2.304.0/externals/node16/bin/node]
  10: 0xe51667 v8::internal::StackGuard::HandleInterrupts() [/home/runner/runners/2.304.0/externals/node16/bin/node]
  11: 0x11fda25 v8::internal::Runtime_StackGuard(int, unsigned long*, v8::internal::Isolate*) [/home/runner/runners/2.304.0/externals/node16/bin/node]
  12: 0x15f2d39  [/home/runner/runners/2.304.0/externals/node16/bin/node]

This is preceded by around 10k lines of output like:

 GroupCoordinatorAdapterTest:1 | testFetchOffsets()
  GroupCoordinatorAdapterTest:1 | [1] version=0
  GroupCoordinatorAdapterTest:1 | [2] version=1
  GroupCoordinatorAdapterTest:1 | [3] version=2
  GroupCoordinatorAdapterTest:1 | [4] version=3
  GroupCoordinatorAdapterTest:1 | [5] version=4
  GroupCoordinatorAdapterTest:1 | [6] version=5
  GroupCoordinatorAdapterTest:1 | [7] version=6
  GroupCoordinatorAdapterTest:1 | [8] version=7
  GroupCoordinatorAdapterTest:1 | [9] version=8

which seems to be the tests being evaluated in the junit test results. We have over 100k tests in our project.

Here's the workflow file I was using:

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Gradle Build 
on: push
jobs:
  gradle-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: gradle/gradle-build-action@v2
        env:
          GRADLE_BUILD_ACTION_CACHE_DEBUG_ENABLED: true
        with:
          # Only write to the cache from trunk
          cache-read-only: ${{ github.ref != 'refs/heads/trunk' }}
          # Cache downloaded JDKs in addition to the default directories.
          gradle-home-cache-includes: |
              caches
              notifications
              jdks
          gradle-home-cache-cleanup: true
      - run: ./gradlew --continue --profile --build-cache unitTest
      - uses: mikepenz/action-junit-report@v3
        if: success() || failure() # always run even if the previous step fails
        with:
          report_paths: '**/build/test-results/**/TEST-*.xml'
          fail_on_failure: true

The job that produced this error https://github.com/mumrah/kafka/actions/runs/5237578902/jobs/9455849520

Answer 1 · 2023-06-12T06:00:47.000Z

Good day @mumrah . Thank you for the report.

Looking into the logged output. Based on a brief scan through, it seems some of the tests have a quite long message attached to them (KafkaLZ4Test). However overall it's probably really just a result of the amount of testcases, given that at this time all data is collected (also from passed tests)

Will look to adjust this behaviour.

Answer 2 · 2023-06-13T12:23:52.000Z

Thanks for taking a look @mikepenz. Would it be possible to optionally ignore the test output, or perhaps only capture the output from failed tests?

Answer 3 · 2023-06-13T12:25:59.000Z

Yes that's what I would be looking into. However I wasn't yet able to spend some time on that.

Answer 4 · 2023-06-18T15:07:46.000Z

Would you please be so kind to try out the following #886
It would not keep as much data, which could have a major impact in the linked usecase.

You can use it by referencing the commit hash

mikepenz/action-junit-report@36988f6147e62eb1de69169a2589f02c081f59e8

Answer 5 · 2023-06-18T17:36:52.000Z

I tried it on this run and still see the error

<--- Last few GCs --->
  
  [75263:0x4f4ee40]    41363 ms: Scavenge 2034.1 (2076.9) -> 2034.8 (2077.7) MB, 21.5 / 0.0 ms  (average mu = 0.201, current mu = 0.161) allocation failure 
  [75263:0x4f4ee40]    41750 ms: Scavenge (reduce) 2034.9 (2077.7) -> 2034.0 (2069.9) MB, 21.7 / 0.0 ms  (average mu = 0.201, current mu = 0.161) allocation failure 
  [75263:0x4f4ee40]    41830 ms: Scavenge (reduce) 2035.0 (2069.9) -> 2035.1 (2070.9) MB, 5.9 / 0.0 ms  (average mu = 0.201, current mu = 0.161) allocation failure 
  
  
  <--- JS stacktrace --->
  
  FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
   1: 0xb0a860 node::Abort() [/home/runner/runners/2.305.0/externals/node16/bin/node]
   2: 0xa1c193 node::FatalError(char const*, char const*) [/home/runner/runners/2.305.0/externals/node16/bin/node]
   3: 0xcf9a6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/home/runner/runners/2.305.0/externals/node16/bin/node]
   4: 0xcf9de7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/home/runner/runners/2.305.0/externals/node16/bin/node]
   5: 0xeb1685  [/home/runner/runners/2.305.0/externals/node16/bin/node]
   6: 0xeb2166  [/home/runner/runners/2.305.0/externals/node16/bin/node]
   7: 0xec068e  [/home/runner/runners/2.305.0/externals/node16/bin/node]
   8: 0xec10d0 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/home/runner/runners/2.305.0/externals/node16/bin/node]
   9: 0xec3fc5 v8::internal::Heap::HandleGCRequest() [/home/runner/runners/2.305.0/externals/node16/bin/node]
  10: 0xe51667 v8::internal::StackGuard::HandleInterrupts() [/home/runner/runners/2.305.0/externals/node16/bin/node]
  11: 0x11fda25 v8::internal::Runtime_StackGuard(int, unsigned long*, v8::internal::Isolate*) [/home/runner/runners/2.305.0/externals/node16/bin/node]
  12: 0x15f2d39  [/home/runner/runners/2.305.0/externals/node16/bin/node]

Are there debug logs for the action I can enable?

Answer 6 · 2023-06-18T17:47:44.000Z

It looks like it's failing at the same point

 GroupCoordinatorAdapterTest:1 | [6] version=5
  GroupCoordinatorAdapterTest:1 | [7] version=6
  GroupCoordinatorAdapterTest:1 | [8] version=7
  GroupCoordinatorAdapterTest:1 | [9] version=8

Could it be the case that one of the XML test reports is too large to parse and transform to JSON?

If I can see which file is being processed, I might be able to reduce the verbosity of problematic tests.

Answer 7 · 2023-06-19T19:09:18.000Z

Added more debug logs: https://github.com/mikepenz/action-junit-report/commits/fix/880_logs

mikepenz/action-junit-report@381e0e640679f1d3b412271c4c6804d6b710a736

Can you please rerun it (and then enable debug logs for it?)

Answer 8 · 2023-06-20T00:03:36.000Z

Seems it is indeed a large XML file causing the issue:

2023-06-19T23:39:36.7515794Z ##[debug]Parsing report file: /home/runner/work/kafka/kafka/core/build/test-results/unitTest/TEST-kafka.coordinator.group.GroupCoordinatorConcurrencyTest.xml
2023-06-19T23:39:36.7516599Z ##[debug]Parsing file /home/runner/work/kafka/kafka/core/build/test-results/unitTest/TEST-kafka.coordinator.group.GroupCoordinatorConcurrencyTest.xml
2023-06-19T23:40:19.6146107Z 
2023-06-19T23:40:19.6147157Z <--- Last few GCs --->
2023-06-19T23:40:19.6147607Z 
2023-06-19T23:40:19.6148454Z [13892:0x53aee40]    55114 ms: Scavenge 2025.2 (2062.4) -> 2025.2 (2067.2) MB, 13.5 / 0.0 ms  (average mu = 0.183, current mu = 0.141) allocation failure 
2023-06-19T23:40:19.6149608Z [13892:0x53aee40]    55150 ms: Scavenge 2028.1 (2067.2) -> 2028.3 (2068.4) MB, 19.8 / 0.0 ms  (average mu = 0.183, current mu = 0.141) allocation failure 
2023-06-19T23:40:19.6344420Z [13892:0x53aee40]    55174 ms: Scavenge 2029.1 (2068.4) -> 2028.8 (2079.2) MB, 21.6 / 0.0 ms  (average mu = 0.183, current mu = 0.141) allocation failure

This test kafka.coordinator.group.GroupCoordinatorConcurrencyTest produces an XML file of 68M. Seems to be from an abundance of of WARN logs. I'll try to reduce the logging output to ERROR for this test and see if the report generator can get past it.

An alternative would be a streaming XML parser that could handle big files like this, but that seems like a fairly complex improvement.

Answer 9 · 2023-06-20T06:13:02.000Z

You may also try to increase the heap for node with something like this:

env:
  # https://github.com/actions/runner-images/issues/70
  NODE_OPTIONS: "--max_old_space_size=4096"

Answer 10 · 2023-06-20T17:51:30.000Z

@mikepenz increasing the memory size for node allowed me to get a successful build

https://github.com/mumrah/kafka/actions/runs/5322654319/jobs/9639897682

Thanks for helping my debug this! Feel free to close this issue.

Answer 11 · 2023-06-20T19:36:41.000Z

Wow. 20k tests

Answer 12 · 2023-06-20T19:37:26.000Z

Thank you for being available to answer all my questions and try different solutions.

Also you came up with the solution :D as you had the right hunch!

Answer 13 · 2023-06-20T19:37:55.000Z

If you would be so kind, you may open a PR to add the note on how to allow huge tests by increasing the heap in the README.

Answer 14 · 2023-06-21T12:56:15.000Z

Thank you very much once again!