LinkedInAttic/JTune

Are negative CMSInitiatingOccupancyFraction values valid?

Closed this issue · 7 comments

I have been running JTune on a couple of my servers and applying the tuning suggestions. I re-run it and re-apply those new values. It appears I'm getting honed in on an optimal set of JVM flags for the server, but I get what look like really odd -XX:CMSInitiatingOccupancyFraction values:

    * Reading gc.log file... done. Scanned 3008 lines in 0.0016 seconds.

    Meta:
    ~~~~~
    Sample Time:    29m13s (1753 seconds)
    System Uptime:  52d49m
    CPU Uptime:     104d1h
    Proc Uptime:    1m50s
    Proc Usertime:  2m22s (0.00%)
    Proc Systime:   5s (0.00%)
    Proc RSS:       1.71G
    Proc VSize:     5.07G
    Proc # Threads: 93

    YG Allocation Rates*:
    ~~~~~~~~~~~~~~~~~~~~~
    per sec (min/mean/max):       1.42M/s     145.62M/s     470.29M/s
    per day (min/mean/max):     119.83G/d         12T/d      38.75T/d

    OG Promotion Rates:
    ~~~~~~~~~~~~~~~~~~~
    per sec (min/mean/max):          9K/s      47.39M/s     583.80M/s
    per hr (min/mean/max):       31.63M/h     166.60G/h          2T/h

    Survivor Death Rates:
    ~~~~~~~~~~~~~~~~~~~~~
    Lengths (min/mean/max): 0/1.9/12
    Death Rate Breakdown:
       Age 1:  0.0% / 32.9% / 100.0% / 67.1% (min/mean/max/cuml alive %)
       Age 2: -0.4% / 11.3% / 89.2% / 59.5% (min/mean/max/cuml alive %)
       Age 3: -0.2% /  1.5% / 50.8% / 58.7% (min/mean/max/cuml alive %)
       Age 4: -0.0% /  1.0% / 40.9% / 58.0% (min/mean/max/cuml alive %)
       Age 5: -0.2% /  0.6% / 54.4% / 57.7% (min/mean/max/cuml alive %)
       Age 6: -0.0% /  0.6% / 31.7% / 57.4% (min/mean/max/cuml alive %)
       Age 7: -0.0% /  0.5% / 48.7% / 57.1% (min/mean/max/cuml alive %)
       Age 8:  0.0% /  0.3% / 23.4% / 56.9% (min/mean/max/cuml alive %)
       Age 9: -0.2% /  0.1% /  5.7% / 56.9% (min/mean/max/cuml alive %)
       Age 10: -0.0% /  0.1% / 12.7% / 56.8% (min/mean/max/cuml alive %)
       Age 11: -0.2% /  0.1% / 15.6% / 56.8% (min/mean/max/cuml alive %)
       Age 12: -0.0% /  0.0% /  0.5% / 56.8% (min/mean/max/cuml alive %)
       Age 13: -0.0% /  0.0% /  8.0% / 56.8% (min/mean/max/cuml alive %)
       Age 14:  0.0% /  0.1% / 22.9% / 56.7% (min/mean/max/cuml alive %)

    GC Information:
    ~~~~~~~~~~~~~~~
    YGC/FGC Count: 430/12 (Rate: 14.72/min, 0.41/min)

    GC Load (since JVM start): 3.80%
    Sample Period GC Load:     3.20%

    CMS Sweep Times: 2.326s /  4.335s /  5.275s / 1.21 (min/mean/max/stdev)
    YGC Times:       0ms / 122ms / 570ms / 100.47 (min/mean/max/stdev)
    FGC Times:       0ms / 51ms / 112ms / 30.62 (min/mean/max/stdev)
    Agg. YGC Time:   55480ms
    Agg. FGC Time:   673ms

    Est. Time Between FGCs (min/mean/max):          4d6h       1m8s         5s
    Est. OG Size for 1 FGC/hr (min/mean/max):     31.63M    166.60G         2T

    Overall JVM Efficiency Score*: 96.797%

    Current JVM Configuration:
    ~~~~~~~~~~~~~~~~~~~~~~~~~~
              NewSize: 172M
              OldSize: 5.19M
        SurvivorRatio: 1
     MinHeapFreeRatio: 40
     MaxHeapFreeRatio: 70
          MaxHeapSize: 3.34G
             PermSize: 240M
             NewRatio: 2

    Recommendation Summary:
    ~~~~~~~~~~~~~~~~~~~~~~~
    Warning: The process I'm doing the analysis on has been up for 1m50s,
    and may not be in a steady-state. It's best to let it be up for more
    than 5 minutes to get more realistic results.

    * Warning: The calculated recommended survivor ratio of 0.46 is less than 1.
    This is not possible, so I increased the size of newgen by 87.43M, and set the
    survivor ratio to 1. Try the tuning suggestions, and watch closely.

    - With a mean YGC time goal of 50ms, the suggested (optimized for a
    YGC rate of 33.55/min) size of NewGen (including adjusting for
    calculated max tenuring size) considering the above criteria should be
    163 MiB (currently: 172 MiB).
    - Because we're decreasing the size of NewGen, it can have an impact
    on system load due to increased memory management requirements.
    There's not an easy way to predict the impact to the application, so
    watch this after it's tuned.
    - It's recommended to have the PermGen size 1.2-1.5x (used 1.5x) the size of the
    live PermGen size. New recommended size is 241MiB (currently: 240MiB).
    - Looking at the worst (max) survivor percentages for all the ages, it looks
    like a TenuringThreshold of 5 is ideal.
    - The survivor size should be 2x the max size for tenuring threshold
    of 5 given above. Given this, the survivor size of 163M is ideal.
    - To ensure enough survivor space is allocated, a survivor ratio of 1 should be
    used.
    - It's recommended to have the max heap size 3-4x the size of the live data size
    (OldGen + PermGen), and adjusted to include the recommended survivor and newgen
    size. New recommended size is 4293MiB (currently: 3416MiB).
    - With a max 99th percentile OG promotion rate of 122.10M/s, and the max CMS
    sweep time of 5.275s, you should not have a occupancy fraction any higher than
    -12363.

    Java G1 Settings:
    ~~~~~~~~~~~~~~~~~~~
    - With a max ygc stdev of 46.95, and a 99th percentile ygc mean ms of 190ms,
    your config is probably not ready to move to the G1 garbage collector. Try
    tuning the JVM, and see if that improves things first.

    The JVM arguments from the above recommendations:
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    -Xmx4293m -Xms4293m -Xmn163m -XX:SurvivorRatio=1 -XX:MaxTenuringThreshold=5
    -XX:CMSInitiatingOccupancyFraction=-12363 -XX:PermSize=241m -XX:MaxPermSize=241m
    ~~~

    * The allocation rate is the increase is usage before a GC done. Growth rate
      is the increase in usage after a GC is done.

    * The JVM efficiency score is a convenient way to quantify how efficient the
      JVM is. The most efficient JVM is 100% (pretty much impossible to obtain).

    * A copy of the critical data used to generate this report is stored
      in /tmp/jpulse_data-eaihost.bin.bz2. Please copy this to your homedir if you
      want to save/analyze this further.

Huh. That's interesting. What do you have your -Xms and -Xmx set to? Can you email me your playback file (/tmp/jpulse_data-eaihost.bin.bz2) to ebullen@linkedin.com? That'll help me look at the data it used. This file only contains jvm data used for calculations, and contains no personally identifiable information. Thanks!

Sure thing. I am queueing up an email right now.

Also, I need to know what your -Xms -Xmx values are set to.

Currently, i'm at -Xmx4031m -Xms4031m, but that is a change since the last sent files. Do you wnat me to send you the most recent jpulse files based on these heap settings?

Were they both set to the same value when you got the negative CMSInitiatingOccupancyFraction?

Yes.

Your JVM was only running for 1m50s, and it warns you that if your JVM has been running less than 5 minutes that you may see weird results. In this case you have OG growth rates that are very high (due to a newly started JVM), and your CMS sweep times of 5 seconds, so the calculations became unreliable. Try it again by waiting 5-10 minutes (the JVM needs to be in a steady-state, AND needs to be under peak load for jtune to work correctly).

Please make sure that you read the warnings, and adjust accordingly.