clarin-eric/oai-harvest-manager

High CPU usage

twagoo opened this issue · 5 comments

The current version of the harvester (oai-harvest-manager-0.5.a06cc3) seem to use a lot more cpu% than before in production (vlo.clarin.eu). Not sure whether this is due to configuration or changes in the code. The Grafana screenshot below shows the difference (relatively low peaks on the left - the empty part in the middle is due to a Grafana disruption). To prevent problems in the future, I think it would be wise to look into this.

screen shot 2016-07-19 at 09 14 59

top -c on the server
screen shot 2016-07-19 at 09 15 34

My first suggestion would be to look into the memory usage (garbage collection in particular) and other process details using jconsole by setting jmxremote.

I guess this was to be expected/intentional. The old runs used the ListIdentifiers scenario, i.e., for each record a request to an endpoint was made, which means a lot of time the harvester was waiting for a response. Now we use the ListRecords scenario, i.e., each requests returns a bunch of records, which means less IO and more processing can happen. Also the additional conversion to 1.2 does mean that more processing needs to happen.

-Xmx1G was added due to some very big CMDI records which result in a spike in memory need, but on average the records are small and we use streams instead of DOMs most of the time, so memory needs should be low in general (and jconsole runs before the 1.2 release have confirmed this).

If the load is too high for the server we could limit the number of concurrent harvests, which is 6 at the moment (https://github.com/clarin-eric/oai-harvest-manager/blob/master/src/main/resources/config-clarin.xml#L16). Its also possible to mark some of the bigger endpoints as exclusive so they'll run alone, which can help in reducing the memory load (a bit).

Memory load is not an issue, we seem to have more than enough. Load gets a bit high for sustained periods of time but it does not seem to be affecting the performance of the VLO. Maybe it does affect the performance of the OAI provider itself, and probably tweaking the number of concurrent harvests would be the best way to find out. I like the suggestion of marking certain endpoints as exclusive, but I'm not sure if it's worth the added complexity.

For now let's just keep an eye on it? And increase the max heap (-Xmx2G maybe?) just to be sure because that seems absolutely safe?

Some observations based on heap dumps and Grafana data

  • Memory usage
    • Start of harvest: used ~ 1.3g, cached ~0.9g
    • Peak: used ~2.2g, cached ~10.3g
    • Normal due to high data throughput? Or consequence of large no GCs?
  • GC behaviour
jstat  -gc -t 22109 1000
Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT   
        51286.5 5632.0 7680.0  96.0   0.0   1382400.0   0.0     495616.0   183035.0  32768.0 31917.4 3584.0 3415.5 297100  614.594 282523 45726.384 46340.978
        51287.5 7168.0 8704.0 7162.5  0.0   1380352.0   0.0     495616.0   187112.6  32768.0 31917.4 3584.0 3415.5 297104  614.610 282526 45727.147 46341.757
        51288.5 7168.0 7680.0  0.0   128.0  1382400.0   0.0     495616.0   170187.5  32768.0 31917.4 3584.0 3415.5 297109  614.618 282531 45728.336 46342.954
        51289.5 6144.0 6144.0  0.0    64.0  1382912.0   0.0     495616.0   169173.4  32768.0 31917.4 3584.0 3415.5 297113  614.625 282535 45729.191 46343.816
        51290.5 5120.0 5120.0  0.0    64.0  1382912.0   0.0     495616.0   170061.6  32768.0 31917.4 3584.0 3415.5 297117  614.632 282539 45730.122 46344.754
        51291.6 4096.0 4608.0 2628.1  0.0   1382400.0   0.0     495616.0   170071.4  32768.0 31917.4 3584.0 3415.5 297122  614.641 282544 45731.142 46345.783
        51292.6 4096.0 4096.0  0.0    0.0   1382912.0   0.0     495616.0   172526.2  32768.0 31917.4 3584.0 3415.5 297126  614.648 282548 45732.215 46346.864
        51293.6 5120.0 7680.0 5114.8  0.0   1380352.0   0.0     495616.0   181158.8  32768.0 31917.4 3584.0 3415.5 297130  614.666 282552 45733.037 46347.704
        51294.6 6656.0 7168.0 3584.0  0.0   1382400.0   0.0     495616.0   183683.7  32768.0 31917.4 3584.0 3415.5 297134  614.680 282556 45734.058 46348.737
        51295.6 6144.0 6144.0  0.0    0.0   1382912.0 48669.5   495616.0   186992.5  32768.0 31917.4 3584.0 3415.5 297138  614.685 282559 45735.048 46349.733
        51296.6 6656.0 6656.0  0.0   2364.8 1383424.0   0.0     495616.0   189515.9  32768.0 31917.4 3584.0 3415.5 297143  614.698 282564 45735.969 46350.667
  • Every second ~4 full GCs
  • High number of full GCs
    • Explicitly requested: oai-harvest-manager/src/main/java/ORG/oclc/oai/harvester2/verb/HarvesterVerb.java:357
      // maybe we're lucky
        System.gc();
        logger.debug("Content-Length=" + con.getHeaderField("Content-Length"));
        ManagementFactory.getMemoryMXBean().getHeapMemoryUsage();
        logger.debug("Heap="+ManagementFactory.getMemoryMXBean().getHeapMemoryUsage());
        logger.debug("NonHeap="+ManagementFactory.getMemoryMXBean().getNonHeapMemoryUsage());
        List<MemoryPoolMXBean> beans = ManagementFactory.getMemoryPoolMXBeans();
        for (MemoryPoolMXBean bean: beans) {
            logger.debug(bean.getName()+"="+bean.getUsage());
        }
        for (GarbageCollectorMXBean bean: ManagementFactory.getGarbageCollectorMXBeans()) {
            logger.debug(bean.getName()+"="+bean.getCollectionCount(), bean.getCollectionTime());
        }
  • Investigatation steps
    • Make behaviour w.r.t. GC requests configurable
    • Investigate potential memory leak (not likely, high cpu starts already when heap is relatively small)
    • Try more Xmx (8G?), east but no effects expected

09fa093 removed the hardcoded System.gc()