High CPU usage
twagoo opened this issue · 5 comments
The current version of the harvester (oai-harvest-manager-0.5.a06cc3
) seem to use a lot more cpu% than before in production (vlo.clarin.eu
). Not sure whether this is due to configuration or changes in the code. The Grafana screenshot below shows the difference (relatively low peaks on the left - the empty part in the middle is due to a Grafana disruption). To prevent problems in the future, I think it would be wise to look into this.
My first suggestion would be to look into the memory usage (garbage collection in particular) and other process details using jconsole by setting jmxremote.
I guess this was to be expected/intentional. The old runs used the ListIdentifiers scenario, i.e., for each record a request to an endpoint was made, which means a lot of time the harvester was waiting for a response. Now we use the ListRecords scenario, i.e., each requests returns a bunch of records, which means less IO and more processing can happen. Also the additional conversion to 1.2 does mean that more processing needs to happen.
-Xmx1G was added due to some very big CMDI records which result in a spike in memory need, but on average the records are small and we use streams instead of DOMs most of the time, so memory needs should be low in general (and jconsole runs before the 1.2 release have confirmed this).
If the load is too high for the server we could limit the number of concurrent harvests, which is 6 at the moment (https://github.com/clarin-eric/oai-harvest-manager/blob/master/src/main/resources/config-clarin.xml#L16). Its also possible to mark some of the bigger endpoints as exclusive so they'll run alone, which can help in reducing the memory load (a bit).
Memory load is not an issue, we seem to have more than enough. Load gets a bit high for sustained periods of time but it does not seem to be affecting the performance of the VLO. Maybe it does affect the performance of the OAI provider itself, and probably tweaking the number of concurrent harvests would be the best way to find out. I like the suggestion of marking certain endpoints as exclusive, but I'm not sure if it's worth the added complexity.
For now let's just keep an eye on it? And increase the max heap (-Xmx2G maybe?) just to be sure because that seems absolutely safe?
Some observations based on heap dumps and Grafana data
- Memory usage
- Start of harvest: used ~ 1.3g, cached ~0.9g
- Peak: used ~2.2g, cached ~10.3g
- Normal due to high data throughput? Or consequence of large no GCs?
- GC behaviour
jstat -gc -t 22109 1000
Timestamp S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
51286.5 5632.0 7680.0 96.0 0.0 1382400.0 0.0 495616.0 183035.0 32768.0 31917.4 3584.0 3415.5 297100 614.594 282523 45726.384 46340.978
51287.5 7168.0 8704.0 7162.5 0.0 1380352.0 0.0 495616.0 187112.6 32768.0 31917.4 3584.0 3415.5 297104 614.610 282526 45727.147 46341.757
51288.5 7168.0 7680.0 0.0 128.0 1382400.0 0.0 495616.0 170187.5 32768.0 31917.4 3584.0 3415.5 297109 614.618 282531 45728.336 46342.954
51289.5 6144.0 6144.0 0.0 64.0 1382912.0 0.0 495616.0 169173.4 32768.0 31917.4 3584.0 3415.5 297113 614.625 282535 45729.191 46343.816
51290.5 5120.0 5120.0 0.0 64.0 1382912.0 0.0 495616.0 170061.6 32768.0 31917.4 3584.0 3415.5 297117 614.632 282539 45730.122 46344.754
51291.6 4096.0 4608.0 2628.1 0.0 1382400.0 0.0 495616.0 170071.4 32768.0 31917.4 3584.0 3415.5 297122 614.641 282544 45731.142 46345.783
51292.6 4096.0 4096.0 0.0 0.0 1382912.0 0.0 495616.0 172526.2 32768.0 31917.4 3584.0 3415.5 297126 614.648 282548 45732.215 46346.864
51293.6 5120.0 7680.0 5114.8 0.0 1380352.0 0.0 495616.0 181158.8 32768.0 31917.4 3584.0 3415.5 297130 614.666 282552 45733.037 46347.704
51294.6 6656.0 7168.0 3584.0 0.0 1382400.0 0.0 495616.0 183683.7 32768.0 31917.4 3584.0 3415.5 297134 614.680 282556 45734.058 46348.737
51295.6 6144.0 6144.0 0.0 0.0 1382912.0 48669.5 495616.0 186992.5 32768.0 31917.4 3584.0 3415.5 297138 614.685 282559 45735.048 46349.733
51296.6 6656.0 6656.0 0.0 2364.8 1383424.0 0.0 495616.0 189515.9 32768.0 31917.4 3584.0 3415.5 297143 614.698 282564 45735.969 46350.667
- Every second ~4 full GCs
- High number of full GCs
- Explicitly requested: oai-harvest-manager/src/main/java/ORG/oclc/oai/harvester2/verb/HarvesterVerb.java:357
// maybe we're lucky
System.gc();
logger.debug("Content-Length=" + con.getHeaderField("Content-Length"));
ManagementFactory.getMemoryMXBean().getHeapMemoryUsage();
logger.debug("Heap="+ManagementFactory.getMemoryMXBean().getHeapMemoryUsage());
logger.debug("NonHeap="+ManagementFactory.getMemoryMXBean().getNonHeapMemoryUsage());
List<MemoryPoolMXBean> beans = ManagementFactory.getMemoryPoolMXBeans();
for (MemoryPoolMXBean bean: beans) {
logger.debug(bean.getName()+"="+bean.getUsage());
}
for (GarbageCollectorMXBean bean: ManagementFactory.getGarbageCollectorMXBeans()) {
logger.debug(bean.getName()+"="+bean.getCollectionCount(), bean.getCollectionTime());
}
- Investigatation steps
- Make behaviour w.r.t. GC requests configurable
- Investigate potential memory leak (not likely, high cpu starts already when heap is relatively small)
- Try more Xmx (8G?), east but no effects expected
09fa093 removed the hardcoded System.gc()