SuffolkLITLab/EfileProxyServer

Don't load entire codes XML into memory

Opened this issue · 1 comments

Have been running into Heap overflows when updating codes. Still trying to narrow down which exact state it is, but it's failing when trying to execute the batch update in the postgres driver.

One possible way to reduce some memory pressure at this point is to not simply unmarshall the entire CodeListDocument at once, but to read each row individually. I think the best idea here is to do something like this: https://stackoverflow.com/a/16935069/11416267 in lines 164 and 175 of CodeDatabase. We just need to get the codes version and each individual row.

If there are still issues, we can look into doing separate Postgres updates not in batch, or simply making the batches smaller if there are over some amount of rows.

TODO:

  • download a sample codes xml (OptionalServices are generally very large)
  • test out the XML event reader approach from the above stack overflow link
  • see if it does relieve memory pressure (?) unsure how
  • integrate it back into the CodeDatabase class

Independently:

  • add more logs when running, try to catch which specific tables are causing heap overflow
  • see if adding -XX:+HeapDumpOnOutOfMemoryError works at all
  • use this info to help with implementing the above

Got a heap dump last night, here's all the useful info I can get from it:

  • crashed during dallas's updates, after ~240 other Texas courts: Doing updates for: dallas:dcjv, tables: [documenttype, partytype, filingcomponent, filing, optionalservices]. Optional services is probably key there.
  • using the Eclipse Memory Analyzer (with its memory expanded to 4GB, otherwise it still crashes), we get this info:
    • 1GB of info on the heap at the time of the crash
    • top 2 objects taking ~70% of memory together (each 35%):
      • org.quartz.simpl.SimpleThreadPool$WorkerThread, the Code Update thread
        • it's two biggest: local genericode._1.CodeListDocument and postgresql.core.ParameterList. 192,000 entries in the ParamaterList, so could batch Postgres queries if smaller than ~50k items. This issuse's main target is the CodeListDocument, details are above.
      • org.apache.cxf.bus.extension.ExtensionManagerBus, the WSDL junk
        • it's biggest element is the WSDLManagerImpl::schemaCacheMap and definitionsMap. Not sure how we can shrink that use, maybe if jurisdictions started sharing the NIEM XML files? Would help with repo size anyway.