MrMimic/MEDOC

Incomplete XML parsing for large files

Closed this issue · 1 comments

When the xml file is quite large ( more than 30M when compressed) the lxml bs4 engine does not find all the elements.

I tried on machine to run the code on more RAM-capable machine (till 64G or RAM) and with other engine as well (html-parser)

I think it may be the way bs4 uses lxml, but I did not want to dig deeper. You can replicate this issue by processing any large file, i.e: updatefiles/pubmed18n0935.xml.gz
For example, with the file mentioned above I can only find 2 elements out of 30k.

Temporary, I extracted the xml nodes for 'Citation' using regular expression, just to be sure to import everything.
So, the output now looks like this:

Elapsed time: 5.17 sec for module: get_file_list
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - DOWNLOADING FILE
Downloading updatefiles/pubmed18n0935.xml.gz ..
Elapsed time: 10.53 sec for module: download
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - FILE EXTRACTION
Elapsed time: 2.4 sec for module: extract
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - XML FILE PARSING
The underlying parser failed to retrieve all the elements. Found 2 out of 30000. 
Loading the others now...30000 elements are now loaded.
Elapsed time: 10.33 sec for module: parse

I thought you may wish to know or to investigate deeply this issue

Hey. It took me a while to get why this could append.

Obviously, the problem is coming from the code block:

empty_element_tags = set(['br' , 'hr', 'input', 'img', 'meta', 'spacer', 'link', 'frame', 'base']) in the lxml library. When the parsed file contains link tag, it breaks down, thus leading to a miscount.

I added a simple xml indexation of the file (slower), and i re-index with lxml every parsed article to lower the case.

Thanks to notify it and sorry for the late reply.