titipata/pubmed_parser

Error: parse_medline_xml() is unable to parse the file even though the provided path is correct

srishti-git1110 opened this issue · 11 comments

Hi, I'm using parse_medline_xml() to parse an xml file; I'm not sure where the error stems from.
I read a discussion on a similar issue that was raised in the past & cross checked if the file path I'm providing is right & it seems right.
Is there any other reason because of which the following error could occur-

Edit : I tried parsing a 2017 file using the exact same code & it worked fine. A similar discussion asks to install the latest version of pubmed parser so I did that but it's still not working. I'm trying to do this for a 2022 file.

Error: it was not able to read a path, a file-like object, or a string as an XML
Traceback (most recent call last):

File "C:\Users\hp\anaconda3\envs\test\lib\site-packages\pubmed_parser-0.3.1-py3.9.egg\pubmed_parser\utils.py", line 31, in read_xml
tree = etree.parse(path)

File "src\lxml\etree.pyx", line 3536, in lxml.etree.parse

File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseDocument

File "src\lxml\parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL

File "src\lxml\parser.pxi", line 1805, in lxml.etree._parseDocFromFile

File "src\lxml\parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile

File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult

File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError

File "file:/C:/Users/hp/Desktop/scratch/Rudraksh/food-disease-relx/data/baseline_test_sg/pubmed22n0002.xml.gz", line 577522
XMLSyntaxError: Specification mandates value for attribute CitedMed, line 577522, column 33

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "C:\Users\hp\anaconda3\envs\test\lib\site-packages\IPython\core\interactiveshell.py", line 3457, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "C:\Users\hp\AppData\Local\Temp/ipykernel_3596/2675774551.py", line 1, in
parsed_file = pp.parse_medline_xml(r'C:\Users\hp\Desktop\scratch\Rudraksh\food-disease-relx\data\baseline_test_sg\pubmed22n0002.xml.gz')

File "C:\Users\hp\anaconda3\envs\test\lib\site-packages\pubmed_parser-0.3.1-py3.9.egg\pubmed_parser\medline_parser.py", line 672, in parse_medline_xml
tree = read_xml(path)

File "C:\Users\hp\anaconda3\envs\test\lib\site-packages\pubmed_parser-0.3.1-py3.9.egg\pubmed_parser\utils.py", line 36, in read_xml
tree = etree.fromstring(path)

File "src\lxml\etree.pyx", line 3252, in lxml.etree.fromstring

File "src\lxml\parser.pxi", line 1913, in lxml.etree._parseMemoryDocument

File "src\lxml\parser.pxi", line 1793, in lxml.etree._parseDoc

File "src\lxml\parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc

File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult

File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError

File "", line 1
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Yes, it seems like the file that you're putting in is not parsable by lxml.

Thanks for taking time to answer.
So, you are saying the parser won't work for files from the year 2022? Or is there any other issue apart from the date of file?
Because it is working just fine for a 2017 file (downloaded from the exact same source) with the same extension .xml.gz

If year is the only issue, then do you have any idea till which year/date the parser shall work?

Oh, if it works until 2017. It might be the problem with the file format. I don't have much time to check the format but there might be an issue there!

In the last year, I have used parse_medline_xml() on all of the PubMed XML files without error. In general, I use the xml.gz file format but I have tested the .xml file too. I recommend stepping through the code while parsing that file in a debugger and isolating the error.

Thanks @raypereda-gr.
Yes, I'm also using it with a .xml.gz file. It's a 2022 file.

I tried debugging - However, I'm unable to figure out the error. Can you please help?

_> c:\users\hp\anaconda3\envs\test\lib\tokenize.py(335)find_cookie()
333 if filename is not None:
334 msg = '{} for {!r}'.format(msg, filename)
--> 335 raise SyntaxError(msg)
336
337 match = cookie_re.match(line_string)

ERROR! Session/line number was not unique in database. History logging moved to new session 157
ipdb> w
c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\async_helpers.py(78)_pseudo_sync_runner()
76 """
77 try:
---> 78 coro.send(None)
79 except StopIteration as exc:
80 return exc.value

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(3185)run_cell_async()
3183 interactivity = 'async'
3184
-> 3185 has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
3186 interactivity=interactivity, compiler=compiler, result=result)
3187

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(3396)run_ast_nodes()
3394 if result:
3395 result.error_before_exec = sys.exc_info()[1]
-> 3396 self.showtraceback()
3397 return True
3398

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(2063)showtraceback()
2061 # Though this won't be called by syntax errors in the input
2062 # line, there may be SyntaxError cases with imported code.
-> 2063 self.showsyntaxerror(filename, running_compiled_code)
2064 elif etype is UsageError:
2065 self.show_usage_error(value)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\interactiveshell.py(2129)showsyntaxerror()
2127 # If the error occurred when executing compiled code, we should provide full stacktrace.
2128 elist = traceback.extract_tb(last_traceback) if running_compiled_code else []
-> 2129 stb = self.SyntaxTB.structured_traceback(etype, value, elist)
2130 self._showtraceback(etype, value, stb)
2131

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1407)structured_traceback()
1405 value.text = newtext
1406 self.last_syntax_error = value
-> 1407 return super(SyntaxTB, self).structured_traceback(etype, value, elist,
1408 tb_offset=tb_offset, context=context)
1409

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(629)structured_traceback()
627 chained_exceptions_tb_offset = 0
628 out_list = (
--> 629 self.structured_traceback(
630 etype, evalue, (etb, chained_exc_ids),
631 chained_exceptions_tb_offset, context)

c:\users\hp\anaconda3\envs\test\lib\site-packages\ipython\core\ultratb.py(1403)structured_traceback()
1401 and isinstance(value.lineno, int):
1402 linecache.checkcache(value.filename)
-> 1403 newtext = linecache.getline(value.filename, value.lineno)
1404 if newtext:
1405 value.text = newtext

c:\users\hp\anaconda3\envs\test\lib\linecache.py(30)getline()
28 Update the cache if it doesn't contain an entry for this file already."""
29
---> 30 lines = getlines(filename, module_globals)
31 if 1 <= lineno <= len(lines):
32 return lines[lineno - 1]

c:\users\hp\anaconda3\envs\test\lib\linecache.py(46)getlines()
44
45 try:
---> 46 return updatecache(filename, module_globals)
47 except MemoryError:
48 clearcache()

c:\users\hp\anaconda3\envs\test\lib\linecache.py(136)updatecache()
134 return []
135 try:
--> 136 with tokenize.open(fullname) as fp:
137 lines = fp.readlines()
138 except OSError:

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(394)open()
392 buffer = _builtin_open(filename, 'rb')
393 try:
--> 394 encoding, lines = detect_encoding(buffer.readline)
395 buffer.seek(0)
396 text = TextIOWrapper(buffer, encoding, line_buffering=True)

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(371)detect_encoding()
369 return default, []
370
--> 371 encoding = find_cookie(first)
372 if encoding:
373 return encoding, [first]

c:\users\hp\anaconda3\envs\test\lib\tokenize.py(335)find_cookie()
333 if filename is not None:
334 msg = '{} for {!r}'.format(msg, filename)
--> 335 raise SyntaxError(msg)
336
337 match = cookie_re.match(line_string)_

@raypereda-gr can you also please let me know the source and code you are downloading the files from?

here is my code - I'm afraid if incorrect files are getting downloaded on my end hence causing errors.

save_loc = 'Desktop/scratch/'
def download_ftp_files(link, save_loc, verbose=True):
     """ Downloads all ftp files from the supplied link """

    process = Popen(['wget', link + "*"],
                    stdout=PIPE, cwd=save_loc)

    if verbose:
        for line in iter(process.stdout.readline, ''):
            sys.stdout.write(line)
           
download_ftp_files('ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/', save_loc=save_loc + 'baseline/')

I will look into it further in a couple days. In the meantime, you can help me with two things. First, trim down the file to creating the smallest file that gives the same error. You will need to work with XML, not the zipped file. Second, try downloading the file in various ways. Try manual downloads too. See if the file changes with different ways of downloading.

Thank you so much @raypereda-gr for helping out!

@raypereda-gr Thanks very much for considering to help.

As you asked to work with the .xml and not .xml.gz (zipped) file, is it required for trimming the file down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().

To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.

Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with.
I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here. Sorry for bothering you too much. I'm very new to this hence the naivety.

As you asked to work with the .xml and not .xml.gz (zipped)
file, is it required for trimming the file
down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().

That is the same function that I use:

list_of_dictionary = pp.parse_medline_xml(pubmed_xml_filename, year_info_only=False)

That function will accept a .xml or .xml.gz file. You don't need to worry about unzipping explicity; the function with handle that if needed.

Since you have been able to to parse the .xml.gz file, we can be confident that the problem is with the .xml file. How exactly did you unzip it? Here's ls output of the the unzipped file that I created by unzipping on a Mac using the pre-installed unzip tool. I also counted the number of lines.

$ ls -l medline17n0116.xml
-rw-r--r--@ 1 raypereda  staff  188634668 Mar 19 16:41 medline17n0116.xml

$ wc *.xml
 4572705 10113718 188634668 medline17n0116.xml

To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.

Good. That means we can be confident that the problem is not with the download. I suspect something is off with the unzipping.

Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with.
I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here.

Ok, why can you just parse the .xml.gz file? I would suggest not worry about unzipping the files.

Thanks @raypereda-gr !

Yes, I was working with the zipped file only (.xml.gz) ; it still wasn't working.

I made a small change by just adding the keyword arg path while calling the function like so -
pp.parse_medline_xml(path = pubmed_xml_filepath)

instead of positional calling like -
pp.parse_medline_xml(pubmed_xml_filepath)

and it worked hence. Anyway, thanks a lot for helping patiently, @titipata @raypereda-gr.
Best,
Srishti