
Error: parse_medline_xml() is unable to parse the file even though the provided path is correct

Hi, I'm using parse_medline_xml() to parse an xml file; I'm not sure where the error stems from.
I read a discussion on a similar issue that was raised in the past & cross checked if the file path I'm providing is right & it seems right.
Is there any other reason because of which the following error could occur-

Edit : I tried parsing a 2017 file using the exact same code & it worked fine. A similar discussion asks to install the latest version of pubmed parser so I did that but it's still not working. I'm trying to do this for a 2022 file.

Error: it was not able to read a path, a file-like object, or a string as an XML
Yes, it seems like the file that you're putting in is not parsable by lxml.

Thanks for taking time to answer.
So, you are saying the parser won't work for files from the year 2022? Or is there any other issue apart from the date of file?
Because it is working just fine for a 2017 file (downloaded from the exact same source) with the same extension .xml.gz

If year is the only issue, then do you have any idea till which year/date the parser shall work?

Oh, if it works until 2017. It might be the problem with the file format. I don't have much time to check the format but there might be an issue there!

In the last year, I have used parse_medline_xml() on all of the PubMed XML files without error. In general, I use the xml.gz file format but I have tested the .xml file too. I recommend stepping through the code while parsing that file in a debugger and isolating the error.

Thanks @raypereda-gr.
Yes, I'm also using it with a .xml.gz file. It's a 2022 file.

I tried debugging - However, I'm unable to figure out the error. Can you please help?

@raypereda-gr can you also please let me know the source and code you are downloading the files from?

here is my code - I'm afraid if incorrect files are getting downloaded on my end hence causing errors.

save_loc = 'Desktop/scratch/'
def download_ftp_files(link, save_loc, verbose=True):
     """ Downloads all ftp files from the supplied link """

    process = Popen(['wget', link + "*"],
                    stdout=PIPE, cwd=save_loc)

    if verbose:
        for line in iter(process.stdout.readline, ''):
download_ftp_files('ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/', save_loc=save_loc + 'baseline/')

I will look into it further in a couple days. In the meantime, you can help me with two things. First, trim down the file to creating the smallest file that gives the same error. You will need to work with XML, not the zipped file. Second, try downloading the file in various ways. Try manual downloads too. See if the file changes with different ways of downloading.

Thank you so much @raypereda-gr for helping out!

@raypereda-gr Thanks very much for considering to help.

As you asked to work with the .xml and not .xml.gz (zipped) file, is it required for trimming the file down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().

To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.

Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with.
I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here. Sorry for bothering you too much. I'm very new to this hence the naivety.

As you asked to work with the .xml and not .xml.gz (zipped)
file, is it required for trimming the file
down or to parse it? Asking because I was able to parse a .xml.gz file using parse_medline_xml().

That is the same function that I use:

list_of_dictionary = pp.parse_medline_xml(pubmed_xml_filename, year_info_only=False)

That function will accept a .xml or .xml.gz file. You don't need to worry about unzipping explicity; the function with handle that if needed.

Since you have been able to to parse the .xml.gz file, we can be confident that the problem is with the .xml file. How exactly did you unzip it? Here's ls output of the the unzipped file that I created by unzipping on a Mac using the pre-installed unzip tool. I also counted the number of lines.

$ ls -l medline17n0116.xml
-rw-r--r--@ 1 raypereda  staff  188634668 Mar 19 16:41 medline17n0116.xml

$ wc *.xml
 4572705 10113718 188634668 medline17n0116.xml

To download manually as you suggested, I tried to navigate to the exact same directory (webpage) online where the files were getting downloaded from. These are the same as the ones that were getting downloaded using the code I attached in a previous comment.

Good. That means we can be confident that the problem is not with the download. I suspect something is off with the unzipping.

Further, as I wrote in the original issue, this is the file parse_medline_xml() works perfectly with.
I strongly feel the one I'm trying to parse now isn't in the format the parser is for. However, I might be wrong here.

Ok, why can you just parse the .xml.gz file? I would suggest not worry about unzipping the files.

Thanks @raypereda-gr !

Yes, I was working with the zipped file only (.xml.gz) ; it still wasn't working.

I made a small change by just adding the keyword arg path while calling the function like so -
pp.parse_medline_xml(path = pubmed_xml_filepath)

instead of positional calling like -

and it worked hence. Anyway, thanks a lot for helping patiently, @titipata @raypereda-gr.