High memory usage by pmidcite

Question

High memory usage by pmidcite

aditya-sarkar441 opened this issue 3 years ago · 7 comments

@dvklopfenstein Pmidcite uses a huge amount of memory while accessing the pubmed ids. I have a txt file which contains 90,000 pmids. On running pmidcite, the cluster gets aborted due to high memory imprint (about 16 GB). Can you please help me with this ? I only require the headers and do not want the information of the citations.

Answer 1 · 2021-07-27T19:51:52.000Z

Thank you for your interest in pmidcite and for taking your time to write. Would it be possible to send me your list of PMIDs and the script or command from the commandline that you used to attempt to download the data?

Answer 2 · 2021-07-28T05:32:15.000Z

@dvklopfenstein Command : /u/home/a/asarkar/.local/bin/icite.py -i citations.txt -H -a final_cite.txt --dir_icite_py /u/scratch/a/asarkar/Aditya-scratch/twitter/citations -R

I have attached the data below.
citations.txt

Answer 3 · 2021-07-29T16:56:36.000Z

Thank you so much for this terrific request. I am working on it now...

Answer 4 · 2021-08-03T07:29:23.000Z

Hello @aditya-sarkar441,

I have modified the code to give better control to the researcher regarding which NIH citation details shall be downloaded. This minimizes NIH downloads to only the researcher-specified PMIDs by default.

To download additional NIH citation data, use the new icite arguments:

-c (download NIH citation details for the citations of a researcher-specified PMID)
-r (download NIH citation details for the references of a researcher-specified PMID)

These new icite arguments augment the existing icite argument:

-v (download NIH citation details for both citations and references of a researcher-specified PMID)

This -R icite argument (don't download NIH citation details for the references of a researcher-specified PMID) shall be removed in the future as it is now obsolete and has been effectively replaced by the new arguments, -c and -r

cc: @pnguyen-biotech: I have updated the following notebook to support this new flexibility:
https://github.com/dvklopfenstein/pmidcite/blob/main/notebooks/print_paper_sort_cites.ipynb.

WAS: dnldr = NIHiCiteDownloader(force_download, api)
NOW: dnldr = NIHiCiteDownloader(force_download, api, details_cites_refs="citations")

Answer 5 · 2021-08-03T07:42:44.000Z

@aditya-sarkar441 ,

The change described in #3 (comment) describes code changes which allow NIH citation data downloads for headers when you do not want the information of the citations.

There is one more pending set of changes needed to address your issue...

The large consumption of memory is due to loading all 90,000 papers (plus their citation data) into RAM before writing an output file.

The next modification is to flush PMID and citation data to stdout or a file bit by bit instead of loading up all papers into RAM and then writing all. I am working on this now.

Thank you very much for taking your time to report this issue. Your comments will certainly be helpful to others who wish to download NIH citation data for large numbers of PMIDs.

Answer 6 · 2021-08-05T15:23:33.000Z

Thanks @dvklopfenstein . It will be more helpful.

Answer 7 · 2021-08-08T09:10:57.000Z

@aditya-sarkar441,

I just improved the speed when downloading and loading NIH citation data for large numbers of PMIDs, as in your example.

Thank you so much for taking your time to open this terrific issue. Your keen observations will benefit many users.

Please give the new version a try and open an new issue if needed