hugheylab/pmparser

Any solutions if the `modifyPubmedDb` process breaks?

Closed this issue · 3 comments

When using the function modifyPubmedDb, because it is highly integrated, so it is convenient. However, if some of the problem arises such as the network is not connected or the system encounter a sudden update. Is there a way to save the current process, and later continue to update it?

Scenario 1: The citation data have been collected, but the rest part have not started yet.
Scenario 2: I have downloaded all the XML files, but the parsing have not started yet.
Scenario 3: The parsing process have been interrupted, and the database only save part of the XMLs.

Thanks.

I have another idea, which is: if everyone using modifyPubmedDb is going to gain the latest version of the data. Why not just process the XMLs and save it in a place. You can send the processed data directly to the users in perhaps csv, rds or fst in bulks. This should save time and memory, and possibly this can establish a project so as to assist PubMed to maintain a structured bibliometric dataset.

I'm not sure if that helps or whether it is what you are looking for, but the processed data is available on a monthly basis via Zenodoo and also via Google Big Query. So you can access the latest (monthly) version of the database without parsing any XML files by yourself to save resources via Google or download the database dump from Zenodoo to set up your own database.

Thank you for the feedback. I think this is a very good solution. Thanks.