internetarchive/fatcat

Pubmed 2021 baseline XML updates

Closed this issue · 2 comments

Every year Pubmed/MEDLINE does a new "baseline" XML dump, and then further daily updates are release against that.

We may need to update the pubmed harvesting code to know about the new "base" update file numbers for 2021. The schema itself seems not to have changed at all:

There will be no changes to the 2021 PubMed DTD; pubmed_190101.dtd will remain in place as the current PubMed DTD.

Details, and file numbers/index here: https://www.nlm.nih.gov/databases/download/pubmed_medline.html

miku commented

Currently, the pubmed harvesting code will find updates by "filedate" - independent of any baseline dump. It generates a mapping from date to filenames and will use that in continuous mode.

In [6]: from fatcat_tools.harvest import pubmed            

In [7]: m = pubmed.generate_date_file_map()                
added entry for 2020-12-14: /pubmed/updatefiles/pubmed21n1063.xml.gz
added entry for 2020-12-14: /pubmed/updatefiles/pubmed21n1064.xml.gz
added entry for 2020-12-14: /pubmed/updatefiles/pubmed21n1065.xml.gz
added entry for 2020-12-15: /pubmed/updatefiles/pubmed21n1066.xml.gz
added entry for 2020-12-15: /pubmed/updatefiles/pubmed21n1067.xml.gz
added entry for 2020-12-16: /pubmed/updatefiles/pubmed21n1068.xml.gz
added entry for 2020-12-17: /pubmed/updatefiles/pubmed21n1069.xml.gz
added entry for 2020-12-18: /pubmed/updatefiles/pubmed21n1070.xml.gz
added entry for 2020-12-18: /pubmed/updatefiles/pubmed21n1071.xml.gz
added entry for 2020-12-18: /pubmed/updatefiles/pubmed21n1072.xml.gz
added entry for 2020-12-19: /pubmed/updatefiles/pubmed21n1073.xml.gz
added entry for 2020-12-20: /pubmed/updatefiles/pubmed21n1074.xml.gz
added entry for 2020-12-21: /pubmed/updatefiles/pubmed21n1075.xml.gz
added entry for 2020-12-22: /pubmed/updatefiles/pubmed21n1076.xml.gz
added entry for 2020-12-22: /pubmed/updatefiles/pubmed21n1077.xml.gz
added entry for 2020-12-28: /pubmed/updatefiles/pubmed21n1078.xml.gz
added entry for 2020-12-28: /pubmed/updatefiles/pubmed21n1079.xml.gz
added entry for 2020-12-28: /pubmed/updatefiles/pubmed21n1080.xml.gz
generated date-file mapping for 10 dates

In [8]: import pprint                                      

In [9]: pprint.pprint(m)                                   
defaultdict(<class 'set'>,
            {'2020-12-14': {'/pubmed/updatefiles/pubmed21n1063.xml.gz',
                            '/pubmed/updatefiles/pubmed21n1064.xml.gz',
                            '/pubmed/updatefiles/pubmed21n1065.xml.gz'},
             '2020-12-15': {'/pubmed/updatefiles/pubmed21n1066.xml.gz',
                            '/pubmed/updatefiles/pubmed21n1067.xml.gz'},
             '2020-12-16': {'/pubmed/updatefiles/pubmed21n1068.xml.gz'},
             '2020-12-17': {'/pubmed/updatefiles/pubmed21n1069.xml.gz'},
             '2020-12-18': {'/pubmed/updatefiles/pubmed21n1070.xml.gz',
                            '/pubmed/updatefiles/pubmed21n1071.xml.gz',
                            '/pubmed/updatefiles/pubmed21n1072.xml.gz'},
             '2020-12-19': {'/pubmed/updatefiles/pubmed21n1073.xml.gz'},
             '2020-12-20': {'/pubmed/updatefiles/pubmed21n1074.xml.gz'},
             '2020-12-21': {'/pubmed/updatefiles/pubmed21n1075.xml.gz'},
             '2020-12-22': {'/pubmed/updatefiles/pubmed21n1076.xml.gz',
                            '/pubmed/updatefiles/pubmed21n1077.xml.gz'},
             '2020-12-28': {'/pubmed/updatefiles/pubmed21n1078.xml.gz',
                            '/pubmed/updatefiles/pubmed21n1079.xml.gz',
                            '/pubmed/updatefiles/pubmed21n1080.xml.gz'}})

From my understanding, we do not need to update the harvesting code.

Thanks for checking!