Pubmed 2021 baseline XML updates
Closed this issue · 2 comments
Every year Pubmed/MEDLINE does a new "baseline" XML dump, and then further daily updates are release against that.
We may need to update the pubmed
harvesting code to know about the new "base" update file numbers for 2021. The schema itself seems not to have changed at all:
There will be no changes to the 2021 PubMed DTD; pubmed_190101.dtd will remain in place as the current PubMed DTD.
Details, and file numbers/index here: https://www.nlm.nih.gov/databases/download/pubmed_medline.html
Currently, the pubmed harvesting code will find updates by "filedate" - independent of any baseline dump. It generates a mapping from date to filenames and will use that in continuous mode.
In [6]: from fatcat_tools.harvest import pubmed
In [7]: m = pubmed.generate_date_file_map()
added entry for 2020-12-14: /pubmed/updatefiles/pubmed21n1063.xml.gz
added entry for 2020-12-14: /pubmed/updatefiles/pubmed21n1064.xml.gz
added entry for 2020-12-14: /pubmed/updatefiles/pubmed21n1065.xml.gz
added entry for 2020-12-15: /pubmed/updatefiles/pubmed21n1066.xml.gz
added entry for 2020-12-15: /pubmed/updatefiles/pubmed21n1067.xml.gz
added entry for 2020-12-16: /pubmed/updatefiles/pubmed21n1068.xml.gz
added entry for 2020-12-17: /pubmed/updatefiles/pubmed21n1069.xml.gz
added entry for 2020-12-18: /pubmed/updatefiles/pubmed21n1070.xml.gz
added entry for 2020-12-18: /pubmed/updatefiles/pubmed21n1071.xml.gz
added entry for 2020-12-18: /pubmed/updatefiles/pubmed21n1072.xml.gz
added entry for 2020-12-19: /pubmed/updatefiles/pubmed21n1073.xml.gz
added entry for 2020-12-20: /pubmed/updatefiles/pubmed21n1074.xml.gz
added entry for 2020-12-21: /pubmed/updatefiles/pubmed21n1075.xml.gz
added entry for 2020-12-22: /pubmed/updatefiles/pubmed21n1076.xml.gz
added entry for 2020-12-22: /pubmed/updatefiles/pubmed21n1077.xml.gz
added entry for 2020-12-28: /pubmed/updatefiles/pubmed21n1078.xml.gz
added entry for 2020-12-28: /pubmed/updatefiles/pubmed21n1079.xml.gz
added entry for 2020-12-28: /pubmed/updatefiles/pubmed21n1080.xml.gz
generated date-file mapping for 10 dates
In [8]: import pprint
In [9]: pprint.pprint(m)
defaultdict(<class 'set'>,
{'2020-12-14': {'/pubmed/updatefiles/pubmed21n1063.xml.gz',
'/pubmed/updatefiles/pubmed21n1064.xml.gz',
'/pubmed/updatefiles/pubmed21n1065.xml.gz'},
'2020-12-15': {'/pubmed/updatefiles/pubmed21n1066.xml.gz',
'/pubmed/updatefiles/pubmed21n1067.xml.gz'},
'2020-12-16': {'/pubmed/updatefiles/pubmed21n1068.xml.gz'},
'2020-12-17': {'/pubmed/updatefiles/pubmed21n1069.xml.gz'},
'2020-12-18': {'/pubmed/updatefiles/pubmed21n1070.xml.gz',
'/pubmed/updatefiles/pubmed21n1071.xml.gz',
'/pubmed/updatefiles/pubmed21n1072.xml.gz'},
'2020-12-19': {'/pubmed/updatefiles/pubmed21n1073.xml.gz'},
'2020-12-20': {'/pubmed/updatefiles/pubmed21n1074.xml.gz'},
'2020-12-21': {'/pubmed/updatefiles/pubmed21n1075.xml.gz'},
'2020-12-22': {'/pubmed/updatefiles/pubmed21n1076.xml.gz',
'/pubmed/updatefiles/pubmed21n1077.xml.gz'},
'2020-12-28': {'/pubmed/updatefiles/pubmed21n1078.xml.gz',
'/pubmed/updatefiles/pubmed21n1079.xml.gz',
'/pubmed/updatefiles/pubmed21n1080.xml.gz'}})
From my understanding, we do not need to update the harvesting code.
Thanks for checking!