pub_date column empty in article table
Closed this issue · 2 comments
Hello, thanks for creating and maintaining this code!
I would like to use the PubMed baseline locally in a SQL database (postgres) and for that I downloaded the PMDB and followed the instructions to restore it in Postgres from here: https://zenodo.org/record/6864161
When inspecting the data, I noticed that the "pub_date" column is empty (only contains "null") in the article table. Is this intentional or a parsing error?
In the pub_history table, the pub_date is present though, so it's possible to join the tables to get the dates, but then I wonder why the "pub_date" column exists in the article column if it's not being filled.
Thanks for your help!
Hello, please see the data dictionary: https://pmparser.hugheylab.org/articles/data_dictionary.html
Those pub_date fields in the different tables are parsed from different places in the xml files. Sometimes the article has a pub_date, sometimes it doesn't. I would generally recommend using the dates in the pub_history table.
What do you get when you run select count(*) from article;
and select count(*) from article where pub_date is not null;
?
Thank you for the explanation and reference to the data dictionary, it was very helpful!
When running those two queries, I get the following results:
select count(*) from article;
- returns 34205444
select count(*) from article where pub_date is not null;
- returns 11583116