hugheylab/pmparser

pub_date column empty in article table

Closed this issue · 2 comments

Hello, thanks for creating and maintaining this code!

I would like to use the PubMed baseline locally in a SQL database (postgres) and for that I downloaded the PMDB and followed the instructions to restore it in Postgres from here: https://zenodo.org/record/6864161

When inspecting the data, I noticed that the "pub_date" column is empty (only contains "null") in the article table. Is this intentional or a parsing error?

image

In the pub_history table, the pub_date is present though, so it's possible to join the tables to get the dates, but then I wonder why the "pub_date" column exists in the article column if it's not being filled.

Thanks for your help!

Hello, please see the data dictionary: https://pmparser.hugheylab.org/articles/data_dictionary.html

Those pub_date fields in the different tables are parsed from different places in the xml files. Sometimes the article has a pub_date, sometimes it doesn't. I would generally recommend using the dates in the pub_history table.

What do you get when you run select count(*) from article; and select count(*) from article where pub_date is not null;?

Thank you for the explanation and reference to the data dictionary, it was very helpful!

When running those two queries, I get the following results:

select count(*) from article;

  • returns 34205444

select count(*) from article where pub_date is not null;

  • returns 11583116