greenelab/greenblack

Multiple bioRxiv preprints link to the same journal publication

Opened this issue · 4 comments

As part of this project, we're working with the Rxivist database of bioRxiv preprints. We've extracted a portion of this database and stored it in the file data/01.preprints.tsv.

Based on this dataset, it looks like there are 28 journal articles that are linked to by multiple bioRxiv preprints. In other words, bioRxiv sometimes thinks multiple preprints have been published in the same journal article.

The Python code

import pandas
url = 'https://github.com/greenelab/greenblack/raw/8f4502d5f62064dd483fbec18d21e5e31c35dc03/data/01.preprints.tsv'
preprint_df = pandas.read_csv(url, sep='\t').dropna()
# bioRxiv preprints with duplicated journal publications
duplicate_df = preprint_df[preprint_df.journal_doi.duplicated(keep=False)].sort_values(['journal_doi', 'preprint_doi'])
duplicate_df.to_csv('duplicated.tsv', sep='\t', index=False)
duplicate_df.journal_doi.nunique()

Here's the output table:

rxivist_preprint_id preprint_date preprint_doi journal_date journal_doi
9880 2017-06-27 10.1101/156331 2017-11-24 10.1007/s00401-017-1789-4
9828 2017-07-18 10.1101/165373 2017-11-24 10.1007/s00401-017-1789-4
16018 2016-12-07 10.1101/092221 2017-10-13 10.1007/s12559-017-9518-9
4696 2017-01-02 10.1101/097675 2017-10-13 10.1007/s12559-017-9518-9
10702 2014-09-04 10.1101/008755 2015-04-17 10.1016/j.ajhg.2015.03.004
10428 2016-04-05 10.1101/046995 2015-04-17 10.1016/j.ajhg.2015.03.004
20822 2017-05-12 10.1101/137190 2018-05-21 10.1016/j.cognition.2018.04.017
20709 2017-12-11 10.1101/231837 2018-05-21 10.1016/j.cognition.2018.04.017
30027 2015-10-14 10.1101/029066 2017-10-30 10.1038/s41598-017-14523-5
29724 2017-05-17 10.1101/138784 2017-10-30 10.1038/s41598-017-14523-5
10577 2015-09-08 10.1101/026278 2017-02-23 10.1038/srep43054
26249 2015-12-08 10.1101/033944 2017-02-23 10.1038/srep43054
10256 2016-09-30 10.1101/078360 2017-05-05 10.1073/pnas.1704442114
10029 2017-04-03 10.1101/122218 2017-05-05 10.1073/pnas.1704442114
20878 2016-10-07 10.1101/079699 2017-12-04 10.1080/01480545.2017.1405971
20857 2017-01-12 10.1101/099952 2017-12-04 10.1080/01480545.2017.1405971
28398 2016-11-23 10.1101/085795 2017-01-06 10.1093/jxb/erw488
28406 2016-11-16 10.1101/088153 2017-01-06 10.1093/jxb/erw488
26406 2015-07-07 10.1101/022061 2016-04-27 10.1098/rsob.160009
8583 2015-09-18 10.1101/027151 2016-04-27 10.1098/rsob.160009
9475 2017-03-20 10.1101/118729 2018-02-15 10.1101/gr.230433.117
10042 2017-03-21 10.1101/119016 2018-02-15 10.1101/gr.230433.117
26141 2016-03-24 10.1101/045369 2016-09-19 10.1111/jeb.12972
25961 2016-08-18 10.1101/070318 2016-09-19 10.1111/jeb.12972
17987 2018-03-28 10.1101/290866 2018-05-17 10.1152/japplphysiol.00012.2018
17938 2018-05-16 10.1101/324020 2018-05-17 10.1152/japplphysiol.00012.2018
25942 2016-08-30 10.1101/070003 2017-03-29 10.1371/journal.pcbi.1005375
25731 2017-01-27 10.1101/103739 2017-03-29 10.1371/journal.pcbi.1005375
20939 2014-11-29 10.1101/011908 2015-03-25 10.1371/journal.pone.0119337
18271 2014-12-26 10.1101/013268 2015-03-25 10.1371/journal.pone.0119337
3791 2017-10-10 10.1101/201251 2018-04-26 10.1371/journal.pone.0196135
3785 2017-10-19 10.1101/205542 2018-04-26 10.1371/journal.pone.0196135
9091 2018-05-09 10.1101/317891 2018-07-31 10.1371/journal.pone.0197699
8894 2018-07-09 10.1101/365130 2018-07-31 10.1371/journal.pone.0197699
12107 2017-02-14 10.1101/108639 2017-11-30 10.1523/jneurosci.1724-17.2017
14821 2017-06-30 10.1101/157628 2017-11-30 10.1523/jneurosci.1724-17.2017
10615 2015-06-07 10.1101/020529 2015-09-10 10.1534/g3.115.021659
26423 2015-06-12 10.1101/020826 2015-09-10 10.1534/g3.115.021659
10483 2016-02-03 10.1101/038729 2016-07-29 10.1534/genetics.116.187369
10155 2017-01-09 10.1101/098095 2016-07-29 10.1534/genetics.116.187369
26064 2016-05-26 10.1101/055517 2017-03-17 10.1534/genetics.116.196303
25909 2016-09-28 10.1101/078279 2017-03-17 10.1534/genetics.116.196303
10215 2016-11-17 10.1101/088260 2017-05-26 10.1534/genetics.116.198424
10213 2016-11-17 10.1101/088385 2017-05-26 10.1534/genetics.116.198424
7850 2017-04-03 10.1101/123554 2017-11-02 10.3390/e19110584
7427 2017-05-22 10.1101/140913 2017-11-02 10.3390/e19110584
22979 2017-03-31 10.1101/122580 2017-10-18 10.7554/elife.27356
23938 2017-04-08 10.1101/125765 2017-10-18 10.7554/elife.27356
22956 2017-04-19 10.1101/128595 2017-12-14 10.7554/elife.27827
19859 2017-11-01 10.1101/212274 2017-12-14 10.7554/elife.27827
9875 2017-04-24 10.1101/130054 2017-07-17 10.7554/elife.28069
9997 2017-04-28 10.1101/131995 2017-07-17 10.7554/elife.28069
16093 2016-11-03 10.1101/085548 2018-01-08 10.7554/elife.28927
4204 2017-06-28 10.1101/157263 2018-01-08 10.7554/elife.28927
11378 2017-04-07 10.1101/125419 2018-10-16 10.7554/elife.34870
11082 2018-01-25 10.1101/253872 2018-10-16 10.7554/elife.34870

I spot checked a couple:

10483	2016-02-03	10.1101/038729	2016-07-29	10.1534/genetics.116.187369
10155	2017-01-09	10.1101/098095	2016-07-29	10.1534/genetics.116.187369
26064	2016-05-26	10.1101/055517	2017-03-17	10.1534/genetics.116.196303
25909	2016-09-28	10.1101/078279	2017-03-17	10.1534/genetics.116.196303

Looks like it might be folks uploading revisions...

I can envision a situation where multiple preprints end up being published in the same journal article. For example, perhaps two preprints were combined into a single work that was then published in a journal. Or as @cgreene mentions:

Looks like it might be folks uploading revisions...

I've gone through a few more:

I thought I initially found one where it did not seem like the preprints should both be associated to the same journal publication... although given that many of these don't seem to be errors, I am not too worried.

This one is really odd. The number of authors dropped with the later preprint but went back up for the journal (if it's a revision):
https://doi.org/10.1101/008755 and https://doi.org/10.1101/046995 both published in https://doi.org/10.1016/j.ajhg.2015.03.004. I am not sure about this one.

In any case, I agree with you that it doesn't seem like a big deal.

Tweeted:

Playing with #Rxivist database and noticed instances where multiple #preprints were marked as published by the same journal article.
Many seem like revisions posted as new preprints or other corner cases. 
@biorxivpreprint, how do publications get matched?