Multiple bioRxiv preprints link to the same journal publication
Opened this issue · 4 comments
As part of this project, we're working with the Rxivist database of bioRxiv preprints. We've extracted a portion of this database and stored it in the file data/01.preprints.tsv
.
Based on this dataset, it looks like there are 28 journal articles that are linked to by multiple bioRxiv preprints. In other words, bioRxiv sometimes thinks multiple preprints have been published in the same journal article.
The Python code
import pandas
url = 'https://github.com/greenelab/greenblack/raw/8f4502d5f62064dd483fbec18d21e5e31c35dc03/data/01.preprints.tsv'
preprint_df = pandas.read_csv(url, sep='\t').dropna()
# bioRxiv preprints with duplicated journal publications
duplicate_df = preprint_df[preprint_df.journal_doi.duplicated(keep=False)].sort_values(['journal_doi', 'preprint_doi'])
duplicate_df.to_csv('duplicated.tsv', sep='\t', index=False)
duplicate_df.journal_doi.nunique()
Here's the output table:
rxivist_preprint_id | preprint_date | preprint_doi | journal_date | journal_doi |
---|---|---|---|---|
9880 | 2017-06-27 | 10.1101/156331 | 2017-11-24 | 10.1007/s00401-017-1789-4 |
9828 | 2017-07-18 | 10.1101/165373 | 2017-11-24 | 10.1007/s00401-017-1789-4 |
16018 | 2016-12-07 | 10.1101/092221 | 2017-10-13 | 10.1007/s12559-017-9518-9 |
4696 | 2017-01-02 | 10.1101/097675 | 2017-10-13 | 10.1007/s12559-017-9518-9 |
10702 | 2014-09-04 | 10.1101/008755 | 2015-04-17 | 10.1016/j.ajhg.2015.03.004 |
10428 | 2016-04-05 | 10.1101/046995 | 2015-04-17 | 10.1016/j.ajhg.2015.03.004 |
20822 | 2017-05-12 | 10.1101/137190 | 2018-05-21 | 10.1016/j.cognition.2018.04.017 |
20709 | 2017-12-11 | 10.1101/231837 | 2018-05-21 | 10.1016/j.cognition.2018.04.017 |
30027 | 2015-10-14 | 10.1101/029066 | 2017-10-30 | 10.1038/s41598-017-14523-5 |
29724 | 2017-05-17 | 10.1101/138784 | 2017-10-30 | 10.1038/s41598-017-14523-5 |
10577 | 2015-09-08 | 10.1101/026278 | 2017-02-23 | 10.1038/srep43054 |
26249 | 2015-12-08 | 10.1101/033944 | 2017-02-23 | 10.1038/srep43054 |
10256 | 2016-09-30 | 10.1101/078360 | 2017-05-05 | 10.1073/pnas.1704442114 |
10029 | 2017-04-03 | 10.1101/122218 | 2017-05-05 | 10.1073/pnas.1704442114 |
20878 | 2016-10-07 | 10.1101/079699 | 2017-12-04 | 10.1080/01480545.2017.1405971 |
20857 | 2017-01-12 | 10.1101/099952 | 2017-12-04 | 10.1080/01480545.2017.1405971 |
28398 | 2016-11-23 | 10.1101/085795 | 2017-01-06 | 10.1093/jxb/erw488 |
28406 | 2016-11-16 | 10.1101/088153 | 2017-01-06 | 10.1093/jxb/erw488 |
26406 | 2015-07-07 | 10.1101/022061 | 2016-04-27 | 10.1098/rsob.160009 |
8583 | 2015-09-18 | 10.1101/027151 | 2016-04-27 | 10.1098/rsob.160009 |
9475 | 2017-03-20 | 10.1101/118729 | 2018-02-15 | 10.1101/gr.230433.117 |
10042 | 2017-03-21 | 10.1101/119016 | 2018-02-15 | 10.1101/gr.230433.117 |
26141 | 2016-03-24 | 10.1101/045369 | 2016-09-19 | 10.1111/jeb.12972 |
25961 | 2016-08-18 | 10.1101/070318 | 2016-09-19 | 10.1111/jeb.12972 |
17987 | 2018-03-28 | 10.1101/290866 | 2018-05-17 | 10.1152/japplphysiol.00012.2018 |
17938 | 2018-05-16 | 10.1101/324020 | 2018-05-17 | 10.1152/japplphysiol.00012.2018 |
25942 | 2016-08-30 | 10.1101/070003 | 2017-03-29 | 10.1371/journal.pcbi.1005375 |
25731 | 2017-01-27 | 10.1101/103739 | 2017-03-29 | 10.1371/journal.pcbi.1005375 |
20939 | 2014-11-29 | 10.1101/011908 | 2015-03-25 | 10.1371/journal.pone.0119337 |
18271 | 2014-12-26 | 10.1101/013268 | 2015-03-25 | 10.1371/journal.pone.0119337 |
3791 | 2017-10-10 | 10.1101/201251 | 2018-04-26 | 10.1371/journal.pone.0196135 |
3785 | 2017-10-19 | 10.1101/205542 | 2018-04-26 | 10.1371/journal.pone.0196135 |
9091 | 2018-05-09 | 10.1101/317891 | 2018-07-31 | 10.1371/journal.pone.0197699 |
8894 | 2018-07-09 | 10.1101/365130 | 2018-07-31 | 10.1371/journal.pone.0197699 |
12107 | 2017-02-14 | 10.1101/108639 | 2017-11-30 | 10.1523/jneurosci.1724-17.2017 |
14821 | 2017-06-30 | 10.1101/157628 | 2017-11-30 | 10.1523/jneurosci.1724-17.2017 |
10615 | 2015-06-07 | 10.1101/020529 | 2015-09-10 | 10.1534/g3.115.021659 |
26423 | 2015-06-12 | 10.1101/020826 | 2015-09-10 | 10.1534/g3.115.021659 |
10483 | 2016-02-03 | 10.1101/038729 | 2016-07-29 | 10.1534/genetics.116.187369 |
10155 | 2017-01-09 | 10.1101/098095 | 2016-07-29 | 10.1534/genetics.116.187369 |
26064 | 2016-05-26 | 10.1101/055517 | 2017-03-17 | 10.1534/genetics.116.196303 |
25909 | 2016-09-28 | 10.1101/078279 | 2017-03-17 | 10.1534/genetics.116.196303 |
10215 | 2016-11-17 | 10.1101/088260 | 2017-05-26 | 10.1534/genetics.116.198424 |
10213 | 2016-11-17 | 10.1101/088385 | 2017-05-26 | 10.1534/genetics.116.198424 |
7850 | 2017-04-03 | 10.1101/123554 | 2017-11-02 | 10.3390/e19110584 |
7427 | 2017-05-22 | 10.1101/140913 | 2017-11-02 | 10.3390/e19110584 |
22979 | 2017-03-31 | 10.1101/122580 | 2017-10-18 | 10.7554/elife.27356 |
23938 | 2017-04-08 | 10.1101/125765 | 2017-10-18 | 10.7554/elife.27356 |
22956 | 2017-04-19 | 10.1101/128595 | 2017-12-14 | 10.7554/elife.27827 |
19859 | 2017-11-01 | 10.1101/212274 | 2017-12-14 | 10.7554/elife.27827 |
9875 | 2017-04-24 | 10.1101/130054 | 2017-07-17 | 10.7554/elife.28069 |
9997 | 2017-04-28 | 10.1101/131995 | 2017-07-17 | 10.7554/elife.28069 |
16093 | 2016-11-03 | 10.1101/085548 | 2018-01-08 | 10.7554/elife.28927 |
4204 | 2017-06-28 | 10.1101/157263 | 2018-01-08 | 10.7554/elife.28927 |
11378 | 2017-04-07 | 10.1101/125419 | 2018-10-16 | 10.7554/elife.34870 |
11082 | 2018-01-25 | 10.1101/253872 | 2018-10-16 | 10.7554/elife.34870 |
I spot checked a couple:
10483 2016-02-03 10.1101/038729 2016-07-29 10.1534/genetics.116.187369
10155 2017-01-09 10.1101/098095 2016-07-29 10.1534/genetics.116.187369
26064 2016-05-26 10.1101/055517 2017-03-17 10.1534/genetics.116.196303
25909 2016-09-28 10.1101/078279 2017-03-17 10.1534/genetics.116.196303
Looks like it might be folks uploading revisions...
I can envision a situation where multiple preprints end up being published in the same journal article. For example, perhaps two preprints were combined into a single work that was then published in a journal. Or as @cgreene mentions:
Looks like it might be folks uploading revisions...
I've gone through a few more:
-
https://doi.org/10.1101/156331 and https://doi.org/10.1101/165373 both are marked as now published in https://doi.org/10.1007/s00401-017-1789-4. I am not sure about this one.
-
https://doi.org/10.1101/008755 and https://doi.org/10.1101/046995 both published in https://doi.org/10.1016/j.ajhg.2015.03.004. I am not sure about this one.
-
https://doi.org/10.1101/125419 and https://doi.org/10.1101/253872
both published in https://doi.org/10.7554/elife.3487. Looks like revisions. -
https://doi.org/10.1101/122580 and https://doi.org/10.1101/125765 in https://doi.org/10.7554/elife.27356. Looks like revisions
I thought I initially found one where it did not seem like the preprints should both be associated to the same journal publication... although given that many of these don't seem to be errors, I am not too worried.
This one is really odd. The number of authors dropped with the later preprint but went back up for the journal (if it's a revision):
https://doi.org/10.1101/008755 and https://doi.org/10.1101/046995 both published in https://doi.org/10.1016/j.ajhg.2015.03.004. I am not sure about this one.
In any case, I agree with you that it doesn't seem like a big deal.