/trouble_w_triples

Docker image for reproducible BiSciCol paper

Primary LanguageRebol

This README replicates processes occurring on a local server during 2013-2014 within this Docker container.

Natural history museums and other biodiversity collection repositories/aggregators most commonly identify specimens with a "Darwin Core Triplet" which is an 'institution code' a 'collection code' and a 'catalog number' separated by colons. This exercise is to explore how these identifiers work as a means to bring records for the same specimen in different repositories back together.

To replicate this process but with current data from the three repositories instead of the canned snapshots found in the directory 'rawdata' please see the Fetch_repositories.txt. (You will need more disk storage space than this Docker image has.)

There are many excellent tutorials on starting up a Docker container which will amount to installing Docker then issuing a variant of:

   docker pull tomc/trouble_w_triples
   docker run -i -t tomc/trouble_w_triples:initial /bin/bash

The remainder of this text assumes you are at the command prompt within the Docker container.

By only keeping the fields from the repositories actually needed to replicate the studies in the triplets paper, the raw data is about 160M uncompressed and about 26M gzipped:

ls -1 rawdata/
IC_knownfilter.list						# (P.I.) curated list of institution codes
VN_vouchers.unl.gz						# well formed DwCT assembled from VertNet's ic cc & cn
locus_voucher.tab.gz					# the locus and specimen_voucher field from vertebrate GenBank divisions
sampleid_catalognum_bold_gbacc.tab.gz	# four fields from BOLD chordate records 2 might have DwCTs

The gzipped files are write protected so we always have a pristine copy to fall back on if our experiments become unintentionally destructive.

The easiest way to begin processing this data is with 'zcat', as in:

zcat rawdata/VN_vouchers.unl.gz | wc -l  
8216424  
zcat rawdata/locus_voucher.tab.gz  | wc -l  
595744  
zcat rawdata/sampleid_catalognum_bold_gbacc.tab.gz | wc -l  
216809

or since we may want to work with uncompressed copies:

zcat rawdata/VN_vouchers.unl.gz > data/VN_vouchers.unl  
zcat rawdata/locus_voucher.tab.gz >  data/locus_voucher.tab  
zcat rawdata/sampleid_catalognum_bold_gbacc.tab.gz > data/sampleid_catalognum_bold_gbacc.tab  

GenBank

When we are only looking at alternatively represented DwCT, the canonical DwCT can be counted then filtered out. Here I am using an '_x' to in the file names to indicate canonical DwCT have been removed

Filter out canonical vouchers:

grep -Ev  "[A-Z]{2,6}\:[A-Z][a-z]+\:.*[0-9]+.*" data/locus_voucher.tab > data/locus_voucher_x.tab   
wc -l data/locus_voucher_x.tab  
585257 data/locus_voucher_x.tab  

check if the locus are a Primary Key (unique):

cut -f1 data/locus_voucher_x.tab | sort -u | wc -l  
585159

LOCUS is not quite a primary key in this case. About 100 duplicated locus IDs, but no worries, we are just checking out of curiosity. (99.98% unique)

Classify:

bin/classify-dwct.reb --args "-i data/locus_voucher_x.tab" > data/locus_voucher_x_classed.tab 2> data/locus_voucher_x_classed.err
wc -l data/locus_voucher_x_classed.tab
433782

The error file will have attempts which could not be parsed:

cat data/locus_voucher_x_classed.err
::R12074 129 ::cn 

just this one without a viable institution code (IC).

The classification process allows for multiple DwCT per record, so the number of duplicate locus IDs can go up:

cut -f1 data/locus_voucher_x_classed.tab | sort -u | wc -l
423829  

We do not know how many of the original ~100 duplications made it in to this set of duplications but there are now 9,953 duplications on this parsed side. (down to 97.7% unique)

Filter for known Institution codes:

Principal investigators on this project manually vetted a list of thousands of Institution Codes from the "Global Registry of Biorepositories" to isolate the subset they believed could house vertebrate collections hence the "blessed" designation.

bin/filter_known_ic.awk -v"FILTER=rawdata/IC_knownfilter.list" < data/locus_voucher_x_classed.tab > data/locus_voucher_x_classed_blessed.tab
wc -l  data/locus_voucher_x_classed_blessed.tab
282,056 data/locus_voucher_classed_blessed.tab

cut -f1 data/locus_voucher_classed_blessed.tab | sort -u | wc -l
279,473

leaving 2,583 locus with alternative (or duplicate) DwCT (back up to 99.08% unique).

Report

The classifier reports on the types of issues it comes across changing a string into a DwCT but the main classes of issues can be seen in the error flag returned.

  • a non zero error flag means the classifier found something wrong
  • an even error flag means semantic issues exist
  • an odd error flag means syntactic issues exist
  • an odd error flag greater than one means both type of issues exist

An error flag of 0 would be no errors, but we filtered for canonical when we began so there should not be any the zeros here turn out to be cases where they only gave one of the two colons and I took a foolish shortcut and figured if they were using colons in one part then they were using colons in both parts: (TODO: fix & rerun)

cat data/locus_voucher_x_classed_blessed.tab | wc -l
282,056

The next command returns how many non-canonical DwCT were assigned which classification classifications are ANDed (summed) powers of two

cat data/locus_voucher_x_classed_blessed.tab | cut -f3 | sort | uniq -c | sort -nr
 
 167399 17
  93903 16			# even
  10365 19
   6724 265
   1349 1
    783 41
    635 0			# these should not be here    
    444 135
    378 133
     34 264			# even
     24 267
     10 3
      7 18			# even
      1 63

--------------------------------
	188,112		93,944

188,112 syntactic
280,075 semantic
186,128 both		 

cut -f2 data/locus_voucher_x_classed_blessed.tab | sort -u | wc -l
116,909 

Hmm, 282,056 / 116,909 = 2.412 appears to a fair amount of duplication, on average every non-canonical DwCT seems to shows up more than twice, lets see if that is true.

Remember this next command produces a distribution of counts for duplications, so the first number is how many non-canonical DwCT appeared, the second number of times:

cut -f2 data/locus_voucher_x_classed_blessed.tab |sort | uniq -c | awk '{print $1}' | sort -n | uniq -c | sort -k1,1nr -k2,2n
73958 1  (unique non-canonical DwCT)
18999 2
 8075 3
 4982 4
 3374 5
 2078 6
 1338 7
 1210 8
  784 10
  583 9
  219 12
  172 13
  141 11
  130 14
   99 17
   97 15
   93 16
   53 21
   49 18
   42 20
   36 22
   34 35
   33 19
   32 28
   28 23
   27 27
   26 26
   25 25
   22 24
   22 29
   13 30
   10 43
   8 33
   8 42
   7 34
   7 45
   6 36
   6 46
   5 31
   5 32
   5 38
   5 41
   4 40
   4 44
   4 48
   3 37
   3 39
   3 49
   3 50
  1 53
  1 62
  1 67
  1 76
  1 85
  1 91
  1 120
  1 327
  1 533
  1 593
  1 602
  1 649
  1 653
  1 670
  1 694
  1 731
  1 753
  1 756
  1 959
  1 1117
  1 1130
  1 1257
  1 1296
  1 1337
  1 1351
  1 1425
  1 1450
  1 1464
  1 1476
  1 1486
  1 1487
  1 1492
  1 1518
  1 1550
  1 1571
  1 1574
  1 1580
  1 1585
  1 1587

Since a couple dozen non-canonical DwCT appeared around a thousand times each and a majority of the non-canonical DwCT appeared only once (73,958 of the 116,909 or 63.26%) I think it is safe to assume we have fundamentally different populations mixed together here, a slight majority where there is a sequence per specimen and a minority where there are many sequences per specimen with a few stragglers in between.

####We are also interested in all DwCT (without filtering out canonical):

bin/classify-dwct.reb --args "-i data/locus_voucher.tab" > data/locus_voucher_classed_all.tab 2> data/locus_voucher_classed_all.err
bin/filter_known_ic.awk -v"FILTER=rawdata/IC_knownfilter.list" <  data/locus_voucher_classed_all.tab >  data/locus_voucher_classed_blessed_all.tab
grep "::" data/locus_voucher_classed_blessed_all.tab | sort -k2,2 -t $'\t' > data/locus_voucher_doublets_all.tab
grep -v "::" data/locus_voucher_classed_blessed_all.tab | sort -k2,2 -t $'\t' > data/locus_voucher_triplets_all.tab

wc -l data/locus_voucher_classed_all.tab data/locus_voucher_classed_blessed_all.tab data/locus_voucher_doublets_all.tab data/locus_voucher_triplets_all.tab
  444191 locus_voucher_classed_all.tab
  292453 locus_voucher_classed_blessed_all.tab
  279279 locus_voucher_doublets_all.tab
   13174 locus_voucher_triplets_all.tab
 1029097 total

========= #VN GB comparisons

Initial datasource:

	wc -l data/VN_vouchers.unl
	8216424 data/VN_vouchers.unl

	grep -v "::" data/VN_vouchers.unl | sort > data/VN_triplets.unl
	grep "::" data/VN_vouchers.unl  | sort > data/VN_doublets.unl

T-T: (triple to triple comparison)

join -11 -22 data/VN_triplets.unl data/locus_voucher_triplets_all.tab  | wc -l
4571

T-T!: (unique! triple to triple comparison)

join -11 -22 data/VN_triplets.unl data/locus_voucher_triplets_all.tab  | cut -f1 |sort -u | wc -l
4571

D-D: (doublet to doublet comparison)

join -11 -22 data/VN_doublets.unl data/locus_voucher_doublets_all.tab  | wc -l
0

not surprising, VN only has a handful of doublets:

cut -f1 -d \:  data/VN_doublets.unl | uniq
TTRS
grep "TTRS::"  data/locus_voucher_doublets_all.tab

VN doublets confirmed for not existing in GB doublets.

####Treat VN triplets as doublets: (omit the collection code)
sed 's|:[^:]*:|::|g' data/VN_triplets.unl | sort > data/VN_triplets_gutted.unl

head data/VN_triplets_gutted.unl
CAS::1
CAS::1
CAS::1
CAS::1
CAS::1
CAS::1
CAS::10
CAS::10
CAS::10
CAS::10

sort -u data/VN_triplets_gutted.unl > data/VN_triplets_gutted_distinct.unl

wc -l data/VN_triplets_gutted.unl data/VN_triplets_gutted_distinct.unl
  8214727 data/VN_triplets_gutted.unl
  5682825 data/VN_triplets_gutted_distinct.unl

join -11 -22 data/VN_triplets_gutted.unl data/locus_voucher_doublets_all.tab  | wc -l
104027

join -11 -22 data/VN_triplets_gutted_distinct.unl data/locus_voucher_doublets_all.tab  | cut -f1 | sort -u | wc -l
58116

grep "TTRS:"  data/locus_voucher_triplets_all.tab
#just checking if the institution with VN doublets appears in GB triples. it does not.  

#BOLD GB comparisons:

cat data/ID_sampleid_classified_blessed_only.tab data/ID_catalognum_classified_blessed_only.tab data/ID_agree_classified_blessed.tab | grep -v "::" | sort -k2,2 -> data/ID_all_triplets.tab

cat data/ID_sampleid_classified_blessed_only.tab data/ID_catalognum_classified_blessed_only.tab data/ID_agree_classified_blessed.tab | grep "::" | sort -k2,2 > data/ID_all_doublets.tab  

T-T:

join -j2 -t '\\t' data/ID_all_triplets.tab data/locus_voucher_triplets_all.tab  | wc -l
67

T-T!:

join -j2 data/ID_all_triplets.tab data/locus_voucher_triplets_all.tab | cut -f1 -d ' ' | sort -u |wc -l
60

T-D:

join -j2 data/ID_all_triplets_gutted.tab data/locus_voucher_doublets_all.tab | wc -l
69

T-D!:

join -j2 data/ID_all_triplets_gutted.tab data/locus_voucher_doublets_all.tab | cut -f1  -d ' '| sort -u | wc -l
30

D-D:

join -j2 data/ID_all_doublets.tab data/locus_voucher_doublets_all.tab  | wc -l
283,875  #... that is a scary big number. 

What causes so many doublet matches here?

  • check the individual components
  • check the union
	cut -f2  data/ID_all_doublets.tab  | sort | uniq -c | sort -nr | head  
	    288 SAIAB::ES08
	    115 SAIAB::ES07
	     76 ECOCH::7192
	     73 ECOCH::7009
	     56 MNHN::2009
	     52 ECOCH::6416
	     42 ECOCH::6109
	     41 ECOCH::5911
	     35 SAIAB::ES06
	     27 UAIC::14963.01

	cut -f2  data/locus_voucher_doublets_all.tab  | sort | uniq -c | sort -nr | head  
	   1587 LSUMZ::B927
	   1585 LSUMZ::B1980
	   1580 LSUMZ::B28330
	   1574 LSUMZ::B37197
	   1571 LSUMZ::B5354
	   1550 LSUMZ::B7513
	   1518 LSUMZ::B103926
	   1492 LSUMZ::B7923
	   1487 LSUMZ::B36554
	   1486 LSUMZ::B37257
	
	join -j2 data/ID_all_doublets.tab data/locus_voucher_doublets_all.tab | cut -f1 -d ' '| sort |uniq -c | sort -nr | head  
	 221778 INIDEP::T
	  16426 ZMMU::SVK
	    858 NME::A
	    670 ZMMU::RAN
	    450 ZMMU::RYA
	    364 NMP::6V
	    364 BPBM::FR
	    256 PIN::RVV
	    256 MCZ::335
	    192 SIO::93-298

ahh the old spaces within a catalog number trick ...
INIDEP::T 0224 (and it's ilk ) need to be seen as "INIDEP::T 0224" not "INIDEP::T" mumble mumble

join -j2 -t'	' data/ID_all_doublets.tab data/locus_voucher_doublets_all.tab  | wc -l
42,975			# this is much more believable than 283,875
				# note there is a  Ctrl-v<TAB>  within the tics of -t '' 
				# which will not paste into a command line. 

D-D!:

join -j2 data/ID_all_doublets.tab data/locus_voucher_doublets_all.tab  | cut -f1  -d ' '| sort -u | wc -l
27,936

sed 's|:[^:]*:|::|g' data/locus_voucher_triplets_all.tab  | sort > data/locus_voucher_triplets_all_gutted.tab

D-T:

join -j2 data/reclassify/ID_all_.tab data/locus_voucher_doublets_all_gutted.tab | wc -l
0

========

#VN BOLD comparisons

T-T:

join -12 -21  data/ID_all_triplets.tab  data/VN_triplets.unl | wc -l
103

T-D:

join -12 -21  data/ID_all_doublets.tab  data/VN_triplets_gutted.unl |wc -l
30,161 

T-D!:

join -12 -21  data/ID_all_doublets.tab  data/VN_triplets_gutted.unl | cut -f1 | sort -u |wc -l
18,766

D-T:

join -12 -21  data/ID_all_triplets_gutted.tab  data/VN_doublets.unl |wc -l
0     # not checking for unique as it ain't getting no smaller

D-D:

join -12 -21  data/ID_all_doublets.tab data/VN_doublets.unl |wc -l
0

All Matches

For the area proportional ellipse diagrams I really just want "all matches between sources" irrespective of whether they are canonical/sloppy, triplet/doublet, exact/inexact, natural/coerced or whatever.

We will classify BOLD's catalognum & sampleid columns, filter on IC then merge (only keep one copy where the columns agree):

bin/classify-dwct.reb --args "-i data/ID_sampleid.tab" > data/ID_sampleid_classified_all.tab 2> data/ID_sampleid_classified_all.err 
bin/classify-dwct.reb --args "-i data/ID_catalognum.tab" > data/ID_catalognum_classified_all.tab 2> data/ID_catalognum_classified_all.err

Filter for blessed IC:

bin/filter_known_ic.awk -v "FILTER=rawdata/IC_knownfilter.list" <  data/ID_sampleid_classified_all.tab >  data/ID_sampleid_classified_blessed_all.tab
bin/filter_known_ic.awk -v "FILTER=rawdata/GenBankII/IC_knownfilter.list" <  data/ID_catalognum_classified_all.tab >  data/ID_catalognum_classified_blessed_all.tab

comm -12 data/ID_catalognum_classified_blessed_all.tab data/ID_sampleid_classified_blessed_all.tab | wc -l
2,248 (including ID and classification which don't matter)

cut -f2 data/ID_catalognum_classified_blessed_all.tab | sort >  data/catalognum_alldone.list
cut -f2 data/ID_sampleid_classified_blessed_all.tab | sort >   data/sampleid_alldone.list

comm -12  data/catalognum_alldone.list data/sampleid_alldone.list | wc -l
2,280	# meh.  underwhelming number of DwCT given in two columns of the same specimen record agree (>90% don't).

comm -12  data/catalognum_alldone.list data/sampleid_alldone.list > data/shared_alldone.list
comm -13  data/catalognum_alldone.list data/sampleid_alldone.list > data/sampleid_only_alldone.list
comm -23  data/catalognum_alldone.list data/sampleid_alldone.list > data/catalognum_only_alldone.list

wc -l data/shared_alldone.list data/sampleid_only_alldone.list data/catalognum_only_alldone.list
  2,280 data/shared_alldone.list
 37,701 data/sampleid_only_alldone.list
 25,299 data/catalognum_only_alldone.list
 65,280 total                          that is up about 600

cat data/shared_alldone.list data/sampleid_only_alldone.list data/catalognum_only_alldone.list |sort > data/bold_dwct_all.list
sort -u  data/bold_dwct_all.list | wc -l
57225                                    unique BOLD DwCT available

BOLD's list of 65,280 effective DwCT
data/bold_dwct_all.list

Find ALL effective GenBank specimen_voucher DwCT:

bin/classify-dwct.reb  --args "-i data/locus_voucher.tab" > data/locus_voucher_classed_all.tab  2> data/locus_voucher_classed_all.err
bin/filter_known_ic.awk -v"FILTER=rawdata/IC_knownfilter.list" <  data/locus_voucher_classed_all.tab  >  data/locus_voucher_classed_blessed_all.tab


cut -f2 data/locus_voucher_classed_blessed_all.tab | sort > data/genbank_dwct_all.list

wc -l data/genbank_dwct_all.list
292,453 
sort -u data/genbank_dwct_all.list | wc -l 
123,111

GenBank's list of 292,453 effective DwCT:

data/genbank_dwct_all.list

VN should be easy:

sort -c data/VN_triplets.unl

yep.


Exact (triplet or doublet)

comm  -12 data/bold_dwct_all.list data/genbank_dwct_all.list  > data/bold_genbank_exact.list
comm  -12 data/bold_dwct_all.list data/VN_triplets.unl  > data/bold_vn_exact.list
comm  -12 data/genbank_dwct_all.list data/VN_triplets.unl  > data/genbank_vertnet_exact.list

wc -l  data/bold_genbank_exact.list data/bold_vn_exact.list data/genbank_vertnet_exact.list

32242 data/bold_genbank_exact.list 103 data/bold_vn_exact.list 2219 data/genbank_vertnet_exact.list 34564 total

comm -12  data/bold_genbank_exact.list data/bold_vn_exact.list > data/bold_genbank_vn_exact.list
wc -l data/bold_genbank_vn_exact.list
45  --> all UAM:Mamm:xxx

Inexact (only one or the other being a natural doublet)

comm -23 data/bold_dwct_all.list data/genbank_dwct_all.list  > data/b-g.list
comm -23 data/bold_dwct_all.list data/VN_triplets.unl  > data/b-v.list
comm -23 genbank_dwct_all.list data/VN_triplets.unl  > data/g-v.list

comm -13 data/bold_dwct_all.list data/genbank_dwct_all.list  > data/g-b.list
comm -13 data/bold_dwct_all.list data/VN_triplets.unl  > data/v-b.list
comm -13 data/genbank_dwct_all.list data/VN_triplets.unl  > data/v-g.list


wc -l data/b-g.list  data/b-v.list  data/g-b.list  data/v-b.list data/g-v.list data/v-g.list
  33038 data/b-g.list
  65177 data/b-v.list
 260211 data/g-b.list
8214624 data/v-b.list
 290234 data/g-v.list
8212508 data/v-g.list

Produce a new set of doublets from the triplets which the existing natural doublets can try to match. The (natural) ones that matched the first time are already merged so can't be recounted now.:

cat data/bold_genbank_exact.list data/g-b.list | sed 's/:[^:]*:/::/g'| sort > data/bgUg-b_doublet.list
cat data/bold_vn_exact.list data/v-b.list | sed 's/:[^:]*:/::/g' | sort > data/bvUv-b_doublet.list
cat data/genbank_vertnet_exact.list data/v-g.list | sed 's/:[^:]*:/::/g' | sort > data/gvUv-g_doublet.list

head data/bgUg-b_doublet.list data/bvUv-b_doublet.list data/gvUv-g_doublet.list

wc -l data/bgUg-b_doublet.list data/bvUv-b_doublet.list data/gvUv-g_doublet.list
   292453 data/bgUg-b_doublet.list
  8214727 data/bvUv-b_doublet.list
  8214727 data/gvUv-g_doublet.list  

comm -12 data/b-g.list data/bgUg-b_doublet.list > data/bold_gb_inexact.list
comm -12 data/b-v.list data/bvUv-b_doublet.list > data/bold_vn_inexact.list
comm -12 data/g-v.list data/gvUv-g_doublet.list > data/gb_vn_inexact.list

wc -l data/bold_gb_inexact.list  data/bold_vn_inexact.list data/gb_vn_inexact.list
   466 data/bold_gb_inexact.list
 18200 data/bold_vn_inexact.list
 39755 data/gb_vn_inexact.list

cat data/bold_genbank_exact.list b-g.list | sed 's/:[^:]*:/::/g'| sort > data/bgUb-g_doublet.list
cat data/bold_vn_exact.list b-v.list      | sed 's/:[^:]*:/::/g'| sort > data/bvUb-v_doublet.list
cat data/genbank_vertnet_exact.list g-v.list | sed 's/:[^:]*:/::/g' | sort > data/gvUg-v_doublet.list

wc -l data/bgUb-g_doublet.list data/bvUb-v_doublet.list  data/gvUg-v_doublet.list
    65280 data/bgUb-g_doublet.list
    65280 data/bvUb-v_doublet.list
   292453 data/gvUg-v_doublet.list

comm -12 g-b.list data/bgUb-g_doublet.list > data/bold_gb_inexact2.list
comm -12 v-b.list data/bvUb-v_doublet.list > data/bold_vn_inexact2.list
comm -12 v-g.list data/gvUg-v_doublet.list > data/gb_vn_inexact2.list

wc -l data/bold_gb_inexact2.list data/bold_vn_inexact2.list  data/gb_vn_inexact2.list
   2688 data/bold_gb_inexact2.list
      0 data/bold_vn_inexact2.list
      0 data/gb_vn_inexact2.list

comm -12  data/bold_gb_inexact.list data/bold_gb_inexact2.list
# nothing in common is good here

cat data/bold_gb_inexact.list data/bold_gb_inexact2.list data/bold_genbank_exact.list | sort > data/bold_genbank_match_II_all.list
cat data/bold_vn_inexact.list data/bold_vn_inexact2.list data/bold_vn_exact.list | sort > data/bold_vernet_match_II_all.list

cat gb_vn_inexact.list gb_vn_inexact2.list genbank_vertnet_exact.list | sort > data/genbank_vernet_match_II_all.list

wc -l data/bold_genbank_match_II_all.list data/bold_vernet_match_II_all.list data/genbank_vernet_match_II_all.list
   35396 data/bold_genbank_match_II_all.list                ********************
   18303 data/bold_vernet_match_II_all.list                 ********************
   41974 data/genbank_vernet_match_II_all.list              ********************
   95673 total

Find all matches in common

comm -12  data/bold_genbank_match_II_all.list data/bold_vernet_match_II_all.list > data/bold_genbank_vertnet_match_II_all
wc -l data/bold_genbank_vertnet_match_II_all
16,048   													********************  

Only unique matches:

sort -u  data/genbank_vernet_match_II_all.list | wc -l
35,280
sort -u  data/bold_vernet_match_II_all.list | wc -l
17,737
sort -u  data/bold_genbank_match_II_all.list | wc -l
3,1139		
sort -u  data/bold_genbank_vertnet_match_II_all | wc -l
15,847