ERROR: [read_dist] invalid distance between adjacent sites! with multiple contigs
beausoleilmo opened this issue · 10 comments
I found another issue with a pos file being not sorted.
ngsLD --geno proj.beagle.gz --n_ind 219 --n_sites 210717 -posH beag.pos.gz --out testngsld
==> Input Arguments:
geno: proj.beagle.gz
probs: false
log_scale: false
n_ind: 219
n_sites: 210717
pos: beag.pos.gz (WITH header)
max_kb_dist (kb): 100
max_snp_dist: 0
min_maf: 0.001000
ignore_miss_data: false
call_geno: false
N_thresh: 0.000000
call_thresh: 0.000000
rnd_sample: 1.000000
seed: 1663963270
extend_out: false
out: testngsld (WITHOUT header)
n_threads: 1
verbose: 1
version: 1.1.1 (Sep 22 2022 @ 15:06:57)
==> GZIP input file (not BINARY)
> Reading data from file...
> Header found! Skipping line...
==> Calculating MAF for all sites...
==> Getting sites coordinates
=====
ERROR: [read_dist] invalid distance between adjacent sites!
=====
But I don't get why the file would need to be in order... my contigs are in 'order' and the sites within each contig is in order. What else do I need?
The pos file looks like this:
chr pos
NC_044571.1 32435
NC_044571.1 32446
NC_044571.1 32448
NC_044571.1 32450
NC_044571.1 32460
NC_044571.1 32462
[...]
Can you upload a small test set so I can reproduce the error?
I think the software assumes that the chromosomes are in different files right?
No, you can have different chromosomes in the same beagle file.
OK, but when I use the whole beagle file it doesn't work, but when only one chromosome is selected, it works... I sent you the file to reproduce the problem.
The order of the chromosomes is irrelevant, but the sites do need to be sorted inside each chromosome. As such, it seems there are some issue in your files, at least:
[...]
NW_022147719.1 7982
NW_022147720.1 9458
NW_022147720.1 137
[...]
And also:
[...]
NW_022148950.1 11965
NW_022148950.1 11968
NW_022148950.1 1069
NW_022148950.1 1084
[...]
Thanks! How then would I need to sort the positions? I'm not sure how to group by chromosome and then sort the position. What would you suggest to sort the beagle file?
How did you create the beagle file? angsd
gives the sites properly sorted.
If it was manually, then I'd just sort the beagle file based on chr and position (splitting on the last underscoreon column 1), and create the pos file from the beagle again.
It was done with atlas
https://bitbucket.org/wegmannlab/atlas/wiki/Home.
So I'd need to sort only the marker
column in the beagle file and then run the R script. Right?
Something like that:
zcat proj.beagle.gz | sort -k 1 -V | gzip > proj.sorted.beagle.gz
?
--> Note that the -V is important here since it will sort mixes of strings and numbers (numbers at the end of the string for the position of the marker. -n or -g will not work, at least for me)
I made the pos file with zcat proj.sorted.beagle | awk '{print $1}' | awk 'BEGIN{FS=OFS="_"}{last=$NF;NF--;print $0"\t"last}' | awk NR\>1 | gzip > beagle.pos.gz
zcat beagle.pos.gz | wc -l
gives 210717
zcat beagle.pos.gz | head
Gives:
NC_044571.1 100029426
NC_044571.1 100029427
NC_044571.1 100042283
NC_044571.1 100042284
NC_044571.1 100042288
NC_044571.1 100042290
NC_044571.1 100042292
NC_044571.1 100042294
NC_044571.1 100042295
NC_044571.1 100042296
I just did that:
==> GZIP input file (not BINARY)
> Reading data from file...
=====
ERROR: [read_geno] GENO file not at EOF. Check GENO file and number of sites!
=====
: Numerical result out of range
With sort -V
the header is placed at the end of the file; just make sure the header is on top.
To get the pos file:
zcat proj.sorted.beagle | cut -f 1 | perl -p -e 's/_([^_]+$)/\t$1/' | gzip > beag.pos.gz
OK! Didn't know that behaviour!
So I got this:
zcat proj.beagle.gz | sort -k 1 -V | gzip > proj.sorted.beagle.gz
zcat proj.sorted.beagle.gz | grep -v marker | gzip > proj.sorted.beagle_nohead.gz
zcat proj.sorted.beagle_nohead.gz | cut -f 1 | perl -p -e 's/_([^_]+$)/\t$1/' | grep -v marker | gzip > beagle.pos.gz
nind=$(zcat proj.sorted.beagle_nohead.gz | head -n 1 | awk '{$1=$2=$3=""; print $0}' | awk '{ for (i=1;i<=NF;i+=3) print $i }' | wc -l)
nsites=$(zcat beagle.pos.gz | wc -l)
ngsLD --geno proj.sorted.beagle_nohead.gz --n_ind $nind --n_sites $nsites --pos beagle.pos.gz --out ./ngsld/testngsld.ld
And now it work! Thank you so much for your help!