fgvieira/ngsLD

ERROR: [read_dist] invalid distance between adjacent sites! with multiple contigs

beausoleilmo opened this issue · 10 comments

I found another issue with a pos file being not sorted.

ngsLD --geno proj.beagle.gz --n_ind 219 --n_sites 210717 -posH beag.pos.gz --out testngsld

==> Input Arguments:
        geno: proj.beagle.gz
        probs: false
        log_scale: false
        n_ind: 219
        n_sites: 210717
        pos: beag.pos.gz (WITH header)
        max_kb_dist (kb): 100
        max_snp_dist: 0
        min_maf: 0.001000
        ignore_miss_data: false
        call_geno: false
        N_thresh: 0.000000
        call_thresh: 0.000000
        rnd_sample: 1.000000
        seed: 1663963270
        extend_out: false
        out: testngsld (WITHOUT header)
        n_threads: 1
        verbose: 1
        version: 1.1.1 (Sep 22 2022 @ 15:06:57)

==> GZIP input file (not BINARY)
> Reading data from file...
> Header found! Skipping line...
==> Calculating MAF for all sites...
==> Getting sites coordinates

=====
ERROR: [read_dist] invalid distance between adjacent sites!
=====

But I don't get why the file would need to be in order... my contigs are in 'order' and the sites within each contig is in order. What else do I need?

The pos file looks like this:

chr	pos
NC_044571.1	32435
NC_044571.1	32446
NC_044571.1	32448
NC_044571.1	32450
NC_044571.1	32460
NC_044571.1	32462
[...]

Can you upload a small test set so I can reproduce the error?

I think the software assumes that the chromosomes are in different files right?

No, you can have different chromosomes in the same beagle file.

OK, but when I use the whole beagle file it doesn't work, but when only one chromosome is selected, it works... I sent you the file to reproduce the problem.

The order of the chromosomes is irrelevant, but the sites do need to be sorted inside each chromosome. As such, it seems there are some issue in your files, at least:

[...]
NW_022147719.1  7982
NW_022147720.1  9458
NW_022147720.1  137
[...]

And also:

[...]
NW_022148950.1  11965
NW_022148950.1  11968
NW_022148950.1  1069
NW_022148950.1  1084
[...]

Thanks! How then would I need to sort the positions? I'm not sure how to group by chromosome and then sort the position. What would you suggest to sort the beagle file?

How did you create the beagle file? angsd gives the sites properly sorted.
If it was manually, then I'd just sort the beagle file based on chr and position (splitting on the last underscoreon column 1), and create the pos file from the beagle again.

It was done with atlas https://bitbucket.org/wegmannlab/atlas/wiki/Home.

So I'd need to sort only the marker column in the beagle file and then run the R script. Right?

Something like that:
zcat proj.beagle.gz | sort -k 1 -V | gzip > proj.sorted.beagle.gz ?
--> Note that the -V is important here since it will sort mixes of strings and numbers (numbers at the end of the string for the position of the marker. -n or -g will not work, at least for me)

I made the pos file with zcat proj.sorted.beagle | awk '{print $1}' | awk 'BEGIN{FS=OFS="_"}{last=$NF;NF--;print $0"\t"last}' | awk NR\>1 | gzip > beagle.pos.gz

zcat beagle.pos.gz | wc -l gives 210717

zcat beagle.pos.gz | head

Gives:

NC_044571.1     100029426
NC_044571.1     100029427
NC_044571.1     100042283
NC_044571.1     100042284
NC_044571.1     100042288
NC_044571.1     100042290
NC_044571.1     100042292
NC_044571.1     100042294
NC_044571.1     100042295
NC_044571.1     100042296

I just did that:

==> GZIP input file (not BINARY)
> Reading data from file...

=====
ERROR: [read_geno] GENO file not at EOF. Check GENO file and number of sites!
=====

        : Numerical result out of range

With sort -V the header is placed at the end of the file; just make sure the header is on top.
To get the pos file:

zcat proj.sorted.beagle | cut -f 1 | perl -p -e 's/_([^_]+$)/\t$1/' | gzip > beag.pos.gz

OK! Didn't know that behaviour!

So I got this:

zcat proj.beagle.gz | sort -k 1 -V | gzip > proj.sorted.beagle.gz
zcat proj.sorted.beagle.gz | grep -v marker | gzip > proj.sorted.beagle_nohead.gz
zcat proj.sorted.beagle_nohead.gz | cut -f 1 | perl -p -e 's/_([^_]+$)/\t$1/' | grep -v marker | gzip > beagle.pos.gz

nind=$(zcat proj.sorted.beagle_nohead.gz | head -n 1 | awk '{$1=$2=$3=""; print $0}' | awk '{ for (i=1;i<=NF;i+=3) print $i }' | wc -l)
nsites=$(zcat beagle.pos.gz | wc -l)

ngsLD --geno proj.sorted.beagle_nohead.gz --n_ind $nind --n_sites $nsites --pos beagle.pos.gz --out ./ngsld/testngsld.ld

And now it work! Thank you so much for your help!