fgvieira/ngsF

[main] ERROR: wrong number of sites or invalid/corrupt file!

RCWilliams opened this issue · 9 comments

Is there a way to calculate the number of sites within ngsF? I am currently getting "[main] ERROR: wrong number of sites or invalid/corrupt file!" and I have no idea why. I'm trying to find a way around it

Hi Williams,

sorry for the late reply but I've been away on holidays. Did you manage to fix it?
You can check the size of the input file as stated on the readme file.

Thanks for your response!
I can see where I would use the number of sites to calculate the size, but is there a way to see the number of sites? (I'm sorry if this should be very obvious!)

not quite sure I understand you question... You are trying to run ngsF with a certain number of sites but it is giving an error. That usually is because the number of sites/individuals do not match the size of the file, either because the file is truncated or the number of sites/indiv is wrong. You can have an idea of the size of the file ngsF expects through the formula on the readme file. Briefly: total_size_bytes = 3 * 8 * n_ind * n_sites

What is the command line you are using and how big (in bytes and uncompressed) is your file?

I just remade the glf (because of what you said about it potentially being truncated) and this has fixed the n_sites problem! The size wasn't matching, so thank you!

My command line is:

ngsF --n_ind 1 --n_sites 29464184 --glf S1_June17.glf --out het_S1

and I am now receiving:

[check_interv] ERROR: value is NaN!

Is this again user error at my end?

Not sure what you are trying to do, but why are you only using one individual?
ngsF needs allele frequencies so, either you have several samples from the same population (ideally around 20), or you need to provide the allele frequencies yourself (--init_values and --freq_fixed).

That said, 29464184 sounds like quite a large number of SNPs for just one sample. Did you call SNPs first?

cheers,

That must be why I'm running into issues then, because I don't have population level data. I have four high coverage individuals from four species, that are each ~10 my diverged from the reference genome (which I think is why I have a large number of SNPs).

I generated my glf from a bam using pileup in samtools-hybrid, is this not correct?

Thanks so much for the time you’ve spent on this!

Do you have allele frequency data from other sources (reference panels, other dataset, etc..)?
If not, then I'm afraid you can't use ngsF...

No I don't, which is the problem that I keep running into. Thank you so much for your time on this, I think I will keep this analysis on the side lines until I have more data available. Thanks again!

No problem, and let me know if you have any other questions.