mgalardini/pyseer

lineage specific unitigs

Closed this issue ยท 10 comments

Hello, thank you for this excellent tool.

I'd like to investigate genetic differences among the populations defined by population structure analysis.
(pop1 n=67, pop2 n=250)
But I got an error while creating qqplot.

At first, to count unitigs from all strains sequences, I used unitig-caller with following code, then obtained out_unitig.pyseer.
unitig-caller --call --refs input.txt --out out_unitig

Then I performed pyseer with --no-distances option
pyseer --phenotypes trait.txt --kmers out_unitig.pyseer --uncompressed --no-distances --min-af 0.05 --max-af 0.99 --cpu 15 > pyseer-out.txt
This pyseer run finished within 1 hour.

pyseer-out.txt included a lot of bad-chisq.

Then, I tried to create qqplot,
python qq_plot.py pyseer-out.txt
but I got an error and qqplot was something strange.

pyseer/lib/python3.9/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: invalid value encountered in log10
result = getattr(ufunc, method)(*inputs, **kwargs)

qq_plot

Could you please how should I interpret?

Best regard,

Could you please paste here your pyseer-out.txt table? from the looks of it you have negative p-values that have been used to make the qq-plot, which may indicate some problems with the table

Thank you for your reply.
Here is the part of pyseer-out.txt.
pyseer-out2.txt

In addition, actually I have 3 populations, so when I tried to investigate pop1 specific unitigs, I specified pop1 "0", pop2&3 "1" in trait file. Does it affect the results? And the number of samples was enough to analysis(n=67 vs n=250)?

Best regards,

Ah I see, this is a bug on formatting the output table when using --no-distances (I think). For now you can solve the issue by changing the header of the output table manually so that there is an additional column name (e.g. X) between intercept and notes. I'll folow up later with a fix

Found the error, and will merge as soon as tests are done. Will likely make a new release after that so the fix is available through conda. For now the manual fix should work.

Thank you so much.
I tried to fix manually using following code.
cat pyseer-out.txt | awk 'BEGIN{ OFS="\t" }{print $1,$2,$3,$4,$5,$6,$7,"X",$8}' > pyseer-out2.txt
qq_plot2
Then I got this qqplot. This means p-values were inflated?

I think in this case, there are a lot of false negative unitings.
Are there any suggestions to improve?

Thanks,

I believe that the relatively low sample size and the absence of population structure correction might be to blame. Are at least the lowest p-values given to things you were expecting?

Thank you for reply.

I tried to correct population structure using two ways.

When running the fixed effects mode, this qqplot was obtained.(fixed effects model)
pyseer --phenotypes trait.txt --kmers out_unitig.pyseer --uncompressed --distances mash.tsv --min-af 0.05 --max-af 0.99 --cpu 15 > pyseer-u-mash.txt
qq_plot-mash2

When running FaST-LMM, this qqplot was obtained.
pyseer --phenotypes trait.txt --kmers out_unitig.pyseer --uncompressed --lmm --similarity genmotype_kinship.tsv --output-patterns unitigs_pattern.txt --min-af 0.05 --max-af 0.99 --cpu 15 > pyseer-u-llm.txt
qq_plot-lmm3

I think the latter setting is better for my dataset, could you please give me your opinion?

Then I checked the gene which includes significant unitigs, the gene was one of interested genes detected by another analysis(not gwas).

Best regards,

They both look fine to me (with the caveat that I'm not familiar with the phenotypic data you have), and the second one looks like a textbook example of a qq-plot, and so you are possibly right.

Closing the issue for now but do let me know if anything else looks unclear.

Thank you for your kindness.