pca sklearn failure in latest development version
chapmanb opened this issue · 10 comments
Brent;
We've been trying to use the latest development in bcbio (to get the initial hg38 support) and running into a failure during pca. Is there a minimum number of input samples needed to run pca, or am I doing something else wrong?
2018-05-30 23:07:51 r3467 peddy.pca[22570] INFO loaded and subsetted thousand-genomes genotypes (shape: (2504, 1)) in 0.8 seconds
Traceback (most recent call last):
File "/g/data3/gx8/local/development/bcbio/anaconda/bin/peddy", line 11, in <module>
load_entry_point('peddy==0.4.0', 'console_scripts', 'peddy')()
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 204, in peddy
in ("ped_check", "het_check", "sex_check")]):
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 43, in run
prefix=prefix, **kwargs)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/peddy.py", line 871, in het_check
pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/pca.py", line 61, in pca
clf.fit(genos1kg, background_target)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 248, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 213, in _fit
**fit_params_steps[name])
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 581, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 348, in fit_transform
U, S, V = self._fit(X)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 394, in _fit
return self._fit_truncated(X, n_components, svd_solver)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 468, in _fit_truncated
% (n_components, n_features, svd_solver))
ValueError: n_components=4 must be between 1 and n_features=1 with svd_solver='randomized'
Ideally we'd love to be able to get ancestry prediction for single sample inputs. Thanks for any pointers/suggestions.
hmm. maybe when there's a single sample we need to use X[:, np.newaxis]
or something as the number of features should be ~ 20K.
I'll try to carve out some time on friday to work on peddy and get these resolved. thanks for reporting.
I can run with 1 sample on master. It seems like maybe you're only finding a single snp out of the full set.
E.g. your log info says (shape: (2504, 1))
. Are you using the right genome-build for your test-case?
I see the problem that's causing this. Will release 0.4.2 shortly.
actually, I don't think that should affect what you are reporting here. You can try updating to cyvcf2 v0.9.0 to see if that resolves it, but I don't think it will.
Brent;
Thanks for the help with this. I agree it's probably also not the parsing. It's possible there was 1 overlapping SNP; bcbio ends up throwing a lot at peddy: panels, somatic calls (when there is no germline, hoping for germline leakage). Is it possible for peddy to skip pca here if there are not enough sites? It's a bit hard to tell up from what and will not work, so I've been just doing garbage in/garbage out and handling outputs that aren't useful. Would skipping here be possible? Thanks again for taking a look at all this so quickly.
It could, but it uses those sites for the heterozygosity stuff too, so the only thing left would be the sex plots.
That would be helpful for me. I know the output is not especially useful, but in bcbio we often get not good enough data that goes in and would be great to have peddy exit cleanly just without the analysis we can't do. Right now I'm catching exceptions and doing other ugly things when it fails, so would love to improve on that. Thanks for considering this.
I'm wary of writing software that catches all possible problems and gives a 0 exit code.
I'm not completely against what you're saying, but I'm not sure how to implement. My though would be that if you send peddy--which has a list of sites that it uses, a VCF that contains none (or 1) of those sites, it's reasonable to raise an error.
I suppose the alternative is to output an informative error message and still exit 0.
Brent;
I definitely hear what you're saying about design. Raising a consistent error is also cool with me, it's just hard to tell up front if a dataset will have very few overlaps with the sites (without calculating it myself) so would be awesome if peddy told me this. Right now we essentially catch the symptom (index errors):
which I'd like to improve on. Thanks again for this discussion.