gbradburd/conStruct

Error freqs using RADseq data

Closed this issue · 7 comments

Dear Gideon,

I'm trying to run conStruct with snps obtained using RADseq, I have 180 samples with 1249 loci and I got the following error:

"you have specified an invalid data partition "data" element that is not positive definite"

I tried the solutions that you recommended in previous issues, however I still obtain the same error. Do you have extra recommendations. I attach my freqs table if you have time to give it a look.

Thanks a lot,

Fabian

freqs.txt

Hm - I'm not able to replicate that problem: if I do

library(conStruct)
freqs <- read.table("freqs.txt", header=TRUE, row.names=1)
fd <- conStruct:::process.freq.data(freqs)

I do not get any errors, and these are all positive:

eigen(conStruct:::calc.covariance(freqs))$values

Is your conStruct up to date? If you run that code above, what do you get?

@petrelharp Fabian got this error trying to run a cross-validation analysis (I can tell from the error message, which references a "data partition"). Although the sample allelic covariance for the entire dataset is positive definite, I think the problem here is that the sample covariance calculated for one or more randomly subsetted partitions of the data is not.

Fabian, can you send us the code you were using to call x.validation? You have a few samples with a really high proportion of missing data, so it's possible they're the issue. But also, check to make sure that you've installed the most recent version of conStruct (instructions below), as we've recently updated out the sample covariance is calculated.

To install the most recent version of conStruct:

        library(devtools)
	install_github("gbradburd/conStruct",build_vignettes=TRUE)

Thank you so much for your answers,

I just checked and I have conStruct_1.0.5, is this the newest?

The line that I am running is the following:
my.xvals <- x.validation(train.prop = 0.6,n.reps = 8,K = 1:10,freqs = as.matrix(freqs), data.partitions = NULL,geoDist = dis_geo,coords = coords,prefix = "GreatTitsConStruct",n.iter = 1e3,make.figs = TRUE,save.files = TRUE,parallel = T,n.nodes = 2)

I follow the vignettes example. I just checked and I am getting the same error even with lower values of training proportions.

It looks like it could be a missing data issue. If I drop all individuals w/ greater than 25% missing data, it runs fine for me. I haven't explored how sensitive results are to that cutoff - it's possible you could get away w/ a higher proportion of missing data.

W/r/t the package version, v.1.0.5 should be correct, but you can confirm by checking the output of:
conStruct:::calc.covariance, which should return the following

conStruct:::calc.covariance
function (freqs) 
{
    x <- t(freqs)
    allelic.covariance <- (1 - 1/nrow(freqs)) * stats::cov(x, 
        use = "pairwise.complete.obs") - (1/2) * outer(colMeans(x, 
        na.rm = TRUE), 1 - colMeans(x, na.rm = TRUE), "*") - 
        (1/2) * outer(1 - colMeans(x, na.rm = TRUE), colMeans(x, 
            na.rm = TRUE), "*") + 1/4
    diag(allelic.covariance) <- 0.25
    return(allelic.covariance)
}
<bytecode: 0x7fd84936f580>
<environment: namespace:conStruct>

If your output looks different, just reinstall from the git repo as referenced in my previous reply. Please let me know if you have further questions on this issue, or if this is resolved. Thanks!

whoops; missed that. =)

BTW, a way around this would be to insert an optional call to Matrix::nearPD( ) into calc.covariance( ). I'm a bit loathe to do so because it might hide major problems in the data (eg some samples being 99% missing data), but if data are missing for reals at random then it should do the right thing. To check, I'd want to (a) check there isn't strong covariance between missingness and variables of interest, and (b) make sure that there's sufficiently much data being used everywhere (e.g., report the minimum number of overlapping loci between pairs of populations). Then maybe a call to conStruct(..., make.positive.definite=TRUE) would turn on this behavior.

It worked just fine, now is running. Thanks a lot for your help

@petrelharp yeah I'd be nervous about it bc I think that this problem arises most often when data aren't missing at random (e.g., bc of allelic dropout). But posdef issues are also the thing I get the most emails about, so maybe the thing to do is to make a vignette about what to do if the sample allelic cov isn't posdef and include strategies for diagnosing problems and possible solutions.