thierrygosselin/grur

imputations_accuracy() lists all markers as not in common and drops all from analysis

pbpearman opened this issue · 2 comments

Hi Thierry,
I have filtered a .vcf dataset so to create a version with low missingness of sites and samples. I then run the following lines to impute genotypes and output .rad and .tped files, with and without rf imputation:

test2 <- genomic_converter(data = "UMBELLA_Erumb1_samples_gt_90pct_snps_gt_40pct_covered.recode.vcf",output="plink",filename = "test_2",strata="strata_eu_40.tsv", imputation.method = "rf")

The run finishes with the following results section:

############################### RESULTS ###############################
Data format of input: vcf.file
Biallelic data
Number of common markers: 156
Number of chromosome/contig/scaffold: 113
Number of individuals 551
Number of populations 56

I then check the imputation accuracy, but all 156 markers are dropped, evidently for not being shared:

> imp_acc <- imputations_accuracy("test_2.rad","test_2_imputed.rad")

Data provided still contains missing genotypes,
accuracy will be mesured on common non-missing genotypes
Removing 156 markers not in common between datasets
$Information dropped from the analysis
$Information dropped from the analysis$markers.dropped
[1] "Erumb1_s01095803__85__621" "Erumb1_s00047668__13__4936"
[3] "Erumb1_s00648045__68__686" "Erumb1_s01308175__88__428"
[5] "Erumb1_s02281967__111__51" "Erumb1_s02289709__112__65" .....

Misclassification Error: by populations
A tibble: 0 x 2
... with 2 variables: POP_ID , ME

Misclassification Error: by individuals
A tibble: 0 x 2
... with 2 variables: INDIVIDUALS , ME

Misclassification Error: by markers
A tibble: 0 x 2
... with 2 variables: MARKERS , ME

Misclassification Error: overall
[1] NaN

I think this is a bug. Do you? If you want, I can provide you with the datasets. Just let me know.
Best,
Peter

Hi again Thierry,
I ran the same procedure on the stickleback data from the tutorial. I imputed the missing genotypes again in genomic converter, obtaining the saved imputed and unimputed datasets. I then used them both with imputation_accuracy().

stickle <- genomic_converter("example_vcf2dadi_ferchaud_2015.vcf",
filename = "stickle1", 
strata="strata.stickleback.tsv", 
imputation.method = "rf")

I got:

#######################################################################
###################### grur: imputations_accuracy #########################
#######################################################################
Data provided still contains missing genotypes,
accuracy will be mesured on common non-missing genotypes
Removing 31790 markers not in common between datasets
Information dropped from the analysis
Information dropped from the analysis$markers.dropped
[1] "groupI__28362__23825290" "groupI__28366__24653314" "groupI__28366__24653352"
[4] "groupI__28366__24653353" "groupI__28382__25273727" "groupI__28450__28141141"
[7] "groupI__28450__28141145" "groupI__34525__6537935" "groupI__34525__6537939"
[10] "groupI__34525__6537953" "groupI__34525__6537980" "groupI__5674__12268369"
[13] "groupI__5674__12268370" "groupI__6034__17836877" "groupII__1632__17623945" ...

The original number of markers was 31802. I checked the values in the resulting object, 'stickle', and all classification errors for individuals, populations etc, were all listed as zero. I also tried using the imputed and unimputed datasets that are in the object produced by genomic_converter(). Again, about 99% of the markers were dropped. I examined the marker names of the two datasets. They are identical, as far as I can see.

Is this right? Why are almost all the markers showing as not in common among the two datasets? This is not the behavior I expected.
Best, Peter

Hi Peter, definitely a bug. This function was not tested with lots of datasets and it's in the treatment of markers names before and after imputations the problem.