kose-y/SparseKmeansFeatureRanking.jl

Float64 Bug in SKFR code

Closed this issue · 2 comments

I have found an issue in the code when running the SKFR algorithm on the hapmap3 data. I will detail it below.

Here is where I downloaded the hapmap3 data from:
http://dalexander.github.io/admixture/download.html

First of all, when I tried running the first 37 lines of the code in the runtests.jl file, I ran into no problems. The algorithm worked as expected and successfully identified the most informative features in the sample data created in these lines of code.

Here is the code that I executed in Jupyter Notebooks, and the results:

The following code is from the runtests.jl file in SKFR.jl/test
`
Random.seed!(16962)
(features, cases) = (100, 300);
(classes, sparsity) = (3, 33);
X = randn(features, cases);

(m, n) = (div(features, 3), 2 * div(features, 3));
(r, s) = (div(cases, 3) + 1, 2 * div(cases, 3));
X[1:m, r:s] = X[1:m, r:s] .+ 1.0;
X[1:m, s + 1:end] = X[1:m, s + 1:end] .+ 2.0;

Random.seed!(16962)
class =initclass(X,classes);
X1=copy(X);
X2=copy(X);

class1=copy(class);
class2=copy(class);

(classout1_, center1_, selectedvec1_, WSSval1_, TSSval1_) = ref_sparsekmeans1(X1, class1, classes, sparsity);
println("\nChosen Features: ", selectedvec1_, '\n')
println("Class Assignments: ", classout1_, '\n')
println("Mean Class Assignment: ", mean(classout1_), " (Samples are assigned to classes as expected)\n")
println("Center for First Chosen Feature: ", center1_[selectedvec1_[1],:], " (No NaN or 0.0 values)\n")
`

Here is the image of the results from the code above:

Successful SKFR

When I tried executing the same code on the hapmap3 data set, I ran into a bug. I first imported the hapmap3.bed file, and converted it to Float64 type (I have not had problems before executing SKFR on Float64 data). I then transposed the data so that it is features by cases, which follows the pattern in the runtests.jl file on line 18. Also, since there are 'NaN' values in the data, I set 'impute=true' to replace them. When I executed SKFR on the data, it resulted in uniform class assignments: every sample was assigned to the class '1.0'. Also, there are 'NaN' and '0.0' values in the 'center1' variable for the rows with selected features. The code and results are below:

The following code is from the runtests.jl file in SKFR.jl/test
`
X = Matrix(transpose(convert(Matrix{Float64}, SnpArray("hapmap3.bed"), impute=true))) # Extract hapmap3 data
(features, cases) = size(X);

(classes, sparsity) = (3, 33);
Random.seed!(16962)
class = initclass(X,classes);
X1=copy(X);
X2=copy(X);

class1=copy(class);
class2=copy(class);

(classout1, center1, selectedvec1, WSSval1, TSSval1) = ref_sparsekmeans1(X1, class1, classes, sparsity);
println("\nChosen Features: ", selectedvec1, '\n')
println("Class Assignments: ", classout1, '\n')
println("Mean Class Assignment: ", mean(classout1), " (Every sample is assigned to the same class - 1.0)\n")
println("Center for First Chosen Feature: ", center1[selectedvec1[1],:], " (There are NaN values and the rest are 0.0)\n")
`

Here is the image of the results from the code above:

SKFR Bug

I also noted the SKFR algorithm functions well using the imputedmatrix function on the same hapmap3 data with the following code from lines 160-166. Only with the code above do I run into the bug I described.

X = SnpArray("Julia_Files/hapmap3.bed") Xtrue = convert(Matrix{Float64}, X, model=ADDITIVE_MODEL, center=false, scale=false) Random.seed!(16962) ISM = SKFR.ImputedSnpMatrix{Float64}(X, 3) Random.seed!(16962) IM = SKFR.ImputedMatrix{Float64}(Xtrue, 3) (classout1, center1, selectedvec1, WSSval1, TSSval1) = SKFR.sparsekmeans1(IM, 30);

Thank you for reporting the issue. The code at the bottom is the intended way of using this package. You should not use impute=true keyword argument when converting a SnpArray to Float64 matrix in this project, because that will fill in the NaN values with the column means. We are using a different imputation algorithm in this project. I will investigate what is happening with the second code block.

Actually, the second code block is only using the reference code included for tests (test/ref). Code in that directory is not considered the main part of our package and is merely there to check if the code in the main package is working correctly. Specifically, that part of the code is there for data with no missing values. Trying that code with data with missing values and trying to make it run might be meaningful for learning purposes, but it doesn't need to run for our project.