madeleineudell/LowRankModels.jl

Working on randomly generated data returns 0 element Arrays for labels and glrm.ry

Closed this issue · 6 comments

Hi again,

I was trying to debug my analysis by working with a randomly generated dataset:

A = rand(100,2)*rand(2,100)
A = convert(DataFrame, A)
glrm, labels = GLRM(A, 2)

However the corresponding labels and glrm.ry objects are empty:

julia> labels
0-element Array{Symbol,1}

julia> glrm.ry
0-element Array{Regularizer,1}

It would be great if you could help me understand why this is happening?

Also could you suggest a base dataset that could be used to understand the different aspects of the LowRankModels code like the loss and regularizer options? I tried using the "psych" Dataset referred to in the README and this randomly generated matrix and ran into issues with them both.

Thanks a lot,
Nandana

The reason this behavior is happening is that none of the columns was recognized by the GLRM(::DataFrame, ::Int) method as a valid data type. I've fixed that, so the above code should work just fine now to generate a sample data set.

In general it might be easier to use the standard GLRM call rather than the DataFrame-specific one to experiment with different losses and regularizers: as in the README example:

using LowRankModels
m,n,k = 100,100,5
losses = QuadLoss() # minimize squared distance to cluster centroids
rx = UnitOneSparseConstraint() # each row is assigned to exactly one cluster
ry = ZeroReg() # no regularization on the cluster centroids
glrm = GLRM(A,losses,rx,ry,k)

And using a randomly generated low rank matrix like A = randn(100,3)*randn(3,100) should give the right intuition.

Thanks, but I think I only explained part of the problem.

GLRM(A, losses, rx, ry, k) runs irrespective of whether A is a random Array or DataFrame. However the "quick" version GLRM(A, k) runs only with a DataFrame. Running it using an Array gives the error:

ERROR: GLRM{L<:Loss,R<:Regularizer} has no method matching GLRM{L<:Loss,R<:Regularizer}(::Array{Float64,2}, ::Int64)

which is why I converted A to a DataFrame before running GLRM in the original example.

I think it's reasonable for GLRM(A::Array, k::Int) to raise an error. It's
important for users to think about what kind of model they're fitting. If
you just want a model fast you can use one of the simple glrms in
simple_glrms.jl, like pca(A, k) or nnmf(A, k).

On Thu, Oct 1, 2015 at 9:43 AM, NandanaSengupta notifications@github.com
wrote:

Thanks, but I think I only explained part of the problem.

GLRM(A, losses, rx, ry, k) runs irrespective of whether A is a random
Array or DataFrame. However the "quick" version GLRM(A, k) runs only with a
DataFrame. Running it using an Array gives the error:

ERROR: GLRM{L<:Loss,R<:Regularizer} has no method matching
GLRM{L<:Loss,R<:Regularizer}(::Array{Float64,2}, ::Int64)

which is why I converted A to a DataFrame before running GLRM in the
original example.


Reply to this email directly or view it on GitHub
#46 (comment)
.

Madeleine Udell
Postdoctoral Fellow at the Center for the Mathematics of Information
California Institute of Technology
www.stanford.edu/~udell
(415) 729-4115

OK, but in a large dataset with many columns its useful to have GLRM set a reasonable loss function that depends on the column type. I can then use the glrm object as a base to make changes to my model based on my assessment.

In other words I might want the loss function to be set automatically by GLRM and change the regularizers according to my needs -- and the GLRM(A,k) functionality lets me do that. (that's the process I've been following in my entire analysis.)

Certainly, and I'm glad you've found the short version useful!

But your qualifier "that depends on the column type" is a qualifier that
applies to DataFrames, but not to Arrays, which have a single type for all
elements. This means it's much harder for LowRankModels to automatically
infer appropriate information about the various columns if they do differ,
since there's no type information that distinguishes different columns.

On Thu, Oct 1, 2015 at 2:46 PM, NandanaSengupta notifications@github.com
wrote:

OK, but in a large dataset with many columns its useful to have GLRM set a
reasonable loss function that depends on the column type. I can then use
the glrm object as a base to make changes to my model based on my
assessment.

In other words I might want the loss function to be set automatically by
GLRM and change the regularizers according to my needs -- and the GLRM(A,k)
functionality lets me do that. (that's the process I've been following in
my entire analysis.)


Reply to this email directly or view it on GitHub
#46 (comment)
.

Madeleine Udell
Postdoctoral Fellow at the Center for the Mathematics of Information
California Institute of Technology
www.stanford.edu/~udell
(415) 729-4115

Aah got it -- that totally makes sense (thinking in types and formats is something I'm having to do much more as a Julia user, and thats a good thing!). Thanks for the clarification.