cardmagic/classifier

LSI classifier raises exception on some input strings

wstrinz opened this issue · 4 comments

I've noticed the LSI classifier fails on certain input strings. It appears to be related to repeating words, but I haven't figured out the exact pattern. As an example:

require 'classifier'
lsi = Classifier::LSI.new
strings = [
            ["This text deals with dogs. dogs.", "dog"],
            ["This text involves dogs too. Dogs! ", "dog"],
            ["Remind me to get milk on monday", "reminder"]
          ]

strings.each {|x| lsi.add_item x.first, x.last}

puts lsi.classify "Remind me to Remind"

results in

/Users/willstrinz/.rvm/gems/ruby-2.0.0-p353/gems/classifier-1.3.4/lib/classifier/lsi.rb:190:in `sort_by': comparison of Float with Float failed (ArgumentError)
    from /Users/willstrinz/.rvm/gems/ruby-2.0.0-p353/gems/classifier-1.3.4/lib/classifier/lsi.rb:190:in `proximity_array_for_content'
    from /Users/willstrinz/.rvm/gems/ruby-2.0.0-p353/gems/classifier-1.3.4/lib/classifier/lsi.rb:255:in `classify'
    from lsi.rb:11:in `<main>'

but if I change the repeat of "Remind" it works fine"

puts lsi.classify "Remind me to zzRemind"
 #=> remind

It looks like the vectorization process is somehow getting NaNs in the ContentNode creation process, which is the direct source of the error, but I haven't been able to track down what's causing that yet.

Update - tracked things back to weighted_total in content_node.rb#raw_vector_with being 0. In my case, this happens when the vectorized value of a word is equal to the total_words value, and all the rest of the entries in the vec array are 0. This causes a divide by zero in Math.log( val + 1 ) / -weighted_total . Not quite sure how to go about fixing it, but I'll keep looking

It looks like this will happen if and only if all values except one in the vec array are 0

Submitted a PR with my attempt at a fix #19

Current code works with GSL.

Fixed in version 1.4.4