tra38/ZombieWriter

Program crashes for larger quantities of articles

Opened this issue · 6 comments

I've been using ZombieWriter and finding that it hits the same crash in Classifier-Reborn when I have a larger quantity of rows in the CSV file:

Jacks-MacBook-Pro:Projects johncambou$ ruby review-generator.rb /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi/content_node.rb:30:in transposed_search_vector': undefined method col' for nil:NilClass (NoMethodError) from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:190:in block in proximity_array_for_content'
from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:188:in collect' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:188:in proximity_array_for_content'
from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:166:in block in highest_relative_content' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:166:in each_key'
from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:166:in highest_relative_content' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi/summarizer.rb:29:in perform_lsi'
from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi/summarizer.rb:10:in summary' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:21:in header'
from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:69:in block in generate_articles' from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:57:in map'
from /Users/johncambou/.rbenv/versions/2.4.2/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:57:in generate_articles' from review-generator.rb:12:in

'`

What's really strange to me is that this only happens for larger quantities of articles. When I have only ~40 or less rows in the CSV, it runs fine, but as I get to ~50+, the program will always hit the crash.

What's even stranger is that this doesn't seem to be consistent - sometimes it will crash at only 35 CSV lines, or sometimes it runs successfully at 56. Sometimes it will crash at the exact same CSV file that it was correctly processing earlier.

I've very meticulously tested if this is being caused by the specific content of my articles, but the program runs fine for any subset of my articles - it only crashes when I get above this certain general limit in quantity.

At this point I have tried:

  • Ensuring that every line has 2 sentences
  • Tried each line only having the content, and also tried with full sourcetext and sourceURL
  • Swapped out different article content

I'm completely lost. Ideally I'd like to run the program with 300+ paragraphs, so that I can really get crazy with the output, but it's disappointing to be capped at so few. If you have any suggestions on how to fix this it'd be greatly appreciated.

tra38 commented

Sorry for not seeing your comment earlier. Basically, lsi.rb is really buggy and slow, and nobody really should be using it. Instead, you'll need to install GSL and rb-gsl on your machine so that classifier-reborn doesn't use lsi.rb and instead use the GSL library to conduct the matrix multiplication necessary to do the machine learning. For more information about this issue, please read my comment on this topic.

Ideally I'd like to run the program with 300+ paragraphs

This sounds super-cool. Keep me posted to see what will happen next. I'm curious whether the output of ZombieWriter improve significantly when given enough paragraphs.

EDIT: That being said, the stack trace also seem to mention summarizer.rb (which I use for generating the title of the 'clusters'), so it's possible that there might be another problem here. If you're still getting this error after installing GSL and rb-gsl, I would appreciate it if you can please give me the corpus so that I can try to replicate the issue myself and figure out what I can do to fix it.

tra38 commented

Er, looking at the error trace again, I don't think the problem is with matrix multiplication (since that's the buggy/slow part of lsi.rb that GSL/rb-gsl moves you away from). I think it's really a more generalized form of this issue ("Program crashes when attempting to provide a title for a one-sentence article") (with me reporting this issue to classifier-reborn here), so I'm definitely interested in seeing how this error occurs and getting it resolved. I would appreciate it if you can give me a sample corpus so that I can replicate the error, and see what I can do to handle it.

In case I can't handle it, I could either try to make 'titles' for article-clusters optional...or fallback to an empty title if one can't be generated.

That would be great!
I've attached a Zip file of the 2 CSV files I created for this - in the first version, the majority of the content rows only have a single sentence. But I made a second version where I added a second sentence ("Test.") to every line). Both of these are failing in the exact same way for me.

TestArticles.zip

Just a warning, the content in these has some explicit words - currently, my use case is to have the program write its own reviews for videogames. So with this data set, I grabbed about 300 lines of text from real user reviews of the call of duty videogame, which happen to be mostly negative. Shouldn't be relevant to the code, but just wanted to explain why the content itself might seem so weird or poorly written :)

Let me know if there's any other info that would help you!

tra38 commented

Working on the issue right now. As a side-note, I am using a new version of Ruby and forgot to install rb-gsl, so the program thought I didn't have GSL installed. So, I got this separate error with your dataset.

/Users/tariqali/.rbenv/versions/2.4.1/lib/ruby/2.4.0/matrix.rb:2088:in `normalize': Zero vectors can not be normalized (Vector::ZeroVectorError)
	from /Users/tariqali/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:147:in `block in build_index'
	from /Users/tariqali/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:145:in `times'
	from /Users/tariqali/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:145:in `build_index'
	from /Users/tariqali/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/classifier-reborn-2.2.0/lib/classifier-reborn/lsi.rb:77:in `add_item'
	from /Users/tariqali/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/zombie_writer-0.2.0/lib/zombie_writer.rb:50:in `add_string'
	from zombie.rb:12:in `block in <main>'
	from zombie.rb:11:in `each'
	from zombie.rb:11:in `<main>'

Installing rb-gsl resolved this issue and allowed me to encounter the issue you're facing with (though it also suggest to me that I really do need to make using rb-gsl mandatory).


After reading the docs of ClassifierReborn::Summarizer, I think I know what might be causing most of two-or-more-sentence summarizations to be failing. I'll demonstrate with the following article.

FWIW my top COD multiplayers: MW2 Ghost (all the hate ruined the series IMO) MW3 MW Remastered

I am not a fan of the Battlefield series but it blows this COD away.

First, Classifier-Reborn take that article and split it up into sentences using this regex: /(\.|\!|\?)/ (basically, split up the sentence based on the presence of punctuation marks). It turns it into an array containing three "sentences":

["FWIW my top COD multiplayers: MW2 Ghost (all the hate ruined the series IMO) MW3 MW Remastered\n\n\nI am not a fan of the Battlefield series but it blows this COD away", ".", "\n"]

Classifier-Reborn then creates a new LSI dedicated only to summarization (lsi = ClassifierReborn::LSI.new auto_rebuild: false) and adds sentences to that LSI based on this script (where chunks is that array above):

chunks.each { |chunk| lsi << chunk unless chunk.strip.empty? || chunk.strip.split.size == 1 }

Essentially, we are throwing away any chunks that are simply one-word sentences (such as ".") or is empty (once we strip away whitespace). So the only chunk that we add to the new LSI is...

"FWIW my top COD multiplayers: MW2 Ghost (all the hate ruined the series IMO) MW3 MW Remastered\n\n\nI am not a fan of the Battlefield series but it blows this COD away"

...i.e, a single sentence according to LSI.

After Classifier-Reborn runs lsi.build_index, the LSI is as follows:

=> #<ClassifierReborn::LSI:0x007fed63338760
 @auto_rebuild=false,
 @built_at_version=-1,
 @cache_node_vectors=nil,
 @items=
  {"FWIW my top COD multiplayers: MW2 Ghost (all the hate ruined the series IMO) MW3 MW Remastered\n" +
   "\n" +
   "\n" +
   "I am not a fan of the Battlefield series but it blows this COD away"=>
    #<ClassifierReborn::ContentNode:0x007fed630b16e0
     @categories=[],
     @lsi_norm=nil,
     @lsi_vector=nil,
     @word_hash=
      {:fwiw=>1,
       :top=>1,
       :cod=>2,
       :multiplay=>1,
       :mw2=>1,
       :ghost=>1,
       :hate=>1,
       :ruin=>1,
       :seri=>2,
       :imo=>1,
       :mw3=>1,
       :remast=>1,
       :fan=>1,
       :battlefield=>1,
       :blow=>1,
       :awai=>1}>},
 @language="en",
 @version=1,
 @word_list=#<ClassifierReborn::WordList:0x007fed63338738 @location_table={}>>

Since there is only one sentence, it is impossible for the LSI to summarize the content, and so an error is thrown.

This, by the way, is why you were unable to prevent this error from occurring by adding the sentence "Test." to the end of every comment. Classifier-Reborn is programmed to reject single-word sentences when generating summaries, so it threw away that sentence. When I replaced all instances of "Test." to "Test Sentence." in the second CSV file, I was able to generate articles and summaries without issues (though a lot of the summaries were simply "Test Sentence").

(Note: Also, when I began removing problematic clusters from your original CSV, I'm able to still sometimes see articles with only one legitimate sentence (such as "Multiplayer/pvp hmmm.\n"), which of course leads to summarization to fail. So merely handling the issue where summarization fails even when you have two or more sentences won't really help in the long term. We need a more general solution to this problem, like adding "Test Sentence." .)

Since the root-cause of this issue is headline generation, I'm tempted to either find a better way of summarizing articles or just downplay/remove that feature...since I'm not sure whether the headlines actually add anything to the article.

For now though, a good, general hotfix would be to generate an empty title in case an error is thrown by ClassifierReborn. I'll work on doing that right now.

tra38 commented

Just pushed up ZombieWriter version 0.3.0 to rubygems.org . Now, if we encounter an error with Classifier-Reborn when generating titles, we generate an empty title instead. Let me know if this fixes the problem.

Just tried with the updated version, and it all works now, including the 1 sentence versions. Been able to get some awesome output as a result, excited to try it with other data sets.

Definitely appreciate the fast response and update!