louismullie/treat

Strange results from #to_s on parsed sentences

ojak opened this issue · 4 comments

ojak commented

I'm seeing strange results from #to_s on parsed sentences. For example:

> text = "Let's see, what do commas do?"

# A tokenized sentence works as expected
> Treat::Entities::Sentence.new(text).apply(:tokenize).to_s
=> "Let's see, what do commas do?"

# A tokenized + parsed sentence goes nuts
> Treat::Entities::Sentence.new(text).apply(:tokenize, :parse).to_s
=> "Let's see, (WHNP (WP what))"

Any ideas what's happening here?

ojak commented

BTW, this does not occur with all sentences. This works fine:

> Treat::Entities::Sentence.new("Just another boring sentence.").apply(:tokenize, :parse).to_s
=> "Just another boring sentence."
ojak commented

Here's a bit more information. It does not seem to be related to the comma:

# Removed the comma
> Treat::Entities::Sentence.new("Let's see what do commas do?").apply(:tokenize, :parse)
=> Sentence (70279355141920)  --- "Let's see (WHNP (WP what))"  ---  {:tag_set=>:penn}   --- []

For some reason, part of the sentence is being detected as a Symbol:

> Treat::Entities::Sentence.new("Let's see, why do commas do?").apply(:tokenize, :parse).print_tree
+ Sentence (70279337517780)  --- "Let's see, (WHADVP (WRB why))"  ---  {:tag_set=>:penn}   --- []
|
+--+ Phrase (70279338096540)  --- "Let's"  ---  {:tag=>"NP"}   --- []
   |
   +--> Word (70279342692080)  --- "Let"  ---  {:tag=>"NNP"}   --- []
   +--> Enclitic (70279343099460)  --- "'s"  ---  {:tag=>"POS"}   --- []
+--+ Phrase (70279352231280)  --- "see, (WHADVP (WRB why))"  ---  {:tag=>"VP"}   --- []
   |
   +--> Word (70279352831460)  --- "see"  ---  {:tag=>"VBP"}   --- []
   +--> Punctuation (70279353160900)  --- ","  ---  {:tag=>","}   --- []
   +--> Symbol (70279353516500)  --- "(WHADVP (WRB why))"  ---  {:tag=>"SBARQ"}   --- []
=> nil
ojak commented

Perhaps something to do with punctuation? For example:

# Question mark following the letters `a`, `e`, `i`, and `u` fail
> 'a'.upto('z').each { |letter| puts Treat::Entities::Sentence.new("Lets see what do commas d"+letter+"?").apply(:tokenize, :parse).to_s }
Lets see what do commas da?
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see what do commas de?
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see what do commas di?
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see what do commas du?
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))
Lets see (WHNP (WP what))

# Without a question mark at the end, these all work
> 'a'.upto('z').each { |letter| puts Treat::Entities::Sentence.new("Lets see what do commas d"+letter).apply(:tokenize, :parse).to_s }
Lets see what do commas da
Lets see what do commas db
... all other iterations work ...
Lets see what do commas dz

Maybe there is a dynamic send or method name ending in a particular way that's mucking things up somewhere in the parse trace, or perhaps it's a bug in the underlying libraries? Removing the punctuation also fails occasionally:

> Treat::Entities::Sentence.new("Let's see, what do commas do").apply(:tokenize, :parse).to_s
=> "Let's see, (WHNP (WP what))"

I'm a bit stumped on this one.

ojak commented

The PR fixes the issue.