word counting code does not account for & being special html symbol.
GoogleCodeExporter opened this issue · 2 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. make the method de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.isWord
public
2. in UnicodeTokenizer.java import static that method
3. add the following main method to UnicodeTokenizer.java :
public static void main(String[] args) {
String html = "A few years later, in 1823, another Knickerbocker, Clement C. Moore, offered his own riff on Irving’s version of St. Nicholas. Moore’s instantly popular poem “A Visit from Saint Nicholas” introduced the slightly cloying, but instantly and sensationally popular, symbol of the season—a “chubby and plump...right jolly old elf.” (There are those who contend that an author named Henry Livingston Jr. penned the poem, but that’s another story altogether.)";
final String[] tokens = UnicodeTokenizer.tokenize(html);
for( String s : tokens ){
if( isWord(s) ){
System.out.println("isWord: "+s);
} else {
System.out.println("!isWord: "+s);
}
}
}
What is the expected output? What do you see instead?
That html is from
http://www.smithsonianmag.com/arts-culture/A-Mischevious-St-Nick-from-the-Americ
an-Art-Museum.html
It uses ’ such as "Irving’s version of St. Nicholas. Moore’s
instantly". The logic used by BoilderPipe does not account for that and in the
program above with output:
isWord: Irving
!isWord: &
isWord: rsquo;s
isWord: version
isWord: of
isWord: St.
isWord: Nicholas.
isWord: Moore
!isWord: &
isWord: rsquo;s
isWord: instantly
which shows that it is breaking up "Irving's" and "Moore's" into two words
where they are one.
Original issue reported on code.google.com by massey1...@gmail.com
on 22 Jan 2012 at 10:36
GoogleCodeExporter commented
adding '&' to the PAT_NOT_WORD_BOUNDARY of UnicodeTokenizer gives the better
output. Of course then it is not really a UnicodeTokenizer but a HtmlTokenizer.
It might be better to combine the regexp of that tokenizer class and the isWord
method into a new class HtmlWordCounter giving it a public static method which
does the word counting so that other projects can easily reuse it.
Original comment by massey1...@gmail.com
on 22 Jan 2012 at 10:42
GoogleCodeExporter commented
The input to UnicodeTokenizer is Unicode text, not HTML-escaped text. If you
want to use UnicodeTokenizer, you have to prepare the input appropriately.
As you have pointed out, you want a HtmlTokenizer. Boilerpipe takes care of
HTML entity resolution via SAX parsing, so there is no need to replicate that
functionality here.
Marking as WontFix.
Original comment by ckkohl79
on 22 Jan 2012 at 10:51
- Changed state: WontFix