DocumentTitleMatchClassifier should include the « and • characters
GoogleCodeExporter opened this issue · 0 comments
GoogleCodeExporter commented
I have run across a few news articles that use these characters.
The following articles use the « character (\u00AB):
http://philadelphia.cbslocal.com/2012/02/06/report-1-in-5-children-exposed-to-se
condhand-smoke-in-cars/
http://blog.mediaglobal.org/?p=448
I haven't seen too many of them but it looks like the first part is always the
title. It might be safe to assume that parts[0] is the title after performing
the split.
The following article uses the • character (\u2022):
http://ictsd.org/i/news/biores/128000/
Original issue reported on code.google.com by tucker...@gmail.com
on 22 Mar 2012 at 6:05