KMCS-NII/planetext

Annotations should not contain plain text for some objects

Closed this issue · 0 comments

Per default, planetext put the plain text of annotations to the .ann files. For example, if we define MathML as objects for replacements, the MathML in the .ann files only consists of the UTF-8 characters in the MathML data, e.g.

#3	AnnotatorNotes T3	α 𝛼 italic-alpha \alpha

rather than the MathML data itself.

I currently fixed this issue by manually adding an if-else-case in

planetext/lib/extract.rb

Lines 240 to 244 in d1247bb

if ["math", "abbr", "cite", "ul"].include? name
text = node.to_s.gsub(/\n/, ' ')
else
text = node.text.gsub(/\n/, ' ')
end

It would be better if we can handle this with an extra option in the config yaml. For example, :objectescapexml: or something like that.