wooorm/parse-latin

Mistakenly categorises :email: as SymbolNode + WordNode

Closed this issue · 1 comments

Hello! Thank you for all of your work on this package.

A short example in v4.3.0.

var inspect = require('unist-util-inspect')
var Latin = require('parse-latin')
var tree = new Latin().parse("You've got \u2709\uFE0F!")
                          // "You've got ✉️!"
console.log(inspect(tree))
RootNode[1] (1:1-1:15, 0-14)
└─ ParagraphNode[1] (1:1-1:15, 0-14)
   └─ SentenceNode[7] (1:1-1:15, 0-14)
      ├─ WordNode[3] (1:1-1:7, 0-6)
      │  ├─ TextNode: "You" (1:1-1:4, 0-3)
      │  ├─ PunctuationNode: "'" (1:4-1:5, 3-4)
      │  └─ TextNode: "ve" (1:5-1:7, 4-6)
      ├─ WhiteSpaceNode: " " (1:7-1:8, 6-7)
      ├─ WordNode[1] (1:8-1:11, 7-10)
      │  └─ TextNode: "got" (1:8-1:11, 7-10)
      ├─ WhiteSpaceNode: " " (1:11-1:12, 10-11)
      ├─ SymbolNode: "✉" (1:12-1:13, 11-12)           <------- This is a U+2709
      ├─ WordNode[1] (1:13-1:14, 12-13)               <------- 😢 
      │  └─ TextNode: "️" (1:13-1:14, 12-13)           <------- This is a U+FE0F
      └─ PunctuationNode: "!" (1:14-1:15, 13-14)

I've traced this down from a bug I was experiencing in https://github.com/tbroadley/spellchecker-cli when I spellcheck markdown that uses the :email: shortcode. It is flagged as a spelling mistake, due to this extra U+FE0F. Some other emoji are affected, ones that are based on older symbols, such as ✂️ and ✈️ .

I had a bit of a go at fixing this but didn't get very far. I would be very grateful to if you could point me in the right direction so I can submit a PR, though if you would prefer to handle yourself I will be equally grateful!

Edit: In particular, I got stuck trying to figure out which, if any, of the modules in lib/plugin ought to be amended to correct this behaviour.

That project uses retext to wrap this project. retext can use one of its plugins to add support for emoji (https://github.com/retextjs/retext-emoji)!