spencermountain/compromise

Inconsistent metadata for languages

Closed this issue · 1 comments

import nlp from "compromise/three"
console.dir(
  nlp("English, German, French").json(),
  { depth: null }
)

yields

{
    text: "English, German, French",
    terms: [
      {
        text: "English",
        pre: "",
        post: ", ",
        tags: [ "Adjective" ],
        normal: "english",
        index: [ 0, 0 ],
        id: "english|08E00000H",
        dirty: true,
        confidence: 0.6,
        chunk: "Noun",
      }, {
        text: "German",
        pre: "",
        post: ", ",
        tags: [ "Noun", "ProperNoun", "Demonym" ],
        normal: "german",
        index: [ 0, 1 ],
        id: "german|08F000015",
        chunk: "Noun",
        dirty: true,
      }, {
        text: "French",
        pre: "",
        post: "",
        tags: [ "Noun", "ProperNoun", "Demonym" ],
        normal: "french",
        index: [ 0, 2 ],
        id: "french|08G00002H",
        chunk: "Noun",
        dirty: true,
      }
    ],
  }

Why does English not have the same tags as the rest?

hey Adrian - cool, I didn't know about the depth parameter!

It could be for a few reasons - perhaps the word-order, like 'the english patient' or the statistics about how it's used 'english is hard to learn', or 'the french love wine'.

You can always co-erce it by supplying a lexicon like this:

nlp('english, french', {english:'Demonym'}).debug()

cheers