ankane/mitie-ruby

How is `score` computed?

Closed this issue · 5 comments

Hey all, this gem looks great. I'm using IBM Watson for NER right now and looking to move to an offline model, so this looks really promising. I had a quick question on score: what is it supposed to convey?

I thought it was something of a confidence rating on each detected entity, but the fact that it isn't bound 0-1 makes me wonder if that's the case. Plus, I don't see a lot of variance between scores in obviously-good and obviously-bad entity results (where "obvious" is up to opinion, at least!). For example, compare scores of the first 10 entities the model detected in The Great Gatsby:

[2] pry(Analysis::EntityService)> entities
=> [
{:text=>"F. Scott Fitzgerald", :tag=>"PERSON", :score=>1.0307061901891843, :offset=>70, :token_index=>10, :token_length=>3},
 {:text=>"Australia", :tag=>"LOCATION", :score=>1.0842781929716498, :offset=>115, :token_index=>18, :token_length=>1},
 {:text=>"Colin Choat Project Gutenberg", :tag=>"PERSON", :score=>0.9315695637550808, :offset=>278, :token_index=>48, :token_length=>4},
 {:text=>"Australia", :tag=>"LOCATION", :score=>0.9394158659426314, :offset=>311, :token_index=>53, :token_length=>1},
 {:text=>"Australia", :tag=>"LOCATION", :score=>1.6034960157756928, :offset=>396, :token_index=>67, :token_length=>1},
 {:text=>"Australia License", :tag=>"LOCATION", :score=>0.28120706180452426, :offset=>839, :token_index=>150, :token_length=>2},
 {:text=>"Australia", :tag=>"LOCATION", :score=>1.0073693250694458, :offset=>959, :token_index=>172, :token_length=>1},
 {:text=>"F. Scott Fitzgerald", :tag=>"PERSON", :score=>1.0565178964506419, :offset=>1042, :token_index=>189, :token_length=>3},
 {:text=>"East", :tag=>"LOCATION", :score=>0.7550257470613084, :offset=>2976, :token_index=>571, :token_length=>1},
 {:text=>"No--Gatsby", :tag=>"PERSON", :score=>0.5100084557142209, :offset=>3885, :token_index=>737, :token_length=>1},

compared to results a few dozen entities later:

 {:text=>"isy o", :tag=>"PERSON", :score=>0.8582236384684863, :offset=>10406, :token_index=>2026, :token_length=>1},
 {:text=>"isy's", :tag=>"PERSON", :score=>0.7578108850438561, :offset=>10477, :token_index=>2041, :token_length=>1},
 {:text=>"m w", :tag=>"PERSON", :score=>0.9629121897934299, :offset=>10507, :token_index=>2048, :token_length=>1},
 {:text=>"orgian Colonial m", :tag=>"MISC", :score=>0.3880779572976068, :offset=>10829, :token_index=>2110, :token_length=>2},
 {:text=>"ench w", :tag=>"MISC", :score=>1.0132125240891383, :offset=>11166, :token_index=>2174, :token_length=>1},
 {:text=>"m Buchanan i", :tag=>"PERSON", :score=>1.6505164986721017, :offset=>11262, :token_index=>2193, :token_length=>2},
 {:text=>"w Haven y", :tag=>"LOCATION", :score=>0.3916722230086166, :offset=>11371, :token_index=>2214, :token_length=>2},
 {:text=>"w Haven w", :tag=>"LOCATION", :score=>1.2104987034791914, :offset=>12138, :token_index=>2357, :token_length=>2},
 {:text=>"alian g", :tag=>"MISC", :score=>1.253780902325822, :offset=>12738, :token_index=>2493, :token_length=>1},
 {:text=>"ench w", :tag=>"MISC", :score=>1.1081784733871476, :offset=>13061, :token_index=>2557, :token_length=>1},
 {:text=>"m Buchanan s", :tag=>"PERSON", :score=>1.4611116626888367, :offset=>13909, :token_index=>2726, :token_length=>2},
 {:text=>"isy, ", :tag=>"PERSON", :score=>0.7700472945063558, :offset=>14481, :token_index=>2843, :token_length=>1},
 {:text=>"ker. ", :tag=>"PERSON", :score=>0.8807115826870865, :offset=>14986, :token_index=>2947, :token_length=>1},
 {:text=>"isy's", :tag=>"PERSON", :score=>0.8695349712636645, :offset=>15018, :token_index=>2956, :token_length=>1},

or even further down:

 {:text=>"deker violent", :tag=>"PERSON", :score=>0.4505072217509498, :offset=>160412, :token_index=>33586, :token_length=>2},
 {:text=>"d wat", :tag=>"PERSON", :score=>0.8038305588283357, :offset=>160563, :token_index=>33617, :token_length=>1},
 {:text=>"I thi", :tag=>"PERSON", :score=>0.992799150389002, :offset=>160952, :token_index=>33696, :token_length=>1},
 {:text=>", this u", :tag=>"LOCATION", :score=>0.4847694871466971, :offset=>161092, :token_index=>33726, :token_length=>2},
 {:text=>"and fishing", :tag=>"LOCATION", :score=>0.7249554379805961, :offset=>161163, :token_index=>33740, :token_length=>2},
 {:text=>"enl", :tag=>"PERSON", :score=>0.864206387154379, :offset=>161832, :token_index=>33865, :token_length=>1},
 {:text=>"by,\" I sai", :tag=>"PERSON", :score=>0.4089383097133973, :offset=>162018, :token_index=>33909, :token_length=>2},
 {:text=>"fur c", :tag=>"PERSON", :score=>0.3831703266997306, :offset=>162241, :token_index=>33956, :token_length=>1},
 {:text=>"hed", :tag=>"PERSON", :score=>0.908687840976366, :offset=>162399, :token_index=>33995, :token_length=>1},
 {:text=>"face ", :tag=>"PERSON", :score=>0.7779609258554205, :offset=>162445, :token_index=>34006, :token_length=>1},
 {:text=>"gan t", :tag=>"PERSON", :score=>0.5923163184277154, :offset=>162518, :token_index=>34022, :token_length=>1},
 {:text=>"aid ", :tag=>"PERSON", :score=>0.8959720764178499, :offset=>163308, :token_index=>34203, :token_length=>1},
 {:text=>"er gl", :tag=>"PERSON", :score=>0.8831318846967953, :offset=>163320, :token_index=>34207, :token_length=>1},

Are the entities reported in any sort of ordering (confidence, maybe)? Or, if not, is there a good way to differentiate low-likelihood results (like alian g, ench w, w Haven w, i Buchanan i, etc all with scores > 1) from more traditional (or more likely-correct for my use) PERSON results?

It also seems odd that a large majority of my entity results seem to be broken mid-word on one or both sides. Is that expected, or might there be something wrong with how I'm using the model? It looks like it might not be tokenizing properly, if tokens are meant to abide by word boundaries.

I'm using ruby 2.7.2 and mitie 0.1.4, and the model is the "English" link in the readme: https://github.com/mit-nlp/MITIE/releases/download/v0.4/MITIE-models-v0.2.tar.bz2

For a little more context on a different data set, here's the first few dozen entities on the first few-thousand-words of Alice in Wonderland:

[
 {:text=>"bbit-Hole Alice w", :tag=>"PERSON", :score=>0.3323887007288693, :offset=>23, :token_index=>4, :token_length=>2},
 {:text=>"n", :tag=>"PERSON", :score=>0.25143917074036054, :offset=>254, :token_index=>52, :token_length=>1},
 {:text=>"bbit with pi", :tag=>"PERSON", :score=>0.263884330272437, :offset=>590, :token_index=>122, :token_length=>2},
 {:text=>"ink i", :tag=>"PERSON", :score=>0.533400777559435, :offset=>692, :token_index=>142, :token_length=>1},
 {:text=>"tch o", :tag=>"PERSON", :score=>0.36489281876638907, :offset=>985, :token_index=>209, :token_length=>1},
 {:text=>"ted t", :tag=>"PERSON", :score=>0.7858687512308972, :offset=>1068, :token_index=>227, :token_length=>1},
 {:text=>"r it,", :tag=>"PERSON", :score=>0.5719355511725237, :offset=>1408, :token_index=>297, :token_length=>1},
 {:text=>"not a", :tag=>"PERSON", :score=>0.7842862873460058, :offset=>1599, :token_index=>336, :token_length=>1},
 {:text=>", “af", :tag=>"PERSON", :score=>0.702844725192319, :offset=>2450, :token_index=>515, :token_length=>1},
 {:text=>"l", :tag=>"PERSON", :score=>0.352647747504045, :offset=>2468, :token_index=>519, :token_length=>1},
 {:text=>"d th", :tag=>"LOCATION", :score=>0.33827636477088835, :offset=>2703, :token_index=>575, :token_length=>1},
 {:text=>"of th", :tag=>"PERSON", :score=>1.0161954483478632, :offset=>2976, :token_index=>641, :token_length=>1},
 {:text=>"ngitu", :tag=>"PERSON", :score=>0.8325828964514884, :offset=>3328, :token_index=>713, :token_length=>1},
 {:text=>"ed to", :tag=>"PERSON", :score=>0.7026636987381064, :offset=>3829, :token_index=>823, :token_length=>2},
 {:text=>"she spoke—f", :tag=>"LOCATION", :score=>0.9862560196220408, :offset=>3846, :token_index=>828, :token_length=>2},
 {:text=>"talk", :tag=>"LOCATION", :score=>0.5070578221141001, :offset=>4151, :token_index=>892, :token_length=>1},
 {:text=>", I s", :tag=>"PERSON", :score=>0.3744207841429434, :offset=>4202, :token_index=>906, :token_length=>1},
 {:text=>"&. Do", :tag=>"PERSON", :score=>0.36569129980932186, :offset=>4080, :token_index=>913, :token_length=>1},
 {:text=>"time.", :tag=>"PERSON", :score=>0.6456767140390555, :offset=>4296, :token_index=>926, :token_length=>1},
 {:text=>"t of ", :tag=>"PERSON", :score=>0.6862893970546272, :offset=>4585, :token_index=>999, :token_length=>1},
 {:text=>"t a b", :tag=>"PERSON", :score=>0.42661307200580356, :offset=>4959, :token_index=>1086, :token_length=>1},
 {:text=>" came", :tag=>"PERSON", :score=>0.3916296374167344, :offset=>5008, :token_index=>1098, :token_length=>1},
 {:text=>" it w", :tag=>"PERSON", :score=>0.8968432314675724, :offset=>5165, :token_index=>1137, :token_length=>1},
 {:text=>" went Alice ", :tag=>"PERSON", :score=>0.20613228064101705, :offset=>5327, :token_index=>1175, :token_length=>2},

And if I add up all the scores from repeated entities (multiple entities with the same :text) and sort by highest scores, I get something like this:

{
 nil=>2.6569319136391565,
 "e Gry"=>2.4443541396378206,
 " with"=>2.230409941017787,
 "Alice"=>2.0789947989901396,
 "bout "=>1.9249066111647068,
 "ou’d "=>1.8222443565269217,
 "r"=>1.7490414079092622,
 "e sil"=>1.6764748484814689,
 "i"=>1.671628082176038,
 "n"=>1.6136421751760688,
 "ing t"=>1.5078949819692,
 "&=&! "=>1.5044488291095792,
 "er. "=>1.4505895217782652,
...

I'm using Plaintext UTF-8 downloads from Project Gutenberg (example). Besides a little cleanup by hand (removing most things like title/table of contents before the actual text starts), the only modifications I'm doing is a simple newline replace with spaces:

  model = Mitie::NER.new(Rails.root.to_s + MODEL_PATH + "ner_model.dat")
  doc   = model.doc(full_text.gsub(/[\r\n]+/, ' '))

  entities = doc.entities

The tokens look fine (splitting on word boundaries) upon manual inspection:

[6] pry(Analysis::EntityService)> doc.tokens
=> ["<U+FEFF>CHAPTER",
 "I.",
 "Down",
 "the",
 "Rabbit-Hole",
 "Alice",
 "was",
 "beginning",
 "to",
 "get",
 "very",
 "tired",
 "of",
 "sitting",
 "by",
 "her",
 "sister",
 "on",
 "the",
 "bank",
 ",",
 "and",
 "of",
 "having",
 "nothing",
...

Hey @drusepth, thanks for reporting. I pushed a fix for multibyte characters. The score comes directly from the MITIE library. Here's a bit more info about it: mit-nlp/MITIE#46

I also recommend trying out NER from Informers.

Ah, thanks a ton for the quick fix! I can't get the gem to initialize properly when I source my Gemfile directly from this repo (Fiddle::DLError (.../gems/mitie-f08467d6a52c/vendor/libmitie.so: cannot open shared object file: No such file or directory)), but I'm sure that's just something I'm doing wrong so I'll just wait for the next version to release to use it.

Thanks for the links as well. Will check out Informers also! If this is resolved, feel free to close the issue at your convenience. :)

Yeah, the shared libraries aren't checked in on GitHub, so it's a bit more complicated to use the gem from there. Just pushed a new release.

Can confirm things look great on the latest release. Thanks again!