Custom Feature generation impossible via 'make-name-finder'
jimpil opened this issue · 7 comments
Hi there,
It seems that 'make-name-finder' does not take into account the several constructors in the NameFinderME.java ...More specifically, there is no way to use the constructor that accepts a custom feature generator... I propose this, which is not a breaking change:
(defmethod make-name-finder TokenNameFinderModel
[model & {:keys [feature-generator]}] ;;optional arg - defaults to nil
(fn name-finder
[tokens & contexts]
{:pre [(seq tokens)
(every? #(= (class %) String) tokens)]}
(let [finder (NameFinderME. model feature-generator *beam-size*) ;can be nil - no problem
matches (.find finder (into-array String tokens))
probs (seq (.probs finder))]
(with-meta
(distinct (Span/spansToStrings matches (into-array String tokens)))
{:probabilities probs}))))
Jim
Actually this won't work because the dispath-fn expects a single argument...
We need variadic args in the dispatch-fn and the 2 defmethods like so:
(defmulti make-name-finder
"Return a function for finding names from tokens based on a given
model file."
(fn [model & args] (class model))) ;;the new dispatch-fn
(defmethod make-name-finder :default
[modelfile & args] ;; args are ignored here
(with-open [model-stream (input-stream modelfile)]
(make-name-finder (TokenNameFinderModel. model-stream))))
(defmethod make-name-finder TokenNameFinderModel
[model & {:keys [feature-generator]}] ;;but not ignored here
(fn name-finder
[tokens & contexts]
{:pre [(seq tokens)
(every? #(= (class %) String) tokens)]}
(let [finder (NameFinderME. model feature-generator *beam-size*) ;;can be nil - no problem
matches (.find finder (into-array String tokens))
probs (seq (.probs finder))]
(with-meta
(distinct (Span/spansToStrings matches (into-array String tokens)))
{:probabilities probs}))))
What do you think?
Jim
and to finish things off, the NER trainer must be slightly changed to take into account the custom feature generator:
(defn ^TokenNameFinderModel train-name-finder
"Returns a name finder based on a given training file"
([in] (train-name-finder "en" in))
([lang in] (train-name-finder lang in 100 5))
([lang entity in feature-gen iter cut]
(NameFinderME/train
lang
(if (nil? entity) "default" entity)
(->> (reader in)
(PlainTextByLineStream.)
(NameSampleDataStream.))
feature-gen
{}
iter
cut)))
Well, obviously the above will break a couple of things first being the f2 overloads that delegate to it, so again I suggest optional arguments like this:
(defn ^TokenNameFinderModel train-name-finder
"Returns a name finder based on a given training file"
([in] (train-name-finder "en" in))
([lang in] (train-name-finder lang in 100 5))
([lang in iter cut & {:keys [entity-type feature-gen]
:or {entity-type "default"}}]
(NameFinderME/train
lang
entity-type
(->> (reader in)
(PlainTextByLineStream.)
(NameSampleDataStream.))
feature-gen
{}
iter
cut)))
this doesn't break anything - whatever callers used 4 arguments they can keep doing so. Custom feature generation is very important for real use...As a last point, it would be nice to be able to train a perceptron as well, so we can compare with maXent. The only thing difference here is that we need to call a different NameFinderME.train() method. The one that takes a TrainingParameters object. I cannot squeeze it in the existing implementation via an optional arg without involving a big ugly 'if' so the simplest thing that comes to me as non-breaking is pulling out the real work done into 3 separate functions. the current top level 'train-name-finder' would take another optional argument :classifier (could default to maXent) and will delegate the real work accordingly to a 'train-name-finder-maxent' or a 'train-name-finder-perceptron' . It makes no sense to paste the whole thing here but if you're interested let me know.
Jim
I am interested, I'll take a look at pulling it into a better method. In the meantime, I'm more than happy to take a pull request if you have specific ideas about implementation.
Yeah sure... I've already forked it but I thought I should wait cos basically the same thing can be applied to all the models...I've only touched the name-finder and his training.
Jim
This is in clojure-opennlp 0.2.0, just released.