dakrone/clojure-opennlp

Custom Feature generation impossible via 'make-name-finder'

jimpil opened this issue · 7 comments

Hi there,

It seems that 'make-name-finder' does not take into account the several constructors in the NameFinderME.java ...More specifically, there is no way to use the constructor that accepts a custom feature generator... I propose this, which is not a breaking change:

(defmethod make-name-finder TokenNameFinderModel
  [model & {:keys [feature-generator]}] ;;optional arg - defaults to nil
  (fn name-finder
    [tokens & contexts]
    {:pre [(seq tokens)
           (every? #(= (class %) String) tokens)]}
    (let [finder (NameFinderME. model feature-generator *beam-size*) ;can be nil - no problem
          matches (.find finder (into-array String tokens))
          probs (seq (.probs finder))]
      (with-meta
        (distinct (Span/spansToStrings matches (into-array String tokens)))
        {:probabilities probs})))) 

Jim

Actually this won't work because the dispath-fn expects a single argument...

We need variadic args in the dispatch-fn and the 2 defmethods like so:

(defmulti make-name-finder 
  "Return a function for finding names from tokens based on a given
   model file."
 (fn [model & args] (class model))) ;;the new dispatch-fn

(defmethod make-name-finder :default
  [modelfile & args]   ;; args are ignored here
  (with-open [model-stream (input-stream modelfile)]
    (make-name-finder (TokenNameFinderModel. model-stream))))

(defmethod make-name-finder TokenNameFinderModel
  [model & {:keys [feature-generator]}] ;;but not ignored here
  (fn name-finder
    [tokens & contexts]
    {:pre [(seq tokens)
           (every? #(= (class %) String) tokens)]}
    (let [finder (NameFinderME. model feature-generator *beam-size*) ;;can be nil - no problem
          matches (.find finder (into-array String tokens))
          probs (seq (.probs finder))]
      (with-meta
        (distinct (Span/spansToStrings matches (into-array String tokens)))
        {:probabilities probs}))))

What do you think?

Jim

and to finish things off, the NER trainer must be slightly changed to take into account the custom feature generator:

(defn ^TokenNameFinderModel train-name-finder
  "Returns a name finder based on a given training file"
  ([in] (train-name-finder "en" in))
  ([lang in] (train-name-finder lang in 100 5))
  ([lang entity in feature-gen iter cut] 
     (NameFinderME/train
      lang
     (if (nil? entity) "default" entity)
      (->> (reader in)
           (PlainTextByLineStream.)
           (NameSampleDataStream.))
      feature-gen     
      {}
      iter
      cut)))

Well, obviously the above will break a couple of things first being the f2 overloads that delegate to it, so again I suggest optional arguments like this:

(defn ^TokenNameFinderModel train-name-finder
  "Returns a name finder based on a given training file"
  ([in] (train-name-finder "en" in))
  ([lang in] (train-name-finder lang in 100 5))
  ([lang in iter cut & {:keys [entity-type feature-gen]  
                        :or  {entity-type "default"}}]                    
     (NameFinderME/train
      lang
      entity-type
      (->> (reader in)
           (PlainTextByLineStream.)
           (NameSampleDataStream.))
      feature-gen     
      {}
      iter
      cut)))

this doesn't break anything - whatever callers used 4 arguments they can keep doing so. Custom feature generation is very important for real use...As a last point, it would be nice to be able to train a perceptron as well, so we can compare with maXent. The only thing difference here is that we need to call a different NameFinderME.train() method. The one that takes a TrainingParameters object. I cannot squeeze it in the existing implementation via an optional arg without involving a big ugly 'if' so the simplest thing that comes to me as non-breaking is pulling out the real work done into 3 separate functions. the current top level 'train-name-finder' would take another optional argument :classifier (could default to maXent) and will delegate the real work accordingly to a 'train-name-finder-maxent' or a 'train-name-finder-perceptron' . It makes no sense to paste the whole thing here but if you're interested let me know.

Jim

I am interested, I'll take a look at pulling it into a better method. In the meantime, I'm more than happy to take a pull request if you have specific ideas about implementation.

Yeah sure... I've already forked it but I thought I should wait cos basically the same thing can be applied to all the models...I've only touched the name-finder and his training.

Jim

This is in clojure-opennlp 0.2.0, just released.