Generative Model Analysis
Closed this issue · 6 comments
Hi,
Hope you are doing well! We were able to run the generate model our our corpus and had some interesting results.
Prior to training the generative model our:
For majority voting
TP FP FN. P. R F1
92 60 546 0.6053 0.1442 0.2329
After our model the model generated the following
(0.5957, 0.1317, 0.2157) for the best dev output
Then testing it our test corpus we received the following:
TP FP FN P R F1
340 390 2726 0.4658. 0.1109 0.1791
How would you interpret this results? Is the generative model not generating any new improvements ontop of our tagger and linking rules? Is the model not improving due to overfitting on the train and dev corpus?
Please let me know if you have any questions or further clarifications that may help you answer our conceptual questions above.
Cheers,
Andrew
Hello Andrew,
Thank you for posting this issue. The link-HMM model generates improvements by leveraging the linking rules, as compared to the Naive Bayes (NB) and hidden Markov Model (HMM) generative models.
The fact that the link-HMM improving the performance when compared to the Majority Vote could indicate either of the following:
- Linking rules are ineffective (e.g. low accuracy or coverage).
- Suboptimal generative model hyperparameters
- Lack of negative supervision (e.g. need to add more tagging rules that vote on 'O' tags)
To help you debug this issue, can you please attach a screenshot of the output scores of all the tagging and linking rules?
Best,
Esteban
Esteban,
Thank you for such a quick response.
To give you a sense of scope - we are trying to identify businesses with supply chain manufacturing capabilities in a corpus of business text.
That is a great list of hypothesis for potential issues:
- The only linking rules we have is Sequential Capitalization, Common Symbols for linkage and Elmo Linking. I would have thought this would be enough for businesses identification. Any more use cases you have encounter to have more effective linking rules.
- Hypertuning Parameters still need some adjustments
- We don't have many in the way of negative supervision.
Output scores for al tagging/linking values respectively.
I hope this helps.
Andrew
Hello Andrew,
Those scores are looking really good! Most tagging rules have an accuracy of +85%, which is what we generally recommend when developing these types of pipelines. The ConsecutiveCapitals linking rule has very high accuracy and coverage, which great for the link-HMM.
I generally recommend debugging the pipeline using the generative model. That is, using the generative model scores as a measure of performance (similar to notebook 2). The scores that you reported are from the best dev output, aka discriminative scores. Can you please share the generative scores too? (The ones obtained from running model.evaluate(test_data)
). Can you also please share with me the values of the three generative hyperparameters used in your model?
Furthermore, can you please give me a bit of context about your rules? For instance, are you employing the tagging rules to identify few select tokens as business names, and using the ConsecutiveCapitals to extract the entire entities?
To isolate and fix the issue, I would recommend doing the following:
-
Adding a few (1-3) negative supervision tagging rules with high coverage and accuracy. For instance, you could tag most punctuation signs, lowercased verbs/adjectives as non-business entities.
-
Adding 1 or 2 additional linking rules. The better the linking rules, the better the link-HMM scores!
-
Training the naive Bayes or HMM generative model to have a benchmark comparison against the link-HMM. You can simply add
from labelmodels import NaiveBayes, HMM
to your code and replace the model name where necessary. Since these two models don't employ linking rules, if you observe that the naive Bayes generative score is way below the Majority Vote, that means that the tagging rules are being improperly weighted. Try reducing the regularization strength (acc_prior, balance_prior). -
If training a discriminative neural network (like in tutorial 3) harms the performance of the generative model, it is likely a regularization problem. In this case, you should also try reducing the regularization strength (acc_prior, balance_prior).
Let me know if any of these methods work and if you have any questions!
Esteban
Esteban,
Your notes and suggestions have been amazing!
I have followed your advice and it has yielded significant improvement.
Context:
The context of the project is to take related user-defined domain-specific organizations. You are right that the Consecutive Capital is trying to extract whole company names.
On to some modeling results...
Modeling Results:
Dev Tagging Rules Accuracy:
In addition this has not only improved the Majority Voting and the Generative Model Improvements:
Dev and Test Majority Voting: (Before)
Test Generative Evaluation: (After)
The generative parameters used is model (init_acc=0.95, acc_prior=50, balance_prior=100)
This has jumped from a F1 Score from 0.2934 to 0.3903 and more importantly our precision rate has jumped form 0.5585 to 0.7349 We employed a series high-coverage negative positive tagging and linking rules as well. The most drastic/novel addition to our linking rules is the use of automatic phrasing generated form the scope at focus (business and supply chain related).
We have yet to benchmark either on Naive Bayes's or HMM's. (Stay Tune for updates - also how many epochs does it run? it's been running for hours at the moment)
This past breakthrough and the vision of the pipeline ahead leads to interesting question(s). I was hoping to get your perspectives on the following questions:
-
High Coverage: High Accuracy - What are the viable rules for coverage % of the total data set? (I know there is documentation concerning the token accuracy). Would ie be more advantageous to have few number of high coverage, lower accuracy rules vs. small coverage, higher accuracy rules. I hypothesize the former would be more advantageous as it allows for more programmatic voting on the same token.
-
Accuracy_Prior and Balance_Prior - what are the recommended mins and maxes for each parameters? As of now the Balance Prior feeds into the entropy term which ultimately affects the accuracy penalty weighting - thus how do I know if its [0,10,20,100] or [0, 50, 100, 200]. I know this question opens a can of worms about grid search, random search or bayesian - but I wanted to understand your hypothesis.
-
F Score (Macro vs. Micro and Optimization Condition Currently the code suggest we are using a macro F Score to decide the best P R F values against dev data - what are your thoughts on using macro F vs. micro F score? Micro F score is intended for in balance in the data; given entity recognition labels tends to always be a minority of the corpus would a Micro F be more appropriate? However, I have seen that we are using the balance_prior parameter which seeks to normalize the distribution.
Side Note Improving Usability:
-
Division by Zero: I notice a potential bug in the score_linking_rules(data) where if the linking rule did not identify the links, the count function that aggregates the total of entity links, non entity links, etc. errors on Division of Zero. I would either put a an if and then statement to ask if any of the counts have been updated; if not report out [0,0,0]. The code is not complicated; however, if you would like some help - I'd more than obliged.
-
Recommended Delete Rule Function There is a lot of trial and error for the linking and tagging rules; thus when the researcher reaches to the accuracy evaluation, it might not seem immediately clear that they are able to delete a rule once they deem it has low coverage/low accuracy. When the operations are short and simple; it's easy to rerun the functions... however functions like Spacy or Phrase Extraction or ElmoSimilarity can cause some latency. I have seen it other code a for loop to delete a certain rule from all the dev, test, train sets.
Examples Code:
remove tagging rule as a function for users to easily pick and choose to drop rules (once they see low coverage/low accuracy combination)
def remove_tagging_rule(data, name):
for instance in data:
del instance['WISER_LABELS'][name]
Same applies to linking rules
def remove_tagging_rule(data, name):
for instance in data:
del instance['WISER_LABELS'][name]
I can help with a tutorial cod addition as well - if that would help. Please let me know.]
Thank you for taking time to read this, and look forward to hearing back from you soon.
Cheers,
Andrew
Hello Andrew,
Thank you so much for your suggestions, and for catching that bug! In the next few days, we will be releasing a code update with a rule deletion function, and an evaluation method that handles divisions by zero. We really appreciate your feedback and enthusiasm for this project!
We will also be releasing a FAQ of common questions and debugging tips, so hopefully more people can use these tricks to improve their models.
A precision of 0.7349 looks really good! With that score, I would expect the neural tagger to improve recall by a few points.
The HMM should be a tad faster than the link-HMM, and the Naive Bayes should be the fastest amongst the three of them. The generative training time is directly dependent on the size of the dataset, the number of tagging/linking rules, and epochs (which you can configure using the LearningConfig
class. Adding tagging rules like the ones you just described will increase the search space, leading to an increased runtime.
Adding training logging for the generative model has been on our TODO list for a while now. However, training a generative model for ~2 epochs can give you a rough approximation of its performance when using 5 epochs (I wouldn't expect the F1 score to increase by more than a few points).
Here's the answer to your questions:
-
Accuracy vs coverage: High accuracy is generally associated with high precision, whereas high coverage is associated with high recall. While you should aim to strike a balance between the two types of rules, we recommend prioritizing precision over recall. This is because the discriminative neural network improves the generalization of the pipeline, and mainly provides increases in recall.
If a given set of rules has a high coverage but low accuracy, the quality of the probabilistic labels used to train the neural tagger will be low (low precision, high recall). Hence, the discriminative model will learn from a large number of partially incorrect labels and likely fail to generalize much beyond the information given. In such cases, it is preferable to have a smaller number of correct labels but increased accuracy (high precision, low recall). I personally recommend writing tagging and linking rules to get a precision of +70% and recall of +50%, but these values really depend on your end-goal and task. -
Accuracy Prior vs Balance Prior: Good question! The typical search space for these two hyperparameters is [0, 500]. For our paper, we ran a grid search using the following values: {0.5, 1, 5, 10, 50, 100, 500}, but your model need not be limited to those. We generally think of the accuracy and balance prior as intensities, meaning that the exact numerical value of the regularizer sdoesn't matter as much as their proximity to 0 or 500. In our next updates, we will add a small note indicating the range of the generative hyperparameters.
-
Macro vs. Micro and F1: You're definitely correct in that entity recognition labels tend to always be a minority of the corpus. However, most sequence tagging papers we've based our work on report their scores using macro F1. Therefore, we have adopted this practice. Please feel free to share some sequence tagging papers and sample tasks that typically report scores using micro-F1.
Let me know if you have any further questions!
Esteban
Closing this issue