bart-large-mnli multi_class does not agree with Python version
jimidle opened this issue · 7 comments
If you convert facebook/bart-large-mnli
and use it to evaluate the demo text at huggingface and compare against a local Python setup for verification, we find that:
- the online demo card and the local Python agree on the label score
- the label probabilities given back are vastly different
- the Python version takes roughly 16 seconds on my local machine, but the Spago version takes 37 seconds - this is a MAC and there is no GPU available
Python code is
text = "A new model offers an explanation for how the Galilean satellites formed around the solar system’s " \
"largest world. Konstantin Batygin did not set out to solve one of the solar system’s most puzzling " \
"mysteries when he went for a run up a hill in Nice, France. Dr. Batygin, a Caltech researcher, " \
"best known for his contributions to the search for the solar system’s missing “Planet Nine,” spotted a " \
"beer bottle. At a steep, 20 degree grade, he wondered why it wasn’t rolling down the hill. He realized " \
"there was a breeze at his back holding the bottle in place. Then he had a thought that would only pop " \
"into the mind of a theoretical astrophysicist: “Oh! This is how Europa formed.” Europa is one of " \
"Jupiter’s four large Galilean moons. And in a paper published Monday in the Astrophysical Journal, " \
"Dr. Batygin and a co-author, Alessandro Morbidelli, a planetary scientist at the Côte d’Azur Observatory " \
"in France, present a theory explaining how some moons form around gas giants like Jupiter and Saturn, " \
"suggesting that millimeter-sized grains of hail produced during the solar system’s formation became " \
"trapped around these massive worlds, taking shape one at a time into the potentially habitable moons we " \
"know today. "
cc = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
labels = ['space & cosmos', 'scientific discovery', 'microbiology', 'robots', 'archeology']
r = cc(text, labels, multi_class=True)
Go code, with same text and classes, is:
bartModel, err = zsc.LoadModel(bartDir)
// ... check err
result, err := bartModel.Classify(c.InputText, "", classes, true)
Similarly using the model valhalla/distilbart-mnli-12-3
also gives wildly different results to the online huggingface demo, using the same text and label set as above.
So, is there something else I need to do, or is the zsc code not working? My go code is essentially just like the zsc demo code.
I didn't read this issue properly and might have opened a duplicate - #103 .
Have you found any further information regarding this since you've opened the issue?
Hello both of you @jimidle and @abishekmuthian. I apologize for taking so long to respond!
I somehow missed the GitHub notifications, and @marco-nicola was sure I had already dealt with your request.
In fact, this is not a desired behavior. Here we probably have either a bug or a serious loss of accuracy during the various steps of the forward.
Need to investigate, probably comparing the intermediate values of the python implementation. Would any of you be able to assist?
Thanks for getting back @matteo-grella , GitHub seems to remove notifications from the site if it sends the email and your emails might not be reaching your inbox.
It does indeed seem to be a serious bug. I can test the project if there's any new patches, But I'm afraid I cannot aid in active debugging immediately due to time constraint.
Hey @abishekmuthian,
It was a simple True flag that was supposed to be a False flag that confused us both!
Hence, with spaGO, output is forced to be a distribution (sum must be 1), whereas with Python, the output is free.
In fact, the scores reported from you here do not sum to 1.
Let me know
Matteo
I see, Glad that the issue was simple to identify and fix. I will conduct some more tests this weekend and get back.
Got back to testing Spago.
If I understand your last comment correctly, multiClass flag in Spago is !multi_label flag of Python i.e. It needs to be false to enable it.
Making it false does distribute the confidence values to ~1 but still doesn't match the python #103.
Model: spago/valhalla/distilbart-mnli-12-3
Text: "PalmOS on Raspberry Pi"
Spago multiClass:false
0. tech [0.86]
1. startup [0.05]
2. legal [0.05]
3. business [0.03]
Spago multiClass:true
0. tech [0.89]
1. startup [0.02]
2. legal [0.01]
3. business [0.00]
Python multi_label:true
0. tech [0.99]
1. legal [0.77]
2. startup [0.05]
3. business [0.00]