A question about the code.
XiaoFengbing opened this issue · 6 comments
I want to reproduce your great paper Selective-Context. Now I have a sentence, such as 'Members of Ukraine's Armed Forces 80th Separate Air Assault Brigade at their position near the frontline city of Bakhmut, eastern Ukraine, last week'.
First, I use huggy_llama_7b
model and tokenizer to get tokens
and self_info
. tokens
is ['M', 'embers', 'of', 'Ukraine', "'", 's', 'Ar', 'med', 'Forces', '', '8', '0', 'th', 'Se', 'par', 'ate', 'Air', 'Ass', 'ault', 'Brigade', 'at', 'their', 'position', 'near', 'the', 'front', 'line', 'city', 'of', 'B', 'akh', 'mut', ',', 'eastern', 'Ukraine', ',', 'last', 'week']
, self_info
is [-8.699746131896973, -12.731630325317383, ..., -20.620922088623047]
, in get_self_information
function from context_manager.py
Second, I get noun_phrases
and noun_phrases_info
in _calculate_lexical_unit
function by self.nlp = spacy.load("en_core_web_sm", disable=["ner"])
. noun_phrases
is ["MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut", ',', 'easternUkraine',',','lastweek']
, noun_phrases_info
is [-17.46139931678772, -17.359699249267578, -21.828365325927734, -17.94999122619629, -20.457746505737305]
because of sent = ''.join(tokens)
in _calculate_lexical_unit
function.
Finally, 'easternUkraine' and 'lastweek' are deleted, compressed context is "MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut,,".
I think it's a strange result. Do you think there's anything wrong in this process?
Thanks for your help.
The input sentence you used in the phrase tokenization seems to be wrong.
Make sure you send the right sentence to self.nlp
.
["MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut", ',', 'easternUkraine',',','lastweek']
I suspect spaces are missing in your input.
Hi, I find the reason why I get the terrible result. Example sentence is Boris Johnson has submitted evidence to MPs investigating whether he misled Parliament over Covid rule-breaking parties in Downing Street.
When I use huggy_llama_7b
to tokenize the sentence, I get ['Bor', 'is', 'Johnson', 'has', 'submitted', 'evidence', 'to', 'MP', 's', 'investig', 'ating', 'whether', 'he', 'mis', 'led', 'Parliament', 'over', 'Cov', 'id', 'rule', '-', 'bre', 'aking', 'parties', 'in', 'Down', 'ing', 'Street', '.']
.
When I use gpt2
to tokenize the sentence (gpt2
is the default setting in selective_context.py
), I get ['B', 'oris', ' Johnson', ' has', ' submitted', ' evidence', ' to', ' MPs', ' investigating', ' whether', ' he', ' misled', ' Parliament', ' over', ' Cov', 'id', ' rule', '-', 'breaking', ' parties', ' in', ' Downing', ' Street', '.']
Because gpt2
tokenizer can remain the whitespace such as ' Johnson', when tokens
go through the sent = ''.join(tokens)
in _calculate_lexical_unit
function, the sentence can be restored normally. And huggy_llama_7b
is 'BorisJohnsonhassubmittedevidencetoMPsinvestigatingwhetherhemisledParliamentoverCovidrule-breakingpartiesinDowningStreet.'
But selective_context.py
do not have huggy_llama_7b
setting, and I do not know how to fix it.
Can you help me? Thanks for your response!
Just replace the gpt2 tokenizer with yours in self._prepare_model
Just replace the gpt2 tokenizer with yours in
self._prepare_model
No, my description above is the result of replacing the tokenizer, so I want to know how you achieve that in self._prepare_model
.
just replace sent = ''.join(tokens)
with sent = ' '.join(tokens)
at here
@XiaoFengbing I just add llama2 support for self-information computing, check here.
remember to keep using sent = ''.join(tokens)
in your main code.