
A question about the code.

XiaoFengbing opened this issue · 6 comments

I want to reproduce your great paper Selective-Context. Now I have a sentence, such as 'Members of Ukraine's Armed Forces 80th Separate Air Assault Brigade at their position near the frontline city of Bakhmut, eastern Ukraine, last week'.

First, I use huggy_llama_7b model and tokenizer to get tokens and self_info. tokens is ['M', 'embers', 'of', 'Ukraine', "'", 's', 'Ar', 'med', 'Forces', '', '8', '0', 'th', 'Se', 'par', 'ate', 'Air', 'Ass', 'ault', 'Brigade', 'at', 'their', 'position', 'near', 'the', 'front', 'line', 'city', 'of', 'B', 'akh', 'mut', ',', 'eastern', 'Ukraine', ',', 'last', 'week'], self_info is [-8.699746131896973, -12.731630325317383, ..., -20.620922088623047], in get_self_information function from

Second, I get noun_phrases and noun_phrases_info in _calculate_lexical_unit function by self.nlp = spacy.load("en_core_web_sm", disable=["ner"]). noun_phrases is ["MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut", ',', 'easternUkraine',',','lastweek'], noun_phrases_info is [-17.46139931678772, -17.359699249267578, -21.828365325927734, -17.94999122619629, -20.457746505737305] because of sent = ''.join(tokens) in _calculate_lexical_unit function.

Finally, 'easternUkraine' and 'lastweek' are deleted, compressed context is "MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut,,".

I think it's a strange result. Do you think there's anything wrong in this process?
Thanks for your help.

The input sentence you used in the phrase tokenization seems to be wrong.

Make sure you send the right sentence to self.nlp.

["MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut", ',', 'easternUkraine',',','lastweek']

I suspect spaces are missing in your input.


Hi, I find the reason why I get the terrible result. Example sentence is Boris Johnson has submitted evidence to MPs investigating whether he misled Parliament over Covid rule-breaking parties in Downing Street.

When I use huggy_llama_7b to tokenize the sentence, I get ['Bor', 'is', 'Johnson', 'has', 'submitted', 'evidence', 'to', 'MP', 's', 'investig', 'ating', 'whether', 'he', 'mis', 'led', 'Parliament', 'over', 'Cov', 'id', 'rule', '-', 'bre', 'aking', 'parties', 'in', 'Down', 'ing', 'Street', '.'].

When I use gpt2 to tokenize the sentence (gpt2 is the default setting in, I get ['B', 'oris', ' Johnson', ' has', ' submitted', ' evidence', ' to', ' MPs', ' investigating', ' whether', ' he', ' misled', ' Parliament', ' over', ' Cov', 'id', ' rule', '-', 'breaking', ' parties', ' in', ' Downing', ' Street', '.']

Because gpt2 tokenizer can remain the whitespace such as ' Johnson', when tokens go through the sent = ''.join(tokens) in _calculate_lexical_unit function, the sentence can be restored normally. And huggy_llama_7b is 'BorisJohnsonhassubmittedevidencetoMPsinvestigatingwhetherhemisledParliamentoverCovidrule-breakingpartiesinDowningStreet.'

But do not have huggy_llama_7b setting, and I do not know how to fix it.

Can you help me? Thanks for your response!

Just replace the gpt2 tokenizer with yours in self._prepare_model

Just replace the gpt2 tokenizer with yours in self._prepare_model

No, my description above is the result of replacing the tokenizer, so I want to know how you achieve that in self._prepare_model.

just replace sent = ''.join(tokens) with sent = ' '.join(tokens) at here

@XiaoFengbing I just add llama2 support for self-information computing, check here.

remember to keep using sent = ''.join(tokens) in your main code.
