A question about the code.

Question

A question about the code.

XiaoFengbing opened this issue 8 months ago · 6 comments

I want to reproduce your great paper Selective-Context. Now I have a sentence, such as 'Members of Ukraine's Armed Forces 80th Separate Air Assault Brigade at their position near the frontline city of Bakhmut, eastern Ukraine, last week'.

First, I use huggy_llama_7b model and tokenizer to get tokens and self_info. tokens is ['M', 'embers', 'of', 'Ukraine', "'", 's', 'Ar', 'med', 'Forces', '', '8', '0', 'th', 'Se', 'par', 'ate', 'Air', 'Ass', 'ault', 'Brigade', 'at', 'their', 'position', 'near', 'the', 'front', 'line', 'city', 'of', 'B', 'akh', 'mut', ',', 'eastern', 'Ukraine', ',', 'last', 'week'], self_info is [-8.699746131896973, -12.731630325317383, ..., -20.620922088623047], in get_self_information function from context_manager.py

Second, I get noun_phrases and noun_phrases_info in _calculate_lexical_unit function by self.nlp = spacy.load("en_core_web_sm", disable=["ner"]). noun_phrases is ["MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut", ',', 'easternUkraine',',','lastweek'], noun_phrases_info is [-17.46139931678772, -17.359699249267578, -21.828365325927734, -17.94999122619629, -20.457746505737305] because of sent = ''.join(tokens) in _calculate_lexical_unit function.

Finally, 'easternUkraine' and 'lastweek' are deleted, compressed context is "MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut,,".

I think it's a strange result. Do you think there's anything wrong in this process?
Thanks for your help.

Answer 1 · 2024-01-23T08:17:54.000Z

The input sentence you used in the phrase tokenization seems to be wrong.

Make sure you send the right sentence to self.nlp.

["MembersofUkraine'sArmedForces80thSeparateAirAssaultBrigadeattheirpositionnearthefrontlinecityofBakhmut", ',', 'easternUkraine',',','lastweek']

I suspect spaces are missing in your input.

Answer 2 · 2024-01-23T12:37:25.000Z

@liyucheng09

Hi, I find the reason why I get the terrible result. Example sentence is Boris Johnson has submitted evidence to MPs investigating whether he misled Parliament over Covid rule-breaking parties in Downing Street.

When I use huggy_llama_7b to tokenize the sentence, I get ['Bor', 'is', 'Johnson', 'has', 'submitted', 'evidence', 'to', 'MP', 's', 'investig', 'ating', 'whether', 'he', 'mis', 'led', 'Parliament', 'over', 'Cov', 'id', 'rule', '-', 'bre', 'aking', 'parties', 'in', 'Down', 'ing', 'Street', '.'].

When I use gpt2 to tokenize the sentence (gpt2 is the default setting in selective_context.py), I get ['B', 'oris', ' Johnson', ' has', ' submitted', ' evidence', ' to', ' MPs', ' investigating', ' whether', ' he', ' misled', ' Parliament', ' over', ' Cov', 'id', ' rule', '-', 'breaking', ' parties', ' in', ' Downing', ' Street', '.']

Because gpt2 tokenizer can remain the whitespace such as ' Johnson', when tokens go through the sent = ''.join(tokens) in _calculate_lexical_unit function, the sentence can be restored normally. And huggy_llama_7b is 'BorisJohnsonhassubmittedevidencetoMPsinvestigatingwhetherhemisledParliamentoverCovidrule-breakingpartiesinDowningStreet.'

But selective_context.py do not have huggy_llama_7b setting, and I do not know how to fix it.

Can you help me? Thanks for your response!

Answer 3 · 2024-01-23T18:10:30.000Z

Just replace the gpt2 tokenizer with yours in self._prepare_model

Answer 4 · 2024-01-24T02:42:09.000Z

Just replace the gpt2 tokenizer with yours in self._prepare_model

No, my description above is the result of replacing the tokenizer, so I want to know how you achieve that in self._prepare_model.

Answer 5 · 2024-01-24T20:23:50.000Z

just replace sent = ''.join(tokens) with sent = ' '.join(tokens) at here

Answer 6 · 2024-01-25T13:52:09.000Z

@XiaoFengbing I just add llama2 support for self-information computing, check here.

remember to keep using sent = ''.join(tokens) in your main code.

#19