nlpyang/BertSum

Pre-processing new data set for BertSum

nimahassanpour opened this issue ยท 8 comments

Can you please let me know that how I can pre-process a new data set, like a list of paper abstracts as an input of BertSum.

Have you read README yet?
There is an option so you can process your own dataset. You need to download the stories to see its format, then you can process your data to have the same format as these stories.

Basically, each story has 2 parts: the source and the target. Each line separated by a blank line, then there is a @highlight line before each target line. Anyway, just follow the steps in README and look at data, code. Then you will figure it out yourself.

According to your explanation, for prediction samples that do not have target line, I should not have @highlight line in my .story files. Is it right?

According to your explanation, for prediction samples that do not have target line, I should not have @highlight line in my .story files. Is it right?

Sure, if you have already trained a model, so in the prediction step, you don't need to provide the label for it. Target lines are the ground truth (or as I called it "label") that you expect the model to output as similar as possible. In short, yes, you don't need to provide target lines (@highlight) when it comes to prediction.

@binhna Thank you for your reply. I am having another strange problem. When I follow the Option2 for pre-processing data, I can tokenize .story files successfully and generate .story.json file. But When I run the step-5 I always get three empty square bracket:

[nhassanp@uc1f-bioinfocloud-assembly-base src]$ /data/conda_envs/20200204/miniconda3/bin/python preprocess.py -mode format_to_bert -raw_path ../merged_stories_tokenized/ -save_path ../bert_cnn/ -oracle_mode greedy -log_file ../logs/preprocess.log
[]
[]
[]

(Since I don't have url for my data, I skip step-4)

@binhna Thank you for your reply. I am having another strange problem. When I follow the Option2 for pre-processing data, I can tokenize .story files successfully and generate .story.json file. But When I run the step-5 I always get three empty square bracket:

[nhassanp@uc1f-bioinfocloud-assembly-base src]$ /data/conda_envs/20200204/miniconda3/bin/python preprocess.py -mode format_to_bert -raw_path ../merged_stories_tokenized/ -save_path ../bert_cnn/ -oracle_mode greedy -log_file ../logs/preprocess.log
[]
[]
[]

(Since I don't have url for my data, I skip step-4)

You should check the format_to_bert function in data_builder.py.
Empty square bracket means that they didn't find any file JSON from step 4.
You should check glob.glob(os.path.join('../merged_stories_tokenized/', f'*{corpus_type}.*.json'))
Note that name of the json file from step 4 should look something like this abcd.test.0.json --> {name}.{corpus_type}.{number I supposed}.json

@binhna Thank you for your reply. I am having another strange problem. When I follow the Option2 for pre-processing data, I can tokenize .story files successfully and generate .story.json file. But When I run the step-5 I always get three empty square bracket:
[nhassanp@uc1f-bioinfocloud-assembly-base src]$ /data/conda_envs/20200204/miniconda3/bin/python preprocess.py -mode format_to_bert -raw_path ../merged_stories_tokenized/ -save_path ../bert_cnn/ -oracle_mode greedy -log_file ../logs/preprocess.log
[]
[]
[]
(Since I don't have url for my data, I skip step-4)

You should check the format_to_bert function in data_builder.py.
Empty square bracket means that they didn't find any file JSON from step 4.
You should check glob.glob(os.path.join('../merged_stories_tokenized/', f'*{corpus_type}.*.json'))
Note that name of the json file from step 4 should look something like this abcd.test.0.json --> {name}.{corpus_type}.{number I supposed}.json

@binhna Thank you! I appreciate you help.
I modified the saving path and now it works. But another issue happened!
In _format_to_bert function there is line which I do not understand it:
name = re.search('Files.(.*).test.json', json_file).group(1)
in my case json_file = "/data/examples/nhassanp/PreSumm-master/bert_cnn/train.6.bert.pt"
and the result for name is "AttributeError: 'NoneType' object has no attribute 'group'". Because of that I get the following error when I run step5 of Option2:

Untitled

Also pre-processed data should have "tgt" tag, to be compatible for training, But the dictionary that is provided by _format_to_bert does not have "tgt" key and its value:

Untitled

First of all, AttributeError: 'NoneType' object has no attribute 'group' means that the re.search('Files.(.*).test.json', json_file) supposed to always return some value in the json_file, but it doesn't, so it returns None, and then you can't call group() from Novetype. You should change the code to compatible with your filename tho.

For the second question, at the line you show me, there is no tgt in it, because these values are from b_data, maybe he changed the name of tgt to something else. Look at the first line in the for loop, source and target are from 'src' and 'tgt'. So it makes sense

PS: Mate, you really need to analyze the source code ๐Ÿ˜„

@binhna You are right. It seems that he updated his code but did not update his pre-processed data sets.