thunlp/RE-Context-or-Names

chemprot dataset

wjczf123 opened this issue · 5 comments

It seems that the data format of chemprot is different from others.
This makes me cannot run your code for chemprot dataset.
Can you provide a suitable version for me to run your code for chemprot dataset?

The processing code for chemprot is different to the other datasets. We rewrote the code in open-source repo for the sake of generality of the repository code. If you need to use chemprot, you can first modify chemprot to the format mentioned in finetune.

You can add the following script in utils/utils.py/EntityMarker class:

    def tokenize_chemprot(self, raw_text, h_blank=False, t_blank=False):
        ####
        # convert head entity mention to "* h *"
        # convert tail entity mention to "^ t ^"
        ####
        h_mention = re.search("<<(.*)>>", raw_text)[1].strip()
        t_mention = re.search("\[\[(.*)\]\]", raw_text)[1].strip()
        text = self.h.sub("* h *", raw_text)
        text = self.t.sub("^ t ^", text)

        # tokenize
        tokenized_text = self.tokenizer.tokenize(text)
        tokenized_head = self.tokenizer.tokenize(h_mention)
        tokenized_tail = self.tokenizer.tokenize(t_mention)

        p_text = " ".join(tokenized_text)
        p_head = " ".join(tokenized_head)
        p_tail = " ".join(tokenized_tail)

        if h_blank:
            p_text = self.h_pattern.sub("[unused0] [unused4] [unused1]", p_text)
        else:
            p_text = self.h_pattern.sub("[unused0] "+p_head+" [unused1]", p_text)
        if t_blank:
            p_text = self.t_pattern.sub("[unused2] [unused5] [unused3]", p_text)
        else:
            p_text = self.t_pattern.sub("[unused2] "+p_tail+" [unused3]", p_text)
    
        f_text = ("[CLS] " + p_text).split()
        try:
            h_pos = f_text.index("[unused0]")
        except:
            self.err += 1
            h_pos = 0
        try:
            t_pos = f_text.index("[unused2]") 
        except:
            self.err += 1
            t_pos = 0

        # pdb.set_trace()
        tokenized_input = self.tokenizer.convert_tokens_to_ids(f_text)
        
        return tokenized_input, h_pos, t_pos

I run this code. And I found that self.h is not defined. And I also don't know its meaning.
Can you provide a complete code? Thanks.

The following is the old version.

class EntityMarker():
    """Mark entity postion
    """
    def __init__(self, args=None):
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.h_pattern = re.compile("\* h \*")
        self.t_pattern = re.compile("\^ t \^")
        self.err = 0
        self.args = args
        self.h = re.compile("<<.*>>")
        self.t = re.compile("\[\[.*\]\]")

    
    def tokenize_chemprot(self, raw_text, h_blank=False, t_blank=False):
        ####
        # convert head entity mention to "* h *"
        # convert tail entity mention to "^ t ^"
        ####
        h_mention = re.search("<<(.*)>>", raw_text)[1].strip()
        t_mention = re.search("\[\[(.*)\]\]", raw_text)[1].strip()
        text = self.h.sub("* h *", raw_text)
        text = self.t.sub("^ t ^", text)

        # tokenize
        tokenized_text = self.tokenizer.tokenize(text)
        tokenized_head = self.tokenizer.tokenize(h_mention)
        tokenized_tail = self.tokenizer.tokenize(t_mention)

        p_text = " ".join(tokenized_text)
        p_head = " ".join(tokenized_head)
        p_tail = " ".join(tokenized_tail)

        if h_blank:
            p_text = self.h_pattern.sub("[unused0] [unused4] [unused1]", p_text)
        else:
            p_text = self.h_pattern.sub("[unused0] "+p_head+" [unused1]", p_text)
        if t_blank:
            p_text = self.t_pattern.sub("[unused2] [unused5] [unused3]", p_text)
        else:
            p_text = self.t_pattern.sub("[unused2] "+p_tail+" [unused3]", p_text)
    
        f_text = ("[CLS] " + p_text).split()
        try:
            h_pos = f_text.index("[unused0]")
        except:
            self.err += 1
            h_pos = 0
        try:
            t_pos = f_text.index("[unused2]") 
        except:
            self.err += 1
            t_pos = 0

        # pdb.set_trace()
        tokenized_input = self.tokenizer.convert_tokens_to_ids(f_text)
        
        return tokenized_input, h_pos, t_pos

Thank you a lot!
I have tried the code and reproduced the result.
Thanks again, hope you have a nice day!