chemprot dataset

Question

chemprot dataset

wjczf123 opened this issue 3 years ago · 5 comments

It seems that the data format of chemprot is different from others.
This makes me cannot run your code for chemprot dataset.
Can you provide a suitable version for me to run your code for chemprot dataset?

Answer 1 · 2021-08-05T06:40:21.000Z

The processing code for chemprot is different to the other datasets. We rewrote the code in open-source repo for the sake of generality of the repository code. If you need to use chemprot, you can first modify chemprot to the format mentioned in finetune.

Answer 2 · 2021-08-05T06:45:14.000Z

You can add the following script in utils/utils.py/EntityMarker class:

    def tokenize_chemprot(self, raw_text, h_blank=False, t_blank=False):
        ####
        # convert head entity mention to "* h *"
        # convert tail entity mention to "^ t ^"
        ####
        h_mention = re.search("<<(.*)>>", raw_text)[1].strip()
        t_mention = re.search("\[\[(.*)\]\]", raw_text)[1].strip()
        text = self.h.sub("* h *", raw_text)
        text = self.t.sub("^ t ^", text)

        # tokenize
        tokenized_text = self.tokenizer.tokenize(text)
        tokenized_head = self.tokenizer.tokenize(h_mention)
        tokenized_tail = self.tokenizer.tokenize(t_mention)

        p_text = " ".join(tokenized_text)
        p_head = " ".join(tokenized_head)
        p_tail = " ".join(tokenized_tail)

        if h_blank:
            p_text = self.h_pattern.sub("[unused0] [unused4] [unused1]", p_text)
        else:
            p_text = self.h_pattern.sub("[unused0] "+p_head+" [unused1]", p_text)
        if t_blank:
            p_text = self.t_pattern.sub("[unused2] [unused5] [unused3]", p_text)
        else:
            p_text = self.t_pattern.sub("[unused2] "+p_tail+" [unused3]", p_text)
    
        f_text = ("[CLS] " + p_text).split()
        try:
            h_pos = f_text.index("[unused0]")
        except:
            self.err += 1
            h_pos = 0
        try:
            t_pos = f_text.index("[unused2]") 
        except:
            self.err += 1
            t_pos = 0

        # pdb.set_trace()
        tokenized_input = self.tokenizer.convert_tokens_to_ids(f_text)
        
        return tokenized_input, h_pos, t_pos

Answer 3 · 2021-08-05T14:36:15.000Z

I run this code. And I found that self.h is not defined. And I also don't know its meaning.
Can you provide a complete code? Thanks.

Answer 4 · 2021-08-05T14:49:06.000Z

The following is the old version.

class EntityMarker():
    """Mark entity postion
    """
    def __init__(self, args=None):
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.h_pattern = re.compile("\* h \*")
        self.t_pattern = re.compile("\^ t \^")
        self.err = 0
        self.args = args
        self.h = re.compile("<<.*>>")
        self.t = re.compile("\[\[.*\]\]")

    
    def tokenize_chemprot(self, raw_text, h_blank=False, t_blank=False):
        ####
        # convert head entity mention to "* h *"
        # convert tail entity mention to "^ t ^"
        ####
        h_mention = re.search("<<(.*)>>", raw_text)[1].strip()
        t_mention = re.search("\[\[(.*)\]\]", raw_text)[1].strip()
        text = self.h.sub("* h *", raw_text)
        text = self.t.sub("^ t ^", text)

        # tokenize
        tokenized_text = self.tokenizer.tokenize(text)
        tokenized_head = self.tokenizer.tokenize(h_mention)
        tokenized_tail = self.tokenizer.tokenize(t_mention)

        p_text = " ".join(tokenized_text)
        p_head = " ".join(tokenized_head)
        p_tail = " ".join(tokenized_tail)

        if h_blank:
            p_text = self.h_pattern.sub("[unused0] [unused4] [unused1]", p_text)
        else:
            p_text = self.h_pattern.sub("[unused0] "+p_head+" [unused1]", p_text)
        if t_blank:
            p_text = self.t_pattern.sub("[unused2] [unused5] [unused3]", p_text)
        else:
            p_text = self.t_pattern.sub("[unused2] "+p_tail+" [unused3]", p_text)
    
        f_text = ("[CLS] " + p_text).split()
        try:
            h_pos = f_text.index("[unused0]")
        except:
            self.err += 1
            h_pos = 0
        try:
            t_pos = f_text.index("[unused2]") 
        except:
            self.err += 1
            t_pos = 0

        # pdb.set_trace()
        tokenized_input = self.tokenizer.convert_tokens_to_ids(f_text)
        
        return tokenized_input, h_pos, t_pos

Answer 5 · 2021-08-06T07:56:21.000Z

Thank you a lot!
I have tried the code and reproduced the result.
Thanks again, hope you have a nice day!