shangjingbo1226/AutoNER

about the c++ codes

Closed this issue · 3 comments

Hi, I am reading the C++ codes in the repo. I am not an expert in C++ so I only get a rough sense that the codes are for annotating the raw texts. But the output format of C++ codes (annotation.ck) is a bit different from truth_dev.ck and truth_test.ck as annotation.ck has the forth column.

<s> O None S
( I None S
Sch O None D
) O None D
was I None S
administered I None S
i.v I None S
. I None S
<eof> I None S

Is the forth column used for another model (fuzzy-lstm-crf) in your paper? Is it using IOBES tagging?

I am going to translate the C++ codes into python in the purpose of preparing data for autoNER, so i don't need the forth column, right? In addition, could you please give more insights on the C++ codes? What algorithm do you use (i.e. trie tree)? What exactly do the codes do?

To be more specific, in the example I post above, for this line Sch O None D , does it indicate that Sch is tied with the previous token (. Similar for the next line, ) is tied with the previous token Sck. In summary, ( Sch ) is an entity detected as None type?

Moreover, why aren't there any Unknown tagging in the annotation? According to your paper, if at least one of the tokens belongs to an kunknown-typed high-quality phrase, the tokens would be tagged as Unknown.

I have the same question. Have you already solved it?

Hi thanks for asking and sorry for the late of reply.

S and D are used to identify whether the boundary label is reliable.
Specifically, as in the paper, we have two distant supervision mechanism, one is the core set (from the dictionary, with high precision), the other is the phrase set (from AutoPhrase, with high recall). Here, ( Sch ) is a phrase, but does not show up in the core set. Therefore, we are not sure whether it is an entity or not, and use D to indicate these labels may not be reliable.
This information is also used in the fuzzy-crf model, which will convert the unreliable labels to fuzzy label.

As for AutoNER, these labels will be converted as UNKNOWN in the following pipelines.

Hi thanks for asking and sorry for the late of reply.

S and D are used to identify whether the boundary label is reliable.
Specifically, as in the paper, we have two distant supervision mechanism, one is the core set (from the dictionary, with high precision), the other is the phrase set (from AutoPhrase, with high recall). Here, ( Sch ) is a phrase, but does not show up in the core set. Therefore, we are not sure whether it is an entity or not, and use D to indicate these labels may not be reliable.
This information is also used in the fuzzy-crf model, which will convert the unreliable labels to fuzzy label.

As for AutoNER, these labels will be converted as UNKNOWN in the following pipelines.

Many thanks for your reply. My views are as follows, can you tell me is it right?

( Sch ) is a phrase, but does not show up in the core set and also does not show up in the phrase set. Or if it is an entity, it will also be marked as S.

And if the phrase show up in the phrase set while not show up in the core set, it will be marked as D. So the label of phase with D will be convert to fuzzy labels in fuzzy-lstm-crf and be be converted as UNKNOWN in AutoNer.