revdotcom/speech-datasets

How to understand multiple entity tags?

huangruizhe opened this issue · 3 comments

In this example:

token|speaker|ts|endTs|punctuation|case|tags|wer_tags
Good|0||||UC|[]|[]
morning|0||||LC|['5:TIME']|['5']
and|0||||LC|[]|[]
welcome|0||||LC|[]|[]
to|0||||LC|[]|[]
the|0||||LC|['6:DATE']|['6']
first|0||||LC|['6:DATE']|['6']
quarter|0||||LC|['6:DATE']|['6']
2020|0||||CA|['0:YEAR']|['0', '1', '6']
NexGEn|0||||MC|['7:ORG']|['7']

How to understand that the word "2020" has three entity tags ['0', '1', '6']?
Thanks.

Hi there!

Thanks for reaching out with your question -- so the way to understand what the three tags are is by using the corresponding wer_tag.json file.

In the examples we provided there's the corresponding wer_tag.json just below the example you shared.

{
  "0":{
    "entity_type" : "YEAR"
  },
  "1":{
    "entity_type" : "CARDINAL"
  },
  "5":{
    "entity_type" : "TIME"
  },
  "6":{
    "entity_type" : "DATE"
  },
  "7":{
    "entity_type" : "ORG"
  }
}

So for 2020 the tag 0 corresponds to a YEAR entity, 1 corresponds to a CARDINAL entity, and 6 corresponds to a DATE entity. Our reasoning is that the token 2020 on its own is a year and a cardinal number but in context of "the first quarter 2020" its a date! As a result we apply all three entity tags to it and include the ID of those entities in the wer_tags column.

Hope this helps but let me know if there's anything else still confusing you!

Thanks,
Miguel

So, in this case, 2020 is tagged as three types of entities. Thank for the nice explanation!
And this word will also be counted in three entity-type-specific WERs computation, right?

Sorry for the late reply -- yes! it will count the token in all three WER categories!