data required for training

Question

data required for training

Siddhijain16 opened this issue 2 years ago · 1 comments

Hi,
I have a dataset in a 2 column format . 1 for text and 2 for IOB tag + NER tag , like this

How can I train a model with this dataset ? Is there any ways to train a model with above mention data structure ?
please help !!

Answer 1 · 2022-09-17T19:12:34.000Z

Hi @Siddhijain16,

To Prepare CoNLL dataset, it is usually necessary to follow the steps below

Data in csv, tsv, txt, conll .etc format is read and analyzed (txt format will be easier to work with)
Data should consist of 4 columns
- The first column contains Token (word)
- The second column contains the PartOfSpeach (pos_tag)
- The third column contains chunk_tag
- The last column is the label column and contains the NER_tag
If your working data has only token and label (word and Ner_tag) columns, you will need to add the other two columns yourself. The values to add to these two columns can also be -NN- and -O- labels.
Here is some important information for the ConNNL format:
- the first line has "-DOCSTART- -X- -X- O" as the header
- the second line is blank
- there is a blank line between every two sentences
You can save the conll in txt format and read it via CoNLL().readDataset()