KeyError: 'label' when processing examples

Question

KeyError: 'label' when processing examples

RuiPChaves opened this issue 5 years ago · 16 comments

When running this part of the code:

process_count = cpu_count() - 1
if __name__ ==  '__main__':
    print(f'Preparing to convert {train_examples_len} examples..')
    print(f'Spawning {process_count} processes..')
    with Pool(process_count) as p:
        train_features = list(tqdm_notebook(p.imap(convert_examples_to_features.convert_example_to_feature, train_examples_for_processing), total=train_examples_len))

I get an error, seen below:

process_count = cpu_count() - 1
if __name__ ==  '__main__':
    print(f'Preparing to convert {train_examples_len} examples..')
    print(f'Spawning {process_count} processes..')
    with Pool(process_count) as p:
        train_features = list(tqdm_notebook(p.imap(convert_examples_to_features.convert_example_to_feature, train_examples_for_processing), total=train_examples_len))
INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at /home/rpc/.pytorch_pretrained_bert/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
Preparing to convert 8 examples..
Spawning 7 processes..
HBox(children=(IntProgress(value=0, max=8), HTML(value='')))
Traceback (most recent call last):

  File "<ipython-input-1-d14e922553ff>", line 103, in <module>
    train_features = list(tqdm_notebook(p.imap(convert_examples_to_features.convert_example_to_feature, train_examples_for_processing), total=train_examples_len))

  File "/home/rpc/.local/lib/python3.6/site-packages/tqdm/_tqdm_notebook.py", line 223, in __iter__
    for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):

  File "/home/rpc/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1032, in __iter__
    for obj in iterable:

  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value

KeyError: 'label'

My dataset is not Yelp, but rather something that looks like this:

id label alpha text
0 0 a Tom saw a cat. Fred likes felines.
1 1 a Mary opened a book. It was a good story.
2 1 a Mia called Sue. It was around noon.
3 0 a John ate a burger. The burger John ate was tasty.

and so on. I'm currently running only 8 items, but I have a larger dataset.

I obtain the same error with 'converter.py' instead. I thought that maybe the data werent' being loaded properly, but as far as I can tell, they are:

print(train_examples)
[<tools.InputExample object at 0x7f1ff524c908>, <tools.InputExample object at 0x7f1ff524c898>, <tools.InputExample object at 0x7f1ff524cac8>, <tools.InputExample object at 0x7f1ff524cb38>, <tools.InputExample object at 0x7f1ff524cba8>, <tools.InputExample object at 0x7f1ff524cc18>, <tools.InputExample object at 0x7f1ff524cc88>, <tools.InputExample object at 0x7f1ff524ccf8>]

print(train_examples_for_processing)
[(<tools.InputExample object at 0x7f1ff524c908>, {'0': 0, '1': 1}, 70, <pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f1ff524ceb8>, 'classification'), (<tools.InputExample object at 0x7f1ff524c898>, {'0': 0, '1': 1}, 70, <pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f1ff524ceb8>, 'classification'), (<tools.InputExample object at 0x7f1ff524cac8>, {'0': 0, '1': 1}, 70, <pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f1ff524ceb8>, 'classification'), (<tools.InputExample object at 0x7f1ff524cb38>, {'0': 0, '1': 1}, 70, <pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f1ff524ceb8>, 'classification'), (<tools.InputExample object at 0x7f1ff524cba8>, {'0': 0, '1': 1}, 70, <pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f1ff524ceb8>, 'classification'), (<tools.InputExample object at 0x7f1ff524cc18>, {'0': 0, '1': 1}, 70, <pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f1ff524ceb8>, 'classification'), (<tools.InputExample object at 0x7f1ff524cc88>, {'0': 0, '1': 1}, 70, <pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f1ff524ceb8>, 'classification'), (<tools.InputExample object at 0x7f1ff524ccf8>, {'0': 0, '1': 1}, 70, <pytorch_pretrained_bert.tokenization.BertTokenizer object at 0x7f1ff524ceb8>, 'classification')]

Maybe I'm missing a package or perhaps convert_examples_to_features isn't working?

My full code is at https://github.com/RuiPChaves/BERTSenClass/blob/master/bert%20_sen_class.py

Many thanks in advance for any help you can provide.

Answer 1 · 2019-08-12T14:05:16.000Z

ps- I've replaced the "label" column int values with '0' and '1' in the dataset, but the error persists unfortunately.

Answer 2 · 2019-08-12T14:12:35.000Z

pps - Weirdly, removing the list command and the argument total=train_examples_len makes the error go away:

train_features = tqdm_notebook(p.imap(convert_examples_to_features.convert_example_to_feature, train_examples_for_processing))
HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

print(train_features)
0/|/| 0/? [04:15<?, ?it/s]

As far as I can see, the value of train_examples_len is correct. I don't know if train_features looks right or not.

Answer 3 · 2019-08-12T14:14:04.000Z

Can you change the convert_examples_to_features.py file to this?

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()


def convert_example_to_feature(example_row):
    # return example_row
    example, label_map, max_seq_length, tokenizer, output_mode = example_row

    tokens_a = tokenizer.tokenize(example.text_a)

    tokens_b = None
    if example.text_b:
        tokens_b = tokenizer.tokenize(example.text_b)
        # Modifies `tokens_a` and `tokens_b` in place so that the total
        # length is less than the specified length.
        # Account for [CLS], [SEP], [SEP] with "- 3"
        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
    else:
        # Account for [CLS] and [SEP] with "- 2"
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[:(max_seq_length - 2)]

    tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
    segment_ids = [0] * len(tokens)

    if tokens_b:
        tokens += tokens_b + ["[SEP]"]
        segment_ids += [1] * (len(tokens_b) + 1)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    padding = [0] * (max_seq_length - len(input_ids))
    input_ids += padding
    input_mask += padding
    segment_ids += padding

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    return InputFeatures(input_ids=input_ids,
                         input_mask=input_mask,
                         segment_ids=segment_ids,
                         label_id=example.label)

Keep your label values as integers. I've dropped the code that uses the label_map to change the label from strings to integers since you already have them as integers.

Answer 4 · 2019-08-12T14:18:02.000Z

Fantastic! Thanks, problem solved:

process_count = cpu_count() - 1
if __name__ ==  '__main__':
    print(f'Preparing to convert {train_examples_len} examples..')
    print(f'Spawning {process_count} processes..')
    with Pool(process_count) as p:
        train_features = list(tqdm_notebook(p.imap(convert_examples_to_features.convert_example_to_feature, train_examples_for_processing), total=train_examples_len))
Preparing to convert 8 examples..
Spawning 7 processes..
HBox(children=(IntProgress(value=0, max=8), HTML(value='')))

Answer 5 · 2019-08-12T14:27:27.000Z

Great!

If you want to double-check your converted features, you can print out an example like this:

def print_input_feature(input_feature):
    print(f'input_ids: \n{input_feature.input_ids}')
    print(f'input_mask: \n{input_feature.input_mask}')
    print(f'segment_ids: \n{input_feature.segment_ids}')
    print(f'label_id: \n{input_feature.label_id}')
    
print_input_feature(train_features[0])

Your output should look something like this:

input_ids: 
[101, 1135, 787, 188, 1103, 1436, 2949, 1128, 1209, 1518, 1329, 1106, 16775, 3245, 1105, 171, 23223, 1158, 119, 1192, 1169, 1321, 18700, 7937, 1204, 1107, 21749, 1532, 1137, 6058, 1122, 1113, 1103, 2241, 1106, 16775, 172, 4515, 2624, 1105, 2489, 119, 1188, 2949, 1144, 1145, 2602, 1632, 2629, 1107, 1343, 6705, 1114, 3245, 10606, 5800, 119, 1192, 1169, 1145, 1329, 3621, 16931, 1306, 1105, 175, 26042, 1233, 2949, 119, 27868, 8967, 1110, 13108, 117, 1105, 1144, 2602, 1632, 2912, 1106, 11778, 11902, 2556, 1988, 119, 1135, 1144, 2012, 2848, 14703, 12253, 1233, 1105, 2848, 2822, 25857, 2916, 2629, 1115, 6618, 1128, 2239, 1114, 4249, 3507, 19790, 1116, 119, 1337, 787, 188, 1725, 175, 26042, 1233, 2949, 1110, 6315, 1106, 1343, 6705, 1114, 3245, 10606, 5800, 2416, 1118, 2213, 10548, 119, 1188, 2949, 1169, 1145, 3843, 1105, 7299, 1884, 8031, 1105, 13306, 1884, 12888, 1548, 119, 24930, 1181, 170, 2337, 8949, 1104, 175, 26042, 1233, 2949, 1106, 170, 2525, 1104, 1447, 1137, 1240, 22245, 1348, 5679, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
input_mask: 
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
segment_ids: 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label_id: 
1

Answer 6 · 2019-08-12T14:33:43.000Z

That is indeed the kind of output I get. But I think this fix created a small issue down the line, during training, namely at:

all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)

Since [f.label_id for f in train_features] yields ['label', '0', '1', '1', '0', '0', '1', '1'] we get a type clash with torch.long:

ValueError: too many dimensions 'str'
I'm surprised the features are again strings... FWIW,I reverted back to the original:

label_map = {label: i for i, label in enumerate(label_list)}
#label_map = {int(label): i for i, label in enumerate(label_list)}

Answer 7 · 2019-08-12T14:39:10.000Z

Can you check the datatype of train_features[0].label_id?

E.g: Adding this line to print_input_feature() function.
print(f'label_id_type: \n{type(input_feature.label_id)}')

Answer 8 · 2019-08-12T14:42:57.000Z

Oh! I think I see the issue. Your data has a header line that is being sent through to the tokenizer. Can you try removing the header in your data file?

Answer 9 · 2019-08-12T14:53:31.000Z

Ah, my bad. I removed the header line, and checked the data type:

train_features[0].label_id
'0'

type(train_features[0].label_id)
str

But I still have this nagging error.

  File "<ipython-input-3-88d617073292>", line 2, in <module>
    all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
ValueError: too many dimensions 'str'

Answer 10 · 2019-08-12T14:56:59.000Z

label_id should be ints. Was the last run done with the original code or the modified one with the mapping removed inside the convert_example_to_feature() function?

Answer 11 · 2019-08-12T15:04:20.000Z

This last run was done with the modified code, mapping removed.

print_input_feature(train_features[0])
input_ids: 
[101, 2545, 1486, 170, 5855, 119, 138, 5855, 1108, 1562, 1118, 2545, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
input_mask: 
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
segment_ids: 
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
label_id: 
0

At the same time:

[f.label_id for f in train_features]
Out[5]: ['0', '1', '1', '0', '0', '1', '1']

For completeness:

label_map
Out[4]: {'0': 0, '1': 1}

and

label_list
['0', '1']

Answer 12 · 2019-08-12T15:05:34.000Z

Can you try it with the original code? Basically, everything the same except for removing the header line from the data.

Answer 13 · 2019-08-12T15:23:42.000Z

The problem was the header line all along, you are correct.

I've tried the original convert_examples_to_features.py code. On one run I made the values of the label column in the dataset be int, on another run I made them str. In either case, the train and test files no longer have header lines.

With str valued label columns I get:

  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
KeyError: '"0"'

If I replace " with ' I get a similar error, only with "`0'" instead.

But if I keep the values as int in the datasets, then all these errors go away.

My next hurdle is now much further down the line. Maybe my multiple runs of the GPU are not clearing up memory? I can probably just go with the CPUs instead:

RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 3.95 GiB total capacity; 2.83 GiB already allocated; 56.88 MiB free; 106.34 MiB cached)

Yeah, looks like BERT is just too large. Works fine on CPUs alone.

Many many thanks..!

Answer 14 · 2019-08-12T15:35:56.000Z

Great! Sorry about the confusion with labels as strings vs ints. I didn't have the Yelp data with me to check.

BERT is quite big and it would take quite a bit of time to finetune it on a CPU, especially if your dataset is large. I noticed you were using bert-large in your repo, you could try with bert-base which is smaller. If that is too large as well, you could also try Google colab. It would probably be much faster than running it on the CPU and it should be fairly straightforward to set up.

Answer 15 · 2019-08-12T15:44:32.000Z

Apologies for not realizing the headers must be stripped off, of course.

Good point. If memory serves, bert-large yields diminishing returns anyway, compared with bert-base.

Again, much appreciated for all you help and patience..!

Answer 16 · 2019-08-12T15:49:17.000Z

It happens!

Yes, it's not worth it for most cases.

No problem, happy to help!