Alibaba-NLP/ACE

Test PTB Dependency Parsing Model

woshiyyya opened this issue · 9 comments

Hi there!

I am trying to test with your pretrained dependency parsing model. However, I cannot find your processed PTB dataset. Can you share it with a link?

Also, I am wondering how to inference with my own data. For example, how can I feed one sentence and get its tagging result?

I have just uploaded the ptb dataset on onedrive.

For inference, you may make a file like this (add dummy tags in the 7,8,9-th column) and follow the instruction:

1\tBut\t_\t_\t_\t_\t_\t0\troot\t0:root
2\tI\t_\t_\t_\t_\t_\t0\troot\t0:root
3\tfound\t_\t_\t_\t_\t_\t0\troot\t0:root
4\tthe\t_\t_\t_\t_\t_\t0\troot\t0:root
5\tlocation\t_\t_\t_\t_\t_\t0\troot\t0:root
6\twonderful\t_\t_\t_\t_\t_\t0\troot\t0:root
7\tand\t_\t_\t_\t_\t_\t0\troot\t0:root
7.1\tfound\t_\t_\t_\t_\t_\t0\troot\t0:root
8\tthe\t_\t_\t_\t_\t_\t0\troot\t0:root
9\tneighbors\t_\t_\t_\t_\t_\t0\troot\t0:root
10\tvery\t_\t_\t_\t_\t_\t0\troot\t0:root
11\tkind\t_\t_\t_\t_\t_\t0\troot\t0:root
12\t.\t_\t_\t_\t_\t_\t0\troot\t0:root

Hi Xinyu,

Thanks for uploading the data!

I created a folder named data and put a train.tsv file with the demo case you provide.

Run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/ptb_parsing_model.yaml --parse --target_dir data --keep_order

But still got an error:

2022-09-07 02:59:16,391 Reading data from /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified
2022-09-07 02:59:16,391 Train: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/train_modified.conllu
2022-09-07 02:59:16,391 Test: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/test.conllu
2022-09-07 02:59:16,391 Dev: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/dev.conllu
Traceback (most recent call last):
  File "train.py", line 85, in <module>
    config = ConfigParser(config,all=args.all,zero_shot=args.zeroshot,other_shot=args.other,predict=args.predict)
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 63, in __init__
    self.corpus: ListCorpus=self.get_corpus
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 329, in get_corpus
    current_dataset=getattr(datasets,corpus)(tag_to_bioes=self.target)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 360, in __init__
    train = UniversalDependenciesDataset(data_folder/'train_modified.conllu', in_memory=in_memory, add_root=True)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 1006, in __init__
    assert path_to_conll_file.exists()
AssertionError

Do you know how to fix that?

Have you checked whether the datasets is at the correct place?

I have just uploaded the ptb dataset on onedrive.

For inference, you may make a file like this (add dummy tags in the 7,8,9-th column) and follow the instruction:

1\tBut\t_\t_\t_\t_\t_\t0\troot\t0:root
2\tI\t_\t_\t_\t_\t_\t0\troot\t0:root
3\tfound\t_\t_\t_\t_\t_\t0\troot\t0:root
4\tthe\t_\t_\t_\t_\t_\t0\troot\t0:root
5\tlocation\t_\t_\t_\t_\t_\t0\troot\t0:root
6\twonderful\t_\t_\t_\t_\t_\t0\troot\t0:root
7\tand\t_\t_\t_\t_\t_\t0\troot\t0:root
7.1\tfound\t_\t_\t_\t_\t_\t0\troot\t0:root
8\tthe\t_\t_\t_\t_\t_\t0\troot\t0:root
9\tneighbors\t_\t_\t_\t_\t_\t0\troot\t0:root
10\tvery\t_\t_\t_\t_\t_\t0\troot\t0:root
11\tkind\t_\t_\t_\t_\t_\t0\troot\t0:root
12\t.\t_\t_\t_\t_\t_\t0\troot\t0:root

Hi Xinyu,
Is there something wrong with the data format provided?
i just find, the code token = Token(fields[1], head_id=int(fields[6])) shows me ValueError: invalid literal for int() with base 10: '_'.

So I guess the 0-th column is token id,
the 1-th column is token,
the 2,3,4,5-th column is "",
the 6-th column is 0, (dummy tags)
the 7-th column is "
",
the 8-th column is "root", (dummy tags)
the 9-th column is "0:root", (dummy tags)

is that right?

Hi Xinyu,

Thanks for uploading the data!

I created a folder named data and put a train.tsv file with the demo case you provide.

Run: CUDA_VISIBLE_DEVICES=0 python train.py --config config/ptb_parsing_model.yaml --parse --target_dir data --keep_order

But still got an error:

2022-09-07 02:59:16,391 Reading data from /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified
2022-09-07 02:59:16,391 Train: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/train_modified.conllu
2022-09-07 02:59:16,391 Test: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/test.conllu
2022-09-07 02:59:16,391 Dev: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/dev.conllu
Traceback (most recent call last):
  File "train.py", line 85, in <module>
    config = ConfigParser(config,all=args.all,zero_shot=args.zeroshot,other_shot=args.other,predict=args.predict)
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 63, in __init__
    self.corpus: ListCorpus=self.get_corpus
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 329, in get_corpus
    current_dataset=getattr(datasets,corpus)(tag_to_bioes=self.target)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 360, in __init__
    train = UniversalDependenciesDataset(data_folder/'train_modified.conllu', in_memory=in_memory, add_root=True)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 1006, in __init__
    assert path_to_conll_file.exists()
AssertionError

Do you know how to fix that?

after I change the data format, I also face the same problem.
have you resolved it?

Hi Xinyu,
Thanks for uploading the data!
I created a folder named data and put a train.tsv file with the demo case you provide.
Run: CUDA_VISIBLE_DEVICES=0 python train.py --config config/ptb_parsing_model.yaml --parse --target_dir data --keep_order
But still got an error:

2022-09-07 02:59:16,391 Reading data from /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified
2022-09-07 02:59:16,391 Train: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/train_modified.conllu
2022-09-07 02:59:16,391 Test: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/test.conllu
2022-09-07 02:59:16,391 Dev: /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/dev.conllu
Traceback (most recent call last):
  File "train.py", line 85, in <module>
    config = ConfigParser(config,all=args.all,zero_shot=args.zeroshot,other_shot=args.other,predict=args.predict)
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 63, in __init__
    self.corpus: ListCorpus=self.get_corpus
  File "/projects/clio1/probing/ACE/flair/config_parser.py", line 329, in get_corpus
    current_dataset=getattr(datasets,corpus)(tag_to_bioes=self.target)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 360, in __init__
    train = UniversalDependenciesDataset(data_folder/'train_modified.conllu', in_memory=in_memory, add_root=True)
  File "/projects/clio1/probing/ACE/flair/datasets.py", line 1006, in __init__
    assert path_to_conll_file.exists()
AssertionError

Do you know how to fix that?

after I change the data format, I also face the same problem. have you resolved it?

Have you ensured the path /home/yunxuan2/.flair/datasets/ptb_3.3.0_modified/train_modified.conllu exist? If not, you may download the data above and put them at this path.

yes! I have done it! and I solve this problem, it also needs to have dev/test datasets in the target_dir.
But now I can parse the dataset with CPU(very slow), fail to run it with GPU set.

It shows me :

Traceback (most recent call last):
File "train.py", line 378, in
train_eval_result, train_loss = student.evaluate(loader,out_path=Path('outputs/train.'+'.'+tar_file_name+'.conllu'),embeddings_storage_mode="none",prediction_mode=True)
File "/DM_parser/ACE/flair/models/dependency_model.py", line 1174, in evaluate
arc_scores, rel_scores = self.forward(batch, prediction_mode=prediction_mode)
File "/DM_parser/ACE/flair/models/dependency_model.py", line 597, in forward
self.embeddings.embed(sentences,embedding_mask=self.selection)
File "/DM_parser/ACE/flair/embeddings.py", line 185, in embed
embedding.embed(sentences)
File "/DM_parser/ACE/flair/embeddings.py", line 97, in embed
self._add_embeddings_internal(sentences)
File "/DM_parser/ACE/flair/embeddings.py", line 2960, in _add_embeddings_internal
self._add_embeddings_to_sentences(sentences)
File "/DM_parser/ACE/flair/embeddings.py", line 3155, in _add_embeddings_to_sentences
sequence_output, pooled_output, hidden_states = self.model(input_ids, attention_mask=mask, inputs_embeds = inputs_embeds)
File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/transformers/modeling_bert.py", line 753, in forward
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 68, in forward
input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds
File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/transformers/modeling_bert.py", line 178, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/anaconda3/envs/ACE_parser/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_index_select

I try to set
sequence_output, pooled_output, hidden_states = self.model(input_ids, attention_mask=mask, inputs_embeds = inputs_embeds)

into

sequence_output, pooled_output, hidden_states = self.model(input_ids.cuda(), attention_mask=mask.cuda(), inputs_embeds = inputs_embeds)

it also shows me the same question.

T T,

You may try to uncomment these lines

ACE/train.py

Lines 226 to 238 in 7033e91

# if student.selection[idx] == 1:
# embedding.to(flair.device)
# if 'elmo' in embedding.name:
# # embedding.reset_elmo()
# # continue
# # pdb.set_trace()
# embedding.ee.elmo_bilm.cuda(device=embedding.ee.cuda_device)
# states=[x.to(flair.device) for x in embedding.ee.elmo_bilm._elmo_lstm._states]
# embedding.ee.elmo_bilm._elmo_lstm._states = states
# for idx in range(len(embedding.ee.elmo_bilm._elmo_lstm._states)):
# embedding.ee.elmo_bilm._elmo_lstm._states[idx]=embedding.ee.elmo_bilm._elmo_lstm._states[idx].to(flair.device)
# else:
embedding.to('cpu')

You may try to uncomment these lines

ACE/train.py

Lines 226 to 238 in 7033e91

# if student.selection[idx] == 1:
# embedding.to(flair.device)
# if 'elmo' in embedding.name:
# # embedding.reset_elmo()
# # continue
# # pdb.set_trace()
# embedding.ee.elmo_bilm.cuda(device=embedding.ee.cuda_device)
# states=[x.to(flair.device) for x in embedding.ee.elmo_bilm._elmo_lstm._states]
# embedding.ee.elmo_bilm._elmo_lstm._states = states
# for idx in range(len(embedding.ee.elmo_bilm._elmo_lstm._states)):
# embedding.ee.elmo_bilm._elmo_lstm._states[idx]=embedding.ee.elmo_bilm._elmo_lstm._states[idx].to(flair.device)
# else:
embedding.to('cpu')

hi Xinyu, I have resolved the problem, and applied ACE to my data parsing successfully, thanks for your help.