Format of the ranking candidates file for GrailQA
Closed this issue · 10 comments
Hi. I am interested in the ranker part of this project. I am currently setting up the environment. However, looks like the previous steps could be time consuming. Can I get some quick information on the format of the output files for:
python enumerate_candidates.py --split train # we use gt entity for trainning (so no need for prediction on training)
python enumerate_candidates.py --split dev --pred_file misc/grail_dev_entity_linking.json
Thanks!
Sure. It will a be a jsonline file where each line is a json of the information of the candidates for a question.
{
"qid": "2100278008000",
"s_expression": "(AND cvg.game_version (JOIN cvg.game_version.producer m.0ds98f))", # the ground truth logical form
"candidates": [ # logical form candidates
{
"logical_form": "(AND cvg.game_version (JOIN cvg.game_version.publisher m.0ds98f))",
"ex": false # whether the logical form is equivalent to the ground truth
},
{
"logical_form": "(COUNT (AND cvg.game_version (JOIN cvg.game_version.publisher m.0ds98f)))",
"ex": false
},
........
]
}
Thanks!
Another quick question, what is the 'ex' here? How will this be used in ranker training?
ex: True means if this logical form is equivalent to the ground truth logical form.
If ex is true, we will not use this logical form as negative candidates (you don't wanna penalize a logical form that is equivalent to the ground truth).
Got it! Thank you for the detailed explanation!
Sorry, one more question.
From the codebase, where can we see the roberta/bert model is using a contrastive loss for ranker? Thanks.
Hi there, I just want to follow up on this thread as I am doing something similar.
I am curious what the input batch looks like before preprocessing when it is being passed through the RobertaRanker.py file. Specifically I would like to know what is the input text, not input_id that is eventually passed to the model forward function.
Thank you in advance for the information!
the code for processing logical form can be found in
basically, we tokenize the expression, replace "_" in relations with " ", replace entities "m.xxxx" with its label.
Thank you for answer. I am trying to understand the forward function of the RobertaRanker.py, and I am confused the tensor dimensions when you compute the loss. I'm referring to this part:
Why do you reshape the logits to be [batch_size, sample_size] when computing the loss, but then call view(-1) on the labels? Won't this cause the labels to be in the shape [batch_size * sample_size] and cause a shape mismatch error?
Edit: Follow up question: For re-ranking 5 predictions, is your label a one-hot vector in the form of [1,0,0,0,0] or the index of the correct classification such as [0]?
My intuition says that both the logits and labels should be in the shape [batch_size * sample_size] so maybe I am misunderstanding something. If my question is not clear please ask me and I can clarify. Thank you again for your answers.
logits: [batch_size, sample_size], labels" [batch_size].
label vec is in the format of index of the correct sample.