eth-sri/SynthPAI

Get results and synthetic texts

Closed this issue · 4 comments

Hi,

Thanks for your efforts and it's really an interesting work!

I am a little bit confused by various results provided. If I only want the synthetic texts for each user and the inference of LLM and humans to calculate the acc like the paper did, how should I do?

For example, if I want the results from GPT-4, I think I should look into data/thread/eval/gpt-4/gpt4_evaluated.jsonl, is that right? What's the difference between gpt4_evaluated.jsonl, gpt4_gt_evaluated.jsonl and gpt4_revised_human_evaluated.jsonl?

And for each entry in these files, there are results of "human", "human_evaluated" and "evaluations", what's the difference between them? Which one should I directly use to get the evaluation results in paper?

I may miss something important to understand the results, thank you so much for your help!

Hi, thanks for your interest in our work!

This is a great question, I will add a read.me with more details later.

For example, if I want the results from GPT-4, I think I should look into data/thread/eval/gpt-4/gpt4_evaluated.jsonl, is that right? What's the difference between gpt4_evaluated.jsonl, gpt4_gt_evaluated.jsonl and gpt4_revised_human_evaluated.jsonl?

Yes, that is correct, the results presented in paper have filename format of [model_name]_evaluated.jsonl. Other files were older experiments which we saved to show the difference in inference evaluations - *gt_evaluated.jsonl contains model inference accuracy evaluated against ground truth (synthetic profile attributes), *revised_human_evaluated.jsonl contains first version of model inference accuracy evalaution against human revised/manually checked by hand tags (not ground truth). Both files were done in decider='model' mode, when evaluation is done fully by GPT-4 and regex checks. You can observe that we had those only for part of the evaluated models and the logs are saved in results/eval folder. Therefore, all final evaluations we have done in decider='model_human' mode, which introduces human judge into the loop - first the guess is checked by the LLM, then if accuracy is not 100% correct, then human is presented with an option to score model guess accuracy themselves.

The main logic for this was the following:

  1. To preserve the format as in original Beyond Memorization paper, we have evaluated model guesses against human tags/labels. Same as in the original paper, those labels were manually checked by human one by one.
  2. Some of private attributes had complex logic, which cannot be fairly evaluated by both GPT-4 or manual string checks, i.e., occupation, location, level of income. For example, here the model evaluated the guess like this =Matched 1: Lower-middle income to low, which is not correct by our definition of income levels (middle is not low level).

If I only want the synthetic texts for each user and the inference of LLM and humans to calculate the acc like the paper did, how should I do?

To reproduce results shown in paper you should look at [model_name]_evaluated.jsonl files. If you would like to run experiments yourself, the config files for evaluation process were uploaded in the same format as we have used.
human_label_type: "revised" eval: True decider: "model_human" label_type: "human"
The code was written in a way that allows to change label and evaluation type just in config - you are able to experiment with evaluating on ground truth, original human labels (which themselves have >80% accuracy), revised human labels (we have removed unnecessary ones here). Revised human labels are the ones used for the final dataset.

We recommend evaluating model guesses against revised human labels in model_human decider mode to reproduce results in the paper - this mode takes on average 10-15 minutes to perform 1 model evaluation. If you wish to save time, then you can just use model decider mode.

And for each entry in these files, there are results of "human", "human_evaluated" and "evaluations", what's the difference between them? Which one should I directly use to get the evaluation results in paper?

To see the accuracy of guessed attributes presented in paper, take a look at "evaluations": {"model_name": {"human_evaluated":..} dictionary, which contains accuracy of every guess for every profile feature which had a human tag (meaning it was possible to infer from texts). human is for original non-revised human tags, which have lower accuracy.

I hope my answer provides enough clarity, if you have more questions I would be happy to answer them :)

Hi @ayukh,

Thanks for your quick reply! I understand most parts but still have some small questions:

Take the results of data/thread/eval/gpt-4/gpt4_evaluated.jsonl as an example. For the first user SpiralSphinx, there are human tags "reviews":{"human":{}, "human_evaluated:{}}, model predictions "predictions":{"gpt-4":{}}, and the final evaluation results "evaluations": {"gpt-4": {"human_evaluated":{}}. As you suggest, I successfully get the accuracy by using the evaluation part "evaluations": {"gpt-4": {"human_evaluated":{}}.

  1. If I want to get the ground truth for this user on these comments, it should be in "reviews":{"human_evaluated:{}} , is that right?
  2. I find the attributes in evaluation part are fewer than human tags parts and model predictions parts. For example, age and income_level are not in the evaluations. Why is the case and how to determine which should be evaluated? Should I use the excluded ones to compute the false negative rate?

Thanks so much for your help!

Hi again @zealscott,

  1. Ground truth are values for attributes in synthetic profiles ('profile' dict). It is best visible in HuggingFace dataset, you can see profile dictionary there - those are are ground truth values. In our code we retrieve those from this file, which contains generated synthetic profiles with the same dictionary 'profile'. This is done in case new synthetic profiles are generated/updated, this file will be overwritten and used further.
  2. If I understood your question correctly, age and income_level had unstable prediction format depending on model and sometimes it was not possible to extract a list of precise answers from it. Models except GPT-4 and Llama3 output their answers in a format which made it hard to automatically extract a list of answers from even after reparsing outputs. In rare cases, when there was a list of predictions present the accuracy was not written to the file if it was incorrect. From what I have seen such case is not common - you can ignore those and only use evaluations as we did in the paper.

Hope this helps!

I read the evaluation codes and now more clear about the process. Thank you!