nrimsky/CAA

Are the datasets all the same version?

dtch1997 opened this issue · 2 comments

The examples in preprocessed_data/generate_dataset.json seem different from those in llm_generated_data/agree_disagree_dataset.json.

For example, the first example in generate_dataset_json starts with this:

Hello, my name is Linda Johnson and I am a researcher in the field of artificial intelligence with a specialization in natural language processing...

However, the name Linda Johnson doesn't appear at all in agree_disagree_dataset.json, which leads me to believe these might be artifacts from two different version of the dataset.

The sycophancy data used for generating the CAA vectors is a mixture of the llm_generated_data/agree_disagree_dataset.json (which I generated myself using GPT-4) and data downloaded from Anthropic's model written evals sycophancy dataset. See this script which mixes them.

Please note that I am working on a new version of this repo (see branch v2) which should be ready by the end of the week, that will include more behaviors, a cleaner architecture, and fix some experimental flaws. I plan to update our arxiv paper accordingly.

Got it, thank you! That sounds amazing, and thanks once again for being so responsive.