This is the official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering(https://arxiv.org/abs/2404.13752). See details below.
w/ Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun.
This minimal scale demo is still in the testing phase, which provides the implementation for Section 5.1 Alignment: To Generate (Harmful Responses) or Not to Generate and 5.2 Hallucination: To Hallucinate or Not to Hallucinate.
Parameters are hardcoded in main.py
for now. If you wish to modify the parameters, please edit main.py
directly. We will implement argparse
soon.
Currently, you can run the program by executing:
python main.py
You can change the model by modifying the model_path
in main.py
. Please note that this set of parameters may not be suitable for larger models, and adjustments may be necessary based on the specific requirements.
Demo for decreasing hallucination is provided in hallucination.ipynb
.
Install the necessary libraries including:
transformers
torch>=2.0
numpy
datasets
peft
pandas
tqdm
sklearn
More code and details will be available upon publication of our paper. Code for processing TrustfulQA dataset is partly borrowed from This Repo.
@article{zhang2024towards,
title={Towards General Conceptual Model Editing via Adversarial Representation Engineering},
author={Zhang, Yihao and Wei, Zeming and Sun, Jun and Sun, Meng},
journal={arXiv preprint},
year={2024}
}