Re-visiting Automated Topic Model Evaluation with Large Language Models

This repo contains code and data for our EMNLP 2023 paper about assessing topic model output with Large Language Models.

@inproceedings{stammbach-etal-2023-revisiting,
    title = "Revisiting Automated Topic Model Evaluation with Large Language Models",
    author = "Stammbach, Dominik  and
      Zouhar, Vil{\'e}m  and
      Hoyle, Alexander  and
      Sachan, Mrinmaya  and
      Ash, Elliott",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.581",
    doi = "10.18653/v1/2023.emnlp-main.581",
    pages = "9348--9357"
}

Prerequisites

pip install --upgrade openai pandas

Large Language Models and Topics with Human Annotations

Download topic words and human annotations from the paper Is Automated Topic Model Evaluation Broken? from their github repository.

Intruder Detection Test

Following (Hoyle et al., 2021), we randomly sample intruder words which are not in the top 50 topic words for each topic.

python src-human-correlations/generate_intruder_words_dataset.py

We can then call an LLM to automatically annotate the intruder words for each topic.

python src-human-correlations/chatGPT_evaluate_intruders.py --API_KEY a_valid_openAI_api_key

For the ratings task, simply call the file which rates topic word sets (no need to generate a dataset first)

python src-human-correlations/chatGPT_evaluate_topic_ratings.py --API_KEY a_valid_openAI_api_key

(In case the openAI API breaks, we simply save all output in a json file, and would restart the script while skipping all already annotated datapoints.)

Evaluation LLMs and Human Correlations

We evaluate using a bootstrapp appraoch where we sample human annotations and LLM annotations for each datapoint. We then average these sampled annotation, and compute a spearman's rho for each bootstrapped sample. We report the mean spearman's rho over all 1000 bootstrapped samples.

python src-human-correlations/human_correlations_bootstrap.py --filename coherence-outputs-section-2/ratings_outfile_with_dataset_description.jsonl --task ratings

Evaluating Topic Models with Different Numbers of Topics

Download fitted topic models and metadata for two datasets (bills and wikitext) here and unzip

Rating Topic Word Sets

To run LLM ratings of topic word sets on a dataset (wiki or NYT) with broad or specific ground-truth example topics, simply run:

python src-number-of-topics/chatGPT_ratings_assignment.py --API_KEY a_valid_openAI_api_key --dataset wikitext --label_categories broad

Purity of Document Collections

We also assign a document label to the top documents belonging to a topic, following Doogan and Buntine, 2021. We then average purity per document collection, and the number of topics with on averag highest purities is the preferred cluster of this procedure.

To run LLM label assignments on a dataset (wiki or NYT) with broad or specific ground-truth example topics, simply run:

python src-number-of-topics/chatGPT_document_label_assignment.py --API_KEY a_valid_openAI_api_key --dataset wikitext --label_categories broad

Plot resulting scores

python src-number-of-topics/LLM_scores_and_ARI.py --label_categories broad --method label_assignment --dataset bills --label_categories broad --filename number-of-topics-section-4/document_label_assignment_wikitext_broad.jsonl

Questions

Please contact Dominik Stammbach regarding any questions.

dominiksinsaarland/evaluating-topic-model-output