llm-detection

This project proposes using an author attribution system for LLM detection. The system has the following structure: it mean-pools LLM sentence embeddings during training and then computes cosine similarity with test sentence embeddings to draw conclusions. The project employs this naive system to perform three failure case analyses in llm_detection_contribution1.ipynb and finds that this system can achieve higher than 90% accuracy when no prompt engineering is involved but fails and performance drops to random guessing when prompt engineering is involved because prompt engineering shifts the LLM's own style. The project then extends the system to include a mean-pooled human representation and projects all embeddings to two dimensions using PCA, using Euclidean distance as a similarity measure (llm_detection_contribution2.ipynb). This improvement enables the system to achieve 74% accuracy, even in use cases where prompt engineering is involved.

Figure 1: Naive Proposed LLM-Detection System:

Figure 2: Extended LLM-Detection System to Generalize to Prompt-Engineered Text:

jjz5463/llm-detection

llm-detection