Ideas to explore
- Formalizing computer vision formal operationalizations within the LLM domain
- HackAPrompt submission
- Symbolic interfaces to LLM reasoning and formalization of safety in software systems
- Model out the highest-risk interfaces to LLMs
- White box-informed fuzzing of large language models
- Mechanistic anomaly detection: Detecting anomalies in the neural network using mech-int.
- If we do black box fuzzing using prompts on the language model, we will get a dataset out that gives us a bunch of prompts with levels of weirdness of the output.
- If we also save the model activations for all levels of weirdness, we might be able to classify network activaiton propagation differences between the different states and dive into concrete examples
-
Hypotheses
- The activation graph of the weird and non-weird outputs will be significantly different.
- The weird activations will be more localized to specific neurons. The activation distribution over neurons will be more long-tail.
- A language model is able to classify most weird outputs from itself as weird except the ones that resemble SolidGoldMagikarp. Here, you would need another model for monitoring.
- Methodology
- Have access to two language models of an alright size, minimizing
$\dfrac{\text{inference time}}{\text{performance need}}$ ratio - Run one model
$M_1$ with the instructionYou are an expert tasked with fuzzing a language model. Find the weirdest and most fringe inputs to the model
or something similar, possibly with multiple examples of very weird inputs (history$H_1$ ). - Send the
$H_1$ output to the fuzzing target model$M_2$ and record its output (history$H_2$ ) - Use either the first inference history
$H_1$ or a new inference history$H_3$ with instructionsYou are tasked with classifying conversation answers in 5 levels of weirdness or expectation. Is this output """{
$H_2$ .output}""" expected given the input """{
$H_2$ .input}"""?
- Get the 5-level classification and save it in a dataset. Also save the activations on
$M_2$ for$H_2$ and connect it with the 5-level weirdness clasisfication, the input prompt, and whatever other meta-data makes sense. - Manually investigate if the dataset outputs make sense, i.e. are the levels coherent with the weirdness of the dataset. This is a sanity check.
- If no, rerun (2-6) with better parameters or redesign the methodology (1-6).
- Otherwise, identify and classify the patterns of weirdness manually and try to develop hypotheses for why they exist. This is parallel from (10-13).
- See if it makes sense to rerun (2-6) with other parameters than weirdness that would inform our future linear mixed-effects model.
- Create an operationalization of some normalization of activation patterns:
- Convert the neural network to a weighted graph with an activation
vertex property
and a weight-adjusted activationedge property
- Get inspired by these network summary statistics (which might be useless) or any DAG-based summary statistics. Maybe "Nonlinear Weighted Directed Acyclic Graph and A Priori Estimates for Neural Networks", "Characterizing dissimilarity of weighted networks", "Statistical Analysis of Weighted Networks",
- Convert the neural network to a weighted graph with an activation
- Run a mixed-effects linear model over some normalization of the activation patterns. The model might be something like
$\text{activation operationalization} \sim\text{weirdness} + (\text{ID})$
- Have access to two language models of an alright size, minimizing
- Risk classifications on The Pile