Agents-Experiments

This repository explores the capabilities of language model (LM)-based agents, focusing on their evolution, environments, and evaluation designs. We view data as outputs of computational programs, including physical systems, where intelligence emerges from compressing these outputs via prediction (e.g., next-token prediction in LLMs). This leads to agentic behaviors through scaling paradigms, including test-time compute, reasoning traces, and agent-environment execution traces. The repo emphasizes Software Engineering (SWE) contexts, decomposing agent scaling and running experiments to test generalization, self-improvement, and economic value.

Theoretical Foundation

Inspired by Solomonoff Induction, we treat data as artifacts generated by underlying programs. Neural networks approximate these via circuit search (backpropagation), enabling compression that equates to understanding. Scaling laws (e.g., from GPT-3 to GPT-4) predict emergent capabilities, where smooth perplexity reductions yield discrete abilities like in-context learning. In agents, this manifests as reasoning traces (e.g., chain-of-thought) and execution rollouts, converting compute into high-quality data beyond human limits.

Data as Compute Artifacts: Observations are program outputs; agents learn by predicting/compressing them.
Compression → Intelligence: Better prediction implies internal world models, as in LLMs absorbing grammar, facts, and reasoning.
Emergence from Scaling: Power-law data distributions lead to power-law learning curves, with abilities emerging once key eigenfeatures are captured.

Decomposition of Scaling Agents

Agent scaling decomposes into pre-training, reinforcement learning (RL), and test-time/inference compute, shifting from human data to self-generated experiences. We focus on SWE, where agents handle code generation, debugging, and optimization in verifiable environments (e.g., bash terminals, compilers).

Pre-Training Phase: Builds general priors via next-token prediction on vast datasets. In SWE: Curates code corpora (e.g., Seed-Coder: Let the Code Model Curate Data for Itself), emphasizing structure over noise.
RL Phase (Low to High Compute):
- Low Compute: Elicits pre-trained capabilities (e.g., InstructGPT/ChatGPT via RLHF for instruction following).
- High Compute: Generates new data/experiences (e.g., self-play in AlphaZero, GRPO in DeepSeek R1). In SWE: Uses verifiers like leandojo/Coq for proof-assisted coding or Proton for kernel tuning.
Test-Time Compute: Allocates variable FLOPs per token/problem complexity (e.g., chain-of-thought for hard tasks like Riemann proof vs. simple arithmetic 1+1 = ). In SWE: Agentic search (e.g., Claude Code with tools like grep/sed, CWM: An Open-Weights LLM for Research on Code Generation with World Models) over retrival information generation (e.g. RETRO, enabling long-horizon tasks with function calling.
Agentic Path in SWE: Reasoning → Generalization → Agentic Capabilities. Decomposes into: Tool use (e.g., Codex for CLI integration), memory (infinite context via context engineering), multimodality (e.g., VLM for code visualization).

This decomposition addresses bottlenecks like data scarcity by converting compute into traces (e.g., reasoning rollouts, self-play games), aiming for ASI-level extrapolation beyond pre-trained support.

Experiments

We run experiments to probe agent capabilities in controlled settings, generating and analyzing traces for self-improvement.

SWE-Focused Experiments:
- Code generation and optimization: Implement NSA Triton kernels from scratch using LM pairs (e.g., GPT-5/Opus 4.1), tracing rollouts for feedback.
- Self-Play and Bootstrapping: Use nanoGPT as a benchmark for recursive self-improvement; test SPIRAL for multi-agent reasoning in zero-sum coding tasks.
- Environment Interactions: Deploy agents in bash/terminal (PrimeIntellect environment hub) or math/coding verifiers (e.g., DeepSeek-Prover-V2 for subgoal decomposition).
General Experiments:
- Web Agents: Information retrieval via WebAgent (e.g., WebShaper for synthetic data generation) for Deep Research like searches. WebDancer, bytedn deer-flow
- Multimodal: Test VLM+tools (e.g., InternVL3.5) for image-based reasoning in games/coding.
- Trajectory Recording: Capture LLM interactions (e.g., via bytedn trae-agent) for auditing rollouts, TRAYECTORY_RECORDING.

Experiments emphasize verifiable domains (math, code) to scale RL without human feedback.

Evaluations

Evaluations measure linear intelligence scaling, avoiding abrupt jumps by reformulating metrics. Focus on traces for reproducibility.

SWE Evaluations:
- Coding Traces: Analyze rollouts (e.g., AlgoTuner logs, CC-bench datasets) for efficiency in kernel design or repo ingestion (Gitingest).
- NSA Eval: Benchmark LM pairs implementing novel algorithms with minimal prior art.
General Evaluations:
- Game-as-Eval: Use GamingAgent for Atari/Tetris/2048 to test agent generalization.
- Self-Improvement Benchmarks: Automated LLM speedrunning on nanoGPT improvements.
- Verifier-Based: willccbb/verifiers for RL loops; measure data efficiency in post-training.

All evals prioritize economic value (e.g., productivity in SWE) and scalability (e.g., parallel test-time compute like Deep Think -> Grok 4 Heavy).

antferdom/agents-experiments

Agents-Experiments

Theoretical Foundation

Decomposition of Scaling Agents

Experiments

Evaluations