Agent Experiments

Overview

This repository contains experiments, methodologies, and results focused on improving our use of LLM Agents, particularly Claude via Claude Code. We systematically collect challenges, develop hypotheses, and evaluate solutions to enhance AI-assisted development workflows.

Repository Structure

.
├── challenges/           # Specific problems and test cases
│   ├── direction-following/
│   ├── error-patterns/
│   ├── complex-refactoring/
│   └── ...
├── testbeds/            # Reusable code bases for experiments
│   ├── web-app/
│   ├── cli-tool/
│   ├── data-pipeline/
│   └── ...
├── methods/             # Approaches and techniques
│   ├── prompting-strategies/
│   ├── context-management/
│   ├── iterative-refinement/
│   └── ...
├── experiments/         # Experimental runs and results
│   ├── 2025-07-05-direction-following/
│   ├── 2025-07-06-error-recovery/
│   └── ...
├── evaluation/          # Evaluation scripts and criteria
│   ├── metrics/
│   ├── scripts/
│   └── rubrics/
└── reports/            # Analysis and findings
    ├── weekly/
    └── insights/

Quick Start

For Contributors

Check open issues for ongoing discussions and experiments
Review the contribution guidelines
Pick a challenge or propose a new one
Document your experiments following our experiment template

Running an Experiment

Select or create a challenge in challenges/
Choose an appropriate testbed from testbeds/ or create a new one
Apply a method from methods/ or develop a new approach
Document your experiment in experiments/YYYY-MM-DD-descriptive-name/ (sessions are saved automatically by Claude Code)
Evaluate results using tools in evaluation/
Share findings via PR and/or issue discussion

Key Concepts

Challenges

Specific, reproducible problems we’ve encountered with Claude Code. Each challenge includes:

Problem description (README.org)
Success criteria
Known failure modes
Related issues/discussions

Testbeds

Reusable codebases that serve as consistent environments for experiments. These are intentionally kept separate from challenges to enable testing multiple approaches on the same codebase.

Methods

Documented approaches for improving Claude’s performance. Methods can be:

Prompting strategies
Context management techniques
Workflow patterns
Tool configurations

Experiments

Individual experimental runs combining a challenge, testbed, and method. Each experiment directory contains:

setup.org - Configuration and parameters
sessions/ - References to Claude Code session UUIDs (sessions are automatically saved to ~/.claude/projects/)
results.org - Observations and metrics
artifacts/ - Generated code or outputs

Workflow

Issue-Driven Development

We use GitHub Issues for:

**Experiment Proposals** (label: experiment)
**Challenge Documentation** (label: challenge)
**Method Discussions** (label: method)
**Results & Insights** (label: findings)

Issues remain open during active experimentation and link to relevant PRs and commits.

Branching Strategy

Given our research-oriented workflow with 4-5 contributors:

main - Stable, reviewed content
experiments/<username>/<description> - Individual experiment branches
develop - Integration branch for collaborative work

Example flow:

git checkout -b experiments/alice/context-window-optimization
# ... work on experiment ...
git push origin experiments/alice/context-window-optimization
# Create PR to develop for review
# After team review, merge to main

Contributing

See CONTRIBUTING.org for detailed guidelines. Key points:

Start discussions in issues before major experiments
Use consistent naming conventions
Document both successes and failures
Include session files for reproducibility
Tag relevant team members for review

Current Focus Areas

**Direction Following** - Improving Claude’s adherence to specific instructions
**Error Recovery** - Handling and learning from Claude’s mistakes
**Context Management** - Optimizing information provided to Claude
**Complex Refactoring** - Multi-file, architectural changes

sirmalloc/claude-experiments