/claude-experiments

A repository for collecting LLM agent challenges, hypotheses / methods, and experimental results. For science!

Primary LanguageJavaScript

Agent Experiments

Overview

This repository contains experiments, methodologies, and results focused on improving our use of LLM Agents, particularly Claude via Claude Code. We systematically collect challenges, develop hypotheses, and evaluate solutions to enhance AI-assisted development workflows.

Repository Structure

.
├── challenges/           # Specific problems and test cases
│   ├── direction-following/
│   ├── error-patterns/
│   ├── complex-refactoring/
│   └── ...
├── testbeds/            # Reusable code bases for experiments
│   ├── web-app/
│   ├── cli-tool/
│   ├── data-pipeline/
│   └── ...
├── methods/             # Approaches and techniques
│   ├── prompting-strategies/
│   ├── context-management/
│   ├── iterative-refinement/
│   └── ...
├── experiments/         # Experimental runs and results
│   ├── 2025-07-05-direction-following/
│   ├── 2025-07-06-error-recovery/
│   └── ...
├── evaluation/          # Evaluation scripts and criteria
│   ├── metrics/
│   ├── scripts/
│   └── rubrics/
└── reports/            # Analysis and findings
    ├── weekly/
    └── insights/

Quick Start

For Contributors

  1. Check open issues for ongoing discussions and experiments
  2. Review the contribution guidelines
  3. Pick a challenge or propose a new one
  4. Document your experiments following our experiment template

Running an Experiment

  1. Select or create a challenge in challenges/
  2. Choose an appropriate testbed from testbeds/ or create a new one
  3. Apply a method from methods/ or develop a new approach
  4. Document your experiment in experiments/YYYY-MM-DD-descriptive-name/ (sessions are saved automatically by Claude Code)
  5. Evaluate results using tools in evaluation/
  6. Share findings via PR and/or issue discussion

Key Concepts

Challenges

Specific, reproducible problems we’ve encountered with Claude Code. Each challenge includes:

  • Problem description (README.org)
  • Success criteria
  • Known failure modes
  • Related issues/discussions

Testbeds

Reusable codebases that serve as consistent environments for experiments. These are intentionally kept separate from challenges to enable testing multiple approaches on the same codebase.

Methods

Documented approaches for improving Claude’s performance. Methods can be:

  • Prompting strategies
  • Context management techniques
  • Workflow patterns
  • Tool configurations

Experiments

Individual experimental runs combining a challenge, testbed, and method. Each experiment directory contains:

  • setup.org - Configuration and parameters
  • sessions/ - References to Claude Code session UUIDs (sessions are automatically saved to ~/.claude/projects/)
  • results.org - Observations and metrics
  • artifacts/ - Generated code or outputs

Workflow

Issue-Driven Development

We use GitHub Issues for:

  • **Experiment Proposals** (label: experiment)
  • **Challenge Documentation** (label: challenge)
  • **Method Discussions** (label: method)
  • **Results & Insights** (label: findings)

Issues remain open during active experimentation and link to relevant PRs and commits.

Branching Strategy

Given our research-oriented workflow with 4-5 contributors:

  • main - Stable, reviewed content
  • experiments/<username>/<description> - Individual experiment branches
  • develop - Integration branch for collaborative work

Example flow:

git checkout -b experiments/alice/context-window-optimization
# ... work on experiment ...
git push origin experiments/alice/context-window-optimization
# Create PR to develop for review
# After team review, merge to main

Contributing

See CONTRIBUTING.org for detailed guidelines. Key points:

  • Start discussions in issues before major experiments
  • Use consistent naming conventions
  • Document both successes and failures
  • Include session files for reproducibility
  • Tag relevant team members for review

Current Focus Areas

  1. **Direction Following** - Improving Claude’s adherence to specific instructions
  2. **Error Recovery** - Handling and learning from Claude’s mistakes
  3. **Context Management** - Optimizing information provided to Claude
  4. **Complex Refactoring** - Multi-file, architectural changes

Resources