Web Codegen Scorer

Web Codegen Scorer is a tool for evaluating the quality of web code generated by Large Language Models (LLMs).

You can use this tool to make evidence-based decisions relating to AI-generated code. For example:

🔄 Iterate on a system prompt to find most effective instructions for your project.
⚖️ Compare the code quality of code produced by different models.
📈 Monitor generated code quality over time as models and agents evolve.

Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on web code and relies primarily on well-established measures of code quality.

Features

⚙️ Configure your evaluations with different models, frameworks, and tools.
✍️ Specify system instructions and add MCP servers.
📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and coding best practices. (More built-in checks coming soon!)
🔧 Automatically attempt to repair issues detected during code generating.
📊 View and compare results with an intuitive report viewer UI.

Setup

Install the package:

npm install -g web-codegen-scorer

Set up your API keys:

In order to run an eval, you have to specify an API keys for the relevant providers as environment variables:

export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models
export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models

Run an eval:

You can run your first eval using our Angular example with the following command:

web-codegen-scorer eval --env=angular-example

(Optional) Set up your own eval:

If you want to set up a custom eval, instead of using our built-in examples, you can run the following command which will guide you through the process:

web-codegen-scorer init

Command-line flags

You can customize the web-codegen-scorer eval script with the following flags:

--env=<path> (alias: --environment): (Required) Specifies the path from which to load the environment config.
- Example: web-codegen-scorer eval --env=foo/bar/my-env.mjs
--model=<name>: Specifies the model to use when generating code. Defaults to the value of DEFAULT_MODEL_NAME.
- Example: web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>
--runner=<name>: Specifies the runner to use to execute the eval. Supported runners are genkit (default) or gemini-cli.
--local: Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the .web-codegen-scorer/llm-output directory (e.g., .web-codegen-scorer/llm-output/todo-app.ts). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation.
- Note: You typically need to run web-codegen-scorer eval once without --local to generate the initial files in .web-codegen-scorer/llm-output.
- The web-codegen-scorer eval:local script is a shortcut for web-codegen-scorer eval --local.
--limit=<number>: Specifies the number of application prompts to process. Defaults to 5.
- Example: web-codegen-scorer eval --limit=10 --env=<config path>
--output-directory=<name> (alias: --output-dir): Specifies which directory to output the generated code under which is useful for debugging. By default, the code will be generated in a temporary directory.
- Example: web-codegen-scorer eval --output-dir=test-output --env=<config path>
--concurrency=<number>: Sets the maximum number of concurrent AI API requests. Defaults to 5 ( as defined by DEFAULT_CONCURRENCY in src/config.ts).
- Example: web-codegen-scorer eval --concurrency=3 --env=<config path>
--report-name=<name>: Sets the name for the generated report directory. Defaults to a timestamp (e.g., 2023-10-27T10-30-00-000Z). The name will be sanitized (non-alphanumeric characters replaced with hyphens).
- Example: web-codegen-scorer eval --report-name=my-custom-report --env=<config path>
--rag-endpoint=<url>: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain a PROMPT substring, which will be replaced with the user prompt.
- Example: web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>
--prompt-filter=<name>: String used to filter which prompts should be run. By default, a random sample (controlled by --limit) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt.
- Example: web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>
--skip-screenshots: Whether to skip taking screenshots of the generated app. Defaults to false.
- Example: web-codegen-scorer eval --skip-screenshots --env=<config path>
--labels=<label1> <label2>: Metadata labels that will be attached to the run.
- Example: web-codegen-scorer eval --labels my-label another-label --env=<config path>
--mcp: Whether to start an MCP for the evaluation. Defaults to false.
- Example: web-codegen-scorer eval --mcp --env=<config path>
--help: Prints out usage information about the script.

Additional configuration options

Local development

If you've cloned this repo and want to work on the tool, you have to install its dependencies by running pnpm install. Once they're installed, you can run the following commands:

pnpm run release-build - Builds the package in the dist directory for publishing to npm.
pnpm run eval - Runs an eval from source.
pnpm run report - Runs the report app from source.
pnpm run init - Runs the init script from source.
pnpm run format - Formats the source code using Prettier.

FAQ

Who built this tool?

This tool is built by the Angular team at Google.

Does this tool only work for Angular code or Google models?

No! You can use this tool with any web library or framework (or none at all) as well as any model.

Why did you build this tool?

As more and more developers reach for LLM-based tools to create and modify code, we wanted to be able to empirically measure the effect of different factors on the quality of generated code. While many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the specific quality metrics we cared about.

In the absence of such a tool, we found that many developers based their judgements on codegen with different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web Codegen Scorer gives us a platform to consistently measure codegen across different configurations with consistency and repeatability.

Will you add more features over time?

Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios.

Our roadmap includes:

Including interaction testing in the rating, to ensure the generated code performs any requested behaviors.
Measure Core Web Vitals.
Measuring the effectiveness of LLM-driven edits on an existing codebase.

angular/web-codegen-scorer