Web Codegen Scorer is a tool for evaluating the quality of web code generated by Large Language Models (LLMs).
You can use this tool to make evidence-based decisions relating to AI-generated code. For example:
- 🔄 Iterate on a system prompt to find most effective instructions for your project.
- ⚖️ Compare the code quality of code produced by different models.
- 📈 Monitor generated code quality over time as models and agents evolve.
Web Codegen Scorer is different from other code benchmarks in that it focuses specifically on web code and relies primarily on well-established measures of code quality.
- ⚙️ Configure your evaluations with different models, frameworks, and tools.
- ✍️ Specify system instructions and add MCP servers.
- 📋 Use built-in checks for build success, runtime errors, accessibility, security, LLM rating, and coding best practices. (More built-in checks coming soon!)
- 🔧 Automatically attempt to repair issues detected during code generating.
- 📊 View and compare results with an intuitive report viewer UI.
- Install the package:
npm install -g web-codegen-scorer
-
Set up your API keys:
In order to run an eval, you have to specify an API keys for the relevant providers as environment variables:
export GEMINI_API_KEY="YOUR_API_KEY_HERE" # If you're using Gemini models
export OPENAI_API_KEY="YOUR_API_KEY_HERE" # If you're using OpenAI models
export ANTHROPIC_API_KEY="YOUR_API_KEY_HERE" # If you're using Anthropic models
-
Run an eval:
You can run your first eval using our Angular example with the following command:
web-codegen-scorer eval --env=angular-example
-
(Optional) Set up your own eval:
If you want to set up a custom eval, instead of using our built-in examples, you can run the following command which will guide you through the process:
web-codegen-scorer init
You can customize the web-codegen-scorer eval
script with the following flags:
-
--env=<path>
(alias:--environment
): (Required) Specifies the path from which to load the environment config.- Example:
web-codegen-scorer eval --env=foo/bar/my-env.mjs
- Example:
-
--model=<name>
: Specifies the model to use when generating code. Defaults to the value ofDEFAULT_MODEL_NAME
.- Example:
web-codegen-scorer eval --model=gemini-2.5-flash --env=<config path>
- Example:
-
--runner=<name>
: Specifies the runner to use to execute the eval. Supported runners aregenkit
(default) orgemini-cli
. -
--local
: Runs the script in local mode for the initial code generation request. Instead of calling the LLM, it will attempt to read the initial code from a corresponding file in the.web-codegen-scorer/llm-output
directory (e.g.,.web-codegen-scorer/llm-output/todo-app.ts
). This is useful for re-running assessments or debugging the build/repair process without incurring LLM costs for the initial generation.- Note: You typically need to run
web-codegen-scorer eval
once without--local
to generate the initial files in.web-codegen-scorer/llm-output
. - The
web-codegen-scorer eval:local
script is a shortcut forweb-codegen-scorer eval --local
.
- Note: You typically need to run
-
--limit=<number>
: Specifies the number of application prompts to process. Defaults to5
.- Example:
web-codegen-scorer eval --limit=10 --env=<config path>
- Example:
-
--output-directory=<name>
(alias:--output-dir
): Specifies which directory to output the generated code under which is useful for debugging. By default, the code will be generated in a temporary directory.- Example:
web-codegen-scorer eval --output-dir=test-output --env=<config path>
- Example:
-
--concurrency=<number>
: Sets the maximum number of concurrent AI API requests. Defaults to5
( as defined byDEFAULT_CONCURRENCY
insrc/config.ts
).- Example:
web-codegen-scorer eval --concurrency=3 --env=<config path>
- Example:
-
--report-name=<name>
: Sets the name for the generated report directory. Defaults to a timestamp (e.g.,2023-10-27T10-30-00-000Z
). The name will be sanitized (non-alphanumeric characters replaced with hyphens).- Example:
web-codegen-scorer eval --report-name=my-custom-report --env=<config path>
- Example:
-
--rag-endpoint=<url>
: Specifies a custom RAG (Retrieval-Augmented Generation) endpoint URL. The URL must contain aPROMPT
substring, which will be replaced with the user prompt.- Example:
web-codegen-scorer eval --rag-endpoint="http://localhost:8080/my-rag-endpoint?query=PROMPT" --env=<config path>
- Example:
-
--prompt-filter=<name>
: String used to filter which prompts should be run. By default, a random sample (controlled by--limit
) will be taken from the prompts in the current environment. Setting this can be useful for debugging a specific prompt.- Example:
web-codegen-scorer eval --prompt-filter=tic-tac-toe --env=<config path>
- Example:
-
--skip-screenshots
: Whether to skip taking screenshots of the generated app. Defaults tofalse
.- Example:
web-codegen-scorer eval --skip-screenshots --env=<config path>
- Example:
-
--labels=<label1> <label2>
: Metadata labels that will be attached to the run.- Example:
web-codegen-scorer eval --labels my-label another-label --env=<config path>
- Example:
-
--mcp
: Whether to start an MCP for the evaluation. Defaults tofalse
.- Example:
web-codegen-scorer eval --mcp --env=<config path>
- Example:
-
--help
: Prints out usage information about the script.
If you've cloned this repo and want to work on the tool, you have to install its dependencies by
running pnpm install
.
Once they're installed, you can run the following commands:
pnpm run release-build
- Builds the package in thedist
directory for publishing to npm.pnpm run eval
- Runs an eval from source.pnpm run report
- Runs the report app from source.pnpm run init
- Runs the init script from source.pnpm run format
- Formats the source code using Prettier.
This tool is built by the Angular team at Google.
No! You can use this tool with any web library or framework (or none at all) as well as any model.
As more and more developers reach for LLM-based tools to create and modify code, we wanted to be able to empirically measure the effect of different factors on the quality of generated code. While many LLM coding benchmarks exist, we found that these were often too broad and didn't measure the specific quality metrics we cared about.
In the absence of such a tool, we found that many developers based their judgements on codegen with different models, frameworks, and tools on loosely structured trial-and-error. In contrast, Web Codegen Scorer gives us a platform to consistently measure codegen across different configurations with consistency and repeatability.
Yes! We plan to both expand the number of built-in checks and the variety of codegen scenarios.
Our roadmap includes:
- Including interaction testing in the rating, to ensure the generated code performs any requested behaviors.
- Measure Core Web Vitals.
- Measuring the effectiveness of LLM-driven edits on an existing codebase.