context-eval

context-eval Project Documentation

context-eval is a local-first Context A/B Testing Framework for coding agents. It compares context variants under controlled local conditions and records the resulting artifacts for engineering review.

Use these docs when you want to understand the product boundary, run the fixture demo, inspect the architecture, or prepare the repository documentation site.

Why Context A/B Testing Exists

Agent-facing context changes are easy to ship and hard to evaluate. A new AGENTS.md, local documentation bundle, DeepWiki export, skill, or rule set can look useful while still making real coding-agent tasks slower, less stable, or less correct.

context-eval keeps that question local and inspectable: hold the repository, task, command template, trials, and validation commands steady, then change the context variant and review the recorded artifacts.

Minimal Local Workflow

  1. Install context-eval in editable mode.
  2. Initialize or choose an evaluation workspace.
  3. Define local tasks, context variants, agent command templates, and validation commands.
  4. Run config validation and a dry-run matrix preview.
  5. Run a small local evaluation.
  6. Inspect report.md, results.jsonl, run_manifest.json, logs, patches, exports, and optional UI output before drawing conclusions.

Start Here

Documentation Map

Read the docs in this order for a first pass:

  1. Demo workflow to see the smallest deterministic path.
  2. Evaluation methodology to understand the comparison model.
  3. Architecture to understand how runs are planned and recorded.
  4. Artifact model to inspect outputs from a completed run.
  5. FAQ to check scope, non-goals, and mode boundaries.

Maintainers preparing a GitHub Pages project site should also read Pages setup.

Project Boundaries

context-eval is local-first. It compares context variants such as AGENTS.md, local docs, DeepWiki exports, skills, and rules against explicit local tasks and validation commands.

The outputs are local observations, not absolute model rankings. The validation confidence boundary comes from project validation commands and human review, not from patch size or an LLM judge alone. Reporting is artifact-only: completed reports, exports, terminal summaries, and the static UI read recorded local artifacts.

context-eval is not a leaderboard, hosted service, provider billing tool, credential manager, automatic agent installer, or automatic target-repository commit workflow. The static UI is offline and export-only. The local app is an explicit loopback mode that runs on the user’s machine.