context-eval

Development Plan

This plan defines how context-eval should evolve using Spec-Driven Development and Test-Driven Development while avoiding a PR cadence that is too small for maintainer review.

The project has moved past its initial MVP foundation. Future work should be planned as capability PRs: each PR owns one coherent user-facing capability, can contain several Ralph stories, and must merge with a complete acceptance package.

Product Boundaries

context_eval/ is the runtime Python package.
.agents/, .codex/skills/, openspec/, and scripts/ are maintainer capability library files, not runtime package modules.
Deleted general-purpose skill-hub documents must not be restored.
Active .codex/config.toml must not be committed; maintainers can copy .codex/config.example.toml locally.
context-eval remains a local engineering tool for comparing context variants, not an agent leaderboard.
Multi-agent comparison must be based only on local results.jsonl and run_metadata.json artifacts, and must not claim absolute agent benchmark results.
Report, export, compare, and static UI aggregation must read local run artifacts. They must not rerun agents, scrape missing telemetry from logs, or call hosted services.
The active roadmap now includes an explicit local app mode for visual configuration, preflight, run orchestration, and result review. This is a local loopback app, not a hosted or multi-user dashboard.
The next active capability is coco-visual-hybrid-evaluation: a Coco-first local app workflow for authoring tasks, expected outcomes, deterministic hard checks, optional soft evaluation payloads, run execution, and result review.
The active roadmap does not include an LLM judge, hosted or multi-user web dashboard, issue miner, real network isolation, automatic agent installation, provider account management, or automatic commits.

Development Cadence Policy

A Ralph story is not a pull request. A Ralph story is the smallest autonomous unit inside a capability PR. A capability PR should contain 3-6 related Ralph stories. Those stories must share one capability boundary and can be reviewed as one product change.

Each story still follows SDD + TDD:

Spec: update the relevant document in docs/ with the contract, edge cases, and non-goals.
Tests: add or update tests that encode that contract before the implementation is treated as complete.
Implementation: make the smallest code change needed for the story.
Docs: update README, examples, or workflow docs when user behavior changes.
Verification: run the agreed quality gates and inspect generated artifacts when the story affects reports, exports, prompts, workspaces, or UI.

Do not open one PR per story. Open or update the PR only when the capability has a coherent merge package: spec, tests, implementation, docs, verification. A story can still be committed separately inside the PR so review history remains clear.

Capability PRs should be split only when one of these is true:

the stories have different users or different operational boundaries;
verification would require unrelated artifact sets;
the PR would mix runtime behavior with maintainer tooling without a shared acceptance contract;
the review would exceed a coherent diff size or introduce unrelated risk.

Capability Audit: PR #1-#17

PR #1-#4 established broad capability slices: runner/config maturity, local UI config editing, agent telemetry, and local multi-agent comparison. Those PRs were easier to reason about because the spec, tests, implementation, docs, and verification all pointed at the same capability.

PR #5-#17 delivered useful work, but the cadence became too fine-grained. The work added Markdown agent summaries, compact JSON metadata, bounded jobs, cleanup policies, run manifests, prompt templates, package build checks, license metadata, platform docs, artifact inspection, release-state checks, development-plan reconciliation, and validation timeout defaults. Several of these were good stories, but they were often promoted to separate PRs before the surrounding capability was complete.

The strongest signal is release readiness: release readiness was split across build, license, platform, artifact inspection, and release-state PRs. Those changes should have been reviewed as one release automation and packaging capability with separate commits and shared acceptance gates.

Future planning should batch related stories into coherent capability PRs. The goal is fewer PRs, stronger review context, less manual intervention, and no loss of SDD/TDD discipline.

Recommended PR Order

PR A: Config Diagnostics And Strict Validation Hardening.
PR B: Local UI Persistence And Server-Mode Decision.
PR C: Reporting Polish For Multi-Task, Multi-Variant, Multi-Agent Runs.
PR D: Release Automation And Packaging Workflow Polish.
PR E: Optional Adapter And Telemetry Expansion, only if justified by stable local artifact formats or repeated command-template friction.
PR F: Local E2E CI Smoke And Test Taxonomy, before later feature work that depends on stronger workflow-level regression confidence.
PR G: Release Candidate Install Smoke And Changelog Finalization, before tagging or publishing the first 0.1.0 release candidate.
PR H: Agent Profiles And Noninteractive Agent Matrix, before full Web UI work. This unblocks Codex CLI, Claude Code, traecli, Coco, and custom commands as first-class local profiles.
PR I: Frontend Build/Test/Acceptance Foundation, before local app server and full Web UI work depend on browser tooling.
PR J: Local App Server And Run Orchestration, after agent profiles are stable. This creates the explicit local server mode behind the visual app.
PR K: Full Web UI Workflow For Non-Technical Users, after the server API is stable enough to avoid duplicating runner logic in the frontend. The first focused slice is a Chinese config/tasks editor with save-reload proof, desktop/narrow browser acceptance, and a minimal harness-readiness reference.
PR L: No-CLI Launcher And Packaging, after the local app workflow is stable and browser-verified.

Current active capability: coco-visual-hybrid-evaluation, after the local app server and frontend workflow are in place. This adds docs/coco-visual-hybrid-evaluation.md, kind: "coco", structured task authoring, expected outcomes, deterministic hard checks, optional soft evaluation payload generation, and local artifact review.

Capability Epic A: Config Diagnostics And Strict Validation Hardening

Goal

Make configuration failures actionable before users create workspaces or spend agent time. Users should see field-specific errors, path context, and strict validation failures that explain exactly what to fix.

Scope

Improve validate-config diagnostics for malformed config/task YAML, duplicate task context, missing files, missing prompt templates, unsafe overlay targets, and strict Git ref checks. The diagnostics contract lives in docs/config-diagnostics.md.
Harden strict validation around config-relative paths, overlay targets, filename-safe task IDs, and task-level repo_ref checks.
Keep validation side-effect-free: no workspaces, agents, validation commands, dependency installation, or network calls.
Update docs/configuration.md, docs/task-format.md, README snippets, and examples when behavior changes.

Non-Goals

Do not implement real network isolation.
Do not run target repository validation commands during config validation.
Do not add remote repo cloning, issue mining, or hosted validation services.
Do not change the runtime package boundary or restore skill-hub docs.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
Invalid config and task files fail before any workspace is created.
Errors name the relevant file, field, task ID, variant, or overlay when that context is available.
Strict mode remains local and side-effect-free.
Existing valid examples still pass context-eval validate-config.

Test Strategy

Spec tests for the diagnostics contract.
Pydantic and loader tests for invalid fields, duplicate task IDs, unsafe overlay targets, and prompt template paths.
CLI tests for strict and non-strict validation output.
Regression tests proving validation does not create workspaces or run validation commands.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

The stories all change the same user workflow: getting from YAML to a trusted preflight result. Splitting them into separate PRs would force reviewers to rebuild the same error model repeatedly and would risk docs/tests describing a partial validation contract.

Capability Epic B: Local UI Persistence And Server-Mode Decision

Goal

Decide and implement the next local UI persistence step without weakening the static UI safety contract. Users should know whether they are exporting YAML, using a browser file capability, or running an explicit local server mode.

Scope

Write a spec that chooses the persistence model: keep static export-only, add browser file save, or add explicit local server mode.
If server mode is chosen, define allowed local endpoints, destination paths, validation before write, and no agent execution from the UI.
Preserve static mode as offline, self-contained HTML that can inspect config and run artifacts without remote dependencies.
Keep all result display based on existing local run artifacts.
Update docs/local-ui-config-editor.md, README UI usage, and tests for the chosen persistence behavior.

Non-Goals

Do not add a hosted service, remote database, multi-user dashboard, or background run orchestration.
Do not let the static UI run agents, validation commands, package installs, or network checks.
Do not silently overwrite config files.
Do not make UI persistence a prerequisite for artifact inspection.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
The selected persistence mode is explicit in docs and UI wording.
Save/export behavior validates generated YAML before writing or downloading.
Static mode remains safe and does not open sockets or write local files.
UI tests cover config editing, matrix preview, validation feedback, and persistence/export controls.

Test Strategy

Spec tests for static mode, save mode, and non-goals.
Unit tests for editable model export and validation before persistence.
CLI/UI HTML tests for visible controls and local-only text.
Browser or Playwright checks when interactive controls change.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

Persistence is a product decision plus implementation. Splitting the decision, UI controls, server behavior, and docs across many PRs would create ambiguous intermediate states where users cannot tell whether the UI is export-only or safe to save.

Capability Epic C: Reporting Polish For Multi-Task, Multi-Variant, Multi-Agent Runs

Goal

Make reports, exports, terminal summaries, and the local UI easier to read for larger local run matrices while preserving the artifact-only and non-benchmark contract.

Scope

Improve Markdown report templates for multi-task, multi-variant, multi-agent, and repeated-trial runs.
Use docs/multi-agent-comparison.md as the source spec for local multi-agent comparison language and artifact boundaries.
Keep variant-level analysis primary for context comparison, and show agent-level summaries only when more than one agent_name exists.
Improve low-confidence, failed, timeout, and missing-telemetry presentation.
Keep CSV and compact JSON deterministic and script-friendly.
Keep compact JSON metadata stable, including controlled export timestamp testing and local source file accounting.
Keep all aggregation sourced only from results.jsonl and optional run_metadata.json.

Non-Goals

Do not rerun agents to fill report gaps.
Do not infer token or tool-call data from logs.
Do not publish an absolute coding-agent capability ranking.
Do not add a hosted dashboard or remote sharing workflow.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
Markdown, inspect, compare, export, and UI output describe local observations rather than benchmark claims.
Multi-task and multi-variant report sections remain readable with synthetic fixture data.
Multi-agent sections appear only when more than one agent_name is present.
Exports remain deterministic and compatible with existing compact JSON/CSV contracts.

Test Strategy

Spec tests for artifact-only aggregation and non-benchmark language.
Report snapshot-style tests with synthetic JSONL fixtures.
CLI tests for inspect/compare output and single-agent suppression.
Export tests for deterministic CSV and compact JSON compatibility.
UI content tests, with browser verification if layout or interaction changes.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

The reporting surfaces share one aggregation contract. Reviewing them together prevents terminal output, Markdown, exports, and UI from drifting into slightly different interpretations of the same run artifacts.

Capability Epic D: Release Automation And Packaging Workflow Polish

Goal

Turn the current manual release checklist into a reproducible packaging workflow that catches local blockers and verifies artifacts without including maintainer capability library files in the runtime package.

Scope

Consolidate release-state checks, package builds, artifact inspection, changelog checks, and tag/publish preparation into a clear release workflow.
Keep CI and local release commands aligned.
Preserve packaging scope: include context_eval/ and report templates; reject .agents/, .codex/skills/, openspec/, scripts/, run artifacts, and active .codex/config.toml.
Document supported Python versions and platform gates.
Add automation only after the manual path remains stable in tests.

Non-Goals

Do not publish packages automatically without an explicit release step.
Do not make macOS release-blocking unless a later spec changes platform support.
Do not include maintainer library files in the runtime package.
Do not commit local run artifacts or active Codex config.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
Release commands are documented in one checklist and mirrored by tests or CI where practical.
Built wheel and sdist artifact inspection rejects forbidden paths.
Release-state checks catch hidden local blockers before build/publish steps.
CHANGELOG requirements are clear for future releases.

Test Strategy

Script tests for release-state and artifact inspection behavior.
Packaging tests for wheel and sdist contents.
CI workflow checks on Windows and Linux for Python 3.11 and 3.12.
Docs tests for release checklist command coverage.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

Release work is only useful as an end-to-end gate. Splitting build checks, metadata, artifact inspection, changelog rules, and docs into isolated PRs increases manual coordination and can leave the release path half-automated.

Capability Epic E: Optional Adapter And Telemetry Expansion

Goal

Expand adapter or telemetry support only when there is evidence that the command-template adapter or generic JSON collector is causing repeated local workflow friction.

Scope

Capability E keeps the command-template adapter as the only adapter for this PR. The accepted expansion is optional telemetry from stable local artifacts, not a new adapter family.
Accepted local artifact format: the generic JSON collector described in docs/agent-telemetry.md, including documented status, source, error, duration, token, and tool-call fields.
Reassess whether a thin Python entrypoint adapter solves real repeated command-template friction.
Use docs/agent-telemetry.md as the source spec for runner-guaranteed and hook-provided metrics.
Consider agent-specific telemetry collectors only for stable local artifact formats that can be covered with fixtures.
Keep the no-op collector and generic JSON collector as the default supported collector baseline unless evidence justifies another local format.
Preserve backwards-compatible CaseResult parsing and missing-telemetry semantics.
Document every new collector format before implementation.
Keep comparisons scoped to local observations from recorded run artifacts.

Non-Goals

Do not add hosted API calls, provider billing reconciliation, remote cost accounting, or brittle log scraping.
Do not require every agent to expose token/tool telemetry.
Do not install or manage coding agents automatically.
Do not turn telemetry expansion into an absolute agent benchmark.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
The PR starts with evidence that the expansion is justified.
New adapters or collectors have fixture-based tests and clear failure modes.
Missing telemetry remains null/empty rather than guessed.
Existing command-template configs continue to work unchanged.

Test Strategy

Spec tests for collector/adapter boundaries and non-goals.
Model tests for backwards-compatible result parsing.
Adapter tests for command-template compatibility and any new entrypoint.
Collector fixture tests for collected, partial, unavailable, and error states.
Report/export tests proving new telemetry remains local-artifact based.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

Adapters and telemetry collectors affect schema, runner behavior, reports, and docs together. Shipping them as isolated micro-PRs would make it hard to verify that a new local artifact format is documented, parsed, reported, and exported consistently.

Capability Epic F: Local E2E CI Smoke And Test Taxonomy

Goal

Add a clearly named local-e2e smoke layer before later feature work so CI proves that the installed CLI can complete the main local artifact workflow, not only unit-level behavior.

Scope

Use docs/local-e2e-ci.md as the source spec for the local-e2e smoke workflow and test taxonomy.
Keep the smoke local artifact-based: fixture repository, fake local agent, local config files, local run artifacts, and no hosted services.
Use no real external coding agent; the smoke must use a fake local agent controlled by the test fixture.
Exercise the installed CLI through context-eval run, context-eval report, context-eval export, and context-eval ui.
Verify generated results.jsonl, run_manifest.json, report.md, summary.csv, summary.json, and context-eval-ui.html.
Run the smoke as a separate local-e2e CI job using the local_e2e pytest marker; keep that marker excluded from the default pytest matrix.
Keep existing unit, integration, skill validation, and package-build gates.

Non-Goals

Do not run a real external coding agent.
Do not install agents automatically.
Do not call network services or hosted services.
Do not add an LLM judge.
Do not turn the smoke into a benchmark or leaderboard.
Do not make Playwright browser automation required in the first local-e2e PR; keep it as an optional follow-up for UI-heavy changes.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
The PR starts from docs/local-e2e-ci.md and failing contract tests.
CI exposes a clearly named local-e2e smoke path.
The smoke validates the installed CLI against a fixture repository and fake local agent.
The smoke records and inspects local artifacts only.
Existing CI jobs remain green on Windows and Linux.

Test Strategy

Spec tests for docs/local-e2e-ci.md and this development-plan epic.
Subprocess CLI smoke tests using a temporary fixture repository.
Artifact assertions for results, manifest, report, export, and static UI.
CI workflow tests proving the local-e2e smoke is named and wired.
Default pytest excludes local_e2e; run python -m pytest tests/test_local_e2e_smoke.py -m local_e2e for the installed CLI smoke path.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

This change cuts across test taxonomy, CLI workflow, generated artifacts, CI configuration, and developer documentation. Keeping those pieces together makes the new gate reviewable and prevents a half-wired smoke test from becoming a silent maintenance burden.

Capability Epic G: Release Candidate Install Smoke And Changelog Finalization

Goal

Prove the 0.1.0 release candidate can be installed from built package artifacts and used through the installed CLI before maintainers tag or publish anything.

Scope

Use docs/release-candidate-install-smoke.md as the source spec for the release candidate install smoke.
Run release validation from a clean archive or clean checkout when live local artifacts such as .context-eval/, dist/, or context_eval.egg-info/ would block release-state checks.
Build local wheel and sdist artifacts, inspect their runtime package scope, install the built wheel into a temporary Python environment, and run one local fixture repository workflow with a fake local agent.
Keep the smoke local artifact-based: built package artifacts, local fixture repository, local config files, generated local run artifacts, and no hosted service calls.
Finalize README, release checklist, development-plan, and changelog language for the 0.1.0 release candidate path.
Stop at the manual publish checkpoint after CI and local smoke pass.

Non-Goals

Do not create Git tags automatically.
Do not publish or upload packages automatically.
Do not run against a live user repository.
Do not install or run a real external coding agent.
Do not call hosted services from the smoke workflow.
Do not add an LLM judge or benchmark/leaderboard language.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
prepare-release.py builds artifacts, inspects them, runs the release candidate install smoke, and then prints the manual publish checkpoint.
CI package-build runs the same consolidated preparation path.
The smoke installs the built package artifacts and exercises the installed CLI against a local fixture repository and fake local agent.
Generated smoke artifacts are local, parseable, and self-contained.
CHANGELOG and release docs name the release candidate path without implying automatic tag or publish behavior.

Test Strategy

Spec tests for docs/release-candidate-install-smoke.md and this development plan epic.
Script tests for dry-run planning, missing wheel failures, and prepare-release wiring.
CI workflow tests proving package-build uses the consolidated release preparation entrypoint and prepares runtime dependencies for the install smoke without editable-installing the project first.
Manual clean-archive verification before release: run python scripts/prepare-release.py --dist-dir <empty-dist> from the clean tree and confirm the manual publish checkpoint.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

This is the final release-candidate gate. Splitting docs, package build, artifact inspection, installed-CLI smoke, changelog finalization, and CI wiring would make it too easy to tag a commit that has only partial release evidence.

Capability Epic H: Agent Profiles And Noninteractive Agent Matrix

Goal

Make Codex CLI, Claude Code, traecli, Coco, and custom local commands first-class noninteractive agent profiles before building the full visual workflow.

Scope

Use docs/agent-profiles.md and the OpenSpec agent-profiles-local-app change as the source specs.
Preserve the existing single agent config as a backwards-compatible implicit profile.
Add a new agents profile map for codex-cli, claude-code, traecli, coco, and custom profile kinds.
Validate command template variables and provide rendered command previews before an agent process starts.
Add optional side-effect-free executable availability checks for configured agent commands.
Expand run planning to agent x task x variant x trial and keep row ordering deterministic.
Record selected profile names in agent_name and keep artifacts case-local.
Keep reporting language scoped to local observations, not an absolute coding agent benchmark.

Non-Goals

Do not install Codex CLI, Claude Code, traecli, coco, or any other coding agent.
Do not manage provider accounts, credentials, billing, or hosted APIs.
Do not add a local app server or frontend in this PR.
Do not add an LLM judge, automatic commits, or leaderboard language.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
Existing single-agent configs continue to work unchanged.
Mixed agent and agents config shapes fail with a clear diagnostic.
traecli commands such as traecli -p "{prompt}", Coco commands such as coco -y --query-timeout 10m --bash-tool-timeout 5m -p "{prompt}", and custom commands are supported through the command-template adapter.
validate-config --check-agents catches missing local agent executables without running agents, installers, validation commands, or workspaces.
Multi-agent runs produce deterministic manifests, result rows, logs, patches, and reports.
Fixture-backed local-e2e coverage proves a profile-map config can run through the installed CLI with a fake local agent.

Test Strategy

Spec tests for docs/agent-profiles.md and OpenSpec requirements.
Model tests for agent compatibility, agents validation, and profile kind handling.
Adapter tests for supported and unknown command template variables.
Runner integration tests with fake local agents for multi-profile matrices.
CLI tests for optional agent executable checks.
Local-e2e smoke for init --agent-profiles plus a selected fake Coco profile.
Report/export/static UI tests proving agent summaries appear only when more than one agent_name exists.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

Agent profiles affect config loading, adapter validation, runner planning, artifact naming, reporting, exports, and UI data. Reviewing those changes together keeps the multi-agent contract coherent and gives the later Web UI a stable backend model.

Capability Epic I: Frontend Build/Test/Acceptance Foundation

Goal

Add the frontend engineering foundation required for the future local app before server endpoints and full Web UI workflows are implemented. The repository should have a clear browser app build, test, and acceptance path that can be run locally and in CI.

Scope

Use docs/frontend-workflow.md and the OpenSpec frontend-tooling-foundation change as the source specs.
Add a frontend/ package using React + Vite + TypeScript for the planned local app shell.
Add Vitest unit/component tests for fixture-backed UI behavior.
Add Playwright browser acceptance for desktop and narrow viewports.
Add scripts\validate-frontend.py --install --install-browsers as the root validation command for frontend work.
Add a dedicated frontend validation CI job that runs the combined frontend gate without weakening existing Python, local-e2e, skill validation, or package-build gates.
Keep build output under frontend/dist until the later local app server change explicitly consumes it.

Non-Goals

This capability does not add local app server endpoints.
Do not implement the complete non-technical Web UI workflow.
Do not replace static HTML export mode.
Do not make Node or npm a runtime requirement for existing CLI users.
Do not add a launcher, hosted dashboard, remote database, LLM judge, automatic agent installation, automatic commits, or leaderboard language.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, and verification.
frontend/ has typecheck, test, build, e2e, and validate scripts.
Browser acceptance verifies a deterministic local app shell at desktop and narrow viewports.
CI exposes a clearly named frontend validation job.
Python package metadata does not include frontend build assets until a later server change consumes them.

Test Strategy

Spec tests for docs/frontend-workflow.md and the OpenSpec change.
Pytest contract tests for frontend package scripts, CI job, validation wrapper, .gitignore, and package-data boundaries.
Vitest unit/component tests for UI-visible fixture behavior.
Playwright browser checks for desktop and narrow viewports.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

Frontend build, tests, acceptance, docs, and CI are one enablement layer. Landing them together gives the later local app server and full UI work a stable quality gate instead of creating a half-wired test harness during product implementation.

Capability Epic J: Local App Server And Run Orchestration

Goal

Add an explicit local app mode that can save local evaluation files, run side-effect-free preflight checks, start local evaluations, stream progress, and inspect artifacts without requiring direct CLI use for the main workflow.

Scope

Use docs/local-app-workflow.md and the OpenSpec agent-profiles-local-app change as the source specs.
Add an explicit local app/server command separate from static HTML export.
Bind the local server to loopback by default and keep data file-based.
Expose API surfaces for config save/load, preflight, run planning, run lifecycle, log streaming, artifact reads, and exports.
The API contract must name the run lifecycle explicitly, including planning, start, active status, stop, completion, and artifact inspection.
Reuse existing config validation, runner, reporting, export, and artifact modules instead of duplicating execution logic.
Preserve static UI mode as offline, self-contained, and unable to run agents or write local files.

Non-Goals

Do not add the final polished frontend in this PR.
Do not add hosted services, remote databases, shared accounts, or remote run orchestration.
Do not install coding agents or project dependencies automatically.
Do not run validation commands during side-effect-free preflight.
Do not create commits, tags, or package publishes.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
Static UI and local app mode have clear separate behavior in docs and code.
API endpoints validate local paths and reject traversal outside allowed evaluation or artifact roots.
Preflight catches config, path, Git ref, overlay, prompt template, command variable, optional executable, and output writability problems without starting agents.
Run lifecycle endpoints write the same local artifacts as CLI runs.

Test Strategy

Spec tests for docs/local-app-workflow.md.
Unit tests for path safety, config round-tripping, and run-plan generation.
API tests for every local app endpoint and failure path.
Local-e2e smoke using fake local agents and local artifacts only.
Browser verification only for any UI included in this PR.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

Local app mode is a new execution boundary. Config writes, preflight, orchestration, log streaming, and artifact reads need to be reviewed as one local safety model rather than scattered across unrelated PRs.

Capability Epic K: Full Web UI Workflow For Non-Technical Users

Goal

Build a complete browser workflow for non-technical users across installation handoff, startup, repo setup, task configuration, evaluation criteria, preflight, run control, validation review, and result exploration.

Scope

Use the local app API from Capability J as the only execution surface.
Provide first-run setup for evaluation workspace and target repo selection.
Provide visual editors for tasks, variants, overlays, agent profiles, validation commands, timeouts, trials, jobs, cleanup policy, and output path.
Present the first full workflow in Chinese, including visible controls, status text, errors, empty states, preflight labels, run labels, result labels, and export labels.
Save context-eval.yaml and tasks.yaml through the server API, then reload from disk so browser state cannot mask failed writes; this is the save-reload proof for the first focused slice.
Show matrix preview before runs and require explicit confirmation before local agent execution.
Show run progress, active case identity, log tails, stop controls, timeout status, and failure state.
Show results from local artifacts: matrix overview, variant summaries, agent summaries, risk signals, validation output, patches, touched paths, and exports.
Keep static UI export available for offline sharing of completed run views.
Document the minimal harness-readiness reference from Skill Hub without copying its repository structure or installing its assets.

Non-Goals

Do not add a hosted service, multi-user collaboration, remote sharing, or remote database.
Do not add real external coding-agent CI smoke in the first full UI PR.
Do not hide validation uncertainty or imply correctness without validation commands and human review.
Do not add decorative marketing pages instead of the actual app workflow.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
A non-technical user can complete the workflow visually with the fixture repo and fake local agent.
Text and controls fit across desktop and narrow browser viewports.
Saving config and task YAML preserves unknown fields and reloads the parsed disk state.
UI copy distinguishes local observations from benchmark claims.
Browser verification covers setup, config editing, preflight, run start/stop, results, and exports.

Test Strategy

UI contract tests for view routing, controls, disabled states, and copy.
API-backed integration tests using local fixture data.
Browser tests across desktop and narrow viewports.
Save/reload tests proving config and task YAML are reparsed from disk while preserving unknown fields.
Pixel/screenshot checks for layout regressions where practical.
Local-e2e smoke with fake local agent through the visual flow.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

The user-facing Web UI is one coherent workflow. Splitting setup, config, preflight, execution, and results into separate merge packages would create intermediate states where non-technical users still cannot complete the job.

Capability Epic L: No-CLI Launcher And Packaging

Goal

Make the stable local app workflow startable without requiring users to type a command in a terminal. The launcher must work without requiring users to type a command after installation.

Scope

Decide the launcher and packaging approach after Capabilities H-K are stable.
Start the local app server and open the browser automatically.
The launcher starts the local app server and opens the browser for the user.
Use the installed context-eval-app script as the shortcut target for the first packaged startup path.
Show startup diagnostics and log location when launch fails.
Document installation, startup, upgrade, logs, and recovery for non-technical users.
Keep release automation stopped at the existing manual tag and publish boundary unless a later release spec changes that.

Non-Goals

Do not add auto-update infrastructure in the first launcher PR.
Do not package external coding agents.
Do not manage provider credentials or local agent login state.
Do not publish packages, create tags, or push releases automatically.

Merge Acceptance Criteria

The capability PR includes spec, tests, implementation, docs, verification.
A user can launch the local app from a packaged shortcut target without typing a command after installation.
Startup failures are visible with actionable local diagnostics.
Existing CLI and local app server entrypoints continue to work for technical users.
Release docs make the manual publish boundary explicit.

Test Strategy

Spec tests for launcher boundaries and non-goals.
Scripted smoke tests for startup success and visible startup failure where practical.
Manual verification for OS-specific launcher behavior until CI coverage is available.
Local app browser verification after launcher startup.
Full verification commands after each completed story and before the PR is marked ready.

Why One Capability PR

The launcher is a product packaging layer over the local app. It should land only after the app workflow is stable, and it should be reviewed with its install, startup, diagnostics, and release-boundary docs together.

Cross-Epic Quality Gates

Every completed story should run the local gates requested for this repository:

.\.venv\Scripts\python -m ruff check .
.\.venv\Scripts\python -m pytest --basetemp C:\tmp\context-eval-pytest
.\.venv\Scripts\context-eval validate-config --config examples/basic/context-eval.yaml
powershell -ExecutionPolicy Bypass -File scripts\validate-skills.ps1 -SkipExternal
.\.venv\Scripts\python scripts\validate-frontend.py --install --install-browsers
.\.venv\Scripts\python -m pytest tests\test_local_e2e_smoke.py -m local_e2e -q --basetemp C:\tmp\context-eval-local-e2e-pytest
git diff --check

Before a capability PR is marked ready, confirm CI status and fix failing checks before asking for review.

Current Replanning Stories

US-069: Audit current development cadence and define larger capability PR policy.
US-070: Replan docs/development-plan.md into capability epics with acceptance criteria.
US-071: Document the new Ralph/SDD/TDD batching policy and changelog handoff.
US-072: Specify agent profiles and noninteractive agent matrix planning.
US-073: Specify local app server mode and full Web UI workflow.
US-074: Replan post-release product expansion around non-CLI usage.

This site is open source. Improve this page.

context-eval

Development Plan

Product Boundaries

Development Cadence Policy

Capability Audit: PR #1-#17

Recommended PR Order

Capability Epic A: Config Diagnostics And Strict Validation Hardening

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic B: Local UI Persistence And Server-Mode Decision

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic C: Reporting Polish For Multi-Task, Multi-Variant, Multi-Agent Runs

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic D: Release Automation And Packaging Workflow Polish

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic E: Optional Adapter And Telemetry Expansion

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic F: Local E2E CI Smoke And Test Taxonomy

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic G: Release Candidate Install Smoke And Changelog Finalization

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic H: Agent Profiles And Noninteractive Agent Matrix

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic I: Frontend Build/Test/Acceptance Foundation

Goal

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Capability Epic J: Local App Server And Run Orchestration

Goal