context-eval

Development Plan

This plan defines how context-eval should evolve using Spec-Driven Development and Test-Driven Development while avoiding a PR cadence that is too small for maintainer review.

The project has moved past its initial MVP foundation. Future work should be planned as capability PRs: each PR owns one coherent user-facing capability, can contain several Ralph stories, and must merge with a complete acceptance package.

Product Boundaries

Development Cadence Policy

A Ralph story is not a pull request. A Ralph story is the smallest autonomous unit inside a capability PR. A capability PR should contain 3-6 related Ralph stories. Those stories must share one capability boundary and can be reviewed as one product change.

Each story still follows SDD + TDD:

  1. Spec: update the relevant document in docs/ with the contract, edge cases, and non-goals.
  2. Tests: add or update tests that encode that contract before the implementation is treated as complete.
  3. Implementation: make the smallest code change needed for the story.
  4. Docs: update README, examples, or workflow docs when user behavior changes.
  5. Verification: run the agreed quality gates and inspect generated artifacts when the story affects reports, exports, prompts, workspaces, or UI.

Do not open one PR per story. Open or update the PR only when the capability has a coherent merge package: spec, tests, implementation, docs, verification. A story can still be committed separately inside the PR so review history remains clear.

Capability PRs should be split only when one of these is true:

Capability Audit: PR #1-#17

PR #1-#4 established broad capability slices: runner/config maturity, local UI config editing, agent telemetry, and local multi-agent comparison. Those PRs were easier to reason about because the spec, tests, implementation, docs, and verification all pointed at the same capability.

PR #5-#17 delivered useful work, but the cadence became too fine-grained. The work added Markdown agent summaries, compact JSON metadata, bounded jobs, cleanup policies, run manifests, prompt templates, package build checks, license metadata, platform docs, artifact inspection, release-state checks, development-plan reconciliation, and validation timeout defaults. Several of these were good stories, but they were often promoted to separate PRs before the surrounding capability was complete.

The strongest signal is release readiness: release readiness was split across build, license, platform, artifact inspection, and release-state PRs. Those changes should have been reviewed as one release automation and packaging capability with separate commits and shared acceptance gates.

Future planning should batch related stories into coherent capability PRs. The goal is fewer PRs, stronger review context, less manual intervention, and no loss of SDD/TDD discipline.

  1. PR A: Config Diagnostics And Strict Validation Hardening.
  2. PR B: Local UI Persistence And Server-Mode Decision.
  3. PR C: Reporting Polish For Multi-Task, Multi-Variant, Multi-Agent Runs.
  4. PR D: Release Automation And Packaging Workflow Polish.
  5. PR E: Optional Adapter And Telemetry Expansion, only if justified by stable local artifact formats or repeated command-template friction.
  6. PR F: Local E2E CI Smoke And Test Taxonomy, before later feature work that depends on stronger workflow-level regression confidence.
  7. PR G: Release Candidate Install Smoke And Changelog Finalization, before tagging or publishing the first 0.1.0 release candidate.
  8. PR H: Agent Profiles And Noninteractive Agent Matrix, before full Web UI work. This unblocks Codex CLI, Claude Code, traecli, Coco, and custom commands as first-class local profiles.
  9. PR I: Frontend Build/Test/Acceptance Foundation, before local app server and full Web UI work depend on browser tooling.
  10. PR J: Local App Server And Run Orchestration, after agent profiles are stable. This creates the explicit local server mode behind the visual app.
  11. PR K: Full Web UI Workflow For Non-Technical Users, after the server API is stable enough to avoid duplicating runner logic in the frontend. The first focused slice is a Chinese config/tasks editor with save-reload proof, desktop/narrow browser acceptance, and a minimal harness-readiness reference.
  12. PR L: No-CLI Launcher And Packaging, after the local app workflow is stable and browser-verified.

Current active capability: coco-visual-hybrid-evaluation, after the local app server and frontend workflow are in place. This adds docs/coco-visual-hybrid-evaluation.md, kind: "coco", structured task authoring, expected outcomes, deterministic hard checks, optional soft evaluation payload generation, and local artifact review.

Capability Epic A: Config Diagnostics And Strict Validation Hardening

Goal

Make configuration failures actionable before users create workspaces or spend agent time. Users should see field-specific errors, path context, and strict validation failures that explain exactly what to fix.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

The stories all change the same user workflow: getting from YAML to a trusted preflight result. Splitting them into separate PRs would force reviewers to rebuild the same error model repeatedly and would risk docs/tests describing a partial validation contract.

Capability Epic B: Local UI Persistence And Server-Mode Decision

Goal

Decide and implement the next local UI persistence step without weakening the static UI safety contract. Users should know whether they are exporting YAML, using a browser file capability, or running an explicit local server mode.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Persistence is a product decision plus implementation. Splitting the decision, UI controls, server behavior, and docs across many PRs would create ambiguous intermediate states where users cannot tell whether the UI is export-only or safe to save.

Capability Epic C: Reporting Polish For Multi-Task, Multi-Variant, Multi-Agent Runs

Goal

Make reports, exports, terminal summaries, and the local UI easier to read for larger local run matrices while preserving the artifact-only and non-benchmark contract.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

The reporting surfaces share one aggregation contract. Reviewing them together prevents terminal output, Markdown, exports, and UI from drifting into slightly different interpretations of the same run artifacts.

Capability Epic D: Release Automation And Packaging Workflow Polish

Goal

Turn the current manual release checklist into a reproducible packaging workflow that catches local blockers and verifies artifacts without including maintainer capability library files in the runtime package.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Release work is only useful as an end-to-end gate. Splitting build checks, metadata, artifact inspection, changelog rules, and docs into isolated PRs increases manual coordination and can leave the release path half-automated.

Capability Epic E: Optional Adapter And Telemetry Expansion

Goal

Expand adapter or telemetry support only when there is evidence that the command-template adapter or generic JSON collector is causing repeated local workflow friction.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Adapters and telemetry collectors affect schema, runner behavior, reports, and docs together. Shipping them as isolated micro-PRs would make it hard to verify that a new local artifact format is documented, parsed, reported, and exported consistently.

Capability Epic F: Local E2E CI Smoke And Test Taxonomy

Goal

Add a clearly named local-e2e smoke layer before later feature work so CI proves that the installed CLI can complete the main local artifact workflow, not only unit-level behavior.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

This change cuts across test taxonomy, CLI workflow, generated artifacts, CI configuration, and developer documentation. Keeping those pieces together makes the new gate reviewable and prevents a half-wired smoke test from becoming a silent maintenance burden.

Capability Epic G: Release Candidate Install Smoke And Changelog Finalization

Goal

Prove the 0.1.0 release candidate can be installed from built package artifacts and used through the installed CLI before maintainers tag or publish anything.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

This is the final release-candidate gate. Splitting docs, package build, artifact inspection, installed-CLI smoke, changelog finalization, and CI wiring would make it too easy to tag a commit that has only partial release evidence.

Capability Epic H: Agent Profiles And Noninteractive Agent Matrix

Goal

Make Codex CLI, Claude Code, traecli, Coco, and custom local commands first-class noninteractive agent profiles before building the full visual workflow.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Agent profiles affect config loading, adapter validation, runner planning, artifact naming, reporting, exports, and UI data. Reviewing those changes together keeps the multi-agent contract coherent and gives the later Web UI a stable backend model.

Capability Epic I: Frontend Build/Test/Acceptance Foundation

Goal

Add the frontend engineering foundation required for the future local app before server endpoints and full Web UI workflows are implemented. The repository should have a clear browser app build, test, and acceptance path that can be run locally and in CI.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Frontend build, tests, acceptance, docs, and CI are one enablement layer. Landing them together gives the later local app server and full UI work a stable quality gate instead of creating a half-wired test harness during product implementation.

Capability Epic J: Local App Server And Run Orchestration

Goal

Add an explicit local app mode that can save local evaluation files, run side-effect-free preflight checks, start local evaluations, stream progress, and inspect artifacts without requiring direct CLI use for the main workflow.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

Local app mode is a new execution boundary. Config writes, preflight, orchestration, log streaming, and artifact reads need to be reviewed as one local safety model rather than scattered across unrelated PRs.

Capability Epic K: Full Web UI Workflow For Non-Technical Users

Goal

Build a complete browser workflow for non-technical users across installation handoff, startup, repo setup, task configuration, evaluation criteria, preflight, run control, validation review, and result exploration.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

The user-facing Web UI is one coherent workflow. Splitting setup, config, preflight, execution, and results into separate merge packages would create intermediate states where non-technical users still cannot complete the job.

Capability Epic L: No-CLI Launcher And Packaging

Goal

Make the stable local app workflow startable without requiring users to type a command in a terminal. The launcher must work without requiring users to type a command after installation.

Scope

Non-Goals

Merge Acceptance Criteria

Suggested Ralph Stories

Test Strategy

Why One Capability PR

The launcher is a product packaging layer over the local app. It should land only after the app workflow is stable, and it should be reviewed with its install, startup, diagnostics, and release-boundary docs together.

Cross-Epic Quality Gates

Every completed story should run the local gates requested for this repository:

.\.venv\Scripts\python -m ruff check .
.\.venv\Scripts\python -m pytest --basetemp C:\tmp\context-eval-pytest
.\.venv\Scripts\context-eval validate-config --config examples/basic/context-eval.yaml
powershell -ExecutionPolicy Bypass -File scripts\validate-skills.ps1 -SkipExternal
.\.venv\Scripts\python scripts\validate-frontend.py --install --install-browsers
.\.venv\Scripts\python -m pytest tests\test_local_e2e_smoke.py -m local_e2e -q --basetemp C:\tmp\context-eval-local-e2e-pytest
git diff --check

Before a capability PR is marked ready, confirm CI status and fix failing checks before asking for review.

Current Replanning Stories