This spec defines the planned local web application mode for users who should not need to operate context-eval through the command line.
The existing context-eval ui command generates a static, self-contained HTML
page. That mode remains useful for offline inspection and export-only config
editing. The new local app mode is a separate explicit mode that can save local
configuration, run preflight checks, start evaluations, stream progress, and
show results from local artifacts.
A non-technical user can open context-eval as a local app and complete the full workflow visually:
The app remains local-only. It runs on the user’s machine, reads and writes only explicit local project files and run artifacts, and does not create hosted dashboards, shared accounts, remote databases, or automatic commits.
context-eval should expose two UI modes:
Mode boundaries must be visible in UI copy and docs. Static UI must stay safe for offline sharing. Local app mode must make local writes and agent execution explicit before they happen.
The development implementation starts with:
context-eval app
The packaged startup entry point is:
context-eval-app --workspace my-eval --config context-eval.yaml
context-eval-app is the shortcut target for installers or a pinned desktop
shortcut. It starts the local server, opens the browser automatically, and
writes a local app launcher log under
my-eval/.context-eval/logs/local-app-launcher.log. The launcher must not hide
errors that prevent the server from starting; it should show startup diagnostics
and the log location.
The frontend build, test, and browser acceptance workflow for this app is
documented in docs/frontend-workflow.md. Maintainers should run
python scripts\validate-frontend.py --install --install-browsers when working
on the local app frontend.
The first launcher packaging step stays inside the existing Python package. It
adds the context-eval-app console script as the target that a future Windows
shortcut, Start Menu entry, or lightweight installer can call. The launcher is
not a hosted dashboard and not a separate desktop runtime.
Startup behavior:
context-eval app;--no-browser is supplied.Recovery:
.context-eval/logs/local-app-launcher.log inside the selected workspace;--no-browser if the browser handoff is the only failing step;--port 0 if the default port is already in use.Installed package smoke tests use the same launcher entry point with
--check-startup:
context-eval-app --workspace my-eval --config context-eval.yaml --no-browser --port 0 --check-startup
This startup preflight validates launcher inputs, writes the local launcher log, and exits without opening a browser, starting the blocking server loop, running agents, or running validation commands.
Packaging boundaries:
The first no-manual-CLI Windows package is a portable zip built from the release wheel and frontend build output:
python scripts/build-windows-portable.py --dist-dir C:\tmp\context-eval-dist --frontend-dist frontend\dist --output-dir C:\tmp\context-eval-dist
The archive is named context-eval-windows-x64-<version>.zip and contains
Start Context Eval.cmd, scripts/start-context-eval.ps1, a bundled
wheelhouse, frontend/dist, a package-local workspace, and a README. The
double-click path creates or reuses a private .venv, installs from the
bundled wheelhouse with no package index, starts context-eval-app with
--frontend-dist, opens the browser, and keeps all generated files under the
portable package directory. Release builds bundle Windows dependency wheels for
Python 3.11, 3.12, and 3.13 by default so the private .venv install remains
offline for those supported runtimes.
On first launch, an empty portable workspace must not pretend that
./fixture-repo exists. The local app starts in a first-run setup state with
two explicit choices:
context-eval.yaml, tasks.yaml, context overlays, a fake
local agent, validation command, hard-evaluation checks, and JSON telemetry.
The demo produces a two-variant matrix where baseline and experiment can be
compared without requiring Coco or another external coding agent.The local app must support:
context-eval.yaml;context-eval.yaml and tasks.yaml with validation before write and
then reloading through the server API to prove the disk state changed;repo.path, repo.base_ref, tasks, variants, overlays, agent
profiles, trials, jobs, cleanup policy, and output directory;All writes must show destination paths and must not silently overwrite unrelated files.
The first full Web configuration editor is Chinese-first. Headings, buttons, status text, errors, empty states, preflight labels, run labels, result labels, and export labels should be visible in Chinese. Code identifiers, file names, YAML keys, artifact names, and API fields can remain English.
The current visual case editor lets users select a task and edit the common
case-authoring fields without touching YAML: task ID, title, prompt, category,
difficulty, expected-outcome summary, acceptance points, expected files,
validation commands, command-based hard checks, soft review rubric, and visible
context variant associations. Saving this form writes tasks.yaml, reloads the
server-side config, and refreshes the run plan before the user starts a run.
The raw YAML view stays available as an advanced folded view for fields that do
not yet have visual controls.
The next configuration-editor slice adds structured controls for the rest of the
basic evaluation workspace setup. Users can edit context variant names,
descriptions, and overlay source/target pairs; edit agent profile name, kind,
command, timeout, and network; add, copy, or delete variants and agents with a
confirmation before deletion; and choose the exact task, variant, and agent
scope for the next run. Saving the structured form writes both
context-eval.yaml and tasks.yaml, preserves unknown config fields such as
agent telemetry, reloads through the local server, and refreshes the run plan so
the selected scope case count is visible before agent execution. Raw YAML
remains folded under the advanced view for escape-hatch edits, not as the main
configuration path.
The Coco-first visual authoring slice is specified in
docs/coco-visual-hybrid-evaluation.md. It extends this workflow with Project,
Coco Agent, Context Variants, Tasks, Expected Outcome, Hard Evaluation, Soft
Evaluation, Run Plan, Run Execution, and Results sections. The app must keep
these controls local-only: structured authoring may save context-eval.yaml
and tasks.yaml, but agent execution still requires explicit run confirmation.
Task editing may start as a safe tasks.yaml editor if a full structured task
form would make the PR too large. In that mode, users can edit IDs, titles,
prompts, categories, difficulty, ordering, additions, deletions, and unknown
task fields directly in YAML. The server must validate and reparse the saved
task file after the write; it must not silently drop unknown fields.
The local app may add subtle motion for hover, focus, active, loading, progress,
and log-update states. Motion must not affect readability or narrow layouts,
and CSS must honor prefers-reduced-motion.
Users must be able to configure acceptance checks visually:
The UI should treat validation commands as the user’s project-specific acceptance criteria. It must not imply that context-eval proves correctness without validation commands or human review.
Preflight is separate from run execution. It should check:
Preflight must not run agent commands, install dependencies, run validation commands, call hosted services, or create commits.
The local app can start a run only after the user reviews the run plan and confirms local agent execution. While a run is active, the UI should show:
The runner should continue writing the same local artifacts used by the CLI:
run_metadata.json, run_manifest.json, results.jsonl, logs, prompts,
patches, workspaces, and reports.
The results view should read local run artifacts and show:
agent_name exists;All results are local observations for the selected repo, tasks, context variants, agents, and validation commands. The UI must not present them as an absolute coding-agent leaderboard.
The local server API should be private to the local app and bind to loopback by
default. The first implementation exposes JSON endpoints under /api/ and
serves the built frontend from frontend/dist when it is available. The server
has one evaluation workspace root. Config writes, output directories, run
artifact roots, and artifact-relative reads must resolve inside that workspace
root and must reject path traversal such as ...
Edited config_path, tasks_path, output_dir, overlay source, and overlay
target values are part of the write-safety boundary. config_path,
tasks_path, output_dir, and overlay source must remain inside the local
app workspace. Overlay target must be a safe relative target path and must not
be absolute or contain traversal.
The local app API endpoints are:
GET /api/health: report loopback mode, workspace root, frontend mode, and
optional startup config path.POST /api/config/load: load context-eval.yaml plus its task file and
return raw YAML, editable model data, resolved paths, and destination paths.POST /api/config/save: validate and save context-eval.yaml and
tasks.yaml to explicit destinations inside the workspace root while
preserving raw YAML fields that the UI does not edit.POST /api/config/save-editable: validate the browser editable model,
regenerate and save tasks.yaml, preserve the existing context-eval.yaml
text, reload the config, and return the refreshed editable model plus raw
YAML. This endpoint is for the visual case editor path.POST /api/preflight: run side-effect-free validation for schema, task IDs,
Git refs, overlay paths, prompt templates, command variables, optional agent
executable availability, and output directory writability.POST /api/run-plan: return the selected agent x task x variant x trial
matrix, command previews, cleanup policy, jobs, and output directory before
any agent command can run.POST /api/runs: start a local run only when the request includes explicit
confirmation.GET /api/runs/{id}: return run lifecycle state, run directory, progress
counts, failure details, and result summary when available.POST /api/runs/{id}/stop: record an explicit stop request and report the
current cleanup behavior.GET /api/runs/{id}/logs: return local console output and available stdout
or stderr log tails for the run.GET /api/results?run_dir=...: read local results.jsonl and
run_metadata.json and return matrix overview, risk signals, cases, and
summaries.GET /api/artifacts?run_dir=...&path=...: read a safe relative artifact path
under the selected run directory.GET /api/exports?run_dir=...&format=csv|json|markdown|html: produce exports
from local run artifacts without rerunning agents.Endpoints must avoid running shell commands except through strict config preflight checks and the existing runner after the user confirms execution. Preflight must not create run directories, install dependencies, run validation commands, run agent commands, or create commits.
Config save must also be execution-free. Saving YAML must not run agent commands, install dependencies, run validation commands, create commits, or create run workspaces.
This phase uses https://github.com/JasonxzWen/skill-hub as a selective
reference and maintainer-skill source, not as runtime app code. The inspected
reference commit for this PR is
42c3065378e1d1d2851ca0e387e915a2841b885e.
Useful patterns to borrow:
build, test, validate, and release-validation gates;Out of scope for this repository phase:
context_eval runtime;This capability does not add a hosted service, multi-user dashboard, remote database, remote sharing, provider billing, automatic agent installation, automatic commits, issue mining, real network isolation, or an LLM judge.
docs/frontend-workflow.md and
scripts\validate-frontend.py --install --install-browsers before server and
full Web UI changes depend on frontend tooling.