context-eval

Local App Workflow And Full Web UI

This spec defines the planned local web application mode for users who should not need to operate context-eval through the command line.

The existing context-eval ui command generates a static, self-contained HTML page. That mode remains useful for offline inspection and export-only config editing. The new local app mode is a separate explicit mode that can save local configuration, run preflight checks, start evaluations, stream progress, and show results from local artifacts.

User Contract

A non-technical user can open context-eval as a local app and complete the full workflow visually:

  1. Install or launch the app.
  2. Choose a target repository and evaluation workspace.
  3. Configure tasks, context variants, agent profiles, and evaluation criteria.
  4. Run preflight checks before spending agent time.
  5. Start, monitor, stop, or retry local evaluation runs.
  6. Review validation results, patches, logs, risk signals, and reports.
  7. Export CSV, compact JSON, Markdown, or static HTML summaries.

The app remains local-only. It runs on the user’s machine, reads and writes only explicit local project files and run artifacts, and does not create hosted dashboards, shared accounts, remote databases, or automatic commits.

Modes

context-eval should expose two UI modes:

Mode boundaries must be visible in UI copy and docs. Static UI must stay safe for offline sharing. Local app mode must make local writes and agent execution explicit before they happen.

Installation And Startup

The development implementation starts with:

context-eval app

The packaged startup entry point is:

context-eval-app --workspace my-eval --config context-eval.yaml

context-eval-app is the shortcut target for installers or a pinned desktop shortcut. It starts the local server, opens the browser automatically, and writes a local app launcher log under my-eval/.context-eval/logs/local-app-launcher.log. The launcher must not hide errors that prevent the server from starting; it should show startup diagnostics and the log location.

The frontend build, test, and browser acceptance workflow for this app is documented in docs/frontend-workflow.md. Maintainers should run python scripts\validate-frontend.py --install --install-browsers when working on the local app frontend.

Launcher Packaging

The first launcher packaging step stays inside the existing Python package. It adds the context-eval-app console script as the target that a future Windows shortcut, Start Menu entry, or lightweight installer can call. The launcher is not a hosted dashboard and not a separate desktop runtime.

Startup behavior:

Recovery:

Installed package smoke tests use the same launcher entry point with --check-startup:

context-eval-app --workspace my-eval --config context-eval.yaml --no-browser --port 0 --check-startup

This startup preflight validates launcher inputs, writes the local launcher log, and exits without opening a browser, starting the blocking server loop, running agents, or running validation commands.

Packaging boundaries:

The first no-manual-CLI Windows package is a portable zip built from the release wheel and frontend build output:

python scripts/build-windows-portable.py --dist-dir C:\tmp\context-eval-dist --frontend-dist frontend\dist --output-dir C:\tmp\context-eval-dist

The archive is named context-eval-windows-x64-<version>.zip and contains Start Context Eval.cmd, scripts/start-context-eval.ps1, a bundled wheelhouse, frontend/dist, a package-local workspace, and a README. The double-click path creates or reuses a private .venv, installs from the bundled wheelhouse with no package index, starts context-eval-app with --frontend-dist, opens the browser, and keeps all generated files under the portable package directory. Release builds bundle Windows dependency wheels for Python 3.11, 3.12, and 3.13 by default so the private .venv install remains offline for those supported runtimes.

On first launch, an empty portable workspace must not pretend that ./fixture-repo exists. The local app starts in a first-run setup state with two explicit choices:

Project And Configuration Workflow

The local app must support:

All writes must show destination paths and must not silently overwrite unrelated files.

The first full Web configuration editor is Chinese-first. Headings, buttons, status text, errors, empty states, preflight labels, run labels, result labels, and export labels should be visible in Chinese. Code identifiers, file names, YAML keys, artifact names, and API fields can remain English.

The current visual case editor lets users select a task and edit the common case-authoring fields without touching YAML: task ID, title, prompt, category, difficulty, expected-outcome summary, acceptance points, expected files, validation commands, command-based hard checks, soft review rubric, and visible context variant associations. Saving this form writes tasks.yaml, reloads the server-side config, and refreshes the run plan before the user starts a run. The raw YAML view stays available as an advanced folded view for fields that do not yet have visual controls.

The next configuration-editor slice adds structured controls for the rest of the basic evaluation workspace setup. Users can edit context variant names, descriptions, and overlay source/target pairs; edit agent profile name, kind, command, timeout, and network; add, copy, or delete variants and agents with a confirmation before deletion; and choose the exact task, variant, and agent scope for the next run. Saving the structured form writes both context-eval.yaml and tasks.yaml, preserves unknown config fields such as agent telemetry, reloads through the local server, and refreshes the run plan so the selected scope case count is visible before agent execution. Raw YAML remains folded under the advanced view for escape-hatch edits, not as the main configuration path.

The Coco-first visual authoring slice is specified in docs/coco-visual-hybrid-evaluation.md. It extends this workflow with Project, Coco Agent, Context Variants, Tasks, Expected Outcome, Hard Evaluation, Soft Evaluation, Run Plan, Run Execution, and Results sections. The app must keep these controls local-only: structured authoring may save context-eval.yaml and tasks.yaml, but agent execution still requires explicit run confirmation.

Task editing may start as a safe tasks.yaml editor if a full structured task form would make the PR too large. In that mode, users can edit IDs, titles, prompts, categories, difficulty, ordering, additions, deletions, and unknown task fields directly in YAML. The server must validate and reparse the saved task file after the write; it must not silently drop unknown fields.

The local app may add subtle motion for hover, focus, active, loading, progress, and log-update states. Motion must not affect readability or narrow layouts, and CSS must honor prefers-reduced-motion.

Evaluation Criteria Workflow

Users must be able to configure acceptance checks visually:

The UI should treat validation commands as the user’s project-specific acceptance criteria. It must not imply that context-eval proves correctness without validation commands or human review.

Preflight Workflow

Preflight is separate from run execution. It should check:

Preflight must not run agent commands, install dependencies, run validation commands, call hosted services, or create commits.

Run Orchestration Workflow

The local app can start a run only after the user reviews the run plan and confirms local agent execution. While a run is active, the UI should show:

The runner should continue writing the same local artifacts used by the CLI: run_metadata.json, run_manifest.json, results.jsonl, logs, prompts, patches, workspaces, and reports.

Results Workflow

The results view should read local run artifacts and show:

All results are local observations for the selected repo, tasks, context variants, agents, and validation commands. The UI must not present them as an absolute coding-agent leaderboard.

API Boundary

The local server API should be private to the local app and bind to loopback by default. The first implementation exposes JSON endpoints under /api/ and serves the built frontend from frontend/dist when it is available. The server has one evaluation workspace root. Config writes, output directories, run artifact roots, and artifact-relative reads must resolve inside that workspace root and must reject path traversal such as ...

Edited config_path, tasks_path, output_dir, overlay source, and overlay target values are part of the write-safety boundary. config_path, tasks_path, output_dir, and overlay source must remain inside the local app workspace. Overlay target must be a safe relative target path and must not be absolute or contain traversal.

The local app API endpoints are:

Endpoints must avoid running shell commands except through strict config preflight checks and the existing runner after the user confirms execution. Preflight must not create run directories, install dependencies, run validation commands, run agent commands, or create commits.

Config save must also be execution-free. Saving YAML must not run agent commands, install dependencies, run validation commands, create commits, or create run workspaces.

Harness Readiness Reference

This phase uses https://github.com/JasonxzWen/skill-hub as a selective reference and maintainer-skill source, not as runtime app code. The inspected reference commit for this PR is 42c3065378e1d1d2851ca0e387e915a2841b885e.

Useful patterns to borrow:

Out of scope for this repository phase:

Non-Goals

This capability does not add a hosted service, multi-user dashboard, remote database, remote sharing, provider billing, automatic agent installation, automatic commits, issue mining, real network isolation, or an LLM judge.

Test Plan