No. context-eval records local observations for a selected repository, task set, context variants, agent command or profile, validation commands, trials, and machine state. It should not be presented as a global agent benchmark or leaderboard.
The current scope keeps correctness grounded in project validation commands and human engineering review. An LLM judge would add another model-dependent signal that may not match the repository’s tests, invariants, or acceptance criteria.
It runs the same task, repo ref, agent command or selected profile, validation commands, trials, and cleanup policy while changing the context variant overlay. Each case gets an isolated workspace so variant files do not leak across cases.
The case is recorded with low confidence. Patches, logs, prompts, and diff stats remain useful for review, but the result should not be treated as proof that the task was solved.
Trials show repeated observations under the recorded local setup. They help surface nondeterminism, flaky validation, and variant stability. They do not create absolute claims about general agent capability.
Yes, through named local agent profiles or aligned configs. Agent-level
summaries appear only when more than one agent_name is present. The summaries
remain local observations, not absolute rankings.
No. context-eval can validate configured command executables when requested, but it does not install coding agents, log in to providers, or manage provider credentials.
No. The runner, reports, exports, static UI, and local app mode are local-first. The local app binds to loopback and uses local files and local run artifacts.
No. The static UI is an offline, self-contained HTML export. It can inspect config and run artifacts and generate downloadable YAML, but it does not save files, run validation commands, or start agents.
The local app is an explicit loopback mode. It can save selected local files, run side-effect-free preflight checks, start local evaluations after explicit confirmation, stream logs, inspect artifacts, and produce exports. It is still local and does not become a hosted service.
Start with report.md, context-eval compare, and context-eval inspect-run
for the run overview, variant summaries, risk signals, and confidence notes.
Then inspect results.jsonl, run_manifest.json, validation logs, patches,
stdout and stderr logs, and retained workspaces for cases that need deeper
review. This artifact-only reporting model keeps conclusions grounded in what
the local run recorded.