Every context-eval run writes local artifacts under .context-eval/runs/<run-id>
unless the config selects another output directory. These files are the source
of truth for review, debugging, reproducibility checks, reports, and exports.
results.jsonl: one JSON row per recorded case. Rows include task, variant,
trial, status, validation status, confidence, hashes, diff stats, paths, and
telemetry fields when available.run_metadata.json: run-level labels and configuration summaries used by
reports and exports when present.run_manifest.json: the planned run matrix, including selected tasks,
selected variants, trials, ordered cases, config_hash, task_hash, and
variant_hash values.report.md: Markdown report generated from local run artifacts.Each case can write:
Retained workspaces are useful when a failed or surprising case needs manual inspection. Cleanup policy controls whether workspaces are kept or removed after result capture.
The Coco-first hybrid evaluation workflow adds case-local sidecars:
artifacts/<case_id>/hard_evaluation.jsonartifacts/<case_id>/soft_evaluation_payload.jsonhard_evaluation.json records deterministic check results, score, max score,
pass/fail status, evidence, and summary. Checks can cover validation success,
required files, forbidden files, changed-file limits, expected snippets,
forbidden snippets, diff-stat bounds, and agent completion.
soft_evaluation_payload.json records review input for later human or local
judge use. It is not a hosted API call and does not make soft scores mandatory
for pass/fail.
results.jsonl keeps compact summary fields for stable exports and UI review:
hard_evaluation_statushard_evaluation_scorehard_evaluation_max_scorehard_evaluation_passed_checkshard_evaluation_failed_checkshard_evaluation_pathsoft_evaluation_statussoft_evaluation_payload_pathsoft_evaluation_result_pathOlder rows that lack these fields remain valid and render unavailable defaults.
context-eval export produces deterministic CSV or compact JSON from existing
run artifacts:
Exports do not rerun agents, run validation commands, call hosted services, or infer missing data from logs.
Result rows and manifests include fields that help downstream tools group and compare local observations:
schema_version: result row schema version.context_eval_version: runtime version that wrote the row.config_hash: deterministic summary of the effective config, excluding the
output directory so moving artifacts does not change experiment identity.task_hash: deterministic summary of the selected task.variant_hash: deterministic summary of the selected context variant.These hashes support reproducibility and grouping. They are not security signatures and do not turn the run into a benchmark.
Telemetry is optional and local-artifact based. Rows may include:
telemetry_status: unavailable, collected, partial, or error.telemetry_source: collector label such as none or json-file.telemetry_error: concise local collection error when collection fails.agent_duration_seconds;prompt_tokens;completion_tokens;total_tokens;reasoning_tokens;tool_call_count;tool_calls_by_name.Missing telemetry is explicit. CSV exports leave missing scalar telemetry empty.
Compact JSON uses null. context-eval does not guess token counts, tool-call
counts, or billing data from logs.
Reproducibility comes from recorded config summaries, task and variant hashes, the manifest, prompt files, deterministic result rows, and evaluation sidecars.
Debugging comes from stdout, stderr, validation logs, patches, status fields, timeouts, retained workspaces, hard-check evidence, and cleanup metadata.
Downstream analysis comes from CSV and compact JSON exports. Those exports stay
grounded in results.jsonl and optional run_metadata.json, so scripts can
process completed runs without rerunning local cases.