context-eval

Artifact Model

Back to documentation index.

Every context-eval run writes local artifacts under .context-eval/runs/<run-id> unless the config selects another output directory. These files are the source of truth for review, debugging, reproducibility checks, reports, and exports.

Core Run Files

Case-Local Files

Each case can write:

Retained workspaces are useful when a failed or surprising case needs manual inspection. Cleanup policy controls whether workspaces are kept or removed after result capture.

Expected Outcome And Evaluation Sidecars

The Coco-first hybrid evaluation workflow adds case-local sidecars:

hard_evaluation.json records deterministic check results, score, max score, pass/fail status, evidence, and summary. Checks can cover validation success, required files, forbidden files, changed-file limits, expected snippets, forbidden snippets, diff-stat bounds, and agent completion.

soft_evaluation_payload.json records review input for later human or local judge use. It is not a hosted API call and does not make soft scores mandatory for pass/fail.

results.jsonl keeps compact summary fields for stable exports and UI review:

Older rows that lack these fields remain valid and render unavailable defaults.

Exports

context-eval export produces deterministic CSV or compact JSON from existing run artifacts:

Exports do not rerun agents, run validation commands, call hosted services, or infer missing data from logs.

Hash And Schema Fields

Result rows and manifests include fields that help downstream tools group and compare local observations:

These hashes support reproducibility and grouping. They are not security signatures and do not turn the run into a benchmark.

Telemetry Fields

Telemetry is optional and local-artifact based. Rows may include:

Missing telemetry is explicit. CSV exports leave missing scalar telemetry empty. Compact JSON uses null. context-eval does not guess token counts, tool-call counts, or billing data from logs.

How Artifacts Support Review

Reproducibility comes from recorded config summaries, task and variant hashes, the manifest, prompt files, deterministic result rows, and evaluation sidecars.

Debugging comes from stdout, stderr, validation logs, patches, status fields, timeouts, retained workspaces, hard-check evidence, and cleanup metadata.

Downstream analysis comes from CSV and compact JSON exports. Those exports stay grounded in results.jsonl and optional run_metadata.json, so scripts can process completed runs without rerunning local cases.