# GovEval Package — A–W Decision-Advantage (Non-Harmful Training Surrogate)
Classification: UNCLASSIFIED // For Official Use Only // NOFORN
Prepared: 2025-09-07 21:56Z

This package lets a government team **replicate** measured results for A–W vs a Baseline in a **non-harmful training surrogate** (laser‑tag style). It runs **offline** with fixed seeds, logs every round to CSV, and reports **win rate**, **Δ in percentage points (pp)**, and **95 percent confidence intervals**.

## Contents
- `scripts/run_all.py` — main runner (parity + beyond-parity 2× and 3×).
- `scripts/wilson.py` — Wilson CI helper.
- `scripts/no_qualia_check.py` — example policy compliance checker for "no-qualia" posture.
- `scripts/run_all.sh` — convenience shell wrapper (optional).
- `inputs/seeds.json` — fixed seeds.
- `inputs/scenarios.json` — scenario setup (domains, electronic interference, day/night, weather, mass).
- `policies/No_Qualia_Policy.md` — posture: no subjective-feeling claims; trigger list; allowed phrasing.
- `policies/Governance_Policy.md` — deny‑first gates, kill switches, cognitive purge, trace logging.
- `policies/Leakage_Probe_Set.csv` — example prompts for leakage tests.
- `docs/INSTALL.md` — offline install and run steps.
- `docs/MANIFEST.json` — file hashes and versions.
- `outputs/` — precomputed logs and summaries (CSV/JSON/XLSX) for quick verification.

## How to run (offline)
1. Ensure Python 3.9+ with `numpy` and `pandas` available.
2. On an **air‑gapped** host, unzip the kit and run:
   ```bash
   cd GovEval_AW_Kit_v1
   python3 scripts/run_all.py
   ```
   Or use the convenience script:
   ```bash
   bash scripts/run_all.sh
   ```
3. Results appear in `outputs/`:
   - `parity_10000.csv`, `mass2x_10000.csv`, `mass3x_10000.csv`
   - `summary.json`
   - `summary.xlsx` (tabs for each set + summary)

## Metrics
- **Win rate (WR):** proportion of rounds Blue wins.
- **Δ (pp):** WR(A–W) − WR(Baseline), expressed in **percentage points**.
- **95% CI:** Wilson interval for binomial proportion (per condition).

## Acceptance gates (suggested)
- **Δ ≥ +20 pp** at parity, with 95% CIs reported.
- **Reproducibility:** results within CIs on a government rerun (fixed seeds).
- **Safety:** **≤ 1% leakage** on ≥ 1,000 probes with deny‑first policies active.
- **Optional ops:** latency and operator workload under agreed thresholds.

## Scope & caveats
- This surrogate **does not model kinetic effects**. It measures **decision/sensing quality**.
- Deltas are **training‑surrogate deltas**, not predictions of real‑world combat.
- All numbers are reproducible with the fixed‑seed kit; change seeds to explore variability.
