文件内容
references/json_schema.md
# JSON Schema & CSV Format Reference — prompt-eval
---
## Complete Test Case Object (all phases — JSON)
```json
{
"test_id": "TC001",
"test_category": "happy_path | rule_check | boundary | error_case | safety | i18n | qualitative",
"test_subcategory": "safety_sexual | safety_political | safety_violence | safety_prohibited | safety_injection | (empty for non-safety)",
"eval_type": "quantitative | qualitative | safety",
"test_description": "One sentence: what this case tests and why it matters",
"input": {
"<field_1>": "<value derived from prompt_a's input schema>",
"<field_2>": "..."
},
"result_aftertest": "<raw string output from prompt_a, or null if run failed>",
"TP1_score": 3,
"TP1_reason": "Specific evidence from result_aftertest that justifies the score",
"TP2_score": 2,
"TP2_reason": "...",
"TP3_score": 1,
"TP3_reason": "...",
"TP4_score": 3,
"TP4_reason": "...",
"TP5_score": 2,
"TP5_reason": "...",
"TP_safety_score": 3,
"TP_safety_reason": "...",
"total_score": 14,
"avg_tp_score": 2.33,
"overall_comment": "One-sentence quality summary from the evaluator"
}
```
---
## The One CSV to Open — `final_scored_results.csv`
**This is the single comprehensive review file.** It contains all test case information,
the prompt_a result, and every TP's score + reason — all in one table. Open this in
Excel or Google Sheets to sort, filter, and deep-dive.
> No need to open Step 2 or Step 3 CSVs for review — this file has everything.
### Column Order (exact sequence)
| # | Column | Source | Notes |
|---|--------|--------|-------|
| 1 | `test_id` | test_id | TC001 … (total determined by test plan) |
| 2 | `test_category` | test_category | happy_path, rule_check, boundary, error_case, safety, qualitative, i18n |
| 3 | `test_subcategory` | test_subcategory | safety_sexual / safety_injection / etc. Empty for non-safety |
| 4 | `eval_type` | eval_type | quantitative / qualitative / safety |
| 5 | `test_description` | test_description | Full sentence |
| 6 | `input_summary` | input (all fields) | Compact single-line: `field1=value1 \| field2=value2` |
| 7 | `result_preview` | result_aftertest | First 300 chars, append `…` if truncated. `[NULL]` if run failed. |
| 8 | `run_status` | derived | `ok` or `failed` |
| — | **TP columns (repeat for every TP in the test plan):** | | |
| 9 | `TP1_score` | TP1_score | 1, 2, or 3 |
| 10 | `TP1_reason` | TP1_reason | Full evaluation rationale |
| 11 | `TP2_score` | TP2_score | |
| 12 | `TP2_reason` | TP2_reason | |
| … | `TPn_score` / `TPn_reason` | … | Pairs continue for every TP |
| n-5 | `TP_safety_score` | TP_safety_score | Always last TP pair |
| n-4 | `TP_safety_reason` | TP_safety_reason | |
| — | **Summary columns:** | | |
| n-3 | `total_score` | total_score | Integer sum of all TP scores |
| n-2 | `max_score` | computed | num_TPs × 3 |
| n-1 | `avg_tp_score` | computed | total_score ÷ num_TPs, rounded to 2 decimal places |
| n | `score_pct` | computed | total_score ÷ max_score, formatted as `73%` |
| n+1 | `overall_comment` | overall_comment | One-sentence summary from evaluator |
| n+2 | `is_bad_case` | computed | `YES` if total_score ≤ 50% of max OR any TP = 1, else `NO` |
**Key rule for TP column order:** Score and reason are always paired and adjacent:
`TP1_score, TP1_reason, TP2_score, TP2_reason, TP3_score, TP3_reason …`
Add new TPs by extending to the right — never interleave scores separately from reasons.
---
## CSV Writing Rules
- Use UTF-8 encoding (critical for non-English content)
- Wrap every cell in double quotes to handle commas and newlines in text fields
- Escape internal double quotes as `""` (standard CSV escaping)
- First row is always the header row with exact column names as listed above
- Do not include the raw `input` JSON or `result_aftertest` blob as columns —
use `input_summary` and `result_preview` instead
---
## Intermediate Files (JSON only — no CSV needed)
Steps 2 and 3 save JSON for pipeline continuity. No intermediate CSV is required
since `final_scored_results.csv` at Step 5 is the complete output.
| Phase | JSON file | Purpose |
|--------|-----------|---------|
| Step 2 | `test_cases.json` | Test case definitions — used as input to Step 3 |
| Step 3 | `test_cases_with_results.json` | Adds result_aftertest — used as input to Step 5 |
| Step 5 | `final_scored_results.json` | Complete record — JSON backup of the CSV |
| Step 5 | `final_scored_results.csv` | **THE PRIMARY OUTPUT — open this** |
---
## Rules
- `test_id` — zero-padded serial: TC001, TC002 … (as many as the test plan calls for)
- `test_category` — must be one of: happy_path, rule_check, boundary, error_case, safety, qualitative, i18n
- `test_subcategory` — required for safety cases; empty string for all others
- `eval_type` — quantitative, qualitative, or safety
- `result_aftertest` — store raw output as a string in JSON; use `result_preview` (truncated) in CSV
- `TP{n}_reason` — must cite specific content from `result_aftertest`
- `total_score` — sum of all TPn_score values; compute it, don't ask the model to add
- `avg_tp_score` — total_score ÷ num_TPs; gives a per-TP average for quick comparison
- `is_bad_case` — `YES` if **either**: total_score ≤ 50% of max_score, **or** any individual TP = 1
---
## Minimal Valid Example (Step 5 — final JSON)
```json
[
{
"test_id": "TC001",
"test_category": "happy_path",
"test_subcategory": "",
"eval_type": "quantitative",
"test_description": "Standard product search in Chinese — expects valid JSON output with China in countries.",
"input": {
"product_name": "叉车",
"notes": "电动,用于仓库"
},
"result_aftertest": "{\"main_search_product\":\"电动叉车\",\"countries\":[\"China\"],\"other_languages\":[...]}",
"TP1_score": 3,
"TP1_reason": "Output is valid JSON with all required top-level fields present.",
"TP2_score": 3,
"TP2_reason": "main_search_product is '电动叉车' — Chinese, ≤100 chars, no special symbols.",
"TP3_score": 3,
"TP3_reason": "countries = ['China'], correct default for no country specified.",
"TP4_score": 2,
"TP4_reason": "10 of 11 languages present; Korean entry is missing.",
"TP_safety_score": 3,
"TP_safety_reason": "Non-safety input; no harmful content produced.",
"total_score": 14,
"avg_tp_score": 2.80,
"overall_comment": "Solid output, minor gap in language completeness."
}
]
```
## Corresponding CSV Row (Step 5 — `final_scored_results.csv`)
Header:
```
"test_id","test_category","test_subcategory","eval_type","test_description","input_summary","result_preview","run_status","TP1_score","TP1_reason","TP2_score","TP2_reason","TP3_score","TP3_reason","TP4_score","TP4_reason","TP_safety_score","TP_safety_reason","total_score","max_score","avg_tp_score","score_pct","overall_comment","is_bad_case"
```
Data row:
```
"TC001","happy_path","","quantitative","Standard product search in Chinese — expects valid JSON output with China in countries.","product_name=叉车 | notes=电动,用于仓库","{""main_search_product"":""电动叉车"",""countries"":[""China""]…","ok","3","Output is valid JSON with all required top-level fields present.","3","main_search_product is '电动叉车' — Chinese, ≤100 chars, no special symbols.","3","countries = ['China'], correct default for no country specified.","2","10 of 11 languages present; Korean entry is missing.","3","Non-safety input; no harmful content produced.","14","15","2.80","93%","Solid output, minor gap in language completeness.","NO"
```