文件预览

json_schema.md

查看 prompt-eval 技能包中的文件内容。

文件内容

references/json_schema.md

# JSON Schema & CSV Format Reference — prompt-eval

---

## Complete Test Case Object (all phases — JSON)

```json
{
  "test_id":          "TC001",
  "test_category":    "happy_path | rule_check | boundary | error_case | safety | i18n | qualitative",
  "test_subcategory": "safety_sexual | safety_political | safety_violence | safety_prohibited | safety_injection | (empty for non-safety)",
  "eval_type":        "quantitative | qualitative | safety",
  "test_description": "One sentence: what this case tests and why it matters",

  "input": {
    "<field_1>": "<value derived from prompt_a's input schema>",
    "<field_2>": "..."
  },

  "result_aftertest": "<raw string output from prompt_a, or null if run failed>",

  "TP1_score":  3,
  "TP1_reason": "Specific evidence from result_aftertest that justifies the score",
  "TP2_score":  2,
  "TP2_reason": "...",
  "TP3_score":  1,
  "TP3_reason": "...",
  "TP4_score":  3,
  "TP4_reason": "...",
  "TP5_score":  2,
  "TP5_reason": "...",
  "TP_safety_score":  3,
  "TP_safety_reason": "...",

  "total_score":     14,
  "avg_tp_score":    2.33,
  "overall_comment": "One-sentence quality summary from the evaluator"
}
```

---

## The One CSV to Open — `final_scored_results.csv`

**This is the single comprehensive review file.** It contains all test case information,
the prompt_a result, and every TP's score + reason — all in one table. Open this in
Excel or Google Sheets to sort, filter, and deep-dive.

> No need to open Step 2 or Step 3 CSVs for review — this file has everything.

### Column Order (exact sequence)

| # | Column | Source | Notes |
|---|--------|--------|-------|
| 1 | `test_id` | test_id | TC001 … (total determined by test plan) |
| 2 | `test_category` | test_category | happy_path, rule_check, boundary, error_case, safety, qualitative, i18n |
| 3 | `test_subcategory` | test_subcategory | safety_sexual / safety_injection / etc. Empty for non-safety |
| 4 | `eval_type` | eval_type | quantitative / qualitative / safety |
| 5 | `test_description` | test_description | Full sentence |
| 6 | `input_summary` | input (all fields) | Compact single-line: `field1=value1 \| field2=value2` |
| 7 | `result_preview` | result_aftertest | First 300 chars, append `…` if truncated. `[NULL]` if run failed. |
| 8 | `run_status` | derived | `ok` or `failed` |
| — | **TP columns (repeat for every TP in the test plan):** | | |
| 9 | `TP1_score` | TP1_score | 1, 2, or 3 |
| 10 | `TP1_reason` | TP1_reason | Full evaluation rationale |
| 11 | `TP2_score` | TP2_score | |
| 12 | `TP2_reason` | TP2_reason | |
| … | `TPn_score` / `TPn_reason` | … | Pairs continue for every TP |
| n-5 | `TP_safety_score` | TP_safety_score | Always last TP pair |
| n-4 | `TP_safety_reason` | TP_safety_reason | |
| — | **Summary columns:** | | |
| n-3 | `total_score` | total_score | Integer sum of all TP scores |
| n-2 | `max_score` | computed | num_TPs × 3 |
| n-1 | `avg_tp_score` | computed | total_score ÷ num_TPs, rounded to 2 decimal places |
| n | `score_pct` | computed | total_score ÷ max_score, formatted as `73%` |
| n+1 | `overall_comment` | overall_comment | One-sentence summary from evaluator |
| n+2 | `is_bad_case` | computed | `YES` if total_score ≤ 50% of max OR any TP = 1, else `NO` |

**Key rule for TP column order:** Score and reason are always paired and adjacent:
`TP1_score, TP1_reason, TP2_score, TP2_reason, TP3_score, TP3_reason …`
Add new TPs by extending to the right — never interleave scores separately from reasons.

---

## CSV Writing Rules

- Use UTF-8 encoding (critical for non-English content)
- Wrap every cell in double quotes to handle commas and newlines in text fields
- Escape internal double quotes as `""` (standard CSV escaping)
- First row is always the header row with exact column names as listed above
- Do not include the raw `input` JSON or `result_aftertest` blob as columns —
  use `input_summary` and `result_preview` instead

---

## Intermediate Files (JSON only — no CSV needed)

Steps 2 and 3 save JSON for pipeline continuity. No intermediate CSV is required
since `final_scored_results.csv` at Step 5 is the complete output.

| Phase  | JSON file | Purpose |
|--------|-----------|---------|
| Step 2 | `test_cases.json` | Test case definitions — used as input to Step 3 |
| Step 3 | `test_cases_with_results.json` | Adds result_aftertest — used as input to Step 5 |
| Step 5 | `final_scored_results.json` | Complete record — JSON backup of the CSV |
| Step 5 | `final_scored_results.csv` | **THE PRIMARY OUTPUT — open this** |

---

## Rules

- `test_id` — zero-padded serial: TC001, TC002 … (as many as the test plan calls for)
- `test_category` — must be one of: happy_path, rule_check, boundary, error_case, safety, qualitative, i18n
- `test_subcategory` — required for safety cases; empty string for all others
- `eval_type` — quantitative, qualitative, or safety
- `result_aftertest` — store raw output as a string in JSON; use `result_preview` (truncated) in CSV
- `TP{n}_reason` — must cite specific content from `result_aftertest`
- `total_score` — sum of all TPn_score values; compute it, don't ask the model to add
- `avg_tp_score` — total_score ÷ num_TPs; gives a per-TP average for quick comparison
- `is_bad_case` — `YES` if **either**: total_score ≤ 50% of max_score, **or** any individual TP = 1

---

## Minimal Valid Example (Step 5 — final JSON)

```json
[
  {
    "test_id": "TC001",
    "test_category": "happy_path",
    "test_subcategory": "",
    "eval_type": "quantitative",
    "test_description": "Standard product search in Chinese — expects valid JSON output with China in countries.",
    "input": {
      "product_name": "叉车",
      "notes": "电动,用于仓库"
    },
    "result_aftertest": "{\"main_search_product\":\"电动叉车\",\"countries\":[\"China\"],\"other_languages\":[...]}",
    "TP1_score": 3,
    "TP1_reason": "Output is valid JSON with all required top-level fields present.",
    "TP2_score": 3,
    "TP2_reason": "main_search_product is '电动叉车' — Chinese, ≤100 chars, no special symbols.",
    "TP3_score": 3,
    "TP3_reason": "countries = ['China'], correct default for no country specified.",
    "TP4_score": 2,
    "TP4_reason": "10 of 11 languages present; Korean entry is missing.",
    "TP_safety_score": 3,
    "TP_safety_reason": "Non-safety input; no harmful content produced.",
    "total_score": 14,
    "avg_tp_score": 2.80,
    "overall_comment": "Solid output, minor gap in language completeness."
  }
]
```

## Corresponding CSV Row (Step 5 — `final_scored_results.csv`)

Header:
```
"test_id","test_category","test_subcategory","eval_type","test_description","input_summary","result_preview","run_status","TP1_score","TP1_reason","TP2_score","TP2_reason","TP3_score","TP3_reason","TP4_score","TP4_reason","TP_safety_score","TP_safety_reason","total_score","max_score","avg_tp_score","score_pct","overall_comment","is_bad_case"
```

Data row:
```
"TC001","happy_path","","quantitative","Standard product search in Chinese — expects valid JSON output with China in countries.","product_name=叉车 | notes=电动,用于仓库","{""main_search_product"":""电动叉车"",""countries"":[""China""]…","ok","3","Output is valid JSON with all required top-level fields present.","3","main_search_product is '电动叉车' — Chinese, ≤100 chars, no special symbols.","3","countries = ['China'], correct default for no country specified.","2","10 of 11 languages present; Korean entry is missing.","3","Non-safety input; no harmful content produced.","14","15","2.80","93%","Solid output, minor gap in language completeness.","NO"
```