# makesPDF model benchmark

> Downloadable digest of the live leaderboard at https://makespdf.com/ai#benchmarks.
> Last refresh: 2026-05-13. Skill version snapshotted in this dataset is at
> https://makespdf.com/skills/pdf-template-author.md — re-running the same models against a different skill version
> can move every score.
>
> The intent of this file: paste it into your AI assistant and ask
> questions ("why did model X fail test Y?", "which model has the best
> price/quality trade-off for short chat edits?"). All models are routed
> through OpenRouter.

## What was tested

Each model received the canonical `pdf-template-author` skill file as its
system prompt, then ran the six tests below. Tests 1–5 are single-prompt
authoring exercises with up to 5 catalog-feedback cycles per test. Test 6
is a four-turn conversational edit chain: each user request is its own
inner loop (up to 5 cycles) before moving to the next turn.

A test is "DSL clean" when the model's final DSL parses, validates against
the catalog with no errors, and the cycle loop ran to a clean state (no
remaining warnings). "Rendered" additionally requires the PDF pipeline to
succeed end-to-end. The chain score (Test 6 only) measures whether every
prior user intent still holds on the final template — a model can clean
all four turns at the catalog level and still score below 1.00 if a later
edit silently undid an earlier one.

Full prompts, sample data, and the broken-DSL source for Test 5 are checked
in at `apps/web/src/data/benchmark-tests.ts` in the makesPDF repo.

## Leaderboard

| Model | Rendered | DSL clean | Total time | Chain score | Run cost |
|---|---|---|---|---|---|
| deepseek-v4-flash | 6/6 | 6/6 | 8m 46s | 1.00 | $0.05 |
| grok-4-fast | 6/6 | 6/6 | 7m 58s | 1.00 | $0.08 |
| gpt-5.4-nano | 6/6 | 6/6 | 4m 17s | 1.00 | $0.10 |
| gpt-5-mini | 6/6 | 6/6 | 4m 14s | 1.00 | $0.12 |
| glm-4.6 | 6/6 | 6/6 | 28m 42s | 1.00 | $0.20 |
| qwen3.5-122b-a10b | 6/6 | 6/6 | 5m 39s | 1.00 | $0.24 |
| kimi-k2.6 | 6/6 | 6/6 | 85m 33s | 1.00 | $0.66 |
| gpt-5 | 6/6 | 6/6 | 12m 28s | 1.00 | $0.71 |
| claude-haiku-4.5 | 6/6 | 6/6 | 10m 30s | 1.00 | $0.85 |
| gpt-5.4 | 6/6 | 6/6 | 9m 28s | 1.00 | $1.13 |
| gemini-2.5-pro | 6/6 | 6/6 | 18m 1s | 1.00 | $1.79 |
| claude-sonnet-4.6 | 6/6 | 6/6 | 12m 58s | 1.00 | $1.83 |
| gpt-5.5 | 6/6 | 6/6 | 5m 45s | 1.00 | $2.04 |
| grok-4 | 6/6 | 6/6 | 42m 46s | 1.00 | $2.42 |
| claude-opus-4.7 | 6/6 | 6/6 | 3m 55s | 1.00 | $2.85 |
| deepseek-v4-pro | 6/6 | 5/6 | 26m 36s | 1.00 | $0.20 |
| gpt-5.1-codex-mini | 6/6 | 5/6 | 8m 50s | 1.00 | $0.25 |
| glm-4.7 | 6/6 | 5/6 | 61m 3s | 1.00 | $0.34 |
| grok-4.20 | 6/6 | 4/6 | 18m 14s | 1.00 | $0.75 |
| qwen3.6-max-preview | 6/6 | 4/6 | 57m 34s | 1.00 | $0.98 |
| llama-4-maverick | 5/6 | 5/6 | 7m | 1.00 | $0.05 |
| gpt-5.4-mini | 5/6 | 5/6 | 6m 11s | 1.00 | $0.43 |
| mistral-medium-3-5 | 5/6 | 5/6 | 2m 15s | 1.00 | $0.73 |
| qwen3-32b | 4/6 | 3/6 | 26m 10s | 0.88 | $0.05 |
| kimi-k2 | 3/6 | 3/6 | 6m 37s | 1.00 | $0.25 |

## Per-model results

### deepseek-v4-flash (`deepseek/deepseek-v4-flash`)

- **OpenRouter provider key:** `openrouter-deepseek-v4-flash`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.05 · **Total wall time:** 8m 46s · **Avg cycles:** 2.00 · **Skill:** 0.99
> Perfect pass rate, but this model rarely nails layout on the first attempt — it burns multiple cycles on adaptation, CV generation, and chat-editing tasks (up to 4), suggesting it over-generates or mis-sequences structural decisions before self-correcting. The single content imperfection is a T2 column-width mismatch: it labelled a header as stretch but failed to assign a `1fr`/`auto-stretch` column to absorb surplus space. Template variable substitution is flawless throughout.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 1 | 0err/0warn |  |
| adapt-invoice | ✓ | 3 | 0err/2warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| generate-cv | ✓ | 3 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### grok-4-fast (`x-ai/grok-4-fast`)

- **OpenRouter provider key:** `openrouter-grok-4-fast`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.08 · **Total wall time:** 7m 58s · **Avg cycles:** 1.83 · **Skill:** 0.98
> Perfect pass rate with near-flawless first attempts, but table layout trips it up consistently: it declares stretch column headers (T2) without a matching `1fr`/`auto-stretch` column to absorb surplus width, and occasionally emits rows with fewer cells than the table has columns rather than using `colspan` (T4). The `fix-broken` task needed three cycles and still left a warning; `chat-edit` took four cycles to converge — both suggesting iterative correction rather than clean-shot repair on complex editing tasks.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/1warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/0warn |  |
| fix-broken | ✓ | 3 | 0err/1warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### gpt-5.4-nano (`openai/gpt-5.4-nano`)

- **OpenRouter provider key:** `openrouter-gpt5-4-nano`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.10 · **Total wall time:** 4m 17s · **Avg cycles:** 1.83 · **Skill:** 0.99
> Strong overall performer — all six tests pass with no errors, and template variable handling is spotless. The one recurring slip is a table column-width mismatch: it labels a header as stretch (e.g. "Metric") without assigning a corresponding `1fr`/`auto-stretch` column, suggesting it handles header semantics and column sizing as independent concerns rather than a paired rule. Needs up to three cycles to settle the CV layout, but converges cleanly without escalating errors.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/1warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 3 | 0err/1warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### gpt-5-mini (`openai/gpt-5-mini`)

- **OpenRouter provider key:** `openrouter-gpt5-mini`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.12 · **Total wall time:** 4m 14s · **Avg cycles:** 1.67 · **Skill:** 0.93
> Strong across the board but shows a recurring pattern of needing a correction cycle to clean up self-introduced issues — four of six tests required two attempts. The characteristic slip is in column-width semantics: it labels a header as a stretch column without assigning a corresponding `1fr`/`auto-stretch` column type. It also hallucinated a `{{notes}}` template variable not present in the sample data, suggesting it fills gaps from intuition rather than strict schema inspection.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/1warn |  |
| adapt-invoice | ✓ | 2 | 0err/1warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/1warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### glm-4.6 (`z-ai/glm-4.6`)

- **OpenRouter provider key:** `openrouter-glm-46`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.20 · **Total wall time:** 28m 42s · **Avg cycles:** 2.50 · **Skill:** 0.99
> Passes everything cleanly, but convergence is uneven: CV generation needed 5 cycles and invoice adaptation 4, suggesting the model overshoots on layout complexity and self-corrects rather than landing right first. The one recurring table slip is declaring a stretch column header (e.g. "Metric") without assigning a matching `1fr`/`auto-stretch` column to absorb surplus width. Template variable handling is flawless and simpler documents close in one shot.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 1 | 0err/0warn |  |
| adapt-invoice | ✓ | 4 | 0err/1warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| generate-cv | ✓ | 5 | 0err/1warn |  |
| fix-broken | ✓ | 2 | 0err/1warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### deepseek-v4-pro (`deepseek/deepseek-v4-pro`)

- **OpenRouter provider key:** `openrouter-deepseek-v4`
- **Rendered:** 6/6 · **DSL clean:** 5/6 · **Chain score:** 1.00
- **Run cost:** $0.20 · **Total wall time:** 26m 36s · **Avg cycles:** 2.50 · **Skill:** 0.99
> Strong across invoice tasks, with template variables handled perfectly throughout. The CV generation is the notable weak spot — it burned 5 cycles and 3 errors before still failing, suggesting persistent trouble authoring that document structure rather than a one-off slip. The single table warning reflects a consistent pattern of declaring stretch column headers without providing a matching `1fr`/`auto-stretch` column to absorb surplus width.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 3 | 0err/0warn |  |
| adapt-invoice | ✓ | 1 | 0err/0warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| generate-cv | ✗ | 5 | 3err/0warn | 3 catalog error(s) |
| fix-broken | ✓ | 2 | 0err/1warn |  |
| chat-edit-invoice | ✓ | 2 | 0err/0warn |  |

---

### qwen3.5-122b-a10b (`qwen/qwen3.5-122b-a10b`)

- **OpenRouter provider key:** `openrouter-qwen3`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.24 · **Total wall time:** 5m 39s · **Avg cycles:** 2.50 · **Skill:** 0.99
> Perfect pass rate with near-flawless first-attempt skill, but convergence is uneven — the CV took 5 cycles while simpler tasks landed in 1. The one recurring content slip is a T2 column-width mismatch: labelling a header as a stretch column (e.g. "Metric") without declaring a corresponding `1fr`/`auto-stretch` column, leaving surplus space unabsorbed. Otherwise unremarkable in its failure modes.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 3 | 0err/1warn |  |
| adapt-invoice | ✓ | 3 | 0err/0warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 5 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### gpt-5.1-codex-mini (`openai/gpt-5.1-codex-mini`)

- **OpenRouter provider key:** `openrouter-gpt5-1-codex-mini`
- **Rendered:** 6/6 · **DSL clean:** 5/6 · **Chain score:** 1.00
- **Run cost:** $0.25 · **Total wall time:** 8m 50s · **Avg cycles:** 2.50 · **Skill:** 1.00
> Handles table authoring and template variables cleanly across all tested formats, but stumbles hard on CV generation — burning through 5 cycles and accumulating 10 content errors without resolving them, suggesting a systematic problem with that document schema rather than a one-off miss. The remaining tests trend toward multi-cycle convergence (pagination and fix-broken each needed 3 passes), so first-attempt reliability is lower than the pass rate implies.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 1 | 0err/0warn |  |
| paginate-invoice | ✓ | 3 | 0err/0warn |  |
| generate-cv | ✗ | 5 | 10err/0warn | 10 catalog error(s) |
| fix-broken | ✓ | 3 | 0err/1warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/2warn |  |

---

### glm-4.7 (`z-ai/glm-4.7`)

- **OpenRouter provider key:** `openrouter-glm-47`
- **Rendered:** 6/6 · **DSL clean:** 5/6 · **Chain score:** 1.00
- **Run cost:** $0.34 · **Total wall time:** 61m 3s · **Avg cycles:** 3.00 · **Skill:** 0.99
> Strong semantic accuracy throughout — skill scores stay near perfect even where it fails — but it burns cycles iterating toward the finish line, averaging 3.5 per test and spending 5 cycles on the one outright failure. That content failure in `adapt-invoice` traces to a T2 violation: it declared a stretch-labelled column header ("Metric") without assigning a corresponding `1fr`/`auto-stretch` column definition to absorb surplus width, a structural column-layout mistake it never self-corrected.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✗ | 5 | 0err/1warn |  |
| paginate-invoice | ✓ | 4 | 0err/0warn |  |
| generate-cv | ✓ | 3 | 0err/0warn |  |
| fix-broken | ✓ | 3 | 0err/1warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### kimi-k2.6 (`moonshotai/kimi-k2.6`)

- **OpenRouter provider key:** `openrouter-kimi-k2-6`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.66 · **Total wall time:** 85m 33s · **Avg cycles:** 2.00 · **Skill:** 1.00
> Perfect score across all tasks with no errors or warnings. The fix-broken case needed 3 cycles to converge, suggesting the model required a couple of correction rounds before producing a clean repair, but every other task settled in 1-2. Unremarkable in terms of failure patterns — there are none to report.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/0warn |  |
| fix-broken | ✓ | 3 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/0warn |  |

---

### gpt-5 (`openai/gpt-5`)

- **OpenRouter provider key:** `openrouter-gpt5`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.71 · **Total wall time:** 12m 28s · **Avg cycles:** 1.83 · **Skill:** 1.00
> Perfect score across all six tests with no errors or warnings. The fix-broken task needed 3 cycles to resolve, suggesting it doesn't always self-correct broken DSL immediately, but it gets there cleanly. Otherwise unremarkable in failure patterns — there are none.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/0warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/0warn |  |
| fix-broken | ✓ | 3 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### grok-4.20 (`x-ai/grok-4.20`)

- **OpenRouter provider key:** `openrouter-grok-4-20`
- **Rendered:** 6/6 · **DSL clean:** 4/6 · **Chain score:** 1.00
- **Run cost:** $0.75 · **Total wall time:** 18m 14s · **Avg cycles:** 2.83 · **Skill:** 0.99
> Both failures share the same root cause: the model repeatedly declares a stretch-labelled column header ("Metric") without assigning the corresponding column a `1fr`/`auto`/`auto-stretch` width — and it never self-corrects across five cycles on each invoice task. Everything renders cleanly and template variable handling is flawless, so the drag is entirely this one persistent T2 misconfiguration in table layout.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✗ | 5 | 0err/1warn |  |
| adapt-invoice | ✗ | 5 | 0err/2warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### claude-haiku-4.5 (`anthropic/claude-haiku-4.5`)

- **OpenRouter provider key:** `openrouter-claude-haiku`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $0.85 · **Total wall time:** 10m 30s · **Avg cycles:** 2.00 · **Skill:** 0.99
> Clean sweep across all six tests with near-perfect skill scores and no errors. The only notable slip is a stretch-column header misconfiguration in a table — declaring a "Metric" header without a corresponding `1fr`/`auto-stretch` column to absorb surplus width. Otherwise unremarkable in the best sense: consistently resolves within 2 cycles and handles template variables, layout adaptation, and pagination without fumbling.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 3 | 0err/0warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### qwen3.6-max-preview (`qwen/qwen3.6-max-preview`)

- **OpenRouter provider key:** `openrouter-qwen3-6-max`
- **Rendered:** 6/6 · **DSL clean:** 4/6 · **Chain score:** 1.00
- **Run cost:** $0.98 · **Total wall time:** 57m 34s · **Avg cycles:** 2.83 · **Skill:** 0.99
> Template variable handling is flawless and overall DSL knowledge is high, but the model failed to close on a T2 column-width mismatch in invoice generation — it burned 5 cycles without ever adding the required `1fr`/`auto` column to back a stretch header, ultimately failing the test despite zero errors. The same warning surfaced in the adapt run but didn't block passage there. One CV request failed in transit and should not be attributed to the model.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✗ | 5 | 0err/1warn |  |
| adapt-invoice | ✓ | 3 | 0err/1warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✗ | 5 | 1err/0warn | 1 catalog error(s) |
| fix-broken | ✓ | 2 | 0err/0warn |  |

---

### gpt-5.4 (`openai/gpt-5.4`)

- **OpenRouter provider key:** `openrouter-gpt5-4`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $1.13 · **Total wall time:** 9m 28s · **Avg cycles:** 1.67 · **Skill:** 0.99
> Near-perfect across all six tests, with flawless template variable substitution and clean renders every time. The one recurring slip is a T2 column-width mismatch — declaring a stretch column header ("Metric") without a corresponding `1fr`/`auto-stretch` column to absorb surplus space. The chat-edit task burned 4 cycles to converge, the only sign of iterative struggle in an otherwise efficient run.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/0warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### gemini-2.5-pro (`google/gemini-2.5-pro`)

- **OpenRouter provider key:** `openrouter-gemini-pro`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $1.79 · **Total wall time:** 18m 1s · **Avg cycles:** 2.83 · **Skill:** 0.99
> Perfect pass rate, but convergence is uneven: pagination burns 5 cycles before landing, and two other tests need 3. The one concrete content slip is mismatching a stretch column header (`Metric`) without a corresponding `1fr`/`auto-stretch` column to absorb surplus space — a T2 layout rule it reliably trips on. Otherwise output is clean across template vars and table authoring with no hard errors.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 3 | 0err/0warn |  |
| adapt-invoice | ✓ | 3 | 0err/2warn |  |
| paginate-invoice | ✓ | 5 | 0err/0warn |  |
| generate-cv | ✓ | 3 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### claude-sonnet-4.6 (`anthropic/claude-sonnet-4.6`)

- **OpenRouter provider key:** `openrouter-claude-sonnet`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $1.83 · **Total wall time:** 12m 58s · **Avg cycles:** 1.67 · **Skill:** 0.99
> Clean sweep with no errors and only one warning across all tests. The sole slip is a T2 layout rule: it declared a stretch column header ("Metric") without pairing it with a `1fr`/`auto-stretch` column to absorb surplus width — a recurring DSL gotcha that caught it once. Otherwise, template variable handling and table authoring are essentially flawless; the 1.67 average cycles reflects minor self-correction, not structural confusion.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/0warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### gpt-5.5 (`openai/gpt-5.5`)

- **OpenRouter provider key:** `openrouter-gpt5-5`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $2.04 · **Total wall time:** 5m 45s · **Avg cycles:** 1.83 · **Skill:** 0.99
> Near-flawless across all tests, with the only content blemish being a T2 warning where a "Metric" stretch column header was declared without a corresponding `1fr`/`auto` column to absorb surplus space. The chat-edit task burned 4 cycles to converge — double the typical run — suggesting it over-iterates on interactive edits. Template variable handling is spotless.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/0warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| generate-cv | ✓ | 2 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### grok-4 (`x-ai/grok-4`)

- **OpenRouter provider key:** `openrouter-grok`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $2.42 · **Total wall time:** 42m 46s · **Avg cycles:** 3.00 · **Skill:** 1.00
> Perfect score with no content errors, but this model rarely converges in one shot — it averaged 3 cycles across all tasks, hitting 4 on both the CV and chat-edit runs. No observable failure patterns in table authoring or template variable handling; every pass was clean. Reliable but iterative: expect it to self-correct rather than nail complex layouts on the first attempt.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 3 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/0warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| generate-cv | ✓ | 4 | 0err/0warn |  |
| fix-broken | ✓ | 3 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 4 | 0err/0warn |  |

---

### claude-opus-4.7 (`anthropic/claude-opus-4.7`)

- **OpenRouter provider key:** `openrouter-claude-opus`
- **Rendered:** 6/6 · **DSL clean:** 6/6 · **Chain score:** 1.00
- **Run cost:** $2.85 · **Total wall time:** 3m 55s · **Avg cycles:** 1.83 · **Skill:** 0.99
> Perfect pass rate with near-flawless first-attempt skill, but it consistently needs extra cycles to converge — chat-edit burned 4 passes, CV authoring 3. The one recurring content slip is a T2 table misconfiguration: it declared a stretch column header without assigning a `1fr`/`auto-stretch` column to actually absorb the surplus width. Otherwise unremarkable in its failure modes.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/0warn |  |
| paginate-invoice | ✓ | 1 | 0err/0warn |  |
| generate-cv | ✓ | 3 | 0err/0warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### llama-4-maverick (`meta-llama/llama-4-maverick`)

- **OpenRouter provider key:** `openrouter-llama4-maverick`
- **Rendered:** 5/6 · **DSL clean:** 5/6 · **Chain score:** 1.00
- **Run cost:** $0.05 · **Total wall time:** 7m · **Avg cycles:** 2.00 · **Skill:** 0.99
> Strong on template variable handling and invoice tasks, but burns extra cycles converging — the chat-edit test took 4 cycles and two invoice tests needed 3 each. The sole content failure is on CV generation, where output was wrong enough to not score at all. Recurring table issue: declaring a stretch column header (e.g. "Metric") without a corresponding `1fr`/`auto-stretch` column to actually absorb the surplus space.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 1 | 0err/0warn |  |
| adapt-invoice | ✓ | 3 | 0err/1warn |  |
| paginate-invoice | ✓ | 3 | 0err/1warn |  |
| generate-cv | ✗ | 2 | 0err/0warn | DSL execution failed: DSL execution failed: "undefined" is not defined (1:13) |
| fix-broken | ✓ | 2 | 0err/1warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### gpt-5.4-mini (`openai/gpt-5.4-mini`)

- **OpenRouter provider key:** `openrouter-gpt5-4-mini`
- **Rendered:** 5/6 · **DSL clean:** 5/6 · **Chain score:** 1.00
- **Run cost:** $0.43 · **Total wall time:** 6m 11s · **Avg cycles:** 2.00 · **Skill:** 0.99
> Strong across generation and editing tasks, with near-perfect template variable handling and clean first attempts. The one content failure is on pagination, where it produced incorrect output despite no API issues. Table authoring shows a recurring tendency to declare stretch column headers without a corresponding `1fr`/`auto-stretch` column to absorb the surplus.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 3 | 0err/0warn |  |
| paginate-invoice | ✗ | 2 | 0err/0warn | DSL execution failed: DSL execution failed: Cannot evaluate ForStatement |
| generate-cv | ✓ | 2 | 0err/1warn |  |
| fix-broken | ✓ | 2 | 0err/0warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### mistral-medium-3-5 (`mistralai/mistral-medium-3-5`)

- **OpenRouter provider key:** `openrouter-mistral-medium`
- **Rendered:** 5/6 · **DSL clean:** 5/6 · **Chain score:** 1.00
- **Run cost:** $0.73 · **Total wall time:** 2m 15s · **Avg cycles:** 2.50 · **Skill:** 1.00
> Handles invoicing tasks reliably, though it rarely lands cleanly on the first pass — pagination and chat-editing both needed 3–4 cycles to converge. The CV generation is the standout failure: it burned all 5 cycles and still couldn't produce a passing document, suggesting it gets trapped in a correction loop on more structurally complex layouts without making meaningful forward progress.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 1 | 0err/0warn |  |
| adapt-invoice | ✓ | 2 | 0err/1warn |  |
| paginate-invoice | ✓ | 3 | 0err/1warn |  |
| generate-cv | ✗ | 5 | 1err/1warn | 1 catalog error(s) |
| fix-broken | ✓ | 3 | 0err/1warn |  |
| chat-edit-invoice | ✓ | 1 | 0err/0warn |  |

---

### qwen3-32b (`qwen/qwen3-32b`)

- **OpenRouter provider key:** `openrouter-qwen3-32b`
- **Rendered:** 4/6 · **DSL clean:** 3/6 · **Chain score:** 0.88
- **Run cost:** $0.05 · **Total wall time:** 26m 10s · **Avg cycles:** 2.83 · **Skill:** 0.93
> Strong initial instincts (meanFirstAttemptSkill 0.93, perfect table authoring) but fails to close: both generate-invoice and generate-cv score skill=1 yet still fail after 3 and 5 cycles respectively, suggesting the model produces structurally valid DSL that quietly violates catalog or render constraints it doesn't catch itself. The V1 template-var failure — emitting `{{notes}}` against sampleData that has no such key — is characteristic: it hallucinates plausible-sounding variable names rather than grounding against actual schema. fix-broken returned null skill, indicating it couldn't diagnose the existing breakage at all.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✗ | 3 | 0err/0warn | DSL execution failed: DSL execution failed: Unterminated string constant (74:8) |
| adapt-invoice | ✓ | 4 | 0err/0warn |  |
| paginate-invoice | ✓ | 2 | 0err/0warn |  |
| generate-cv | ✗ | 5 | 1err/0warn | 1 catalog error(s) |
| fix-broken | ✗ | 2 | 0err/0warn | DSL execution failed: DSL execution failed: object is not iterable (cannot read property Symbol(Symbol.iterator)) |
| chat-edit-invoice | ✓ | 1 | 0err/1warn |  |

---

### kimi-k2 (`moonshotai/kimi-k2`)

- **OpenRouter provider key:** `openrouter-kimi-k2`
- **Rendered:** 3/6 · **DSL clean:** 3/6 · **Chain score:** 1.00
- **Run cost:** $0.25 · **Total wall time:** 6m 37s · **Avg cycles:** 2.67 · **Skill:** 1.00
> Handles routine invoice generation and editing reliably, but fails content-side on pagination, CV layout, and broken-document repair — suggesting it loses the thread when structural complexity increases beyond a flat table. The adapt task needed 4 cycles to converge (with a warning), and the CV likewise burned 4 cycles before ultimately failing. No transport issues; all three failures are the model's own output.

| Test | Result | Cycles | Catalog | Notes |
|---|---|---|---|---|
| generate-invoice | ✓ | 2 | 0err/0warn |  |
| adapt-invoice | ✓ | 4 | 0err/1warn |  |
| paginate-invoice | ✗ | 2 | 0err/0warn | DSL execution failed: DSL execution failed: Unexpected token (59:42) |
| generate-cv | ✗ | 4 | 0err/0warn | DSL execution failed: DSL execution failed: Unexpected token (110:17) |
| fix-broken | ✗ | 2 | 0err/0warn | DSL execution failed: DSL execution failed: "undefined" is not defined (11:26) |
| chat-edit-invoice | ✓ | 2 | 0err/0warn |  |

## Reproducing

The bench script lives at `scripts/tools/benchmark-models.ts` in the
`makesPDF` repo. To re-run a single model against the current skill:

```bash
npx tsx scripts/tools/benchmark-models.ts \
  --providers <openrouter-key> --merge --iterate 5
```

Set `OPENROUTER_API_KEY` first. Use `--missing-only` to fill gaps
without re-spending on existing rows. Use `--tests chat-edit-invoice`
(or any other test name) to scope to one test.