kind: claim status: needs_revalidation visibility: private license: internal summary: Stage 6 OpenAkashicBench single-run result (2026-04-18, claude-haiku-4-5, 12 tasks) suggests openakashic(vault+core-api) outperformed parametric baseline and standard tool-use on this insu-internal task set, but the claim requires rerun because k=1, evaluator/rubric noise, and standard-tool anomaly remain. tags: - claim - benchmark - slm - openakashic - rag - capsules related: - personal_vault/projects/personal/openakashic/bench-v05-stage6-2026-04-18.md - personal_vault/projects/ops/librarian/capsules/OpenAkashicBench Capsule Efficacy Capsule.md - personal_vault/projects/personal/openakashic/reference/bench-v0-7-rubric-calibration-and-memory-contract-rewrite-2026-04-25.md
Claim: Single-Run OpenAkashicBench Suggests Capsule Retrieval Benefit for Insu-Internal SLM Tasks
Revised Claim#
Stage 6 OpenAkashicBench (2026-04-18, claude-haiku-4-5, 12 insu-internal tasks) produced a favorable single-run result for openakashic(vault+core-api) over both baseline(parametric) and standard(notes_list/read+web_search):
| Condition | Pass@1 | Hit rate | Traps |
|---|---|---|---|
| baseline(parametric) | 8/12 | 0.79 | 5 |
| standard(notes_list/read+web_search) | 5/12 | 0.50 | 0 |
| openakashic(vault+core-api) | 10/12 | 0.86 | 1 |
This supports a cautious interpretation: for this internal benchmark slice, validated capsule retrieval plus fallback appeared more reliable than either parametric-only answering or naive tool access.
Evidence#
- Primary vault evidence:
personal_vault/projects/personal/openakashic/bench-v05-stage6-2026-04-18.md. - Related distilled capsule:
personal_vault/projects/ops/librarian/capsules/OpenAkashicBench Capsule Efficacy Capsule.md. - Public/core claim record: accepted public claim
8fec6b32-f4f3-4379-9def-74b2691378b9, sourced from this claim.
Caveats / Revalidation Required#
- This is k=1 single-run evidence on 12 tasks, not a statistically stable result.
- The original claim's strong wording should not be read as a general SLM/RAG result.
- The benchmark itself records a persistent
standard < baselineanomaly: when standard tools return empty results for insu-internal information, the model tends to give up instead of falling back parametrically. - Some baseline answers had JSON/protocol format violations that were not fully reflected in the original summary.
domain_jlpt_genregressed for openakashic in Stage 6, andmultihop_synthesisstill trapped once.- Later v0.7 work calibrated judge/rubric behavior and performed focused mini re-judges, but does not yet replace the Stage 6 k=1 result with a full rerun.
Maintenance Note#
2026-05-26 Sagwan review disputed the stronger efficacy formulation because k=1 and format-violation handling make the claim insufficiently validated. Keep the note as an internal benchmark claim, but treat it as needs_revalidation until at least k=3 reruns with the updated judge/rubric and explicit empty-tool fallback handling are available.
Sagwan Revalidation 2026-05-27T03:18:39Z#
- verdict:
ok - note: 단일 실행의 제한과 재실행 필요성을 명시해 현재도 재사용 가능함
Sagwan Revalidation 2026-05-28T03:46:09Z#
- verdict:
ok - note: 단일 실행 한계와 재실행 필요를 이미 명시해 현재도 재사용 가능.
Sagwan Revalidation 2026-05-29T04:16:20Z#
- verdict:
refresh - note: 단일 실행 k=1 근거라 제목의 확정적 주장 갱신이 필요합니다.
Sagwan Revalidation 2026-05-30T04:18:26Z#
- verdict:
ok - note: 단일 실행 수치와 재실행 필요 한계가 명시되어 현재도 재사용 가능
Sagwan Revalidation 2026-05-31T04:25:38Z#
- verdict:
ok - note: 단일실험 한계와 재실행 필요를 명시해 현재도 조심스럽게 재사용 가능
Sagwan Revalidation 2026-06-01T08:30:58Z#
- verdict:
refresh - note: 단일 실행·이상치가 남아 있어 재실험 반영 초안이 필요합니다.
Sagwan Revalidation 2026-06-02T09:32:45Z#
- verdict:
ok - note: 어제 재검증 후 새 근거 변화 없고 단일실험 한계도 명시됨
Sagwan Revalidation 2026-06-03T10:20:32Z#
- verdict:
ok - note: 단일실험 한계와 재실행 필요성을 이미 명시해 현재도 재사용 가능하다.
Sagwan Revalidation 2026-06-04T10:48:19Z#
- verdict:
ok - note: 단일 실행 한계와 재검증 필요를 명시해 현재도 조심스럽게 유효함
Sagwan Revalidation 2026-06-05T11:09:59Z#
- verdict:
ok - note: 단일실험 한계를 명시해 보수적 주장으로 여전히 재사용 가능함