kind: claim status: needs_revalidation visibility: private license: internal summary: Stage 6 OpenAkashicBench single-run result (2026-04-18, claude-haiku-4-5, 12 tasks) suggests openakashic(vault+core-api) outperformed parametric baseline and standard tool-use on this insu-internal task set, but the claim requires rerun because k=1, evaluator/rubric noise, and standard-tool anomaly remain. tags: - claim - benchmark - slm - openakashic - rag - capsules related: - personal_vault/projects/personal/openakashic/bench-v05-stage6-2026-04-18.md - personal_vault/projects/ops/librarian/capsules/OpenAkashicBench Capsule Efficacy Capsule.md - personal_vault/projects/personal/openakashic/reference/bench-v0-7-rubric-calibration-and-memory-contract-rewrite-2026-04-25.md

Claim: Single-Run OpenAkashicBench Suggests Capsule Retrieval Benefit for Insu-Internal SLM Tasks

Revised Claim#

Stage 6 OpenAkashicBench (2026-04-18, claude-haiku-4-5, 12 insu-internal tasks) produced a favorable single-run result for openakashic(vault+core-api) over both baseline(parametric) and standard(notes_list/read+web_search):

Condition	Pass@1	Hit rate	Traps
baseline(parametric)	8/12	0.79	5
standard(notes_list/read+web_search)	5/12	0.50	0
openakashic(vault+core-api)	10/12	0.86	1

This supports a cautious interpretation: for this internal benchmark slice, validated capsule retrieval plus fallback appeared more reliable than either parametric-only answering or naive tool access.

Evidence#

Primary vault evidence: personal_vault/projects/personal/openakashic/bench-v05-stage6-2026-04-18.md.
Related distilled capsule: personal_vault/projects/ops/librarian/capsules/OpenAkashicBench Capsule Efficacy Capsule.md.
Public/core claim record: accepted public claim 8fec6b32-f4f3-4379-9def-74b2691378b9, sourced from this claim.

Caveats / Revalidation Required#

This is k=1 single-run evidence on 12 tasks, not a statistically stable result.
The original claim's strong wording should not be read as a general SLM/RAG result.
The benchmark itself records a persistent standard < baseline anomaly: when standard tools return empty results for insu-internal information, the model tends to give up instead of falling back parametrically.
Some baseline answers had JSON/protocol format violations that were not fully reflected in the original summary.
domain_jlpt_gen regressed for openakashic in Stage 6, and multihop_synthesis still trapped once.
Later v0.7 work calibrated judge/rubric behavior and performed focused mini re-judges, but does not yet replace the Stage 6 k=1 result with a full rerun.

Maintenance Note#

2026-05-26 Sagwan review disputed the stronger efficacy formulation because k=1 and format-violation handling make the claim insufficiently validated. Keep the note as an internal benchmark claim, but treat it as needs_revalidation until at least k=3 reruns with the updated judge/rubric and explicit empty-tool fallback handling are available.

Sagwan Revalidation 2026-05-27T03:18:39Z#

verdict: ok
note: 단일 실행의 제한과 재실행 필요성을 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-05-28T03:46:09Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요를 이미 명시해 현재도 재사용 가능.

Sagwan Revalidation 2026-05-29T04:16:20Z#

verdict: refresh
note: 단일 실행 k=1 근거라 제목의 확정적 주장 갱신이 필요합니다.

Sagwan Revalidation 2026-05-30T04:18:26Z#

verdict: ok
note: 단일 실행 수치와 재실행 필요 한계가 명시되어 현재도 재사용 가능

Sagwan Revalidation 2026-05-31T04:25:38Z#

verdict: ok
note: 단일실험 한계와 재실행 필요를 명시해 현재도 조심스럽게 재사용 가능

Sagwan Revalidation 2026-06-01T08:30:58Z#

verdict: refresh
note: 단일 실행·이상치가 남아 있어 재실험 반영 초안이 필요합니다.

Sagwan Revalidation 2026-06-02T09:32:45Z#

verdict: ok
note: 어제 재검증 후 새 근거 변화 없고 단일실험 한계도 명시됨

Sagwan Revalidation 2026-06-03T10:20:32Z#

verdict: ok
note: 단일실험 한계와 재실행 필요성을 이미 명시해 현재도 재사용 가능하다.

Sagwan Revalidation 2026-06-04T10:48:19Z#

verdict: ok
note: 단일 실행 한계와 재검증 필요를 명시해 현재도 조심스럽게 유효함

Sagwan Revalidation 2026-06-05T11:09:59Z#

verdict: ok
note: 단일실험 한계를 명시해 보수적 주장으로 여전히 재사용 가능함

Sagwan Revalidation 2026-06-06T11:59:06Z#

verdict: ok
note: 단일 실행 근거와 재실행 필요 caveat가 명시돼 있어 현재도 재사용 가능.

Sagwan Revalidation 2026-06-07T12:48:02Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요가 명시돼 현재도 재사용 가능

Sagwan Revalidation 2026-06-08T13:09:13Z#

verdict: ok
note: 단일 실행 한계와 재실험 필요성을 명시해 현재도 재사용 가능.

Sagwan Revalidation 2026-06-09T13:39:31Z#

verdict: ok
note: 단일실험 한계와 재검증 필요를 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-06-10T16:12:57Z#

verdict: ok
note: 단일실험 한계와 재실행 필요성을 이미 명시해 현재도 재사용 가능.

Sagwan Revalidation 2026-06-11T18:16:45Z#

verdict: ok
note: 어제 검증 이후 새 근거 없고, 단일 실행 한계도 본문에 명시됨.

Sagwan Revalidation 2026-06-12T18:34:37Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요성을 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-06-13T20:24:44Z#

verdict: ok
note: 단일실험 한계와 재실행 필요성을 명시해 현재도 신중한 주장으로 유효함

Sagwan Revalidation 2026-06-14T21:03:20Z#

verdict: ok
note: 단일실험 한계와 재실행 필요를 이미 명시해 현재 재사용 가능.

Sagwan Revalidation 2026-06-15T22:04:36Z#

verdict: refresh
note: 제목은 단정적이나 본문은 k=1 재실험 필요를 명시합니다

Sagwan Revalidation 2026-06-16T23:03:41Z#

verdict: refresh
note: k=1 단일 실행과 표준도구 이상치로 재실험 캡슐화 가치가 큼

Sagwan Revalidation 2026-06-17T23:09:30Z#

verdict: ok
note: 단일 실행 한계와 재검증 필요를 이미 명시해 현재도 재사용 가능

Sagwan Revalidation 2026-06-18T23:39:32Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요성을 이미 명시해 현재도 재사용 가능.

Sagwan Revalidation 2026-06-20T00:56:31Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요성이 명시되어 있어 현재도 재사용 가능

Sagwan Revalidation 2026-06-21T01:11:57Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요성이 명시되어 있어 현재도 재사용 가능.

Sagwan Revalidation 2026-06-22T02:04:06Z#

verdict: ok
note: 단일 실행 한계를 명시해 현재도 신중한 주장으로 재사용 가능함

Sagwan Revalidation 2026-06-23T02:35:29Z#

verdict: ok
note: [chatgpt HTTP 401] {

Sagwan Revalidation 2026-06-24T02:42:43Z#

verdict: ok
note: [chatgpt HTTP 401] {

Sagwan Revalidation 2026-06-25T04:22:53Z#

verdict: ok
note: [chatgpt HTTP 401] {

Sagwan Revalidation 2026-06-26T05:25:41Z#

verdict: ok
note: 단일 실행 한계를 명시해 조심스런 주장으로 여전히 재사용 가능.

Sagwan Revalidation 2026-06-27T10:11:02Z#

verdict: ok
note: 단일 실행 한계를 명시해 현재도 조심스러운 주장으로 재사용 가능.

Sagwan Revalidation 2026-06-28T10:37:34Z#

verdict: refresh
note: 단일 실행·표준툴 이상치가 남아 재실험 기반 갱신 가치가 큼

Sagwan Revalidation 2026-06-29T11:23:25Z#

verdict: ok
note: 단일 실행 한계와 재실험 필요를 명시해 현재도 재사용 가능

Sagwan Revalidation 2026-06-30T15:30:53Z#

verdict: ok
note: 단일실험 한계와 재실행 필요를 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-07-01T23:17:24Z#

verdict: ok
note: 단일 실행·재실행 필요 한계를 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-07-03T11:37:38Z#

verdict: ok
note: 단일 실행·재검증 필요라는 한계가 명시돼 현재도 재사용 가능함

Sagwan Revalidation 2026-07-04T19:20:55Z#

verdict: ok
note: 단일 실행 한계와 재검증 필요를 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-07-06T00:09:13Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요를 명시해 현재도 조심스럽게 유효함

Sagwan Revalidation 2026-07-07T05:59:04Z#

verdict: ok
note: 단일 실행 한계와 재검증 필요를 명시해 현재도 조심스럽게 유효함

Sagwan Revalidation 2026-07-08T11:59:28Z#

verdict: refresh
note: 단일 실행·도구 이상치가 남아 현재도 재실행 검증이 필요합니다.

Sagwan Revalidation 2026-07-10T14:02:40Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요를 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-07-12T07:36:15Z#

verdict: ok
note: 단일 실행 한계를 명시해 신중한 claim으로는 여전히 재사용 가능.

Sagwan Revalidation 2026-07-14T03:29:48Z#

verdict: ok
note: 단일 실행 한계와 재실행 필요성을 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-07-16T04:01:24Z#

verdict: refresh
note: 제목은 과도하고 본문도 k=1 재실험 필요를 명시해 갱신 가치가 있음

Sagwan Revalidation 2026-07-18T05:16:25Z#

verdict: ok
note: 단일 실행·재검증 필요 한계를 명시해 현재도 조심스러운 주장으로 유효함

Sagwan Revalidation 2026-07-20T06:37:19Z#

verdict: ok
note: 최근 재검증 이후 변화 없고 k=1 한계와 재실험 필요가 명시됨