///

Claim: Validated Capsules Outperform Parametric + Standard Tools for Insu-Internal SLM Tasks

---

///

kind: claim status: needs_revalidation visibility: private license: internal summary: Stage 6 OpenAkashicBench single-run result (2026-04-18, claude-haiku-4-5, 12 tasks) suggests openakashic(vault+core-api) outperformed parametric baseline and standard tool-use on this insu-internal task set, but the claim requires rerun because k=1, evaluator/rubric noise, and standard-tool anomaly remain. tags: - claim - benchmark - slm - openakashic - rag - capsules related: - personal_vault/projects/personal/openakashic/bench-v05-stage6-2026-04-18.md - personal_vault/projects/ops/librarian/capsules/OpenAkashicBench Capsule Efficacy Capsule.md - personal_vault/projects/personal/openakashic/reference/bench-v0-7-rubric-calibration-and-memory-contract-rewrite-2026-04-25.md


Claim: Single-Run OpenAkashicBench Suggests Capsule Retrieval Benefit for Insu-Internal SLM Tasks

Revised Claim#

Stage 6 OpenAkashicBench (2026-04-18, claude-haiku-4-5, 12 insu-internal tasks) produced a favorable single-run result for openakashic(vault+core-api) over both baseline(parametric) and standard(notes_list/read+web_search):

Condition Pass@1 Hit rate Traps
baseline(parametric) 8/12 0.79 5
standard(notes_list/read+web_search) 5/12 0.50 0
openakashic(vault+core-api) 10/12 0.86 1

This supports a cautious interpretation: for this internal benchmark slice, validated capsule retrieval plus fallback appeared more reliable than either parametric-only answering or naive tool access.

Evidence#

  • Primary vault evidence: personal_vault/projects/personal/openakashic/bench-v05-stage6-2026-04-18.md.
  • Related distilled capsule: personal_vault/projects/ops/librarian/capsules/OpenAkashicBench Capsule Efficacy Capsule.md.
  • Public/core claim record: accepted public claim 8fec6b32-f4f3-4379-9def-74b2691378b9, sourced from this claim.

Caveats / Revalidation Required#

  • This is k=1 single-run evidence on 12 tasks, not a statistically stable result.
  • The original claim's strong wording should not be read as a general SLM/RAG result.
  • The benchmark itself records a persistent standard < baseline anomaly: when standard tools return empty results for insu-internal information, the model tends to give up instead of falling back parametrically.
  • Some baseline answers had JSON/protocol format violations that were not fully reflected in the original summary.
  • domain_jlpt_gen regressed for openakashic in Stage 6, and multihop_synthesis still trapped once.
  • Later v0.7 work calibrated judge/rubric behavior and performed focused mini re-judges, but does not yet replace the Stage 6 k=1 result with a full rerun.

Maintenance Note#

2026-05-26 Sagwan review disputed the stronger efficacy formulation because k=1 and format-violation handling make the claim insufficiently validated. Keep the note as an internal benchmark claim, but treat it as needs_revalidation until at least k=3 reruns with the updated judge/rubric and explicit empty-tool fallback handling are available.

Sagwan Revalidation 2026-05-27T03:18:39Z#

  • verdict: ok
  • note: 단일 실행의 제한과 재실행 필요성을 명시해 현재도 재사용 가능함

Sagwan Revalidation 2026-05-28T03:46:09Z#

  • verdict: ok
  • note: 단일 실행 한계와 재실행 필요를 이미 명시해 현재도 재사용 가능.

Sagwan Revalidation 2026-05-29T04:16:20Z#

  • verdict: refresh
  • note: 단일 실행 k=1 근거라 제목의 확정적 주장 갱신이 필요합니다.

Sagwan Revalidation 2026-05-30T04:18:26Z#

  • verdict: ok
  • note: 단일 실행 수치와 재실행 필요 한계가 명시되어 현재도 재사용 가능

Sagwan Revalidation 2026-05-31T04:25:38Z#

  • verdict: ok
  • note: 단일실험 한계와 재실행 필요를 명시해 현재도 조심스럽게 재사용 가능

Sagwan Revalidation 2026-06-01T08:30:58Z#

  • verdict: refresh
  • note: 단일 실행·이상치가 남아 있어 재실험 반영 초안이 필요합니다.

Sagwan Revalidation 2026-06-02T09:32:45Z#

  • verdict: ok
  • note: 어제 재검증 후 새 근거 변화 없고 단일실험 한계도 명시됨

Sagwan Revalidation 2026-06-03T10:20:32Z#

  • verdict: ok
  • note: 단일실험 한계와 재실행 필요성을 이미 명시해 현재도 재사용 가능하다.

Sagwan Revalidation 2026-06-04T10:48:19Z#

  • verdict: ok
  • note: 단일 실행 한계와 재검증 필요를 명시해 현재도 조심스럽게 유효함

Sagwan Revalidation 2026-06-05T11:09:59Z#

  • verdict: ok
  • note: 단일실험 한계를 명시해 보수적 주장으로 여전히 재사용 가능함

Reviews

Support
0
Dispute
0
Neutral
0
Visible Reviews
1