Summary#
Collector incremental ingestion should be designed around at-least-once collection, idempotent writes, explicit cursor/watermark state, and bounded duplicate suppression, rather than assuming true end-to-end exactly-once behavior. In poller, webhook, feed, and pagination-based collectors, duplicates and gaps usually arise from cursor drift, non-monotonic source updates, deleted records, retry/replay behavior, clock skew, partial failures, and ambiguous acknowledgement boundaries.
A robust collector capsule should treat “exactly-once ingestion” as an implementation goal only within narrow subsystems, not as a global guarantee. The safer architectural stance is:
- collect at least once,
- persist raw observations or events with stable identity where possible,
- derive idempotency keys from source IDs, version fields, event IDs, or canonical content hashes,
- advance cursors only after durable processing,
- support overlap windows/backfill,
- record cursor lineage and failure state,
- and make duplicate suppression observable rather than silent.
Key Points#
- Exactly-once is usually not an end-to-end collector guarantee
- Brokers, databases, APIs, webhooks, pollers, and sinks each have different acknowledgement semantics.
- Even where a platform advertises exactly-once delivery or processing, the guarantee is normally scoped to that platform’s protocol, session, topic, subscription, transaction, or consumer behavior.
-
Collector architecture should therefore assume at-least-once input and make downstream writes idempotent.
-
Cursor and watermark design should be explicit
- Common cursor types:
- monotonically increasing numeric ID,
- source update timestamp,
- opaque API cursor/page token,
- compound cursor such as
(updated_at, id), - log sequence number / offset,
- high-water mark plus overlap window.
- A single timestamp watermark is fragile when records share timestamps, clocks skew, records are updated out of order, or the source has delayed visibility.
- Safer designs often use a compound cursor and query with a deterministic order, for example:
WHERE updated_at > last_seen_updated_at- or
WHERE updated_at = last_seen_updated_at AND id > last_seen_id
-
If the source API supports only opaque page cursors, collectors should persist the token, request parameters, and snapshot assumptions because cursor semantics may expire or drift.
-
Advance cursor only after durable effects
- A collector should not advance its durable cursor before the corresponding batch has been durably written or staged.
- Failure mode:
- fetch page,
- advance cursor,
- crash before writing records,
- restart from advanced cursor,
- records are permanently skipped.
-
Safer pattern:
- fetch,
- write raw/staged records idempotently,
- commit sink transaction or checkpoint,
- then advance cursor/checkpoint.
-
Duplicate suppression must be based on stable identity
- Preferred idempotency key sources:
- upstream event ID,
- source record primary key plus version/update timestamp,
- source log offset or sequence number,
- webhook delivery ID,
- canonicalized content hash when no stable ID exists.
-
Content hashing can help, but it has limitations:
- benign formatting changes may create false “new” records,
- lossy canonicalization may collapse distinct records,
- mutable fields such as scrape time or tracking parameters must be excluded.
-
Poller-specific failure modes
- Cursor skips caused by advancing before write completion.
- Duplicate reads from retrying the same page or overlap window.
- Missed updates when filtering only by
updated_at > watermarkand multiple rows share the same timestamp. - Cursor drift when offset pagination is used while the source dataset mutates.
- Late-arriving records that have timestamps older than the current high-water mark.
-
Deletes not visible unless the source exposes tombstones, audit logs, or soft-delete markers.
-
Webhook-specific failure modes
- Webhook providers commonly retry deliveries when acknowledgements fail or time out.
- Network ambiguity means the sender and receiver may disagree about whether a delivery succeeded.
- Receivers should store event IDs or delivery IDs and perform idempotent handling.
-
If a webhook only sends “state changed” notifications, the collector may still need to re-fetch the authoritative resource state.
-
Pagination and feed ingestion require extra care
- Offset pagination is vulnerable to insertions/deletions during traversal.
- Cursor-based pagination is safer but still depends on provider semantics.
- Feeds may reorder entries, remove entries, mutate old entries, or expose only a rolling window.
-
A collector should use overlap/backfill windows and dedupe rather than trusting that each page boundary is stable forever.
-
Backfill and recovery should be first-class
- Collectors should support:
- replay from a prior cursor,
- bounded historical backfill,
- reprocessing from raw captured observations,
- manual cursor override,
- gap detection,
- duplicate-rate metrics,
- dead-letter or quarantine handling.
-
Cursor history should include:
- source,
- query parameters,
- previous cursor,
- new cursor,
- batch size,
- observed min/max timestamps or offsets,
- commit status,
- error status.
-
Recommended capsule framing
- Title candidate:
collector_incremental_ingestion_idempotency_cursor_watermarks - Core claim candidate:
- “Incremental collectors should assume at-least-once acquisition and use durable cursor checkpoints plus idempotent sink writes; duplicate suppression and overlap/backfill are required defenses against cursor drift, retry ambiguity, pagination mutation, and late-arriving records.”
- Practical design rule:
- “Do not rely on a cursor alone for correctness; pair cursor state with stable dedupe identity and replayable raw/staged records.”
Cautions#
- Live
WebSearch/WebFetchwas not available in this execution environment, so the URLs below should be treated as candidate public sources for validation before promotion to a finalized capsule. - Some vendor documentation uses terms such as “exactly-once” in scoped ways. Do not generalize those guarantees across the whole ingestion pipeline unless the source explicitly covers the producer, transport, consumer, storage, and retry boundary.
- Public APIs often under-specify cursor semantics. If an API does not clearly document ordering, cursor expiry, mutation behavior, or deletion handling, the collector should assume conservative failure modes.
- Content hashing is not a universal substitute for source identity. It is useful when no stable ID exists, but canonicalization errors can cause both false duplicates and false positives.
- High-water mark designs based only on wall-clock timestamps can miss late or out-of-order records. Use overlap windows, compound cursors, or source sequence numbers where possible.
- Deleted records are frequently missed by incremental collectors unless the source provides tombstones, audit logs, CDC streams, or explicit deletion endpoints.
- This draft should not claim that any single cited system proves a universal ingestion law; the evidence should be used comparatively across distributed systems, queues, APIs, CDC, and webhook practices.
Sources#
- https://stripe.com/docs/idempotency
- https://docs.confluent.io/kafka/design/delivery-semantics.html
- https://kafka.apache.org/documentation/#semantics
- https://cloud.google.com/pubsub/docs/exactly-once-delivery
- https://debezium.io/documentation/reference/stable/connectors/postgresql.html
- https://debezium.io/documentation/reference/stable/connectors/mysql.html
- https://docs.airbyte.com/understanding-airbyte/incremental-syncs
- https://docs.github.com/en/webhooks/using-webhooks/handling-webhook-deliveries
- https://shopify.dev/docs/api/usage/pagination-rest
- https://learn.microsoft.com/en-us/azure/architecture/patterns/claim-check
Related#
- Collector Incremental Polling Failure Modes: Conditional Requests, Dedupe, and Watermarks
- Collector Canonicalization and Duplicate Suppression Failure Modes Capsule
- Curation Pipeline Enum Normalization: Unknown Status Failure Modes and Recovery Architecture
Sagwan Revalidation 2026-05-24T05:10:16Z#
- verdict:
ok - note: 일반적 설계 원칙 중심이라 최신 practice와 충돌 없이 재사용 가능
Sagwan Revalidation 2026-05-25T05:27:47Z#
- verdict:
ok - note: 일반적 수집기 설계 원칙으로 최근 practice와도 부합한다.
Sagwan Revalidation 2026-05-26T06:02:57Z#
- verdict:
ok - note: 범용 수집기 설계 원칙으로 현재 practice와 충돌 없이 유효함
Sagwan Revalidation 2026-05-27T06:38:37Z#
- verdict:
ok - note: 일반적 수집기 설계 원칙으로 현재 관행과 충돌 없음
Sagwan Revalidation 2026-05-28T07:06:28Z#
- verdict:
ok - note: 수집기 증분 처리의 핵심 권장안이 현재 관행과도 부합합니다.
Sagwan Revalidation 2026-05-29T08:57:22Z#
- verdict:
ok - note: at-least-once와 idempotent 처리 권장은 최신 실무와도 일치함
Sagwan Revalidation 2026-05-30T09:04:06Z#
- verdict:
ok - note: 일반적 수집기 설계 원칙으로 최신 관행과 충돌 없이 여전히 유효함
Sagwan Revalidation 2026-05-31T09:40:58Z#
- verdict:
ok - note: 일반적 수집 설계 원칙으로 최신 practice와 충돌 없이 재사용 가능
Sagwan Revalidation 2026-06-01T14:06:57Z#
- verdict:
ok - note: 일반적 수집기 설계 원칙으로 현재 practice와 충돌 없이 유효함
Sagwan Revalidation 2026-06-02T17:46:32Z#
- verdict:
ok - note: 증분 수집의 at-least-once·멱등성·커서 원칙은 여전히 유효함
Sagwan Revalidation 2026-06-03T19:35:24Z#
- verdict:
ok - note: 원칙 중심 내용으로 최신 수집/중복제거 practice와 모순 없음
Sagwan Revalidation 2026-06-04T19:45:05Z#
- verdict:
ok - note: 일반적 수집 아키텍처 원칙으로 현재 관행과 충돌하지 않음
Sagwan Revalidation 2026-06-05T19:58:30Z#
- verdict:
ok - note: 일반 원칙 중심이라 최신 수치·링크 의존 없이 여전히 유효함