Collector Incremental Ingestion: Cursor Watermarks, Idempotency Keys, and Duplicate Suppression Failure Modes

Summary#

Collector incremental ingestion should be designed around at-least-once collection, idempotent writes, explicit cursor/watermark state, and bounded duplicate suppression, rather than assuming true end-to-end exactly-once behavior. In poller, webhook, feed, and pagination-based collectors, duplicates and gaps usually arise from cursor drift, non-monotonic source updates, deleted records, retry/replay behavior, clock skew, partial failures, and ambiguous acknowledgement boundaries.

A robust collector capsule should treat “exactly-once ingestion” as an implementation goal only within narrow subsystems, not as a global guarantee. The safer architectural stance is:

collect at least once,
persist raw observations or events with stable identity where possible,
derive idempotency keys from source IDs, version fields, event IDs, or canonical content hashes,
advance cursors only after durable processing,
support overlap windows/backfill,
record cursor lineage and failure state,
and make duplicate suppression observable rather than silent.

Key Points#

Exactly-once is usually not an end-to-end collector guarantee
Brokers, databases, APIs, webhooks, pollers, and sinks each have different acknowledgement semantics.
Even where a platform advertises exactly-once delivery or processing, the guarantee is normally scoped to that platform’s protocol, session, topic, subscription, transaction, or consumer behavior.
Collector architecture should therefore assume at-least-once input and make downstream writes idempotent.
Cursor and watermark design should be explicit
Common cursor types:
- monotonically increasing numeric ID,
- source update timestamp,
- opaque API cursor/page token,
- compound cursor such as (updated_at, id),
- log sequence number / offset,
- high-water mark plus overlap window.
A single timestamp watermark is fragile when records share timestamps, clocks skew, records are updated out of order, or the source has delayed visibility.
Safer designs often use a compound cursor and query with a deterministic order, for example:
- WHERE updated_at > last_seen_updated_at
- or WHERE updated_at = last_seen_updated_at AND id > last_seen_id
If the source API supports only opaque page cursors, collectors should persist the token, request parameters, and snapshot assumptions because cursor semantics may expire or drift.
Advance cursor only after durable effects
A collector should not advance its durable cursor before the corresponding batch has been durably written or staged.
Failure mode:
- fetch page,
- advance cursor,
- crash before writing records,
- restart from advanced cursor,
- records are permanently skipped.
Safer pattern:
- fetch,
- write raw/staged records idempotently,
- commit sink transaction or checkpoint,
- then advance cursor/checkpoint.
Duplicate suppression must be based on stable identity
Preferred idempotency key sources:
- upstream event ID,
- source record primary key plus version/update timestamp,
- source log offset or sequence number,
- webhook delivery ID,
- canonicalized content hash when no stable ID exists.
Content hashing can help, but it has limitations:
- benign formatting changes may create false “new” records,
- lossy canonicalization may collapse distinct records,
- mutable fields such as scrape time or tracking parameters must be excluded.
Poller-specific failure modes
Cursor skips caused by advancing before write completion.
Duplicate reads from retrying the same page or overlap window.
Missed updates when filtering only by updated_at > watermark and multiple rows share the same timestamp.
Cursor drift when offset pagination is used while the source dataset mutates.
Late-arriving records that have timestamps older than the current high-water mark.
Deletes not visible unless the source exposes tombstones, audit logs, or soft-delete markers.
Webhook-specific failure modes
Webhook providers commonly retry deliveries when acknowledgements fail or time out.
Network ambiguity means the sender and receiver may disagree about whether a delivery succeeded.
Receivers should store event IDs or delivery IDs and perform idempotent handling.
If a webhook only sends “state changed” notifications, the collector may still need to re-fetch the authoritative resource state.
Pagination and feed ingestion require extra care
Offset pagination is vulnerable to insertions/deletions during traversal.
Cursor-based pagination is safer but still depends on provider semantics.
Feeds may reorder entries, remove entries, mutate old entries, or expose only a rolling window.
A collector should use overlap/backfill windows and dedupe rather than trusting that each page boundary is stable forever.
Backfill and recovery should be first-class
Collectors should support:
- replay from a prior cursor,
- bounded historical backfill,
- reprocessing from raw captured observations,
- manual cursor override,
- gap detection,
- duplicate-rate metrics,
- dead-letter or quarantine handling.
Cursor history should include:
- source,
- query parameters,
- previous cursor,
- new cursor,
- batch size,
- observed min/max timestamps or offsets,
- commit status,
- error status.
Recommended capsule framing
Title candidate: collector_incremental_ingestion_idempotency_cursor_watermarks
Core claim candidate:
- “Incremental collectors should assume at-least-once acquisition and use durable cursor checkpoints plus idempotent sink writes; duplicate suppression and overlap/backfill are required defenses against cursor drift, retry ambiguity, pagination mutation, and late-arriving records.”
Practical design rule:
- “Do not rely on a cursor alone for correctness; pair cursor state with stable dedupe identity and replayable raw/staged records.”

Cautions#

Live WebSearch/WebFetch was not available in this execution environment, so the URLs below should be treated as candidate public sources for validation before promotion to a finalized capsule.
Some vendor documentation uses terms such as “exactly-once” in scoped ways. Do not generalize those guarantees across the whole ingestion pipeline unless the source explicitly covers the producer, transport, consumer, storage, and retry boundary.
Public APIs often under-specify cursor semantics. If an API does not clearly document ordering, cursor expiry, mutation behavior, or deletion handling, the collector should assume conservative failure modes.
Content hashing is not a universal substitute for source identity. It is useful when no stable ID exists, but canonicalization errors can cause both false duplicates and false positives.
High-water mark designs based only on wall-clock timestamps can miss late or out-of-order records. Use overlap windows, compound cursors, or source sequence numbers where possible.
Deleted records are frequently missed by incremental collectors unless the source provides tombstones, audit logs, CDC streams, or explicit deletion endpoints.
This draft should not claim that any single cited system proves a universal ingestion law; the evidence should be used comparatively across distributed systems, queues, APIs, CDC, and webhook practices.

Sources#

https://stripe.com/docs/idempotency
https://docs.confluent.io/kafka/design/delivery-semantics.html
https://kafka.apache.org/documentation/#semantics
https://cloud.google.com/pubsub/docs/exactly-once-delivery
https://debezium.io/documentation/reference/stable/connectors/postgresql.html
https://debezium.io/documentation/reference/stable/connectors/mysql.html
https://docs.airbyte.com/understanding-airbyte/incremental-syncs
https://docs.github.com/en/webhooks/using-webhooks/handling-webhook-deliveries
https://shopify.dev/docs/api/usage/pagination-rest
https://learn.microsoft.com/en-us/azure/architecture/patterns/claim-check

Sagwan Revalidation 2026-05-24T05:10:16Z#

verdict: ok
note: 일반적 설계 원칙 중심이라 최신 practice와 충돌 없이 재사용 가능

Sagwan Revalidation 2026-05-25T05:27:47Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 최근 practice와도 부합한다.

Sagwan Revalidation 2026-05-26T06:02:57Z#

verdict: ok
note: 범용 수집기 설계 원칙으로 현재 practice와 충돌 없이 유효함

Sagwan Revalidation 2026-05-27T06:38:37Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 현재 관행과 충돌 없음

Sagwan Revalidation 2026-05-28T07:06:28Z#

verdict: ok
note: 수집기 증분 처리의 핵심 권장안이 현재 관행과도 부합합니다.

Sagwan Revalidation 2026-05-29T08:57:22Z#

verdict: ok
note: at-least-once와 idempotent 처리 권장은 최신 실무와도 일치함

Sagwan Revalidation 2026-05-30T09:04:06Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 최신 관행과 충돌 없이 여전히 유효함

Sagwan Revalidation 2026-05-31T09:40:58Z#

verdict: ok
note: 일반적 수집 설계 원칙으로 최신 practice와 충돌 없이 재사용 가능

Sagwan Revalidation 2026-06-01T14:06:57Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 현재 practice와 충돌 없이 유효함

Sagwan Revalidation 2026-06-02T17:46:32Z#

verdict: ok
note: 증분 수집의 at-least-once·멱등성·커서 원칙은 여전히 유효함

Sagwan Revalidation 2026-06-03T19:35:24Z#

verdict: ok
note: 원칙 중심 내용으로 최신 수집/중복제거 practice와 모순 없음

Sagwan Revalidation 2026-06-04T19:45:05Z#

verdict: ok
note: 일반적 수집 아키텍처 원칙으로 현재 관행과 충돌하지 않음

Sagwan Revalidation 2026-06-05T19:58:30Z#

verdict: ok
note: 일반 원칙 중심이라 최신 수치·링크 의존 없이 여전히 유효함

Sagwan Revalidation 2026-06-06T20:16:56Z#

verdict: ok
note: 일반 원칙 중심이라 최신 수집/중복억제 관행과 여전히 부합함

Sagwan Revalidation 2026-06-07T20:29:52Z#

verdict: ok
note: 증분 수집의 at-least-once·멱등성·워터마크 원칙은 여전히 유효함

Sagwan Revalidation 2026-06-08T20:31:36Z#

verdict: ok
note: 일반 원칙 중심이라 최신 수집/중복제거 practice와 충돌 없음

Sagwan Revalidation 2026-06-09T20:50:33Z#

verdict: ok
note: 일반적 수집 아키텍처 원칙으로 최신 관행과 충돌 없이 재사용 가능.

Sagwan Revalidation 2026-06-11T01:33:26Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 최신 관행과 충돌 없이 여전히 유효함

Sagwan Revalidation 2026-06-12T03:09:03Z#

verdict: ok
note: [chatgpt HTTP 401] {

Sagwan Revalidation 2026-06-13T04:02:47Z#

verdict: ok
note: 일반 원칙 중심이라 최근 practice와 충돌 없고 재사용 가능함

Sagwan Revalidation 2026-06-14T04:54:34Z#

verdict: ok
note: 증분 수집의 at-least-once·멱등성·커서 원칙은 현재도 유효함

Sagwan Revalidation 2026-06-15T05:12:36Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 현재 practice와 충돌 없음

Sagwan Revalidation 2026-06-16T06:00:55Z#

verdict: ok
note: 증분 수집의 멱등성·워터마크 원칙은 현재 practice와도 부합함

Sagwan Revalidation 2026-06-17T07:33:06Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 최신 practice와 충돌 없음

Sagwan Revalidation 2026-06-18T07:39:25Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 현재도 유효하며 갱신 필요 없음

Sagwan Revalidation 2026-06-19T09:20:23Z#

verdict: ok
note: at-least-once·idempotency·watermark 권장은 여전히 최신 practice와 부합함

Sagwan Revalidation 2026-06-20T09:20:59Z#

verdict: ok
note: 일반 원칙 중심으로 최신 관행과 충돌 없고 재사용 가능함

Sagwan Revalidation 2026-06-21T09:26:31Z#

verdict: ok
note: 일반 원칙 중심이라 최신 수치·링크 의존 없이 여전히 유효함

Sagwan Revalidation 2026-06-22T10:05:30Z#

verdict: ok
note: 일반 원칙 중심으로 현재 수집기 설계 관행과 충돌하지 않습니다.

Sagwan Revalidation 2026-06-23T11:26:33Z#

verdict: ok
note: [chatgpt HTTP 401] {

Sagwan Revalidation 2026-06-24T11:37:41Z#

verdict: ok
note: [chatgpt HTTP 401] {

Sagwan Revalidation 2026-06-25T13:43:16Z#

verdict: ok
note: 일반 원칙 중심이라 최신 관행과 충돌 없고 재사용 가능함

Sagwan Revalidation 2026-06-26T15:27:04Z#

verdict: ok
note: 일반적 수집 아키텍처 원칙으로 현재 practice와 충돌 없음

Sagwan Revalidation 2026-06-27T18:50:49Z#

verdict: ok
note: 일반 원칙 중심이라 최신 practice와 충돌 없고 재사용 가능함

Sagwan Revalidation 2026-06-28T19:24:35Z#

verdict: ok
note: 일반 원칙 중심으로 최신 관행과 충돌 없고 재사용 가능함

Sagwan Revalidation 2026-06-29T20:16:33Z#

verdict: ok
note: 일반 원칙과 권장안이 최신 관행과 부합해 변경 불필요함

Sagwan Revalidation 2026-07-01T01:55:32Z#

verdict: ok
note: 최신 수집 파이프라인 관행과 일치하며 재사용에 문제 없음.

Sagwan Revalidation 2026-07-02T10:36:42Z#

verdict: ok
note: 증분 수집의 멱등성·워터마크 권장안은 여전히 최신 practice와 부합함

Sagwan Revalidation 2026-07-03T23:43:23Z#

verdict: ok
note: 일반적 수집 설계 원칙으로 현재 practice와 충돌하지 않습니다.

Sagwan Revalidation 2026-07-05T02:58:31Z#

verdict: ok
note: 일반적 수집 아키텍처 원칙으로 현재 practice와 충돌 없음

Sagwan Revalidation 2026-07-06T09:40:50Z#

verdict: ok
note: 증분 수집의 at-least-once·멱등성·워터마크 권장안은 여전히 유효함

Sagwan Revalidation 2026-07-07T15:51:57Z#

verdict: ok
note: 일반적 수집 아키텍처 원칙으로 현재 practice와 충돌 없음

Sagwan Revalidation 2026-07-08T23:02:17Z#

verdict: ok
note: 일반 원칙 중심으로 최신 수치·링크 의존이 없어 여전히 유효함

Sagwan Revalidation 2026-07-11T03:35:16Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 현재 practice와도 부합함

Sagwan Revalidation 2026-07-12T21:41:26Z#

verdict: ok
note: 일반적 수집 설계 원칙으로 현재 practice와 충돌 없음

Sagwan Revalidation 2026-07-14T18:54:34Z#

verdict: ok
note: 현재 수집기 설계 관행과 일치하며 갱신 필요성이 낮다.

Sagwan Revalidation 2026-07-16T19:52:57Z#

verdict: ok
note: 일반적 수집 아키텍처 원칙으로 최신 관행과 충돌 없이 유효함

Sagwan Revalidation 2026-07-18T21:13:20Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 현재 practice와 충돌하지 않음

Sagwan Revalidation 2026-07-20T22:19:51Z#

verdict: ok
note: 일반적 설계 원칙으로 현재 practice와 모순 없어 재사용 가능

Sagwan Revalidation 2026-07-23T00:11:55Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 현재 practice와 충돌 없이 재사용 가능

Sagwan Revalidation 2026-07-25T02:30:26Z#

verdict: ok
note: 일반 원칙 중심이며 최근 수치·링크 의존 없이 여전히 유효함

Sagwan Revalidation 2026-07-27T06:35:09Z#

verdict: ok
note: 일반 원칙 중심이라 최근 practice와 충돌 없이 재사용 가능함

Sagwan Revalidation 2026-07-29T11:41:18Z#

verdict: ok
note: 일반적 수집기 설계 원칙으로 현재 practice와 충돌 없이 유효함

Sagwan Revalidation 2026-07-31T21:05:16Z#

verdict: ok
note: 수집기 증분 처리의 핵심 권장안은 현재 practice와도 부합함