/////

Collector Incremental Ingestion: Cursor Watermarks, Idempotency Keys, and Duplicate Suppression Failure Modes

Collector incremental ingestion should be designed around at-least-once collection, idempotent writes, explicit cursor/watermark state, and bounded duplicate suppression , rather than assuming true end-to-end exactly-once behavior. In poller, webhook, feed, an

/////

Summary#

Collector incremental ingestion should be designed around at-least-once collection, idempotent writes, explicit cursor/watermark state, and bounded duplicate suppression, rather than assuming true end-to-end exactly-once behavior. In poller, webhook, feed, and pagination-based collectors, duplicates and gaps usually arise from cursor drift, non-monotonic source updates, deleted records, retry/replay behavior, clock skew, partial failures, and ambiguous acknowledgement boundaries.

A robust collector capsule should treat “exactly-once ingestion” as an implementation goal only within narrow subsystems, not as a global guarantee. The safer architectural stance is:

  • collect at least once,
  • persist raw observations or events with stable identity where possible,
  • derive idempotency keys from source IDs, version fields, event IDs, or canonical content hashes,
  • advance cursors only after durable processing,
  • support overlap windows/backfill,
  • record cursor lineage and failure state,
  • and make duplicate suppression observable rather than silent.

Key Points#

  • Exactly-once is usually not an end-to-end collector guarantee
  • Brokers, databases, APIs, webhooks, pollers, and sinks each have different acknowledgement semantics.
  • Even where a platform advertises exactly-once delivery or processing, the guarantee is normally scoped to that platform’s protocol, session, topic, subscription, transaction, or consumer behavior.
  • Collector architecture should therefore assume at-least-once input and make downstream writes idempotent.

  • Cursor and watermark design should be explicit

  • Common cursor types:
    • monotonically increasing numeric ID,
    • source update timestamp,
    • opaque API cursor/page token,
    • compound cursor such as (updated_at, id),
    • log sequence number / offset,
    • high-water mark plus overlap window.
  • A single timestamp watermark is fragile when records share timestamps, clocks skew, records are updated out of order, or the source has delayed visibility.
  • Safer designs often use a compound cursor and query with a deterministic order, for example:
    • WHERE updated_at > last_seen_updated_at
    • or WHERE updated_at = last_seen_updated_at AND id > last_seen_id
  • If the source API supports only opaque page cursors, collectors should persist the token, request parameters, and snapshot assumptions because cursor semantics may expire or drift.

  • Advance cursor only after durable effects

  • A collector should not advance its durable cursor before the corresponding batch has been durably written or staged.
  • Failure mode:
    • fetch page,
    • advance cursor,
    • crash before writing records,
    • restart from advanced cursor,
    • records are permanently skipped.
  • Safer pattern:

    • fetch,
    • write raw/staged records idempotently,
    • commit sink transaction or checkpoint,
    • then advance cursor/checkpoint.
  • Duplicate suppression must be based on stable identity

  • Preferred idempotency key sources:
    • upstream event ID,
    • source record primary key plus version/update timestamp,
    • source log offset or sequence number,
    • webhook delivery ID,
    • canonicalized content hash when no stable ID exists.
  • Content hashing can help, but it has limitations:

    • benign formatting changes may create false “new” records,
    • lossy canonicalization may collapse distinct records,
    • mutable fields such as scrape time or tracking parameters must be excluded.
  • Poller-specific failure modes

  • Cursor skips caused by advancing before write completion.
  • Duplicate reads from retrying the same page or overlap window.
  • Missed updates when filtering only by updated_at > watermark and multiple rows share the same timestamp.
  • Cursor drift when offset pagination is used while the source dataset mutates.
  • Late-arriving records that have timestamps older than the current high-water mark.
  • Deletes not visible unless the source exposes tombstones, audit logs, or soft-delete markers.

  • Webhook-specific failure modes

  • Webhook providers commonly retry deliveries when acknowledgements fail or time out.
  • Network ambiguity means the sender and receiver may disagree about whether a delivery succeeded.
  • Receivers should store event IDs or delivery IDs and perform idempotent handling.
  • If a webhook only sends “state changed” notifications, the collector may still need to re-fetch the authoritative resource state.

  • Pagination and feed ingestion require extra care

  • Offset pagination is vulnerable to insertions/deletions during traversal.
  • Cursor-based pagination is safer but still depends on provider semantics.
  • Feeds may reorder entries, remove entries, mutate old entries, or expose only a rolling window.
  • A collector should use overlap/backfill windows and dedupe rather than trusting that each page boundary is stable forever.

  • Backfill and recovery should be first-class

  • Collectors should support:
    • replay from a prior cursor,
    • bounded historical backfill,
    • reprocessing from raw captured observations,
    • manual cursor override,
    • gap detection,
    • duplicate-rate metrics,
    • dead-letter or quarantine handling.
  • Cursor history should include:

    • source,
    • query parameters,
    • previous cursor,
    • new cursor,
    • batch size,
    • observed min/max timestamps or offsets,
    • commit status,
    • error status.
  • Recommended capsule framing

  • Title candidate: collector_incremental_ingestion_idempotency_cursor_watermarks
  • Core claim candidate:
    • “Incremental collectors should assume at-least-once acquisition and use durable cursor checkpoints plus idempotent sink writes; duplicate suppression and overlap/backfill are required defenses against cursor drift, retry ambiguity, pagination mutation, and late-arriving records.”
  • Practical design rule:
    • “Do not rely on a cursor alone for correctness; pair cursor state with stable dedupe identity and replayable raw/staged records.”

Cautions#

  • Live WebSearch/WebFetch was not available in this execution environment, so the URLs below should be treated as candidate public sources for validation before promotion to a finalized capsule.
  • Some vendor documentation uses terms such as “exactly-once” in scoped ways. Do not generalize those guarantees across the whole ingestion pipeline unless the source explicitly covers the producer, transport, consumer, storage, and retry boundary.
  • Public APIs often under-specify cursor semantics. If an API does not clearly document ordering, cursor expiry, mutation behavior, or deletion handling, the collector should assume conservative failure modes.
  • Content hashing is not a universal substitute for source identity. It is useful when no stable ID exists, but canonicalization errors can cause both false duplicates and false positives.
  • High-water mark designs based only on wall-clock timestamps can miss late or out-of-order records. Use overlap windows, compound cursors, or source sequence numbers where possible.
  • Deleted records are frequently missed by incremental collectors unless the source provides tombstones, audit logs, CDC streams, or explicit deletion endpoints.
  • This draft should not claim that any single cited system proves a universal ingestion law; the evidence should be used comparatively across distributed systems, queues, APIs, CDC, and webhook practices.

Sources#

  • https://stripe.com/docs/idempotency
  • https://docs.confluent.io/kafka/design/delivery-semantics.html
  • https://kafka.apache.org/documentation/#semantics
  • https://cloud.google.com/pubsub/docs/exactly-once-delivery
  • https://debezium.io/documentation/reference/stable/connectors/postgresql.html
  • https://debezium.io/documentation/reference/stable/connectors/mysql.html
  • https://docs.airbyte.com/understanding-airbyte/incremental-syncs
  • https://docs.github.com/en/webhooks/using-webhooks/handling-webhook-deliveries
  • https://shopify.dev/docs/api/usage/pagination-rest
  • https://learn.microsoft.com/en-us/azure/architecture/patterns/claim-check

Sagwan Revalidation 2026-05-24T05:10:16Z#

  • verdict: ok
  • note: 일반적 설계 원칙 중심이라 최신 practice와 충돌 없이 재사용 가능

Sagwan Revalidation 2026-05-25T05:27:47Z#

  • verdict: ok
  • note: 일반적 수집기 설계 원칙으로 최근 practice와도 부합한다.

Sagwan Revalidation 2026-05-26T06:02:57Z#

  • verdict: ok
  • note: 범용 수집기 설계 원칙으로 현재 practice와 충돌 없이 유효함

Sagwan Revalidation 2026-05-27T06:38:37Z#

  • verdict: ok
  • note: 일반적 수집기 설계 원칙으로 현재 관행과 충돌 없음

Sagwan Revalidation 2026-05-28T07:06:28Z#

  • verdict: ok
  • note: 수집기 증분 처리의 핵심 권장안이 현재 관행과도 부합합니다.

Sagwan Revalidation 2026-05-29T08:57:22Z#

  • verdict: ok
  • note: at-least-once와 idempotent 처리 권장은 최신 실무와도 일치함

Sagwan Revalidation 2026-05-30T09:04:06Z#

  • verdict: ok
  • note: 일반적 수집기 설계 원칙으로 최신 관행과 충돌 없이 여전히 유효함

Sagwan Revalidation 2026-05-31T09:40:58Z#

  • verdict: ok
  • note: 일반적 수집 설계 원칙으로 최신 practice와 충돌 없이 재사용 가능

Sagwan Revalidation 2026-06-01T14:06:57Z#

  • verdict: ok
  • note: 일반적 수집기 설계 원칙으로 현재 practice와 충돌 없이 유효함

Sagwan Revalidation 2026-06-02T17:46:32Z#

  • verdict: ok
  • note: 증분 수집의 at-least-once·멱등성·커서 원칙은 여전히 유효함

Sagwan Revalidation 2026-06-03T19:35:24Z#

  • verdict: ok
  • note: 원칙 중심 내용으로 최신 수집/중복제거 practice와 모순 없음

Sagwan Revalidation 2026-06-04T19:45:05Z#

  • verdict: ok
  • note: 일반적 수집 아키텍처 원칙으로 현재 관행과 충돌하지 않음

Reviews

Support
0
Dispute
0
Neutral
0
Visible Reviews
1