Skip to content

feat(FS-1988): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy#342

Open
aballman wants to merge 1 commit into
mainfrom
aballman/fs-1988-min-attempts
Open

feat(FS-1988): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy#342
aballman wants to merge 1 commit into
mainfrom
aballman/fs-1988-min-attempts

Conversation

@aballman
Copy link
Copy Markdown

Summary

Closes the SDK retry-budget short-circuit documented in FS-1988.

retry_with_backoff_async currently checks now - start > max_elapsed_time before sleeping. With RetryConfig.max_elapsed_time set to 5 min (per platform-plugins#1170) and per-chunk httpx client timeouts of 30 min (unstructured-api) / 60 min (vlm), any chunk whose first attempt takes longer than the retry budget blows the budget on attempt 1 — zero retries fire on subsequent transient errors. Two customer-visible regressions in the platform partition path traced back to this.

What changes

New optional fields on BackoffStrategy (default values preserve existing behavior):

  • min_attempts: int = 0 — minimum retry attempts that must fire before max_elapsed_time is honored. Counts retries, not the initial attempt. min_attempts=2 permits 1 initial + at least 2 retries (3 total attempts) before the soft cap can cut the loop.
  • absolute_max_elapsed_time_ms: int | None = None — cap on when a new retry can START. Does NOT interrupt an in-flight func() call. Worst-case wall-clock under this cap is absolute_max_elapsed_time_ms + per_attempt_timeout.

Loop changes in both sync (retry_with_backoff) and async (retry_with_backoff_async) paths:

  1. Post-attempt cap check. Soft cap honored only when retries >= min_attempts; hard cap unconditional.
  2. Pre-sleep hard-cap check. Refuses to sleep into a retry whose projected start would cross the hard cap.
  3. Post-sleep verification. Belt-and-suspenders against late wakeups and projection rounding.
  4. Helper extraction (_cap_hit_after_attempt, _raise_or_return_after_cap) dedupes the soft/hard cap logic between sync and async.

Validation in BackoffStrategy.__init__ rejects min_attempts < 0, absolute_max_elapsed_time_ms <= 0, and hard cap below soft cap.

.genignore updated to preserve these fields across future Speakeasy regens, with a documented merge procedure matching the general.py and users.py precedent. Pushing the change upstream to Speakeasy templates is a separate follow-up.

Design choices

  • Field names retained from FS-1988 (no rename to min_retries) to keep the public BackoffStrategy interface stable. The docstring is the contract — see the comment block in retries.py for unambiguous semantics.
  • Hard cap is a retry-start cap, not a wall-clock bound. In-flight func() cannot be interrupted from the retry loop. Documented in the field docstring. For the partitioner consumer, pair with a sensible per-attempt timeout (CLIENT_TIMEOUT_MS) to keep the worst case bounded.
  • Defaults preserve existing behavior so this change is non-breaking for every existing consumer. Consumers opt in by setting min_attempts > 0 and/or absolute_max_elapsed_time_ms.

Tests

49 tests pass (46 unit + 3 split-PDF retry integration). New coverage:

  • T1–T14 in _test_unstructured_client/unit/test_retries.py: fake-clock harness monkeypatching time.time / time.sleep / asyncio.sleep / random.uniform. Covers the FS-1988 reproducer (slow first attempt + min_attempts floor), floor-is-not-a-ceiling semantics, hard cap overrides floor, sleep truncation, TemporaryError early-return through both caps, PermanentError short-circuit immunity, and BackoffStrategy.__init__ validation.
  • test_split_pdf_cache_tmp_data_chunk_request_stream_is_replay_safe in integration/test_decorators.py: pins the body-replay invariant for chunk requests built from open file objects (the split_pdf_cache_tmp_data=True path). Iterates request.stream twice directly — bypasses request.read() caching — so a future Speakeasy template change that produced a single-consumption stream would fail this test.

Adversarial review

This PR was iterated through four rounds of codex review (v1 → v2 → v3 → v3.1 → final2) before opening. The most consequential findings:

  • v1 / v2: original design used min_attempts alone, recreating the partitioner's ~92-min slot-pinning failure. Resolution: add the hard cap.
  • v2: request.read()-based regression test would pass even with a non-replayable stream. Resolution: rewrite to iterate request.stream twice directly.
  • v3: .genignore and tests were planned but not actually committed. Resolution: commit them in the same PR.

Final review: "No blocking issues found."

Test plan

  • uv run pytest _test_unstructured_client/unit/test_retries.py — 46 pass
  • uv run pytest _test_unstructured_client/integration/test_decorators.py::test_split_pdf_*retry* _test_unstructured_client/integration/test_decorators.py::test_split_pdf_cache_tmp_data_* — 3 pass
  • CI green
  • Platform-plugins follow-up PR opts in via min_attempts=2 + absolute_max_elapsed_time_ms=15_min; SND validation before staging/prod cutover

Follow-ups (separate tickets)

  • Push min_attempts + absolute_max_elapsed_time_ms upstream to Speakeasy templates so .genignore can eventually be removed.
  • platform-plugins PR to opt the partitioner into the new fields, behind a PARTITIONER_RETRY_MIN_ATTEMPTS env-var feature flag.

…ackoffStrategy

Closes a short-circuit in retry_with_backoff{,_async} where a single slow
first attempt can exhaust max_elapsed_time before any retry fires. With
the partitioner config (max_elapsed_time=5min, CLIENT_TIMEOUT_MS=30min),
any chunk whose first attempt exceeds 5 minutes blew the retry budget
on attempt 1 -- zero retries on subsequent transient errors. Documented
in two customer-visible regressions in the platform partition path.

New fields on BackoffStrategy:

* min_attempts (default 0) -- minimum number of retry attempts that must
  fire before max_elapsed_time is honored. Counts retries (not the
  initial attempt). Default 0 preserves existing behavior.
* absolute_max_elapsed_time_ms (default None) -- cap on when a new retry
  can START. Does NOT interrupt an in-flight func() call. Worst-case
  wall-clock under this cap is absolute_max_elapsed_time_ms +
  per_attempt_timeout. Default None preserves existing behavior.

Loop changes in both retry_with_backoff and retry_with_backoff_async:

* Post-attempt cap check honors min_attempts as a floor on the soft cap
  and treats the hard cap as unconditional.
* Pre-sleep hard-cap check refuses to sleep into a doomed retry whose
  projected start would exceed the hard cap.
* Post-sleep verification (belt-and-suspenders against late wakeups and
  rounding drift in the projection).
* Helper extraction (_cap_hit_after_attempt, _raise_or_return_after_cap)
  dedupes logic between sync and async paths.

Validation rejects min_attempts < 0, absolute_max_elapsed_time_ms <= 0,
and absolute_max_elapsed_time_ms below max_elapsed_time.

.genignore: ignore src/unstructured_client/utils/retries.py to preserve
these fields across Speakeasy regens. Documented merge procedure for
future template updates. Pushing these fields upstream to Speakeasy
templates is filed as a follow-up.

Tests:

* T1-T14 in unit/test_retries.py with a fake-clock harness that
  monkeypatches time.time / time.sleep / asyncio.sleep / random.uniform.
  Covers the v1 reproducer (slow first attempt + min_attempts floor),
  floor-is-not-a-ceiling semantics, hard cap overrides floor, sleep
  truncation, TemporaryError early-return through both caps, and
  PermanentError short-circuit immunity.
* Validation tests for the new __init__ guards.
* New integration test
  test_split_pdf_cache_tmp_data_chunk_request_stream_is_replay_safe
  pins the body-replay invariant: chunks built from open file objects
  (cache_tmp_data=True) must produce replay-safe httpx requests so SDK
  retries actually deliver the original multipart payload. Iterates
  request.stream twice directly to bypass request.read() caching.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant