feat(FS-1988): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy by aballman · Pull Request #342 · Unstructured-IO/unstructured-python-client

aballman · 2026-05-21T15:48:44Z

Summary

Closes the SDK retry-budget short-circuit documented in FS-1988.

retry_with_backoff_async currently checks now - start > max_elapsed_time before sleeping. With RetryConfig.max_elapsed_time set to 5 min (per platform-plugins#1170) and per-chunk httpx client timeouts of 30 min (unstructured-api) / 60 min (vlm), any chunk whose first attempt takes longer than the retry budget blows the budget on attempt 1 — zero retries fire on subsequent transient errors. Two customer-visible regressions in the platform partition path traced back to this.

What changes

New optional fields on BackoffStrategy (default values preserve existing behavior):

min_attempts: int = 0 — minimum retry attempts that must fire before max_elapsed_time is honored. Counts retries, not the initial attempt. min_attempts=2 permits 1 initial + at least 2 retries (3 total attempts) before the soft cap can cut the loop.
absolute_max_elapsed_time_ms: int | None = None — cap on when a new retry can START. Does NOT interrupt an in-flight func() call. Worst-case wall-clock under this cap is absolute_max_elapsed_time_ms + per_attempt_timeout.

Loop changes in both sync (retry_with_backoff) and async (retry_with_backoff_async) paths:

Post-attempt cap check. Soft cap honored only when retries >= min_attempts; hard cap unconditional.
Pre-sleep hard-cap check. Refuses to sleep into a retry whose projected start would cross the hard cap.
Post-sleep verification. Belt-and-suspenders against late wakeups and projection rounding.
Helper extraction (_cap_hit_after_attempt, _raise_or_return_after_cap) dedupes the soft/hard cap logic between sync and async.

Validation in BackoffStrategy.__init__ rejects min_attempts < 0, absolute_max_elapsed_time_ms <= 0, and hard cap below soft cap.

.genignore updated to preserve these fields across future Speakeasy regens, with a documented merge procedure matching the general.py and users.py precedent. Pushing the change upstream to Speakeasy templates is a separate follow-up.

Design choices

Field names retained from FS-1988 (no rename to min_retries) to keep the public BackoffStrategy interface stable. The docstring is the contract — see the comment block in retries.py for unambiguous semantics.
Hard cap is a retry-start cap, not a wall-clock bound. In-flight func() cannot be interrupted from the retry loop. Documented in the field docstring. For the partitioner consumer, pair with a sensible per-attempt timeout (CLIENT_TIMEOUT_MS) to keep the worst case bounded.
Defaults preserve existing behavior so this change is non-breaking for every existing consumer. Consumers opt in by setting min_attempts > 0 and/or absolute_max_elapsed_time_ms.

Tests

49 tests pass (46 unit + 3 split-PDF retry integration). New coverage:

T1–T14 in _test_unstructured_client/unit/test_retries.py: fake-clock harness monkeypatching time.time / time.sleep / asyncio.sleep / random.uniform. Covers the FS-1988 reproducer (slow first attempt + min_attempts floor), floor-is-not-a-ceiling semantics, hard cap overrides floor, sleep truncation, TemporaryError early-return through both caps, PermanentError short-circuit immunity, and BackoffStrategy.__init__ validation.
test_split_pdf_cache_tmp_data_chunk_request_stream_is_replay_safe in integration/test_decorators.py: pins the body-replay invariant for chunk requests built from open file objects (the split_pdf_cache_tmp_data=True path). Iterates request.stream twice directly — bypasses request.read() caching — so a future Speakeasy template change that produced a single-consumption stream would fail this test.

Adversarial review

This PR was iterated through four rounds of codex review (v1 → v2 → v3 → v3.1 → final2) before opening. The most consequential findings:

v1 / v2: original design used min_attempts alone, recreating the partitioner's ~92-min slot-pinning failure. Resolution: add the hard cap.
v2: request.read()-based regression test would pass even with a non-replayable stream. Resolution: rewrite to iterate request.stream twice directly.
v3: .genignore and tests were planned but not actually committed. Resolution: commit them in the same PR.

Final review: "No blocking issues found."

Test plan

uv run pytest _test_unstructured_client/unit/test_retries.py — 46 pass
uv run pytest _test_unstructured_client/integration/test_decorators.py::test_split_pdf_*retry* _test_unstructured_client/integration/test_decorators.py::test_split_pdf_cache_tmp_data_* — 3 pass
CI green
Platform-plugins follow-up PR opts in via min_attempts=2 + absolute_max_elapsed_time_ms=15_min; SND validation before staging/prod cutover

Follow-ups (separate tickets)

Push min_attempts + absolute_max_elapsed_time_ms upstream to Speakeasy templates so .genignore can eventually be removed.
platform-plugins PR to opt the partitioner into the new fields, behind a PARTITIONER_RETRY_MIN_ATTEMPTS env-var feature flag.

…ackoffStrategy Closes a short-circuit in retry_with_backoff{,_async} where a single slow first attempt can exhaust max_elapsed_time before any retry fires. With the partitioner config (max_elapsed_time=5min, CLIENT_TIMEOUT_MS=30min), any chunk whose first attempt exceeds 5 minutes blew the retry budget on attempt 1 -- zero retries on subsequent transient errors. Documented in two customer-visible regressions in the platform partition path. New fields on BackoffStrategy: * min_attempts (default 0) -- minimum number of retry attempts that must fire before max_elapsed_time is honored. Counts retries (not the initial attempt). Default 0 preserves existing behavior. * absolute_max_elapsed_time_ms (default None) -- cap on when a new retry can START. Does NOT interrupt an in-flight func() call. Worst-case wall-clock under this cap is absolute_max_elapsed_time_ms + per_attempt_timeout. Default None preserves existing behavior. Loop changes in both retry_with_backoff and retry_with_backoff_async: * Post-attempt cap check honors min_attempts as a floor on the soft cap and treats the hard cap as unconditional. * Pre-sleep hard-cap check refuses to sleep into a doomed retry whose projected start would exceed the hard cap. * Post-sleep verification (belt-and-suspenders against late wakeups and rounding drift in the projection). * Helper extraction (_cap_hit_after_attempt, _raise_or_return_after_cap) dedupes logic between sync and async paths. Validation rejects min_attempts < 0, absolute_max_elapsed_time_ms <= 0, and absolute_max_elapsed_time_ms below max_elapsed_time. .genignore: ignore src/unstructured_client/utils/retries.py to preserve these fields across Speakeasy regens. Documented merge procedure for future template updates. Pushing these fields upstream to Speakeasy templates is filed as a follow-up. Tests: * T1-T14 in unit/test_retries.py with a fake-clock harness that monkeypatches time.time / time.sleep / asyncio.sleep / random.uniform. Covers the v1 reproducer (slow first attempt + min_attempts floor), floor-is-not-a-ceiling semantics, hard cap overrides floor, sleep truncation, TemporaryError early-return through both caps, and PermanentError short-circuit immunity. * Validation tests for the new __init__ guards. * New integration test test_split_pdf_cache_tmp_data_chunk_request_stream_is_replay_safe pins the body-replay invariant: chunks built from open file objects (cache_tmp_data=True) must produce replay-safe httpx requests so SDK retries actually deliver the original multipart payload. Iterates request.stream twice directly to bypass request.read() caching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(FS-1988): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy#342

feat(FS-1988): add min_attempts and absolute_max_elapsed_time_ms to BackoffStrategy#342
aballman wants to merge 1 commit into
mainfrom
aballman/fs-1988-min-attempts

aballman commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aballman commented May 21, 2026

Summary

What changes

Design choices

Tests

Adversarial review

Test plan

Follow-ups (separate tickets)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant