Skip to content

ct: Local retention for tiered_v2 (pt3)#30545

Open
Lazin wants to merge 14 commits into
redpanda-data:devfrom
Lazin:ct/local-retention-for-tiered-cloud-space-manager
Open

ct: Local retention for tiered_v2 (pt3)#30545
Lazin wants to merge 14 commits into
redpanda-data:devfrom
Lazin:ct/local-retention-for-tiered-cloud-space-manager

Conversation

@Lazin
Copy link
Copy Markdown
Contributor

@Lazin Lazin commented May 20, 2026

This is a last PR of the series.

More tests and integration with the space management.

The first PR: #30543
The previous PR: #30544

Summary

Add a single optional hint, allowed_local_start_offset, replicated into
ctp_stm state. The reconciler is the sole writer of the hint. On each
tick it inspects the partition's topic config and the data it has reconciled,
computes a retention offset from retention.local.target.{ms,bytes} exactly
like disk_log_impl::housekeeping would, and publishes it (it takes into account
topic config, cluster level config, and retention_local_strict).

ctp_stm's truncate loop then uses
prefix_truncate_target = min(LRLO, log_offset(hint)) instead of just LRLO.
That is the only consumer of the hint.

For cloud mode (or compacted topics), the reconciler publishes nullopt and
behaviour is unchanged — aggressive eviction up to LRO.

The housekeeping in the ctp_stm is already tracking active L0 readers so the
races with the local eviction are not possible.

Local retention rules

storage.mode compaction cached value Action
tiered_cloud off nullopt re-evaluate → publish Some(offset)
tiered_cloud off Some(_) re-evaluate at normal cadence
tiered_cloud on Some(_) publish nullopt
cloud any Some(_) publish nullopt
cloud any nullopt no-op

Space manager: no changes needed

The space manager already publishes a cloud_gc offset on the log when disk
pressure rises. In classic tiered, disk_log_impl::housekeeping reads and
clears it. For cloud topics:

  • ctp_stm::prefix_truncate_below_lro already consults cloud_gc when
    computing its truncation target, so the space manager can push the local
    footprint below the reconciler's hint under pressure.
  • cloud_gc is cleared by ctp_stm (since disk_log_impl::housekeeping
    doesn't run).
  • The only space-manager change is allowing it to run against cloud topic
    partitions; the GC mechanism is identical to tiered.
  • max_removable_local_log_offset() is intentionally unchanged (returns
    LRLO), so the reclaim path is not gated by the reconciler's hint.

Net result: the reconciler steers the steady-state local footprint, and the
space manager retains the ability to reclaim further on demand. The retention
logic (how much to keep) is the same as in classic tiered storage; only the
executor differs.

Component interactions

                  topic config                disk pressure
                       │                            │
                       ▼                            ▼
              ┌─────────────────┐         ┌────────────────────┐
              │   reconciler    │         │   space_manager    │
              │ (evaluates per  │         │ (resource_mgmt/    │
              │  source / tick) │         │   storage.cc)      │
              └────────┬────────┘         └─────────┬──────────┘
                       │                            │
   set_allowed_local_  │                            │ set cloud_gc offset
   start_offset_cmd    │                            │ (on the log object)
                       ▼                            ▼
              ┌─────────────────┐         ┌────────────────────┐
              │     ctp_stm     │◄────────│  log (cloud_gc)    │
              │  (hint cached   │         └────────────────────┘
              │   in state)     │
              └────────┬────────┘
                       │ truncate target =
                       │   min(LRLO,
                       │       log_offset(hint),
                       │       cloud_gc)
                       ▼
            prefix_truncate_below_lro
                       │
                       ▼
              ┌─────────────────┐
              │ local segments  │
              │   on disk       │
              └─────────────────┘

  Contrast — classic tiered storage:

           space_manager ──cloud_gc──► disk_log_impl::housekeeping ──► evict
                                       (also evaluates retention.local.target.*
                                        and clears cloud_gc after GC)

Summary

Classic tiered tiered_cloud
Retention executor disk_log_impl::housekeeping ctp_stm::prefix_truncate_below_lro
Who computes retention? housekeeping itself reconciler (publishes hint via STM cmd)
Space-manager mechanism cloud_gc on log cloud_gc on log (same)
Who clears cloud_gc? housekeeping ctp_stm
Retention logic retention.local.target.{ms,bytes} → offset identical

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Features

  • Local storage retention in tiered_cloud storage mode matches tiered-storage

Lazin added 14 commits May 20, 2026 05:36
Optional kafka::offset cached on ctp_stm_state, produced later by the
reconciler. Defaults to nullopt; bumps serde version to 1 while keeping
compat_version=0 so older snapshots still load.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
New replicated command carrying an optional kafka::offset. nullopt clears
the hint; Some(offset) sets it. Single command type covers advance,
lower, set, and clear. Apply path is a stub returning no-op; the real
implementation lands in a follow-up change.
Dispatch the new command in do_apply, store the value in state, and
signal the prefix-truncate loop so it can pick up the new target on
the next iteration.
Introduce prefix_truncate_target() that returns LRLO when the hint is
unset, or the log offset of the hint when set (clamped to LRLO).
max_removable_local_log_offset() continues to return LRLO so the storage
layer's reclaim path is unaffected.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Public entry point used by the reconciler to replicate the new command.
Idempotently no-ops when the cached value already matches.
Bytes-since-last-eval counter, last-eval timestamp, last-published value.
Used by the upcoming evaluate_local_retention path.
Compute the allowed_local_start_offset target from storage.mode,
cleanup.policy, and storage::log::retention_offset against effective
local-retention targets. Publish via ctp_stm_api when changed.
Idempotent when the value matches last-published.

The source class gains two virtuals (compute_local_retention_target,
publish_local_retention_target) so partition/ctp_stm-specific logic
lives in l0_source while the reconciler orchestrates decision and
bookkeeping. fake_source implements the virtuals as test hooks.
Bytes (>= segment_size_bytes), time (60s), and config-shape mismatch
triggers. Wired into per-source post-reconcile path and the idle tick.
Local footprint converges near retention.local.target.bytes;
flip to cloud and enabling compaction both evict aggressively.
In `tiered_cloud` mode the local log is prefix-truncated only up to
`min(LRO, allowed_local_start_offset)`, so it may still cover offsets
below LRO. Route those reads to the L0 (local) reader instead of L1
when the local log's start offset is at or below the requested start.

In `cloud` mode the local log is prefix-truncated to LRO, so the new
check is a no-op and the existing read-routing behavior is preserved.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Add storage::log::cloud_gc_offset() getter (implemented on disk_log_impl
and failure_injectable_log) so the ctp_stm's truncate loop can observe
the value the space manager publishes. The actual fix to
ctp_stm::prefix_truncate_target lands in a follow-up so this commit
demonstrates the bug:

* prefix_truncate_target_respects_cloud_gc_above_hint — set hint=20 and
  cloud_gc=60, expect target == cloud_gc. FAILS today (returns hint).
* prefix_truncate_target_clamps_cloud_gc_to_lrlo — cloud_gc above LRLO
  must clamp. Passes today by accident; locks the post-fix behavior.
* prefix_truncate_target_uses_hint_when_cloud_gc_below — cloud_gc below
  hint must not relax truncation. Passes today by accident; locks the
  post-fix behavior.

A small test-only accessor on disk_log_impl bypasses the
is_cloud_retention_active() gate in set_cloud_gc_offset so the cloud
topics fixture (with no ntp_config overrides) can drive the value
directly. Generalizing that gate so the space manager can drive
cloud_topics partitions in production is a separate concern.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
In tiered_cloud mode the space manager publishes a cloud_gc offset per
log when disk pressure builds. disk_log_impl::do_gc consumes the value
for normal tiered storage, but cloud_topics partitions are exempt from
that housekeeping and rely on ctp_stm::prefix_truncate_below_lro
instead. Read cloud_gc through the new storage::log::cloud_gc_offset()
getter and raise the truncate target when it would drive more
aggressive eviction than the local-retention hint, capped at LRLO so we
never truncate unreconciled data.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
…ions

The space manager (resource_mgmt/storage.cc) previously filtered out
cloud_topics partitions because they lack a remote_partition handle,
and disk_log_impl::{set_cloud_gc_offset,get_reclaimable_offsets}
explicitly rejected them via the is_cloud_retention_active() gate.
Combined with disk_log_impl::do_gc consuming any pending cloud_gc
unconditionally, the result was that the space manager could not
influence local retention on tiered_cloud partitions at all even
though their local log is exactly what disk pressure should reclaim.

Plumb the path end-to-end:

* Space manager: include partitions where cloud_topic_enabled() is
  true in the eviction schedule.
* disk_log_impl::set_cloud_gc_offset and get_reclaimable_offsets:
  accept cloud_topics partitions alongside tiered storage ones.
* disk_log_impl::do_gc: skip its cloud_gc consumption (and skip the
  reset) when the partition is cloud_topics so the value is left for
  the partition's own truncation loop.
* storage::log: add reset_cloud_gc_offset() so consumers can clear
  the value after acting on it (mirrors do_gc's clear-after-use
  pattern; ctp_stm needs it because it bypasses do_gc).
* ctp_stm::prefix_truncate_below_lro: after a successful
  snapshot_and_truncate_log, clear cloud_gc so the next space-mgmt
  round publishes a fresh decision rather than re-applying a stale
  one.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Adds test_space_manager_reclaim_under_pressure to the local-retention
suite. With tight retention_local_target_capacity_bytes and
retention_local_strict enabled, the space manager must shrink local
footprint below the reconciler-published hint, confirming that
max_removable_local_log_offset() is not gated by the hint.

Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Copilot AI review requested due to automatic review settings May 20, 2026 09:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the cloud-topics (“ct”) pipeline to support local retention in tiered_cloud mode by having the reconciler compute/publish an allowed_local_start_offset hint, and by wiring space-management driven GC (cloud_gc_offset) through to cloud-topics prefix truncation. It also adds unit + end-to-end coverage around the new behavior.

Changes:

  • Add reconciler-side evaluation/publishing of allowed_local_start_offset for tiered_cloud + delete topics and associated bookkeeping/triggering.
  • Extend cloud-topics ctp_stm prefix truncation to respect allowed_local_start_offset (hold back) and cloud_gc_offset (evict more aggressively under pressure).
  • Update space-management and storage surfaces to treat cloud-topics partitions as “cloud-backed” for reclaim decisions; add unit/e2e tests.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/rptest/tests/tiered_cloud_local_retention_test.py New ducktape e2e coverage validating local footprint behavior + mode/policy flips + disk pressure reclaim.
src/v/storage/log.h Adds cloud_gc_offset() / reset_cloud_gc_offset() virtual surfaces to storage::log.
src/v/storage/disk_log_impl.h Implements new cloud_gc_offset() surfaces; adds is_cloud_backed(); test accessor friendship.
src/v/storage/disk_log_impl.cc Introduces is_cloud_backed() and uses it to include cloud-topics in reclaimable/GC gating.
src/v/resource_mgmt/storage.cc Includes cloud-topics partitions in eviction-policy reclaim collection (tiered storage + cloud topics).
src/v/raft/tests/failure_injectable_log.h Extends test log wrapper interface for new cloud_gc_offset surfaces.
src/v/raft/tests/failure_injectable_log.cc Delegates new cloud_gc_offset methods to underlying log.
src/v/cloud_topics/reconciler/tests/test_utils.h Adds fake-source hooks for local-retention evaluator tests (compute target, publish capture, shape flags, segment size).
src/v/cloud_topics/reconciler/tests/reconciliation_source_test.cc New unit tests for reconciliation_source local-retention bookkeeping defaults/mutators.
src/v/cloud_topics/reconciler/tests/reconciler_test.cc Adds evaluator + “eval due” predicate tests, and ensures evaluator is invoked from reconcile().
src/v/cloud_topics/reconciler/tests/BUILD Wires new reconciliation_source gtest target.
src/v/cloud_topics/reconciler/reconciliation_source.h Extends source interface with local-retention compute/publish + shape/segment-size hooks and evaluator bookkeeping state.
src/v/cloud_topics/reconciler/reconciliation_source.cc Implements local-retention hint compute/publish logic for L0 sources.
src/v/cloud_topics/reconciler/reconciler.h Adds local_retention_eval_due and evaluate_local_retention_hint entry points.
src/v/cloud_topics/reconciler/reconciler.cc Runs per-tick evaluation pass; tracks per-source bytes; implements due predicate + evaluator.
src/v/cloud_topics/reconciler/BUILD Adds deps needed for topic config/properties, storage, tristate.
src/v/cloud_topics/level_zero/stm/types.h Adds new ctp_stm_key::set_allowed_local_start_offset.
src/v/cloud_topics/level_zero/stm/types.cc Adds formatter string for the new STM key.
src/v/cloud_topics/level_zero/stm/tests/ctp_stm_test.cc Adds test-only accessor for cloud_gc_offset + many tests for allowed_local_start_offset and prefix-truncate targeting logic.
src/v/cloud_topics/level_zero/stm/tests/ctp_stm_state_test.cc Adds state tests for allowed_local_start_offset defaults/set/get/serde round-trip.
src/v/cloud_topics/level_zero/stm/tests/BUILD Adds serde + storage deps needed by new tests.
src/v/cloud_topics/level_zero/stm/ctp_stm.h Adds apply handler + prefix_truncate_target() API.
src/v/cloud_topics/level_zero/stm/ctp_stm.cc Uses prefix_truncate_target() in background truncation loop; consumes/clears cloud_gc_offset; applies new STM command; implements targeting logic.
src/v/cloud_topics/level_zero/stm/ctp_stm_state.h Bumps serde version; adds allowed_local_start_offset field + accessors.
src/v/cloud_topics/level_zero/stm/ctp_stm_state.cc Implements allowed_local_start_offset accessors.
src/v/cloud_topics/level_zero/stm/ctp_stm_commands.h Adds set_allowed_local_start_offset_cmd.
src/v/cloud_topics/level_zero/stm/ctp_stm_api.h Adds API to replicate allowed_local_start_offset command.
src/v/cloud_topics/level_zero/stm/ctp_stm_api.cc Implements idempotent replication for allowed_local_start_offset.
src/v/cloud_topics/frontend/frontend.cc Adjusts read path to serve from local log below LRO in tiered_cloud mode when safe (non-compacted).

Comment on lines +542 to +548
// Translate the kafka::offset hint to a log offset. to_log_offset
// may return a sentinel for offsets outside the translator's known
// range (e.g. a stale hint from a previous epoch); fall back to the
// cap in that case rather than feeding garbage into std::min.
auto hint_log = _raft->log()->to_log_offset(kafka::offset_cast(*hint));
if (hint_log != model::offset{} && hint_log != model::offset::min()) {
target = std::min(cap, hint_log);
Comment on lines +57 to +61
const bool has_local_limit
= (!props.retention_local_target_bytes.is_disabled()
&& props.retention_local_target_bytes.has_optional_value())
|| (!props.retention_local_target_ms.is_disabled()
&& props.retention_local_target_ms.has_optional_value());
lg.warn,
"{}: failed to publish allowed_local_start_offset: {}",
src->ntp(),
res.error());
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#84717
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL MasterTestSuite quota_manager_fetch_throttling unit https://buildkite.com/redpanda/redpanda/builds/84717#019e44c5-1c98-4f9e-95db-29571f2c71c2 0/1
FAIL src/v/cloud_topics/level_one/domain/tests/db_domain_manager_test src/v/cloud_topics/level_one/domain/tests/db_domain_manager_test unit https://buildkite.com/redpanda/redpanda/builds/84717#019e44c5-1c98-4f9e-95db-29571f2c71c2 0/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants