Skip to content

BigQuery offline store loses array data when pushing features with list types #5845

@max36067

Description

@max36067

Summary

When using store.push() with PushMode.OFFLINE or PushMode.ONLINE_AND_OFFLINE, array/list type columns (e.g., STRING_LIST) are written as empty arrays [] to BigQuery, even though the data is correct in the DataFrame and PyArrow table.

Root Cause

The BigQuery LoadJobConfig in offline_write_batch() is missing parquet_options.enable_list_inference = True. Without this option, BigQuery's parquet loader doesn't correctly interpret PyArrow's list format.

Related issue: googleapis/google-cloud-python#15705 (comment)

Steps to Reproduce

  from feast import FeatureStore
  from feast.data_source import PushMode
  import pandas as pd
  from datetime import datetime, timezone

  # Assuming feature view with STRING_LIST field is configured
  data = {
      "entity_id": "test_123",
      "tags": ["category_a", "category_b"],  # STRING_LIST type
      "event_time": datetime.now(timezone.utc),
  }

  df = pd.DataFrame([data])
  store = FeatureStore(repo_path=".")
  store.push("my_push_source", df, to=PushMode.ONLINE_AND_OFFLINE)

  # Result in BigQuery: tags = [] (empty array)
  # Expected: tags = ["category_a", "category_b"]

Expected Behavior

Array data should be correctly written to BigQuery with values preserved.

Actual Behavior

Array columns are written as empty arrays [] in BigQuery, while the online store receives correct data.

Proposed Fix

In feast/infra/offline_stores/bigquery.py, update offline_write_batch() (~line 428):

  @staticmethod
  def offline_write_batch(
      config: RepoConfig,
      feature_view: FeatureView,
      table: pyarrow.Table,
      progress: Optional[Callable[[int], Any]],
  ):
      # ... existing code ...

      parquet_options = bigquery.ParquetOptions()
      parquet_options.enable_list_inference = True

      job_config = bigquery.LoadJobConfig(
          source_format=bigquery.SourceFormat.PARQUET,
          schema=arrow_schema_to_bq_schema(pa_schema),
          create_disposition=config.offline_store.table_create_disposition,
          write_disposition="WRITE_APPEND",
          parquet_options=parquet_options,  # Add this line
      )
      # ... rest of code ...

Environment

  • Feast version: 0.58.0
  • Python version: 3.12
  • BigQuery client version: (latest)

Additional Context

  • Online store (PostgreSQL) receives array data correctly
  • The PyArrow table contains correct array data before parquet write
  • Parquet file contains correct data when read locally
  • Only BigQuery load loses the array content
  • Using load_table_from_json instead of parquet works correctly
  • Adding enable_list_inference=True to ParquetOptions fixes the issue

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions