-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Steps to reproduce the behavior (Required)
- Create an Iceberg table in StarRocks connected to a REST catalog (Polaris):
CREATE TABLE dwd_rest.test_migration_l0.test_events_local (
partition_date DATE,
partition_hour INT,
id BIGINT,
revenue BIGINT
)
PARTITION BY (partition_date, partition_hour);- Load initial data for January (dates >= 15, all hours):
INSERT INTO dwd_rest.test_migration_l0.test_events_local
FROM FILES(...);- Create a materialized view on the Iceberg table:
CREATE MATERIALIZED VIEW default_catalog.native_test.iceberg_test_mv
COMMENT "MATERIALIZED_VIEW polaris rest iceberg"
PARTITION BY (partition_date)
DISTRIBUTED BY RANDOM
REFRESH ASYNC EVERY(INTERVAL 5 MINUTE)
PROPERTIES (
"replicated_storage" = "true",
"replication_num" = "1",
"datacache.enable" = "true",
"enable_async_write_back" = "false",
"storage_volume" = "builtin_storage_volume",
"warehouse" = "default_warehouse"
)
AS
SELECT
partition_date,
partition_hour,
id AS store_id,
COUNT(*) AS event_count,
SUM(revenue) AS total_revenue_micros,
SUM(revenue) / 1000000.0 AS total_revenue_dollars
FROM dwd_rest.test_migration_l0.test_events_local
GROUP BY partition_date, partition_hour, id;- Verify initial refresh completes all partitions using:
SELECT CREATE_TIME, STATE,
JSON_QUERY(EXTRA_MESSAGE, '$.mvPartitionsToRefresh') AS partitions_refreshed
FROM information_schema.task_runs
WHERE TASK_NAME LIKE 'mv-12345678'
AND CREATE_TIME > NOW() - INTERVAL 1 DAY
ORDER BY CREATE_TIME DESC
LIMIT 1000;-
Load incremental data for February 1-4 (one day at a time) and verify each refresh only updates the new partition.
-
Verify partition metadata before snapshot expiration:
SELECT partition_value, last_updated_at
FROM dwd_rest.test_migration_l0.test_events_local$partitions;All partitions show valid last_updated_at timestamps.
- Expire the first snapshot (containing January data) using pyiceberg:
from pyiceberg.catalog import load_catalog
catalog = load_catalog('polaris')
tbl = catalog.load_table("test_migration_l0.test_events_local")
print("expiring snapshot 1234")
tbl.maintenance.expire_snapshots().by_ids([1234]).commit()
print("snapshot expired")- Check partition metadata after snapshot expiration:
SELECT partition_value, last_updated_at
FROM dwd_rest.test_migration_l0.test_events_local$partitions;the partitions from january now has last_updated_at NULL
hitting these path in the starrocks iceberg code
starrocks/fe/fe-core/src/main/java/com/starrocks/connector/iceberg/IcebergCatalog.java
Lines 376 to 385 in 10bb268
| } catch (Exception e) { | |
| logger.error("Failed to get last_updated_at for partition [{}] of table [{}] " + | |
| "under snapshot [{}]", partitionName, nativeTable.name(), snapshotId, e); | |
| } | |
| } | |
| if (lastUpdated == -1) { | |
| // Fallback to current snapshot's timestamp if last_updated_at is null due to snapshot expiration. | |
| lastUpdated = getTableLastestSnapshotTime(icebergTable, logger); | |
| logger.warn("The table [{}] last_updated_at is null (snapshot [{}] may have been expired), " + | |
| "using current snapshot timestamp: {}", nativeTable.name(), snapshotId, lastUpdated); |
- Load new data (February 5th) and observe the next materialized view refresh.
Expected behavior (Required)
Partition change tracking for iceberg doesn't loose track of partitions whose snapshot was expired.
After expiring the first snapshot, the materialized view should continue to perform incremental refreshes. When loading February 5th data, only the new partition should be refreshed.
The last_updated_at timestamp in the $partitions metadata table should remain intact for all partitions after snapshot expiration
Real behavior (Required)
After expiring the first snapshot:
- All January partitions (from the expired snapshot) lose their last_updated_at timestamp in the $partitions metadata table
- The next materialized view refresh refreshes ALL January partitions plus the new partition (February 5th)
- All subsequent refreshes continue to unnecessarily refresh all January partitions
This results in significantly degraded performance as the materialized view performs full partition refreshes instead of incremental refreshes.
StarRocks version (Required)
4.0.4-591a874