Skip to content

Core: Move decision about remote scan planning from Spark to Core#15184

Open
nastra wants to merge 3 commits intoapache:mainfrom
nastra:requires-remote-planning-detection
Open

Core: Move decision about remote scan planning from Spark to Core#15184
nastra wants to merge 3 commits intoapache:mainfrom
nastra:requires-remote-planning-detection

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Jan 30, 2026

Previously we introduced the RequiresRemoteScanPlanning marker interface for Spark to properly detect whether a table requires to be remote planned and thus skip all of the distributed planning in SparkDistributedDataScan.
After talking to a few folks, it's probably better to move this decision out of Spark and into Core, hence I've renamed the marker interface to SupportsDistributedScanPlanning. By default, tables support distributed planning, which is then only overridden for RESTTable

@nastra nastra force-pushed the requires-remote-planning-detection branch from 1bc1dd7 to a5076de Compare January 30, 2026 08:45
protected CloseableIterable<ScanTask> doPlanFiles() {
if (table() instanceof SupportsDistributedScanPlanning
&& !((SupportsDistributedScanPlanning) table()).allowDistributedPlanning()) {
return table().newBatchScan().planFiles();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I don't think this is going to work like this, because the scan object here doesn't carry over any filter/projection/asOfTime/ref settings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, its better to just have a marker, to selectively disable distributed planning


/** Marker interface to indicate whether a Table requires remote scan planning */
public interface RequiresRemoteScanPlanning {}
/** Marker interface to indicate whether a Table supports distributed scan planning */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : logically its no longer a marker interface in a sense if Table implements its not suffiient we need to inspect the API below, how about in the doc we explain what Distributed planning means (i believe its easy to confuse between DistributedPlanning and Remote Planning) and then in the API below describe what does true and false imply, mostly thinking from POV of how

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.java is modeled

Copy link
Contributor

@danielcweeks danielcweeks Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the move away from "RemoteScanPlanning" as the marker is that remote planning is a REST specific concept and distributed planning is a general concept. I'd prefer that we don't leak REST concepts throughout the API surface area.

Also, I don't think there's an issue with providing an interface that exposes multiple options. We don't want the interface to necessarily force a behavior (e.g. Requires something) because then you can't support multiple options. Some markers just indicate that a thing can be done, but in this case we want the implementation to be able to make a determination.

Comment on lines 767 to 768
} else if (table instanceof BaseTable && readConf.distributedPlanningEnabled()) {
return new SparkDistributedDataScan(spark, table, readConf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to additionally check this as well

((SupportsDistributedScanPlanning) table).allowDistributedPlanning())

may be we can restructure this whole if else logic ^^

readConf.distributedPlanningEnabled()

we might need to update the docs for this as well, in a sense now this is no longer sufficient condition to enforce distributed planning ? i am mostly thinking from custom BaseTable POV

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor java17 language comments. These would seem a good place to use the guarded switch statements

@singhpk234 singhpk234 added this to the Iceberg 1.11.0 milestone Feb 2, 2026
@singhpk234
Copy link
Contributor

proactively added this to 1.11 since this is a public interface changes and if we wanna do this better to do this 1.11 unless the RequiresRemoteScanPlanning is release, please feel free to remove it if you all folks think otherwise

@nastra nastra force-pushed the requires-remote-planning-detection branch from 5cf43ea to ca1bce9 Compare February 3, 2026 10:12
@nastra nastra force-pushed the requires-remote-planning-detection branch from ca1bce9 to 1543911 Compare February 3, 2026 11:26
Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM thanks @nastra !

Comment on lines +295 to +298
public boolean distributedPlanningDisallowed() {
return table instanceof SupportsDistributedScanPlanning distributed
&& !distributed.allowDistributedPlanning();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: It might be contradictory to see we have distributedPlanningDisallowed not an exact negation of distributedPlanningEnabled ?

How about we name this : underlyingTableSupportsDistributedPlanning() ?

Comment on lines +301 to +303
return table instanceof SupportsDistributedScanPlanning distributed
&& distributed.allowDistributedPlanning()
&& (dataPlanningMode() != LOCAL || deletePlanningMode() != LOCAL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[not scope of this change] should we change the spark docs for this https://iceberg.apache.org/docs/nightly/spark-configuration/#spark-sql-options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants