Core: Move decision about remote scan planning from Spark to Core by nastra · Pull Request #15184 · apache/iceberg

nastra · 2026-01-30T08:42:55Z

Previously we introduced the RequiresRemoteScanPlanning marker interface for Spark to properly detect whether a table requires to be remote planned and thus skip all of the distributed planning in SparkDistributedDataScan.
After talking to a few folks, it's probably better to move this decision out of Spark and into Core, hence I've renamed the marker interface to SupportsDistributedScanPlanning. By default, tables support distributed planning, which is then only overridden for RESTTable

nastra · 2026-01-30T13:41:56Z

core/src/main/java/org/apache/iceberg/BaseDistributedDataScan.java

  protected CloseableIterable<ScanTask> doPlanFiles() {
+    if (table() instanceof SupportsDistributedScanPlanning
+        && !((SupportsDistributedScanPlanning) table()).allowDistributedPlanning()) {
+      return table().newBatchScan().planFiles();


actually I don't think this is going to work like this, because the scan object here doesn't carry over any filter/projection/asOfTime/ref settings

I agree, its better to just have a marker, to selectively disable distributed planning

singhpk234 · 2026-01-30T16:49:28Z

core/src/main/java/org/apache/iceberg/SupportsDistributedScanPlanning.java


-/** Marker interface to indicate whether a Table requires remote scan planning */
-public interface RequiresRemoteScanPlanning {}
+/** Marker interface to indicate whether a Table supports distributed scan planning */


minor : logically its no longer a marker interface in a sense if Table implements its not suffiient we need to inspect the API below, how about in the doc we explain what Distributed planning means (i believe its easy to confuse between DistributedPlanning and Remote Planning) and then in the API below describe what does true and false imply, mostly thinking from POV of how

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownLimit.java is modeled

Part of the move away from "RemoteScanPlanning" as the marker is that remote planning is a REST specific concept and distributed planning is a general concept. I'd prefer that we don't leak REST concepts throughout the API surface area.

Also, I don't think there's an issue with providing an interface that exposes multiple options. We don't want the interface to necessarily force a behavior (e.g. Requires something) because then you can't support multiple options. Some markers just indicate that a thing can be done, but in this case we want the implementation to be able to make a determination.

singhpk234 · 2026-01-30T16:56:03Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

    } else if (table instanceof BaseTable && readConf.distributedPlanningEnabled()) {
      return new SparkDistributedDataScan(spark, table, readConf);


we would need to additionally check this as well

((SupportsDistributedScanPlanning) table).allowDistributedPlanning())

may be we can restructure this whole if else logic ^^

readConf.distributedPlanningEnabled()

we might need to update the docs for this as well, in a sense now this is no longer sufficient condition to enforce distributed planning ? i am mostly thinking from custom BaseTable POV

steveloughran

minor java17 language comments. These would seem a good place to use the guarded switch statements

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

singhpk234 · 2026-02-02T17:07:04Z

proactively added this to 1.11 since this is a public interface changes and if we wanna do this better to do this 1.11 unless the RequiresRemoteScanPlanning is release, please feel free to remove it if you all folks think otherwise

singhpk234

Overall LGTM thanks @nastra !

singhpk234 · 2026-02-05T17:10:47Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

+  public boolean distributedPlanningDisallowed() {
+    return table instanceof SupportsDistributedScanPlanning distributed
+        && !distributed.allowDistributedPlanning();
+  }


minor: It might be contradictory to see we have distributedPlanningDisallowed not an exact negation of distributedPlanningEnabled ?

How about we name this : underlyingTableSupportsDistributedPlanning() ?

singhpk234 · 2026-02-05T17:12:37Z

spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

+    return table instanceof SupportsDistributedScanPlanning distributed
+        && distributed.allowDistributedPlanning()
+        && (dataPlanningMode() != LOCAL || deletePlanningMode() != LOCAL);


[not scope of this change] should we change the spark docs for this https://iceberg.apache.org/docs/nightly/spark-configuration/#spark-sql-options

github-actions bot added spark core labels Jan 30, 2026

nastra force-pushed the requires-remote-planning-detection branch from 1bc1dd7 to a5076de Compare January 30, 2026 08:45

nastra commented Jan 30, 2026

View reviewed changes

singhpk234 reviewed Jan 30, 2026

View reviewed changes

steveloughran reviewed Jan 30, 2026

View reviewed changes

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Outdated Show resolved Hide resolved

singhpk234 added this to the Iceberg 1.11.0 milestone Feb 2, 2026

nastra added 2 commits February 3, 2026 09:23

Core: Move decision about remote scan planning from Spark to Core

84ec4cd

move decision back into Spark

9e3cd66

nastra force-pushed the requires-remote-planning-detection branch from 5cf43ea to ca1bce9 Compare February 3, 2026 10:12

review feedback

1543911

nastra force-pushed the requires-remote-planning-detection branch from ca1bce9 to 1543911 Compare February 3, 2026 11:26

nastra requested review from danielcweeks and singhpk234 February 4, 2026 15:12

singhpk234 approved these changes Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Move decision about remote scan planning from Spark to Core#15184

Core: Move decision about remote scan planning from Spark to Core#15184
nastra wants to merge 3 commits intoapache:mainfrom
nastra:requires-remote-planning-detection

nastra commented Jan 30, 2026

Uh oh!

nastra Jan 30, 2026

Uh oh!

singhpk234 Jan 30, 2026

Uh oh!

singhpk234 Jan 30, 2026

Uh oh!

danielcweeks Feb 2, 2026 •

edited

Loading

Uh oh!

singhpk234 Jan 30, 2026

Uh oh!

steveloughran left a comment

Uh oh!

Uh oh!

singhpk234 commented Feb 2, 2026

Uh oh!

singhpk234 left a comment

Uh oh!

singhpk234 Feb 5, 2026

Uh oh!

singhpk234 Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		} else if (table instanceof BaseTable && readConf.distributedPlanningEnabled()) {
		return new SparkDistributedDataScan(spark, table, readConf);

Conversation

nastra commented Jan 30, 2026

Uh oh!

nastra Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

danielcweeks Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

singhpk234 commented Feb 2, 2026

Uh oh!

singhpk234 left a comment

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danielcweeks Feb 2, 2026 •

edited

Loading