Core: Move decision about remote scan planning from Spark to Core#15184
Core: Move decision about remote scan planning from Spark to Core#15184nastra wants to merge 3 commits intoapache:mainfrom
Conversation
1bc1dd7 to
a5076de
Compare
| protected CloseableIterable<ScanTask> doPlanFiles() { | ||
| if (table() instanceof SupportsDistributedScanPlanning | ||
| && !((SupportsDistributedScanPlanning) table()).allowDistributedPlanning()) { | ||
| return table().newBatchScan().planFiles(); |
There was a problem hiding this comment.
actually I don't think this is going to work like this, because the scan object here doesn't carry over any filter/projection/asOfTime/ref settings
There was a problem hiding this comment.
I agree, its better to just have a marker, to selectively disable distributed planning
|
|
||
| /** Marker interface to indicate whether a Table requires remote scan planning */ | ||
| public interface RequiresRemoteScanPlanning {} | ||
| /** Marker interface to indicate whether a Table supports distributed scan planning */ |
There was a problem hiding this comment.
minor : logically its no longer a marker interface in a sense if Table implements its not suffiient we need to inspect the API below, how about in the doc we explain what Distributed planning means (i believe its easy to confuse between DistributedPlanning and Remote Planning) and then in the API below describe what does true and false imply, mostly thinking from POV of how
There was a problem hiding this comment.
Part of the move away from "RemoteScanPlanning" as the marker is that remote planning is a REST specific concept and distributed planning is a general concept. I'd prefer that we don't leak REST concepts throughout the API surface area.
Also, I don't think there's an issue with providing an interface that exposes multiple options. We don't want the interface to necessarily force a behavior (e.g. Requires something) because then you can't support multiple options. Some markers just indicate that a thing can be done, but in this case we want the implementation to be able to make a determination.
| } else if (table instanceof BaseTable && readConf.distributedPlanningEnabled()) { | ||
| return new SparkDistributedDataScan(spark, table, readConf); |
There was a problem hiding this comment.
we would need to additionally check this as well
((SupportsDistributedScanPlanning) table).allowDistributedPlanning())
may be we can restructure this whole if else logic ^^
readConf.distributedPlanningEnabled()
we might need to update the docs for this as well, in a sense now this is no longer sufficient condition to enforce distributed planning ? i am mostly thinking from custom BaseTable POV
steveloughran
left a comment
There was a problem hiding this comment.
minor java17 language comments. These would seem a good place to use the guarded switch statements
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java
Outdated
Show resolved
Hide resolved
|
proactively added this to 1.11 since this is a public interface changes and if we wanna do this better to do this 1.11 unless the RequiresRemoteScanPlanning is release, please feel free to remove it if you all folks think otherwise |
5cf43ea to
ca1bce9
Compare
ca1bce9 to
1543911
Compare
singhpk234
left a comment
There was a problem hiding this comment.
Overall LGTM thanks @nastra !
| public boolean distributedPlanningDisallowed() { | ||
| return table instanceof SupportsDistributedScanPlanning distributed | ||
| && !distributed.allowDistributedPlanning(); | ||
| } |
There was a problem hiding this comment.
minor: It might be contradictory to see we have distributedPlanningDisallowed not an exact negation of distributedPlanningEnabled ?
How about we name this : underlyingTableSupportsDistributedPlanning() ?
| return table instanceof SupportsDistributedScanPlanning distributed | ||
| && distributed.allowDistributedPlanning() | ||
| && (dataPlanningMode() != LOCAL || deletePlanningMode() != LOCAL); |
There was a problem hiding this comment.
[not scope of this change] should we change the spark docs for this https://iceberg.apache.org/docs/nightly/spark-configuration/#spark-sql-options
Previously we introduced the
RequiresRemoteScanPlanningmarker interface for Spark to properly detect whether a table requires to be remote planned and thus skip all of the distributed planning inSparkDistributedDataScan.After talking to a few folks, it's probably better to move this decision out of Spark and into Core, hence I've renamed the marker interface to
SupportsDistributedScanPlanning. By default, tables support distributed planning, which is then only overridden forRESTTable