Current TaskEstimator API challenges

The `TaskEstimator` API was introduced in 
- https://github.com/datafusion-contrib/datafusion-distributed/pull/259

For allowing a better task assignation control than just static config numbers. The approach followed by this API is the following:
- Nodes can decide how many tasks are appropriate for them
- The amount of tasks appropriate for each individual node is calculated from a bottom to top fashion, taking into account how many tasks each node decided
- Upon reaching a network boundary, all the nodes in each stage reach a consensus on whats the appropriate final task count

The current API has a big limitation:

The current `TaskEstimator` API lets nodes decide how many tasks are appropriate for them taking into account how much data they are going to return, but it does not take into account the amount of compute that will happen during the query.

For example:

### Plan that currently gets distributed but it shouldnt

```
                           ┌───────────────────┐                           
                           │  ProjectionExec   │                           
                           └─────────▲─────────┘                           
                                     │                                     
                           ┌─────────┴─────────┐                           
                           │     UnionExec     │                           
                           └─────▲─▲───▲─▲─────┘                           
         ┌──────────────────┬────┴─┘   └─┴─────┬──────────────────┐        
         │                  │                  │                  │        
┌────────────────┐ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│   DataSource   │ │   DataSource   │ │   DataSource   │ │   DataSource   │
└────────────────┘ └────────────────┘ └────────────────┘ └────────────────┘
```

This plan has the capacity of pulling a lot of data up, but computationally speaking is very inexpensive, as it's almost a pure IO query. This plan should not be distributed, it's fine to run this in a single node.

Another example:

### Plan that currently does not get distributed but it should

```
      ┌───────────────────┐      
      │  ProjectionExec   │      
      └─────────▲─────────┘      
                │                
┌───────────────┴───────────────┐
│ ExtremelyExpensiveComputeExec │
└───────────────▲───────────────┘
                │                
   ┌────────────┴───────────┐    
   │    SmallDataSource     │    
   └────────────────────────┘    
```

The current `TaskEstimator` API might decide that, as not a huge amount of data is flowing through `SmallDataSource`, then it does not need to be distributed, but it doesn't account for the fact that `ExtremelyExpensiveComputeExec` might be doing something crazy computationally speaking that might be nice to distribute.

---

Right now the amount of tasks is determined by looking at the data size, but it should take into account the product of "data size" * "compute", so that pure IO plans do not get unnecessarily distributed, and compute heavy plans are not left undistributed by mistake.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current TaskEstimator API challenges #296

Plan that currently gets distributed but it shouldnt

Plan that currently does not get distributed but it should

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Current TaskEstimator API challenges #296

Description

Plan that currently gets distributed but it shouldnt

Plan that currently does not get distributed but it should

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions