Library that brings distributed execution capabilities to Apache DataFusion.
Note
This is project is not part of Apache DataFusion
This crate is a toolkit that extends Apache DataFusion with distributed capabilities, providing a developer experience as close as possible to vanilla DataFusion while being unopinionated about the networking stack used for hosting the different workers involved in a query.
Users of this library can expect to take their existing single-node DataFusion-based systems and add distributed capabilities with minimal changes.
- Be as close as possible to vanilla DataFusion, providing a seamless integration with existing DataFusion systems and a familiar API for building applications.
- Unopinionated about networking. This crate does not take any opinion about the networking stack, and users are expected to leverage their own infrastructure for hosting DataFusion nodes.
- No coordinator-worker architecture. To keep infrastructure simple, any node can act as a coordinator or a worker.
The benchmarking code is public an open for anyone to easily reproduce. It uses AWS CDK for automating the creation of the benchmarking cluster so that anyone can reproduce the same results in their own AWS account. The code can be found in the benchmarks/cdk directory.
The user and contributor guide can be found here:
https://datafusion-contrib.github.io/datafusion-distributed
There are some runnable examples showcasing how to provide a localhost implementation for Distributed DataFusion in examples/:
- localhost_worker.rs: code that spawns a Worker listening for physical plans over the network.
- localhost_run.rs: code that distributes a query across the spawned Workers and executes it.
The integration tests also provide an idea about how to use the library and what can be achieved with it:
- tpch_validation_test.rs: executes all TPCH queries and performs assertions over the distributed plans.
- custom_config_extension.rs: showcases how to propagate custom DataFusion config extensions.
- custom_extension_codec.rs: showcases how to propagate custom physical extension codecs.
- distributed_aggregation.rs: showcases how to manually place
ArrowFlightReadExecnodes in a plan and build a distributed query out of it.

