-
Notifications
You must be signed in to change notification settings - Fork 293
Refactoring the transport layer with a consistent, schema-driven RPC interface #1288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…d enhance request handling
…n-layer-request-reply
Related PRs & IssuesEarly planCoordinator & Daemon design
Coordinator memory/availability issue
CLI Refactor |
|
just want to say that this looks like an extensive PR and although I agree that some of the current architecture is not great, it could be good to not try to build too big refactoring that also does not fix a use-case feature. I think that we should either do small refactoring when possible, otherwise the risk of introducing bug as we can currently see with the CI is pretty big |
|
I think indeed tarps could help dora daemon, coordinator, and cli indeed. |
I agree that tarpc could be beneficial for the dora daemon, coordinator, and CLI. However, as noted in the PR description, tarpc doesn't support streaming responses. The challenge here is that
|
|
Oh sorry I misread, I somehow understood that tarpc could do streaming, sorry. But I think that streaming also has it's own issue as if internet break for whatever reason it could be very difficult to recover. |
Co-authored-by: sjfhsjfh <sjfhsjfh@qq.com>
c25b702 to
416aa80
Compare
ca12ece to
a8ddd1f
Compare
|
This PR focuses on the CLI <=> coordinator RPC/transport refactor (schema-driven RPC, slimmer call sites). The rest (daemon, node, detailed tests & docs) will come in follow-up PRs to keep this one reviewable. @phil-opp Mind taking a look at this part first? Would like to get the foundational part (macro & |
|
When I use dora start in the past I used to get stdout but now I don't anymore: |
| result: Ok(node_config), | ||
| } => Self::init(node_config), | ||
| } => Self::init( | ||
| serde_json::from_str(&node_config).context("failed to deserialize node config")?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is DaemonReply::NodeConfig returning a string?
| NextEvents(Vec<Timestamped<NodeEvent>>), | ||
| NextDropEvents(Vec<Timestamped<NodeDropEvent>>), | ||
| NodeConfig { result: Result<NodeConfig, String> }, | ||
| NodeConfig { result: Result<String, String> }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
binaries/cli/src/lib.rs
Outdated
| mod template; | ||
|
|
||
| pub use command::build; | ||
| #[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is already a bit big could we make everything that is related to clippy in a different PR?
| .to_string() | ||
| .bold() | ||
| .color(word_to_color(&node_id.to_string())); | ||
| .color(word_to_color(node_id.as_ref())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll expect this to be related to clippy?
| #[cfg(feature = "postcard")] | ||
| mod postcard; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this?
| #[cfg(feature = "postcard")] | ||
| mod postcard; | ||
| #[cfg(feature = "yaml")] | ||
| mod yaml; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to serialize and deserialize yaml?
| impl<T, U> ShmemServer<T, U> { | ||
| pub unsafe fn new(memory: Shmem) -> eyre::Result<Self> { | ||
| #[allow(clippy::missing_safety_doc)] | ||
| impl ShmemChannel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ShmemChannel impl some specific trait?
|
Thanks a lot for working on this! I agree that the current implementation is messy and error-prone. Given the size of this PR and the number of changes I only took a quick look yet. One concern that I have is the big Could you clarify why it was necessary to create a custom RPC framework instead of using one of the many existing ones? You mentioned tarpc, but what about e.g. tonic? |
|
@phil-opp thanks! The main reason for a custom (thin) RPC layer is our constraints:
|
|
Could you clarify how the custom RPC macro supports streaming? It looks like it also supports only the request/reply pattern. Or am I missing something? Unfortunately, this PR is not reviewable in the current state. It does too many things at once and is way to big. You claim that this is just about the CLI<->Coordinator refactor, but this PR also does completely unrelated things such as refactoring the shared memory communication (which is not used by the CLI) or adding support for new serialization formats. |
You're right — the current implementation only supports request-reply. In order to keep the scope of the PR manageable, and also because streaming is only used under a certain scenario (log subscription), we made the RPC macro support only request-reply for now. Streaming is handled separately, just as it was before the refactor.
I understand the concern. We did consider splitting out the CLI-coordinator protocol as a standalone PR, but the current protocol doesn't follow a clean request-reply pattern — some requests expect no response, some share a response type with other requests, and some can return multiple different response types. This ad-hoc design makes it hard to swap in a structured RPC layer without also refactoring how the coordinator handles these messages. Nevertheless, we will try to break it down into smaller PRs. We'll start by isolating CLI-coordinator protocol changes. |
Above you said that streaming support was the main reason for not choosing tarpc:
Is there any other reason why you still chose to implement a custom macro? |
Let me rephrase. There are several major concerns regarding using existing RPC frameworks:
Given these concerns we leaned to writing a custom proc macro for RPC. However we are fully aware that this now causes difficulties for reviewing. Please let us know your opinion on this. |
|
I took a closer look at tarpc and it looked quite promising, so I went ahead and redesigned the CLI<->coordinator connection to use Sorry for creating an alternative implementation, but I think #1345 is the better approach because it:
Limitation:
I would appreciate your review on #1345 if you have time! For future changes, I would recommend to discuss the general design with us first before doing the implementation. Your work here is definitely valuable and appreciated, even if it might not be merged. |
Motivation & Background
Unify CLI / coordinator / daemon / node communication with a schema-driven RPC layer instead of hand-written matches and ad hoc message handling.
Support both shared memory and stream-based (
Read/Write,AsyncRead/AsyncWrite, e.g.,TcpStream) transports.Improve extensibility (adding new messages), error handling, and testability.
Address the long-standing action item from Redesign: Create a
dora-daemonas a communication broker #162: "Use RPC framework instead of custom TCP socket for coordinator <-> daemon <-> node communication," which has remained undone.Client call-sites shrink from hand-written request/parse/match code to a single typed method call. Example:
The schema-driven client hides serialization, transport, and reply dispatch, reducing boilerplate and error surface.
Heads-up on conflicts: This refactor touches RPC schemas/transport and coordinator/daemon event-loop plumbing, so it may conflict with other branches in those areas. I will rebase regularly; if your work overlaps, please flag so we can coordinate.
Design decisions (settled)
The transport layer is intentionally minimal and message-agnostic. We expose two traits, one sync and one async, each defined only by
sendandreceive; hitting EOF returnsOk(None)so callers can stop cleanly. Both traits offerwith_encodingto lift a byte transport into a typed one, and a framed variant sits above any shared-memory channel orRead/Write/AsyncRead/AsyncWritestream (e.g.,TcpStream) to provide length-delimited frames without coupling to an encoding.Transport<Req, Resp>withsend(&Req)andreceive() -> Option<Resp>AsyncTransport<Req, Resp>with async counterparts of the same methodsEncoding is composed via
EncodedTransport<T, Encoding, Req, Resp>, which owns a scratch buffer, performs encode/decode, and delegates I/O to the inner transport. Four encodings are feature-gated today: json and yaml for readability (json is the current default; yaml is heavier and expected to be rare), and two binary codecs—postcard and bincode. Postcard is the preferred lean, allocation-light option going forward; bincode remains for compatibility, but its maintenance status is uncertain.On top of transport and encoding sits a schema-first macro layer. A simple request–response DSL generates both sync and async typed clients. Dispatch glue is already implemented; the open question is the exact server-side handler trait shape. In async contexts, a self-consuming handler looks attractive to avoid borrowing pitfalls, and we will finalize ergonomics before wiring services to it. The design is influenced by google/tarpc, but tarpc lacks stream responses (see google/tarpc#397); for now we surface raw message enums so clients can parse subscribed events manually. Subscription APIs are few, so this trade-off is acceptable until we design a higher-level streaming abstraction.
Finally, we keep parallel clients for runtime safety: the node stays on the sync client because it is embedded via pyo3 and should not spin an internal Tokio runtime (risking nested-runtime panics in rust nodes), while CLI and other Rust async callers can use the async client against the same schemas and encodings.
Current progress
Read/Write/AsyncRead/AsyncWritesupportCoordinator & Daemon refactor
Both coordinator and daemon suffer from giant, single-function
matchdispatch that makes control flow and state hard to reason about. The coordinator threads state through local variables and borrow-checked lifetimes; the daemon does have a state struct, but it is still driven by one large match, so the cognitive load is similar.Plan for both: