Nexus background task for distributing attached subnets#9780
Conversation
- Add APIs to the sled agent for attaching and detaching either a single subnet on an instance, or setting / clearing the entire set for an instance. - Add list of attached subnets in the instance-creation request body, and fill that in from Nexus with the (currently-empty) set of attached subnets for the target instnace. - Plumb attachment requests all the way through the sled-agent internals to the new APIs in OPTE. - Add mapping of attached subnets per-instance to the simulated sled agent for testing. - Fixes #9702
- Adds a subnet attach and detach saga in Nexus, modeled after the existing floating IP attachment sagas. - Update the common code for sagas to include passing or removing the attached subnets to Dendrite and / or OPTE. This is to pick up those changes during the existing instance sagas, e.g. instance update. - Fixes #9685
|
Stacked on #9779 |
2867c38 to
8e62a71
Compare
rcgoodfellow
left a comment
There was a problem hiding this comment.
Meta comment for this PR in combo with #9779, should uplink movement be a trigger for this RPW? That can come as a follow later on if so.
| &desired_attachments, | ||
| ); | ||
|
|
||
| // Remove any attachments Dendrite has that we no longer want. |
There was a problem hiding this comment.
If nexus instance 1 is told to add subnet X, while nexus instance 2 is operating on old DB data but new dendrite data, nexus-2 may delete X. Is the theory that subsequent rounds of the reconciler will pick this up and things will become eventually consistent?
There was a problem hiding this comment.
This would presumably be a much larger change, but FWIW, we solve the analogous problem for deploying sled configs from blueprints by putting a generation number on each sled config as a whole, which allows sled-agent to detect (and reject) stale data coming from an out-of-date Nexus. I don't think we ever came up with any reasonable proposals to address this problem from Nexus only; we really want the recipient to be able to correctly handle (ie: ignore) stale requests.
There was a problem hiding this comment.
Yeah, that sounds useful but a big change. The sled-agent could be made to work that way, if we have it maintain the whole-sled set of attachments instead of a per-VMM set. But Dendrite would also need to change to make full use of that.
There was a problem hiding this comment.
I've logged the following for longer term follow ups.
|
Alright, I've fixed up conflicts with |
| use nexus_types::external_api::params; | ||
| use nexus_types::external_api::views; | ||
| use nexus_types::identity::Resource; | ||
| use nexus_types::identity::Resource as _; |
There was a problem hiding this comment.
Hello there. We meet again.
There was a problem hiding this comment.
Hmm, maybe I fucked up the last merge with main before I integrated #9779. This is on main, in any case.
There was a problem hiding this comment.
What's going on with these asserts? Is this a crashworthy situation?
There was a problem hiding this comment.
Ok, GitHub's UI is such trash that I actually can't tell exactly what you're referring to. Your comment appears attached to a line building some tuple, with no assertions in sight. So that's cool.
But if it's this:
omicron/illumos-utils/src/opte/port_manager.rs
Line 1054 in 08c0d9b
then
- that's already on
maintoo, and - I think it's crash-worthy. It means we fucked up the code badly enough that a previous attempt to attach / detach a subnet failed, but we didn't notice. The sled-agent and OPTE disagree about what's attached and I'd be suspicious of any future operations.
omdb.