[sim] Optionally enable health monitor#9628
[sim] Optionally enable health monitor#9628karencfv wants to merge 11 commits intooxidecomputer:mainfrom
Conversation
|
Cool. Does this cause the simulated sled agent to look at the actual SMF state wherever it's running? Wouldn't it be more useful to allow the reported state to be customized directly? |
Yes, but that was the use case I've been having 😄.
That would be really useful too. I wonder if there is a possibility to have one or the other. Do you think it' would be relatively straightforward to do that? |
|
@davepacheco, I've changed the approach. It's now possible to inject fake data via a config file. Let me know what you think! I updated the PR's description to show the new way this would work |
|
Heya @davepacheco! Just a tiny ping to see if I could get some eyes on this? Having this feature available will make the ongoing health monitor work a lot easier! |
| [dependencies] | ||
| anyhow.workspace = true | ||
| camino.workspace = true | ||
| chrono.workspace = true |
There was a problem hiding this comment.
I think maybe most of the stuff added here isn't used?
There was a problem hiding this comment.
Ah! yeah, I think this was for a previous iteration of this work, I'll remove
| /// | ||
| /// Note: you should probably use the `extra_sled_agents` macro parameter on | ||
| /// `nexus_test` instead! | ||
| #[allow(clippy::too_many_arguments)] |
There was a problem hiding this comment.
What about defining a struct for these arguments instead?
There was a problem hiding this comment.
I went with just adding the function to have too many arguments because of the comment above this line:
/// Note: you should probably use the
extra_sled_agentsmacro parameter on
///nexus_testinstead!
Since this is somewhat of a semi-deprecated function, I didn't want to create a struct for it, WDYT?
| Some(&camino::Utf8Path::new("/an/unused/update/directory")), | ||
| omicron_sled_agent::sim::ZpoolConfig::None, | ||
| sled_agent_types::inventory::SledCpuFamily::AmdTurin, | ||
| omicron_sled_agent::sim::ConfigHealthMonitor { |
There was a problem hiding this comment.
This block is repeated quite a lot -- maybe there should be a ConfigHealthMonitor::disabled() that constructs this form?
| /// Returns a `HealthMonitorHandle` that doesn't monitor health and always | ||
| /// reports no problems unless a `ConfigSimHealthMonitor` with simulated | ||
| /// data is passed. | ||
| pub fn spawn_sim( | ||
| sim_health_checks: Option<HealthMonitorInventory>, | ||
| ) -> Self { |
There was a problem hiding this comment.
I would be inclined to keep two separate functions here. Is there some reason to combine them like this?
It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.
There was a problem hiding this comment.
This function (and previously stub()) is only ever used for the simulated omicron. Otherwise, there really is no need for the "stub()" part of this function. I'd be passing on the responsibility to decide whether to fake a config or not to the caller and that felt trickier to handle in the long run?
It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.
The way I see it stub() also reports simulated data, it reports that all health checks returned healthy no?
| [dependencies] | ||
| anyhow.workspace = true | ||
| async-trait.workspace = true | ||
| chrono.workspace = true |
| use std::net::SocketAddr; | ||
| use std::net::SocketAddrV6; | ||
|
|
||
| pub const DEFAULT_HEALTH_MONITOR_CONFIG: &str = concat!( |
There was a problem hiding this comment.
Does the correctness of this depend on where you run it from? I usually would run cargo run --bin=sled-agent-sim from the top level of Omicron (and I think that's what the written instructions have people do), but this seems to assume it will be run from sled-agent?
There was a problem hiding this comment.
I honestly don't remember why I did this this way, 😅 let me have a look again
| /// Configuration for the simulated health monitor. | ||
| #[derive(Clone, Debug, PartialEq, Deserialize, Serialize)] | ||
| pub struct ConfigHealthMonitor { | ||
| /// Whether the real health monitor is running or not. |
There was a problem hiding this comment.
It looks like this struct allows expressing invalid state (like "enabled" with a non-None sim_health_checks. What about having this be a tagged enum with two variants? FromLiveSystem and FixtureData(HealthMonitorInventory)?
There was a problem hiding this comment.
Ah yes, that would make sense 😄
| ) | ||
| } | ||
|
|
||
| #[allow(clippy::too_many_arguments)] |
|
I'm really sorry for the delay here. This looks like an improvement because the behavior is customizable and hopefully tests that use this will be less flaky than if they were manipulating the local SMF state (and won't require privileges, etc.). What I had been wondering about though was about adding a sim sled agent API to control this dynamically. The simulated sled agent has a few endpoints for things that need to be manipulated by tests: though the client is a little janky: omicron/clients/sled-agent-client/src/lib.rs Lines 348 to 398 in d196e0c (see #8900) Doing it this way is not a blocker here! But it would allow us to write tests that exercise the Nexus behavior here. It might even be less code, since a lot of this PR is plumbing config through. What do you think? |
karencfv
left a comment
There was a problem hiding this comment.
Thanks for taking a look @davepacheco !
What I had been wondering about though was about adding a sim sled agent API to control this dynamically. The simulated sled agent has a few endpoints for things that need to be manipulated by tests
Hmmmm... that sounds really interesting! Let me take a look
| [dependencies] | ||
| anyhow.workspace = true | ||
| camino.workspace = true | ||
| chrono.workspace = true |
There was a problem hiding this comment.
Ah! yeah, I think this was for a previous iteration of this work, I'll remove
| /// | ||
| /// Note: you should probably use the `extra_sled_agents` macro parameter on | ||
| /// `nexus_test` instead! | ||
| #[allow(clippy::too_many_arguments)] |
There was a problem hiding this comment.
I went with just adding the function to have too many arguments because of the comment above this line:
/// Note: you should probably use the
extra_sled_agentsmacro parameter on
///nexus_testinstead!
Since this is somewhat of a semi-deprecated function, I didn't want to create a struct for it, WDYT?
| /// Returns a `HealthMonitorHandle` that doesn't monitor health and always | ||
| /// reports no problems unless a `ConfigSimHealthMonitor` with simulated | ||
| /// data is passed. | ||
| pub fn spawn_sim( | ||
| sim_health_checks: Option<HealthMonitorInventory>, | ||
| ) -> Self { |
There was a problem hiding this comment.
This function (and previously stub()) is only ever used for the simulated omicron. Otherwise, there really is no need for the "stub()" part of this function. I'd be passing on the responsibility to decide whether to fake a config or not to the caller and that felt trickier to handle in the long run?
It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.
The way I see it stub() also reports simulated data, it reports that all health checks returned healthy no?
| use std::net::SocketAddr; | ||
| use std::net::SocketAddrV6; | ||
|
|
||
| pub const DEFAULT_HEALTH_MONITOR_CONFIG: &str = concat!( |
There was a problem hiding this comment.
I honestly don't remember why I did this this way, 😅 let me have a look again
| /// Configuration for the simulated health monitor. | ||
| #[derive(Clone, Debug, PartialEq, Deserialize, Serialize)] | ||
| pub struct ConfigHealthMonitor { | ||
| /// Whether the real health monitor is running or not. |
There was a problem hiding this comment.
Ah yes, that would make sense 😄
Adds the ability to enable the sled agent health monitor on simulated systems. This is and will be very useful for various types of testing.
Disabled:
With fake health monitor results
Enabled
Closes: #9517