Skip to content

[sim] Optionally enable health monitor#9628

Open
karencfv wants to merge 11 commits intooxidecomputer:mainfrom
karencfv:sim-enable-health-monitor
Open

[sim] Optionally enable health monitor#9628
karencfv wants to merge 11 commits intooxidecomputer:mainfrom
karencfv:sim-enable-health-monitor

Conversation

@karencfv
Copy link
Contributor

@karencfv karencfv commented Jan 13, 2026

Adds the ability to enable the sled agent health monitor on simulated systems. This is and will be very useful for various types of testing.

Disabled:

# Configuration toml file
enabled = false
$ cargo xtask omicron-dev run-all
<...>
omicron-dev: sled agent API:         http://[::1]:56577
<...>
$ curl -H "api-version: 14.0.0"  http://[::1]:56577/inventory | jq .health_monitor
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21274  100 21274    0     0  7654k      0 --:--:-- --:--:-- --:--:-- 10.1M
{
  "smf_services_in_maintenance": {
    "ok": {
      "services": [],
      "errors": [],
      "time_of_status": null
    }
  }
}

With fake health monitor results

# Configuration toml file

enabled = false

[sim_health_checks.smf_services_in_maintenance.ok]
services = [
    { fmri = "svc:/system/fake-service-1:default", zone = "oxz_fake_zone_1" },
    { fmri = "svc:/network/fake-service-2:default", zone = "oxz_fake_zone_2" },
    { fmri = "svc:/application/fake-service-3:default", zone = "global" }
]

errors = []

time_of_status = "2026-04-12T23:20:50.52Z"
$ cargo xtask omicron-dev run-all --health-monitor-config sled-agent/tests/configs/health_monitor_sim_unhealthy.toml
<...>
omicron-dev: sled agent API:         http://[::1]:64707
<...>
$ curl -H "api-version: 14.0.0"  http://[::1]:64707/inventory | jq .health_monitor
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21505  100 21505    0     0  8932k      0 --:--:-- --:--:-- --:--:-- 10.2M
{
  "smf_services_in_maintenance": {
    "ok": {
      "services": [
        {
          "fmri": "svc:/system/fake-service-1:default",
          "zone": "oxz_fake_zone_1"
        },
        {
          "fmri": "svc:/network/fake-service-2:default",
          "zone": "oxz_fake_zone_2"
        },
        {
          "fmri": "svc:/application/fake-service-3:default",
          "zone": "global"
        }
      ],
      "errors": [],
      "time_of_status": "2026-04-12T23:20:50.520Z"
    }
  }
}

Enabled

# Configuration toml file
enabled = true
$ cargo xtask omicron-dev run-all --health-monitor-config sled-agent/tests/configs/health_monitor_sim_enabled.toml
<...>
omicron-dev: sled agent API:         http://[::1]:59351
<...>
$ curl -H "api-version: 14.0.0"  http://[::1]:59351/inventory | jq .health_monitor
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21418  100 21418    0     0  8900k      0 --:--:-- --:--:-- --:--:-- 10.2M
{
  "smf_services_in_maintenance": {
    "ok": {
      "services": [
        {
          "fmri": "svc:/site/fake-service2:default",
          "zone": "global"
        },
        {
          "fmri": "svc:/site/fake-service:default",
          "zone": "global"
        }
      ],
      "errors": [],
      "time_of_status": "2026-01-22T06:41:03.279150883Z"
    }
  }
}

Closes: #9517

@davepacheco
Copy link
Collaborator

Cool. Does this cause the simulated sled agent to look at the actual SMF state wherever it's running? Wouldn't it be more useful to allow the reported state to be customized directly?

@karencfv
Copy link
Contributor Author

Does this cause the simulated sled agent to look at the actual SMF state wherever it's running?

Yes, but that was the use case I've been having 😄.

Wouldn't it be more useful to allow the reported state to be customized directly?

That would be really useful too. I wonder if there is a possibility to have one or the other. Do you think it' would be relatively straightforward to do that?

@karencfv
Copy link
Contributor Author

karencfv commented Jan 22, 2026

@davepacheco, I've changed the approach. It's now possible to inject fake data via a config file. Let me know what you think!

I updated the PR's description to show the new way this would work

@karencfv
Copy link
Contributor Author

karencfv commented Feb 2, 2026

Heya @davepacheco! Just a tiny ping to see if I could get some eyes on this? Having this feature available will make the ongoing health monitor work a lot easier!

[dependencies]
anyhow.workspace = true
camino.workspace = true
chrono.workspace = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe most of the stuff added here isn't used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! yeah, I think this was for a previous iteration of this work, I'll remove

///
/// Note: you should probably use the `extra_sled_agents` macro parameter on
/// `nexus_test` instead!
#[allow(clippy::too_many_arguments)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about defining a struct for these arguments instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with just adding the function to have too many arguments because of the comment above this line:

/// Note: you should probably use the extra_sled_agents macro parameter on
/// nexus_test instead!

Since this is somewhat of a semi-deprecated function, I didn't want to create a struct for it, WDYT?

Some(&camino::Utf8Path::new("/an/unused/update/directory")),
omicron_sled_agent::sim::ZpoolConfig::None,
sled_agent_types::inventory::SledCpuFamily::AmdTurin,
omicron_sled_agent::sim::ConfigHealthMonitor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is repeated quite a lot -- maybe there should be a ConfigHealthMonitor::disabled() that constructs this form?

Comment on lines +52 to +57
/// Returns a `HealthMonitorHandle` that doesn't monitor health and always
/// reports no problems unless a `ConfigSimHealthMonitor` with simulated
/// data is passed.
pub fn spawn_sim(
sim_health_checks: Option<HealthMonitorInventory>,
) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be inclined to keep two separate functions here. Is there some reason to combine them like this?

It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function (and previously stub()) is only ever used for the simulated omicron. Otherwise, there really is no need for the "stub()" part of this function. I'd be passing on the responsibility to decide whether to fake a config or not to the caller and that felt trickier to handle in the long run?

It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.

The way I see it stub() also reports simulated data, it reports that all health checks returned healthy no?

[dependencies]
anyhow.workspace = true
async-trait.workspace = true
chrono.workspace = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

use std::net::SocketAddr;
use std::net::SocketAddrV6;

pub const DEFAULT_HEALTH_MONITOR_CONFIG: &str = concat!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the correctness of this depend on where you run it from? I usually would run cargo run --bin=sled-agent-sim from the top level of Omicron (and I think that's what the written instructions have people do), but this seems to assume it will be run from sled-agent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't remember why I did this this way, 😅 let me have a look again

/// Configuration for the simulated health monitor.
#[derive(Clone, Debug, PartialEq, Deserialize, Serialize)]
pub struct ConfigHealthMonitor {
/// Whether the real health monitor is running or not.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this struct allows expressing invalid state (like "enabled" with a non-None sim_health_checks. What about having this be a tagged enum with two variants? FromLiveSystem and FixtureData(HealthMonitorInventory)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that would make sense 😄

)
}

#[allow(clippy::too_many_arguments)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct?

@davepacheco
Copy link
Collaborator

I'm really sorry for the delay here.

This looks like an improvement because the behavior is customizable and hopefully tests that use this will be less flaky than if they were manipulating the local SMF state (and won't require privileges, etc.). What I had been wondering about though was about adding a sim sled agent API to control this dynamically. The simulated sled agent has a few endpoints for things that need to be manipulated by tests:
https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/sim/http_entrypoints.rs#L100-L103

though the client is a little janky:

#[async_trait]
impl TestInterfaces for Client {
async fn vmm_single_step(&self, id: PropolisUuid) {
let baseurl = self.baseurl();
let client = self.client();
let url = format!("{}/vmms/{}/poke-single-step", baseurl, id);
client
.post(url)
.send()
.await
.expect("instance_single_step() failed unexpectedly");
}
async fn vmm_finish_transition(&self, id: PropolisUuid) {
let baseurl = self.baseurl();
let client = self.client();
let url = format!("{}/vmms/{}/poke", baseurl, id);
client
.post(url)
.send()
.await
.expect("instance_finish_transition() failed unexpectedly");
}
async fn disk_finish_transition(&self, id: Uuid) {
let baseurl = self.baseurl();
let client = self.client();
let url = format!("{}/disks/{}/poke", baseurl, id);
client
.post(url)
.send()
.await
.expect("disk_finish_transition() failed unexpectedly");
}
async fn vmm_simulate_migration_source(
&self,
id: PropolisUuid,
params: SimulateMigrationSource,
) {
let baseurl = self.baseurl();
let client = self.client();
let url = format!("{baseurl}/vmms/{id}/sim-migration-source");
client
.post(url)
.json(&params)
.send()
.await
.expect("instance_simulate_migration_source() failed unexpectedly");
}
}

(see #8900)

Doing it this way is not a blocker here! But it would allow us to write tests that exercise the Nexus behavior here. It might even be less code, since a lot of this PR is plumbing config through. What do you think?

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @davepacheco !

What I had been wondering about though was about adding a sim sled agent API to control this dynamically. The simulated sled agent has a few endpoints for things that need to be manipulated by tests

Hmmmm... that sounds really interesting! Let me take a look

[dependencies]
anyhow.workspace = true
camino.workspace = true
chrono.workspace = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! yeah, I think this was for a previous iteration of this work, I'll remove

///
/// Note: you should probably use the `extra_sled_agents` macro parameter on
/// `nexus_test` instead!
#[allow(clippy::too_many_arguments)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with just adding the function to have too many arguments because of the comment above this line:

/// Note: you should probably use the extra_sled_agents macro parameter on
/// nexus_test instead!

Since this is somewhat of a semi-deprecated function, I didn't want to create a struct for it, WDYT?

Comment on lines +52 to +57
/// Returns a `HealthMonitorHandle` that doesn't monitor health and always
/// reports no problems unless a `ConfigSimHealthMonitor` with simulated
/// data is passed.
pub fn spawn_sim(
sim_health_checks: Option<HealthMonitorInventory>,
) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function (and previously stub()) is only ever used for the simulated omicron. Otherwise, there really is no need for the "stub()" part of this function. I'd be passing on the responsibility to decide whether to fake a config or not to the caller and that felt trickier to handle in the long run?

It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.

The way I see it stub() also reports simulated data, it reports that all health checks returned healthy no?

use std::net::SocketAddr;
use std::net::SocketAddrV6;

pub const DEFAULT_HEALTH_MONITOR_CONFIG: &str = concat!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't remember why I did this this way, 😅 let me have a look again

/// Configuration for the simulated health monitor.
#[derive(Clone, Debug, PartialEq, Deserialize, Serialize)]
pub struct ConfigHealthMonitor {
/// Whether the real health monitor is running or not.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that would make sense 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Include health monitor information for testing

2 participants