[sim] Optionally enable health monitor by karencfv · Pull Request #9628 · oxidecomputer/omicron

karencfv · 2026-01-13T03:16:44Z

Adds the ability to enable the sled agent health monitor on simulated systems. This is and will be very useful for various types of testing.

Disabled:

# Configuration toml file
enabled = false

$ cargo xtask omicron-dev run-all
<...>
omicron-dev: sled agent API:         http://[::1]:56577
<...>

$ curl -H "api-version: 14.0.0"  http://[::1]:56577/inventory | jq .health_monitor
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21274  100 21274    0     0  7654k      0 --:--:-- --:--:-- --:--:-- 10.1M
{
  "smf_services_in_maintenance": {
    "ok": {
      "services": [],
      "errors": [],
      "time_of_status": null
    }
  }
}

With fake health monitor results

# Configuration toml file

enabled = false

[sim_health_checks.smf_services_in_maintenance.ok]
services = [
    { fmri = "svc:/system/fake-service-1:default", zone = "oxz_fake_zone_1" },
    { fmri = "svc:/network/fake-service-2:default", zone = "oxz_fake_zone_2" },
    { fmri = "svc:/application/fake-service-3:default", zone = "global" }
]

errors = []

time_of_status = "2026-04-12T23:20:50.52Z"

$ cargo xtask omicron-dev run-all --health-monitor-config sled-agent/tests/configs/health_monitor_sim_unhealthy.toml
<...>
omicron-dev: sled agent API:         http://[::1]:64707
<...>

$ curl -H "api-version: 14.0.0"  http://[::1]:64707/inventory | jq .health_monitor
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21505  100 21505    0     0  8932k      0 --:--:-- --:--:-- --:--:-- 10.2M
{
  "smf_services_in_maintenance": {
    "ok": {
      "services": [
        {
          "fmri": "svc:/system/fake-service-1:default",
          "zone": "oxz_fake_zone_1"
        },
        {
          "fmri": "svc:/network/fake-service-2:default",
          "zone": "oxz_fake_zone_2"
        },
        {
          "fmri": "svc:/application/fake-service-3:default",
          "zone": "global"
        }
      ],
      "errors": [],
      "time_of_status": "2026-04-12T23:20:50.520Z"
    }
  }
}

Enabled

# Configuration toml file
enabled = true

$ cargo xtask omicron-dev run-all --health-monitor-config sled-agent/tests/configs/health_monitor_sim_enabled.toml
<...>
omicron-dev: sled agent API:         http://[::1]:59351
<...>

$ curl -H "api-version: 14.0.0"  http://[::1]:59351/inventory | jq .health_monitor
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21418  100 21418    0     0  8900k      0 --:--:-- --:--:-- --:--:-- 10.2M
{
  "smf_services_in_maintenance": {
    "ok": {
      "services": [
        {
          "fmri": "svc:/site/fake-service2:default",
          "zone": "global"
        },
        {
          "fmri": "svc:/site/fake-service:default",
          "zone": "global"
        }
      ],
      "errors": [],
      "time_of_status": "2026-01-22T06:41:03.279150883Z"
    }
  }
}

Closes: #9517

davepacheco · 2026-01-13T20:59:04Z

Cool. Does this cause the simulated sled agent to look at the actual SMF state wherever it's running? Wouldn't it be more useful to allow the reported state to be customized directly?

karencfv · 2026-01-13T21:47:27Z

Does this cause the simulated sled agent to look at the actual SMF state wherever it's running?

Yes, but that was the use case I've been having 😄.

Wouldn't it be more useful to allow the reported state to be customized directly?

That would be really useful too. I wonder if there is a possibility to have one or the other. Do you think it' would be relatively straightforward to do that?

karencfv · 2026-01-22T06:58:12Z

@davepacheco, I've changed the approach. It's now possible to inject fake data via a config file. Let me know what you think!

I updated the PR's description to show the new way this would work

karencfv · 2026-02-02T03:14:19Z

Heya @davepacheco! Just a tiny ping to see if I could get some eyes on this? Having this feature available will make the ongoing health monitor work a lot easier!

davepacheco · 2026-02-03T22:51:01Z

dev-tools/omicron-dev/Cargo.toml

 [dependencies]
 anyhow.workspace = true
 camino.workspace = true
+chrono.workspace = true


I think maybe most of the stuff added here isn't used?

Ah! yeah, I think this was for a previous iteration of this work, I'll remove

davepacheco · 2026-02-03T22:53:37Z

nexus/test-utils/src/starter.rs

 ///
 /// Note: you should probably use the `extra_sled_agents` macro parameter on
 /// `nexus_test` instead!
+#[allow(clippy::too_many_arguments)]


What about defining a struct for these arguments instead?

I went with just adding the function to have too many arguments because of the comment above this line:

/// Note: you should probably use the extra_sled_agents macro parameter on
/// nexus_test instead!

Since this is somewhat of a semi-deprecated function, I didn't want to create a struct for it, WDYT?

davepacheco · 2026-02-03T22:54:10Z

nexus/tests/integration_tests/instances.rs

        Some(&camino::Utf8Path::new("/an/unused/update/directory")),
        omicron_sled_agent::sim::ZpoolConfig::None,
        sled_agent_types::inventory::SledCpuFamily::AmdTurin,
+        omicron_sled_agent::sim::ConfigHealthMonitor {


This block is repeated quite a lot -- maybe there should be a ConfigHealthMonitor::disabled() that constructs this form?

davepacheco · 2026-02-03T22:56:18Z

sled-agent/health-monitor/src/handle.rs

+    /// Returns a `HealthMonitorHandle` that doesn't monitor health and always
+    /// reports no problems unless a `ConfigSimHealthMonitor` with simulated
+    /// data is passed.
+    pub fn spawn_sim(
+        sim_health_checks: Option<HealthMonitorInventory>,
+    ) -> Self {


I would be inclined to keep two separate functions here. Is there some reason to combine them like this?

It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.

This function (and previously stub()) is only ever used for the simulated omicron. Otherwise, there really is no need for the "stub()" part of this function. I'd be passing on the responsibility to decide whether to fake a config or not to the caller and that felt trickier to handle in the long run?

It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.

The way I see it stub() also reports simulated data, it reports that all health checks returned healthy no?

davepacheco · 2026-02-03T22:57:02Z

sled-agent/health-monitor/Cargo.toml

 [dependencies]
 anyhow.workspace = true
 async-trait.workspace = true
+chrono.workspace = true


Is this used?

davepacheco · 2026-02-03T22:57:59Z

sled-agent/src/bin/sled-agent-sim.rs

 use std::net::SocketAddr;
 use std::net::SocketAddrV6;

+pub const DEFAULT_HEALTH_MONITOR_CONFIG: &str = concat!(


Does the correctness of this depend on where you run it from? I usually would run cargo run --bin=sled-agent-sim from the top level of Omicron (and I think that's what the written instructions have people do), but this seems to assume it will be run from sled-agent?

I honestly don't remember why I did this this way, 😅 let me have a look again

davepacheco · 2026-02-03T22:59:21Z

sled-agent/src/sim/config.rs

+/// Configuration for the simulated health monitor.
+#[derive(Clone, Debug, PartialEq, Deserialize, Serialize)]
+pub struct ConfigHealthMonitor {
+    /// Whether the real health monitor is running or not.


It looks like this struct allows expressing invalid state (like "enabled" with a non-None sim_health_checks. What about having this be a tagged enum with two variants? FromLiveSystem and FixtureData(HealthMonitorInventory)?

Ah yes, that would make sense 😄

davepacheco · 2026-02-03T22:59:35Z

sled-agent/src/sim/config.rs

        )
    }

+    #[allow(clippy::too_many_arguments)]


davepacheco · 2026-02-03T23:01:07Z

I'm really sorry for the delay here.

This looks like an improvement because the behavior is customizable and hopefully tests that use this will be less flaky than if they were manipulating the local SMF state (and won't require privileges, etc.). What I had been wondering about though was about adding a sim sled agent API to control this dynamically. The simulated sled agent has a few endpoints for things that need to be manipulated by tests:
https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/sim/http_entrypoints.rs#L100-L103

though the client is a little janky:

omicron/clients/sled-agent-client/src/lib.rs

Lines 348 to 398 in d196e0c

    
           #[async_trait] 
        
           impl TestInterfaces for Client { 
        
               async fn vmm_single_step(&self, id: PropolisUuid) { 
        
                   let baseurl = self.baseurl(); 
        
                   let client = self.client(); 
        
                   let url = format!("{}/vmms/{}/poke-single-step", baseurl, id); 
        
                   client 
        
                       .post(url) 
        
                       .send() 
        
                       .await 
        
                       .expect("instance_single_step() failed unexpectedly"); 
        
               } 
        
               async fn vmm_finish_transition(&self, id: PropolisUuid) { 
        
                   let baseurl = self.baseurl(); 
        
                   let client = self.client(); 
        
                   let url = format!("{}/vmms/{}/poke", baseurl, id); 
        
                   client 
        
                       .post(url) 
        
                       .send() 
        
                       .await 
        
                       .expect("instance_finish_transition() failed unexpectedly"); 
        
               } 
        
               async fn disk_finish_transition(&self, id: Uuid) { 
        
                   let baseurl = self.baseurl(); 
        
                   let client = self.client(); 
        
                   let url = format!("{}/disks/{}/poke", baseurl, id); 
        
                   client 
        
                       .post(url) 
        
                       .send() 
        
                       .await 
        
                       .expect("disk_finish_transition() failed unexpectedly"); 
        
               } 
        
               async fn vmm_simulate_migration_source( 
        
                   &self, 
        
                   id: PropolisUuid, 
        
                   params: SimulateMigrationSource, 
        
               ) { 
        
                   let baseurl = self.baseurl(); 
        
                   let client = self.client(); 
        
                   let url = format!("{baseurl}/vmms/{id}/sim-migration-source"); 
        
                   client 
        
                       .post(url) 
        
                       .json(&params) 
        
                       .send() 
        
                       .await 
        
                       .expect("instance_simulate_migration_source() failed unexpectedly"); 
        
               } 
        
           }

(see #8900)

Doing it this way is not a blocker here! But it would allow us to write tests that exercise the Nexus behavior here. It might even be less code, since a lot of this PR is plumbing config through. What do you think?

karencfv

Thanks for taking a look @davepacheco !

What I had been wondering about though was about adding a sim sled agent API to control this dynamically. The simulated sled agent has a few endpoints for things that need to be manipulated by tests

Hmmmm... that sounds really interesting! Let me take a look

karencfv · 2026-02-04T02:43:32Z

dev-tools/omicron-dev/Cargo.toml

 [dependencies]
 anyhow.workspace = true
 camino.workspace = true
+chrono.workspace = true


Ah! yeah, I think this was for a previous iteration of this work, I'll remove

karencfv · 2026-02-04T02:46:06Z

nexus/test-utils/src/starter.rs

 ///
 /// Note: you should probably use the `extra_sled_agents` macro parameter on
 /// `nexus_test` instead!
+#[allow(clippy::too_many_arguments)]


I went with just adding the function to have too many arguments because of the comment above this line:

/// Note: you should probably use the extra_sled_agents macro parameter on
/// nexus_test instead!

Since this is somewhat of a semi-deprecated function, I didn't want to create a struct for it, WDYT?

karencfv · 2026-02-04T02:52:04Z

sled-agent/health-monitor/src/handle.rs

+    /// Returns a `HealthMonitorHandle` that doesn't monitor health and always
+    /// reports no problems unless a `ConfigSimHealthMonitor` with simulated
+    /// data is passed.
+    pub fn spawn_sim(
+        sim_health_checks: Option<HealthMonitorInventory>,
+    ) -> Self {


This function (and previously stub()) is only ever used for the simulated omicron. Otherwise, there really is no need for the "stub()" part of this function. I'd be passing on the responsibility to decide whether to fake a config or not to the caller and that felt trickier to handle in the long run?

It seems clearer to keep stub() as it was and have a separate spawn_sim that accepts a non-Option and always reports the simulated data.

The way I see it stub() also reports simulated data, it reports that all health checks returned healthy no?

karencfv · 2026-02-04T02:53:34Z

sled-agent/src/bin/sled-agent-sim.rs

 use std::net::SocketAddr;
 use std::net::SocketAddrV6;

+pub const DEFAULT_HEALTH_MONITOR_CONFIG: &str = concat!(


I honestly don't remember why I did this this way, 😅 let me have a look again

karencfv · 2026-02-04T02:54:26Z

sled-agent/src/sim/config.rs

+/// Configuration for the simulated health monitor.
+#[derive(Clone, Debug, PartialEq, Deserialize, Serialize)]
+pub struct ConfigHealthMonitor {
+    /// Whether the real health monitor is running or not.


Ah yes, that would make sense 😄

karencfv added 3 commits January 13, 2026 15:45

[sim] Optionally enable health monitor

667bec2

clippy

893b850

clean up

b40ee44

karencfv requested review from davepacheco and paudmir January 13, 2026 03:16

karencfv added 7 commits January 21, 2026 19:43

Merge branch 'main' into sim-enable-health-monitor

3b892ed

get this working with fake data

bcdc133

plumb the config through

003d246

use the config from the CLI

f4772d5

enable health monitor for sled-agent sim

ff8553c

clean up

a93af40

clean up

aa49383

make linter happy

53bfd84

davepacheco reviewed Feb 3, 2026

View reviewed changes

karencfv commented Feb 4, 2026

View reviewed changes

Conversation

karencfv commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Jan 13, 2026

Uh oh!

karencfv commented Jan 13, 2026

Uh oh!

karencfv commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv commented Feb 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Feb 3, 2026

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karencfv commented Jan 13, 2026 •

edited

Loading

karencfv commented Jan 22, 2026 •

edited

Loading