FEATURE: add paravirtualized clock support for guest time access#1173
FEATURE: add paravirtualized clock support for guest time access#1173simongdavies wants to merge 1 commit intohyperlight-dev:mainfrom
Conversation
978cfdc to
e51fca1
Compare
Hyperlight guests can now read time without expensive VM exits by using a paravirtualized clock shared between host and guest. This enables high-frequency timing operations like benchmarking, rate limiting, and timestamping with minimal overhead. Paravirtualized clocks work by having the hypervisor populate a shared memory page with clock calibration data. The guest reads this data along with the CPU's TSC to compute the current time entirely in userspace, avoiding the cost of a VM exit. Reference: https://docs.kernel.org/virt/kvm/x86/msr.html#pvclock The implementation uses the native mechanism for each hypervisor: - KVM: pvclock (MSR 0x4b564d01) - MSHV: Hyper-V Reference TSC page - WHP: Hyper-V Reference TSC page Guests have access to: - Monotonic time: nanoseconds since sandbox creation, guaranteed to never go backwards - Wall-clock time: UTC nanoseconds since Unix epoch - Local time: wall-clock adjusted for host timezone captured at sandbox creation Rust API (hyperlight_guest_bin::time): - SystemTime/Instant types mirroring std::time - DateTime type for human-readable date/time formatting - Weekday/Month enums with name() and short_name() methods C API (hyperlight_guest_capi): - POSIX-compatible: clock_gettime, gettimeofday, time - Broken-down time: gmtime_r, localtime_r, mktime, timegm - Formatting: strftime with common format specifiers The feature is gated behind `guest_time` (enabled by default) and documented in docs/guest-time.md. Note: The timezone offset is a snapshot from sandbox creation and does not update for DST transitions during the sandbox lifetime. Signed-off-by: Simon Davies <simongdavies@users.noreply.github.com>
e51fca1 to
324dfda
Compare
ludfjig
left a comment
There was a problem hiding this comment.
First round of reviews, looks very good to me! Question: maybe we can split out all the time-related math + formatting into a separate crate?
Haven't looked in detail into everythign yet
| #[cfg(test)] | ||
| mod tests { | ||
| use core::mem::size_of; | ||
|
|
||
| use super::*; | ||
|
|
||
| #[test] | ||
| fn test_kvm_pvclock_size() { | ||
| // KVM pvclock struct must be exactly 32 bytes | ||
| assert_eq!(size_of::<KvmPvclockVcpuTimeInfo>(), 32); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_hv_reference_tsc_size() { | ||
| // Hyper-V reference TSC page must be exactly 4KB | ||
| assert_eq!(size_of::<HvReferenceTscPage>(), 4096); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_guest_clock_region_size() { | ||
| // GuestClockRegion should be 32 bytes (4 x u64 equivalent: 3 x u64 + i32 + u32) | ||
| assert_eq!(size_of::<GuestClockRegion>(), 32); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_clock_type_conversion() { | ||
| assert_eq!(ClockType::from(0u64), ClockType::None); | ||
| assert_eq!(ClockType::from(1u64), ClockType::KvmPvclock); | ||
| assert_eq!(ClockType::from(2u64), ClockType::HyperVReferenceTsc); | ||
| assert_eq!(ClockType::from(99u64), ClockType::None); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_guest_clock_region_default() { | ||
| let region = GuestClockRegion::default(); | ||
| assert!(!region.is_available()); | ||
| assert_eq!(region.get_clock_type(), ClockType::None); | ||
| } | ||
| } |
There was a problem hiding this comment.
I think most of these can be const-time assertions instead of tests
| #[inline] | ||
| fn rdtsc() -> u64 { | ||
| #[cfg(target_arch = "x86_64")] | ||
| { | ||
| let lo: u32; | ||
| let hi: u32; | ||
| // SAFETY: RDTSC is always available on x86_64 | ||
| unsafe { | ||
| core::arch::asm!( | ||
| "rdtsc", | ||
| out("eax") lo, | ||
| out("edx") hi, | ||
| options(nostack, nomem, preserves_flags) | ||
| ); | ||
| } | ||
| ((hi as u64) << 32) | (lo as u64) | ||
| } | ||
| #[cfg(not(target_arch = "x86_64"))] | ||
| { | ||
| 0 // TSC not available on non-x86_64 architectures | ||
| } | ||
| } |
There was a problem hiding this comment.
I think you can replace this with https://doc.rust-lang.org/core/arch/x86/fn._rdtsc.html
| Err(crate::new_error!( | ||
| "Paravirtualized clock setup not implemented for this hypervisor", | ||
| )) | ||
| } |
There was a problem hiding this comment.
nit: should this default implementation be removed?
|
|
||
| /// Returns true if a clock is configured. | ||
| pub fn is_available(&self) -> bool { | ||
| self.clock_page_ptr != 0 && self.clock_type != ClockType::None as u64 |
There was a problem hiding this comment.
It's my understand that this is always available, so do we really need is_available and is_clock_available? Correct me if I am wrong.
|
Do we definitely want to enable this by default? I think a lot of guests will not need high precision time in production, and providing it by default is a significant semantic constraint: e.g. it makes snapshotting observable, makes timing side-channel attacks easier to internalise, etc. When we first enabled rdtsc for the performance traces, I believe we explicitly had a discussion across all the maintainers about this and said we very much did not want to enable any extra time sources (and especially not wall-clock/referenced to an epoch ones) anywhere. |
| let tsc_scale = tsc_page.tsc_scale; | ||
| let tsc_offset = tsc_page.tsc_offset; | ||
|
|
||
| compiler_fence(Ordering::Acquire); |
There was a problem hiding this comment.
Did you mean Ordering::Release here? There's also a similar patter in the KVM handler.
A compiler barrier is not enough though, you need an rmb() here and before reading the data on L147. They are paired with the hypervisor wmb() when updating the page. Admittedly it is going to be a noop on amd64, but maybe better to take care of this now if the arm64 port is planned,
|
|
||
| if time_100ns < 0 { | ||
| return None; // Invalid time | ||
| } |
There was a problem hiding this comment.
You are making scaled a signed integer to catch overflows, however by doing so you also lose precision (it's just 1 bit of precision, but still). So that would detect overflows when the u64 value would not have in fact overflowed. (e.g. i64::MAX + 1).
I know that tsc_offset is signed, but you could let scaled be unsigned and use u64.overflowing_add_signed instead to not lose precision and detect over/under-flows. On amd64 it uses just a normal add instruction and saves the overflow flag.
| // Check sequence again | ||
| let seq2 = unsafe { core::ptr::read_volatile(&tsc_page.tsc_sequence) }; | ||
| if seq1 != seq2 { | ||
| return None; // Data changed during read, retry later |
There was a problem hiding this comment.
I wonder how the caller is supposed to differentiate between "use MSR fallback" and this case here? The sequence mismatch is a transient problem because the hypervisor is modifying the page contents and it is very likely to go away if you retry immediately, and that is still going to be much faster than taking an MSR exit.
Therefore typically you would loop when trying to read from the TSC page until sequences match because we (the hyper-v hypervisor) do not expect this problem to persist. All client-side implementations i've seen usually just retry immediately maybe having an upper bound on the retry count. Falling back on MSR traps would be much less efficient in P99.9 cases.
Hyperlight guests can now read time without expensive VM exits by using a paravirtualized clock shared between host and guest. This enables high-frequency timing operations like benchmarking, rate limiting, and timestamping with minimal overhead.
Paravirtualized clocks work by having the hypervisor populate a shared memory page with clock calibration data. The guest reads this data along with the CPU's TSC to compute the current time entirely in userspace, avoiding the cost of a VM exit.
Reference: https://docs.kernel.org/virt/kvm/x86/msr.html#pvclock
The implementation uses the native mechanism for each hypervisor:
Guests have access to:
Rust API (hyperlight_guest_bin::time):
C API (hyperlight_guest_capi):
The feature is gated behind
guest_time(enabled by default) and documented in docs/guest-time.md.Note: The timezone offset is a snapshot from sandbox creation and does not update for DST transitions during the sandbox lifetime.