Skip to content

Adds a HIP for efficiently mapping files into sandboxes#1207

Open
simongdavies wants to merge 1 commit intohyperlight-dev:mainfrom
simongdavies:HIP0002
Open

Adds a HIP for efficiently mapping files into sandboxes#1207
simongdavies wants to merge 1 commit intohyperlight-dev:mainfrom
simongdavies:HIP0002

Conversation

@simongdavies
Copy link
Contributor

@simongdavies simongdavies commented Feb 3, 2026

This covers both the guest binary for the sandbox and arbitrary user mapped files. It proposes a capability that users minimal amounts of memory whilst retaining low overhead, that integrates with sandbox snapshot and restore and paves the way for future snapshot persistence.

Rendered view of proposal.

This covers both the guest binary for the sandbox and arbitrary user mapped files. It proposes a capability that users minimal amounts of memory whilst retaining low overhead, that integrates with sandbox snapshot and restore and paves the way for future snapshot persistence.

Signed-off-by: Simon Davies <simongdavies@users.noreply.github.com>
@simongdavies simongdavies added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label Feb 3, 2026
@syntactically
Copy link
Member

Thanks for writing this up! I strongly believe that this is moving in the correct direction, but I have a number of reservations about the precise suggestions here. Since my comments are mostly fairly high-level, I'm just going to include them here, rather than try to disperse them throughout inline comments.

Lazy page table population

I have fairly serious reservations about whether this is worth it. Is the motivation for this change mostly about saving the memory used for page table entries for unused file pages? If so, I am not convinced, because I think that the used pages will be fairly dense in the unused ones, and if even one page in an aligned 2MB region of virtual memory is paged in, the whole page table will need to be realised and take up the same amount of space whether one page or all of them are mapped. A much bigger memory savings on the page tables themselves is possible using hugepage mappings, I think, and that is a fairly straightforward change to the mapping code (the only particularly complicated thing there is that one needs to be a little bit careful with tlb discipline when converting a huge page to non-huge-page or vice versa mapping---probably there is a fairly reasonable way to turn this into certain invalidation hooks in TableOps).

I also think that the performance motivation here would need to be very large indeed in order to make it worth the complexity. The bootstrap code mapping here would need to include the exception handler, as well as all code that it calls. I think this would mean needing to compile it separately from the main guest library and include it in the host? Which is a complexity increase on its own, and especially not ideal when one needs to add custom behaviour to it in a guest, as often happens (e.g. hyperlight-wasm does). Adding hooks to call into user code isn't straightforward, since that code could itself be unmapped, causing double faults, which also would necessitate adding significant complexity to exception stack management.

The whole extra structure of memory region permissions passed between the host and the guest is also some extra complexity, and particularly adds another large point of coordination between the host and the guest, which I think we have generally been trying to get rid of, and is only necessary pretty much to preserve the default behaviour of RW/RX on different regions of the binary, and only because of the lazy page table creation, if I understand correctly.

Proactive cache management

I also think that the binary cache management is too opinionated for an infrastructure library like Hyperlight. It involves a lot of assumptions about the host (e.g. the existence of writable storage) and is difficult to interpose on from another library. I think it is more useful to factor out the basic things that are needed to enable something like this, but avoid building all this specialised API and cache management code---which is really more of a concern from the service. I think we have been burned before on being too opinionated trying to manage stuff that really is about the integration with the host environment, like the seccomp handler feature we had for a while, and this feels a similar level of opinionated to me.

I think the core functionalities here are (1) getting at the bytes that are going into the sandbox after relocations are applied and (2) spinning up a new sandbox from those bytes on disk. They can both be addressed with a lot less case-specific code by focusing on snapshot persistence instead (i.e. the benefits of the relocated binary caching code here are essentially matched by (1) taking a snapshot after loading the binary once and (2) making all future uses of the binary descend from that snapshot). Building those features makes it easy for a user to have the same caching strategy described in this proposal, or any other strategy they wish.

File mapping modes

This is a less important comment, but I also do not like the introduction of multiple modes for file mappings. The user mappings already require cooperation in the guest to get them mapped into GVAs, and the stage 2 host->guest mapping should always be RO, so there is no need to have more information in the host. Again, this just feels a little too opinionated for Hyperlight as an infrastructure library---it should be, as much as possible, up to the guests how to handle their own memory.

Sketch of an alternate proposal

I have been thinking about similar requirement for a bit, because, as soon as the host and guest both no longer try to write to the snapshot region, it should be fairly trivially possible to map the snapshot directly off disk, readonly, into the guest, which necessitates some changes to the snapshot structures to remove the Vec<u8> backing. Developing that further, one could I think achieve similar benefits by:

1. Generalise snapshot backing regions

Currently, a snapshot is always backed by a Vec<u8> in-memory-only representation, which is extremely not-ideal, since it needs to be copied into a SharedMemory in order to actually map it into the guest, and can't easily be persisted to disk. Instead, a snapshot's backing store could be generalised to be any number of backing objects, each of which could be either a Vec (needing to be copied whenever a sandbox used this snapshot), anonymous memory, or a file mapping (of some subset of a file). These would all be by the host into the low half of guest physical address space, and, as already happens, the guest's view of virtual memory would be preserved by being careful about creating new page tables. Loading a file mapping would involve the same considerations around potentially locking it, etc., of course.

The same benefit of rationalising the approach between other mapped-in-later files applies here, since those file mappings can directly turn into file-backed snapshot backing objects.

This is a pretty small, scoped, change: just making some fields in the snapshot struct a bit more rich and changing some of the memory manager restore logic to allow mapping in arbitrary numbers of sandbox memories instead of just one.

2. Allow realising a backing object to disk

Add some API to a snapshot, e.g. .persist(), which allows choosing some backing objects that are currently backed by memory or anonymous mappings that should be converted to be file-backed.

This is also a very small and tightly-scoped change.

3. Make it easy to retain a backing region from an earlier snapshot

When taking a new snapshot, allow user code to pass in a predicate that decides, for a given page (or region of pages?) whether to copy it into a new snapshot backing object (as is presently always done), or to add a reference to the original backing object.

Again, this is a pretty small and scoped change contained to snapshot.rs that only touches the code that copies a page around.

User code flow for sharing guest binary objects

  1. Create (offline) a snapshot object from a raw binary. The Snapshot::from_file() is enough to do the relocation/etc, but there is little downside to also doing vm.initialise() here and taking a snapshot after.
  2. Convert that snapshot's memory backing into a file backing (this will produce a file with morally similar contents to the cached relocated binary that you proposed, but located at the discretion of user code)
  3. When creating future sandboxes, do so from this snapshot. This already achieves code sharing for those direct sandboxes. If you take more snapshots, pass in a predicate saying that for these code pages, snapshotting should preserve the reference to the original source object---this will ensure that even snapshots further down a tree are able to share the pages, much like is allowed by the original proposal

I think something along these lines would end up allowing the same benefits as the original proposal (apart from the lazy PTE population, which, as I have said above, I am quite sceptical of), but would be less opinionated, more flexible, a significantly smaller and less likely to conflict with other ongoing work change set, and integrate more nicely with (and be directly useful for) a final version of snapshot persistence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants