Adds a HIP for efficiently mapping files into sandboxes#1207
Adds a HIP for efficiently mapping files into sandboxes#1207simongdavies wants to merge 1 commit intohyperlight-dev:mainfrom
Conversation
This covers both the guest binary for the sandbox and arbitrary user mapped files. It proposes a capability that users minimal amounts of memory whilst retaining low overhead, that integrates with sandbox snapshot and restore and paves the way for future snapshot persistence. Signed-off-by: Simon Davies <simongdavies@users.noreply.github.com>
|
Thanks for writing this up! I strongly believe that this is moving in the correct direction, but I have a number of reservations about the precise suggestions here. Since my comments are mostly fairly high-level, I'm just going to include them here, rather than try to disperse them throughout inline comments. Lazy page table populationI have fairly serious reservations about whether this is worth it. Is the motivation for this change mostly about saving the memory used for page table entries for unused file pages? If so, I am not convinced, because I think that the used pages will be fairly dense in the unused ones, and if even one page in an aligned 2MB region of virtual memory is paged in, the whole page table will need to be realised and take up the same amount of space whether one page or all of them are mapped. A much bigger memory savings on the page tables themselves is possible using hugepage mappings, I think, and that is a fairly straightforward change to the mapping code (the only particularly complicated thing there is that one needs to be a little bit careful with tlb discipline when converting a huge page to non-huge-page or vice versa mapping---probably there is a fairly reasonable way to turn this into certain invalidation hooks in I also think that the performance motivation here would need to be very large indeed in order to make it worth the complexity. The bootstrap code mapping here would need to include the exception handler, as well as all code that it calls. I think this would mean needing to compile it separately from the main guest library and include it in the host? Which is a complexity increase on its own, and especially not ideal when one needs to add custom behaviour to it in a guest, as often happens (e.g. hyperlight-wasm does). Adding hooks to call into user code isn't straightforward, since that code could itself be unmapped, causing double faults, which also would necessitate adding significant complexity to exception stack management. The whole extra structure of memory region permissions passed between the host and the guest is also some extra complexity, and particularly adds another large point of coordination between the host and the guest, which I think we have generally been trying to get rid of, and is only necessary pretty much to preserve the default behaviour of RW/RX on different regions of the binary, and only because of the lazy page table creation, if I understand correctly. Proactive cache managementI also think that the binary cache management is too opinionated for an infrastructure library like Hyperlight. It involves a lot of assumptions about the host (e.g. the existence of writable storage) and is difficult to interpose on from another library. I think it is more useful to factor out the basic things that are needed to enable something like this, but avoid building all this specialised API and cache management code---which is really more of a concern from the service. I think we have been burned before on being too opinionated trying to manage stuff that really is about the integration with the host environment, like the seccomp handler feature we had for a while, and this feels a similar level of opinionated to me. I think the core functionalities here are (1) getting at the bytes that are going into the sandbox after relocations are applied and (2) spinning up a new sandbox from those bytes on disk. They can both be addressed with a lot less case-specific code by focusing on snapshot persistence instead (i.e. the benefits of the relocated binary caching code here are essentially matched by (1) taking a snapshot after loading the binary once and (2) making all future uses of the binary descend from that snapshot). Building those features makes it easy for a user to have the same caching strategy described in this proposal, or any other strategy they wish. File mapping modesThis is a less important comment, but I also do not like the introduction of multiple modes for file mappings. The user mappings already require cooperation in the guest to get them mapped into GVAs, and the stage 2 host->guest mapping should always be RO, so there is no need to have more information in the host. Again, this just feels a little too opinionated for Hyperlight as an infrastructure library---it should be, as much as possible, up to the guests how to handle their own memory. Sketch of an alternate proposalI have been thinking about similar requirement for a bit, because, as soon as the host and guest both no longer try to write to the snapshot region, it should be fairly trivially possible to map the snapshot directly off disk, readonly, into the guest, which necessitates some changes to the snapshot structures to remove the 1. Generalise snapshot backing regionsCurrently, a snapshot is always backed by a The same benefit of rationalising the approach between other mapped-in-later files applies here, since those file mappings can directly turn into file-backed snapshot backing objects. This is a pretty small, scoped, change: just making some fields in the snapshot struct a bit more rich and changing some of the memory manager restore logic to allow mapping in arbitrary numbers of sandbox memories instead of just one. 2. Allow realising a backing object to diskAdd some API to a snapshot, e.g. This is also a very small and tightly-scoped change. 3. Make it easy to retain a backing region from an earlier snapshotWhen taking a new snapshot, allow user code to pass in a predicate that decides, for a given page (or region of pages?) whether to copy it into a new snapshot backing object (as is presently always done), or to add a reference to the original backing object. Again, this is a pretty small and scoped change contained to snapshot.rs that only touches the code that copies a page around. User code flow for sharing guest binary objects
I think something along these lines would end up allowing the same benefits as the original proposal (apart from the lazy PTE population, which, as I have said above, I am quite sceptical of), but would be less opinionated, more flexible, a significantly smaller and less likely to conflict with other ongoing work change set, and integrate more nicely with (and be directly useful for) a final version of snapshot persistence. |
This covers both the guest binary for the sandbox and arbitrary user mapped files. It proposes a capability that users minimal amounts of memory whilst retaining low overhead, that integrates with sandbox snapshot and restore and paves the way for future snapshot persistence.
Rendered view of proposal.