[Feature] buffer- and stack-based allocator strategies#251
Conversation
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
05a80ce to
8df8edf
Compare
|
To answer the question about the buffer size choice: I think in an ideal case what I would really want is this to be page-aligned, and increased in multiples of the page size, but I'm not sure how easy it is to achieve that in a portable/generic manner. A secondary thing is that the OS will actually prevent us from doing anything really stupid, since this buffer initially only takes up virtual memory and is accessed from the start. |
Co-authored-by: Jutho <Jutho@users.noreply.github.com>
|
I think this looks good but I am very tired, so I can take a fresh look tomorrow. |
Co-authored-by: Jutho <Jutho@users.noreply.github.com>
docs/src/man/backends.md
Outdated
| By default, the `DefaultAllocator` is used, which uses Julia's built-in memory management system. | ||
| Optionally, it can be useful to use the `ManualAllocator`, as the manual memory management reduces the pressure on the garbage collector. | ||
| In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement. | ||
| On the other hand, for often-repeated but thread-safe `@tensor` calls, the `BufferAllocator` is a lightweight slab allocator that pre-allocates a buffer for temporaries, falling back to Julia's default if needed. |
There was a problem hiding this comment.
Can we be specific about how often is often enough? At least an order of magnitude?
There was a problem hiding this comment.
I'm not really sure I can, I think that anything larger than 1 gives a reduction in number of allocations, but whether or not that matters depends a whole lot on the context. It's kind of the same for the manual allocator, it can help with reducing GC pressure, but doesn't necessarily make anything faster
There was a problem hiding this comment.
Maybe we can turn it around: when would it make sense to think "I should use the buffer allocator!" -- when there's a lot of GC pressure? When I'm swapping?
|
Ok this is very cool. A new allocator out of the blue. How do you use it in practice? I assume you still have to store a buffer in a global state for a given block of contractions. Is this using |
|
In principle I could indeed copy some of the Bumper functionality for that, but then we might also consider just using Bumper as a hard dependency and adding the resize functionality on top of that? I was imagining this to be more of a manual thing though, and leaving that up to users and library developers. The use case I have is that I created this for MPSKit, where I simply create a buffer at the beginning of a Krylov loop and reuse that throughout the eigensolver. |
|
I am in for merging the PR in its current state (up to the language corrections suggested by Katharine), and we can always add more functionality to make it easier to use at a later stage. |
|
Let me do some final tweaks to the language, and rewrite the |
|
Maybe I'm missing something here, but why not just use Bumper.jl's default allocator (the You can think of a |
|
I might be completely misunderstanding the When the Sketching a use-case here: I want to repeatedly perform the same tensor contraction (for context, this shows up e.g. in Krylov-based methods for diagonalizing that operator) f(in1, in2, in3, in4) = @tensor out[...] := in1[...] * in2[...] * in3[...] * in4[...]In this example, 2 intermediate objects are needed: The total workflow would therefore be similar to, writing out the buffer manipulations in pseudocode: buffer = # create buffer
for i in 1:maxiter
checkpoint = create_checkpoint(buffer)
out = f(in1, in2, in3, in4)
checkpoint_restore!(buffer, checkpoint)
endTypically, the intermediate objects are large, and it is not impossible for them to exceed GB sizes. Again, I might be misunderstanding how the |
|
I see, that makes sense. I believe a similar functionality has actually been requested before from machine learning people as well, since they have a similar use-case. Would you be interested in upstreaming this buffer implementation to Bumper.jl? No problem if it's not something you have bandwidth for, but I just want to raise the possibility since it would likely be useful to more packages than just TensorOperations.jl |
|
I'll try and make some time either this week or the next, happy to contribute (and also happy to get another set of eyes on the implementation, never hurts when dealing with this pointer magic) |
This is a new set of features that expands on the allocator functionality.
For repeated tensor contractions, really the best approach is to use and reuse a buffer for the intermediate tensors.
While the Bumper approach has been working for this, the main short-coming there is that it is hard to know a priori what size of buffer has to be provided.
I tried to tackle this problem by making two main changes:
allocator_checkpoint!andallocator_reset!functions. Their main purpose is to natively support capturing and resetting stack-based allocation strategies. By default, every@tensorcall that specifies a backend will now include anallocator_checkpoint!call at the beginning and aallocator_reset!call at the end, which by default are no-ops.BufferAllocatorimplementation which functions in a way that is similar to how Bumper'sAllocBufferwould work. However, the main difference here is that whenever the buffer is full, it simply falls back on regular Julia-allocated objects, while keeping track of the maximal size it would have needed to accommodate all intermediate tensors. When the buffer is fully empty, it will use this information to allocate appropriately-sized buffers, so subsequent usages of the same buffer will avoid repeated allocations, without needing to know the buffer size a priori.Feedback and comments very welcome on features, names and design choices (and anything else really)