Skip to content

[Feature] buffer- and stack-based allocator strategies#251

Merged
lkdvos merged 16 commits intomasterfrom
ld-allocator
Jan 28, 2026
Merged

[Feature] buffer- and stack-based allocator strategies#251
lkdvos merged 16 commits intomasterfrom
ld-allocator

Conversation

@lkdvos
Copy link
Member

@lkdvos lkdvos commented Jan 27, 2026

This is a new set of features that expands on the allocator functionality.
For repeated tensor contractions, really the best approach is to use and reuse a buffer for the intermediate tensors.
While the Bumper approach has been working for this, the main short-coming there is that it is hard to know a priori what size of buffer has to be provided.

I tried to tackle this problem by making two main changes:

  1. I've expanded on the interface of the allocations by additionally including allocator_checkpoint! and allocator_reset! functions. Their main purpose is to natively support capturing and resetting stack-based allocation strategies. By default, every @tensor call that specifies a backend will now include an allocator_checkpoint! call at the beginning and a allocator_reset! call at the end, which by default are no-ops.
  2. I've added a native BufferAllocator implementation which functions in a way that is similar to how Bumper's AllocBuffer would work. However, the main difference here is that whenever the buffer is full, it simply falls back on regular Julia-allocated objects, while keeping track of the maximal size it would have needed to accommodate all intermediate tensors. When the buffer is fully empty, it will use this information to allocate appropriately-sized buffers, so subsequent usages of the same buffer will avoid repeated allocations, without needing to know the buffer size a priori.

Feedback and comments very welcome on features, names and design choices (and anything else really)

@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 18.96552% with 47 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/implementation/allocator.jl 0.00% 42 Missing ⚠️
ext/TensorOperationsBumperExt.jl 0.00% 4 Missing ⚠️
src/implementation/blascontract.jl 80.00% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/TensorOperations.jl 100.00% <ø> (ø)
src/indexnotation/postprocessors.jl 93.33% <100.00%> (+0.65%) ⬆️
src/indexnotation/tensormacros.jl 46.66% <100.00%> (+0.59%) ⬆️
src/interface.jl 48.21% <100.00%> (+1.91%) ⬆️
src/implementation/blascontract.jl 89.10% <80.00%> (+0.33%) ⬆️
ext/TensorOperationsBumperExt.jl 0.00% <0.00%> (ø)
src/implementation/allocator.jl 34.06% <0.00%> (-29.20%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lkdvos lkdvos force-pushed the ld-allocator branch 2 times, most recently from 05a80ce to 8df8edf Compare January 27, 2026 20:12
@lkdvos lkdvos requested review from Jutho and kshyatt January 27, 2026 20:12
@lkdvos
Copy link
Member Author

lkdvos commented Jan 27, 2026

To answer the question about the buffer size choice:

I think in an ideal case what I would really want is this to be page-aligned, and increased in multiples of the page size, but I'm not sure how easy it is to achieve that in a portable/generic manner.
Here I chose to mimic the behavior from Base for dictionaries, which uses _tablesz to determine the sizes, which indeed takes the next power of 2.
https://github.com/JuliaLang/julia/blob/fdce8f69aa65d8aceea9922e564fc6bdb3d563b0/base/abstractdict.jl#L580

A secondary thing is that the OS will actually prevent us from doing anything really stupid, since this buffer initially only takes up virtual memory and is accessed from the start.
For example it's not even that much of an issue to simply allocate a buffer of 100GB, the OS will then only allocate physical pages as required.

lkdvos and others added 2 commits January 27, 2026 16:31
@Jutho
Copy link
Member

Jutho commented Jan 27, 2026

I think this looks good but I am very tired, so I can take a fresh look tomorrow.

lkdvos and others added 2 commits January 27, 2026 16:58
Co-authored-by: Jutho <Jutho@users.noreply.github.com>
By default, the `DefaultAllocator` is used, which uses Julia's built-in memory management system.
Optionally, it can be useful to use the `ManualAllocator`, as the manual memory management reduces the pressure on the garbage collector.
In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement.
On the other hand, for often-repeated but thread-safe `@tensor` calls, the `BufferAllocator` is a lightweight slab allocator that pre-allocates a buffer for temporaries, falling back to Julia's default if needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be specific about how often is often enough? At least an order of magnitude?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure I can, I think that anything larger than 1 gives a reduction in number of allocations, but whether or not that matters depends a whole lot on the context. It's kind of the same for the manual allocator, it can help with reducing GC pressure, but doesn't necessarily make anything faster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can turn it around: when would it make sense to think "I should use the buffer allocator!" -- when there's a lot of GC pressure? When I'm swapping?

@Jutho
Copy link
Member

Jutho commented Jan 28, 2026

Ok this is very cool. A new allocator out of the blue. How do you use it in practice? I assume you still have to store a buffer in a global state for a given block of contractions. Is this using task_local_storage then? Should there be functionality for this in the same way as provided by Bumper.jl, i.e. have some default_buffer_allocator() function or something similar?

@lkdvos
Copy link
Member Author

lkdvos commented Jan 28, 2026

In principle I could indeed copy some of the Bumper functionality for that, but then we might also consider just using Bumper as a hard dependency and adding the resize functionality on top of that? I was imagining this to be more of a manual thing though, and leaving that up to users and library developers.

The use case I have is that I created this for MPSKit, where I simply create a buffer at the beginning of a Krylov loop and reuse that throughout the eigensolver.
So yes, there is some state that is managed by the user, either through a global or even just locally around some code.
I don't think I would really recommend using it as a global buffer though, and rather just allocate a new buffer for iterative procedures and let it be freed after.
The philosophy being that it's not that bad to allocate, you just shouldn't be doing that in a loop.

@Jutho
Copy link
Member

Jutho commented Jan 28, 2026

I am in for merging the PR in its current state (up to the language corrections suggested by Katharine), and we can always add more functionality to make it easier to use at a later stage.

@lkdvos
Copy link
Member Author

lkdvos commented Jan 28, 2026

Let me do some final tweaks to the language, and rewrite the sizehint as resize, I'll request another review when I'm done

@lkdvos lkdvos requested review from Jutho and kshyatt January 28, 2026 14:11
@lkdvos lkdvos enabled auto-merge (squash) January 28, 2026 14:20
@lkdvos lkdvos merged commit ff410ee into master Jan 28, 2026
9 of 10 checks passed
@lkdvos lkdvos deleted the ld-allocator branch January 28, 2026 15:10
@MasonProtter
Copy link
Contributor

Maybe I'm missing something here, but why not just use Bumper.jl's default allocator (the SlabBuffer)? That allocator is growable, and will work with arbitrary sizes without needing to fall back on GC.

You can think of a SlabBuffer as a collection of AllocBuffers. Each time one of the slabs is full, we just allocate a new one, and make that the active buffer. If an allocation is bigger than the slab-size, it makes a custom-sized slab for your specific allocation.

@lkdvos
Copy link
Member Author

lkdvos commented Jan 28, 2026

I might be completely misunderstanding the SlabBuffer implementation here, so please do correct me if I'm wrong.
Somehow I convinced myself that there was a subtle difference in the following sense:

When the SlabBuffer is reset, it frees all of the additional slabs that had been allocated after the checkpoint, including the custom-sized ones.
However, this is not really what I want, I would rather like to instead at that point allocate a larger buffer, such that if I were to then repeat the exact same action, I no longer have to add or remove slabs at all.

Sketching a use-case here: I want to repeatedly perform the same tensor contraction (for context, this shows up e.g. in Krylov-based methods for diagonalizing that operator)
To do that, each single application looks like:

f(in1, in2, in3, in4) = @tensor out[...] := in1[...] * in2[...] * in3[...] * in4[...]

In this example, 2 intermediate objects are needed: in12 = (in1 * in2) and in123 = (in12 * in3).

The total workflow would therefore be similar to, writing out the buffer manipulations in pseudocode:

buffer = # create buffer
for i in 1:maxiter
    checkpoint = create_checkpoint(buffer)
    out = f(in1, in2, in3, in4)
    checkpoint_restore!(buffer, checkpoint)
end

Typically, the intermediate objects are large, and it is not impossible for them to exceed GB sizes.
What I want to avoid is for the first slab to be filled up by in12, and then for a new slab to be allocated for in123, and then afterwards to free in123 again, only to have it be reallocated in the next iteration.

Again, I might be misunderstanding how the SlabBuffer works, in which case this allocator indeed wasn't necessary and only the interface changes were needed.
If not, I'd also be happy to think about how to contribute a thing like this to Bumper.jl, if you would be interested in that.
It was mostly just easier to get it working and start playing around with it here.

@MasonProtter
Copy link
Contributor

I see, that makes sense. I believe a similar functionality has actually been requested before from machine learning people as well, since they have a similar use-case.

Would you be interested in upstreaming this buffer implementation to Bumper.jl? No problem if it's not something you have bandwidth for, but I just want to raise the possibility since it would likely be useful to more packages than just TensorOperations.jl

@lkdvos
Copy link
Member Author

lkdvos commented Jan 28, 2026

I'll try and make some time either this week or the next, happy to contribute (and also happy to get another set of eyes on the implementation, never hurts when dealing with this pointer magic)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants