[Feature] buffer- and stack-based allocator strategies by lkdvos · Pull Request #251 · QuantumKitHub/TensorOperations.jl

lkdvos · 2026-01-27T18:58:04Z

This is a new set of features that expands on the allocator functionality.
For repeated tensor contractions, really the best approach is to use and reuse a buffer for the intermediate tensors.
While the Bumper approach has been working for this, the main short-coming there is that it is hard to know a priori what size of buffer has to be provided.

I tried to tackle this problem by making two main changes:

I've expanded on the interface of the allocations by additionally including allocator_checkpoint! and allocator_reset! functions. Their main purpose is to natively support capturing and resetting stack-based allocation strategies. By default, every @tensor call that specifies a backend will now include an allocator_checkpoint! call at the beginning and a allocator_reset! call at the end, which by default are no-ops.
I've added a native BufferAllocator implementation which functions in a way that is similar to how Bumper's AllocBuffer would work. However, the main difference here is that whenever the buffer is full, it simply falls back on regular Julia-allocated objects, while keeping track of the maximal size it would have needed to accommodate all intermediate tensors. When the buffer is fully empty, it will use this information to allocate appropriately-sized buffers, so subsequent usages of the same buffer will avoid repeated allocations, without needing to know the buffer size a priori.

Feedback and comments very welcome on features, names and design choices (and anything else really)

codecov · 2026-01-27T19:03:51Z

Codecov Report

❌ Patch coverage is 18.96552% with 47 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/implementation/allocator.jl	0.00%	42 Missing ⚠️
ext/TensorOperationsBumperExt.jl	0.00%	4 Missing ⚠️
src/implementation/blascontract.jl	80.00%	1 Missing ⚠️

Files with missing lines	Coverage Δ
src/TensorOperations.jl	`100.00% <ø> (ø)`
src/indexnotation/postprocessors.jl	`93.33% <100.00%> (+0.65%)`	⬆️
src/indexnotation/tensormacros.jl	`46.66% <100.00%> (+0.59%)`	⬆️
src/interface.jl	`48.21% <100.00%> (+1.91%)`	⬆️
src/implementation/blascontract.jl	`89.10% <80.00%> (+0.33%)`	⬆️
ext/TensorOperationsBumperExt.jl	`0.00% <0.00%> (ø)`
src/implementation/allocator.jl	`34.06% <0.00%> (-29.20%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/implementation/allocator.jl

lkdvos · 2026-01-27T21:30:41Z

To answer the question about the buffer size choice:

I think in an ideal case what I would really want is this to be page-aligned, and increased in multiples of the page size, but I'm not sure how easy it is to achieve that in a portable/generic manner.
Here I chose to mimic the behavior from Base for dictionaries, which uses _tablesz to determine the sizes, which indeed takes the next power of 2.
https://github.com/JuliaLang/julia/blob/fdce8f69aa65d8aceea9922e564fc6bdb3d563b0/base/abstractdict.jl#L580

A secondary thing is that the OS will actually prevent us from doing anything really stupid, since this buffer initially only takes up virtual memory and is accessed from the start.
For example it's not even that much of an issue to simply allocate a buffer of 100GB, the OS will then only allocate physical pages as required.

Co-authored-by: Jutho <Jutho@users.noreply.github.com>

src/implementation/allocator.jl

Jutho · 2026-01-27T21:56:35Z

I think this looks good but I am very tired, so I can take a fresh look tomorrow.

Co-authored-by: Jutho <Jutho@users.noreply.github.com>

src/implementation/allocator.jl

kshyatt · 2026-01-28T09:05:42Z

docs/src/man/backends.md

 By default, the `DefaultAllocator` is used, which uses Julia's built-in memory management system.
 Optionally, it can be useful to use the `ManualAllocator`, as the manual memory management reduces the pressure on the garbage collector.
 In particular in multi-threaded applications, this can sometimes lead to a significant performance improvement.
+On the other hand, for often-repeated but thread-safe `@tensor` calls, the `BufferAllocator` is a lightweight slab allocator that pre-allocates a buffer for temporaries, falling back to Julia's default if needed.


Can we be specific about how often is often enough? At least an order of magnitude?

I'm not really sure I can, I think that anything larger than 1 gives a reduction in number of allocations, but whether or not that matters depends a whole lot on the context. It's kind of the same for the manual allocator, it can help with reducing GC pressure, but doesn't necessarily make anything faster

Maybe we can turn it around: when would it make sense to think "I should use the buffer allocator!" -- when there's a lot of GC pressure? When I'm swapping?

docs/src/man/backends.md

src/implementation/allocator.jl

docs/src/man/backends.md

src/implementation/allocator.jl

Jutho · 2026-01-28T09:14:27Z

Ok this is very cool. A new allocator out of the blue. How do you use it in practice? I assume you still have to store a buffer in a global state for a given block of contractions. Is this using task_local_storage then? Should there be functionality for this in the same way as provided by Bumper.jl, i.e. have some default_buffer_allocator() function or something similar?

lkdvos · 2026-01-28T12:12:03Z

In principle I could indeed copy some of the Bumper functionality for that, but then we might also consider just using Bumper as a hard dependency and adding the resize functionality on top of that? I was imagining this to be more of a manual thing though, and leaving that up to users and library developers.

The use case I have is that I created this for MPSKit, where I simply create a buffer at the beginning of a Krylov loop and reuse that throughout the eigensolver.
So yes, there is some state that is managed by the user, either through a global or even just locally around some code.
I don't think I would really recommend using it as a global buffer though, and rather just allocate a new buffer for iterative procedures and let it be freed after.
The philosophy being that it's not that bad to allocate, you just shouldn't be doing that in a loop.

Jutho · 2026-01-28T12:52:51Z

I am in for merging the PR in its current state (up to the language corrections suggested by Katharine), and we can always add more functionality to make it easier to use at a later stage.

lkdvos · 2026-01-28T13:36:12Z

Let me do some final tweaks to the language, and rewrite the sizehint as resize, I'll request another review when I'm done

MasonProtter · 2026-01-28T19:25:53Z

Maybe I'm missing something here, but why not just use Bumper.jl's default allocator (the SlabBuffer)? That allocator is growable, and will work with arbitrary sizes without needing to fall back on GC.

You can think of a SlabBuffer as a collection of AllocBuffers. Each time one of the slabs is full, we just allocate a new one, and make that the active buffer. If an allocation is bigger than the slab-size, it makes a custom-sized slab for your specific allocation.

lkdvos · 2026-01-28T19:55:20Z

I might be completely misunderstanding the SlabBuffer implementation here, so please do correct me if I'm wrong.
Somehow I convinced myself that there was a subtle difference in the following sense:

When the SlabBuffer is reset, it frees all of the additional slabs that had been allocated after the checkpoint, including the custom-sized ones.
However, this is not really what I want, I would rather like to instead at that point allocate a larger buffer, such that if I were to then repeat the exact same action, I no longer have to add or remove slabs at all.

Sketching a use-case here: I want to repeatedly perform the same tensor contraction (for context, this shows up e.g. in Krylov-based methods for diagonalizing that operator)
To do that, each single application looks like:

f(in1, in2, in3, in4) = @tensor out[...] := in1[...] * in2[...] * in3[...] * in4[...]

In this example, 2 intermediate objects are needed: in12 = (in1 * in2) and in123 = (in12 * in3).

The total workflow would therefore be similar to, writing out the buffer manipulations in pseudocode:

buffer = # create buffer
for i in 1:maxiter
    checkpoint = create_checkpoint(buffer)
    out = f(in1, in2, in3, in4)
    checkpoint_restore!(buffer, checkpoint)
end

Typically, the intermediate objects are large, and it is not impossible for them to exceed GB sizes.
What I want to avoid is for the first slab to be filled up by in12, and then for a new slab to be allocated for in123, and then afterwards to free in123 again, only to have it be reallocated in the next iteration.

Again, I might be misunderstanding how the SlabBuffer works, in which case this allocator indeed wasn't necessary and only the interface changes were needed.
If not, I'd also be happy to think about how to contribute a thing like this to Bumper.jl, if you would be interested in that.
It was mostly just easier to get it working and start playing around with it here.

MasonProtter · 2026-01-28T20:42:05Z

I see, that makes sense. I believe a similar functionality has actually been requested before from machine learning people as well, since they have a similar use-case.

Would you be interested in upstreaming this buffer implementation to Bumper.jl? No problem if it's not something you have bandwidth for, but I just want to raise the possibility since it would likely be useful to more packages than just TensorOperations.jl

lkdvos · 2026-01-28T20:46:10Z

I'll try and make some time either this week or the next, happy to contribute (and also happy to get another set of eyes on the implementation, never hurts when dealing with this pointer magic)

lkdvos force-pushed the ld-allocator branch from 2dfa3a8 to 339e452 Compare January 27, 2026 18:58

lkdvos force-pushed the ld-allocator branch 2 times, most recently from 05a80ce to 8df8edf Compare January 27, 2026 20:12

lkdvos requested review from Jutho and kshyatt January 27, 2026 20:12

lkdvos added 9 commits January 27, 2026 15:16

add stack-based allocator interface functions

878468b

small formatting changes

8eb8c06

add checkpoint postprocessor

fafefce

rewrite Bumper extension in terms of new functions

cb9856c

Add native BufferAllocator

70ba9b6

add tests

cb3ca87

add note on thread-safety

eddd5f3

move buffer growth to allocation instead of reset

04a4fda

update docs

01f37e9

lkdvos force-pushed the ld-allocator branch from 8df8edf to 01f37e9 Compare January 27, 2026 20:16

lkdvos mentioned this pull request Jan 27, 2026

[perf] Benchmarks and AC/AC2 contraction improvements QuantumKitHub/MPSKit.jl#345

Open

Jutho reviewed Jan 27, 2026

View reviewed changes

src/implementation/allocator.jl Outdated Show resolved Hide resolved

Jutho reviewed Jan 27, 2026

View reviewed changes

src/implementation/allocator.jl Outdated Show resolved Hide resolved

Jutho reviewed Jan 27, 2026

View reviewed changes

src/implementation/allocator.jl Outdated Show resolved Hide resolved

Jutho reviewed Jan 27, 2026

View reviewed changes

src/implementation/allocator.jl Outdated Show resolved Hide resolved

lkdvos and others added 2 commits January 27, 2026 16:31

Update src/implementation/allocator.jl

2766d79

Co-authored-by: Jutho <Jutho@users.noreply.github.com>

avoid reallocating if the size hasn't changed

f1af4ac

Jutho reviewed Jan 27, 2026

View reviewed changes

src/implementation/allocator.jl Outdated Show resolved Hide resolved

lkdvos and others added 2 commits January 27, 2026 16:58

refactor buffer size determination

5fe8b79

Update src/implementation/allocator.jl

5900038

Co-authored-by: Jutho <Jutho@users.noreply.github.com>

lkdvos force-pushed the ld-allocator branch from 3c139d8 to 5900038 Compare January 27, 2026 21:59

small fix

bdec1a8

Jutho reviewed Jan 28, 2026

View reviewed changes

src/implementation/allocator.jl Outdated Show resolved Hide resolved