New circular buffer using faa instead of cas to optimize performance #3644
New circular buffer using faa instead of cas to optimize performance #3644ZENOTME wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
The committers listed above are authorized under a signed CLA. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3644 +/- ##
==========================================
+ Coverage 90.03% 90.08% +0.06%
==========================================
Files 220 220
Lines 7065 7111 +46
==========================================
+ Hits 6360 6405 +45
- Misses 705 706 +1
🚀 New features to boost your workflow:
|
|
|
||
| CircularBufferRange<AtomicUniquePtr<T>> PeekImpl() noexcept | ||
| { | ||
| uint64_t current_count = count_.load(std::memory_order_relaxed); |
There was a problem hiding this comment.
The only sync operation is processed here. Is it possible return the same CircularBufferRange for more than one thread, and cause data race?
There was a problem hiding this comment.
I'm sorry I can't get the point here, could you elaborate a more specific scene.
There was a problem hiding this comment.
When two threads call this function and finish the callings before any change, they may returns the same values. And callback(range); in Consume will be called with the same range.
There was a problem hiding this comment.
Yes, this circular buffer implementation is mpsc and assuming only one consumer thread calling this method. That's how it used in batch span processor.
There was a problem hiding this comment.
Looks like PeekImpl() could be made const, so the const_cast could be removed too
|
On Mac M1 I get: So the result is similar to SpinLockMutex: the higher the contention, the less efficient atomics are, and the better std::mutex performs. |
|
And a compiler fix for libc (MacOS) is also required, because uint64_t max_check = std::min<uint64_t>(current_count, capacity_); |
solve #3645
Changes
I noticed that original circular buffer using by batch span processor use lock free MPSC queu based on CAS. When experiencing intense multithreading contention, Compare-And-Swap (CAS) exhibits poorer scalability compared to Fetch-And-Add (FAA). In reference of https://github.com/dbittman/waitfree-mpsc-queue, this pr attempt to implement a wait-free MPSC (Multiple Producer, Single Consumer) queue using FAA. Base on original benchmark, it indicate that this approach demonstrates better performance scalability.
I would like to refine this PR if it's acceptable.