fix: handle null states in EVCacheBulkGetFuture and refactor#179
fix: handle null states in EVCacheBulkGetFuture and refactor#179
Conversation
Fixes NullPointerException in bulk get operations when some operations never signal completion through callbacks. The issue was introduced in commit e4a2a9b which optimized operation state checking by collecting state during callbacks, but didn't handle cases where callbacks don't fire. This fix adds a fallback mechanism that preserves the original behavior: - Use pre-collected state when available (maintains performance optimization) - Fall back to direct operation state checking when state is null - Ensures all operations are processed, matching pre-e4a2a9bd behavior The root cause was that operationStates[i] could remain null if signalSingleOpComplete() was never called for operations that timeout, fail, or are cancelled before their completion callbacks fire.
| assert operationStates != null; | ||
|
|
||
| // Note: The latch here is counterintuitive. Based on the implementation in EVCacheMemcachedClient.asyncGetBulk(), | ||
| // the latch count is set to 1 no matter the chunk size and only decrement when pendingChunks counts down to 0. |
There was a problem hiding this comment.
the latch count is set to 1 no matter the chunk size
isn't it also 0 when chunks.isEmpty() ?
There was a problem hiding this comment.
Right. I actually mean I was expecting the latch count equal to the chunk size (no matter it's empty or with some chunks).
| super(m, getOps, l, service); | ||
| rvMap = m; | ||
| ops = getOps; | ||
| opsArray = ops.toArray(new Operation[0]); |
There was a problem hiding this comment.
ops is an empty list that gets populated later.. See EVCacheMemcachedClient's
final Collection<Operation> ops = new ArrayList<Operation>(chunks.size());
final EVCacheBulkGetFuture<T> rv = new EVCacheBulkGetFuture<T>(m, ops, latch, executorService, client);
There was a problem hiding this comment.
Great catch.. agree, then we should convert to Array whenever we need them (was implemented that way in the original fix, but I tweak it the wrong way)
2ff2114 to
e93ef48
Compare
| } | ||
|
|
||
| if (!allCompleted && !hasZF && hadTimedoutOp) statusString = EVCacheMetricsFactory.TIMEOUT; | ||
| if (hadTimedoutOp && !hasZF) statusString = EVCacheMetricsFactory.TIMEOUT; |
There was a problem hiding this comment.
Is it intentional to omit && hadTimedoutOp in the conditional?
There was a problem hiding this comment.
I assume you mean omit !allCompleted? This is based on the fact that "!allCompleted" will always lead to hadTimedoutOp.
|
|
||
| if (state == null) { | ||
| Operation op = opsArray[i]; | ||
| op.timeOut(); |
There was a problem hiding this comment.
do we want to signal timeout here? Iiuc we're handling that one of the ops in the bulk had an exception, shouldn't we signal a cancellation instead?
There was a problem hiding this comment.
Agree. signal a cancellation to the op seems better if we enter here while state is still null
There was a problem hiding this comment.
hmm after second look at this method, I wonder what's the intention of this method. The caller caught an exception, but we are not doing anything with that "ex". Instead we throw based on whichever ops we found "first" that is either Cancelled or TimedOut or Errored...
I feel like the correct implementation should be something like
public void handleBulkException(Throwable ex) {
Operation[] opsArray = ops.toArray(new Operation[0]);
for (int i = 0; i < operationStates.length(); i++) {
SingleOperationState state = operationStates.get(i);
if (state == null) {
Operation op = opsArray[i];
op.cancel();
MemcachedConnection.opSucceeded(op);
} else {
// Use pre-collected state
if (state.timedOut) {
MemcachedConnection.opTimedOut(state.op);
} else {
MemcachedConnection.opSucceeded(state.op);
}
}
}
throw new RuntimeException(ex);
}
| } | ||
| } | ||
|
|
||
| throw new RuntimeException(t); |
There was a problem hiding this comment.
is it now possible to throw new RuntimeException(null)? may need to handle that
There was a problem hiding this comment.
Putting down the note here: As we discussed offline, the state==null condition is more like the scenario in the below else block, so we go with marking it as timedout. (here is the original code before state was introduced for reference)
This PR cherry-picked the commit from PR to fix the NullPointerException in bulk get operations. More details regarding the bug can be found in the original PR. On top of the change, refactoring some of the bulk get logic and adding some comments to make it more clear.