Skip to content

fix: race conditions in parallel agents#473

Open
jaxxjj wants to merge 3 commits intogoogle:mainfrom
jaxxjj:fix/rc-parallel-agents
Open

fix: race conditions in parallel agents#473
jaxxjj wants to merge 3 commits intogoogle:mainfrom
jaxxjj:fix/rc-parallel-agents

Conversation

@jaxxjj
Copy link

@jaxxjj jaxxjj commented Jan 12, 2026

Fix race condition in parallel agents with tools

Problem

Parallel agents using tools exhibit a data race: concurrent goroutines access shared session state without proper synchronization. This causes intermittent failures where agents build LLM requests with incomplete session history, missing recently executed tool responses.

Root Cause

Architecture

Parallel agent spawns N goroutines for N sub-agents, each executing:

agent.Run() → yields events → continues next iteration

Runner goroutine processes these events:

receive event → append to session → yield to caller

Race Window

The race occurs in this sequence:

Agent goroutine:                Runner goroutine:
─────────────────              ─────────────────
1. Yield FunctionResponse  →   2. Receive event
                               3. Start session.Append()
4. Continue to next iter
5. Read session
   ❌ FunctionResponse          4. Append completes
      not visible yet              (too late)

Between steps 3-5, agent reads session while runner is still writing. Writes in one goroutine are not guaranteed visible to reads in another without explicit synchronization.

Fix

Implement backpressure control using acknowledgment channels. Agent goroutines wait for explicit confirmation that event processing is complete before proceeding to next iteration.

// Agent: Wait for acknowledgment
<-ackChan  // Blocks until runner signals completion

// Runner: Signal completion
close(ackChan)  // Unblocks agent

Changes

1. Session State Protection

File: internal/sessioninternal/mutablesession.go

Issue: storedSession field accessed concurrently without synchronization.

Fix: Added sync.RWMutex to protect all storedSession field accesses

2. Backpressure Control in Parallel Agent

File: agent/workflowagents/parallelagent/agent.go

Issue: Sub-agents continue iteration before runner completes event processing.

Fix: Implemented acknowledgment channel protocol:

  • Added ackChan chan struct{} to result struct
  • Sub-agent creates ack channel per event
  • Sub-agent blocks on <-ackChan after sending event
  • Parallel agent closes ack channel after yield() returns
  • yield() return implies runner has appended event to session

3. OpenTelemetry Tracer Protection

File: internal/telemetry/telemetry.go

Issue: localTracer global variable accessed concurrently without synchronization.

Fix: Added sync.RWMutex to protect tracer access

Testing

Race Detector Results

Before fix:
race conditions in session and otel

$ go test -race ./agent/workflowagents/parallelagent
==================
WARNING: DATA RACE
Read at 0x... by goroutine ...:
  google.golang.org/adk/internal/sessioninternal.(*MutableSession).Events()
      .../sessioninternal/mutablesession.go:65

Previous write at 0x... by goroutine ...:
  google.golang.org/adk/internal/sessioninternal.(*MutableSession).Events()
      .../sessioninternal/mutablesession.go:80
==================
Found 19 data race(s)
FAIL

After fix:

$ go test -race ./agent/workflowagents/parallelagent
ok  	google.golang.org/adk/agent/workflowagents/parallelagent	3.264s

Alignment with Python ADK

Python ADK implementation uses identical synchronization pattern:

Python (asyncio.Event):

resume_signal = asyncio.Event()
await queue.put((event, resume_signal))
await resume_signal.wait()  # Wait for processing

Go (chan struct{}):

ackChan := make(chan struct{})
results <- result{event: event, ackChan: ackChan}
<-ackChan  // Wait for processing

@google-cla
Copy link

google-cla bot commented Jan 12, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jaxxjj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical race conditions within the parallel agent system, specifically impacting scenarios where agents interact with tools. The previous implementation allowed concurrent goroutines to access shared session state and global telemetry variables without proper synchronization, leading to data inconsistencies and intermittent failures. The solution introduces robust synchronization mechanisms, including read-write mutexes for session state and telemetry, and a backpressure system utilizing acknowledgment channels for parallel agents. This ensures that all events are fully processed and shared state is consistently updated before sub-agents proceed, significantly enhancing the stability and correctness of parallel agent execution.

Highlights

  • Race Condition Fix: Addressed and resolved data race conditions occurring in parallel agents, particularly when they utilize tools. These races led to intermittent failures where agents would build LLM requests with incomplete session history.
  • Backpressure Control Implementation: Introduced a backpressure mechanism using acknowledgment channels. Sub-agents now wait for explicit confirmation that event processing (including session history updates) is complete before proceeding to their next iteration, preventing premature reads of shared state.
  • Session State Synchronization: Implemented sync.RWMutex in internal/sessioninternal/mutablesession.go to protect concurrent accesses to the storedSession field, ensuring thread-safe operations on session data.
  • Telemetry Tracer Protection: Added sync.RWMutex to internal/telemetry/telemetry.go to safeguard the localTracer global variable, preventing race conditions during telemetry setup and access.
  • Race Detector Validation: The implemented fixes have been verified using Go's race detector, which now reports no data races in the agent/workflowagents/parallelagent package, confirming the effectiveness of the synchronization changes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical race condition in parallel agents by introducing an acknowledgment channel (ackChan) for backpressure, ensuring that sub-agents wait for event processing to complete before proceeding. The changes are well-structured and include a targeted test case that validates the fix. Additionally, mutexes have been correctly added to MutableSession and the telemetry package to protect shared state from concurrent access, resolving other potential data races. My review includes a couple of suggestions for simplification and improved error visibility.

connyay added a commit to connyay/adk-go that referenced this pull request Jan 17, 2026
Adds backpressure mechanism to prevent sub-agents from reading stale
session state. Uses acknowledgment channels to ensure event processing
completes before the next iteration continues.

Co-Authored-By: jaxxjj <yc5082@nyu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant