Skip to content
15 changes: 13 additions & 2 deletions pkg/agent/loop.go
Original file line number Diff line number Diff line change
Expand Up @@ -574,11 +574,22 @@ func (al *AgentLoop) handleReasoning(ctx context.Context, reasoningContent, chan
return
}

al.bus.PublishOutbound(ctx, bus.OutboundMessage{
// Use a short timeout so the goroutine does not block indefinitely when
// the outbound bus is full. Reasoning output is best-effort; dropping it
// is acceptable to avoid goroutine accumulation.
pubCtx, pubCancel := context.WithTimeout(ctx, 5*time.Second)
defer pubCancel()

if err := al.bus.PublishOutbound(pubCtx, bus.OutboundMessage{
Channel: channelName,
ChatID: channelID,
Content: reasoningContent,
})
}); err != nil {
Comment on lines +578 to +588
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests for handleReasoning, but none cover the new β€œbus is full / publish times out” path. Add a test that fills the outbound buffer and uses a short-deadline parent context to assert handleReasoning returns promptly (and does not publish) when PublishOutbound can’t enqueue before the deadline.

Copilot uses AI. Check for mistakes.
logger.WarnCF("agent", "Failed to publish reasoning (best-effort)", map[string]any{
"channel": channelName,
"error": err.Error(),
})
}
Comment on lines 588 to 607
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleReasoning treats reasoning as best-effort, but logs a warning on every publish failure/timeout. Under load (when the outbound bus is full), this can become very noisy and add its own performance/ops overhead. Consider lowering the log level, sampling/rate-limiting, or only warning on unexpected errors (e.g., excluding context deadline/cancel).

Copilot uses AI. Check for mistakes.
}

// runLLMIteration executes the LLM call loop with tool handling.
Expand Down
59 changes: 46 additions & 13 deletions pkg/channels/whatsapp_native/whatsapp_native.go
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ type WhatsAppNativeChannel struct {
runCancel context.CancelFunc
reconnectMu sync.Mutex
reconnecting bool
wg sync.WaitGroup // tracks background goroutines (QR handler, reconnect)
}

// NewWhatsAppNativeChannel creates a WhatsApp channel that uses whatsmeow for connection.
Expand Down Expand Up @@ -112,6 +113,12 @@ func (c *WhatsAppNativeChannel) Start(ctx context.Context) error {
}

client := whatsmeow.NewClient(deviceStore, waLogger)

// Create runCtx/runCancel BEFORE registering event handler and starting
// goroutines so that Stop() can cancel them at any time, including during
// the QR-login flow.
c.runCtx, c.runCancel = context.WithCancel(ctx)

Comment on lines +127 to +131
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.stopping is set to true in Stop() but never reset in Start(). If this channel instance is ever restarted after a Stop(), the disconnect handler will permanently skip reconnection attempts because eventHandler returns early when stopping is true. Consider resetting stopping (and any related reconnect state) at the beginning of Start(), under the same lock used in Stop()/eventHandler, so lifecycle restarts behave correctly.

Copilot uses AI. Check for mistakes.
client.AddEventHandler(c.eventHandler)

c.mu.Lock()
Expand All @@ -122,33 +129,50 @@ func (c *WhatsAppNativeChannel) Start(ctx context.Context) error {
if client.Store.ID == nil {
qrChan, err := client.GetQRChannel(ctx)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client.GetQRChannel is still called with the parent ctx, but the QR reader goroutine exits on c.runCtx.Done() during Stop(). If GetQRChannel uses its context to stop its internal producer, canceling only runCtx can leave the producer running and blocked trying to send into qrChan after the reader exits. Use c.runCtx (or another context canceled by Stop()) when calling GetQRChannel so the producer is also canceled/closed when stopping.

Suggested change
qrChan, err := client.GetQRChannel(ctx)
qrChan, err := client.GetQRChannel(c.runCtx)

Copilot uses AI. Check for mistakes.
if err != nil {
c.runCancel()
_ = container.Close()
return fmt.Errorf("get QR channel: %w", err)
}
if err := client.Connect(); err != nil {
c.runCancel()
_ = container.Close()
return fmt.Errorf("connect: %w", err)
}
for evt := range qrChan {
if evt.Event == "code" {
logger.InfoCF("whatsapp", "Scan this QR code with WhatsApp (Linked Devices):", nil)
qrterminal.GenerateWithConfig(evt.Code, qrterminal.Config{
Level: qrterminal.L,
Writer: os.Stdout,
HalfBlocks: true,
})
} else {
logger.InfoCF("whatsapp", "WhatsApp login event", map[string]any{"event": evt.Event})
// Handle QR events in a background goroutine so Start() returns
// promptly. The goroutine is tracked via c.wg and respects
// c.runCtx for cancellation.
c.wg.Add(1)
go func() {
defer c.wg.Done()
Comment on lines 166 to 179
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c.wg.Add(1) for the QR handler goroutine is not coordinated with Stop() calling c.wg.Wait(). If Stop() can run concurrently with Start() (the comments imply it can, e.g. during QR-login), there is a real sync.WaitGroup misuse risk: Stop may enter Wait() while Start subsequently calls Add(1), which can panic. To fix, ensure all wg.Add calls are prevented once Stop begins (e.g., guard wg.Add with the same mutex/flag protocol used in eventHandler, or introduce a lifecycle mutex that serializes Start/Stop).

Suggested change
// promptly. The goroutine is tracked via c.wg and respects
// c.runCtx for cancellation.
c.wg.Add(1)
go func() {
defer c.wg.Done()
// promptly. The goroutine respects c.runCtx for cancellation.
go func() {

Copilot uses AI. Check for mistakes.
for {
select {
case <-c.runCtx.Done():
return
case evt, ok := <-qrChan:
if !ok {
return
}
if evt.Event == "code" {
logger.InfoCF("whatsapp", "Scan this QR code with WhatsApp (Linked Devices):", nil)
qrterminal.GenerateWithConfig(evt.Code, qrterminal.Config{
Level: qrterminal.L,
Writer: os.Stdout,
HalfBlocks: true,
})
} else {
logger.InfoCF("whatsapp", "WhatsApp login event", map[string]any{"event": evt.Event})
}
}
}
}
}()
} else {
if err := client.Connect(); err != nil {
c.runCancel()
_ = container.Close()
return fmt.Errorf("connect: %w", err)
}
}

c.runCtx, c.runCancel = context.WithCancel(ctx)
c.SetRunning(true)
logger.InfoC("whatsapp", "WhatsApp native channel connected")
Comment on lines 202 to 209
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the QR-login path (Store.ID == nil), Start() now returns immediately but still calls SetRunning(true). That means the channel manager will start outbound workers while the client may not be paired yet, and Send() currently only checks IsConnected() (not Store.ID / login state). Consider delaying SetRunning(true) until pairing completes, or have Send() explicitly detect the unpaired state (e.g., Store.ID == nil) and return a clear temporary/not-running error to avoid futile retries/log noise during QR setup.

Copilot uses AI. Check for mistakes.
return nil
Expand All @@ -159,6 +183,11 @@ func (c *WhatsAppNativeChannel) Stop(ctx context.Context) error {
if c.runCancel != nil {
c.runCancel()
}

// Wait for background goroutines (QR handler, reconnect) to finish so
// they don't reference the client/container after cleanup.
c.wg.Wait()

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop() calls c.wg.Wait() while eventHandler can still call c.wg.Add(1) concurrently (e.g., a *events.Disconnected arriving during shutdown). This is a WaitGroup misuse and can panic with sync: WaitGroup misuse: Add called concurrently with Wait, or let reconnect goroutines run after cleanup. Add a shutdown/stopping guard (protected by a mutex/atomic) that prevents new goroutines from being added once stopping begins, and/or ensure the client can't dispatch events before Wait() is entered.

Copilot uses AI. Check for mistakes.
c.mu.Lock()
client := c.client
container := c.container
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop(ctx) now waits unconditionally on c.wg.Wait(), but the tracked goroutines can block for an unbounded time (notably reconnectWithBackoff during client.Connect()), and Stop ignores its ctx parameter. Consider making the wait context-aware (e.g., wait in a separate goroutine and select on ctx.Done()), and/or disconnect the client before waiting so Connect() can be interrupted.

Suggested change
// Wait for background goroutines (QR handler, reconnect) to finish so
// they don't reference the client/container after cleanup.
c.wg.Wait()
c.mu.Lock()
client := c.client
container := c.container
// Grab current client/container so we can disconnect the client before waiting.
c.mu.Lock()
client := c.client
container := c.container
c.mu.Unlock()
// Disconnect the client first so any blocking Connect()/reconnect loops
// can be interrupted before we wait on the goroutines.
if client != nil {
client.Disconnect()
}
// Wait for background goroutines (QR handler, reconnect) to finish in a
// context-aware way so Stop can be bounded by ctx.
done := make(chan struct{})
go func() {
c.wg.Wait()
close(done)
}()
select {
case <-done:
// All goroutines have finished.
case <-ctx.Done():
// Context canceled or timed out; log and proceed with best-effort cleanup.
logger.WarnC("whatsapp", "Stop context canceled before all goroutines finished: %v", ctx.Err())
}
// Now it is safe to clear and close resources.
c.mu.Lock()
client = c.client
container = c.container

Copilot uses AI. Check for mistakes.
Expand Down Expand Up @@ -189,7 +218,11 @@ func (c *WhatsAppNativeChannel) eventHandler(evt any) {
}
c.reconnecting = true
c.reconnectMu.Unlock()
go c.reconnectWithBackoff()
c.wg.Add(1)
go func() {
defer c.wg.Done()
c.reconnectWithBackoff()
}()
}
}

Expand Down