Skip to content

feat: add unanswered Slack mention detection to heartbeat#186

Open
vltbaudbot wants to merge 2 commits intomodem-dev:mainfrom
vltbaudbot:feat/unanswered-slack-mentions
Open

feat: add unanswered Slack mention detection to heartbeat#186
vltbaudbot wants to merge 2 commits intomodem-dev:mainfrom
vltbaudbot:feat/unanswered-slack-mentions

Conversation

@vltbaudbot
Copy link

Summary

Adds periodic checking for Slack app_mention events that didn't receive a reply, helping catch messages lost during agent restarts.

Problem

During control-agent restarts, messages forwarded to dead sockets are lost forever because the Slack bridge acks them before confirming agent receipt. This creates a window where user messages can be silently dropped.

Example incident: On 2026-02-27, 3 messages were lost during a 45-minute restart window (05:03-05:48 UTC).

Solution

This PR adds an unanswered mention check to the heartbeat extension that:

  1. Scans bridge logs for recent app_mention events (last hour)
  2. Checks for replies by searching:
    • Bridge logs for POST /send or /reply with the thread_ts
    • Control-agent session logs for thread_ts references
  3. Reports unanswered mentions older than 5 minutes (grace period for processing)
  4. Runs automatically every 10 minutes via existing heartbeat mechanism
  5. Zero-token when healthy - only prompts LLM when issues are found

Changes

  • Add checkUnansweredMentions() function
  • Add hasRepliedToThread() helper for reply detection
  • Integrate into main heartbeat flow
  • Add HEARTBEAT_CHECK_UNANSWERED_MENTIONS env var (enabled by default)
  • Add configuration output for new settings

Testing

Tested locally during incident recovery - successfully detected the 3 lost messages from the restart.

Configuration

# Enabled by default, disable with:
HEARTBEAT_CHECK_UNANSWERED_MENTIONS=0

# Adjust grace period (default 5 min) by editing UNANSWERED_MENTION_THRESHOLD_MS constant

Follow-up Work

This is a detection/monitoring solution. Long-term reliability improvements would require changes to the bridge architecture (two-phase ack, persistent queue, etc) and potentially the broker service itself.

Adds periodic checking for Slack app_mention events that didn't receive a
reply, helping catch messages lost during agent restarts.

Changes:
- Add checkUnansweredMentions() function that scans bridge log for recent
  app_mention events (last hour, older than 5 min)
- Add hasRepliedToThread() helper that checks bridge logs and control-agent
  session logs for evidence of replies
- Integrate check into main heartbeat flow (runs every 10 min by default)
- Add HEARTBEAT_CHECK_UNANSWERED_MENTIONS env var (enabled by default)
- Add UNANSWERED_MENTION_THRESHOLD_MS constant (5 min grace period)
- Update heartbeat config output to show new settings

This addresses the issue where messages forwarded to dead sockets during
restarts are lost because the bridge acks them to Slack before confirming
agent receipt. The heartbeat will now alert the control-agent about
unanswered mentions so it can follow up.

Fixes dropped messages during agent restart windows.
The initial implementation checked for any occurrence of thread_ts in session
logs, which would match both inbound mentions AND outbound replies. This led
to false positives.

Improvements:
1. Add support for slack-reply-log.jsonl (most reliable source)
2. Check for thread_ts in outbound context specifically (JSON keys in curl
   commands or send_to_session calls)
3. Use proper session subdirectory path
4. Add multiple JSON formatting variations to catch escaped quotes
5. Add documentation for future reaction-based checking

This makes the check more precise - only counting actual replies, not just
seeing the mention.
@greptile-apps
Copy link

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Added periodic unanswered Slack mention detection to the heartbeat extension to catch messages lost during agent restarts. The implementation scans bridge logs for app_mention events and checks for replies by searching both bridge and session logs.

Key changes:

  • Scans last 500 lines of bridge log for mentions within the past hour
  • Checks for replies via POST /send or /reply in bridge logs, or thread_ts references in control-agent session logs
  • Reports mentions older than 5 minutes without replies
  • Configurable via HEARTBEAT_CHECK_UNANSWERED_MENTIONS env var (enabled by default)
  • Zero-token when healthy, only prompts LLM when unanswered mentions are found

Issues found:

  • Documentation: New env var HEARTBEAT_CHECK_UNANSWERED_MENTIONS is not documented in CONFIGURATION.md (violates guideline: "When behavior changes, update docs in the same PR")
  • Tests: No tests added for new functions (isUnansweredMentionsCheckEnabled(), checkUnansweredMentions(), hasRepliedToThread()) despite existing comprehensive test suite
  • Comment on line 18 incorrectly states the env var needs to be set to "1" to enable when it's actually enabled by default

Confidence Score: 4/5

  • Safe to merge with minor documentation and testing improvements recommended
  • Implementation is solid with proper error handling and follows existing patterns. Deducted one point for missing documentation in CONFIGURATION.md and lack of tests for new functionality. The code is safe and will work correctly, but doesn't meet the project's documentation and testing standards.
  • Check that CONFIGURATION.md is updated with the new env var before merging

Important Files Changed

Filename Overview
pi/extensions/heartbeat.ts Added unanswered Slack mention detection with log parsing and reply checking. Minor documentation and test coverage issues.

Last reviewed commit: ed2c48d

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

* HEARTBEAT_INTERVAL_MS — interval between heartbeats (default: 600000 = 10 min)
* HEARTBEAT_ENABLED — set to "0" or "false" to disable (default: enabled)
* HEARTBEAT_EXPECTED_SESSIONS — comma-separated session aliases to check (default: "sentry-agent")
* HEARTBEAT_CHECK_UNANSWERED_MENTIONS — set to "1" or "true" to enable (default: enabled)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is misleading - should say "enabled by default, set to 0/false/no to disable"

Suggested change
* HEARTBEAT_CHECK_UNANSWERED_MENTIONS set to "1" or "true" to enable (default: enabled)
* HEARTBEAT_CHECK_UNANSWERED_MENTIONS enabled by default, set to "0" or "false" to disable
Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/heartbeat.ts
Line: 18

Comment:
Comment is misleading - should say "enabled by default, set to 0/false/no to disable"

```suggestion
 *   HEARTBEAT_CHECK_UNANSWERED_MENTIONS — enabled by default, set to "0" or "false" to disable
```

How can I resolve this? If you propose a fix, please make it concise.

try {
const content = readFileSync(join(controlAgentSessionDir, file), "utf-8");
// Look for evidence of a reply: the thread_ts appearing in a curl /send command
// or in a send_to_session message. The thread_ts in an outbound context
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple string search could have false positives if threadTs appears in error messages or logs without being an actual reply. Consider checking for more specific patterns like the thread being mentioned in a tool call or message send context.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/extensions/heartbeat.ts
Line: 411

Comment:
Simple string search could have false positives if `threadTs` appears in error messages or logs without being an actual reply. Consider checking for more specific patterns like the thread being mentioned in a tool call or message send context.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 413 to 416
const worktreeResults = checkWorktrees();
const stuckTodoResults = checkStuckTodos();
const unansweredMentionResults = isUnansweredMentionsCheckEnabled()
? checkUnansweredMentions()
: [];

const allResults: CheckResult[] = [
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The hasRepliedToThread() function incorrectly identifies inbound messages as replies by using a broad string search for thread_ts in session logs, leading to false negatives.
Severity: HIGH

Suggested Fix

Refine the search logic in hasRepliedToThread() to differentiate between inbound and outbound messages. Instead of a simple string search, parse the session log entries and only check for thread_ts within the context of an outbound action, such as a send_to_session tool call or a curl command to the Slack API. This ensures that only actual replies are counted.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: pi/extensions/heartbeat.ts#L413-L416

Potential issue: The `hasRepliedToThread()` function checks for replies by searching for
the string `"thread_ts":"<threadTs>"` anywhere in the session log file. However, inbound
Slack messages are logged with their metadata, which includes the `thread_ts`. If the
agent processes an inbound mention and logs it but then crashes before sending a reply,
the function will find the `thread_ts` in the logged inbound message and incorrectly
conclude that a reply was sent. This defeats the purpose of the unanswered mention
detection feature, as lost messages will go undetected.

Did we get this right? 👍 / 👎 to inform future reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants