Skip to content

Conversation

@sawka
Copy link
Member

@sawka sawka commented Feb 4, 2026

  • Lots of work on ResyncController, issues with stale data, and connection switching
  • Add+Fix termination messages for shell controller
  • Prune done/detached jobs on a timer
  • Remove circular dependencies from wcore, use blockclose event in both jobcontroller and blockcontroller instead of direct calls
  • Fix concurrency bugs with job termination
  • Send object update events when modifying the block datastructure in jobcontroller

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 4, 2026

Walkthrough

This pull request introduces durability configuration tracking for block shells, automatic job lifecycle management, and refactors signal/process utilities into a centralized package. On the backend, it adds block controller initialization during server startup, implements block-close event handling and automatic job pruning, extends the durable shell controller with a connection name field and forced-restart capability, and consolidates Unix-specific signal and process utilities into a new unixutil package. On the frontend, it updates the durability status UI to distinguish between ineligible (null), configured (true/false), and standard sessions, adds a termConfigedDurable atom to track durability configuration, and refactors terminal restart semantics to accept a durability flag. A new CLI subcommand jobdebug blockattachment is added to query block-job attachments, and job controller cleanup is simplified by delegating lifecycle management to event handlers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'More Durable Shell Bug Fixes' is vague and generic, using non-descriptive terms that don't convey meaningful information about the specific changes in this comprehensive changeset. Consider a more specific title that highlights the main change, such as 'Add block controller lifecycle management and job pruning' or 'Refactor job termination with pruning and block-close events'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description is related to the changeset and covers multiple key aspects: ResyncController improvements, termination messages, job pruning, circular dependency removal, concurrency fixes, and object update events.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sawka/jobs-4

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

  • 136.113.208.247/32 (new)
  • 34.170.211.100/32
  • 35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
frontend/app/view/term/term-model.ts (1)

1143-1163: ⚠️ Potential issue | 🟡 Minor

Missing fireAndForget wrapper for async menu click handlers.

The restartSessionWithDurability method is async, but the click handlers at lines 1150 and 1160 call it directly without wrapping in fireAndForget(). This is inconsistent with other async click handlers in this file (e.g., lines 316, 700, 1045) and could result in unhandled promise rejections.

🔧 Proposed fix
                     {
                         label: "Restart Session in Standard Mode",
-                        click: () => this.restartSessionWithDurability(false),
+                        click: () => fireAndForget(() => this.restartSessionWithDurability(false)),
                     },
                     {
                         label: "Restart Session in Durable Mode",
-                        click: () => this.restartSessionWithDurability(true),
+                        click: () => fireAndForget(() => this.restartSessionWithDurability(true)),
                     },
🧹 Nitpick comments (6)
pkg/util/utilfn/utilfn.go (1)

592-604: Clarify: duplicates from s2 are preserved in the result.

The function name implies set semantics, but if s2 contains duplicate elements, they will appear multiple times in the result. This differs from CombineStrArrays (which deduplicates) and may surprise callers expecting true set behavior.

If set semantics are intended, consider deduplicating:

♻️ Optional fix to deduplicate results
 func StrSetIntersection(s1 []string, s2 []string) []string {
 	set := make(map[string]bool)
 	for _, s := range s1 {
 		set[s] = true
 	}
 	var rtn []string
 	for _, s := range s2 {
-		if set[s] {
+		if set[s] {
+			delete(set, s) // ensure each element appears only once
 			rtn = append(rtn, s)
 		}
 	}
 	return rtn
 }

If preserving s2's duplicates is intentional (e.g., for counting or ordering purposes), consider renaming to clarify behavior (e.g., StrFilterBySet).

pkg/blockcontroller/shellcontroller.go (1)

600-615: Consider avoiding duplicate exit notifications for cmd blocks.
The pty read loop already appends “process finished …” for cmd blocks; this new muted “[command exited …]” can show twice. If that’s not desired, consider gating one of them.

pkg/blockcontroller/blockcontroller.go (2)

122-129: Log subscription failures to avoid silent BlockClose misses.
Right now, a subscription error would fail silently. Consider logging the error.

Proposed change
 func InitBlockController() {
 	rpcClient := wshclient.GetBareRpcClient()
 	rpcClient.EventListener.On(wps.Event_BlockClose, handleBlockCloseEvent)
-	wshclient.EventSubCommand(rpcClient, wps.SubscriptionRequest{
+	if err := wshclient.EventSubCommand(rpcClient, wps.SubscriptionRequest{
 		Event:     wps.Event_BlockClose,
 		AllScopes: true,
-	}, nil)
+	}, nil); err != nil {
+		log.Printf("[blockcontroller] error subscribing to blockclose: %v", err)
+	}
 }

153-167: Avoid destroying controllers when conn name is still unknown.
When ShellProcConnName is empty (e.g., before the proc starts), the current check can trigger unnecessary destroy/recreate cycles. Consider guarding on a non-empty runtime conn name.

Proposed change
 	if existing != nil {
 		existingStatus := existing.GetRuntimeStatus()
-		if existingStatus.ShellProcConnName != connName {
+		if existingStatus.ShellProcConnName != "" && existingStatus.ShellProcConnName != connName {
 			log.Printf("stopping blockcontroller %s due to conn change (from %q to %q)\n", blockId, existingStatus.ShellProcConnName, connName)
 			DestroyBlockController(blockId)
 			time.Sleep(100 * time.Millisecond)
 			existing = nil
 		}
 	}
pkg/jobcontroller/jobcontroller.go (2)

290-332: Re-validate job state before delete to avoid race window.
A job could reattach between scan and delete; a quick re-check avoids accidental deletes.

Proposed change
 	for _, jobId := range jobsToDelete {
-		err := DeleteJob(ctx, jobId)
+		job, err := wstore.DBGet[*waveobj.Job](ctx, jobId)
+		if err != nil || job == nil {
+			if err != nil {
+				log.Printf("[jobpruner] error reloading job %s: %v", jobId, err)
+			}
+			continue
+		}
+		if job.JobManagerStatus != JobManagerStatus_Done || job.AttachedBlockId != "" {
+			continue
+		}
+		err = DeleteJob(ctx, jobId)
 		if err != nil {
 			log.Printf("[jobpruner] error deleting job %s: %v", jobId, err)
 		}
 	}

394-419: Consider offloading BlockClose termination to avoid blocking the event loop.
Termination can be slow; running it asynchronously keeps the event handler responsive.

Proposed change
 func handleBlockCloseEvent(event *wps.WaveEvent) {
-	ctx, cancelFn := context.WithTimeout(context.Background(), 5*time.Second)
-	defer cancelFn()
 	blockId, ok := event.Data.(string)
 	if !ok {
 		log.Printf("[blockclose] invalid event data type")
 		return
 	}
 
-	jobIds, err := wstore.WithTxRtn(ctx, func(tx *wstore.TxWrap) ([]string, error) {
-		query := `SELECT oid FROM db_job WHERE json_extract(data, '$.attachedblockid') = ?`
-		jobIds := tx.SelectStrings(query, blockId)
-		return jobIds, nil
-	})
-	if err != nil {
-		log.Printf("[block:%s] error looking up jobids: %v", blockId, err)
-		return
-	}
-	if len(jobIds) == 0 {
-		return
-	}
-
-	for _, jobId := range jobIds {
-		TerminateAndDetachJob(ctx, jobId)
-	}
+	go func(blockId string) {
+		ctx, cancelFn := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cancelFn()
+		jobIds, err := wstore.WithTxRtn(ctx, func(tx *wstore.TxWrap) ([]string, error) {
+			query := `SELECT oid FROM db_job WHERE json_extract(data, '$.attachedblockid') = ?`
+			jobIds := tx.SelectStrings(query, blockId)
+			return jobIds, nil
+		})
+		if err != nil {
+			log.Printf("[block:%s] error looking up jobids: %v", blockId, err)
+			return
+		}
+		for _, jobId := range jobIds {
+			TerminateAndDetachJob(ctx, jobId)
+		}
+	}(blockId)
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants