Add test runner tool to BI Copilot agent#1488
Add test runner tool to BI Copilot agent#1488xlight05 merged 4 commits intowso2:release/bi-1.8.xfrom
Conversation
📝 WalkthroughWalkthroughAdds a TESTING task type, a new runTests tool that runs Changes
Sequence DiagramsequenceDiagram
participant Agent as AI Agent
participant Registry as Tool Registry
participant TestTool as Test Runner Tool
participant CLI as Ballerina CLI
participant EventHandler as Event Handler
participant UI as UI Components
Agent->>Registry: request runTests tool
Registry->>TestTool: invoke createTestRunnerTool(...)
TestTool->>EventHandler: emit tool_call (runTests, toolCallId)
EventHandler->>UI: dispatch tool_call -> display "Running tests..."
TestTool->>CLI: execute `bal test` in tempProjectPath
CLI-->>TestTool: return combined stdout/stderr
TestTool->>EventHandler: emit tool_result (output, toolCallId)
EventHandler->>UI: dispatch tool_result -> replace with "Tests completed"
TestTool-->>Agent: return TestRunResult(output)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
61c1b60 to
f2286cb
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
workspaces/ballerina/ballerina-visualizer/src/views/AIPanel/components/ToolCallGroupSegment.tsx (1)
131-143: ImportTEST_RUNNER_TOOL_NAMEinstead of hardcoding"runTests".All constants for other tools (
"file_write","LibrarySearchTool", etc.) are hardcoded strings in this file, butTEST_RUNNER_TOOL_NAMEis already exported fromtest-runner.tsand imported inprompts.tsandtool-registry.ts. Importing it here would make a tool name rename safe without a silent UI regression.♻️ Proposed refactor
+import { TEST_RUNNER_TOOL_NAME } from "../../../../../../../ballerina-extension/src/features/ai/agent/tools/test-runner"; const FILE_TOOLS = ["file_write", "file_edit", "file_batch_edit"]; const LIBRARY_TOOLS = ["LibrarySearchTool", "LibraryGetTool", "HealthcareLibraryProviderTool"]; - const hasTestRunner = names.includes("runTests"); + const hasTestRunner = names.includes(TEST_RUNNER_TOOL_NAME);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workspaces/ballerina/ballerina-visualizer/src/views/AIPanel/components/ToolCallGroupSegment.tsx` around lines 131 - 143, Replace the hardcoded test-runner name in getGroupCategory with the canonical constant: import TEST_RUNNER_TOOL_NAME from the module that exports it and use TEST_RUNNER_TOOL_NAME instead of the literal "runTests" when computing hasTestRunner in ToolCallGroupSegment (function getGroupCategory) so future renames are safe and consistent with other tool-name constants already centralized elsewhere.workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts (2)
105-105: "Introduce a new subtask if needed" could trigger an unintended re-approval flow.In the task management system (
task-writer.ts), adding a new task to an in-progress plan is treated asisPlanRemodification, which re-triggers the full plan approval dialog (needsPlanApproval = true). This means the phrase may cause the agent to interrupt execution with an unexpected approval request mid-task, or conversely make it hesitant to add useful diagnostic steps.Consider replacing with clearer guidance, such as: "Use
${DIAGNOSTICS_TOOL_NAME}iteratively until all errors are resolved before marking the task as completed."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts` at line 105, The prompt in prompts.ts currently instructs "Introduce a new subtask if needed," which can unintentionally trigger a remodification flow in task-writer.ts (isPlanRemodification → needsPlanApproval); update the text around DIAGNOSTICS_TOOL_NAME to instead instruct iterative use until no compilation errors remain (for example: "Use ${DIAGNOSTICS_TOOL_NAME} iteratively until all errors are resolved before marking the task as completed."), so the agent performs repeated diagnostics/fixes without adding new tasks that cause isPlanRemodification or re-opening the plan approval dialog.
105-105: "Introduce a new subtask if needed" can unexpectedly trigger plan re-approval.In
task-writer.ts, any call that changes the task count compared to the existing plan is detected asisPlanRemodification(viaallTasks.length !== existingPlan.tasks.length), which causesneedsPlanApproval = trueand re-presents the full plan approval dialog mid-execution. Instructing the agent to add subtasks here could interrupt the user's approval flow unexpectedly.Consider replacing with iterative guidance instead:
- - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed. + - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors. Fix any errors and re-run ${DIAGNOSTICS_TOOL_NAME} until compilation is clean.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts` at line 105, Update the prompt text that currently reads "Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed." to avoid instructing the agent to add subtasks (which triggers isPlanRemodification detection in task-writer.ts via allTasks.length !== existingPlan.tasks.length and forces needsPlanApproval). Instead, instruct iterative in-place fixes and repeated diagnostic checks (e.g., "run ${DIAGNOSTICS_TOOL_NAME}, fix errors iteratively, and re-run diagnostics until clean; if a large scope of work is required, propose a plan change to the user rather than auto-creating subtasks"). Reference the DIAGNOSTICS_TOOL_NAME token in the updated text so behavior is unchanged except for removing automatic subtask creation guidance.workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts (1)
61-62:context.toolCallIdis supported by the AI SDK — type annotation can be tightened.As of AI SDK 4.1, the
executefunction's second parameter providestoolCallIdfor tracking specific executions,messagesfor full conversation history, andabortSignalfor canceling long-running operations. The usage ofcontext?.toolCallIdis correct and the fallback is good defensive coding, but the parameter can be typed more precisely to reflect the SDK's actual contract (non-optional second arg, non-optionaltoolCallId):-execute: async (_input: Record<string, never>, context?: { toolCallId?: string }): Promise<TestRunResult> => { +execute: async (_input: Record<string, never>, context: { toolCallId: string; abortSignal: AbortSignal }): Promise<TestRunResult> => { - const toolCallId = context?.toolCallId || `fallback-${Date.now()}`; + const toolCallId = context.toolCallId;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts` around lines 61 - 62, The execute function's second parameter should be tightened to the AI SDK 4.1 shape instead of being optional; update the signature of execute to accept a non-optional context object (e.g., context: { toolCallId: string; messages?: any; abortSignal?: AbortSignal }) so toolCallId is non-optional, then reference context.toolCallId (you can keep the existing fallback for extra safety) and ensure TestRunResult and return types remain unchanged; update the execute declaration and its usages inside the function accordingly (look for execute and the local toolCallId variable).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`:
- Around line 87-100: In runBallerinaTests, child_process.exec is called without
timeout/maxBuffer and ignores exec errors; update the exec call in
runBallerinaTests to pass an options object including a reasonable timeout
(e.g., milliseconds) and an increased maxBuffer, and change the Promise
resolution to always settle with either resolved output or a rejected/errored
result that includes exec error details (err.message/err.code) so command, cwd
and err are surfaced rather than silently hanging or truncating; ensure the
output returned still concatenates stdout/stderr but also appends or includes
err.message and err.code when err is present.
- Around line 87-100: The runBallerinaTests function currently calls
child_process.exec(command, { cwd }, ...) without timeout/maxBuffer and swallows
exec errors; update the exec invocation in runBallerinaTests to pass options {
cwd, timeout: <reasonable-ms>, maxBuffer: <larger-bytes> } (e.g., 60_000 ms and
e.g. 10*1024*1024 bytes) and include err details in the resolved TestRunResult
output (or reject on fatal exec errors) so that stdout/stderr truncation and
OS-level errors like command-not-found are surfaced; ensure you reference
balCmd/command, child_process.exec callback, and the TestRunResult object when
adding the timeout/maxBuffer and appending err.message/err.code into the
returned output.
---
Nitpick comments:
In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts`:
- Line 105: The prompt in prompts.ts currently instructs "Introduce a new
subtask if needed," which can unintentionally trigger a remodification flow in
task-writer.ts (isPlanRemodification → needsPlanApproval); update the text
around DIAGNOSTICS_TOOL_NAME to instead instruct iterative use until no
compilation errors remain (for example: "Use ${DIAGNOSTICS_TOOL_NAME}
iteratively until all errors are resolved before marking the task as
completed."), so the agent performs repeated diagnostics/fixes without adding
new tasks that cause isPlanRemodification or re-opening the plan approval
dialog.
- Line 105: Update the prompt text that currently reads "Before marking the task
as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and
fix them. Introduce a new subtask if needed." to avoid instructing the agent to
add subtasks (which triggers isPlanRemodification detection in task-writer.ts
via allTasks.length !== existingPlan.tasks.length and forces needsPlanApproval).
Instead, instruct iterative in-place fixes and repeated diagnostic checks (e.g.,
"run ${DIAGNOSTICS_TOOL_NAME}, fix errors iteratively, and re-run diagnostics
until clean; if a large scope of work is required, propose a plan change to the
user rather than auto-creating subtasks"). Reference the DIAGNOSTICS_TOOL_NAME
token in the updated text so behavior is unchanged except for removing automatic
subtask creation guidance.
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`:
- Around line 61-62: The execute function's second parameter should be tightened
to the AI SDK 4.1 shape instead of being optional; update the signature of
execute to accept a non-optional context object (e.g., context: { toolCallId:
string; messages?: any; abortSignal?: AbortSignal }) so toolCallId is
non-optional, then reference context.toolCallId (you can keep the existing
fallback for extra safety) and ensure TestRunResult and return types remain
unchanged; update the execute declaration and its usages inside the function
accordingly (look for execute and the local toolCallId variable).
In
`@workspaces/ballerina/ballerina-visualizer/src/views/AIPanel/components/ToolCallGroupSegment.tsx`:
- Around line 131-143: Replace the hardcoded test-runner name in
getGroupCategory with the canonical constant: import TEST_RUNNER_TOOL_NAME from
the module that exports it and use TEST_RUNNER_TOOL_NAME instead of the literal
"runTests" when computing hasTestRunner in ToolCallGroupSegment (function
getGroupCategory) so future renames are safe and consistent with other tool-name
constants already centralized elsewhere.
| async function runBallerinaTests(cwd: string): Promise<TestRunResult> { | ||
| return new Promise((resolve) => { | ||
| const balCmd = extension.ballerinaExtInstance.getBallerinaCmd(); | ||
| const command = `${balCmd} test`; | ||
|
|
||
| console.log(`[TestRunner] Running: ${command} in ${cwd}`); | ||
|
|
||
| child_process.exec(command, { cwd }, (err, stdout, stderr) => { | ||
| const output = [stdout, stderr].filter(Boolean).join('\n').trim(); | ||
|
|
||
| console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`); | ||
| resolve({ output }); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
Missing timeout and maxBuffer on child_process.exec — risk of infinite hang and silent output truncation.
bal test can block indefinitely if a test awaits a network resource, has an infinite loop, or requires interactive input. Without a timeout option, the Promise never settles and the agent loop is frozen with no recovery path until the user stops the entire generation. Additionally, the default maxBuffer (1 MB) can be silently exceeded for verbose test suites, and any OS-level exec error (err.message, e.g. "stdout maxBuffer exceeded", command-not-found) is never surfaced to the agent because only stdout/stderr are joined into output.
🐛 Proposed fix — add timeout, explicit maxBuffer, and exec-error surfacing
async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
return new Promise((resolve) => {
const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
const command = `${balCmd} test`;
+ const TIMEOUT_MS = 5 * 60 * 1000; // 5 minutes
+ const MAX_BUFFER = 10 * 1024 * 1024; // 10 MB
console.log(`[TestRunner] Running: ${command} in ${cwd}`);
- child_process.exec(command, { cwd }, (err, stdout, stderr) => {
- const output = [stdout, stderr].filter(Boolean).join('\n').trim();
-
- console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`);
+ child_process.exec(command, { cwd, timeout: TIMEOUT_MS, maxBuffer: MAX_BUFFER }, (err, stdout, stderr) => {
+ const parts = [stdout, stderr].filter(Boolean);
+ if (err) {
+ if (err.killed) {
+ parts.push(`\nError: 'bal test' timed out after ${TIMEOUT_MS / 1000} seconds.`);
+ } else if (!stdout && !stderr) {
+ // OS-level failure (e.g. command not found, maxBuffer exceeded)
+ parts.push(`\nError: ${err.message}`);
+ }
+ }
+ const output = parts.join('\n').trim();
+ console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}, killed: ${err?.killed ?? false}`);
resolve({ output });
});
});
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async function runBallerinaTests(cwd: string): Promise<TestRunResult> { | |
| return new Promise((resolve) => { | |
| const balCmd = extension.ballerinaExtInstance.getBallerinaCmd(); | |
| const command = `${balCmd} test`; | |
| console.log(`[TestRunner] Running: ${command} in ${cwd}`); | |
| child_process.exec(command, { cwd }, (err, stdout, stderr) => { | |
| const output = [stdout, stderr].filter(Boolean).join('\n').trim(); | |
| console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`); | |
| resolve({ output }); | |
| }); | |
| }); | |
| async function runBallerinaTests(cwd: string): Promise<TestRunResult> { | |
| return new Promise((resolve) => { | |
| const balCmd = extension.ballerinaExtInstance.getBallerinaCmd(); | |
| const command = `${balCmd} test`; | |
| const TIMEOUT_MS = 5 * 60 * 1000; // 5 minutes | |
| const MAX_BUFFER = 10 * 1024 * 1024; // 10 MB | |
| console.log(`[TestRunner] Running: ${command} in ${cwd}`); | |
| child_process.exec(command, { cwd, timeout: TIMEOUT_MS, maxBuffer: MAX_BUFFER }, (err, stdout, stderr) => { | |
| const parts = [stdout, stderr].filter(Boolean); | |
| if (err) { | |
| if (err.killed) { | |
| parts.push(`\nError: 'bal test' timed out after ${TIMEOUT_MS / 1000} seconds.`); | |
| } else if (!stdout && !stderr) { | |
| // OS-level failure (e.g. command not found, maxBuffer exceeded) | |
| parts.push(`\nError: ${err.message}`); | |
| } | |
| } | |
| const output = parts.join('\n').trim(); | |
| console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}, killed: ${err?.killed ?? false}`); | |
| resolve({ output }); | |
| }); | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`
around lines 87 - 100, In runBallerinaTests, child_process.exec is called
without timeout/maxBuffer and ignores exec errors; update the exec call in
runBallerinaTests to pass an options object including a reasonable timeout
(e.g., milliseconds) and an increased maxBuffer, and change the Promise
resolution to always settle with either resolved output or a rejected/errored
result that includes exec error details (err.message/err.code) so command, cwd
and err are surfaced rather than silently hanging or truncating; ensure the
output returned still concatenates stdout/stderr but also appends or includes
err.message and err.code when err is present.
Missing timeout on child_process.exec — agent will block indefinitely if bal test hangs.
bal test can hang in real scenarios: a test awaiting a network service that never starts, an infinite loop in a test function, or a test waiting for interactive input. Without a timeout option, the returned Promise never resolves and the entire AI agent execution loop is blocked with no recovery path short of the user stopping the whole generation.
Additionally:
- The default
maxBuffer(1 MB) can be exceeded by large test suites with verbose output, producing a silent truncation whose error message (err.message = "stdout maxBuffer exceeded") is never surfaced to the agent because onlystdout/stderrare joined intooutput. - Any OS-level exec failure (e.g.,
balCmdpath resolution failure) is silently swallowed for the same reason.
🐛 Proposed fix — add timeout, larger maxBuffer, and exec-error surfacing
async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
return new Promise((resolve) => {
const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
const command = `${balCmd} test`;
+ const TIMEOUT_MS = 5 * 60 * 1000; // 5 minutes
+ const MAX_BUFFER = 10 * 1024 * 1024; // 10 MB
console.log(`[TestRunner] Running: ${command} in ${cwd}`);
- child_process.exec(command, { cwd }, (err, stdout, stderr) => {
- const output = [stdout, stderr].filter(Boolean).join('\n').trim();
-
- console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`);
+ child_process.exec(command, { cwd, timeout: TIMEOUT_MS, maxBuffer: MAX_BUFFER }, (err, stdout, stderr) => {
+ const parts = [stdout, stderr].filter(Boolean);
+ if (err) {
+ // Surface OS-level errors (timeout, maxBuffer, command-not-found) that
+ // would otherwise be invisible to the agent.
+ if (err.killed) {
+ parts.push(`\nError: 'bal test' timed out after ${TIMEOUT_MS / 1000} seconds.`);
+ } else if (!stdout && !stderr) {
+ parts.push(`\nError: ${err.message}`);
+ }
+ }
+ const output = parts.join('\n').trim();
+
+ console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}, killed: ${err?.killed ?? false}`);
resolve({ output });
});
});
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async function runBallerinaTests(cwd: string): Promise<TestRunResult> { | |
| return new Promise((resolve) => { | |
| const balCmd = extension.ballerinaExtInstance.getBallerinaCmd(); | |
| const command = `${balCmd} test`; | |
| console.log(`[TestRunner] Running: ${command} in ${cwd}`); | |
| child_process.exec(command, { cwd }, (err, stdout, stderr) => { | |
| const output = [stdout, stderr].filter(Boolean).join('\n').trim(); | |
| console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`); | |
| resolve({ output }); | |
| }); | |
| }); | |
| async function runBallerinaTests(cwd: string): Promise<TestRunResult> { | |
| return new Promise((resolve) => { | |
| const balCmd = extension.ballerinaExtInstance.getBallerinaCmd(); | |
| const command = `${balCmd} test`; | |
| const TIMEOUT_MS = 5 * 60 * 1000; // 5 minutes | |
| const MAX_BUFFER = 10 * 1024 * 1024; // 10 MB | |
| console.log(`[TestRunner] Running: ${command} in ${cwd}`); | |
| child_process.exec(command, { cwd, timeout: TIMEOUT_MS, maxBuffer: MAX_BUFFER }, (err, stdout, stderr) => { | |
| const parts = [stdout, stderr].filter(Boolean); | |
| if (err) { | |
| // Surface OS-level errors (timeout, maxBuffer, command-not-found) that | |
| // would otherwise be invisible to the agent. | |
| if (err.killed) { | |
| parts.push(`\nError: 'bal test' timed out after ${TIMEOUT_MS / 1000} seconds.`); | |
| } else if (!stdout && !stderr) { | |
| parts.push(`\nError: ${err.message}`); | |
| } | |
| } | |
| const output = parts.join('\n').trim(); | |
| console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}, killed: ${err?.killed ?? false}`); | |
| resolve({ output }); | |
| }); | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`
around lines 87 - 100, The runBallerinaTests function currently calls
child_process.exec(command, { cwd }, ...) without timeout/maxBuffer and swallows
exec errors; update the exec invocation in runBallerinaTests to pass options {
cwd, timeout: <reasonable-ms>, maxBuffer: <larger-bytes> } (e.g., 60_000 ms and
e.g. 10*1024*1024 bytes) and include err details in the resolved TestRunResult
output (or reject on fatal exec errors) so that stdout/stderr truncation and
OS-level errors like command-not-found are surfaced; ensure you reference
balCmd/command, child_process.exec callback, and the TestRunResult object when
adding the timeout/maxBuffer and appending err.message/err.code into the
returned output.
|
|
||
| console.log(`[TestRunner] Running: ${command} in ${cwd}`); | ||
|
|
||
| child_process.exec(command, { cwd }, (err, stdout, stderr) => { |
There was a problem hiding this comment.
We need to see if can utilize inbuilt vscode tools, but we can do that in a sepeare PR.
| return { running: "Generating connector...", done: "Connector ready" }; | ||
| } | ||
| if (hasTestRunner) { | ||
| return { running: "Running tests...", done: "Tests completed" }; |
There was a problem hiding this comment.
shouldn't we display test pass/failure status here?
There was a problem hiding this comment.
For initial implementaion it's handled by the agent. It'll give a summary of the test and what things were tested.
We can improvise further from the tool side in the next PR
f2286cb to
850dd8b
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts (2)
61-62: Consider propagating errors fromrunBallerinaTestsrather than always resolving.The
executefunction unconditionally awaitsrunBallerinaTestsand returns its result, butrunBallerinaTestsalways resolves (never rejects) — even on catastrophic failures like a missingbalcommand. Ifextension.ballerinaExtInstance.getBallerinaCmd()throws synchronously (e.g., extension not initialized), the promise constructor won't catch it and the rejection will be unhandled. Wrapping the body in a try/catch and emitting a meaningfultool_resulton failure would improve resilience.♻️ Proposed fix — add error handling in execute and runBallerinaTests
execute: async (_input: Record<string, never>, context?: { toolCallId?: string }): Promise<TestRunResult> => { const toolCallId = context?.toolCallId || `fallback-${Date.now()}`; eventHandler({ type: "tool_call", toolName: TEST_RUNNER_TOOL_NAME, toolCallId, }); - const result = await runBallerinaTests(tempProjectPath); + let result: TestRunResult; + try { + result = await runBallerinaTests(tempProjectPath); + } catch (e) { + result = { output: `Error running tests: ${e instanceof Error ? e.message : String(e)}` }; + } eventHandler({ type: "tool_result", toolName: TEST_RUNNER_TOOL_NAME, toolCallId, toolOutput: result }); return result; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts` around lines 61 - 62, The execute function should propagate and handle errors instead of relying on runBallerinaTests to always resolve: wrap the body of execute in a try/catch so any synchronous throws (e.g., from extension.ballerinaExtInstance.getBallerinaCmd()) are caught, and on error return a proper TestRunResult/tool_result indicating failure (including the error message and toolCallId) rather than leaving the promise unhandled; also update runBallerinaTests to reject/throw on catastrophic failures (missing bal, spawn errors) instead of always resolving so callers like execute can catch and convert those into meaningful tool_result error responses.
84-86: Nit: JSDoc says "parses the output" but the function returns raw output.The function simply concatenates stdout and stderr without any parsing. The description should match the actual behavior.
📝 Proposed fix
/** - * Executes `bal test` in the given directory and parses the output. + * Executes `bal test` in the given directory and returns the raw output. */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts` around lines 84 - 86, Update the JSDoc for the function that "Executes `bal test` in the given directory" to accurately state that it returns the raw combined stdout and stderr output (a single concatenated string) rather than parsing results; mention the exact return shape ("string" with combined stdout/stderr) and, if desired, add a TODO or alternative note that parsing could be implemented later or add a new parseTestOutput helper to perform structured parsing (so callers know to handle raw text).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`:
- Around line 87-100: In runBallerinaTests, when calling child_process.exec (the
invocation that uses variables command and cwd) add an options object with a
reasonable timeout and increased maxBuffer (e.g., timeout in ms and maxBuffer
bytes) so the promise cannot hang indefinitely or silently truncate output; also
include the exec callback's err information in the resolved TestRunResult.output
(append err.message and err.killed/err.signal/err.code details) so timeout,
command-not-found, and maxBuffer errors are visible, and ensure the Promise
always resolves with output even when err is present.
---
Nitpick comments:
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`:
- Around line 61-62: The execute function should propagate and handle errors
instead of relying on runBallerinaTests to always resolve: wrap the body of
execute in a try/catch so any synchronous throws (e.g., from
extension.ballerinaExtInstance.getBallerinaCmd()) are caught, and on error
return a proper TestRunResult/tool_result indicating failure (including the
error message and toolCallId) rather than leaving the promise unhandled; also
update runBallerinaTests to reject/throw on catastrophic failures (missing bal,
spawn errors) instead of always resolving so callers like execute can catch and
convert those into meaningful tool_result error responses.
- Around line 84-86: Update the JSDoc for the function that "Executes `bal test`
in the given directory" to accurately state that it returns the raw combined
stdout and stderr output (a single concatenated string) rather than parsing
results; mention the exact return shape ("string" with combined stdout/stderr)
and, if desired, add a TODO or alternative note that parsing could be
implemented later or add a new parseTestOutput helper to perform structured
parsing (so callers know to handle raw text).
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts`:
- Around line 105-106: The test-run trigger in the agent prompt currently fires
whenever compilation is clean and tests exist, which causes tests to run after
any task; update the prompt text around the DIAGNOSTICS_TOOL_NAME guidance to
require the current task be the testing task (e.g., add an explicit condition
like "only if the current task's type is 'testing'") before instructing to run
tests in plan mode, and make the same explicit conditional change in the
edit-mode prompt block where the duplicate wording appears (refer to the prompt
strings containing DIAGNOSTICS_TOOL_NAME and the testing-related trigger to
locate and update both spots).
- Around line 119-124: Update the "## Test Runner" prompt block (the Test Runner
section that uses ${TEST_RUNNER_TOOL_NAME}) to include an explicit retry cap and
an escape hatch: add a maximum attempts parameter (e.g., "max_attempts: 3") and
change step 4 to say "if failures remain after max_attempts, stop retrying,
report remaining failing tests with one-line diagnostics, and recommend manual
investigation or filing an issue"; ensure the prompt instructs the agent to
increment an attempt counter each run and to stop once max_attempts is reached
while summarizing unresolved failures and next steps.
| - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed. | ||
| - Once compilation is clean and the project contains test cases, run the tests. |
There was a problem hiding this comment.
Test execution may fire prematurely during non-testing tasks.
The trigger condition "Once compilation is clean and the project contains test cases, run the tests" fires after every task in plan mode — not just after the testing task. If the user's project already contains test files (iterative workflow, existing test suite, partial agent run), this will invoke the test runner after connections_init or implementation tasks as well, running a potentially stale/incomplete test suite mid-implementation.
The same condition repeats on line 141 in edit mode, where there is no testing task type to act as a natural gate.
Consider restricting this trigger to the testing task type (plan mode) or making it explicitly conditional on the current task type, e.g.:
💡 Suggested wording
- - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed.
- - Once compilation is clean and the project contains test cases, run the tests.
+ - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed.
+ - Once compilation is clean and the current task is a 'testing' task, run the tests.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed. | |
| - Once compilation is clean and the project contains test cases, run the tests. | |
| - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed. | |
| - Once compilation is clean and the current task is a 'testing' task, run the tests. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts`
around lines 105 - 106, The test-run trigger in the agent prompt currently fires
whenever compilation is clean and tests exist, which causes tests to run after
any task; update the prompt text around the DIAGNOSTICS_TOOL_NAME guidance to
require the current task be the testing task (e.g., add an explicit condition
like "only if the current task's type is 'testing'") before instructing to run
tests in plan mode, and make the same explicit conditional change in the
edit-mode prompt block where the duplicate wording appears (refer to the prompt
strings containing DIAGNOSTICS_TOOL_NAME and the testing-related trigger to
locate and update both spots).
| ## Test Runner | ||
| When running tests, follow these steps: | ||
| 1. Before running, briefly tell the user what is being tested. | ||
| 2. Use ${TEST_RUNNER_TOOL_NAME} to run the test suite. | ||
| 3. After the run, give a short summary: how many tests passed/failed. | ||
| 4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run. |
There was a problem hiding this comment.
Unbounded fix-and-rerun loop — add a retry cap.
Step 4 instructs the agent to "fix them, and re-run" with no iteration limit. For persistently failing tests (environment-dependent failures, flaky assertions, genuinely incorrect logic), the agent can loop indefinitely consuming tokens, time, and incurring LLM cost with no exit path.
Add an explicit maximum-attempt guard and an escape hatch:
💡 Suggested wording
## Test Runner
When running tests, follow these steps:
1. Before running, briefly tell the user what is being tested.
2. Use ${TEST_RUNNER_TOOL_NAME} to run the test suite.
3. After the run, give a short summary: how many tests passed/failed.
-4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run.
+4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run.
+ Repeat at most 3 times. If tests still fail after 3 attempts, report the remaining failures to the user and stop.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ## Test Runner | |
| When running tests, follow these steps: | |
| 1. Before running, briefly tell the user what is being tested. | |
| 2. Use ${TEST_RUNNER_TOOL_NAME} to run the test suite. | |
| 3. After the run, give a short summary: how many tests passed/failed. | |
| 4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run. | |
| ## Test Runner | |
| When running tests, follow these steps: | |
| 1. Before running, briefly tell the user what is being tested. | |
| 2. Use ${TEST_RUNNER_TOOL_NAME} to run the test suite. | |
| 3. After the run, give a short summary: how many tests passed/failed. | |
| 4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run. | |
| Repeat at most 3 times. If tests still fail after 3 attempts, report the remaining failures to the user and stop. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts`
around lines 119 - 124, Update the "## Test Runner" prompt block (the Test
Runner section that uses ${TEST_RUNNER_TOOL_NAME}) to include an explicit retry
cap and an escape hatch: add a maximum attempts parameter (e.g., "max_attempts:
3") and change step 4 to say "if failures remain after max_attempts, stop
retrying, report remaining failing tests with one-line diagnostics, and
recommend manual investigation or filing an issue"; ensure the prompt instructs
the agent to increment an attempt counter each run and to stop once max_attempts
is reached while summarizing unresolved failures and next steps.
9022d25 to
6c3a371
Compare
Resolves: wso2/product-ballerina-integrator#2459
runTeststool that executesbal testin the temp project directory and returns raw output to the agenttestingtask type toTaskTypesenum andTaskWritetool schema so the agent can plan test writing as a dedicated taskSummary by CodeRabbit
New Features
Chores