Skip to content

Add test runner tool to BI Copilot agent#1488

Merged
xlight05 merged 4 commits intowso2:release/bi-1.8.xfrom
RNViththagan:copilot-agent-test-tool
Feb 19, 2026
Merged

Add test runner tool to BI Copilot agent#1488
xlight05 merged 4 commits intowso2:release/bi-1.8.xfrom
RNViththagan:copilot-agent-test-tool

Conversation

@RNViththagan
Copy link
Member

@RNViththagan RNViththagan commented Feb 18, 2026

Resolves: wso2/product-ballerina-integrator#2459

  • Add runTests tool that executes bal test in the temp project directory and returns raw output to the agent
  • Add testing task type to TaskTypes enum and TaskWrite tool schema so the agent can plan test writing as a dedicated task
  • Update plan mode to always include a testing task after implementation tasks unless user opts out

Summary by CodeRabbit

  • New Features

    • AI agent can run project tests as part of its workflow and will produce test run output.
    • In-chat status updates display when tests start and when they complete.
    • Task planning and validation include explicit testing steps and guidance; prompts now advise running and summarizing tests.
  • Chores

    • Test runner integrated into the agent's toolset and registry; configuration prompts for test setup streamlined.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Adds a TESTING task type, a new runTests tool that runs bal test in a temp project, prompt updates to include testing steps, registers the tool, and UI changes to show "Running tests..." and "Tests completed" markers.

Changes

Cohort / File(s) Summary
Enum & Task Schema
workspaces/ballerina/ballerina-core/src/state-machine-types.ts, workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/task-writer.ts
Added TESTING/"testing" to TaskTypes enum and task input schema; updated examples and descriptions to document testing semantics.
Test Runner Tool
workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts
New tool: createTestRunnerTool(tempProjectPath, eventHandler), TEST_RUNNER_TOOL_NAME, and TestRunResult; executes bal test, captures combined stdout/stderr, and emits tool_call/tool_result events.
Tool Registration & Prompts
workspaces/ballerina/ballerina-extension/src/features/ai/agent/tool-registry.ts, workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts
Registered the test runner in the tool registry and updated agent prompts/plans to include testing steps, when to run tests, and how to report results.
UI: Chat & Grouping
workspaces/ballerina/ballerina-visualizer/src/views/AIPanel/components/AIChat/index.tsx, workspaces/ballerina/ballerina-visualizer/src/views/AIPanel/components/ToolCallGroupSegment.tsx
On runTests tool_call insert "Running tests..." marker; on tool_result replace with "Tests completed"; added grouping labels for test-run lifecycle.
Config & Help Text
workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/config-collector.ts
Reworked ConfigCollector description and examples to clarify COLLECT/CHECK modes and testing-related variable naming guidance.

Sequence Diagram

sequenceDiagram
    participant Agent as AI Agent
    participant Registry as Tool Registry
    participant TestTool as Test Runner Tool
    participant CLI as Ballerina CLI
    participant EventHandler as Event Handler
    participant UI as UI Components

    Agent->>Registry: request runTests tool
    Registry->>TestTool: invoke createTestRunnerTool(...)
    TestTool->>EventHandler: emit tool_call (runTests, toolCallId)
    EventHandler->>UI: dispatch tool_call -> display "Running tests..."
    TestTool->>CLI: execute `bal test` in tempProjectPath
    CLI-->>TestTool: return combined stdout/stderr
    TestTool->>EventHandler: emit tool_result (output, toolCallId)
    EventHandler->>UI: dispatch tool_result -> replace with "Tests completed"
    TestTool-->>Agent: return TestRunResult(output)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped into code to start the quest,
Pressed "runTests" and left the rest,
"Running tests..." I softly hum,
Then "Tests completed" — victory drum! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is missing most required sections from the template (Goals, Approach, UI/Icons, User Stories, Release Notes, Documentation, etc.) but includes the core information linking to the issue and summarizing the changes. Complete the PR description template by adding sections for Goals, Approach, Release Notes, Documentation impact, Testing details, and other required sections.
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive Changes to prompts, config-collector tool descriptions, and UI components are aligned with the testing workflow but slightly expand beyond the core runTests tool requirement. Clarify whether updates to config-collector descriptions and prompt wording changes are essential to the runTests tool implementation or represent scope creep.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: adding a test runner tool to the BI Copilot agent.
Linked Issues check ✅ Passed All objectives from issue #2459 are met: a runTests tool executes bal test and returns results, the agent can identify failing tests, and task planning includes testing phases for iterative fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@RNViththagan RNViththagan force-pushed the copilot-agent-test-tool branch from 61c1b60 to f2286cb Compare February 18, 2026 13:14
@RNViththagan RNViththagan marked this pull request as ready for review February 18, 2026 13:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
workspaces/ballerina/ballerina-visualizer/src/views/AIPanel/components/ToolCallGroupSegment.tsx (1)

131-143: Import TEST_RUNNER_TOOL_NAME instead of hardcoding "runTests".

All constants for other tools ("file_write", "LibrarySearchTool", etc.) are hardcoded strings in this file, but TEST_RUNNER_TOOL_NAME is already exported from test-runner.ts and imported in prompts.ts and tool-registry.ts. Importing it here would make a tool name rename safe without a silent UI regression.

♻️ Proposed refactor
+import { TEST_RUNNER_TOOL_NAME } from "../../../../../../../ballerina-extension/src/features/ai/agent/tools/test-runner";
 
 const FILE_TOOLS = ["file_write", "file_edit", "file_batch_edit"];
 const LIBRARY_TOOLS = ["LibrarySearchTool", "LibraryGetTool", "HealthcareLibraryProviderTool"];
 
-    const hasTestRunner = names.includes("runTests");
+    const hasTestRunner = names.includes(TEST_RUNNER_TOOL_NAME);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@workspaces/ballerina/ballerina-visualizer/src/views/AIPanel/components/ToolCallGroupSegment.tsx`
around lines 131 - 143, Replace the hardcoded test-runner name in
getGroupCategory with the canonical constant: import TEST_RUNNER_TOOL_NAME from
the module that exports it and use TEST_RUNNER_TOOL_NAME instead of the literal
"runTests" when computing hasTestRunner in ToolCallGroupSegment (function
getGroupCategory) so future renames are safe and consistent with other tool-name
constants already centralized elsewhere.
workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts (2)

105-105: "Introduce a new subtask if needed" could trigger an unintended re-approval flow.

In the task management system (task-writer.ts), adding a new task to an in-progress plan is treated as isPlanRemodification, which re-triggers the full plan approval dialog (needsPlanApproval = true). This means the phrase may cause the agent to interrupt execution with an unexpected approval request mid-task, or conversely make it hesitant to add useful diagnostic steps.

Consider replacing with clearer guidance, such as: "Use ${DIAGNOSTICS_TOOL_NAME} iteratively until all errors are resolved before marking the task as completed."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts` at
line 105, The prompt in prompts.ts currently instructs "Introduce a new subtask
if needed," which can unintentionally trigger a remodification flow in
task-writer.ts (isPlanRemodification → needsPlanApproval); update the text
around DIAGNOSTICS_TOOL_NAME to instead instruct iterative use until no
compilation errors remain (for example: "Use ${DIAGNOSTICS_TOOL_NAME}
iteratively until all errors are resolved before marking the task as
completed."), so the agent performs repeated diagnostics/fixes without adding
new tasks that cause isPlanRemodification or re-opening the plan approval
dialog.

105-105: "Introduce a new subtask if needed" can unexpectedly trigger plan re-approval.

In task-writer.ts, any call that changes the task count compared to the existing plan is detected as isPlanRemodification (via allTasks.length !== existingPlan.tasks.length), which causes needsPlanApproval = true and re-presents the full plan approval dialog mid-execution. Instructing the agent to add subtasks here could interrupt the user's approval flow unexpectedly.

Consider replacing with iterative guidance instead:

-   - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed.
+   - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors. Fix any errors and re-run ${DIAGNOSTICS_TOOL_NAME} until compilation is clean.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts` at
line 105, Update the prompt text that currently reads "Before marking the task
as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and
fix them. Introduce a new subtask if needed." to avoid instructing the agent to
add subtasks (which triggers isPlanRemodification detection in task-writer.ts
via allTasks.length !== existingPlan.tasks.length and forces needsPlanApproval).
Instead, instruct iterative in-place fixes and repeated diagnostic checks (e.g.,
"run ${DIAGNOSTICS_TOOL_NAME}, fix errors iteratively, and re-run diagnostics
until clean; if a large scope of work is required, propose a plan change to the
user rather than auto-creating subtasks"). Reference the DIAGNOSTICS_TOOL_NAME
token in the updated text so behavior is unchanged except for removing automatic
subtask creation guidance.
workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts (1)

61-62: context.toolCallId is supported by the AI SDK — type annotation can be tightened.

As of AI SDK 4.1, the execute function's second parameter provides toolCallId for tracking specific executions, messages for full conversation history, and abortSignal for canceling long-running operations. The usage of context?.toolCallId is correct and the fallback is good defensive coding, but the parameter can be typed more precisely to reflect the SDK's actual contract (non-optional second arg, non-optional toolCallId):

-execute: async (_input: Record<string, never>, context?: { toolCallId?: string }): Promise<TestRunResult> => {
+execute: async (_input: Record<string, never>, context: { toolCallId: string; abortSignal: AbortSignal }): Promise<TestRunResult> => {
-    const toolCallId = context?.toolCallId || `fallback-${Date.now()}`;
+    const toolCallId = context.toolCallId;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`
around lines 61 - 62, The execute function's second parameter should be
tightened to the AI SDK 4.1 shape instead of being optional; update the
signature of execute to accept a non-optional context object (e.g., context: {
toolCallId: string; messages?: any; abortSignal?: AbortSignal }) so toolCallId
is non-optional, then reference context.toolCallId (you can keep the existing
fallback for extra safety) and ensure TestRunResult and return types remain
unchanged; update the execute declaration and its usages inside the function
accordingly (look for execute and the local toolCallId variable).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`:
- Around line 87-100: In runBallerinaTests, child_process.exec is called without
timeout/maxBuffer and ignores exec errors; update the exec call in
runBallerinaTests to pass an options object including a reasonable timeout
(e.g., milliseconds) and an increased maxBuffer, and change the Promise
resolution to always settle with either resolved output or a rejected/errored
result that includes exec error details (err.message/err.code) so command, cwd
and err are surfaced rather than silently hanging or truncating; ensure the
output returned still concatenates stdout/stderr but also appends or includes
err.message and err.code when err is present.
- Around line 87-100: The runBallerinaTests function currently calls
child_process.exec(command, { cwd }, ...) without timeout/maxBuffer and swallows
exec errors; update the exec invocation in runBallerinaTests to pass options {
cwd, timeout: <reasonable-ms>, maxBuffer: <larger-bytes> } (e.g., 60_000 ms and
e.g. 10*1024*1024 bytes) and include err details in the resolved TestRunResult
output (or reject on fatal exec errors) so that stdout/stderr truncation and
OS-level errors like command-not-found are surfaced; ensure you reference
balCmd/command, child_process.exec callback, and the TestRunResult object when
adding the timeout/maxBuffer and appending err.message/err.code into the
returned output.

---

Nitpick comments:
In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts`:
- Line 105: The prompt in prompts.ts currently instructs "Introduce a new
subtask if needed," which can unintentionally trigger a remodification flow in
task-writer.ts (isPlanRemodification → needsPlanApproval); update the text
around DIAGNOSTICS_TOOL_NAME to instead instruct iterative use until no
compilation errors remain (for example: "Use ${DIAGNOSTICS_TOOL_NAME}
iteratively until all errors are resolved before marking the task as
completed."), so the agent performs repeated diagnostics/fixes without adding
new tasks that cause isPlanRemodification or re-opening the plan approval
dialog.
- Line 105: Update the prompt text that currently reads "Before marking the task
as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and
fix them. Introduce a new subtask if needed." to avoid instructing the agent to
add subtasks (which triggers isPlanRemodification detection in task-writer.ts
via allTasks.length !== existingPlan.tasks.length and forces needsPlanApproval).
Instead, instruct iterative in-place fixes and repeated diagnostic checks (e.g.,
"run ${DIAGNOSTICS_TOOL_NAME}, fix errors iteratively, and re-run diagnostics
until clean; if a large scope of work is required, propose a plan change to the
user rather than auto-creating subtasks"). Reference the DIAGNOSTICS_TOOL_NAME
token in the updated text so behavior is unchanged except for removing automatic
subtask creation guidance.

In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`:
- Around line 61-62: The execute function's second parameter should be tightened
to the AI SDK 4.1 shape instead of being optional; update the signature of
execute to accept a non-optional context object (e.g., context: { toolCallId:
string; messages?: any; abortSignal?: AbortSignal }) so toolCallId is
non-optional, then reference context.toolCallId (you can keep the existing
fallback for extra safety) and ensure TestRunResult and return types remain
unchanged; update the execute declaration and its usages inside the function
accordingly (look for execute and the local toolCallId variable).

In
`@workspaces/ballerina/ballerina-visualizer/src/views/AIPanel/components/ToolCallGroupSegment.tsx`:
- Around line 131-143: Replace the hardcoded test-runner name in
getGroupCategory with the canonical constant: import TEST_RUNNER_TOOL_NAME from
the module that exports it and use TEST_RUNNER_TOOL_NAME instead of the literal
"runTests" when computing hasTestRunner in ToolCallGroupSegment (function
getGroupCategory) so future renames are safe and consistent with other tool-name
constants already centralized elsewhere.

Comment on lines +87 to +100
async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
return new Promise((resolve) => {
const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
const command = `${balCmd} test`;

console.log(`[TestRunner] Running: ${command} in ${cwd}`);

child_process.exec(command, { cwd }, (err, stdout, stderr) => {
const output = [stdout, stderr].filter(Boolean).join('\n').trim();

console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`);
resolve({ output });
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing timeout and maxBuffer on child_process.exec — risk of infinite hang and silent output truncation.

bal test can block indefinitely if a test awaits a network resource, has an infinite loop, or requires interactive input. Without a timeout option, the Promise never settles and the agent loop is frozen with no recovery path until the user stops the entire generation. Additionally, the default maxBuffer (1 MB) can be silently exceeded for verbose test suites, and any OS-level exec error (err.message, e.g. "stdout maxBuffer exceeded", command-not-found) is never surfaced to the agent because only stdout/stderr are joined into output.

🐛 Proposed fix — add timeout, explicit maxBuffer, and exec-error surfacing
 async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
     return new Promise((resolve) => {
         const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
         const command = `${balCmd} test`;
+        const TIMEOUT_MS = 5 * 60 * 1000;  // 5 minutes
+        const MAX_BUFFER  = 10 * 1024 * 1024; // 10 MB

         console.log(`[TestRunner] Running: ${command} in ${cwd}`);

-        child_process.exec(command, { cwd }, (err, stdout, stderr) => {
-            const output = [stdout, stderr].filter(Boolean).join('\n').trim();
-
-            console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`);
+        child_process.exec(command, { cwd, timeout: TIMEOUT_MS, maxBuffer: MAX_BUFFER }, (err, stdout, stderr) => {
+            const parts = [stdout, stderr].filter(Boolean);
+            if (err) {
+                if (err.killed) {
+                    parts.push(`\nError: 'bal test' timed out after ${TIMEOUT_MS / 1000} seconds.`);
+                } else if (!stdout && !stderr) {
+                    // OS-level failure (e.g. command not found, maxBuffer exceeded)
+                    parts.push(`\nError: ${err.message}`);
+                }
+            }
+            const output = parts.join('\n').trim();
+            console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}, killed: ${err?.killed ?? false}`);
             resolve({ output });
         });
     });
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
return new Promise((resolve) => {
const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
const command = `${balCmd} test`;
console.log(`[TestRunner] Running: ${command} in ${cwd}`);
child_process.exec(command, { cwd }, (err, stdout, stderr) => {
const output = [stdout, stderr].filter(Boolean).join('\n').trim();
console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`);
resolve({ output });
});
});
async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
return new Promise((resolve) => {
const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
const command = `${balCmd} test`;
const TIMEOUT_MS = 5 * 60 * 1000; // 5 minutes
const MAX_BUFFER = 10 * 1024 * 1024; // 10 MB
console.log(`[TestRunner] Running: ${command} in ${cwd}`);
child_process.exec(command, { cwd, timeout: TIMEOUT_MS, maxBuffer: MAX_BUFFER }, (err, stdout, stderr) => {
const parts = [stdout, stderr].filter(Boolean);
if (err) {
if (err.killed) {
parts.push(`\nError: 'bal test' timed out after ${TIMEOUT_MS / 1000} seconds.`);
} else if (!stdout && !stderr) {
// OS-level failure (e.g. command not found, maxBuffer exceeded)
parts.push(`\nError: ${err.message}`);
}
}
const output = parts.join('\n').trim();
console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}, killed: ${err?.killed ?? false}`);
resolve({ output });
});
});
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`
around lines 87 - 100, In runBallerinaTests, child_process.exec is called
without timeout/maxBuffer and ignores exec errors; update the exec call in
runBallerinaTests to pass an options object including a reasonable timeout
(e.g., milliseconds) and an increased maxBuffer, and change the Promise
resolution to always settle with either resolved output or a rejected/errored
result that includes exec error details (err.message/err.code) so command, cwd
and err are surfaced rather than silently hanging or truncating; ensure the
output returned still concatenates stdout/stderr but also appends or includes
err.message and err.code when err is present.

⚠️ Potential issue | 🟠 Major

Missing timeout on child_process.exec — agent will block indefinitely if bal test hangs.

bal test can hang in real scenarios: a test awaiting a network service that never starts, an infinite loop in a test function, or a test waiting for interactive input. Without a timeout option, the returned Promise never resolves and the entire AI agent execution loop is blocked with no recovery path short of the user stopping the whole generation.

Additionally:

  • The default maxBuffer (1 MB) can be exceeded by large test suites with verbose output, producing a silent truncation whose error message (err.message = "stdout maxBuffer exceeded") is never surfaced to the agent because only stdout/stderr are joined into output.
  • Any OS-level exec failure (e.g., balCmd path resolution failure) is silently swallowed for the same reason.
🐛 Proposed fix — add timeout, larger maxBuffer, and exec-error surfacing
 async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
     return new Promise((resolve) => {
         const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
         const command = `${balCmd} test`;
+        const TIMEOUT_MS = 5 * 60 * 1000; // 5 minutes
+        const MAX_BUFFER = 10 * 1024 * 1024; // 10 MB

         console.log(`[TestRunner] Running: ${command} in ${cwd}`);

-        child_process.exec(command, { cwd }, (err, stdout, stderr) => {
-            const output = [stdout, stderr].filter(Boolean).join('\n').trim();
-
-            console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`);
+        child_process.exec(command, { cwd, timeout: TIMEOUT_MS, maxBuffer: MAX_BUFFER }, (err, stdout, stderr) => {
+            const parts = [stdout, stderr].filter(Boolean);
+            if (err) {
+                // Surface OS-level errors (timeout, maxBuffer, command-not-found) that
+                // would otherwise be invisible to the agent.
+                if (err.killed) {
+                    parts.push(`\nError: 'bal test' timed out after ${TIMEOUT_MS / 1000} seconds.`);
+                } else if (!stdout && !stderr) {
+                    parts.push(`\nError: ${err.message}`);
+                }
+            }
+            const output = parts.join('\n').trim();
+
+            console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}, killed: ${err?.killed ?? false}`);
             resolve({ output });
         });
     });
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
return new Promise((resolve) => {
const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
const command = `${balCmd} test`;
console.log(`[TestRunner] Running: ${command} in ${cwd}`);
child_process.exec(command, { cwd }, (err, stdout, stderr) => {
const output = [stdout, stderr].filter(Boolean).join('\n').trim();
console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}`);
resolve({ output });
});
});
async function runBallerinaTests(cwd: string): Promise<TestRunResult> {
return new Promise((resolve) => {
const balCmd = extension.ballerinaExtInstance.getBallerinaCmd();
const command = `${balCmd} test`;
const TIMEOUT_MS = 5 * 60 * 1000; // 5 minutes
const MAX_BUFFER = 10 * 1024 * 1024; // 10 MB
console.log(`[TestRunner] Running: ${command} in ${cwd}`);
child_process.exec(command, { cwd, timeout: TIMEOUT_MS, maxBuffer: MAX_BUFFER }, (err, stdout, stderr) => {
const parts = [stdout, stderr].filter(Boolean);
if (err) {
// Surface OS-level errors (timeout, maxBuffer, command-not-found) that
// would otherwise be invisible to the agent.
if (err.killed) {
parts.push(`\nError: 'bal test' timed out after ${TIMEOUT_MS / 1000} seconds.`);
} else if (!stdout && !stderr) {
parts.push(`\nError: ${err.message}`);
}
}
const output = parts.join('\n').trim();
console.log(`[TestRunner] Completed. Exit code: ${err?.code ?? 0}, killed: ${err?.killed ?? false}`);
resolve({ output });
});
});
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`
around lines 87 - 100, The runBallerinaTests function currently calls
child_process.exec(command, { cwd }, ...) without timeout/maxBuffer and swallows
exec errors; update the exec invocation in runBallerinaTests to pass options {
cwd, timeout: <reasonable-ms>, maxBuffer: <larger-bytes> } (e.g., 60_000 ms and
e.g. 10*1024*1024 bytes) and include err details in the resolved TestRunResult
output (or reject on fatal exec errors) so that stdout/stderr truncation and
OS-level errors like command-not-found are surfaced; ensure you reference
balCmd/command, child_process.exec callback, and the TestRunResult object when
adding the timeout/maxBuffer and appending err.message/err.code into the
returned output.


console.log(`[TestRunner] Running: ${command} in ${cwd}`);

child_process.exec(command, { cwd }, (err, stdout, stderr) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to see if can utilize inbuilt vscode tools, but we can do that in a sepeare PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

return { running: "Generating connector...", done: "Connector ready" };
}
if (hasTestRunner) {
return { running: "Running tests...", done: "Tests completed" };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we display test pass/failure status here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For initial implementaion it's handled by the agent. It'll give a summary of the test and what things were tested.
We can improvise further from the tool side in the next PR

@RNViththagan RNViththagan force-pushed the copilot-agent-test-tool branch from f2286cb to 850dd8b Compare February 19, 2026 04:53
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts (2)

61-62: Consider propagating errors from runBallerinaTests rather than always resolving.

The execute function unconditionally awaits runBallerinaTests and returns its result, but runBallerinaTests always resolves (never rejects) — even on catastrophic failures like a missing bal command. If extension.ballerinaExtInstance.getBallerinaCmd() throws synchronously (e.g., extension not initialized), the promise constructor won't catch it and the rejection will be unhandled. Wrapping the body in a try/catch and emitting a meaningful tool_result on failure would improve resilience.

♻️ Proposed fix — add error handling in execute and runBallerinaTests
         execute: async (_input: Record<string, never>, context?: { toolCallId?: string }): Promise<TestRunResult> => {
             const toolCallId = context?.toolCallId || `fallback-${Date.now()}`;
 
             eventHandler({
                 type: "tool_call",
                 toolName: TEST_RUNNER_TOOL_NAME,
                 toolCallId,
             });
 
-            const result = await runBallerinaTests(tempProjectPath);
+            let result: TestRunResult;
+            try {
+                result = await runBallerinaTests(tempProjectPath);
+            } catch (e) {
+                result = { output: `Error running tests: ${e instanceof Error ? e.message : String(e)}` };
+            }
 
             eventHandler({
                 type: "tool_result",
                 toolName: TEST_RUNNER_TOOL_NAME,
                 toolCallId,
                 toolOutput: result
             });
 
             return result;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`
around lines 61 - 62, The execute function should propagate and handle errors
instead of relying on runBallerinaTests to always resolve: wrap the body of
execute in a try/catch so any synchronous throws (e.g., from
extension.ballerinaExtInstance.getBallerinaCmd()) are caught, and on error
return a proper TestRunResult/tool_result indicating failure (including the
error message and toolCallId) rather than leaving the promise unhandled; also
update runBallerinaTests to reject/throw on catastrophic failures (missing bal,
spawn errors) instead of always resolving so callers like execute can catch and
convert those into meaningful tool_result error responses.

84-86: Nit: JSDoc says "parses the output" but the function returns raw output.

The function simply concatenates stdout and stderr without any parsing. The description should match the actual behavior.

📝 Proposed fix
 /**
- * Executes `bal test` in the given directory and parses the output.
+ * Executes `bal test` in the given directory and returns the raw output.
  */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`
around lines 84 - 86, Update the JSDoc for the function that "Executes `bal
test` in the given directory" to accurately state that it returns the raw
combined stdout and stderr output (a single concatenated string) rather than
parsing results; mention the exact return shape ("string" with combined
stdout/stderr) and, if desired, add a TODO or alternative note that parsing
could be implemented later or add a new parseTestOutput helper to perform
structured parsing (so callers know to handle raw text).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`:
- Around line 87-100: In runBallerinaTests, when calling child_process.exec (the
invocation that uses variables command and cwd) add an options object with a
reasonable timeout and increased maxBuffer (e.g., timeout in ms and maxBuffer
bytes) so the promise cannot hang indefinitely or silently truncate output; also
include the exec callback's err information in the resolved TestRunResult.output
(append err.message and err.killed/err.signal/err.code details) so timeout,
command-not-found, and maxBuffer errors are visible, and ensure the Promise
always resolves with output even when err is present.

---

Nitpick comments:
In
`@workspaces/ballerina/ballerina-extension/src/features/ai/agent/tools/test-runner.ts`:
- Around line 61-62: The execute function should propagate and handle errors
instead of relying on runBallerinaTests to always resolve: wrap the body of
execute in a try/catch so any synchronous throws (e.g., from
extension.ballerinaExtInstance.getBallerinaCmd()) are caught, and on error
return a proper TestRunResult/tool_result indicating failure (including the
error message and toolCallId) rather than leaving the promise unhandled; also
update runBallerinaTests to reject/throw on catastrophic failures (missing bal,
spawn errors) instead of always resolving so callers like execute can catch and
convert those into meaningful tool_result error responses.
- Around line 84-86: Update the JSDoc for the function that "Executes `bal test`
in the given directory" to accurately state that it returns the raw combined
stdout and stderr output (a single concatenated string) rather than parsing
results; mention the exact return shape ("string" with combined stdout/stderr)
and, if desired, add a TODO or alternative note that parsing could be
implemented later or add a new parseTestOutput helper to perform structured
parsing (so callers know to handle raw text).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts`:
- Around line 105-106: The test-run trigger in the agent prompt currently fires
whenever compilation is clean and tests exist, which causes tests to run after
any task; update the prompt text around the DIAGNOSTICS_TOOL_NAME guidance to
require the current task be the testing task (e.g., add an explicit condition
like "only if the current task's type is 'testing'") before instructing to run
tests in plan mode, and make the same explicit conditional change in the
edit-mode prompt block where the duplicate wording appears (refer to the prompt
strings containing DIAGNOSTICS_TOOL_NAME and the testing-related trigger to
locate and update both spots).
- Around line 119-124: Update the "## Test Runner" prompt block (the Test Runner
section that uses ${TEST_RUNNER_TOOL_NAME}) to include an explicit retry cap and
an escape hatch: add a maximum attempts parameter (e.g., "max_attempts: 3") and
change step 4 to say "if failures remain after max_attempts, stop retrying,
report remaining failing tests with one-line diagnostics, and recommend manual
investigation or filing an issue"; ensure the prompt instructs the agent to
increment an attempt counter each run and to stop once max_attempts is reached
while summarizing unresolved failures and next steps.

Comment on lines +105 to +106
- Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed.
- Once compilation is clean and the project contains test cases, run the tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test execution may fire prematurely during non-testing tasks.

The trigger condition "Once compilation is clean and the project contains test cases, run the tests" fires after every task in plan mode — not just after the testing task. If the user's project already contains test files (iterative workflow, existing test suite, partial agent run), this will invoke the test runner after connections_init or implementation tasks as well, running a potentially stale/incomplete test suite mid-implementation.

The same condition repeats on line 141 in edit mode, where there is no testing task type to act as a natural gate.

Consider restricting this trigger to the testing task type (plan mode) or making it explicitly conditional on the current task type, e.g.:

💡 Suggested wording
-   - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed.
-   - Once compilation is clean and the project contains test cases, run the tests.
+   - Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed.
+   - Once compilation is clean and the current task is a 'testing' task, run the tests.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed.
- Once compilation is clean and the project contains test cases, run the tests.
- Before marking the task as completed, use ${DIAGNOSTICS_TOOL_NAME} to check for compilation errors and fix them. Introduce a new subtask if needed.
- Once compilation is clean and the current task is a 'testing' task, run the tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts`
around lines 105 - 106, The test-run trigger in the agent prompt currently fires
whenever compilation is clean and tests exist, which causes tests to run after
any task; update the prompt text around the DIAGNOSTICS_TOOL_NAME guidance to
require the current task be the testing task (e.g., add an explicit condition
like "only if the current task's type is 'testing'") before instructing to run
tests in plan mode, and make the same explicit conditional change in the
edit-mode prompt block where the duplicate wording appears (refer to the prompt
strings containing DIAGNOSTICS_TOOL_NAME and the testing-related trigger to
locate and update both spots).

Comment on lines 119 to 124
## Test Runner
When running tests, follow these steps:
1. Before running, briefly tell the user what is being tested.
2. Use ${TEST_RUNNER_TOOL_NAME} to run the test suite.
3. After the run, give a short summary: how many tests passed/failed.
4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unbounded fix-and-rerun loop — add a retry cap.

Step 4 instructs the agent to "fix them, and re-run" with no iteration limit. For persistently failing tests (environment-dependent failures, flaky assertions, genuinely incorrect logic), the agent can loop indefinitely consuming tokens, time, and incurring LLM cost with no exit path.

Add an explicit maximum-attempt guard and an escape hatch:

💡 Suggested wording
 ## Test Runner
 When running tests, follow these steps:
 1. Before running, briefly tell the user what is being tested.
 2. Use ${TEST_RUNNER_TOOL_NAME} to run the test suite.
 3. After the run, give a short summary: how many tests passed/failed.
-4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run.
+4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run.
+   Repeat at most 3 times. If tests still fail after 3 attempts, report the remaining failures to the user and stop.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## Test Runner
When running tests, follow these steps:
1. Before running, briefly tell the user what is being tested.
2. Use ${TEST_RUNNER_TOOL_NAME} to run the test suite.
3. After the run, give a short summary: how many tests passed/failed.
4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run.
## Test Runner
When running tests, follow these steps:
1. Before running, briefly tell the user what is being tested.
2. Use ${TEST_RUNNER_TOOL_NAME} to run the test suite.
3. After the run, give a short summary: how many tests passed/failed.
4. If there are failures, mention which tests failed and why (one line each), fix them, and re-run.
Repeat at most 3 times. If tests still fail after 3 attempts, report the remaining failures to the user and stop.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@workspaces/ballerina/ballerina-extension/src/features/ai/agent/prompts.ts`
around lines 119 - 124, Update the "## Test Runner" prompt block (the Test
Runner section that uses ${TEST_RUNNER_TOOL_NAME}) to include an explicit retry
cap and an escape hatch: add a maximum attempts parameter (e.g., "max_attempts:
3") and change step 4 to say "if failures remain after max_attempts, stop
retrying, report remaining failing tests with one-line diagnostics, and
recommend manual investigation or filing an issue"; ensure the prompt instructs
the agent to increment an attempt counter each run and to stop once max_attempts
is reached while summarizing unresolved failures and next steps.

@RNViththagan RNViththagan force-pushed the copilot-agent-test-tool branch from 9022d25 to 6c3a371 Compare February 19, 2026 09:00
@xlight05 xlight05 merged commit 101dfc6 into wso2:release/bi-1.8.x Feb 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants