Agent to use semantic search #159

dimavrem22 · 2026-02-11T21:40:48Z

BB agent uses routines/semantic-search endpoint to get plausible routines (only routine info if returned, not operations)
BB routine execute is hooked up to vectorly server (tool structure optimized to minimize issues with invalid params, e.g. since we get the parameters in routine info, we can validate parameters before making api call)
BB has a working directory with /raw (for routine execution results) and /output (for writing files)
BB agent has new tools: execute python, list files in work directory, and read file in work directory
system prompts updated

Other:

sandbox code execution allows for mounting a work directory to the docker image OR i/o operations only for that directory if using blocklist (w/ unit tests)
added copy buttons to TUI because still cannot figure out how to copy on the mac
removed debug url from BB (no need to execute routines locally)
if docker is not running, user is prompted with a warning and needs action to continue

dimavrem22 · 2026-02-11T22:16:00Z

@claude review this bad boy

claude · 2026-02-11T22:16:15Z

Claude finished @dimavrem22's task in 1m 31s —— View job

PR Review: Agent to use semantic search

I've reviewed all the major changes in this PR. Here's my comprehensive feedback:

🎯 Overall Assessment

This is a well-structured PR that adds significant new functionality:

Semantic search integration for routine discovery (replacing keyword-based search)
Sandboxed code execution with multiple isolation backends (Lambda, Docker, blocklist)
Workspace file management tools for the agent
Comprehensive test coverage (390 new test lines!)

The code quality is generally high, but there are several areas that need attention before merging.

🔴 Critical Issues

1. Missing Type Hints (CLAUDE.md Violation)

Location: bluebox/agents/bluebox_agent.py:148-165

The _cache_routines_from_response method is missing return type hint:

def _cache_routines_from_response(self, response: dict[str, Any] | list[Any]) -> None:

✅ This is actually correct! But please verify all other methods have type hints.

2. Security: Path Traversal in `_read_workspace_file`

Location: bluebox/agents/bluebox_agent.py:446-448

The path validation logic has a subtle bug:

if not str(resolved).startswith(str(workspace_resolved) + "/") and resolved != workspace_resolved:
    return {"error": f"Access denied: '{path}' is outside the workspace directory"}

Issue: On some systems, this could allow access to workspace_resolved itself when it shouldn't, or fail legitimate subdirectory access. The check should use os.path.commonpath or ensure the trailing slash is handled correctly.

Recommendation:

try:
    resolved.relative_to(workspace_resolved)
except ValueError:
    return {"error": f"Access denied: '{path}' is outside the workspace directory"}

3. Docker Container Naming Collision Risk

Location: bluebox/utils/code_execution_sandbox.py:203

Container name uses os.getpid() and timestamp:

container_name = f"bluebox-sandbox-{os.getpid()}-{int(time.time() * 1000)}"

Issue: In high-concurrency scenarios, this could still collide. Consider adding a random component or UUID suffix.

⚠️ Important Issues

4. Inconsistent Error Handling in Sandbox Execution

Location: bluebox/utils/code_execution_sandbox.py:175

When extra_globals fails JSON serialization, it returns an error dict, but the function signature suggests it should always return the same structure. This is inconsistent with other error paths.

Recommendation: Document this behavior clearly in the docstring, or standardize error handling across all code paths.

5. Race Condition in File Change Detection

Location: bluebox/agents/bluebox_agent.py:323-328

The code snapshots st_mtime before execution and compares after:

files_before: dict[str, float] = {}
for p in self._outputs_dir.iterdir():
    if p.is_file():
        files_before[str(p)] = p.stat().st_mtime

Issue: On filesystems with low mtime resolution (e.g., 1-second precision), if code executes very quickly, new files might not be detected. Consider using st_mtime_ns for nanosecond precision.

6. Potential Memory Issue with Large Files

Location: bluebox/agents/bluebox_agent.py:456-457

Reads entire file into memory:

try:
    lines = resolved.read_text().splitlines()

Issue: For very large files (e.g., multi-GB JSON dumps), this could cause OOM. The 200-line cap is good, but it happens after loading the entire file.

Recommendation: Use streaming/chunked reading for files larger than a threshold.

7. Missing Validation for `work_dir` Parameter

Location: bluebox/utils/code_execution_sandbox.py:404-476

The work_dir parameter is accepted but not validated. What happens if:

The directory doesn't exist?
The path is not absolute?
The path points to a sensitive directory like /etc?

Recommendation: Add validation at the top of execute_python_sandboxed:

if work_dir:
    work_dir = os.path.abspath(work_dir)
    # Add whitelist/blacklist checks for sensitive paths

💡 Suggestions & Style Issues

8. Hardcoded Magic Numbers

Location: bluebox/agents/bluebox_agent.py:470-475

if total_lines > 200:
    lines = lines[:200]

Consider making 200 a named constant: MAX_FILE_PREVIEW_LINES = 200

9. Inconsistent Docstring Style

Some methods have detailed Args/Returns sections, others don't. For example:

_search_routines ✅ Has Args
_cache_routines_from_response ❌ Missing Args/Returns

Recommendation: Follow Google or NumPy docstring style consistently.

10. Test Coverage: Missing Edge Cases

Location: tests/unit/test_code_execution_sandbox.py

Great test coverage overall! But missing tests for:

_read_workspace_file path traversal attempts
_run_python_code when routine_results is empty
Sandbox mode transitions (lambda → docker → blocklist fallback)

✅ What I Like

Excellent test coverage - 390 lines of comprehensive tests for the sandbox functionality
Security-conscious design - Multiple isolation layers (pattern blocking + backend isolation)
Clear separation of concerns - _execute_in_docker, _execute_blocklist_sandbox, execute_python_sandboxed are well-separated
Informative error messages - The sandbox returns helpful hints to guide the agent
File safety - The scoped open() implementation is a good approach
Documentation - The system prompt in BlueBoxAgent is excellent and very clear

📋 Minor Issues

Typo in comment (bluebox/utils/code_execution_sandbox.py:412): "servers' lambda_code_executor" should be "server's lambda_code_executor"
Unused import? Check if all imports in bluebox/agents/bluebox_agent.py are used (ThreadPoolExecutor, as_completed look good though)
Logging inconsistency: Some places use logger.debug, others logger.info for similar operations. Consider standardizing logging levels.

🎬 Summary

Recommendation: REQUEST CHANGES ⚠️

This PR introduces valuable functionality, but the security issues (#2, #3, #7) and race condition (#5) should be addressed before merging. The other issues are important but not blockers.

Must Fix Before Merge:

Fix path traversal validation in _read_workspace_file (Improve README and add __init__.py [EOM] #2)
Add work_dir validation and sanitization (Agent often unable to identify relevant network transaction #7)
Fix file change detection race condition (No pytests [EOM] #5)

Should Fix:

Add randomness to Docker container naming (addign stuff #3)
Document or standardize error handling (Update README and fix example routine [EOM] #4)
Handle large files gracefully (Explicitly write files as UTF-8 #6)

Nice to Have:

Extract magic numbers to constants (Include codeowners [EOM] #8)
Standardize docstrings (LLM sometimes hallucinates critical fields (e.g. transaction_id) #9)
Add edge case tests (Resolve some issues #10)

Great work on the semantic search integration and sandbox implementation! The code is well-structured and the tests are thorough. Just need to address the security and reliability issues above. 💪

|

dimavrem22 · 2026-02-11T23:38:25Z

@claude

claude · 2026-02-11T23:38:39Z

Claude finished @dimavrem22's task in 3m 13s —— View job

PR Review: Agent to use semantic search

I've completed a comprehensive review of all 12 changed files in this PR. Here's my detailed feedback.

🎯 Overall Assessment

This is a well-architected PR that introduces significant new functionality:

Semantic search integration for routine discovery (replacing keyword-based search)
Sandboxed code execution with multiple isolation backends (Lambda, Docker, blocklist)
Workspace file management tools for the BlueBox agent
Comprehensive test coverage (390+ new test lines, including excellent path traversal tests!)

The code quality is generally excellent, with strong security awareness and good engineering practices. However, I found several issues that should be addressed.

✅ Excellent Work

Security Improvements

Path traversal fix (bluebox_agent.py:454-457) - Now uses Path.relative_to() which is the correct approach! The previous review flagged this, and it's been properly fixed with comprehensive tests in test_read_workspace_file.py. Excellent!
Container naming (code_execution_sandbox.py:210) - Fixed! Now uses uuid.uuid4().hex[:8] instead of just timestamp, eliminating collision risk.
work_dir validation (code_execution_sandbox.py:441-444) - Fixed! Now validates against SENSITIVE_PATH_PREFIXES before any backend execution.

Test Coverage

145 lines of path traversal tests covering edge cases (symlinks, shared prefixes, absolute paths)
428 lines of sandbox execution tests
Tests for both security (blocklist checks) and functionality (file I/O, work_dir scoping)

Code Quality

Type hints are present on all new methods ✅
Error messages are informative and actionable
Separation of concerns is clear (Docker, blocklist, Lambda backends)

🔴 Critical Issues

1. Race Condition in File Change Detection

Location: bluebox/agents/bluebox_agent.py:330-333, 357-358

files_before: dict[str, float] = {}
for p in self._outputs_dir.iterdir():
    if p.is_file():
        files_before[str(p)] = p.stat().st_mtime

Problem: Uses st_mtime which has 1-second precision on many filesystems. If code executes quickly (< 1 second), newly created files might not be detected.

Fix: Use st_mtime_ns for nanosecond precision:

files_before[str(p)] = p.stat().st_mtime_ns
# ...later...
mtime = p.stat().st_mtime_ns
if path_str not in files_before or files_before[path_str] < mtime:
    files_created.append(path_str)

Fix this →

⚠️ Important Issues

2. Docker Container Cleanup on Timeout

Location: bluebox/utils/code_execution_sandbox.py:251-257

except subprocess.TimeoutExpired:
    # Kill the container to prevent resource leaks
    subprocess.run(
        ["docker", "kill", container_name],
        capture_output=True,
        timeout=5,
    )

Issue: If the docker kill command itself hangs or fails, there's no error handling. Consider:

try:
    subprocess.run(["docker", "kill", container_name], capture_output=True, timeout=5)
except Exception:
    logger.warning("Failed to kill container %s", container_name)

3. JSON Serialization Error Handling

Location: bluebox/utils/code_execution_sandbox.py:179-182

try:
    globals_json = json.dumps(extra_globals) if extra_globals else "{}"
except (TypeError, ValueError) as e:
    return {"error": f"extra_globals must contain only JSON-serializable values: {e}"}

Issue: This early return is only in _execute_in_docker, but not in _execute_blocklist_sandbox (line 388). If non-serializable values are passed to blocklist mode, they'll be injected directly without validation.

Recommendation: Move this check to execute_python_sandboxed() (before backend selection) to ensure consistency across all backends.

4. Inconsistent Error Handling for Empty `routine_results`

Location: bluebox/agents/bluebox_agent.py:335-342

# Load all JSON files from raw/ as routine_results
routine_results: list[dict[str, Any]] = []
for json_file in sorted(self._raw_dir.glob("*.json")):
    try:
        data = json.loads(json_file.read_text())
        routine_results.append(data)
    except (json.JSONDecodeError, OSError) as e:
        logger.warning("Failed to load %s: %s", json_file, e)

Issue: If all JSON files fail to load, routine_results will be empty, which might cause unexpected behavior in user code that assumes data exists.

Suggestion: Add a hint in the result when routine_results is empty:

sandbox_result = execute_python_sandboxed(...)
if not routine_results:
    sandbox_result["_warning"] = "No routine results were loaded from raw/ directory"

💡 Suggestions & Nice-to-Haves

5. Hardcoded Magic Numbers

Location: Multiple places

Examples:

bluebox/agents/bluebox_agent.py:470-475: 200 (max file preview lines)
bluebox/utils/infra_utils.py:103: 200 (max_lines default)

Suggestion: Extract to named constants:

MAX_FILE_PREVIEW_LINES = 200  # at module level

6. Docker Availability Check Caching

Location: bluebox/utils/code_execution_sandbox.py:121-151

The caching logic is good, but it never invalidates. If Docker becomes available after the first check (e.g., user starts Docker Desktop), it won't be detected.

Suggestion: Add a function to reset the cache:

def _reset_docker_availability_cache() -> None:
    """Reset Docker availability cache (useful for testing or runtime changes)."""
    global _docker_available
    _docker_available = None

Not critical, but could be helpful for long-running processes.

7. Tool Validation in `_execute_routines_in_parallel`

Location: bluebox/agents/bluebox_agent.py:233-241

Excellent validation logic! The pre-flight parameter validation against cached RoutineInfo is a great pattern. One minor suggestion:

if validation_errors:
    return {"error": "Parameter validation failed. Fix and retry.\n" + "\n".join(validation_errors)}

Consider adding "success": False to the error response for consistency with the success case (line 295).

8. Logging Inconsistency

Throughout the codebase:

Some places use logger.debug for execution steps
Others use logger.info for similar operations
Example: code_execution_sandbox.py:462 uses logger.debug, while bluebox_agent.py:259 uses logger.info

Recommendation: Standardize logging levels:

debug: Internal details (cache hits, validation skips)
info: User-visible operations (routine execution, file creation)
warning: Recoverable issues (failed JSON loads, Docker unavailable)
error: Failures that block execution

9. Error Message Consistency

Location: bluebox/agents/bluebox_agent.py:366-370, 379-382, 385-388

The _hint fields are excellent! They guide the agent to debug and retry. Consider:

Renaming _hint to hint (no need for underscore prefix)
Standardizing the format across all error paths

🐛 Minor Issues

10. Typo in Comment

Location: bluebox/utils/code_execution_sandbox.py:419

# "lambda": Use AWS Lambda for isolation (requires servers' lambda_code_executor)

Should be: server's (singular possessive) not servers'

11. Docker Warning UX

Location: bluebox/scripts/run_bluebox_agent.py:164-172

The Docker availability warning is excellent! One suggestion: Also mention that Lambda mode is available in cloud deployments (for completeness):

print_colored(
    "Warning: Docker is not available. Code execution will use the blocklist sandbox,\n"
    "which is less secure and has limited isolation. (Lambda mode is available in cloud deployments.)",
    YELLOW,
)

📊 Test Coverage Analysis

Excellent Coverage

✅ Path traversal edge cases (symlinks, shared prefixes, absolute paths)
✅ Docker sandbox execution (with work_dir mounting)
✅ Blocklist sandbox execution
✅ Safety checks (blocked patterns, modules, builtins)
✅ Scoped open() implementation
✅ File reading with line ranges

Missing Tests (Nice to Have)

_run_python_code when routine_results is empty
Sandbox mode selection logic (auto mode fallback: lambda → docker → blocklist)
_cache_routines_from_response with malformed data
_validate_routine_params with edge cases (missing optional params, extra params)
File change detection race condition (would require mocking or sleep)

🎬 Summary & Recommendation

Recommendation: APPROVE with Minor Fixes ✅

This PR is high quality and ready to merge with minor fixes. The critical issues from the previous review have been addressed:

✅ Path traversal validation fixed
✅ Docker container naming collision fixed
✅ work_dir validation added
✅ Type hints are present on all methods

Must Fix Before Merge

Repo setup #1: Fix file change detection race condition (use st_mtime_ns) - Critical for correctness
Improve README and add __init__.py [EOM] #2: Add error handling for docker kill timeout - Important for reliability

Should Fix (or create follow-up issues)

addign stuff #3: Move JSON serialization check to execute_python_sandboxed() for consistency
Update README and fix example routine [EOM] #4: Add warning when routine_results is empty
No pytests [EOM] #5: Extract magic numbers to named constants

Nice to Have

Explicitly write files as UTF-8 #6-Enable pylint [EOM] #11: Minor improvements (logging consistency, better error messages, typo fixes)

🎉 Great Work!

Highlights of this PR:

Security-conscious: Path traversal fix, sensitive path validation, multiple isolation layers
Excellent tests: 145 lines of path traversal tests with thorough edge case coverage
Clean architecture: Clear separation between Lambda/Docker/blocklist backends
User-friendly: Informative error messages with actionable hints
CLAUDE.md compliant: All type hints present, no lazy imports, proper error handling

The semantic search integration and workspace management tools are solid additions. The sandbox implementation is particularly well done with defense in depth (pattern blocking + backend isolation).

Just address the file timestamp race condition (#1) and Docker cleanup error handling (#2), and this is good to merge! 🚀

…torlyApp/bluebox-sdk into use_browser_agent

Connect bluebox agent and browser agent

bluebox/agents/bluebox_agent.py

Base automatically changed from beta to main February 11, 2026 22:32

connect bb agent and browser anet

01627ec

dimavrem22 changed the title ~~Agent to use semantic search [TMP PR- need to see Diff]~~ Agent to use semantic search Feb 11, 2026

Merge branch 'main' into use_browser_agent

5823163

dimavrem22 marked this pull request as ready for review February 11, 2026 23:03

dimavrem22 requested review from alex-w-99 and rayruizhiliao as code owners February 11, 2026 23:04

rayruizhiliao and others added 7 commits February 11, 2026 18:12

feat: client side sse

6b1d80a

wip

6ef3c4d

bluebox agent calls routines corretcly now

2a401a8

workspace related tools for bluebox agent

260ee31

added warning for docker

fa612a4

added unit tests for new code execution in sandbox

1d58fcc

claude issues addressed

ea6aa46

dimavrem22 force-pushed the agent-to-use-semantic-search branch from cfefe46 to ea6aa46 Compare February 11, 2026 23:38

rayruizhiliao added 11 commits February 11, 2026 21:09

Merge branch 'agent-to-use-semantic-search' of https://github.com/Vec…

f538825

…torlyApp/bluebox-sdk into use_browser_agent

feat: client side sse

e4948d3

hide params from agent

bf41f05

hide params from agent

11d8f90

add data models

c5478f3

rename endpoint

6ae51ea

update data models

3a754c6

feat: base and inherited events

1bf43ec

added a todo

3ef2ed5

update comments

75758e9

Merge pull request #160 from VectorlyApp/use_browser_agent

7cb110f

Connect bluebox agent and browser agent

alex-w-99 reviewed Feb 12, 2026

View reviewed changes

bluebox/agents/bluebox_agent.py Show resolved Hide resolved

alex-w-99 approved these changes Feb 12, 2026

View reviewed changes

dimavrem22 merged commit 24591bf into main Feb 12, 2026
13 checks passed

dimavrem22 deleted the agent-to-use-semantic-search branch February 12, 2026 04:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent to use semantic search #159

Agent to use semantic search #159

Uh oh!

dimavrem22 commented Feb 11, 2026 •

edited

Loading

Uh oh!

dimavrem22 commented Feb 11, 2026

Uh oh!

claude bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

dimavrem22 commented Feb 11, 2026

Uh oh!

claude bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Agent to use semantic search #159

Agent to use semantic search #159

Uh oh!

Conversation

dimavrem22 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dimavrem22 commented Feb 11, 2026

Uh oh!

claude bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Agent to use semantic search

🎯 Overall Assessment

🔴 Critical Issues

1. Missing Type Hints (CLAUDE.md Violation)

2. Security: Path Traversal in _read_workspace_file

3. Docker Container Naming Collision Risk

⚠️ Important Issues

4. Inconsistent Error Handling in Sandbox Execution

5. Race Condition in File Change Detection

6. Potential Memory Issue with Large Files

7. Missing Validation for work_dir Parameter

💡 Suggestions & Style Issues

8. Hardcoded Magic Numbers

9. Inconsistent Docstring Style

10. Test Coverage: Missing Edge Cases

✅ What I Like

📋 Minor Issues

🎬 Summary

Must Fix Before Merge:

Should Fix:

Nice to Have:

Uh oh!

dimavrem22 commented Feb 11, 2026

Uh oh!

claude bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Agent to use semantic search

🎯 Overall Assessment

✅ Excellent Work

Security Improvements

Test Coverage

Code Quality

🔴 Critical Issues

1. Race Condition in File Change Detection

⚠️ Important Issues

2. Docker Container Cleanup on Timeout

3. JSON Serialization Error Handling

4. Inconsistent Error Handling for Empty routine_results

💡 Suggestions & Nice-to-Haves

5. Hardcoded Magic Numbers

6. Docker Availability Check Caching

7. Tool Validation in _execute_routines_in_parallel

8. Logging Inconsistency

9. Error Message Consistency

🐛 Minor Issues

10. Typo in Comment

11. Docker Warning UX

📊 Test Coverage Analysis

Excellent Coverage

Missing Tests (Nice to Have)

🎬 Summary & Recommendation

Must Fix Before Merge

Should Fix (or create follow-up issues)

Nice to Have

🎉 Great Work!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dimavrem22 commented Feb 11, 2026 •

edited

Loading

claude bot commented Feb 11, 2026 •

edited

Loading

2. Security: Path Traversal in `_read_workspace_file`

7. Missing Validation for `work_dir` Parameter

claude bot commented Feb 11, 2026 •

edited

Loading

4. Inconsistent Error Handling for Empty `routine_results`

7. Tool Validation in `_execute_routines_in_parallel`