Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
dfb11bd
add 2 test case and config files
vince-leaf Apr 16, 2025
2fd90a3
updated and fixed front-man effective response
vince-leaf Apr 26, 2025
6a653b2
Added 1st e2e infrastructure with 1 test case
vince-leaf Apr 26, 2025
87bb429
Merge branch 'main' into un-3096_add_test_case_all_agent_cli_connections
vince-leaf Apr 26, 2025
fe68a4e
clean-up
vince-leaf Apr 26, 2025
539cd33
clean-up
vince-leaf Apr 26, 2025
9119952
fixed broken flake8
vince-leaf Apr 26, 2025
ca4cdd0
flake8 reported no newline at end of file, but none
vince-leaf Apr 26, 2025
993cac0
fixed flake8
vince-leaf Apr 26, 2025
219a0b1
added my test dependencies to requirements-build.txt
vince-leaf Apr 28, 2025
e60fcad
add pytest command for path on e2e tests
vince-leaf Apr 28, 2025
a1bfefd
fixed typo
vince-leaf Apr 28, 2025
66fc08e
updated to make flake8 happy
vince-leaf Apr 28, 2025
c1273fc
Made flake8 happy
vince-leaf Apr 28, 2025
4294245
Merge branch 'main' into un-3096_add_test_case_all_agent_cli_connections
vince-leaf Apr 28, 2025
c6215e4
tweaked e2e pytest
vince-leaf Apr 28, 2025
f5f6d9d
edit e2e pytest
vince-leaf Apr 28, 2025
769c352
Update cost values
vince-leaf Apr 28, 2025
2b4a38a
Merge branch 'main' into un-3096_add_test_case_all_agent_cli_connections
vince-leaf Apr 28, 2025
5802313
Merge branch 'main' into un-3096_add_test_case_all_agent_cli_connections
vince-leaf Apr 29, 2025
f57ceec
removed extra
vince-leaf Apr 29, 2025
e58d261
combined fileterwarning to top pytest.ini
vince-leaf Apr 29, 2025
c138c5f
Added server service utility
vince-leaf Apr 30, 2025
5ae944e
added start and stop server service
vince-leaf Apr 30, 2025
d669421
added ignore warning
vince-leaf Apr 30, 2025
bdd9548
renamed to smoketest
vince-leaf Apr 30, 2025
a74094c
updated to run smoke test
vince-leaf Apr 30, 2025
abdcde8
made flake8 happy
vince-leaf May 1, 2025
e028a2e
ignore pytest warning
vince-leaf May 1, 2025
475a76d
debug
vince-leaf May 1, 2025
cfe348f
debug
vince-leaf May 1, 2025
2c2d694
debug failure
vince-leaf May 1, 2025
67649c1
debug
vince-leaf May 1, 2025
db5a2a1
fixed flake8
vince-leaf May 1, 2025
38c79c5
debug
vince-leaf May 1, 2025
c250fda
increased timeout on wait for prompt
vince-leaf May 1, 2025
700459f
added logging
vince-leaf May 1, 2025
db10cf7
make flake8 happy
vince-leaf May 1, 2025
fadcd3b
Made Flake8 happy
vince-leaf May 1, 2025
4526cbd
made flake8 happy
vince-leaf May 1, 2025
3832285
add condition
vince-leaf May 1, 2025
a84833d
a major refactor to support start&stop server service
vince-leaf May 6, 2025
19b5fd8
update trigger smoke-test
vince-leaf May 6, 2025
a37c0a7
made flake8 happy
vince-leaf May 6, 2025
aab8039
add test requirement
vince-leaf May 6, 2025
476d9c6
tweaked stop all servers script
vince-leaf May 6, 2025
469f982
more tweaks
vince-leaf May 6, 2025
3c09c55
fixed a minor info message
vince-leaf May 6, 2025
00782ef
Tweaked timeout
vince-leaf May 6, 2025
bfa15b8
Changes Smoke-test to run after Unit tests
vince-leaf May 6, 2025
739a198
updated readme
vince-leaf May 7, 2025
40de058
Merge branch 'main' into un-3096_add_test_case_all_agent_cli_connections
vince-leaf May 7, 2025
84168d4
renamed the files
vince-leaf May 7, 2025
c778f69
renamed hocon
vince-leaf May 7, 2025
a9e4aee
added more comment
vince-leaf May 7, 2025
ea1d37b
added comment
vince-leaf May 7, 2025
d983628
added comment
vince-leaf May 7, 2025
be264d4
added comment
vince-leaf May 7, 2025
7e0c486
add smoketest cron job
vince-leaf May 8, 2025
7c30777
Removed smoke test build test
vince-leaf May 8, 2025
b4ffbd8
updated test result text
vince-leaf May 8, 2025
417f680
Merge branch 'main' into un-3096_add_test_case_all_agent_cli_connections
vince-leaf May 8, 2025
4b952a7
removed requirement file
vince-leaf May 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion neuro_san/coded_tools/music_nerd_pro/accounting.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def invoke(self, args: Dict[str, Any], sly_data: Dict[str, Any]) -> Dict[str, An
running_cost: float = float(args.get("running_cost"))

# Increment the running cost
updated_running_cost: float = running_cost + 1.0
updated_running_cost: float = running_cost + 3.0

tool_response = {
"running_cost": updated_running_cost
Expand Down
7 changes: 6 additions & 1 deletion neuro_san/registries/music_nerd_pro.hocon
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,12 @@ You’re equal parts playlist curator, music historian, and pop culture mythbust
This service comes for a fee. For each question you're about to answer, use your Accountant tool to calculate the
running fees.

Return your answer and the running cost in a JSON message.
#Return your answer and the running cost in a JSON message.
This service comes at a cost. For every question:
1. Use your Accountant tool to calculate the updated running cost.
2. Return your response in **two parts**:
- First, give your full music answer as plain text.
- Then, on a **separate line**, return a **valid JSON object** with the updated cost only.
""",
"tools": ["Accountant"]
},
Expand Down
114 changes: 114 additions & 0 deletions tests/e2e/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# πŸ§ͺ End-to-End Agent Testing Framework

This project provides an extensible, reusable **pytest**-based test system to validate AI agent behavior through real CLI interactions.

It supports:
- Running **multiple connections** (`grpc`, `http`, `direct`)
- **Parallel execution** with **pytest-xdist**
- Optional **thinking file capture** for agent internals
- Config-driven prompts using **HOCON** files

---

## πŸ“¦ Project Structure

```bash
e2e/
β”œβ”€β”€ README.md # This documentation
β”œβ”€β”€ configs/ # Static agent configuration
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added README file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good that you have all your e2e stuff together under its own directory.

β”‚ └── config.hocon
β”œβ”€β”€ conftest.py # Pytest customizations (CLI args, test discovery)
β”œβ”€β”€ pytest.ini # Pytest settings
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ test_cases_data/ # Test data for each agent
β”‚ └── mnpt_data.hocon
β”œβ”€β”€ tests/ # Test case source files
β”‚ └── test_music_nerd_pro.py
└── utils/ # Helper modules (parsing, building commands, etc.)
β”œβ”€β”€ mnpt_hocon_loader.py
β”œβ”€β”€ mnpt_output_parser.py
β”œβ”€β”€ mnpt_test_runner.py
β”œβ”€β”€ thinking_file_builder.py
└── verifier.py
```

---

## πŸš€ Running Tests

### Install Dependencies

```bash
pip install -r requirements.txt
```

### Basic Test Command

Run a test (default: **all connections**):

```bash
pytest tests/ --verbose
```

Run for specific connection only:

```bash
pytest tests/ --connection grpc --verbose
```

Run and enable thinking file output:

```bash
pytest tests/ --thinking-file --verbose
```

Enable parallel test execution:

```bash
pytest tests/ --connection grpc --repeat 5 --thinking-file -n auto --verbose
```

> πŸ’‘ When using `-n auto`, each repeat runs across multiple CPU cores.

---

## βš™οΈ CLI Options

| Option | Description |
|:------------------|:------------|
| `--connection` | Run tests only for a specific connection (e.g., `grpc`, `http`, `direct`). |
| `--repeat` | Repeat each test multiple times. |
| `--thinking-file` | Save the agent's internal "thinking" to a temp directory during the test. |

---

# 🧠 Agent: MusicNerdPro Test (test_music_nerd_pro.py)

This suite tests the `music_nerd_pro` agent over all connection types.

### Test Logic

- Load prompt/expected outputs from **HOCON** config files
- Spawn a CLI agent process
- Send user questions
- Verify that:
- Correct keyword appears in the response
- Correct cost value is returned

### Related Files

| File | Purpose |
|:-----|:--------|
| `tests/test_music_nerd_pro.py` | Main test case (pytest function) |
| `test_cases_data/mnpt_data.hocon` | Prompt/expected answer definitions |
| `configs/config.hocon` | Static agent config (connections list) |
| `utils/*.py` | Reusable helpers for all agent tests |

---

# πŸ“ Notes

- **Thinking files** are stored under `/private/tmp/agent_thinking/`
- If `-n auto` is used, **worker-specific** folders are created (e.g., `run_gw0_1`).
- **PEXPECT** is used to fully simulate CLI typing behavior.
- Future agents can be easily added following the same pattern as MusicNerdPro!
8 changes: 8 additions & 0 deletions tests/e2e/configs/config.hocon
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# config.hocon
# Agent config & connection setup

connection = ["direct", "grpc", "http"]
agent = [music_nerd_pro]

model_llm = ["gpt-4o", "llama3.1"]
Copy link
Collaborator

@d1donlydfink d1donlydfink Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMs should be a property of the agent, not the test.
It would make more sense to have different (very simple) agents set up to test specific llms I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I listed here as I was thinking of a performance test case(s). For example:

  1. Let's hypothetically; we know llam3.1 should be faster than gpt-4o on the x hocon. This test case will rerun the agent with the 1st model, then rerun it with the 2nd model, and compare the results of both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could utilize the existing infrastructure on the sly_data feature to perform the comparison.


105 changes: 105 additions & 0 deletions tests/e2e/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# conftest.py
# ------------------------------------------------------------------------
# Pytest configuration for MusicNerdPro tests.
# Provides custom CLI flags, dynamic test generation, and environment setup.
# ------------------------------------------------------------------------

import pytest
import os
from pyhocon import ConfigFactory

# ------------------------------------------------------------------------------
# Constants
# ------------------------------------------------------------------------------

# Directory where agent CLI thinking files will be written (optional feature)
THINKING_FILE_PATH = "/private/tmp/agent_thinking"

# Static agent config (HOCON) loaded once for all tests
CONFIG_HOCON_PATH = os.path.join(os.path.dirname(__file__), "configs", "config.hocon")
config = ConfigFactory.parse_file(CONFIG_HOCON_PATH)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parse the config hocon to get connections.


# ------------------------------------------------------------------------------
# Hooks
# ------------------------------------------------------------------------------

def pytest_configure(config):
"""
Prints custom environment info when pytest starts.
Helps verify environment settings.
"""
print("\nCustom Environment Info")
print(f"thinking-file path : {THINKING_FILE_PATH}")

def pytest_addoption(parser):
"""
Adds custom command-line options for pytest to control the test suite:
--connection -> Filter tests by specific connection method (direct/grpc/http)
--repeat -> Repeat the same test multiple times (for stability/reliability)
--thinking-file -> Enable writing out agent thinking_file logs during test
"""
group = parser.getgroup("custom options")
group.addoption(
"--connection",
action="store",
default=None,
help="Specify a connection name to test (e.g., direct, grpc, http). If omitted, all will be tested."
)
group.addoption(
"--repeat",
action="store",
type=int,
default=1,
help="Number of times to repeat each test (for stress or reliability testing)."
)
group.addoption(
"--thinking-file",
action="store_true",
default=False,
help="If enabled, agent will write a thinking_file log per test case (grpc/http/direct)."
)

def pytest_generate_tests(metafunc):
"""
Dynamically parameterizes the tests based on the connection(s) and repetition requested.

Example:
--connection grpc --repeat 3
β†’ Runs 3 tests against 'grpc' connection.

--repeat 2 (with no connection)
β†’ Runs 2 tests for each connection (direct, grpc, http).

This auto-expands into (connection_name, repeat_index) fixture pairs.
"""
if "connection_name" in metafunc.fixturenames:
all_connections = load_connections()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default, the connection is all three

selected_connection = metafunc.config.getoption("connection")
repeat = metafunc.config.getoption("repeat")

# Filter if a specific connection is selected
if selected_connection:
if selected_connection not in all_connections:
raise ValueError(f"Connection '{selected_connection}' not found in config: {all_connections}")
all_connections = [selected_connection]

# Generate combinations of (connection_name, repeat_index)
test_params = [
pytest.param(conn, i, id=f"{conn}_run{i+1}")
for conn in all_connections
for i in range(repeat)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generate the matrix of runners

]

# Parametrize the test function
metafunc.parametrize("connection_name, repeat_index", test_params)

# ------------------------------------------------------------------------------
# Utilities
# ------------------------------------------------------------------------------

def load_connections():
"""
Loads the list of supported connection names from the static config file.
"""
return config.get("connection")

5 changes: 5 additions & 0 deletions tests/e2e/pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# pytest.ini
[pytest]
filterwarnings =
ignore:.*use of forkpty.*:DeprecationWarning:pty

6 changes: 6 additions & 0 deletions tests/e2e/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pexpect
pyhocon
pytest
pytest-xdist
pytest-timeout
pytest-timer
29 changes: 29 additions & 0 deletions tests/e2e/test_cases_data/mnpt_data.hocon
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# test_data.hocon
# Input/output test pairs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't abbreviate leaving people guessing as to what this file is for.
No one will instantly know what mnpt actually means.


test = [
{
input_1: {
user_text: "Who did yellow submarine?"
answer: {
type_match: "keyword"
word: "Beatles"
cost: "3.0"
Copy link
Collaborator

@d1donlydfink d1donlydfink Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That you have cost built in as a key likely means that this format is very tightly coupled to a particular test.
It would be worth your while to deeply understand the hocon format used in the tests/fixtures area.

}
}
},
{
input_2: {
user_text: "Where were they from?"
answer: {
type_match: "keyword"
word: "Liverpool"
cost: "6.0"
}
}
},
{
input_done: "quit"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to make this test using the existing infrastructure.
See https://github.com/leaf-ai/neuro-san/blob/main/tests/fixtures/music_nerd/beatles_with_history.hocon as a basis for continuing infrastructure and https://github.com/leaf-ai/neuro-san/blob/main/tests/fixtures/math_guy/basic_sly_data.hocon for sly_data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do like to use the sly_data feature, but I have got to it. That's an excellent suggestion; I'll look into it.

]

21 changes: 21 additions & 0 deletions tests/e2e/tests/test_music_nerd_pro.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# test_music_nerd_pro.py
# ---------------------------------------------------------
# Parametrized test case that drives CLI interaction test
# ---------------------------------------------------------

import pytest
from utils.mnpt_hocon_loader import extract_test_values
from utils.mnpt_test_runner import run_test

@pytest.mark.timeout(120)
def test_run_connection(connection_name, repeat_index, request):
"""
Main test entry point for testing music_nerd_pro agent over various connections.
"""
use_thinking_file = request.config.getoption("--thinking-file")

# NEW: Only pass connection name
result = extract_test_values(connection_name)

run_test(*result, repeat_index, use_thinking_file)

69 changes: 69 additions & 0 deletions tests/e2e/utils/mnpt_hocon_loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# ------------------------------------------------------------------------
# mnpt_hocon_loader.py
# ------------------------------------------------------------------------
# Utility functions for loading test prompt/response values from HOCON files.
# Separates test data loading from agent configuration loading.
# ------------------------------------------------------------------------

import os
from pyhocon import ConfigFactory

# ------------------------------------------------------------------------
# Path to the TEST DATA HOCON file
# - This file contains input prompts and expected agent outputs.
# - NOTE: Only test cases, no agent config.
# ------------------------------------------------------------------------

TEST_DATA_HOCON_PATH = os.path.join(
os.path.dirname(__file__), # This utils/ folder
"../test_cases_data/mnpt_data.hocon" # Relative path to test_cases/
)

# ------------------------------------------------------------------------
# Load the test data once at import time
# ------------------------------------------------------------------------
test_data = ConfigFactory.parse_file(os.path.abspath(TEST_DATA_HOCON_PATH))

# ------------------------------------------------------------------------
# Function: extract_test_values
# Description:
# - Loads the prompts and expected answer keywords/costs
# - Validates the connection name if needed
# - Returns extracted values for CLI interaction testing
# ------------------------------------------------------------------------
def extract_test_values(connection_name):
"""
Loads test prompts and expected outputs for a given connection
from the test data HOCON file.

Args:
connection_name (str): The type of connection to validate (e.g., "grpc", "http")

Returns:
tuple: (connection_name, prompt_1, prompt_2, word_1, word_2, cost_1, cost_2, input_done)
"""

# If you want to validate connection types, you can add here
# Example connection list: ["direct", "grpc", "http"]

# Pull the list of test prompts and expected outputs
test_entries = test_data.get("test")

# Extract the first test input
input_1 = next(item["input_1"] for item in test_entries if "input_1" in item)
prompt_1 = input_1.get("user_text")
word_1 = input_1.get("answer.word")
cost_1 = input_1.get("answer.cost")

# Extract the second test input
input_2 = next(item["input_2"] for item in test_entries if "input_2" in item)
prompt_2 = input_2.get("user_text")
word_2 = input_2.get("answer.word")
cost_2 = input_2.get("answer.cost")

# Extract the input for termination (e.g., "quit")
input_done = next((item.get("input_done") for item in test_entries if "input_done" in item), None)

# Return all values required for the test runner
return connection_name, prompt_1, prompt_2, word_1, word_2, cost_1, cost_2, input_done

Loading
Loading