-
Notifications
You must be signed in to change notification settings - Fork 29
UN-3096 add 1st e2e test case #179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 19 commits
dfb11bd
2fd90a3
6a653b2
87bb429
fe68a4e
539cd33
9119952
ca4cdd0
993cac0
219a0b1
e60fcad
a1bfefd
66fc08e
c1273fc
4294245
c6215e4
f5f6d9d
769c352
2b4a38a
5802313
f57ceec
e58d261
c138c5f
5ae944e
d669421
bdd9548
a74094c
abdcde8
e028a2e
475a76d
cfe348f
2c2d694
67649c1
db5a2a1
38c79c5
c250fda
700459f
db10cf7
fadcd3b
4526cbd
3832285
a84833d
19b5fd8
a37c0a7
aab8039
476d9c6
469f982
3c09c55
00782ef
bfa15b8
739a198
40de058
84168d4
c778f69
a9e4aee
ea1d37b
d983628
be264d4
7e0c486
7c30777
b4ffbd8
417f680
4b952a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,8 +35,15 @@ jobs: | |
| - name: Run flake8 | ||
| run: flake8 | ||
|
|
||
| - name: Run pytest (excluding integration tests) | ||
| run: pytest --verbose -m "not integration" --timer-top-n 10 | ||
| - name: Run E2E Tests (only tests/e2e/) | ||
| run: pytest tests/e2e/tests/ --verbose --connection direct --thinking-file --repeat 1 -n auto | ||
vince-leaf marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| env: | ||
| OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
| AGENT_TOOL_PATH: "./neuro_san/coded_tools" | ||
| PYTHONPATH: ${{ env.PYTHONPATH }}:"." | ||
|
|
||
| - name: Run pytest Run All Other Tests (excluding integration and e2e) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This name seems redundant. How about just |
||
| run: pytest --verbose -m "not integration and not e2e" --ignore=tests/e2e/ --timer-top-n 10 | ||
vince-leaf marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| env: | ||
|
||
| OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
| AGENT_TOOL_PATH: "./neuro_san/coded_tools" | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,6 +8,10 @@ timeout-decorator==0.5.0 | |
| coverage==7.6.1 | ||
| pytest-cov==5.0.0 | ||
| parameterized | ||
| pexpect | ||
| pyhocon | ||
| pytest-xdist | ||
| pytest-timeout | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added requirement for e2e tests
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should these requirements go to |
||
|
|
||
| # Code quality | ||
| flake8==7.1.1 | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,114 @@ | ||
| # π§ͺ End-to-End Agent Testing Framework | ||
|
|
||
| This project provides an extensible, reusable **pytest**-based test system to validate AI agent behavior through real CLI interactions. | ||
|
|
||
| It supports: | ||
| - Running **multiple connections** (`grpc`, `http`, `direct`) | ||
| - **Parallel execution** with **pytest-xdist** | ||
| - Optional **thinking file capture** for agent internals | ||
| - Config-driven prompts using **HOCON** files | ||
|
|
||
| --- | ||
|
|
||
| ## π¦ Project Structure | ||
|
|
||
| ```bash | ||
| e2e/ | ||
| βββ README.md # This documentation | ||
| βββ configs/ # Static agent configuration | ||
|
||
| β βββ config.hocon | ||
| βββ conftest.py # Pytest customizations (CLI args, test discovery) | ||
| βββ pytest.ini # Pytest settings | ||
| βββ requirements.txt # Python dependencies | ||
| βββ test_cases_data/ # Test data for each agent | ||
| β βββ mnpt_data.hocon | ||
| βββ tests/ # Test case source files | ||
| β βββ test_music_nerd_pro.py | ||
| βββ utils/ # Helper modules (parsing, building commands, etc.) | ||
| βββ mnpt_hocon_loader.py | ||
| βββ mnpt_output_parser.py | ||
| βββ mnpt_test_runner.py | ||
| βββ thinking_file_builder.py | ||
| βββ verifier.py | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## π Running Tests | ||
|
|
||
| ### Install Dependencies | ||
|
|
||
| ```bash | ||
| pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| ### Basic Test Command | ||
|
|
||
| Run a test (default: **all connections**): | ||
|
|
||
| ```bash | ||
| pytest tests/ --verbose | ||
| ``` | ||
|
|
||
| Run for specific connection only: | ||
|
|
||
| ```bash | ||
| pytest tests/ --connection grpc --verbose | ||
| ``` | ||
|
|
||
| Run and enable thinking file output: | ||
|
|
||
| ```bash | ||
| pytest tests/ --thinking-file --verbose | ||
| ``` | ||
|
|
||
| Enable parallel test execution: | ||
|
|
||
| ```bash | ||
| pytest tests/ --connection grpc --repeat 5 --thinking-file -n auto --verbose | ||
| ``` | ||
|
|
||
| > π‘ When using `-n auto`, each repeat runs across multiple CPU cores. | ||
|
|
||
| --- | ||
|
|
||
| ## βοΈ CLI Options | ||
|
|
||
| | Option | Description | | ||
| |:------------------|:------------| | ||
| | `--connection` | Run tests only for a specific connection (e.g., `grpc`, `http`, `direct`). | | ||
| | `--repeat` | Repeat each test multiple times. | | ||
| | `--thinking-file` | Save the agent's internal "thinking" to a temp directory during the test. | | ||
|
|
||
| --- | ||
|
|
||
| # π§ Agent: MusicNerdPro Test (test_music_nerd_pro.py) | ||
|
|
||
| This suite tests the `music_nerd_pro` agent over all connection types. | ||
|
|
||
| ### Test Logic | ||
|
|
||
| - Load prompt/expected outputs from **HOCON** config files | ||
| - Spawn a CLI agent process | ||
| - Send user questions | ||
| - Verify that: | ||
| - Correct keyword appears in the response | ||
| - Correct cost value is returned | ||
|
|
||
| ### Related Files | ||
|
|
||
| | File | Purpose | | ||
| |:-----|:--------| | ||
| | `tests/test_music_nerd_pro.py` | Main test case (pytest function) | | ||
| | `test_cases_data/mnpt_data.hocon` | Prompt/expected answer definitions | | ||
| | `configs/config.hocon` | Static agent config (connections list) | | ||
| | `utils/*.py` | Reusable helpers for all agent tests | | ||
|
|
||
| --- | ||
|
|
||
| # π Notes | ||
|
|
||
| - **Thinking files** are stored under `/private/tmp/agent_thinking/` | ||
| - If `-n auto` is used, **worker-specific** folders are created (e.g., `run_gw0_1`). | ||
| - **PEXPECT** is used to fully simulate CLI typing behavior. | ||
| - Future agents can be easily added following the same pattern as MusicNerdPro! | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| # config.hocon | ||
| # Agent config & connection setup | ||
vince-leaf marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| connection = ["direct", "grpc", "http"] | ||
| agent = [music_nerd_pro] | ||
|
|
||
| model_llm = ["gpt-4o", "llama3.1"] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. LLMs should be a property of the agent, not the test.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I listed here as I was thinking of a performance test case(s). For example:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternatively, we could utilize the existing infrastructure on the sly_data feature to perform the comparison. |
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| # conftest.py | ||
| # ------------------------------------------------------------------------ | ||
| # Provides custom CLI flags, dynamic test generation, and environment setup. | ||
| # Pytest configuration to share like MusicNerdPro test | ||
|
||
| # ------------------------------------------------------------------------ | ||
|
|
||
| import pytest | ||
| import os | ||
| from pyhocon import ConfigFactory | ||
|
|
||
| # ------------------------------------------------------------------------------ | ||
| # Constants | ||
| # ------------------------------------------------------------------------------ | ||
|
|
||
| # Directory where agent CLI thinking files will be written (optional feature) | ||
| THINKING_FILE_PATH = "/private/tmp/agent_thinking" | ||
|
|
||
| # Static agent config (HOCON) loaded once for all tests | ||
| CONFIG_HOCON_PATH = os.path.join(os.path.dirname(__file__), "configs", "config.hocon") | ||
| config = ConfigFactory.parse_file(CONFIG_HOCON_PATH) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Parse the config hocon to get connections. |
||
|
|
||
| # ------------------------------------------------------------------------------ | ||
| # Hooks | ||
| # ------------------------------------------------------------------------------ | ||
|
|
||
|
|
||
| def pytest_configure(config): | ||
| """ | ||
| Prints custom environment info when pytest starts. | ||
| Helps verify environment settings. | ||
| """ | ||
| print("\nCustom Environment Info") | ||
| print(f"thinking-file path : {THINKING_FILE_PATH}") | ||
|
|
||
|
|
||
| def pytest_addoption(parser): | ||
vince-leaf marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| """ | ||
| Adds custom command-line options for pytest to control the test suite: | ||
| --connection -> Filter tests by specific connection method (direct/grpc/http) | ||
| --repeat -> Repeat the same test multiple times (for stability/reliability) | ||
| --thinking-file -> Enable writing out agent thinking_file logs during test | ||
| """ | ||
| group = parser.getgroup("custom options") | ||
| group.addoption( | ||
| "--connection", | ||
| action="store", | ||
| default=None, | ||
| help="Specify a connection name to test (e.g., direct, grpc, http). If omitted, all will be tested." | ||
| ) | ||
| group.addoption( | ||
| "--repeat", | ||
| action="store", | ||
| type=int, | ||
| default=1, | ||
| help="Number of times to repeat each test (for stress or reliability testing)." | ||
| ) | ||
| group.addoption( | ||
| "--thinking-file", | ||
| action="store_true", | ||
| default=False, | ||
| help="If enabled, agent will write a thinking_file log per test case (grpc/http/direct)." | ||
| ) | ||
|
|
||
|
|
||
| def pytest_generate_tests(metafunc): | ||
| """ | ||
| Dynamically parameterizes the tests based on the connection(s) and repetition requested. | ||
|
|
||
| Example: | ||
| --connection grpc --repeat 3 | ||
| β Runs 3 tests against 'grpc' connection. | ||
|
|
||
| --repeat 2 (with no connection) | ||
| β Runs 2 tests for each connection (direct, grpc, http). | ||
|
|
||
| This auto-expands into (connection_name, repeat_index) fixture pairs. | ||
| """ | ||
| if "connection_name" in metafunc.fixturenames: | ||
| all_connections = load_connections() | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. by default, the connection is all three |
||
| selected_connection = metafunc.config.getoption("connection") | ||
| repeat = metafunc.config.getoption("repeat") | ||
|
|
||
| # Filter if a specific connection is selected | ||
| if selected_connection: | ||
| if selected_connection not in all_connections: | ||
| raise ValueError(f"Connection '{selected_connection}' not found in config: {all_connections}") | ||
| all_connections = [selected_connection] | ||
|
|
||
| # Generate combinations of (connection_name, repeat_index) | ||
| test_params = [ | ||
| pytest.param(conn, i, id=f"{conn}_run{i+1}") | ||
| for conn in all_connections | ||
| for i in range(repeat) | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generate the matrix of runners |
||
| ] | ||
|
|
||
| # Parametrize the test function | ||
| metafunc.parametrize("connection_name, repeat_index", test_params) | ||
|
|
||
| # ------------------------------------------------------------------------------ | ||
| # Utilities | ||
| # ------------------------------------------------------------------------------ | ||
|
|
||
|
|
||
| def load_connections(): | ||
| """ | ||
| Loads the list of supported connection names from the HOCON config file. | ||
| """ | ||
| return config.get("connection") | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # pytest.ini | ||
| [pytest] | ||
| filterwarnings = | ||
| ignore:.*use of forkpty.*:DeprecationWarning:pty | ||
vince-leaf marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
vince-leaf marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| pexpect | ||
| pyhocon | ||
| pytest | ||
| pytest-xdist | ||
| pytest-timeout | ||
| pytest-timer | ||
vince-leaf marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # test_data.hocon | ||
| # Input/output test pairs | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't abbreviate leaving people guessing as to what this file is for. |
||
|
|
||
| test = [ | ||
vince-leaf marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| { | ||
| input_1: { | ||
| user_text: "Who did yellow submarine?" | ||
| answer: { | ||
| type_match: "keyword" | ||
| word: "Beatles" | ||
| cost: "3.0" | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That you have cost built in as a key likely means that this format is very tightly coupled to a particular test. |
||
| } | ||
| } | ||
| }, | ||
| { | ||
| input_2: { | ||
| user_text: "Where were they from?" | ||
| answer: { | ||
| type_match: "keyword" | ||
| word: "Liverpool" | ||
| cost: "6.0" | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| input_done: "quit" | ||
| } | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should be able to make this test using the existing infrastructure.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I do like to use the sly_data feature, but I have got to it. That's an excellent suggestion; I'll look into it. |
||
| ] | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| # test_music_nerd_pro.py | ||
| # --------------------------------------------------------- | ||
| # Parametrized E2E test case that drives CLI interaction tests | ||
| # --------------------------------------------------------- | ||
|
|
||
| import pytest | ||
| from utils.mnpt_hocon_loader import extract_test_values | ||
| from utils.mnpt_test_runner import run_test | ||
|
|
||
|
|
||
| @pytest.mark.e2e | ||
| @pytest.mark.timeout(120) | ||
| def test_run_connection(connection_name, repeat_index, request): | ||
| """ | ||
| End-to-end test for the music_nerd_pro agent across different connections. | ||
|
|
||
| This test: | ||
| - Dynamically parametrizes across multiple connections (e.g., direct, grpc, http). | ||
| - Supports repeated test runs via `repeat_index`. | ||
| - Optionally uses a 'thinking file' if the --thinking-file pytest option is passed. | ||
| """ | ||
|
|
||
| # Retrieve custom CLI option for thinking file usage | ||
| use_thinking_file = request.config.getoption("--thinking-file") | ||
|
|
||
| # Extract required test values for the given connection | ||
| result = extract_test_values(connection_name) | ||
|
|
||
| # Defensive check (optional but good practice) | ||
| assert result is not None, f"Failed to extract test values for connection: {connection_name}" | ||
|
|
||
| # Execute the CLI-based test | ||
| run_test(*result, repeat_index, use_thinking_file) | ||
vince-leaf marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want/need to run the e2e every time? I'm not arguing we should or should not, I'm just wondering. How solid are these tests against false failures?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is solid after I updated the Music Nerd Pro. It has triggered over 100 runs without any false failures.
This single test case takes about 18 seconds.
It is an e2e smoketest. I should label it as a smoke test instead.
This quick test should help determine what went wrong on the server side. Since it uses the agent_cli, it also helps determine whether the client is working or not.
If it is too much, then trigger one a day.