Skip to content

Add failing tests for #375: Race condition in LLM cost tracking#386

Draft
Serhan-Asad wants to merge 5 commits intopromptdriven:mainfrom
Serhan-Asad:fix/issue-375
Draft

Add failing tests for #375: Race condition in LLM cost tracking#386
Serhan-Asad wants to merge 5 commits intopromptdriven:mainfrom
Serhan-Asad:fix/issue-375

Conversation

@Serhan-Asad
Copy link
Contributor

Summary

Adds failing tests that detect the race condition bug reported in #375, where concurrent jobs in server mode share the global _LAST_CALLBACK_DATA dictionary, causing cost and token data corruption.

Test Files

  • Unit test: tests/test_llm_invoke_concurrency.py

    • Directly tests _litellm_success_callback() function under concurrent execution
    • Uses 3 concurrent threads (matching server's max_concurrent=3)
    • Demonstrates threads reading each other's callback data
  • E2E test: tests/test_e2e_issue_375_concurrent_cost_tracking.py

    • Simulates the full server mode scenario with concurrent jobs
    • Tests the complete write→read cycle that happens in production
    • Verifies each job should get its own cost/token data

What This PR Contains

  • ✅ Failing unit test that reproduces the reported bug
  • ✅ Failing E2E test that verifies the bug at integration level
  • ✅ Tests are verified to fail on current code (demonstrating the bug exists)
  • ✅ Tests will pass once the bug is fixed with thread-local storage

Root Cause

The module-level global dictionary _LAST_CALLBACK_DATA in pdd/llm_invoke.py:690-695 is shared across all threads in server mode. When LiteLLM executes callbacks for concurrent jobs, each callback writes to this shared dictionary, and each llm_invoke() call reads from it. With no synchronization mechanism, threads overwrite each other's data in a classic read-after-write race condition.

Impact:

  • Budget limits malfunction (jobs may stop early or exceed budgets)
  • Cost reports show incorrect values
  • Token tracking is corrupted
  • Affects all concurrent operations via pdd connect

Next Steps

  1. Implement the fix at pdd/llm_invoke.py using threading.local() storage
  2. Verify the unit test passes (pytest tests/test_llm_invoke_concurrency.py -v)
  3. Verify the E2E test passes (pytest tests/test_e2e_issue_375_concurrent_cost_tracking.py -v)
  4. Run full test suite to check for regressions
  5. Mark PR as ready for review

Fixes #375


Generated by PDD agentic bug workflow

@Serhan-Asad
Copy link
Contributor Author

Summary

Fixed critical race condition in LLM cost tracking that caused data corruption when multiple jobs run concurrently in server mode. Replaced global _LAST_CALLBACK_DATA
dictionary with thread-local storage (threading.local()) to ensure each concurrent job tracks its own cost data independently.

Impact:

  • Server mode (pdd connect) with max_concurrent=3 was sharing cost data across threads
  • Jobs were reading wrong cost values from other concurrent jobs
  • Budget enforcement was failing (jobs stopped early or exceeded limits)
  • Cost reports showed incorrect amounts

Fix:

  • Replaced module-level global dict with threading.local() for thread-safe storage
  • Each thread gets isolated callback data (cost, tokens, finish_reason)
  • Read operations use getattr() with safe defaults
  • All 6 affected test files updated to reference new _CALLBACK_DATA variable

Test Results

  • Unit tests: PASS
  • Regression tests: PASS
  • Sync regression: PASS
  • Test coverage: 70%

Specific test results:

test_concurrent_callbacks_corrupt_cost_data - Verifies thread isolation                                                                                                     
test_sequential_callbacks_work - Ensures backward compatibility                                                                                                             
test_concurrent_server_jobs_cost_tracking_e2e - Full server mode simulation                                                                                                 
test_sequential_server_jobs_still_work_e2e - Regression prevention                                                                                                          
All 7 related e2e tests (OpenAI schema, Vertex retry, etc.)                                                                                                                 

Overall test suite:

  • 77/78 tests pass in test_llm_invoke.py
  • 1 pre-existing failure unrelated to this fix (LiteLLM provider config issue)

Manual Testing

Automated tests simulate real server mode scenario:

  • Created concurrent threads matching server's max_concurrent=3 setting
  • Each thread writes different cost values to callback storage
  • Each thread reads back and verifies it gets its own data (not corrupted)
  • Tests stagger timing to maximize chance of race condition exposure
  • Verified both concurrent and sequential execution paths work correctly

Checklist

  • All tests pass
  • Copilot comments addressed
  • No temp files committed
  • No merge conflicts
  • (If bug fix) Failing test added - 4 tests added that reproduce race condition
  • (If feature) A/B comparison included - N/A (bug fix)
  • (If prompts changed) Submitted to pdd_cap - No prompts changed in pdd repo

Fixes #375

@gltanaka gltanaka requested a review from Copilot January 24, 2026 20:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive failing tests to reproduce and validate the fix for issue #375, where concurrent jobs in server mode experience race conditions in LLM cost tracking due to shared global state.

Changes:

  • Adds unit tests for concurrent callback execution with thread-local storage
  • Adds E2E tests simulating full server mode concurrent scenarios
  • Updates all existing test files to use the new _CALLBACK_DATA thread-local storage instead of the old _LAST_CALLBACK_DATA dictionary

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pdd/llm_invoke.py Implements thread-local storage (_CALLBACK_DATA) replacing global dictionary to fix race condition
tests/test_llm_invoke_concurrency.py New unit tests verifying thread-safe callback data isolation
tests/test_e2e_issue_375_concurrent_cost_tracking.py New E2E tests simulating concurrent server mode operations
tests/test_llm_invoke.py Updated to mock thread-local storage instead of dictionary
tests/test_llm_invoke_vertex_retry.py Updated mock for thread-local storage compatibility
tests/test_e2e_openai_required_array.py Updated mock for thread-local storage compatibility
tests/test_e2e_issue_295_openai_schema.py Updated mock for thread-local storage compatibility

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gltanaka gltanaka marked this pull request as draft January 28, 2026 19:34
@gltanaka
Copy link
Contributor

making this draft until conflicts and comments are addressed

@gltanaka
Copy link
Contributor

@Serhan-Asad

Serhan-Asad and others added 4 commits January 28, 2026 16:53
…ptdriven#375)

This commit adds two test files that reproduce the race condition reported
in issue promptdriven#375, where concurrent jobs share the global _LAST_CALLBACK_DATA
dictionary, causing cost and token data corruption.

Test files:
- tests/test_llm_invoke_concurrency.py: Unit test that directly tests the
  _litellm_success_callback() function under concurrent execution
- tests/test_e2e_issue_375_concurrent_cost_tracking.py: E2E test that
  simulates the full server mode scenario with 3 concurrent jobs

Both tests currently fail (as expected), demonstrating the bug. They will
pass once the fix is implemented using threading.local() storage.

Related to promptdriven#375

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated 6 test files to reference _CALLBACK_DATA (thread-local) instead of
_LAST_CALLBACK_DATA (global dict) following the fix for issue promptdriven#375.

- Replaced all _LAST_CALLBACK_DATA references with _CALLBACK_DATA
- Updated mock patches to use object with attributes instead of dict
- Fixed indentation for proper context manager nesting
- Fixed 6 tests added to main after PR creation that still referenced _LAST_CALLBACK_DATA
- Updated line number reference in E2E test from 690 to 689 (Copilot review)
- All tests now pass (83/84, 1 pre-existing LiteLLM provider config issue)

Addresses Copilot review comment on PR promptdriven#386
@Serhan-Asad
Copy link
Contributor Author

Copilot review addressed

Fixed the line number reference in commit 4ec74b9:

  • Updated from line 690 to 689 (the actual line where _CALLBACK_DATA is defined in pdd/llm_invoke.py)

Also rebased on latest main (88a37d5) and resolved all conflicts. All tests passing (83/84, 1 pre-existing unrelated failure).

Instead of renaming to _CALLBACK_DATA, keep the original variable name
_LAST_CALLBACK_DATA for the thread-local storage. This ensures:
- All existing tests continue to work without modification
- Future tests will naturally use the correct name
- No breaking changes for other contributors

Only changes needed:
- pdd/llm_invoke.py: Rename _CALLBACK_DATA back to _LAST_CALLBACK_DATA
- New test files: Update imports to use _LAST_CALLBACK_DATA

All tests passing (16/17, 1 skipped)
@Serhan-Asad
Copy link
Contributor Author

Simplified Fix ✅

Changed approach to maintain backward compatibility by keeping the original variable name _LAST_CALLBACK_DATA instead of renaming to _CALLBACK_DATA.

Benefits:

  • ✅ All existing tests work without modification
  • ✅ New tests added to main after this PR work automatically
  • ✅ Future tests will naturally use the correct name
  • ✅ No breaking changes for contributors

Changes:

  • Only 3 files modified (down from 7 previously)
  • pdd/llm_invoke.py: Thread-local storage using original name
  • New test files: Updated to use _LAST_CALLBACK_DATA

Test Results: 16/17 passing, 1 skipped (pre-existing)

@gltanaka
Copy link
Contributor

gltanaka commented Feb 2, 2026

target 2/3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race Condition in LLM Cost Tracking Causes Data Corruption

2 participants