Skip to content

Handle retries without returning errors for non-network errors#22

Open
paymanblu wants to merge 5 commits intoIBM:mainfrom
paymanblu:fix-error-logic-with-default-fallback-to-legacy
Open

Handle retries without returning errors for non-network errors#22
paymanblu wants to merge 5 commits intoIBM:mainfrom
paymanblu:fix-error-logic-with-default-fallback-to-legacy

Conversation

@paymanblu
Copy link
Contributor

@paymanblu paymanblu commented Jul 25, 2025

Fix Retry logic to avoid loss of request body on retry and return response body on non-200 Responses:

Issue #22

This PR addresses an issue in the watsonx-go client library where HTTP request bodies were being lost during retry attempts, causing 400 Bad Request errors from the server.

Description

  • Root Cause: When using bytes.NewBuffer() for HTTP request bodies, the buffer gets consumed on the first request attempt. On subsequent retries, the request body was empty, leading to 400 HTML formatted server errors instead of expected json http responses.
  • Impact: All retry attempts after the first request would fail with 400 Bad Request errors, making the retry mechanism behave unexpectedly, and lose context of error response from server.
  • Additional Issues: The existing retry logic, wraps non-200 status codes into an error and returns them as a non-nil error which causes the caller to no longer have access to the body of the http response which may contain critically useful troubleshooting information about the server error. Especially for 400 and 500 error (i.e. non-network errors), the server typically responds with helpful error, and status information. Currently, only the status number is returned in the err and nil is returned as the response. The current behavior of wrapping the status code integer as an error and returning it to the caller also seems to be inconsistent with traditional go http.Client behavior of returning nil error and the response for the caller to handle as they see fit.

Code Changes

1. Enhanced Retry Mechanism (pkg/models/retry.go)

  • Refactor of the retry logic to support request body handling and allowing for nil error and non-nil response for non-200 and non-network error cases.
  • Added req.GetBody support: Enables automatic request body recreation for retries (i.e. in prepareRequest)
  • New retry configuration options:
    • WithRetryIfV2(): Response-based retry conditions (vs legacy error-based)
    • WithOnRetryV2(): Enhanced retry callbacks with response access
    • WithReturnHTTPStatusAsErr(): Controls legacy behavior of converting HTTP status to errors
    • Retry(): Now expectes a request parameter for better control over the request (e.g. ability to prepare request body for retry)
  • Improved error handling: Proper separation of network errors vs HTTP response errors

2. Fixed Request Body Handling (pkg/models/generate.go, pkg/models/embedding.go)

  • Replaced bytes.NewBuffer() with bytes.NewReader(): Prevents buffer consumption issues
  • Enhanced error handling: Better HTTP status code detection and error reporting

3. Added Test Coverage (pkg/internal/tests/models/generate_test.go)

  • Added invalid parameter tests: Validate proper error handling for bad requests
  • Added invalid model tests: Test both new and legacy retry behaviors
  • Enhanced error validation: Parse and validate JSON error responses
  • Improved test organization: Better constants and helper functions

4. Development Tools Enhancement (Makefile)

  • Added fmt-check target: Check code formatting without modifying files
  • Enhanced build pipeline: Added proper dependencies between targets
  • Improved structure: Better organization and documentation

Migration Notes

  • No breaking changes: All existing code continues to work unchanged
  • Optional enhancements: New retry options available for advanced use cases
  • Legacy support: Old retry behavior maintained through configuration flags

Files Modified

  • pkg/models/retry.go - Core retry mechanism improvements
  • pkg/models/generate.go - Request body fix for text generation
  • pkg/models/embedding.go - Request body fix for embeddings
  • pkg/internal/tests/models/generate_test.go - Enhanced test coverage
  • Makefile - Added formatting check capabilities

Externally Visible Outcome

GenerateText err - Before fix

400

GenerateText err - After fix

{"errors":[{"code":"model_not_supported","message":"Model 'meta-llama/llama-3-70b-instruct-foo-test' is not supported","more_info":"https://cloud.ibm.com/apidocs/watsonx-ai#text-generation"}],"trace":"3a8e0587e089f3d55d8f5d16f4f90d7c","status_code":404}

Discussion Topics

  • It would simplify the retry.go logic if we could just make breaking changes and document them. It seems that although the Retry function and the RetryConfig are exported and visible, it is very likely that the consumers of the existing behavior will not have tried to alter the default behavior, and so we could possibly just update the code and document a migration path forward. The migration in such a case would be minimal in that the GenerateText and EmbedDocuments still return an error response as before, but the error response is just a richer response with more details.
  • The current PR is a bit more complex than it needs to be because it is trying to preserve backwards compatibility in the Retry logic by maintaining v1 and v2 logic and switch between them via the flag returnHTTPStatusAsErr.

@paymanblu
Copy link
Contributor Author

This initial draft is meant for discussion purposes. I would be happy to discuss and revise the strategy of using different naming for the feature flags in this PR, or any suggestions.

Copy link
Contributor Author

@paymanblu paymanblu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving some Notes for the maintainers to make it easier to follow the thought process.

```go
result, _ := client.GenerateText(
"meta-llama/llama-3-1-8b-instruct",
"meta-llama/llama-3-3-70b-instruct",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the initial models seem to no longer be available by default

elapsedTime := endTime.Sub(startTime)
expectedMinimumTime := backoffTime * time.Duration(expectedRetries)

if err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: The new behavior is that the error will be nil and the response will be non-nil.

This will give the user the opportunity to react to the response and use the response contents for troubleshooting purposes by logging/printing them to the end user.


if !opts.retryIf(err) {
return nil, err
shouldRetry = retryConfig.retryIf(retryIfV1Err) || retryConfig.retryIfV2(resp, err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we call both the legacy and new retryIf for backwards compatibility purposes


lastErr = err
opts.onRetry(n+1, err)
retryConfig.onRetry(n+1, statusAsErr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we call both the legacy and new onRetry for backwards compatibility purposes

type HttpClient struct {
httpClient *http.Client
httpClient *http.Client
retryConfig *RetryConfig
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we add the RetryConfig to the struct as a way to influence the Client retry logic from the outside.

type RetryIfFunc func(error) bool

// RetryIfFuncV2 determines whether a retry should be attempted based on the response.
type RetryIfFuncV2 func(*http.Response, error) bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this RetryIfV2 variation gives the user the ability to decide what they want to do based on the original response and the error. We don't coerce the status into an error prematurely as to give the user more control over the original Response and error to make a better informed decision on when to Retry.

@christopherallis
Copy link
Collaborator

Thanks @paymanblu for this contribution. Could you help let the DCO pass? Just run git commit -s and push the changes.

@christopherallis christopherallis self-requested a review January 7, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants