Skip to content

WIP: Defer K8s operations in CreateListener to investigate timeout#768

Open
stormcat24 wants to merge 1 commit intomainfrom
fix/defer-k8s-operations-in-listener-creation
Open

WIP: Defer K8s operations in CreateListener to investigate timeout#768
stormcat24 wants to merge 1 commit intomainfrom
fix/defer-k8s-operations-in-listener-creation

Conversation

@stormcat24
Copy link
Member

@stormcat24 stormcat24 commented Oct 16, 2025

Summary

  • Investigated Terraform provider timeout (3m40s) during aws_lb_listener.http creation
  • Identified and removed blocking operations in CreateListener function
  • Critical finding: The issue persists even after removing ALL K8s operations, proving the root cause is NOT in CreateListener itself

Investigation

Problem

  • Terraform apply hangs for 3m40s when creating aws_lb_listener.http resource
  • Fails with "Plugin did not respond" error

Investigation Steps and Results

  1. ✅ Identified blocking operations in CreateListener:

    • updateTraefikConfigForListener() - Creates K8s Ingress resource
    • addK3dPortMapping() - Executes k3d CLI command
  2. ❌ Added timeout context wrapper → FAILED

    • Set 30-second timeout, but still timed out after 3m+
  3. ❌ Attempted async goroutine pattern → FAILED

    • Used background context with async execution, same hang occurred
  4. ❌ Commented out Traefik/Ingress configuration → FAILED

    • Revealed k3d port mapping was also problematic
  5. Removed ALL K8s operationsSTILL FAILS

    • Removed all K8s operations from CreateListener
    • 3m40s hang persists
    • Critical finding: Root cause is NOT in CreateListener function

Changes Made

  • Removed updateTraefikConfigForListener() call (was at line 371)
  • Removed addK3dPortMapping() call (was at line 375)
  • Added debug logging explaining operations are deferred
  • Listener is now created in memory only without K8s integration

Next Steps

Areas Requiring Further Investigation

  1. Hot reload verification: Old code may still be running despite rebuild
  2. API layer investigation: elbv2_api_impl.go may have additional blocking operations
  3. Storage layer investigation: DuckDB operations may be slow
  4. LocalStack response time: Mock AWS service may have slow responses

Architectural Fix Required

  • Terraform provider expects synchronous API responses
  • Performing I/O operations (K8s, storage, external commands) in request path causes timeouts
  • Proper fix: Move all I/O operations to background reconciliation loop

Test Results

  • ✅ All unit tests pass
  • ✅ Code compiles successfully
  • ❌ Terraform apply still hangs at 3m40s

🤖 Generated with Claude Code

Problem:
- Terraform provider times out after 3m40s during aws_lb_listener.http creation
- "Plugin did not respond" error causes terraform apply to fail

Investigation performed:
1. Identified blocking operations in CreateListener (integration_k8s.go):
   - updateTraefikConfigForListener() - creates K8s Ingress resource
   - addK3dPortMapping() - executes k3d CLI command
2. Attempted timeout context wrapper - FAILED
3. Attempted async goroutine pattern - FAILED
4. Commented out Traefik/Ingress configuration - FAILED
5. Removed ALL K8s operations including k3d port mapping - STILL FAILS

Current Status:
- Despite removing all K8s operations from CreateListener, the hang persists
- This definitively proves the root cause is NOT in CreateListener itself
- Listener is now created in memory only without K8s integration

Changes Made:
- Deferred updateTraefikConfigForListener() call (was line 371)
- Deferred addK3dPortMapping() call (was line 375)
- Added debug logging explaining operations are deferred

Next Steps Needed:
1. Verify hot reload worked correctly (old code may still be running)
2. Investigate API layer (elbv2_api_impl.go) for blocking operations
3. Investigate storage layer for slow operations
4. Check LocalStack response times
5. Consider architectural fix: move all I/O to background reconciliation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

Greptile Review Summary- Critical Issues: 0- Logic Issues: 0- Style Suggestions: 0✅ **No blocking issues found

See Greptile Workflow Guide for handling reviews.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This WIP PR investigates a 3m40s timeout issue when Terraform creates aws_lb_listener.http resources. The investigation systematically removed potentially blocking operations from CreateListener:

  • Removed updateTraefikConfigForListener() (K8s Ingress creation)
  • Removed target group name extraction logic
  • Removed addK3dPortMapping() (k3d CLI execution)

Key Finding: The timeout persists even after removing all K8s operations, proving the root cause is elsewhere in the call chain.

Likely culprits identified:

  • Database operations in elbv2_api_impl.go (lines 602, 643) - GetLoadBalancer or CreateListener DuckDB queries
  • Hot reload may not have deployed new code (verify pod restart timestamps)
  • HTTP middleware or context timeout configuration
  • Terraform provider timeout settings

Next steps for investigation:

  1. Add timing logs around DuckDB operations in elbv2_api_impl.go
  2. Verify hot reload actually restarted pods with new code
  3. Check database locks or slow transactions
  4. Consider moving all I/O to background reconciliation loop (architectural fix)

Confidence Score: 3/5

  • This WIP PR is safe to merge as an investigation branch but does not resolve the timeout issue
  • Score of 3 reflects that this is investigative work that correctly identifies the K8s operations are not the bottleneck. The changes themselves are safe (deferring operations), but the PR doesn't solve the original problem. The timeout persists, indicating the root cause is upstream in the database layer or deployment process. All tests pass, no critical issues introduced, but production functionality is degraded (listeners created without K8s integration).
  • Focus investigation on controlplane/internal/controlplane/api/elbv2_api_impl.go (database operations at lines 602 and 643) and verify hot reload deployed new code

Important Files Changed

File Analysis

Filename Score Overview
controlplane/internal/integrations/elbv2/integration_k8s.go 3/5 Deferred K8s operations in CreateListener to investigate timeout, but root cause appears to be elsewhere in call chain (likely DB operations or hot reload issue)

Sequence Diagram

sequenceDiagram
    participant TF as Terraform Provider
    participant HTTP as HTTP API (routing.go)
    participant API as ELBv2APIImpl
    participant DB as DuckDB Storage
    participant K8s as K8sIntegration
    participant Kube as Kubernetes API

    TF->>HTTP: POST /CreateListener
    Note over TF,HTTP: 3m40s timeout occurs here
    HTTP->>API: CreateListener(ctx, input)
    
    API->>DB: GetLoadBalancer(ctx, arn)
    Note over API,DB: Potential bottleneck #1<br/>Database read operation
    DB-->>API: LoadBalancer data
    
    API->>DB: CreateListener(ctx, dbListener)
    Note over API,DB: Potential bottleneck #2<br/>Database write operation
    DB-->>API: Success
    
    API->>K8s: CreateListener(ctx, lbArn, port, protocol, tgArn)
    Note over K8s: WIP: K8s operations deferred<br/>(Traefik config, k3d port mapping)
    K8s->>K8s: Generate ARN, create in-memory listener
    Note over K8s: Skipped: updateTraefikConfigForListener()<br/>Skipped: addK3dPortMapping()
    K8s-->>API: Listener created
    
    API-->>HTTP: CreateListenerOutput
    HTTP-->>TF: HTTP 200 Response
    
    Note over TF,Kube: Investigation Result:<br/>Timeout persists even after removing K8s ops<br/>Root cause likely in DB operations or hot reload
Loading

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

targetGroupName = tg.Name
}
}
// Note: Target group name extraction skipped for now since Traefik config is deferred
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: commented-out target group extraction logic still references removed code

The comment says "Target group name extraction skipped for now since Traefik config is deferred" but this creates technical debt. The entire code block was removed, leaving this orphaned comment.

Suggested change
// Note: Target group name extraction skipped for now since Traefik config is deferred
i.mu.RUnlock()

Consider removing this comment or clarifying what needs to happen when K8s operations are re-enabled.

Prompt To Fix With AI
This is a comment left during a code review.
Path: controlplane/internal/integrations/elbv2/integration_k8s.go
Line: 332:332

Comment:
**style:** commented-out target group extraction logic still references removed code

The comment says "Target group name extraction skipped for now since Traefik config is deferred" but this creates technical debt. The entire code block was removed, leaving this orphaned comment.

```suggestion
	i.mu.RUnlock()
```

Consider removing this comment or clarifying what needs to happen when K8s operations are re-enabled.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +354 to +360
// TODO: Update Traefik configuration with new listener
// For now, skip K8s integration to prevent timeout
// The listener is created in memory immediately without K8s ingress
logging.Debug("Listener created in memory, K8s configuration pending",
"listener_arn", arn,
"load_balancer", lbName,
"note", "Traefik/Ingress/k3d port mapping creation deferred to avoid provider timeout")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: TODOs defer K8s operations but the timeout likely originates upstream in the call chain

Based on the investigation, removing K8s ops from this function didn't fix the 3m40s timeout, which correctly indicates the bottleneck is elsewhere. The call chain is:

Terraform → HTTP API → elbv2_api_impl.go:654 → this function

Likely culprits to investigate:

  • Database operations at elbv2_api_impl.go:602 (GetLoadBalancer) or 643 (CreateListener)
  • Hot reload may not have deployed the new code
  • Some middleware or context timeout in the HTTP chain
Prompt To Fix With AI
This is a comment left during a code review.
Path: controlplane/internal/integrations/elbv2/integration_k8s.go
Line: 354:360

Comment:
**style:** TODOs defer K8s operations but the timeout likely originates upstream in the call chain

Based on the investigation, removing K8s ops from this function didn't fix the 3m40s timeout, which correctly indicates the bottleneck is elsewhere. The call chain is:

Terraform → HTTP API → `elbv2_api_impl.go:654` → this function

Likely culprits to investigate:
- Database operations at `elbv2_api_impl.go:602` (GetLoadBalancer) or `643` (CreateListener)
- Hot reload may not have deployed the new code
- Some middleware or context timeout in the HTTP chain

How can I resolve this? If you propose a fix, please make it concise.

@github-actions
Copy link

Code Metrics Report

#767 (a2793cd) #768 (0cd5007) +/-
Coverage 12.8% 12.7% -0.1%
Code to Test Ratio 1:0.2 1:0.2 +0.0
Test Execution Time 3m23s 3m19s -4s
Details
  |                     | #767 (a2793cd) | #768 (0cd5007) |  +/-  |
  |---------------------|----------------|----------------|-------|
- | Coverage            |          12.8% |          12.7% | -0.1% |
  |   Files             |            269 |            269 |     0 |
  |   Lines             |          42056 |          42040 |   -16 |
- |   Covered           |           5397 |           5373 |   -24 |
+ | Code to Test Ratio  |          1:0.2 |          1:0.2 |  +0.0 |
  |   Code              |          71292 |          71263 |   -29 |
  |   Test              |          16352 |          16352 |     0 |
+ | Test Execution Time |          3m23s |          3m19s |   -4s |

Code coverage of files in pull request scope (27.8% → 23.9%)

Files Coverage +/- Status
controlplane/internal/integrations/elbv2/integration_k8s.go 23.9% -4.0% modified

Reported by octocov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant