WIP: Defer K8s operations in CreateListener to investigate timeout#768
WIP: Defer K8s operations in CreateListener to investigate timeout#768stormcat24 wants to merge 1 commit intomainfrom
Conversation
Problem: - Terraform provider times out after 3m40s during aws_lb_listener.http creation - "Plugin did not respond" error causes terraform apply to fail Investigation performed: 1. Identified blocking operations in CreateListener (integration_k8s.go): - updateTraefikConfigForListener() - creates K8s Ingress resource - addK3dPortMapping() - executes k3d CLI command 2. Attempted timeout context wrapper - FAILED 3. Attempted async goroutine pattern - FAILED 4. Commented out Traefik/Ingress configuration - FAILED 5. Removed ALL K8s operations including k3d port mapping - STILL FAILS Current Status: - Despite removing all K8s operations from CreateListener, the hang persists - This definitively proves the root cause is NOT in CreateListener itself - Listener is now created in memory only without K8s integration Changes Made: - Deferred updateTraefikConfigForListener() call (was line 371) - Deferred addK3dPortMapping() call (was line 375) - Added debug logging explaining operations are deferred Next Steps Needed: 1. Verify hot reload worked correctly (old code may still be running) 2. Investigate API layer (elbv2_api_impl.go) for blocking operations 3. Investigate storage layer for slow operations 4. Check LocalStack response times 5. Consider architectural fix: move all I/O to background reconciliation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Greptile Review Summary- Critical Issues: 0- Logic Issues: 0- Style Suggestions: 0✅ **No blocking issues foundSee Greptile Workflow Guide for handling reviews. |
There was a problem hiding this comment.
Greptile Overview
Summary
This WIP PR investigates a 3m40s timeout issue when Terraform creates aws_lb_listener.http resources. The investigation systematically removed potentially blocking operations from CreateListener:
- Removed
updateTraefikConfigForListener()(K8s Ingress creation) - Removed target group name extraction logic
- Removed
addK3dPortMapping()(k3d CLI execution)
Key Finding: The timeout persists even after removing all K8s operations, proving the root cause is elsewhere in the call chain.
Likely culprits identified:
- Database operations in
elbv2_api_impl.go(lines 602, 643) -GetLoadBalancerorCreateListenerDuckDB queries - Hot reload may not have deployed new code (verify pod restart timestamps)
- HTTP middleware or context timeout configuration
- Terraform provider timeout settings
Next steps for investigation:
- Add timing logs around DuckDB operations in
elbv2_api_impl.go - Verify hot reload actually restarted pods with new code
- Check database locks or slow transactions
- Consider moving all I/O to background reconciliation loop (architectural fix)
Confidence Score: 3/5
- This WIP PR is safe to merge as an investigation branch but does not resolve the timeout issue
- Score of 3 reflects that this is investigative work that correctly identifies the K8s operations are not the bottleneck. The changes themselves are safe (deferring operations), but the PR doesn't solve the original problem. The timeout persists, indicating the root cause is upstream in the database layer or deployment process. All tests pass, no critical issues introduced, but production functionality is degraded (listeners created without K8s integration).
- Focus investigation on
controlplane/internal/controlplane/api/elbv2_api_impl.go(database operations at lines 602 and 643) and verify hot reload deployed new code
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| controlplane/internal/integrations/elbv2/integration_k8s.go | 3/5 | Deferred K8s operations in CreateListener to investigate timeout, but root cause appears to be elsewhere in call chain (likely DB operations or hot reload issue) |
Sequence Diagram
sequenceDiagram
participant TF as Terraform Provider
participant HTTP as HTTP API (routing.go)
participant API as ELBv2APIImpl
participant DB as DuckDB Storage
participant K8s as K8sIntegration
participant Kube as Kubernetes API
TF->>HTTP: POST /CreateListener
Note over TF,HTTP: 3m40s timeout occurs here
HTTP->>API: CreateListener(ctx, input)
API->>DB: GetLoadBalancer(ctx, arn)
Note over API,DB: Potential bottleneck #1<br/>Database read operation
DB-->>API: LoadBalancer data
API->>DB: CreateListener(ctx, dbListener)
Note over API,DB: Potential bottleneck #2<br/>Database write operation
DB-->>API: Success
API->>K8s: CreateListener(ctx, lbArn, port, protocol, tgArn)
Note over K8s: WIP: K8s operations deferred<br/>(Traefik config, k3d port mapping)
K8s->>K8s: Generate ARN, create in-memory listener
Note over K8s: Skipped: updateTraefikConfigForListener()<br/>Skipped: addK3dPortMapping()
K8s-->>API: Listener created
API-->>HTTP: CreateListenerOutput
HTTP-->>TF: HTTP 200 Response
Note over TF,Kube: Investigation Result:<br/>Timeout persists even after removing K8s ops<br/>Root cause likely in DB operations or hot reload
1 file reviewed, 2 comments
| targetGroupName = tg.Name | ||
| } | ||
| } | ||
| // Note: Target group name extraction skipped for now since Traefik config is deferred |
There was a problem hiding this comment.
style: commented-out target group extraction logic still references removed code
The comment says "Target group name extraction skipped for now since Traefik config is deferred" but this creates technical debt. The entire code block was removed, leaving this orphaned comment.
| // Note: Target group name extraction skipped for now since Traefik config is deferred | |
| i.mu.RUnlock() |
Consider removing this comment or clarifying what needs to happen when K8s operations are re-enabled.
Prompt To Fix With AI
This is a comment left during a code review.
Path: controlplane/internal/integrations/elbv2/integration_k8s.go
Line: 332:332
Comment:
**style:** commented-out target group extraction logic still references removed code
The comment says "Target group name extraction skipped for now since Traefik config is deferred" but this creates technical debt. The entire code block was removed, leaving this orphaned comment.
```suggestion
i.mu.RUnlock()
```
Consider removing this comment or clarifying what needs to happen when K8s operations are re-enabled.
How can I resolve this? If you propose a fix, please make it concise.| // TODO: Update Traefik configuration with new listener | ||
| // For now, skip K8s integration to prevent timeout | ||
| // The listener is created in memory immediately without K8s ingress | ||
| logging.Debug("Listener created in memory, K8s configuration pending", | ||
| "listener_arn", arn, | ||
| "load_balancer", lbName, | ||
| "note", "Traefik/Ingress/k3d port mapping creation deferred to avoid provider timeout") |
There was a problem hiding this comment.
style: TODOs defer K8s operations but the timeout likely originates upstream in the call chain
Based on the investigation, removing K8s ops from this function didn't fix the 3m40s timeout, which correctly indicates the bottleneck is elsewhere. The call chain is:
Terraform → HTTP API → elbv2_api_impl.go:654 → this function
Likely culprits to investigate:
- Database operations at
elbv2_api_impl.go:602(GetLoadBalancer) or643(CreateListener) - Hot reload may not have deployed the new code
- Some middleware or context timeout in the HTTP chain
Prompt To Fix With AI
This is a comment left during a code review.
Path: controlplane/internal/integrations/elbv2/integration_k8s.go
Line: 354:360
Comment:
**style:** TODOs defer K8s operations but the timeout likely originates upstream in the call chain
Based on the investigation, removing K8s ops from this function didn't fix the 3m40s timeout, which correctly indicates the bottleneck is elsewhere. The call chain is:
Terraform → HTTP API → `elbv2_api_impl.go:654` → this function
Likely culprits to investigate:
- Database operations at `elbv2_api_impl.go:602` (GetLoadBalancer) or `643` (CreateListener)
- Hot reload may not have deployed the new code
- Some middleware or context timeout in the HTTP chain
How can I resolve this? If you propose a fix, please make it concise.
Code Metrics Report
Details | | #767 (a2793cd) | #768 (0cd5007) | +/- |
|---------------------|----------------|----------------|-------|
- | Coverage | 12.8% | 12.7% | -0.1% |
| Files | 269 | 269 | 0 |
| Lines | 42056 | 42040 | -16 |
- | Covered | 5397 | 5373 | -24 |
+ | Code to Test Ratio | 1:0.2 | 1:0.2 | +0.0 |
| Code | 71292 | 71263 | -29 |
| Test | 16352 | 16352 | 0 |
+ | Test Execution Time | 3m23s | 3m19s | -4s |Code coverage of files in pull request scope (27.8% → 23.9%)
Reported by octocov |
Summary
Investigation
Problem
aws_lb_listener.httpresourceInvestigation Steps and Results
✅ Identified blocking operations in CreateListener:
updateTraefikConfigForListener()- Creates K8s Ingress resourceaddK3dPortMapping()- Executes k3d CLI command❌ Added timeout context wrapper → FAILED
❌ Attempted async goroutine pattern → FAILED
❌ Commented out Traefik/Ingress configuration → FAILED
❌ Removed ALL K8s operations → STILL FAILS
Changes Made
updateTraefikConfigForListener()call (was at line 371)addK3dPortMapping()call (was at line 375)Next Steps
Areas Requiring Further Investigation
elbv2_api_impl.gomay have additional blocking operationsArchitectural Fix Required
Test Results
🤖 Generated with Claude Code