WIP: Defer K8s operations in CreateListener to investigate timeout by stormcat24 · Pull Request #768 · nandemo-ya/kecs

stormcat24 · 2025-10-16T16:24:47Z

Summary

Investigated Terraform provider timeout (3m40s) during aws_lb_listener.http creation
Identified and removed blocking operations in CreateListener function
Critical finding: The issue persists even after removing ALL K8s operations, proving the root cause is NOT in CreateListener itself

Investigation

Problem

Terraform apply hangs for 3m40s when creating aws_lb_listener.http resource
Fails with "Plugin did not respond" error

Investigation Steps and Results

✅ Identified blocking operations in CreateListener:
- updateTraefikConfigForListener() - Creates K8s Ingress resource
- addK3dPortMapping() - Executes k3d CLI command
❌ Added timeout context wrapper → FAILED
- Set 30-second timeout, but still timed out after 3m+
❌ Attempted async goroutine pattern → FAILED
- Used background context with async execution, same hang occurred
❌ Commented out Traefik/Ingress configuration → FAILED
- Revealed k3d port mapping was also problematic
❌ Removed ALL K8s operations → STILL FAILS
- Removed all K8s operations from CreateListener
- 3m40s hang persists
- Critical finding: Root cause is NOT in CreateListener function

Changes Made

Removed updateTraefikConfigForListener() call (was at line 371)
Removed addK3dPortMapping() call (was at line 375)
Added debug logging explaining operations are deferred
Listener is now created in memory only without K8s integration

Next Steps

Areas Requiring Further Investigation

Hot reload verification: Old code may still be running despite rebuild
API layer investigation: elbv2_api_impl.go may have additional blocking operations
Storage layer investigation: DuckDB operations may be slow
LocalStack response time: Mock AWS service may have slow responses

Architectural Fix Required

Terraform provider expects synchronous API responses
Performing I/O operations (K8s, storage, external commands) in request path causes timeouts
Proper fix: Move all I/O operations to background reconciliation loop

Test Results

✅ All unit tests pass
✅ Code compiles successfully
❌ Terraform apply still hangs at 3m40s

🤖 Generated with Claude Code

Problem: - Terraform provider times out after 3m40s during aws_lb_listener.http creation - "Plugin did not respond" error causes terraform apply to fail Investigation performed: 1. Identified blocking operations in CreateListener (integration_k8s.go): - updateTraefikConfigForListener() - creates K8s Ingress resource - addK3dPortMapping() - executes k3d CLI command 2. Attempted timeout context wrapper - FAILED 3. Attempted async goroutine pattern - FAILED 4. Commented out Traefik/Ingress configuration - FAILED 5. Removed ALL K8s operations including k3d port mapping - STILL FAILS Current Status: - Despite removing all K8s operations from CreateListener, the hang persists - This definitively proves the root cause is NOT in CreateListener itself - Listener is now created in memory only without K8s integration Changes Made: - Deferred updateTraefikConfigForListener() call (was line 371) - Deferred addK3dPortMapping() call (was line 375) - Added debug logging explaining operations are deferred Next Steps Needed: 1. Verify hot reload worked correctly (old code may still be running) 2. Investigate API layer (elbv2_api_impl.go) for blocking operations 3. Investigate storage layer for slow operations 4. Check LocalStack response times 5. Consider architectural fix: move all I/O to background reconciliation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-10-16T16:26:01Z

Greptile Review Summary- Critical Issues: 0- Logic Issues: 0- Style Suggestions: 0✅ **No blocking issues found

See Greptile Workflow Guide for handling reviews.

greptile-apps

Greptile Overview

Summary

This WIP PR investigates a 3m40s timeout issue when Terraform creates aws_lb_listener.http resources. The investigation systematically removed potentially blocking operations from CreateListener:

Removed updateTraefikConfigForListener() (K8s Ingress creation)
Removed target group name extraction logic
Removed addK3dPortMapping() (k3d CLI execution)

Key Finding: The timeout persists even after removing all K8s operations, proving the root cause is elsewhere in the call chain.

Likely culprits identified:

Database operations in elbv2_api_impl.go (lines 602, 643) - GetLoadBalancer or CreateListener DuckDB queries
Hot reload may not have deployed new code (verify pod restart timestamps)
HTTP middleware or context timeout configuration
Terraform provider timeout settings

Next steps for investigation:

Add timing logs around DuckDB operations in elbv2_api_impl.go
Verify hot reload actually restarted pods with new code
Check database locks or slow transactions
Consider moving all I/O to background reconciliation loop (architectural fix)

Confidence Score: 3/5

This WIP PR is safe to merge as an investigation branch but does not resolve the timeout issue
Score of 3 reflects that this is investigative work that correctly identifies the K8s operations are not the bottleneck. The changes themselves are safe (deferring operations), but the PR doesn't solve the original problem. The timeout persists, indicating the root cause is upstream in the database layer or deployment process. All tests pass, no critical issues introduced, but production functionality is degraded (listeners created without K8s integration).
Focus investigation on controlplane/internal/controlplane/api/elbv2_api_impl.go (database operations at lines 602 and 643) and verify hot reload deployed new code

Important Files Changed

File Analysis

Filename	Score	Overview
controlplane/internal/integrations/elbv2/integration_k8s.go	3/5	Deferred K8s operations in CreateListener to investigate timeout, but root cause appears to be elsewhere in call chain (likely DB operations or hot reload issue)

Sequence Diagram

sequenceDiagram
    participant TF as Terraform Provider
    participant HTTP as HTTP API (routing.go)
    participant API as ELBv2APIImpl
    participant DB as DuckDB Storage
    participant K8s as K8sIntegration
    participant Kube as Kubernetes API

    TF->>HTTP: POST /CreateListener
    Note over TF,HTTP: 3m40s timeout occurs here
    HTTP->>API: CreateListener(ctx, input)
    
    API->>DB: GetLoadBalancer(ctx, arn)
    Note over API,DB: Potential bottleneck #1<br/>Database read operation
    DB-->>API: LoadBalancer data
    
    API->>DB: CreateListener(ctx, dbListener)
    Note over API,DB: Potential bottleneck #2<br/>Database write operation
    DB-->>API: Success
    
    API->>K8s: CreateListener(ctx, lbArn, port, protocol, tgArn)
    Note over K8s: WIP: K8s operations deferred<br/>(Traefik config, k3d port mapping)
    K8s->>K8s: Generate ARN, create in-memory listener
    Note over K8s: Skipped: updateTraefikConfigForListener()<br/>Skipped: addK3dPortMapping()
    K8s-->>API: Listener created
    
    API-->>HTTP: CreateListenerOutput
    HTTP-->>TF: HTTP 200 Response
    
    Note over TF,Kube: Investigation Result:<br/>Timeout persists even after removing K8s ops<br/>Root cause likely in DB operations or hot reload

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-16T16:28:31Z

controlplane/internal/integrations/elbv2/integration_k8s.go

-			targetGroupName = tg.Name
-		}
-	}
+	// Note: Target group name extraction skipped for now since Traefik config is deferred


style: commented-out target group extraction logic still references removed code

The comment says "Target group name extraction skipped for now since Traefik config is deferred" but this creates technical debt. The entire code block was removed, leaving this orphaned comment.

Suggested change

// Note: Target group name extraction skipped for now since Traefik config is deferred

i.mu.RUnlock()

Consider removing this comment or clarifying what needs to happen when K8s operations are re-enabled.

Prompt To Fix With AI

This is a comment left during a code review. Path: controlplane/internal/integrations/elbv2/integration_k8s.go Line: 332:332 Comment: **style:** commented-out target group extraction logic still references removed code The comment says "Target group name extraction skipped for now since Traefik config is deferred" but this creates technical debt. The entire code block was removed, leaving this orphaned comment. ```suggestion i.mu.RUnlock() ``` Consider removing this comment or clarifying what needs to happen when K8s operations are re-enabled. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-10-16T16:28:32Z

controlplane/internal/integrations/elbv2/integration_k8s.go

+	// TODO: Update Traefik configuration with new listener
+	// For now, skip K8s integration to prevent timeout
+	// The listener is created in memory immediately without K8s ingress
+	logging.Debug("Listener created in memory, K8s configuration pending",
+		"listener_arn", arn,
+		"load_balancer", lbName,
+		"note", "Traefik/Ingress/k3d port mapping creation deferred to avoid provider timeout")


style: TODOs defer K8s operations but the timeout likely originates upstream in the call chain

Based on the investigation, removing K8s ops from this function didn't fix the 3m40s timeout, which correctly indicates the bottleneck is elsewhere. The call chain is:

Terraform → HTTP API → elbv2_api_impl.go:654 → this function

Likely culprits to investigate:

Database operations at elbv2_api_impl.go:602 (GetLoadBalancer) or 643 (CreateListener)

Hot reload may not have deployed the new code

Some middleware or context timeout in the HTTP chain

Prompt To Fix With AI

This is a comment left during a code review. Path: controlplane/internal/integrations/elbv2/integration_k8s.go Line: 354:360 Comment: **style:** TODOs defer K8s operations but the timeout likely originates upstream in the call chain Based on the investigation, removing K8s ops from this function didn't fix the 3m40s timeout, which correctly indicates the bottleneck is elsewhere. The call chain is: Terraform → HTTP API → `elbv2_api_impl.go:654` → this function Likely culprits to investigate: - Database operations at `elbv2_api_impl.go:602` (GetLoadBalancer) or `643` (CreateListener) - Hot reload may not have deployed the new code - Some middleware or context timeout in the HTTP chain How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2025-10-16T16:28:36Z

Code Metrics Report

	#767 (a2793cd)	#768 (0cd5007)	+/-
Coverage	12.8%	12.7%	-0.1%
Code to Test Ratio	1:0.2	1:0.2	+0.0
Test Execution Time	3m23s	3m19s	-4s

Details

  |                     | #767 (a2793cd) | #768 (0cd5007) |  +/-  |
  |---------------------|----------------|----------------|-------|
- | Coverage            |          12.8% |          12.7% | -0.1% |
  |   Files             |            269 |            269 |     0 |
  |   Lines             |          42056 |          42040 |   -16 |
- |   Covered           |           5397 |           5373 |   -24 |
+ | Code to Test Ratio  |          1:0.2 |          1:0.2 |  +0.0 |
  |   Code              |          71292 |          71263 |   -29 |
  |   Test              |          16352 |          16352 |     0 |
+ | Test Execution Time |          3m23s |          3m19s |   -4s |

Code coverage of files in pull request scope (27.8% → 23.9%)

Files	Coverage	+/-	Status
controlplane/internal/integrations/elbv2/integration_k8s.go	23.9%	-4.0%	modified

Reported by octocov

greptile-apps bot reviewed Oct 16, 2025

View reviewed changes

github-actions bot mentioned this pull request Dec 18, 2025

feat: Add ECS API reference skills for KECS development #769

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Defer K8s operations in CreateListener to investigate timeout#768

WIP: Defer K8s operations in CreateListener to investigate timeout#768
stormcat24 wants to merge 1 commit intomainfrom
fix/defer-k8s-operations-in-listener-creation

stormcat24 commented Oct 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Oct 16, 2025

Uh oh!

greptile-apps bot Oct 16, 2025

Uh oh!

github-actions bot commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	// Note: Target group name extraction skipped for now since Traefik config is deferred
	i.mu.RUnlock()

Conversation

stormcat24 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Investigation

Problem

Investigation Steps and Results

Changes Made

Next Steps

Areas Requiring Further Investigation

Architectural Fix Required

Test Results

Uh oh!

github-actions bot commented Oct 16, 2025

Greptile Review Summary- Critical Issues: 0- Logic Issues: 0- Style Suggestions: 0✅ **No blocking issues found

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 16, 2025

Code Metrics Report

Code coverage of files in pull request scope (27.8% → 23.9%)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stormcat24 commented Oct 16, 2025 •

edited

Loading