-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[INS-283] Support following symlinks in filesystem source #4742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[INS-283] Support following symlinks in filesystem source #4742
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| } | ||
| // If symlink resolves to directory scan that directory | ||
| ctx.Logger().V(5).Info("Resolved symlink is a directory", "path", resolvedPath) | ||
| err = s.scanDir(ctx, resolvedPath, chunksChan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursive scanDir creates unbounded nested worker pools
Medium Severity
When a symlink to a directory is found during WalkDir, scanDir is called recursively at line 268 from within the walk callback. Each recursive scanDir creates its own errgroup.Group with SetLimit(s.concurrency). The outer worker pool may still have goroutines running (spawned for files found before the symlink), while the inner pool also spawns up to s.concurrency goroutines. With nested symlinked directories, the effective concurrency becomes depth * s.concurrency, bypassing the user-configured limit and potentially exhausting system resources.
Additional Locations (1)
| if s.maxSymlinkDepth > 0 { | ||
| s.followSymlinks = true | ||
| } | ||
| s.visitedSymlinkPaths = make(map[string]struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Visited symlink state leaks across independent ChunkUnit calls
Medium Severity
visitedSymlinkPaths is initialized once in Init and never cleared. markVisitedPath is called unconditionally at the start of every scanDir, including for non-symlink directories. When using the SourceUnitEnumChunker interface, where ChunkUnit is called per enumerated unit, the accumulated state from one unit's scan causes subsequent units to skip symlinks pointing to already-visited directories. This makes scan results depend on unit processing order and can silently miss content.


Description:
This PR re-enables symlink following in filesystem traversal and introduces configurable controls for symlink behavior, including:
Symlink resolution uses Go’s
os.Readlink, which resolves a single symlink per call. If the number of resolutions exceeds the configured max depth,resolveSymLinkreturns an error.Additionally, directory scanning is now robust against symlink loops (e.g.,
A/linkToB -> BandB/linkToA -> A). We maintain a map (visitedSymlinkPaths) of already-resolved directory symlinks to prevent infinite recursion while keeping traversal predictable and stableChecklist:
make test-community)?make lintthis requires golangci-lint)?