Avoid corrupt trees when submodules are present and a specially named subdirectory is a sibling#14347
Avoid corrupt trees when submodules are present and a specially named subdirectory is a sibling#14347newren wants to merge 3 commits intodependabot:mainfrom
Conversation
_clone_repo_contents has two code paths that skip the find_submodules() call, leaving @submodules as an empty array: 1. The cached-clone early-return path (line 870): if the clone directory already exists from a previous run, the method returns immediately without identifying submodule paths. 2. The submodule clone failure rescue (line 900-909): when --recurse-submodules fails, the main repo is already cloned but find_submodules() was never reached. When @submodules is empty, the in_submodule?() guard returns false for all paths, defeating the support_file protection that prevents files inside submodule directories from being included in commits. Fix both paths to call find_submodules() so that @submodules is always populated when a clone directory exists. The git ls-files --stage command used by find_submodules() works regardless of whether the submodule content was actually checked out. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When listing directory contents from a cloned repo, _cloned_repo_contents() uses Dir.exist?() to identify directories. Since `git clone --recurse-submodules` populates submodule directories, Dir.exist?() returns true for them and they are reported as type "dir". This causes ecosystem file fetchers (e.g., uv, python) that scan subdirectories for requirement files to descend into submodule directories and discover files inside them. When those files are later updated and sent to the GitHub Create Tree API with base_tree, the API can produce a corrupt tree with duplicate entries — e.g., `docs` appearing as both `160000 commit` (submodule) and `040000 tree` (directory). This happens when the base tree contains an entry that sorts between the non-directory entry and its nested counterpart (e.g., `docs-internal` between `docs` and `docs/...`), which defeats an adjacency-based D/F conflict check in git's verify_cache(). Fix this by checking `@submodules` (populated by find_submodules during clone) before falling through to Dir.exist?(). Entries whose paths match known submodules are now reported as type "submodule" rather than "dir", matching the behavior of the GitHub Contents API path (_github_repo_contents). This prevents file fetchers from descending into submodule directories, since they filter on `type == "dir"`. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the GitHub Create Tree API is called with base_tree and file_trees contains an entry whose path traverses a submodule or symlink directory (e.g., 'docs/requirements.txt' where 'docs' is a submodule), the API can produce a corrupt tree with duplicate entries — for example, 'docs' appearing as both '160000 commit' and '040000 tree'. This happens when the base tree contains an intervening entry (e.g., 'docs-internal') that sorts between the non-directory entry and its nested counterpart, defeating an adjacency-based D/F conflict check in git's verify_cache(). Add a filter_conflicting_entries method to both PullRequestCreator::Github and PullRequestUpdater::Github that fetches the base tree before calling create_tree and removes any file_trees entries whose first path component matches a submodule (type 'commit') or symlink (mode '120000') entry in the base tree. Excluded entries are logged as warnings. This is a defense-in-depth measure that catches cases the file fetcher protections miss, such as: - Child requirement file references (e.g., '-r docs/requirements.txt') that resolve through a symlink directory - Any future code path that inadvertently includes files from inside submodule or symlink directories The base tree fetch is wrapped in a rescue so that API errors do not block the main operation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes GitHub tree corruption scenarios (duplicate/conflicting tree entries) when Dependabot attempts to write updates beneath paths that are submodules (or symlinks), aligning with the issue report in #14346.
Changes:
- Add filtering in GitHub PR creation/updating to exclude tree entries that would conflict with submodule/symlink entries in the base tree.
- Enhance cloned repository directory listing to classify submodule paths as
"submodule"and ensure submodule metadata is populated when reusing an existing clone.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| common/lib/dependabot/pull_request_updater/github.rb | Filters conflicting create_tree entries against base-tree submodules/symlinks before updating PR branches. |
| common/lib/dependabot/pull_request_creator/github.rb | Filters conflicting create_tree entries against base-tree submodules/symlinks before creating PR commits. |
| common/lib/dependabot/file_fetchers/base.rb | Detects submodule paths in cloned repos and marks directory entries as "submodule" to avoid treating them as normal directories. |
Comments suppressed due to low confidence (6)
common/lib/dependabot/pull_request_creator/github.rb:369
- This logic silently drops updates under a submodule/symlink path and proceeds to create a commit anyway. That can result in a PR/branch update that claims success but is missing intended file changes (or produces a no-op commit with the same tree). Instead of only logging, consider failing fast with a dedicated error when any entries are excluded, so Dependabot surfaces a clear failure rather than generating an incomplete update.
file_trees.reject do |entry|
path = T.must(entry[:path])
first_component = T.must(path.split("/").first)
if path.include?("/") && special_paths.include?(first_component)
Dependabot.logger.warn(
"Excluding #{path} from create_tree — " \
"conflicts with submodule/symlink '#{first_component}' in base tree"
)
true
common/lib/dependabot/pull_request_creator/github.rb:356
- This change introduces new GitHub API calls (
git_commit+tree) and new filtering behavior, but the existing PR creator specs stub onlyPOST /git/treesand will start failing due to unexpected requests. Please update/add tests to stub the additional endpoints and to assert the conflict-filtering behavior (including the warning/error path).
def filter_conflicting_entries(file_trees)
commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
special_paths = T.let(Set.new, T::Set[String])
base_tree.tree.each do |entry|
# Submodules have type "commit"; symlinks have mode "120000"
special_paths.add(entry.path) if entry.type == "commit" || entry.mode == "120000"
common/lib/dependabot/pull_request_creator/github.rb:353
filter_conflicting_entriesis duplicated (nearly identically) in both PullRequestCreator::Github and PullRequestUpdater::Github. To avoid the two implementations drifting (especially around edge cases like nested paths and error handling), consider extracting this into a shared helper/module used by both classes.
sig do
params(file_trees: T::Array[T::Hash[Symbol, T.untyped]])
.returns(T::Array[T::Hash[Symbol, T.untyped]])
end
def filter_conflicting_entries(file_trees)
commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
special_paths = T.let(Set.new, T::Set[String])
common/lib/dependabot/pull_request_updater/github.rb:236
- New filtering introduces extra GitHub API calls and behavior changes, but updater specs currently stub only a subset of requests. Please update/add tests to stub
git_commit/treecalls forbase_commitand to validate the filtering behavior so regressions don’t slip in.
sig do
params(file_trees: T::Array[T::Hash[Symbol, T.untyped]])
.returns(T::Array[T::Hash[Symbol, T.untyped]])
end
def filter_conflicting_entries(file_trees)
commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
special_paths = T.let(Set.new, T::Set[String])
base_tree.tree.each do |entry|
# Submodules have type "commit"; symlinks have mode "120000"
special_paths.add(entry.path) if entry.type == "commit" || entry.mode == "120000"
common/lib/dependabot/pull_request_updater/github.rb:249
- As in the PR creator, this method silently drops conflicting entries and continues, which can yield a branch update that 'succeeds' but doesn’t include all intended changes (or produces a no-op commit). Consider raising a specific error when entries are excluded so the update fails loudly and predictably.
file_trees.reject do |entry|
path = T.must(entry[:path])
first_component = T.must(path.split("/").first)
if path.include?("/") && special_paths.include?(first_component)
Dependabot.logger.warn(
"Excluding #{path} from create_tree — " \
"conflicts with submodule/symlink '#{first_component}' in base tree"
)
true
common/lib/dependabot/pull_request_updater/github.rb:233
filter_conflicting_entriesis duplicated in both the updater and creator GitHub implementations. Extracting it to a shared helper would reduce maintenance overhead and help ensure fixes (e.g., nested path handling) are applied consistently.
sig do
params(file_trees: T::Array[T::Hash[Symbol, T.untyped]])
.returns(T::Array[T::Hash[Symbol, T.untyped]])
end
def filter_conflicting_entries(file_trees)
commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
special_paths = T.let(Set.new, T::Set[String])
| commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit) | ||
| base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha) | ||
| special_paths = T.let(Set.new, T::Set[String]) | ||
| base_tree.tree.each do |entry| | ||
| # Submodules have type "commit"; symlinks have mode "120000" | ||
| special_paths.add(entry.path) if entry.type == "commit" || entry.mode == "120000" |
There was a problem hiding this comment.
filter_conflicting_entries only checks the first path component against special_paths, but special_paths is populated from a non-recursive root tree (tree(...) without recursive). This means conflicts with nested submodules/symlinks (e.g., a submodule at a/b) will not be detected, and updates under those paths can still produce invalid trees. Consider either fetching the tree recursively (and collecting full paths), or checking all path prefixes against special_paths derived by walking the tree hierarchy.
This issue also appears in the following locations of the same file:
- line 361
- line 350
- line 346
| commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit) | ||
| base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha) | ||
| special_paths = T.let(Set.new, T::Set[String]) | ||
| base_tree.tree.each do |entry| | ||
| # Submodules have type "commit"; symlinks have mode "120000" | ||
| special_paths.add(entry.path) if entry.type == "commit" || entry.mode == "120000" |
There was a problem hiding this comment.
Same issue as in the PR creator: special_paths is built from a non-recursive root tree and the filter only matches the first path component, so nested submodules/symlinks won't be detected and updates under them can still generate conflicting entries. Consider using a recursive tree fetch or prefix checking across the full path.
This issue also appears in the following locations of the same file:
- line 226
- line 241
- line 226
What are you trying to accomplish?
Trying to fix #14346.
When a project has a submodule, updates to any manifest files within that submodule need to be made within the submodule, not as direct file entries within the outer project.
Sadly, the attempts to update the directory entries directly in the outer project were also resulting in corrupt trees (with duplicate entries), which fail fsck checks and trips assertion failures in git merge logic.
Anything you want to highlight for special attention from reviewers?
I am not familiar with dependabot-core code, and I don't actually know much Ruby. I used GitHub Copilot to come up with these patches and help me trace the problem from assertion failures in git-merge-tree, to dependabot, to this repo, to these fixes, to the GitHub API, to the backend code that API calls, to where that backend fails to reject these bad requests from dependabot, and have patches for multiple projects from this. Copilot didn't quite get the patches right for the other projects, so they may not be perfect here either, but the patches were at least helpful at highlighting the problem area for the other projects, so I hope they are similarly helpful here. The individual commit messages should at least help describe the background for each change.
How will you know you've accomplished your goal?
My main goal is fixing the assertion failures in git merging, and making the backend reject the mal-formed tree changes that dependabot is requesting, but those are being tracked in other projects. I'm sending these in so you at least have a heads up that some of your requests will start failing, and perhaps give you a heads up on the patches needed to fix them.
Checklist