Skip to content

Avoid corrupt trees when submodules are present and a specially named subdirectory is a sibling#14347

Open
newren wants to merge 3 commits intodependabot:mainfrom
newren:fix/prevent-corrupt-trees-from-submodules
Open

Avoid corrupt trees when submodules are present and a specially named subdirectory is a sibling#14347
newren wants to merge 3 commits intodependabot:mainfrom
newren:fix/prevent-corrupt-trees-from-submodules

Conversation

@newren
Copy link

@newren newren commented Mar 3, 2026

What are you trying to accomplish?

Trying to fix #14346.

When a project has a submodule, updates to any manifest files within that submodule need to be made within the submodule, not as direct file entries within the outer project.

Sadly, the attempts to update the directory entries directly in the outer project were also resulting in corrupt trees (with duplicate entries), which fail fsck checks and trips assertion failures in git merge logic.

Anything you want to highlight for special attention from reviewers?

I am not familiar with dependabot-core code, and I don't actually know much Ruby. I used GitHub Copilot to come up with these patches and help me trace the problem from assertion failures in git-merge-tree, to dependabot, to this repo, to these fixes, to the GitHub API, to the backend code that API calls, to where that backend fails to reject these bad requests from dependabot, and have patches for multiple projects from this. Copilot didn't quite get the patches right for the other projects, so they may not be perfect here either, but the patches were at least helpful at highlighting the problem area for the other projects, so I hope they are similarly helpful here. The individual commit messages should at least help describe the background for each change.

How will you know you've accomplished your goal?

My main goal is fixing the assertion failures in git merging, and making the backend reject the mal-formed tree changes that dependabot is requesting, but those are being tracked in other projects. I'm sending these in so you at least have a heads up that some of your requests will start failing, and perhaps give you a heads up on the patches needed to fix them.

Checklist

  • I have run the complete test suite to ensure all tests and linters pass.
  • I have thoroughly tested my code changes to ensure they work as expected, including adding additional tests for new functionality.
  • I have written clear and descriptive commit messages.
  • I have provided a detailed description of the changes in the pull request, including the problem it addresses, how it fixes the problem, and any relevant details about the implementation.
  • I have ensured that the code is well-documented and easy to understand.

newren and others added 3 commits February 21, 2026 09:53
_clone_repo_contents has two code paths that skip the
find_submodules() call, leaving @submodules as an empty array:

1. The cached-clone early-return path (line 870): if the clone
   directory already exists from a previous run, the method returns
   immediately without identifying submodule paths.

2. The submodule clone failure rescue (line 900-909): when
   --recurse-submodules fails, the main repo is already cloned but
   find_submodules() was never reached.

When @submodules is empty, the in_submodule?() guard returns false
for all paths, defeating the support_file protection that prevents
files inside submodule directories from being included in commits.

Fix both paths to call find_submodules() so that @submodules is
always populated when a clone directory exists. The git ls-files
--stage command used by find_submodules() works regardless of
whether the submodule content was actually checked out.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When listing directory contents from a cloned repo,
_cloned_repo_contents() uses Dir.exist?() to identify directories.
Since `git clone --recurse-submodules` populates submodule
directories, Dir.exist?() returns true for them and they are
reported as type "dir".

This causes ecosystem file fetchers (e.g., uv, python) that scan
subdirectories for requirement files to descend into submodule
directories and discover files inside them. When those files are
later updated and sent to the GitHub Create Tree API with
base_tree, the API can produce a corrupt tree with duplicate
entries — e.g., `docs` appearing as both `160000 commit`
(submodule) and `040000 tree` (directory). This happens when the
base tree contains an entry that sorts between the non-directory
entry and its nested counterpart (e.g., `docs-internal` between
`docs` and `docs/...`), which defeats an adjacency-based D/F
conflict check in git's verify_cache().

Fix this by checking `@submodules` (populated by find_submodules
during clone) before falling through to Dir.exist?(). Entries
whose paths match known submodules are now reported as type
"submodule" rather than "dir", matching the behavior of the
GitHub Contents API path (_github_repo_contents). This prevents
file fetchers from descending into submodule directories, since
they filter on `type == "dir"`.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the GitHub Create Tree API is called with base_tree and
file_trees contains an entry whose path traverses a submodule or
symlink directory (e.g., 'docs/requirements.txt' where 'docs' is a
submodule), the API can produce a corrupt tree with duplicate
entries — for example, 'docs' appearing as both '160000 commit'
and '040000 tree'. This happens when the base tree contains an
intervening entry (e.g., 'docs-internal') that sorts between the
non-directory entry and its nested counterpart, defeating an
adjacency-based D/F conflict check in git's verify_cache().

Add a filter_conflicting_entries method to both
PullRequestCreator::Github and PullRequestUpdater::Github that
fetches the base tree before calling create_tree and removes any
file_trees entries whose first path component matches a submodule
(type 'commit') or symlink (mode '120000') entry in the base tree.
Excluded entries are logged as warnings.

This is a defense-in-depth measure that catches cases the file
fetcher protections miss, such as:

 - Child requirement file references (e.g., '-r docs/requirements.txt')
   that resolve through a symlink directory
 - Any future code path that inadvertently includes files from inside
   submodule or symlink directories

The base tree fetch is wrapped in a rescue so that API errors do not
block the main operation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@newren newren requested a review from a team as a code owner March 3, 2026 07:43
Copilot AI review requested due to automatic review settings March 3, 2026 07:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes GitHub tree corruption scenarios (duplicate/conflicting tree entries) when Dependabot attempts to write updates beneath paths that are submodules (or symlinks), aligning with the issue report in #14346.

Changes:

  • Add filtering in GitHub PR creation/updating to exclude tree entries that would conflict with submodule/symlink entries in the base tree.
  • Enhance cloned repository directory listing to classify submodule paths as "submodule" and ensure submodule metadata is populated when reusing an existing clone.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
common/lib/dependabot/pull_request_updater/github.rb Filters conflicting create_tree entries against base-tree submodules/symlinks before updating PR branches.
common/lib/dependabot/pull_request_creator/github.rb Filters conflicting create_tree entries against base-tree submodules/symlinks before creating PR commits.
common/lib/dependabot/file_fetchers/base.rb Detects submodule paths in cloned repos and marks directory entries as "submodule" to avoid treating them as normal directories.
Comments suppressed due to low confidence (6)

common/lib/dependabot/pull_request_creator/github.rb:369

  • This logic silently drops updates under a submodule/symlink path and proceeds to create a commit anyway. That can result in a PR/branch update that claims success but is missing intended file changes (or produces a no-op commit with the same tree). Instead of only logging, consider failing fast with a dedicated error when any entries are excluded, so Dependabot surfaces a clear failure rather than generating an incomplete update.
        file_trees.reject do |entry|
          path = T.must(entry[:path])
          first_component = T.must(path.split("/").first)
          if path.include?("/") && special_paths.include?(first_component)
            Dependabot.logger.warn(
              "Excluding #{path} from create_tree — " \
              "conflicts with submodule/symlink '#{first_component}' in base tree"
            )
            true

common/lib/dependabot/pull_request_creator/github.rb:356

  • This change introduces new GitHub API calls (git_commit + tree) and new filtering behavior, but the existing PR creator specs stub only POST /git/trees and will start failing due to unexpected requests. Please update/add tests to stub the additional endpoints and to assert the conflict-filtering behavior (including the warning/error path).
      def filter_conflicting_entries(file_trees)
        commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
        base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
        special_paths = T.let(Set.new, T::Set[String])
        base_tree.tree.each do |entry|
          # Submodules have type "commit"; symlinks have mode "120000"
          special_paths.add(entry.path) if entry.type == "commit" || entry.mode == "120000"

common/lib/dependabot/pull_request_creator/github.rb:353

  • filter_conflicting_entries is duplicated (nearly identically) in both PullRequestCreator::Github and PullRequestUpdater::Github. To avoid the two implementations drifting (especially around edge cases like nested paths and error handling), consider extracting this into a shared helper/module used by both classes.
      sig do
        params(file_trees: T::Array[T::Hash[Symbol, T.untyped]])
          .returns(T::Array[T::Hash[Symbol, T.untyped]])
      end
      def filter_conflicting_entries(file_trees)
        commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
        base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
        special_paths = T.let(Set.new, T::Set[String])

common/lib/dependabot/pull_request_updater/github.rb:236

  • New filtering introduces extra GitHub API calls and behavior changes, but updater specs currently stub only a subset of requests. Please update/add tests to stub git_commit/tree calls for base_commit and to validate the filtering behavior so regressions don’t slip in.
      sig do
        params(file_trees: T::Array[T::Hash[Symbol, T.untyped]])
          .returns(T::Array[T::Hash[Symbol, T.untyped]])
      end
      def filter_conflicting_entries(file_trees)
        commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
        base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
        special_paths = T.let(Set.new, T::Set[String])
        base_tree.tree.each do |entry|
          # Submodules have type "commit"; symlinks have mode "120000"
          special_paths.add(entry.path) if entry.type == "commit" || entry.mode == "120000"

common/lib/dependabot/pull_request_updater/github.rb:249

  • As in the PR creator, this method silently drops conflicting entries and continues, which can yield a branch update that 'succeeds' but doesn’t include all intended changes (or produces a no-op commit). Consider raising a specific error when entries are excluded so the update fails loudly and predictably.
        file_trees.reject do |entry|
          path = T.must(entry[:path])
          first_component = T.must(path.split("/").first)
          if path.include?("/") && special_paths.include?(first_component)
            Dependabot.logger.warn(
              "Excluding #{path} from create_tree — " \
              "conflicts with submodule/symlink '#{first_component}' in base tree"
            )
            true

common/lib/dependabot/pull_request_updater/github.rb:233

  • filter_conflicting_entries is duplicated in both the updater and creator GitHub implementations. Extracting it to a shared helper would reduce maintenance overhead and help ensure fixes (e.g., nested path handling) are applied consistently.
      sig do
        params(file_trees: T::Array[T::Hash[Symbol, T.untyped]])
          .returns(T::Array[T::Hash[Symbol, T.untyped]])
      end
      def filter_conflicting_entries(file_trees)
        commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
        base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
        special_paths = T.let(Set.new, T::Set[String])

Comment on lines +351 to +356
commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
special_paths = T.let(Set.new, T::Set[String])
base_tree.tree.each do |entry|
# Submodules have type "commit"; symlinks have mode "120000"
special_paths.add(entry.path) if entry.type == "commit" || entry.mode == "120000"
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter_conflicting_entries only checks the first path component against special_paths, but special_paths is populated from a non-recursive root tree (tree(...) without recursive). This means conflicts with nested submodules/symlinks (e.g., a submodule at a/b) will not be detected, and updates under those paths can still produce invalid trees. Consider either fetching the tree recursively (and collecting full paths), or checking all path prefixes against special_paths derived by walking the tree hierarchy.

This issue also appears in the following locations of the same file:

  • line 361
  • line 350
  • line 346

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +236
commit = T.unsafe(github_client_for_source).git_commit(source.repo, base_commit)
base_tree = T.unsafe(github_client_for_source).tree(source.repo, commit.tree.sha)
special_paths = T.let(Set.new, T::Set[String])
base_tree.tree.each do |entry|
# Submodules have type "commit"; symlinks have mode "120000"
special_paths.add(entry.path) if entry.type == "commit" || entry.mode == "120000"
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in the PR creator: special_paths is built from a non-recursive root tree and the filter only matches the first path component, so nested submodules/symlinks won't be detected and updates under them can still generate conflicting entries. Consider using a recursive tree fetch or prefix checking across the full path.

This issue also appears in the following locations of the same file:

  • line 226
  • line 241
  • line 226

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dependabot creates corrupt trees (duplicate entries) which fail fsck checks & trigger merge assertions in git

2 participants