Skip to content

Conversation

@adit-chandra
Copy link
Contributor

@adit-chandra adit-chandra commented Aug 13, 2025

  • we've run into a couple instances where concurrent pushes to our s3 remote results in writing two bundles to the ref. This is the expected and documented behavior, but it still bricks the the remote ref and prevents us from being able to push without removing one of the two bundles.
  • proposal: add a per-ref lock built around an s3 lock object that the client must acquire before it can push it's bundle to the remote.
    • use S3 conditional writes to resolve lock acquisition races
    • add ttl (default 60s) for lock objects
    • add support for stale lock inspection to git-remote-s3 doctor cli
  • added unit test to verify behavior. updated + new unit tests pass.

@adit-chandra
Copy link
Contributor Author

Hey @massi-ang! wanted to check if you had a chance to skim these changes or had any opinions on the proposal

please let me know if there's any additional context or detail i can provide -- goes without saying, but we really appreciate you taking the time to maintain the library!

Copy link
Contributor

@massi-ang massi-ang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR and the effort put in creating it.
The original design idea was to keep things as simple as possible and have as little failure scenarios as possible. Introducing a lock mechanism adds an additional failure scenario in case any of the locks fails to be deleted. As with the multiple bundle scenario, this should be detected and you should provide an easy way to fix it.

I like your clever implementation, but I wonder if using conditional writes wouldn't be simpler. By using conditional write with If-None-Match, you can have each client write a LOCK object under the /remote_ref/ prefix, but only one would succeed. It only requires updating the acquire_lock and release_lock function. It might also have the advantage to reduce the failure scenarios since only one additional object is written instead of the at least 2 of your implementation, hence one less object that might fail to be written or deleted.

# Otherwise, we proceed with pushing the new bundle
current_contents = self.get_bundles_for_ref(remote_ref)
if len(current_contents) > 1:
return f'error {remote_ref} "multiple bundles exists on server. Run git-s3 doctor to fix."?\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error could also point mention to upgrade git-remote-s3 to the latest version, since this should be the only case we get duplicate bundles on the remote. Correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep exactly,
good point

self.release_lock(remote_ref, lock_key)
except Exception:
logger.info(f"failed to release lock {lock_key} for {remote_ref}")
# Clean up temp bundle that might've been written but never got pushed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comment. This line is to remove the local bundle even if it got pushed

if lock_key:
try:
self.release_lock(remote_ref, lock_key)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more specific with the exception you are catching. We want to be sure the lock has been deleted otherwise
any further git push (by any client) will fail, since this lock that failed to be delete will always be the first one lexicographically. Can you propose a way to recover from such situation or at least to notify the user in the eventuality this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you propose a way to recover from such situation or at least to notify the user in the eventuality this happens?

This is an excellent call out. I was thinking of a couple options here, but, again, I really want to respect the simplicity of the existing design:

  1. add lock TTL + takeover
  • on 412's (lock exists), read the lock LastModified
  • if the lock is older than TTL, try a conditional delete and immediately reacquire the lock
  • if the delete races and fails, back and retry

2a. Add retries and notify user on repeated failures

  • if client fails to acquire after X retries or Y seconds, return a clear error containing the culprit lock key, age, etc and tell the user to run doctor

2b. add lock cleanup to git-remote-s3 doctor

  • have cmd list locks older than TTL by age
  • add --force-clear-locks to remove stale locks or add interactive option
  • use conditional delete to avoid race with active lock

I'm slightly biased towards 2a + 2b as this still manages to keep the client implementation fairly simple.
I'd love to get your opinions here or any other ideas you might have!

@adit-chandra
Copy link
Contributor Author

Thank you for taking the time give a thoughtful review @massi-ang!

I really wanted to minimize the amount of complexity I introduced with the locking design, especially since you've done a great job of keeping this client design simple and transparent. I think using S3 conditional writes is a great idea. You're definitely right: doing so significantly simplifies the locking design and reduces failure modes! I hadn't even considered leveraging built-in S3 features!

I agree that the design introduces a failure mode when the lock isn't deleted, I think this could also be framed as trading the multi-bundle failure mode for the left-over lock failure mode.
I believe the left-over lock is more amenable to automatic recovery (e.g. lock ttl/lease) compared to the multi-bundle situation so perhaps this is a reasonable design trade to make.

@adit-chandra
Copy link
Contributor Author

adit-chandra commented Sep 22, 2025

Hey @massi-ang, wanted to double-check if you've had a chance to check out these changes I made to address your comments!

  • I ended up implementing the locking behavior using S3 conditional writes and I definitely prefer this over additional lock object comparison.
  • added some defensive behaviors, namely a TTL to the lock object and logic on the doctor cli to handle removing stale locks to mitigate unintentional lock persistence
  • also clarified the error messaging + readme to make these changes especially transparent to users

please let me know if you've got any other ideas or comments!

@massi-ang massi-ang merged commit 9d53e01 into awslabs:main Oct 17, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants