add per-ref locking to gracefully handle push races #44

adit-chandra · 2025-08-13T20:27:19Z

we've run into a couple instances where concurrent pushes to our s3 remote results in writing two bundles to the ref. This is the expected and documented behavior, but it still bricks the the remote ref and prevents us from being able to push without removing one of the two bundles.
proposal: add a per-ref lock built around an s3 lock object that the client must acquire before it can push it's bundle to the remote.
- use S3 conditional writes to resolve lock acquisition races
- add ttl (default 60s) for lock objects
- add support for stale lock inspection to git-remote-s3 doctor cli
added unit test to verify behavior. updated + new unit tests pass.

adit-chandra · 2025-08-19T22:38:04Z

Hey @massi-ang! wanted to check if you had a chance to skim these changes or had any opinions on the proposal

please let me know if there's any additional context or detail i can provide -- goes without saying, but we really appreciate you taking the time to maintain the library!

massi-ang

Thanks for submitting this PR and the effort put in creating it.
The original design idea was to keep things as simple as possible and have as little failure scenarios as possible. Introducing a lock mechanism adds an additional failure scenario in case any of the locks fails to be deleted. As with the multiple bundle scenario, this should be detected and you should provide an easy way to fix it.

I like your clever implementation, but I wonder if using conditional writes wouldn't be simpler. By using conditional write with If-None-Match, you can have each client write a LOCK object under the /remote_ref/ prefix, but only one would succeed. It only requires updating the acquire_lock and release_lock function. It might also have the advantage to reduce the failure scenarios since only one additional object is written instead of the at least 2 of your implementation, hence one less object that might fail to be written or deleted.

git_remote_s3/remote.py

massi-ang · 2025-08-21T20:33:24Z

git_remote_s3/remote.py

+            # Otherwise, we proceed with pushing the new bundle 
+            current_contents = self.get_bundles_for_ref(remote_ref)
+            if len(current_contents) > 1:
+                return f'error {remote_ref} "multiple bundles exists on server. Run git-s3 doctor to fix."?\n'


The error could also point mention to upgrade git-remote-s3 to the latest version, since this should be the only case we get duplicate bundles on the remote. Correct?

yep exactly,
good point

git_remote_s3/remote.py

massi-ang · 2025-08-21T20:40:54Z

git_remote_s3/remote.py

+                    self.release_lock(remote_ref, lock_key)
+                except Exception:
+                    logger.info(f"failed to release lock {lock_key} for {remote_ref}")
+            # Clean up temp bundle that might've been written but never got pushed


Remove this comment. This line is to remove the local bundle even if it got pushed

massi-ang · 2025-08-21T20:50:02Z

git_remote_s3/remote.py

+            if lock_key:
+                try:
+                    self.release_lock(remote_ref, lock_key)
+                except Exception:


Be more specific with the exception you are catching. We want to be sure the lock has been deleted otherwise
any further git push (by any client) will fail, since this lock that failed to be delete will always be the first one lexicographically. Can you propose a way to recover from such situation or at least to notify the user in the eventuality this happens?

Can you propose a way to recover from such situation or at least to notify the user in the eventuality this happens?

This is an excellent call out. I was thinking of a couple options here, but, again, I really want to respect the simplicity of the existing design:

add lock TTL + takeover

on 412's (lock exists), read the lock LastModified

if the lock is older than TTL, try a conditional delete and immediately reacquire the lock

if the delete races and fails, back and retry

2a. Add retries and notify user on repeated failures

if client fails to acquire after X retries or Y seconds, return a clear error containing the culprit lock key, age, etc and tell the user to run doctor

2b. add lock cleanup to git-remote-s3 doctor

have cmd list locks older than TTL by age

add --force-clear-locks to remove stale locks or add interactive option

use conditional delete to avoid race with active lock

I'm slightly biased towards 2a + 2b as this still manages to keep the client implementation fairly simple.
I'd love to get your opinions here or any other ideas you might have!

git_remote_s3/remote.py

adit-chandra · 2025-08-25T21:20:35Z

Thank you for taking the time give a thoughtful review @massi-ang!

I really wanted to minimize the amount of complexity I introduced with the locking design, especially since you've done a great job of keeping this client design simple and transparent. I think using S3 conditional writes is a great idea. You're definitely right: doing so significantly simplifies the locking design and reduces failure modes! I hadn't even considered leveraging built-in S3 features!

I agree that the design introduces a failure mode when the lock isn't deleted, I think this could also be framed as trading the multi-bundle failure mode for the left-over lock failure mode.
I believe the left-over lock is more amenable to automatic recovery (e.g. lock ttl/lease) compared to the multi-bundle situation so perhaps this is a reasonable design trade to make.

…s on acquire

adit-chandra · 2025-09-22T18:30:28Z

Hey @massi-ang, wanted to double-check if you've had a chance to check out these changes I made to address your comments!

I ended up implementing the locking behavior using S3 conditional writes and I definitely prefer this over additional lock object comparison.
added some defensive behaviors, namely a TTL to the lock object and logic on the doctor cli to handle removing stale locks to mitigate unintentional lock persistence
also clarified the error messaging + readme to make these changes especially transparent to users

please let me know if you've got any other ideas or comments!

adit-chandra added 2 commits August 13, 2025 12:32

add per-ref best-effort mutex

607041d

cleanup + add unit test

da0b159

massi-ang reviewed Aug 21, 2025

View reviewed changes

adit-chandra added 6 commits August 25, 2025 14:22

use conditional writes + simplify!

deab870

add lock ttl + doctor lock tooling

2124b26

add clear errs + clean up remote ref state check + replace stale lock…

d91f6a7

…s on acquire

update readme

aff0b73

typos

ef3c38d

update readme and set default TTL for lock

c9a9d05

massi-ang merged commit 9d53e01 into awslabs:main Oct 17, 2025
1 check passed

massi-ang mentioned this pull request Oct 31, 2025

Conditional writes for conditional locks #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add per-ref locking to gracefully handle push races #44

add per-ref locking to gracefully handle push races #44

Uh oh!

adit-chandra commented Aug 13, 2025 •

edited

Loading

Uh oh!

adit-chandra commented Aug 19, 2025

Uh oh!

massi-ang left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

massi-ang Aug 21, 2025

Uh oh!

adit-chandra Aug 25, 2025

Uh oh!

Uh oh!

massi-ang Aug 21, 2025

Uh oh!

massi-ang Aug 21, 2025

Uh oh!

adit-chandra Aug 25, 2025

Uh oh!

Uh oh!

adit-chandra commented Aug 25, 2025

Uh oh!

adit-chandra commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

add per-ref locking to gracefully handle push races #44

add per-ref locking to gracefully handle push races #44

Uh oh!

Conversation

adit-chandra commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adit-chandra commented Aug 19, 2025

Uh oh!

massi-ang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

massi-ang Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

adit-chandra Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

massi-ang Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

massi-ang Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

adit-chandra Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adit-chandra commented Aug 25, 2025

Uh oh!

adit-chandra commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adit-chandra commented Aug 13, 2025 •

edited

Loading

massi-ang left a comment •

edited

Loading

adit-chandra commented Sep 22, 2025 •

edited

Loading