Skip to content

Use FdLock on all targets#16569

Open
ysbaddaden wants to merge 10 commits intocrystal-lang:masterfrom
ysbaddaden:feature/move-fd-lock-out-of-crystal-system
Open

Use FdLock on all targets#16569
ysbaddaden wants to merge 10 commits intocrystal-lang:masterfrom
ysbaddaden:feature/move-fd-lock-out-of-crystal-system

Conversation

@ysbaddaden
Copy link
Collaborator

Moves Crystal::FdLock out of the internal Crystal::System types right into the stdlib types: IO::Descriptor, File and Socket.

The change is mostly to bring the single reader/writer behavior to Windows. Still, making sure to clean everything before we close a file isn't a bad idea.

Contrary to what I thought, we don't need the single reader/writer lock to implement Crystal::EventLoop::IOCP#shutdown(IO::FileDescriptor) because Windows provides CancelIoEx(handle, NULL) that allows to cancel all pending IO operations on a file handle. This may be extracted into its own PR.

Also refactors #system_bind and #system_listen that currently take a block to yield errors... while the block usually returns the error. It's simpler to just return a nilable error.

follow-up to #16209

@ysbaddaden ysbaddaden self-assigned this Jan 15, 2026
Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes so much more sense to have the locks at this level 👍

@straight-shoota
Copy link
Member

straight-shoota commented Jan 21, 2026

I was wondering whether we could safely guarantee that all file operations are protected by a lock and maintain that in the future.

And there could be a simple way to provide a bit more confidence: We could combine access to the file descriptor handle with acquiring the lock. The locking methods yield the file descriptor. It's a bit more inconvenient because we have to pass the fd handle down into the system implementations, but I don't think that should be a stopper.
We could implement that directly in FdLock or add wrappers in the FileDescriptor and Socket.

  private def unbuffered_read(slice : Bytes) : Int32
    # FdLock#read yields the fd handle, and we pass it explicitly to `system_read`
    @fd_lock.read { |fd| system_read(fd, slice) }
  end

The public #fd method still needs to exist and provide lock-free access to the handle for backwards compatibility. So we cannot make this 100% safe. In the end, nothing prevents system_read from still calling #fd instead of using the argument (or perhaps we could turn them into class methods so they have no access to the object?). But it would still be worth it to have bit more safety, even if it's not 100% water proof.

I figure this should probably be addressed in a follow-up though.

@straight-shoota straight-shoota changed the title Use FdLock on every targets Use FdLock on all targets Jan 21, 2026
@ysbaddaden
Copy link
Collaborator Author

@straight-shoota Interesting. Though FdLock doesn't know the actual fd (or the IO object) for now.

@ysbaddaden ysbaddaden added this to the 1.20.0 milestone Jan 29, 2026
@ysbaddaden
Copy link
Collaborator Author

Damn. I didn't notice the CI failures on Windows.

@straight-shoota straight-shoota removed this from the 1.20.0 milestone Jan 29, 2026
@straight-shoota
Copy link
Member

straight-shoota commented Jan 29, 2026

The icon for failed checks is way too similar to skipped ones.

image

@ysbaddaden
Copy link
Collaborator Author

ysbaddaden commented Jan 30, 2026

There's an odd behavior on Windows where a fiber keeps calling Crystal::FdLock.read without actually locking, and hangs the process.

It feels like a recursive call but a stackoverflow is never reached, so it's not.

Actually, the fdlock has been closed and so #lock_slow raises, so there must be a loop that rescues the exception and keeps trying to read.

@ysbaddaden
Copy link
Collaborator Author

The issue is that HTTP::Server#listen rescues the exception and retries without checking if the IO is closed, which creates the infinite loop.

It's working on UNIX, so I thought there might be something different in the UNIX vs Win32 system implementations of #accept? but it doesn't look like it.

I think #accept? should return nil early when the socket is closed.

@ysbaddaden
Copy link
Collaborator Author

ysbaddaden commented Jan 30, 2026

Fixing the #accept? methods fixes the hang 👍

Now, I get a "Process terminated abnormally, the cause is unknown" when running HTTP::Client specs (will retry a broken socket) 😓

If I mark the above spec as pending, I'm down to a couple failures:

UNIXServer accept raises when server is closed
Failure/Error: exception.try(&.message).should eq("Closed stream")

  Expected: "Closed stream"
       got: "AcceptEx timed out"

# spec\std\socket\unix_server_spec.cr:119
UNIXServer accept? returns nil when server is closed
Failure/Error: ch.receive.should eq SpecChannelStatus::End

  Expected: SpecChannelStatus::End
       got: SpecChannelStatus::Timeout

# spec\std\socket\unix_server_spec.cr:157

There might be something about shutdown related to UNIX sockets on Windows?

@ysbaddaden
Copy link
Collaborator Author

The two failures are caused by Crystal::System::Socket#overlapped_accept that raises an IO::TimeoutError when the operation has been aborted, which only happened because of a manual cancel after timeout, but now we explicitly cancel with a shutdown before we close.

Let's check if the IO has been closed before raising, and... fixed 🎉

And it seems to have fixed the "Process terminated abnormally" of the HTTP::Client spec... which still fails, but with an explicit exception:

WSARecv (#TCPSocket:0x2b448b7a8c0): An established connection was aborted by the software in your host machine. (IO::Error)

Which sounds just about right. We did cancel it.

@ysbaddaden
Copy link
Collaborator Author

Allright, I think I got most of the issues cornered. I'm not so sure about the last commits for IOCP. They look okayish but I'd like another pair of 👀

@ysbaddaden
Copy link
Collaborator Author

Sigh, one last failure apparently:

  1) UDPSocket using IPv4 joins and transmits to multicast groups
     Failure/Error: expect_raises(IO::Error) { udp.receive }

       Expected IO::Error but nothing was raised

     # spec\std\socket\udp_socket_spec.cr:201

@ysbaddaden
Copy link
Collaborator Author

@ysbaddaden
Copy link
Collaborator Author

The issue is caused by 4b286bb#diff-d5a00d86dac51ee0e517a4d7b20d9f87168c36220dc1fc4cb23360b424108e61R440

It makes sense: both cases shall raise an IO::Error exception, not report an EOF.

But then the HTTP::Client spec failure from a few comments above comes back. We must fix it some other way. The spec shouldn't pass because we returned an EOF when the connection has been explicitly closed, especially when the spec is "HTTP::Client will retry a broken socket".

@ysbaddaden ysbaddaden force-pushed the feature/move-fd-lock-out-of-crystal-system branch from 4b286bb to e03b7b0 Compare February 2, 2026 12:55
@ysbaddaden
Copy link
Collaborator Author

Rebased to bring changes from master (especially the timer fixes from 1.19.1), and added a commit to properly fix/workaround spec failures related to WSA.

@ysbaddaden
Copy link
Collaborator Author

MinGW eventually fails on x86_64 with the unhelpful:

make: *** [Makefile:142: std_spec] Error 67
Error: Process completed with exit code 2.

While the spec output looks fine (no failures, no errors).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants