Try moving to LLVM toolchain on Windows
exec on Windows is broken
Windows has a fundamentally different process model from POSIX platforms.
Specifically, instead of
exec Windows only provides
CreateProcess, which spawns a new process with a fresh address space and no
relation to the creating process. This is particularly problematic when tools
expecting a POSIX environment (e.g.
gcc) are ported to Windows. The reason
for this is that
exec on Windows is emulated via
spawnve. That is, while on
exec replaces the calling process, on Windows it results in a new
process (killing the parent)1.
This poses trouble for anyone (e.g.
ghc) who blocks on a child process which
as) since it will appear that the
child will have finished when in reality it is simply continued execution as a
new process. Fixing this issue is Very Hard (as we will see, it is in fact
impossible). The usual workaround (that GHC started
adopting a few releases ago)
is to use the Windows
facility. The rough idea is described in this
A job object represents a set of processes for which one can receive
notifications via an I/O completion
The idea is that when we spawn a process that might
exec we place it in a job
object and listen for when it creates children. When a child is created also
assign it to the job object. Furthermore, we listen for when processes in the
job object terminate. When all processes in the job have terminated we can
conclude that the entire subtree has finished.
Of course, this being Windows, there are a few wrinkles here:
- The process which calls
execterminates with a successful exit code, despite the fact that the child process may exit with an unsuccessful exit code.
- It turns out that the IOCP notifications aren't guaranteed to be delivered [^2], resulting in child processes slipping through our fingers or hangs.
Regarding the latter, the documentation which suggests that IOCP notifications may be lost is frustratingly vague:
Note that, except for limits set with the JobObjectNotificationLimitInformation information class, messages are intended only as notifications and their delivery to the completion port is not guaranteed. The failure of a message to arrive at the completion port does not necessarily mean that the event did not occur. Notifications for limits set with JobObjectNotificationLimitInformation are guaranteed to arrive at the completion port.
I honestly don't know if this loss is something that could actually happen today or rather something that may be exploited in the future.
Despite the the fact that
exec appears to be hopelessly broken on Windows, up
until now we have nevertheless tried to keep the boat afloat on Windows via the
job object workaround described above (having no good alternative to
binutils). Often this even happened to work.
However, with my recent attempts at solidifying our Windows support this have been tearing at the seams, with countlesss cases of CI flakiness. Many of these issues are all manifestations of a few implementation bugs:
Job support in
process, which is supposed to offer us some assurance that programs using
gcc) are properly waited on, is completely broken. I explain the problem in the commit message of my fix:
System.Process.waitForProcess failed to keep the ProcessHandle's MVar alive, potentially resulting in the finalizer being run while waitForJobCompletion is executing. This would cause the process handles to be closed, causing in waitForJobCompletion to fail.
Cabal.Distribution.Simple.Utils) makes no attempt at using jobs (this is the cause of #17691 (closed))
GHC also doesn't use jobs reliably (this is the issue fixed in !2486 (closed))
However, as we saw above, even if these were fixed, we still can't rely on job objects to give us an correct indication of whether a process tree has terminated.
MIN_PATH is a constant headache
exec is not the only source of trouble for our
gcc toolchain. We have also
long fought with the Windows
MAX_PATH limitation, limiting file paths to 256
characters in length. Thankfully, all Windows releases supported by GHC
paths in the
\\?\ namespace, which are not subject to this limitation.
Thanks to @Phyx, GHC itself has excellent support for long paths via this mechanism.
gcc, which relies on the C runtime implementations provided by
enjoys no such support. There are two possible this could be fixed:
gccet al. to avoid using file I/O operations from the C runtime. Unfortunately, this is an incredibly daunting task and the patch that results would be a hard sell to upstream, meaning we would be stuck maintaining thousands of lines of
Fix the C runtime to use support long paths.
A few releases ago, @Phyx introduced a tool and library implementing the latter. While it has worked reasonably well, it nevertheless imposes a non-trivial maintenance overhead and has been a source of bugs.
A way forward: Try LLVM?
Both of the issues described above ultimately stem from the fact that GCC is a
large legacy codebase which supports Windows as a second-class citizen by linking against
msvcrt. On the
other hand, LLVM's Windows support has been gradually improving over the years
and it now seems plausible that it is mature enough to use as our primary native toolchain.
This would likely bring a few benefits:
- a faster linker,
lld(although we would still need to implement bigobj support on i386)
- less dependence on the legacy
msvcrtimplementation, fixing the above issues
- we could possibly support both GHC's NCG and LLVM backends with only one toolchain tarball
However, there is also one major disadvantage: @Phyx, our primary Windows contributor, is unable to contribute to LLD due to legal reasons. This is a very hard trade-off.
One might ask why
execdoesn't block on the process that it spawns, ensuring that the caller does not terminate before the spawned process does. To which I would reply: I have no idea. I can only guess that it was expected that this would leave too many blocked processes laying around.