Try moving to LLVM toolchain on Windows
exec
on Windows is broken
Background: Windows has a fundamentally different process model from POSIX platforms.
Specifically, instead of fork
and exec
Windows only provides
CreateProcess
, which spawns a new process with a fresh address space and no
relation to the creating process. This is particularly problematic when tools
expecting a POSIX environment (e.g. gcc
) are ported to Windows. The reason
for this is that exec
on Windows is emulated via spawnve
. That is, while on
POSIX exec
replaces the calling process, on Windows it results in a new
process (killing the parent)1.
This poses trouble for anyone (e.g. ghc
) who blocks on a child process which
uses exec
(e.g. gcc
, which exec
s as
) since it will appear that the
child will have finished when in reality it is simply continued execution as a
new process. Fixing this issue is Very Hard (as we will see, it is in fact
impossible). The usual workaround (that GHC started
adopting a few releases ago)
is to use the Windows
Job
facility. The rough idea is described in this
post.
A job object represents a set of processes for which one can receive
notifications via an I/O completion
port.
The idea is that when we spawn a process that might exec
we place it in a job
object and listen for when it creates children. When a child is created also
assign it to the job object. Furthermore, we listen for when processes in the
job object terminate. When all processes in the job have terminated we can
conclude that the entire subtree has finished.
Of course, this being Windows, there are a few wrinkles here:
- The process which calls
exec
terminates with a successful exit code, despite the fact that the child process may exit with an unsuccessful exit code. - It turns out that the IOCP notifications aren't guaranteed to be delivered [^2], resulting in child processes slipping through our fingers or hangs.
Regarding the latter, the documentation which suggests that IOCP notifications may be lost is frustratingly vague:
Note that, except for limits set with the JobObjectNotificationLimitInformation information class, messages are intended only as notifications and their delivery to the completion port is not guaranteed. The failure of a message to arrive at the completion port does not necessarily mean that the event did not occur. Notifications for limits set with JobObjectNotificationLimitInformation are guaranteed to arrive at the completion port.
I honestly don't know if this loss is something that could actually happen today or rather something that may be exploited in the future.
Further bugs
Despite the the fact that exec
appears to be hopelessly broken on Windows, up
until now we have nevertheless tried to keep the boat afloat on Windows via the
job object workaround described above (having no good alternative to gcc
and
binutils
). Often this even happened to work.
However, with my recent attempts at solidifying our Windows support this have been tearing at the seams, with countlesss cases of CI flakiness. Many of these issues are all manifestations of a few implementation bugs:
-
Job support in
process
, which is supposed to offer us some assurance that programs usingexec
(e.g.gcc
) are properly waited on, is completely broken. I explain the problem in the commit message of my fix:System.Process.waitForProcess failed to keep the ProcessHandle's MVar alive, potentially resulting in the finalizer being run while waitForJobCompletion is executing. This would cause the process handles to be closed, causing in waitForJobCompletion to fail.
-
Cabal
(specifically inCabal.Distribution.Simple.Utils
) makes no attempt at using jobs (this is the cause of #17691 (closed)) -
GHC also doesn't use jobs reliably (this is the issue fixed in !2486 (closed))
However, as we saw above, even if these were fixed, we still can't rely on job objects to give us an correct indication of whether a process tree has terminated.
MIN_PATH
is a constant headache
Background: exec
is not the only source of trouble for our gcc
toolchain. We have also
long fought with the Windows MAX_PATH
limitation, limiting file paths to 256
characters in length. Thankfully, all Windows releases supported by GHC
support
paths in the \\?\
namespace, which are not subject to this limitation.
Thanks to @Phyx, GHC itself has excellent support for long paths via this mechanism.
Unfortunately gcc
, which relies on the C runtime implementations provided by msvcrt
,
enjoys no such support. There are two possible this could be fixed:
-
Refactor
gcc
et al. to avoid using file I/O operations from the C runtime. Unfortunately, this is an incredibly daunting task and the patch that results would be a hard sell to upstream, meaning we would be stuck maintaining thousands of lines ofgcc
patches indefinitely. -
Fix the C runtime to use support long paths.
A few releases ago, @Phyx introduced a tool and library implementing the latter. While it has worked reasonably well, it nevertheless imposes a non-trivial maintenance overhead and has been a source of bugs.
A way forward: Try LLVM?
Both of the issues described above ultimately stem from the fact that GCC is a
large legacy codebase which supports Windows as a second-class citizen by linking against msvcrt
. On the
other hand, LLVM's Windows support has been gradually improving over the years
and it now seems plausible that it is mature enough to use as our primary native toolchain.
This would likely bring a few benefits:
- a faster linker,
lld
(although we would still need to implement bigobj support on i386) - less dependence on the legacy
msvcrt
implementation, fixing the above issues - we could possibly support both GHC's NCG and LLVM backends with only one toolchain tarball
However, there is also one major disadvantage: @Phyx, our primary Windows contributor, is unable to contribute to LLD due to legal reasons. This is a very hard trade-off.
-
One might ask why
exec
doesn't block on the process that it spawns, ensuring that the caller does not terminate before the spawned process does. To which I would reply: I have no idea. I can only guess that it was expected that this would leave too many blocked processes laying around.↩