Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Support
    • Submit feedback
  • Sign in / Register
GHC
GHC
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
    • Insights
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
    • Locked Files
  • Issues 3,734
    • Issues 3,734
    • List
    • Boards
    • Labels
    • Milestones
  • Merge Requests 258
    • Merge Requests 258
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
    • Charts
  • Security & Compliance
    • Security & Compliance
    • Dependency List
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • Glasgow Haskell Compiler
  • GHCGHC
  • Issues
  • #17777

Closed
Open
Opened Feb 01, 2020 by Ben Gamari@bgamari🐢
  • Report abuse
  • New issue
Report abuse New issue

Try moving to LLVM toolchain on Windows

Background: exec on Windows is broken

Windows has a fundamentally different process model from POSIX platforms. Specifically, instead of fork and exec Windows only provides CreateProcess, which spawns a new process with a fresh address space and no relation to the creating process. This is particularly problematic when tools expecting a POSIX environment (e.g. gcc) are ported to Windows. The reason for this is that exec on Windows is emulated via spawnve. That is, while on POSIX exec replaces the calling process, on Windows it results in a new process (killing the parent)1.

This poses trouble for anyone (e.g. ghc) who blocks on a child process which uses exec (e.g. gcc, which execs as) since it will appear that the child will have finished when in reality it is simply continued execution as a new process. Fixing this issue is Very Hard (as we will see, it is in fact impossible). The usual workaround (that GHC started adopting a few releases ago) is to use the Windows Job facility. The rough idea is described in this post.

A job object represents a set of processes for which one can receive notifications via an I/O completion port. The idea is that when we spawn a process that might exec we place it in a job object and listen for when it creates children. When a child is created also assign it to the job object. Furthermore, we listen for when processes in the job object terminate. When all processes in the job have terminated we can conclude that the entire subtree has finished.

Of course, this being Windows, there are a few wrinkles here:

  • The process which calls exec terminates with a successful exit code, despite the fact that the child process may exit with an unsuccessful exit code.
  • It turns out that the IOCP notifications aren't guaranteed to be delivered [^2], resulting in child processes slipping through our fingers or hangs.

Regarding the latter, the documentation which suggests that IOCP notifications may be lost is frustratingly vague:

Note that, except for limits set with the JobObjectNotificationLimitInformation information class, messages are intended only as notifications and their delivery to the completion port is not guaranteed. The failure of a message to arrive at the completion port does not necessarily mean that the event did not occur. Notifications for limits set with JobObjectNotificationLimitInformation are guaranteed to arrive at the completion port.

I honestly don't know if this loss is something that could actually happen today or rather something that may be exploited in the future.

Further bugs

Despite the the fact that exec appears to be hopelessly broken on Windows, up until now we have nevertheless tried to keep the boat afloat on Windows via the job object workaround described above (having no good alternative to gcc and binutils). Often this even happened to work.

However, with my recent attempts at solidifying our Windows support this have been tearing at the seams, with countlesss cases of CI flakiness. Many of these issues are all manifestations of a few implementation bugs:

  • Job support in process, which is supposed to offer us some assurance that programs using exec (e.g. gcc) are properly waited on, is completely broken. I explain the problem in the commit message of my fix:

    System.Process.waitForProcess failed to keep the ProcessHandle's MVar alive, potentially resulting in the finalizer being run while waitForJobCompletion is executing. This would cause the process handles to be closed, causing in waitForJobCompletion to fail.

  • Cabal (specifically in Cabal.Distribution.Simple.Utils) makes no attempt at using jobs (this is the cause of #17691)

  • GHC also doesn't use jobs reliably (this is the issue fixed in !2486 (closed))

However, as we saw above, even if these were fixed, we still can't rely on job objects to give us an correct indication of whether a process tree has terminated.

Background: MIN_PATH is a constant headache

exec is not the only source of trouble for our gcc toolchain. We have also long fought with the Windows MAX_PATH limitation, limiting file paths to 256 characters in length. Thankfully, all Windows releases supported by GHC support paths in the \\?\ namespace, which are not subject to this limitation. Thanks to @Phyx, GHC itself has excellent support for long paths via this mechanism.

Unfortunately gcc, which relies on the C runtime implementations provided by msvcrt, enjoys no such support. There are two possible this could be fixed:

  • Refactor gcc et al. to avoid using file I/O operations from the C runtime. Unfortunately, this is an incredibly daunting task and the patch that results would be a hard sell to upstream, meaning we would be stuck maintaining thousands of lines of gcc patches indefinitely.

  • Fix the C runtime to use support long paths.

A few releases ago, @Phyx introduced a tool and library implementing the latter. While it has worked reasonably well, it nevertheless imposes a non-trivial maintenance overhead and has been a source of bugs.

A way forward: Try LLVM?

Both of the issues described above ultimately stem from the fact that GCC is a large legacy codebase which supports Windows as a second-class citizen. On the other hand, LLVM's Windows support has been gradually improving over the years and it now seems plausible that it is mature enough to use as our primary native toolchain.

This would likely bring a few benefits:

  • a faster linker, lld (although we would still need to implement bigobj support on i386)
  • less dependence on the legacy msvcrt implementation, fixing the above issues
  • we could possibly support both GHC's NCG and LLVM backends with only one toolchain tarball

However, there is also one major disadvantage: @Phyx, our primary Windows contributor, is unable to contribute to LLD due to legal reasons. This is a very hard trade-off.

  1. One might ask why exec doesn't block on the process that it spawns, ensuring that the caller does not terminate before the spawned process does. To which I would reply: I have no idea. I can only guess that it was expected that this would leave too many blocked processes laying around. ↩

Edited Feb 02, 2020 by Ben Gamari

Related issues

  • Discussion
  • Designs
Assignee
Assign to
8.12.1
Milestone
8.12.1
Assign milestone
Time tracking
None
Due date
None
2
Labels
task Windows
Assign labels
  • View project labels
Reference: ghc/ghc#17777