Skip to content
  • Tamar Christina's avatar
    Fix testsuite threading, timeout, encoding and performance issues on Windows · 0ce59be3
    Tamar Christina authored and Ben Gamari's avatar Ben Gamari committed
    In a land far far away, a project called Cygwin was born.
    Cygwin used newlib as it's standard C library implementation.
    
    But Cygwin wanted to emulate POSIX systems as closely as possible.
    So it implemented `execv` using the Windows function `spawnve`.
    
    Specifically
    
    ```
    spawnve (_P_OVERLAY, path, argv, cur_environ ())
    ```
    
    `_P_OVERLAY` is crucial, as it makes the function behave *sort of*
    like execv on linux. the child process replaces the original process.
    
    With one major difference because of the difference in process models
    on Windows: the original process signals the caller that it's done.
    
    this is why the file is still locked. because it's still running,
    control was returned because the parent process was destroyed,
    but the child is still running.
    
    I think it's just pure dumb luck, that the older runtimes are slow
    enough to give the process time to terminate before we tried deleting
    the file.  Which explains why you do have sporadic failures even on
    older runtimes like 2.5.0, of a test or two (like T7307).
    
    So this patch fixes a couple of things. I leverage the existing
    `timeout.exe` to implement a workaround for this issue.
    
    a) The old timeout used to start the process then assign it to the job.
       This is slightly faulty since child processes are only assigned to a
       job is their parent were assigned at the time they started. So this
       was a race condition. I now create the process suspended, assign it
       to the job and then resume it. Which means all child processes are
       not running under the same job.
    
    b) First things, Is to prevent dangling child processes. I mark the job
       with `JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE` so when the last process in
       the job is done, it insures all processes under the job are killed.
    
    c) Secondly, I change the way we wait for results. Instead of waiting
       for the parent process to terminate, I wait for the job itself to
       terminate.
    
       There's a slight subtlety there, we can't wait on the job itself.
       Instead we have to create an I/O Completion port and wait for signals
       on it.  See
       https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743
    
    This fixes the issues on all runtimes for me and makes T7307 pass
    consistenly.
    
    The threading was also simplified by hiding all the locking in a single
    semaphore and a completion class. Futhermore some additional error
    reporting was added.
    
    For encoding the testsuite now no longer passes a file handle to the
    subprocess since on windows, sh.exe seems to acquire a lock on the file
    that is not released in a timely fashion.
    
    I suspect this because cygwin seems to emulate console handles by
    creating file handles and using those for std handles. So when we give
    it an existing file handle it just locks the file. I what's happening is
    that it's not releasing the handle until all shared cygwin processes are
    dead. Which explains why it worked in single threaded mode.
    
    So now instead we pass a pipe and do not interpret the resulting data.
    
    Any bytes written to stdin or read out of stdout/stderr are done so in
    binary mode and we do not interpret the data. The reason for this is
    that we have encoding tests in GHC which pass invalid utf-8. If we try
    to handle the data as text then python will throw an exception instead
    of a test comparison failing.
    
    Also I have fixed the ability to override `PYTHON` when calling `make
    tests`. This now works the same as with `.\validate`.
    
    Finally, after cleaning up the locks I was able to make the abort
    behavior work correctly as I believe it was intended: when you press
    Ctrl+C and send an interrupt signal, the testsuite finishes the active
    tests and then gracefully exits showing you a report of the progress it
    did make. So using Ctrl+C will not just *die* as it did before.
    
    These changes lift the restriction on which python version you use
    (msys/mingw) or which runtime or python 3 or python 2.  All combinations
    should now be supported.
    
    Test Plan:
    PATH=/usr/local/bin:/mingw64/bin:$APPDATA/cabal/bin:$PATH &&
    PYTHON=/usr/bin/python THREADS=9 make test
    THREADS=9 make test
    PATH=/usr/local/bin:/mingw64/bin:$APPDATA/cabal/bin:$PATH &&
    PYTHON=/usr/bin/python ./validate --quiet --testsuite-only
    
    Reviewers: erikd, RyanGlScott, bgamari, austin
    
    Subscribers: jrtc27, mpickering, thomie, #ghc_windows_task_force
    
    Differential Revision: https://phabricator.haskell.org/D2684
    
    GHC Trac Issues: #12725, #12554, #12661, #12004
    0ce59be3