Skip to content

Many IO operations can block entire program

Summary

Many IO operations are slow (when using HDD, NFS, or even FUSE), and programmers nowadays demands running many IO operations and computations in parallel. For example, rsync uses multiple threads to read data on disks and computes at the same time. However, it seems impossible to achieve this using GHC and simple codes.

Steps to reproduce

  1. Save this file as t.py.
#!/usr/bin/env python3
import logging
import os
import time

from errno import EACCES
from os.path import realpath
from threading import Lock

from fuse import FUSE, FuseOSError, Operations, LoggingMixIn


class Loopback(LoggingMixIn, Operations):
    def __init__(self, root):
        self.root = realpath(root)
        self.rwlock = Lock()

    def __call__(self, op, path, *args):
        return super(Loopback, self).__call__(op, self.root + path, *args)

    def access(self, path, mode):
        if not os.access(path, mode):
            raise FuseOSError(EACCES)

    def getattr(self, path, fh=None):
        time.sleep(3)  # blocks stat()
        st = os.lstat(path)
        return dict((key, getattr(st, key)) for key in (
            'st_atime', 'st_ctime', 'st_gid', 'st_mode', 'st_mtime',
            'st_nlink', 'st_size', 'st_uid'))

    def link(self, target, source):
        return os.link(self.root + source, target)

    open = os.open

    def read(self, path, size, offset, fh):
        print('read block')  # blocks read()
        time.sleep(3)
        with self.rwlock:
            os.lseek(fh, offset, 0)
            return os.read(fh, size)

    def readdir(self, path, fh):
        return ['.', '..'] + os.listdir(path)

    readlink = os.readlink

    def release(self, path, fh):
        return os.close(fh)

    def statfs(self, path):
        stv = os.statvfs(path)
        return dict((key, getattr(stv, key)) for key in (
            'f_bavail', 'f_bfree', 'f_blocks', 'f_bsize', 'f_favail',
            'f_ffree', 'f_files', 'f_flag', 'f_frsize', 'f_namemax'))


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('root')
    parser.add_argument('mount')
    args = parser.parse_args()

    logging.basicConfig(level=logging.DEBUG)
    fuse = FUSE(
        Loopback(args.root), args.mount, foreground=True, allow_other=True)
  1. Save this file as main.hs:
import Control.Concurrent
import Control.Monad
import Data.ByteString
import Data.Time.Clock
import System.IO

main :: IO ()
main = do
    getCurrentTime >>= print
    handle <- openFile "mnt/foo" ReadMode
    forkIO $ do
        threadDelay 2000000
        hGet handle 80 >>= print
    forever $ do
        threadDelay 500000
        getCurrentTime >>= print
  1. chmod +x t.py
  2. python3 -m venv venv
  3. . venv/bin/activate
  4. pip install fusepy
  5. mkdir root mnt
  6. echo content >root/foo
  7. ghc -threaded -O2 main.hs
  8. Run ./t.py root mnt in one terminal window.
  9. Run ./main in another terminal window.

(Sorry about using such a complicated way to set up a slow FUSE. I am not familiar with other methods.)

Expected behavior

I expect to see the time printed every 0.5s. However, despite my best effort (using forkIO or forkOS or forkOn, using -threaded or not, using +RTS -N4 or not), IO operations always block other threads.

I expect:

  1. One slow IO operation (may be caused by NFS or something) does not block everything else. If I run a loop in a Haskell thread, it keeps running reliably.
  2. Many slow IO operations can run in parallel. For example, I wish to send 100 stat() calls out to an NFS and waits for all of them, instead of sending and waiting one-by-one.

Environment

  • GHC version used: 8.8.4

Optional:

  • Operating System: Debian 10
  • System Architecture: x86_64
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information