Many IO operations can block entire program
Summary
Many IO operations are slow (when using HDD, NFS, or even FUSE), and programmers nowadays demands running many IO operations and computations in parallel. For example, rsync uses multiple threads to read data on disks and computes at the same time. However, it seems impossible to achieve this using GHC and simple codes.
Steps to reproduce
- Save this file as
t.py
.
#!/usr/bin/env python3
import logging
import os
import time
from errno import EACCES
from os.path import realpath
from threading import Lock
from fuse import FUSE, FuseOSError, Operations, LoggingMixIn
class Loopback(LoggingMixIn, Operations):
def __init__(self, root):
self.root = realpath(root)
self.rwlock = Lock()
def __call__(self, op, path, *args):
return super(Loopback, self).__call__(op, self.root + path, *args)
def access(self, path, mode):
if not os.access(path, mode):
raise FuseOSError(EACCES)
def getattr(self, path, fh=None):
time.sleep(3) # blocks stat()
st = os.lstat(path)
return dict((key, getattr(st, key)) for key in (
'st_atime', 'st_ctime', 'st_gid', 'st_mode', 'st_mtime',
'st_nlink', 'st_size', 'st_uid'))
def link(self, target, source):
return os.link(self.root + source, target)
open = os.open
def read(self, path, size, offset, fh):
print('read block') # blocks read()
time.sleep(3)
with self.rwlock:
os.lseek(fh, offset, 0)
return os.read(fh, size)
def readdir(self, path, fh):
return ['.', '..'] + os.listdir(path)
readlink = os.readlink
def release(self, path, fh):
return os.close(fh)
def statfs(self, path):
stv = os.statvfs(path)
return dict((key, getattr(stv, key)) for key in (
'f_bavail', 'f_bfree', 'f_blocks', 'f_bsize', 'f_favail',
'f_ffree', 'f_files', 'f_flag', 'f_frsize', 'f_namemax'))
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('root')
parser.add_argument('mount')
args = parser.parse_args()
logging.basicConfig(level=logging.DEBUG)
fuse = FUSE(
Loopback(args.root), args.mount, foreground=True, allow_other=True)
- Save this file as
main.hs
:
import Control.Concurrent
import Control.Monad
import Data.ByteString
import Data.Time.Clock
import System.IO
main :: IO ()
main = do
getCurrentTime >>= print
handle <- openFile "mnt/foo" ReadMode
forkIO $ do
threadDelay 2000000
hGet handle 80 >>= print
forever $ do
threadDelay 500000
getCurrentTime >>= print
chmod +x t.py
python3 -m venv venv
. venv/bin/activate
pip install fusepy
mkdir root mnt
echo content >root/foo
ghc -threaded -O2 main.hs
- Run
./t.py root mnt
in one terminal window. - Run
./main
in another terminal window.
(Sorry about using such a complicated way to set up a slow FUSE. I am not familiar with other methods.)
Expected behavior
I expect to see the time printed every 0.5s. However, despite my best effort (using forkIO
or forkOS
or forkOn
, using -threaded
or not, using +RTS -N4
or not), IO operations always block other threads.
I expect:
- One slow IO operation (may be caused by NFS or something) does not block everything else. If I run a loop in a Haskell thread, it keeps running reliably.
- Many slow IO operations can run in parallel. For example, I wish to send 100 stat() calls out to an NFS and waits for all of them, instead of sending and waiting one-by-one.
Environment
- GHC version used: 8.8.4
Optional:
- Operating System: Debian 10
- System Architecture: x86_64