Skip to content

Memory corruption crash: Evaluated a CAF that was GC'd

Summary

In the test suite of my Haskell binding to the lz4 compression library I discovered an apparent GHC bug that leads to memory corruption (segfaults and arbitrary other errors induced by the wrong memory values).

With the -debug runtime I get this reliable error message:

lz4-frame-conduit-test: internal error: Evaluated a CAF (0xe966e0) that was GC'd!
    (GHC version 8.10.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug

The printed memory address (0xe966e0 here) does not change across runs.

Steps to reproduce

  • Clone https://github.com/nh2/lz4-frame-conduit
  • git checkout 7dd7b90 (github link)
  • Install the lz4 binary, because the test suite uses it (e.g. package liblz4-tool on Debian/Ubuntu)
  • stack clean && stack test
  • while (.stack-work/dist/x86_64-linux/Cabal-3.2.0.0/build/lz4-frame-conduit-test/lz4-frame-conduit-test); do sleep 0.01; done

The while bash loop executes the binary repeatedly, because it does not crash on every run.

On my machine the error reproduces within 1 minute. It reproduces within a second when I use my own source-built GHC from the ghc-8.10 branch (exact commit is 35c7451, which has a cherry-pick to fix a build error, see here, currently being backported in !4032 (merged)) with these make settings

BuildFlavour = perf
GhcLibHcOpts += -g3
GhcRtsHcOpts += -g3

I have not yet managed to make the repro smaller, or independent of lz4. Help on that would be appreciated.

But I have some suspicions on what may be going on. My test suite does this:

    let prepare :: [BSL.ByteString] -> [ByteString]
        prepare strings = BSL.toChunks $ BSL.concat $ intersperse " " $ ["BEGIN"] ++ strings ++ ["END"]

    describe "reproducing memory error" $ do

      it "compresses 1000 strings" $ do
        let strings = prepare $ replicate 1000 "hello" -- cannot reproduce if `!`ing this
        actual <- runCompressToLZ4 (CL.sourceList strings)
        actual `shouldBe` (BS.concat strings)

    describe "more reproducing" $ do

      it "decompresses 10000 strings" $ do
        let strings = prepare $ replicate 10000 "hello"
        actual <- runLZ4ToDecompress (CL.sourceList strings)
        actual `shouldBe` (BS.concat strings)

Suspicions

I suspect that the CAF that's being GC'd here is related to the string constant "hello".

Key observations:

  • I use replicate 10000 "hello" in the 2 places shown above. If I change one of them (e.g. hello -> hella) the issue seems to disappear.
  • The crash is always in the second code block (decompressing), when accessing the input ByteString here.

Maybe GHC is doing something illegal with that String CAF?

Additional info

There's some more info useful for reproducing:

  • In stack.yaml I'm pinning the library conduit-extra-1.3.2 because the newer version 1.3.3 magically makes the bug disappear.

    But the differences in those are very benign (I was involved in writing them). The only thing it does is to add some calls to hSetBuffering h NoBuffering.

    I believe that this is simply because I use this library to communicate with the lz4 binary used in the test, and the change of buffering affects GC behaviour.

Environment

  • GHC version used: GHC 8.6 to 8.10.2

Optional:

  • Operating System: Ubuntu 18.04 and NixOS 20.03
  • System Architecture: x86_64
Edited by Niklas Hambüchen
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information