Skip to content

SMP primitives broken on power(pc)

I originally noticed this when working on the AIX port (32-bit powerpc), and recently saw this also on Linux/powerpc64, which lead to talking to Peter Trommler who already had a suspicion:

Here's for example the CAS definition (in <stg/SMP.h>):

StgWord
cas(StgVolatilePtr p, StgWord o, StgWord n)
{
    StgWord result;
    __asm__ __volatile__ (
        "1:     ldarx     %0, 0, %3\n"
        "       cmpd      %0, %1\n"
        "       bne       2f\n"
        "       stdcx.    %2, 0, %3\n"
        "       bne-      1b\n"
        "2:"
        :"=&r" (result)
        :"r" (o), "r" (n), "r" (p)
        :"cc", "memory"
    );
    return result;
}

The important detail is the lack any barrier instructions, such as isync at the end. This results in infrequent heap-corruptions which in turn result in all sorts of infrequent and hard to track down runtime-crashes (including in ghc --make -j) such as for instance

internal error: END_TSO_QUEUE object entered!
(GHC version 8.0.0.20160421 for powerpc64_unknown_linux)

Peter has already a patch in the works which simply replaces the atomic powerpc primitives with __sync_* intrinsics which turn out to be more portable than inline-asm. This would result in e.g.

StgWord
cas(StgVolatilePtr p, StgWord o, StgWord n)
{
    return __sync_val_compare_and_swap (p, o, n);
}

which then gets compiled as

000000000000004c <.cas>:
  4c:	7c 00 04 ac 	sync    
  50:	7d 20 18 a8 	ldarx   r9,0,r3
  54:	7c 29 20 00 	cmpd    r9,r4
  58:	40 c2 00 0c 	bne-    64 <.cas+0x18>
  5c:	7c a0 19 ad 	stdcx.  r5,0,r3
  60:	40 c2 ff f0 	bne-    50 <.cas+0x4>
  64:	4c 00 01 2c 	isync
  68:	7d 23 4b 78 	mr      r3,r9
  6c:	4e 80 00 20 	blr

I've been testing the patch already and it seems to have made all issues I experienced so far disappear, as well as fixing the concprog01 test which was also failing infrequently.

Edited by Herbert Valerio Riedel
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information