Add a broader set of (C/CMM-based) atomic memory operations
This task is a precursor to ticket #7883 (closed). The goal is to standardize a broader set of atomic ops -- mainly CAS and fetch-and-add -- and then in #7883 (closed) we will optimize those primops to have native support in the LLVM backend and thus avoid the extra C calls.
Currently, the goal is only to support word-sized (4 or 8 byte) compare-and-swap and fetch-and-add operations. In the future, this goal could be expanded to support 16-byte CAS, implemented by CMPXCHG16B on x86.
Even supporting word sized CAS requires three different primops for
respectively. Here are signatures for those primOps:
gcptr stg_casMutVarzh ( gcptr mv, gcptr old, gcptr new ) /* MutVar# s a -> a -> a -> State# s -> (# State#, Int#, Any a #) */ gcptr stg_casArrayzh ( gcptr arr, W_ ind, gcptr old, gcptr new ); /* MutableArray# s a -> Int# -> a -> a -> State# s -> (# State# s, Int#, Any a #) */ W_ stg_casIntArrayzh( gcptr arr, W_ ind, W_ old, W_ new ); /* MutableByteArray# s -> Int# -> Int# -> Int# -> State# s -> (# State# s, Int# #) */
Fetch-and-add, on the other hand, only makes sense on a
W_ stg_fetchAddIntArrayzh( gcptr arr, W_ ind, W_ incr ); /* MutableByteArray# s -> Int# -> Int# -> State# s -> (# State# s, Int# #) */
Some systems, such as gcc and LLVM, provide other variants, such as "fetch-and-multiply". However, these do not have direct support on x86 and are usually just simulated with CAS. For example, see the LLVM documentation here. User-exposed Haskell libraries for atomic operations may employ the same approach using the CAS primOps above to simulate other fetch-and-X operations.
Finally, a design question which bears thinking about is how to
implement opaque "ticket" approach to CAS. The basic idea is that a
user does NOT present a regular pointer-based data value (e.g.
to CAS as the "old" value. Rather, observations of prior values are
kept in an opaque form (
Any type) to prevent the compiler changing
the pointer identity by, for example, unboxing and reboxing the value.
The way this affects the **primOp** design is as follows. The current implementation of casMutVar# and casArray# returns the NEW value rather than the old one if the CAS is successful. The Haskell code that imports the primOps treats this new value as the opaque ticket.
The user can in fact safely retrieve the real value from the
"ticket" with a
peekTicket :: Ticket a -> a operation. But
peekTicket MUST be marked NOINLINE to hide the
operation inside it from compiler optimizations. Otherwise, the
code path that carries the ticket from one CAS operation to the
next is subject to optimization . In particular,
becomes a type-level operation, not a function call with the usual
constraints on control and dataflow.
Returning to the issue of primOp design, if we wished to avoid the
somewhat non-standard design of returning the new value from CAS, we
//could// allow users to create
Ticket values themselves, rather
retrieving them as the result of previous read and CAS IO operations.
Yet this would entail an additional
NOINLINE (unsafeCoerce) function
call, and I believe it would be prone to error.
Further, if you look at the primOp implementation for stg_casMutVarzh, you will see that there is //already// a branch, which is necessary for updating GC-related meta-data. Thus there is no additional overhead to return the new rather than the old value from CAS operations in the successful case. Haskell-level CAS is already a compound operation, not merely a single machine code instruction.
By contrast, the C intrinsic
become only a single machine operation, and it returns the "old"
value irrespective of whether the CAS succeeded or failed. In fact,
an equality comparison on that old value is necessary to determine if
the operation succeeded. In contrast, the Haskell CAS primOps above
return an unboxed tuple with an explicit success bit. Thus returning
the old value in the success case is completely redundant, and
returning the new value instead allows us to acheive two goals
(the CAS and the opacifying coercion) with one
operation at no additional cost.
 P.S. As a concrete example of what happens when
unsafeCoerce is not
hidden, consider the following code as a starting point:
x = unsafeCoerce y z = f y
x would seem to have nothing to do with
f, and yet
later in the compiler, one can end up with Core like this:
z = f x
This exposes the type information and activates optimizations, which
was the source of one bug with the
lockfree-queue package that was fixed by adding NOINLINE to