copy_tag_nolock(): fix write ordering and add a write_barrier()

Fixes a rare crash in the parallel GC.

If we copy a closure non-atomically during GC, as we do for all
immutable values, then before writing the forwarding pointer we better
make sure that the closure itself is visible to other threads that
might follow the forwarding pointer.  I imagine this doesn't happen
very often, but I just found one case of it: in scavenge_stack, the
RET_FUN case, after evacuating ret_fun->fun we then follow it and look
up the info pointer.
......@@ -140,14 +140,18 @@ copy_tag_nolock(StgClosure **p, const StgInfoTable *info,
to = alloc_for_copy(size,gen);
*p = TAG_CLOSURE(tag,(StgClosure*)to);
src-> = (const StgInfoTable *)MK_FORWARDING_PTR(to);
from = (StgPtr)src;
to[0] = (W_)info;
for (i = 1; i < size; i++) { // unroll for small i
to[i] = from[i];
// if somebody else reads the forwarding pointer, we better make
// sure there's a closure at the end of it.
src-> = (const StgInfoTable *)MK_FORWARDING_PTR(to);
// if (to+size+2 < bd->start + BLOCK_SIZE_W) {
// __builtin_prefetch(to + size + 2, 1);
// }
