Skip to content
  • In 18919, one of the things Ben showed me how to do was look at block descriptors in gdb:

    (gdb) set $MBLOCK_MASK = 0x0fffff
    (gdb) set $BLOCK_MASK = 0xFFF
    (gdb) set $BLOCK_SHIFT = 12
    (gdb) set $BDESCR_SHIFT = 6
    (gdb) set $p = (int64_t)mvar
    (gdb) p *(bdescr *)((((($p) &  $MBLOCK_MASK & ~$BLOCK_MASK) >> ($BLOCK_SHIFT-$BDESCR_SHIFT)) | (($p) & ~$MBLOCK_MASK)))

    It may be instructive to learn some things about the block descriptors of the TSO and the invalid global_link. While the data there is already freed, the block descriptor may provide some information. From the stack trace the TSO is on the gen=1 mut_list. What is the gen# of the global_link pointer?

    I don't know how often you're spawning new threads, but if it is not frequent, it may be appropriate to trace all changes to the TSO global_link lists, showing the pointer and block descriptor detail, and moves of TSO's under GC. That sort of trace should identify the earliest point at which things get out of sync.

    That's the sort of thing that worked for me in the MVar case. It took a few days of detective work to finally nail it down...

  • Is this different to using the Bdescr function to get the bdescr * for a closure pointer?

  • I needed a version that would work with core files. If the process is still alive, it should be the same IIRC.

  • -- TSO
    (gdb) p Bdescr((P_) 0x4203a0a000)
    $1 = (bdescr *) 0x4203a00280          
    (gdb) p $1->gen->no
    $4 = 1
    -- global_link
    (gdb) p Bdescr((P_) 0x420940a788)
    $5 = (bdescr *) 0x4209400280
    (gdb) p $5->gen->no
    $6 = 0

    So the TSO is in generation 1, the global link pointer is in generation 0. Anything else you think would be interesting?

    As I commented on the ticket, shouldn't Scav.c recurse into the global_link field?

  • What's interesting here, but I may have misunderstood what little of the TSO global_link code I read, is that the global_link list is supposed to be per-generation. I don't know under what conditions it would have links from a TSO in an older generation to a TSO in a younger generation. Perhaps that's a clue, or perhaps (more likely) I have no idea what I'm talking about, but this may help Ben once he's back online...

    Edited by vdukhovni
  • @trac-vdukhovni There are thousands of threads and it doesn't happen until sometimes 100 000 have been spawned so adding traces might not get us so far unfortunately. I am not blocked on this so unless Ben has some ideas or thinks it's concerning then I will leave it.

  • typedef struct generation_ {            
        uint32_t       no;                  // generation number
        // ...
        StgTSO *       threads;             // threads in this gen
                                            // linked via global_link
        // ...                              
        StgTSO *     old_threads;
        StgWeak *    old_weak_ptr_list;
    } generation;
    
    typedef struct StgTSO_ {
        StgHeader               header;
        // ...
        struct StgTSO_*         global_link;    // Links threads on the
                                                // generation->threads lists
        // ...
        StgWord32               dirty;          /* non-zero => dirty */
        // ...
    } *StgTSOPtr; // StgTSO defined in rts/Types.h
  • FWIW, it looks concerning to me... :-( Unless the check is just ill-timed, and should be less strict... Perhaps there are times when the list is temporarily not consistent and not expected to be valid?

    In my code I could whittle down the number of MVars to just a half a dozen, and configure so little memory for the nursery (IIRC something like 32K or 64K) that GCs happened left-and-right and the issue started happening frequently enough to get reasonably short problem traces...

  • Oh, and I did on the other hand run on multiple cores (~16 actual cores, not just HT) to make races as likely as possible. The problem was not reproducible under rr because it serialises all execution running threads one at a time, which is both slow and makes some races much less likely.

  • Don't know whether this issue is rr-friendly...

0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment