parallelize getRootSummary computations in dep analysis downsweep
Fixes #20891.
I haven't benchmarked more than a trivial -M
run, where it reduced the total time from 2.5s to 1.1s.
The implementation just reuses the machinery used for the upsweep part, creating an action per target. This results in many threads that block until the semaphore releases a slot. I'd assume the overhead to be negligible, but if someone has a different opinion we could also create bundles of NCPU targets.
Note: Without !12607 (closed), benchmarking -M
produces severely distorted results.
Edited by Torsten Schmits