"Divergence is still very expensive" is quite compatible with "Divergence is les...

subharmonicon · on Oct 24, 2023

Their “CUDA core” is not warp-wide, it’s a single lane.

If you’re talking about FP32 rates, they double it because of FMA (floating-point multiply-accumulate). Everyone does that.

freeone3000 · on Oct 24, 2023

the SIMT section there is pretty telling — you can do it, if you explicitly account for it, and are willing to leave other threads in the dust potentially forever. It’s not quite the same thing as JMP and only seems to account for the data dependency case and not the if/else non-stream case.

But I stand corrected — Volta and up have multiple PCs per warp.