"Divergence is still very expensive" is quite compatible with "Divergence is less expensive than before".
Here's evidence (not proof) that Nvidia would remove the hard limit of warp divergence (or perhaps more "precisely", *a warp is always synchronous across its 32 threads with divergence"): https://developer.nvidia.com/blog/cooperative-groups/
I don't think it's misleading to talk about a "CUDA core" as a warp-wide processor, although it seems that Nvidia doubles the number (at least for gaming GPUs), presumably because of having both FP and INT pathways.
the SIMT section there is pretty telling — you can do it, if you explicitly account for it, and are willing to leave other threads in the dust potentially forever. It’s not quite the same thing as JMP and only seems to account for the data dependency case and not the if/else non-stream case.
But I stand corrected — Volta and up have multiple PCs per warp.
Here's evidence (not proof) that Nvidia would remove the hard limit of warp divergence (or perhaps more "precisely", *a warp is always synchronous across its 32 threads with divergence"): https://developer.nvidia.com/blog/cooperative-groups/
I don't think it's misleading to talk about a "CUDA core" as a warp-wide processor, although it seems that Nvidia doubles the number (at least for gaming GPUs), presumably because of having both FP and INT pathways.