> GPUs can now basically do out-of-order execution using dynamic warp formation and driver-based prediction and optimization.
That's... not OOE.
If you have:
add [eax], ebx
add ecx, edx
A CPU can execute "add ecx, edx" before the "add [eax], ebx" (especially if the [eax] memory read/write is far away: like in DDR4 or worse).
A GPU will ALWAYS execute the 1st line before the 2nd line. Dynamic warp formation means that SIMD-lanes can split-up in your warp, but none of the warp is executing anything out-of-order.
> Branch prediction is useless in massively parallel workflows, because of linearization.
Stockfish NNUE argues otherwise. Only part of the network updates at any given time. Its a massively-parallel problem, but you need dynamic / branch prediction to see which parts of the Neural-Net are actually updating.
The idea of CPU-designed networks (Neural Nets that PARTIALLY update, instead of a GPU-network that updates entirely) is intriguing to me. There might be other situations where branch prediction is helpful in other largely parallel problems.
----------
> And in any case, OOE for SIMD instructions loses a lot of it's usefulness on many workflows.
More like the SIMD offers enough parallelism to saturate the machine. But NNUE really shows where OOE / branch prediction can suddenly become unexpectedly helpful.
Classic SIMD algorithms do not account for branch prediction / OOE benefits. The very existence of high-speed SIMD + OOE + branch prediction is an unexplored part of computer science IMO.
That's... not OOE.
If you have:
A CPU can execute "add ecx, edx" before the "add [eax], ebx" (especially if the [eax] memory read/write is far away: like in DDR4 or worse).A GPU will ALWAYS execute the 1st line before the 2nd line. Dynamic warp formation means that SIMD-lanes can split-up in your warp, but none of the warp is executing anything out-of-order.
> Branch prediction is useless in massively parallel workflows, because of linearization.
Stockfish NNUE argues otherwise. Only part of the network updates at any given time. Its a massively-parallel problem, but you need dynamic / branch prediction to see which parts of the Neural-Net are actually updating.
The idea of CPU-designed networks (Neural Nets that PARTIALLY update, instead of a GPU-network that updates entirely) is intriguing to me. There might be other situations where branch prediction is helpful in other largely parallel problems.
----------
> And in any case, OOE for SIMD instructions loses a lot of it's usefulness on many workflows.
More like the SIMD offers enough parallelism to saturate the machine. But NNUE really shows where OOE / branch prediction can suddenly become unexpectedly helpful.
Classic SIMD algorithms do not account for branch prediction / OOE benefits. The very existence of high-speed SIMD + OOE + branch prediction is an unexplored part of computer science IMO.