> GPUs can now basically do out-of-order execution using dynamic warp formation ...

> GPUs can now basically do out-of-order execution using dynamic warp formation and driver-based prediction and optimization.

That's... not OOE.

If you have:

    add [eax], ebx
    add ecx, edx

A CPU can execute "add ecx, edx" before the "add [eax], ebx" (especially if the [eax] memory read/write is far away: like in DDR4 or worse).

A GPU will ALWAYS execute the 1st line before the 2nd line. Dynamic warp formation means that SIMD-lanes can split-up in your warp, but none of the warp is executing anything out-of-order.

> Branch prediction is useless in massively parallel workflows, because of linearization.

Stockfish NNUE argues otherwise. Only part of the network updates at any given time. Its a massively-parallel problem, but you need dynamic / branch prediction to see which parts of the Neural-Net are actually updating.

The idea of CPU-designed networks (Neural Nets that PARTIALLY update, instead of a GPU-network that updates entirely) is intriguing to me. There might be other situations where branch prediction is helpful in other largely parallel problems.

----------

> And in any case, OOE for SIMD instructions loses a lot of it's usefulness on many workflows.

More like the SIMD offers enough parallelism to saturate the machine. But NNUE really shows where OOE / branch prediction can suddenly become unexpectedly helpful.

Classic SIMD algorithms do not account for branch prediction / OOE benefits. The very existence of high-speed SIMD + OOE + branch prediction is an unexplored part of computer science IMO.