Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.
With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.
So why does it happen only with hugepages? Is the extra overhead / TLB pressure enough to trigger the issue in some way? Of is it because the regular pages get swapped out (which hugepages can't be)?
I don't fully know, but I suspect it's just that due to the minor faults and tlb misses there is terrible contention with the spinlock, regardless of the PREEMPT_LAZY when using 4k pages (that easily reproducible). Which is then made worse by preempting more with the lock held.