SHA256 is designed as such that the maximum amount of data that can be contained within a single block is 440 bits (55 bytes.)
If you carefully organize the nonce at the end and use all 55 bytes, you can pre-hash the first ~20/64 rounds of state and the first several rounds of W generation and just base further iterations off of that static value (this is known as a "midstate optimization.")
> If you limit your variable portion to a base16 alphabet like A-P
The more nonce bits you decide to use, the less you can statically pre-hash.
In FPGA, I am using 64 deep, 8-bit-wide memories to do the alphabet expansion. I am guessing in CUDA you could something similar with `LOP3.LUT`.
That's a beefy FPGA! Wish I had access to one to revive my 2009 era FPGA coursework knowledge but it seems the dev boards start at 5 figures which is too rich for a side project.
If you carefully organize the nonce at the end and use all 55 bytes, you can pre-hash the first ~20/64 rounds of state and the first several rounds of W generation and just base further iterations off of that static value (this is known as a "midstate optimization.")
> If you limit your variable portion to a base16 alphabet like A-P
The more nonce bits you decide to use, the less you can statically pre-hash.
In FPGA, I am using 64 deep, 8-bit-wide memories to do the alphabet expansion. I am guessing in CUDA you could something similar with `LOP3.LUT`.