That's a frankly insane way to write parallel programs. Parallel evaluation is a...

dragontamer · on March 23, 2021

> That's a frankly insane way to write parallel programs.

Don't hate it until you try it!

    #pragma openmp parallel for
    for(...) { ... } // Parallel portion

    // No longer parallel. Implicit join!
    foobar(); // execute sequentially

    #pragma openmp parallel for
    for(...) { ... } // Parallel portion again

It really covers a lot of parallel situations. Its often better than no parallelism at all (especially with 4-core/8-threads on even the cheapest CPUs today), and its really, really, really easy.

---------

Also consider how CUDA / OpenCL kernels execute.

    someCUDAKernel<<<ten-thousand CUDA-threads>>>(params);
    cudaMemcpy(...); // transfers data from GPU back to CPU.
    foobar(data); // Fully sequential portion when parallelism is non-obvious
    cudaMemcpy(...); // transfer data from CPU back to GPU
    someCUDAKernel2<<<ten-thousand CUDA-threads>>>(params); // Parallelism when its easy

Which is basically how most Tensorflow operations work today. (Bonus points: the "foobar" sequential portion is written in Python, something like 1000x slower than C/C++, but you don't care because 99.999% of the work is done inside of those CUDA-kernels)

This "stupid" fork-join parallelism is really the core concept of modern heterogenous processors. A really easy way to increase utilization of both CPU-and-GPU resources.