That's a frankly insane way to write parallel programs. Parallel evaluation is about maximum utilization of system resources balanced against the inherent overhead of synchronization/spawning/joining.
It has nothing to do with code organization.
> If you have an "innately parallel problem" but do NOT care about performance, just pthread_create() / pthread_join() or fork()/wait() and be done with it.
No, just do it in serial. It's easier to read and debug. If you do care about performance and the tasks vastly dominate the time of spawning/joining, then fork/join. It's slightly harder to read and much more difficult to debug, but an order of magnitude faster for a large number of problems.
It should also be noted that promises/futures are close cousins of this model, and their implementation is significantly more complex than pthread_create/pthread_join (but APIs are much nicer, imo, since they do the imperative/cooperative transformation for you).
> That's a frankly insane way to write parallel programs.
Don't hate it until you try it!
#pragma openmp parallel for
for(...) { ... } // Parallel portion
// No longer parallel. Implicit join!
foobar(); // execute sequentially
#pragma openmp parallel for
for(...) { ... } // Parallel portion again
It really covers a lot of parallel situations. Its often better than no parallelism at all (especially with 4-core/8-threads on even the cheapest CPUs today), and its really, really, really easy.
---------
Also consider how CUDA / OpenCL kernels execute.
someCUDAKernel<<<ten-thousand CUDA-threads>>>(params);
cudaMemcpy(...); // transfers data from GPU back to CPU.
foobar(data); // Fully sequential portion when parallelism is non-obvious
cudaMemcpy(...); // transfer data from CPU back to GPU
someCUDAKernel2<<<ten-thousand CUDA-threads>>>(params); // Parallelism when its easy
Which is basically how most Tensorflow operations work today. (Bonus points: the "foobar" sequential portion is written in Python, something like 1000x slower than C/C++, but you don't care because 99.999% of the work is done inside of those CUDA-kernels)
This "stupid" fork-join parallelism is really the core concept of modern heterogenous processors. A really easy way to increase utilization of both CPU-and-GPU resources.
It has nothing to do with code organization.
> If you have an "innately parallel problem" but do NOT care about performance, just pthread_create() / pthread_join() or fork()/wait() and be done with it.
No, just do it in serial. It's easier to read and debug. If you do care about performance and the tasks vastly dominate the time of spawning/joining, then fork/join. It's slightly harder to read and much more difficult to debug, but an order of magnitude faster for a large number of problems.
It should also be noted that promises/futures are close cousins of this model, and their implementation is significantly more complex than pthread_create/pthread_join (but APIs are much nicer, imo, since they do the imperative/cooperative transformation for you).