The hardest part of parallel programming is tricking yourself into believing you need to worry about it at all.
9/10 times, if you use the right tools & techniques, a single x86 core (assuming you have access to a 'real' one) can easily chew through millions of business transactions per second and easily satisfy all technical requirements for whatever business system.
Latency is the ultimate devil. Getting your synchronous state to fit into L1/2 and processing transactions in (micro)batches of contiguous data is how you win the game. The closer your data is to your compute, the faster you can go.
The biggest problem most developers are struggling with right now is that they got tricked into scattering their business systems across many computers and potentially across multiple physical regions. The latency in L1 is measured in nanoseconds. The latency between 2 datacenters is measured in milliseconds. This is SIX orders of magnitude difference. Even if you squander 3 of those to some bad programming practices, you are still 1000x better off on a single box using a single thread than if you scattered your infra to the seven winds.
See: LMAX Disruptor for a case study in what the opposite end of the "infrastructure isnt my problem" spectrum looks like.
Parallel programming is not what you think it is. You're right that parallelism is not the answer to everything, but parallel programming is not even a paradigm people use to write basic business transaction logic. Distributed computing, yes, not parallel programming. Amdahl's law has more to do with diminishing returns in parallel processing moreso than physical proximity to memory.
9/10 times the hundred cycles you save from leveraging your caches vs memory accesses is eclipsed by millions/billions of extra cpu cycles you have to perform for certain computations. Not to mention not every application has predictable memory access patterns that would benefit from LRU/LFU policies.
Now that Moore's law is no longer the bible in computer architecture research, CPU designers have to resort to increasing core count. With the advent of machine learning and big data workloads, you'll spend thousands of years crunching numbers on a single threaded application.
Even in code like this just being aware of cache levels and there relative timings can sometimes produce some impressive speedups with simple things like sorting the initial data to make it cache friendly. A recent one I improved was along the lines of:
for t in transactions
t.client = clients[t.ClientId]
t.product = products[t.ProductId]
In this case sorting transactions by ClientId, ProductId was an orders of magnitude speedup because client and product were already in cache in L1/L2 cache most of the time. It could have been faster again with less SOLID but improvements like this are often easy and politically achievable because it keeps the sacrifice to the OO gods.
Treating Big O as the be all of performance has quite a lot to answer for in real world code, often constants aren't so constant.
Please, refrain from judging people here. I would rather prefer if we talked about this wonderful book that was released for free for our own enjoyment. I cannot even grasp the volume of work that went into this book.
Counterpoint: Parallel programming isn't very hard.
Seriously: try fork() + wait(), or spawn in bash / cmd line. make -j offers parallelism (and Ninja is "make -j by default").
OpenMP's "#pragma omp parallel for" is extremely easy to use.
Things get harder with thread libraries, but the "parallelism" part of pthreads is pretty easy. pthread_create(&thread_id, NULL, function, args);, at least if you're fine with the default settings.
C++ Threads are even easier than pthreads.
-----------
The hard part is:
1. Communication / Synchronization -- But really, condition variables are the most complicated thing most people need. fork() + wait() push the difficult communication part to the wait() or to pipes. But if you stick with wait() / waitpid() as your main communication mechanism, its pretty easy to think about.
2. High performance programming -- This is just hard, especially if you begin to reach for the highest performance "memory barrier / lock-free programming" styles. This is where "false sharing" comes in (your code correctly executes under false sharing, but slowly due to how the L1 cache handles multicore architectures).
---------
However, if you stick with a simple fork-join communication model (ex: pthread_join, or wait()/waitpid() based synchronization)... its really not too hard at all. Just try to stay away from the high-performance techniques unless you really need them.
Not to offend, but yeah, if you've already partitioned the work to be done, then of course it's not hard.
I'm pretty sure that anyone who's come into even vague contact with parallel programming understands that the hardest part is communication and synchronization.
I think you underestimate the number of real-world problems that can be solved by a bash script executing "./some_task &" in a for loop, and using "jobs", "fg", and "bg" as needed.
For many problems, that's literally good enough parallelism.
--------
Case in point: I once needed to render an animation in Blender, and I noticed that Blender's built-in parallelism was kind of crap (it was an older version of Blender at the time. My 32-thread / 16-core system barely reached 400% utilization... I know I could do better).
So I just spawned 16x Blenders in headless mode, each handling a different frame of the animation, and achieved far higher utilization than the innate Blender parallelism.
Yes, literally:
#!/bin/bash
for i in `seq 1 16`;
do
blender (params) & # Yeah, just run a bunch of blenders in headless + background mode
done
Bonus points: I tweaked the script to take advantage of NUMA-locality, by automatically pinning blender to one NUMA-node. All without ever touching a single line of C code.
-------
Communication / synchronization is hard. But parallelism (even high-performance NUMA-locality with thread affinity pinned to particular CPU nodes) doesn't have to be hard. Just think a bit smarter.
--------
The equivalent to doing "./some_job &" is fork() and wait() in POSIX C / C++. That's good enough for a ton of problems (hell, its how us older programmers achieved parallelism back in the CGI-BIN days).
Don't do complicated high-performance techniques if you only need a bit of parallelism. Don't overthink things, stick to simple parallelism when simple solutions solve simple problem.
Don't premature optimize into complicated producer/consumer patterns unless you actually need the performance. Definitely don't premature optimize into high-performance lockless programming techniques unless you really need to.
I mean in the example you gave, the work was already partitioned for you. That we have high level tools that make it easy doesn't change the fact that the work of rendering frames in an animation are very clear chunks of work that lend themselves naturally to parallelization.
We can quibble about the semantics of "parallel" programming all day, but I think that either too broad or too narrow a scope for the term misses the point: that data coherency is what makes parallel programming hard.
In the ultimate reductio ad absurdum, you can just say it falls into one of the two classes of hard computer science problems, that being "cache invalidation", where the "cache" is some form of shared data. In this sense, I agree with you: it's not the parallelism per se that's hard, but I think anyone talking about the difficulty of it quite clearly is referring to the shared data problems.
> I mean in the example you gave, the work was already partitioned for you.
In my experience, I keep running into problems *already partitioned*.
Make -j. 3d-animation frames. SAXY arrays. (Seriously, do you know how many for loops I've come across that can become #pragma omp parallel for and just automatically benefit from parallelism??). Multiple connections to a server (ex: Fork/join Apache model / CGI-bin / PHP / etc. etc.). Multiple connections to other servers (SMTP, IRC, etc. etc.)
A ton of problems are "easy parallelism" in my experience. Probably most problems. Yeah, there's some hard problems out there, but beginners should learn how to parallelize the easy stuff first.
The point is that these easy cases may just referred to as "work to do". Using more than one CPU core at the time to process through a backlog of work is just...the normal case. Doing work simultanously is not what people refer to when they say "parallel programming is hard"; and the pedantery is in insisting that people word themselves to exclude trivially parallel work when what they want to talk about are the cases where parallelism is not trivial.
The conversation above is like this:
"Solving cubic equations is hard"
"Actually, solving x ^ 3 = 27 is easy, just take the cubic root.."
"Well that is not what we meant when we said cubic equations, we were interested in the general case..."
"I see equations like x^3 = y all the time, it is not that often you have a linear term too..."
* "No its not. 90% of the time, its just changing your oil, refilling the fuel tank, washer fluid, and rotating tires".
* "But you have to also replace the transmission fluid and sometimes rebuild the engine"
* "True, but those are not a very common tasks".
-----------
That's how I see it. Most parallelism is "easy parallelism" in my experience. The goal isn't necessarily to train programmers into being able to do "engine rebuilds", but instead train them for the easy "oil change / rotate tires" part of taking care of cars (or programs).
For programming: that's fork / wait(), or pthread_create() / pthread_join(), or #pragma omp parallel for. Those constructs covers the vast, vast majority of parallel constructs I've run across.
Yes, mutex is hard. Condition variables are hard. Lock-free programming is hard. High performance (false-sharing / MESI) is hard. GPU ballots / __shared__ memory / butterfly permutes are very hard. Prefix-sum / scan and Mergepath algorithms are hard. Lock-free Producer-consumer is really, really, really hard (Probably PH.D level hard)
But... I really don't come across a lot of situations where I need to Mergepath, GPU-ballot, or need to manually program / debug a Lock-free algorithm. That's just the facts. I'm glad I learned these constructs (and I await the day when a high-performance programming problem is given to me such that I need to "pull out the big guns" that I've spent a lot of effort learning)... but I recognize that its not really needed in typical day-to-day programming parallel jobs.
Even in GPU-programming: the bulk of GPU-programming is just simple fork/join stuff. A GPU-kernel is a fork, and when the GPU-kernel finishes execution, its an innate join() (with synchronization guaranteed). Its really not that hard in most situations.
Schedule your chain of GPU-kernels to execute in order (a "CUDA-stream"), and they automatically join and synchronize with themselves each time the kernel completes. Its that easy in most cases.
"Oh, but for maximum performance you want to use dynamic / recursive parallelism, or block-off your data into __shared__ memory and think about L1 caches". Erm, yes, if you care to put that much work into performance its hard. But you're no longer talking about parallelism per se, but instead the very difficult field of high-performance compute at that point.
Pretty much every survival/simulation game with complex mechanics suffers from low performance because at some point the simulation becomes very CPU intensive but parallelizing the code is still extremely difficult because of extensive interdependencies.
If you go with the trivial "divide the problem into the smallest parallelizable components" strategy you will end up with huge interconnected graphs, sometimes there isn't even a second graph that can be put on a second CPU core. If you are insane enough to divide the problem at the agent level then you end up with a synchronization nightmare. Sure, given enough effort you can do these all these things but that only entrenches the perception of parallelism being a last resort instead of being the default solution for everything.
I think a lot of people don't know how easy it can and should be to handle the easy cases. C++ has some parallel constructs, but they are not as simple as OpenMP because they're meant to handle the harder stuff.
I think it's important to show and use the easy ways before getting into the complex things.
I've read v1 of the book, and the author does cover the fact that many parallel problems can be covered by using shell scripts or databases. He calls these problems you are talking about "embarrassingly parallel" and encourages the reader to take advantage of parallel tools whenever possible.
The rest of the book talks about doing data sharing efficiently (i.e. for not embarrassingly parallel problems), including some things I don't think I ever would have run across otherwise like RCU. All the while he explains the use cases for each of the techniques. If you are interested in making the most of your processors (and I suspect you are given this post), and are up for a technical read, I'd highly recommend this book.
EDIT: By which I mean to say that he covers all the stuff you posted, and explains exactly when it is worthwhile to use the high performance and communication techniques. If you want to know just the basic stuff, read the first 3 chapters.
- Do you really want to ignore ~90% of the compute throughput and ~90% of the memory bandwidth of your machine?
- Knock knock
- ...
----
An AMD EPYC 7742 (Milan, Zen 3, announced last week, high-end server model) has ~100 GB/s DRAM bandwidth and 1-2 TFLOPS double precision.
A single PCI-e A100 has ~2Tb/s DRAM bandwidth, and 10-20 TFLOPs double precision. You can pack 4x of these per CPU socket... so you end up at 8 TB/s and 40-80 TFLOPs... That makes the GPUs like 99% of both the compute and memory BW of a socket.
Well, CPUs vs GPUs is pretty interesting actually. The #1 supercomputer right now is CPU-only, the A64Fx by Fujitsu. But you're right in that GPU-parallelism is the obvious parallel architecture available to modern programmers. For anyone spending less than $500,000 on a computer, GPUs are probably the most compute for the least amount of cost.
---------
The issue is that there's two difference concepts here:
1. Parallelism -- When a problem you're working on has an undefined ordering to it explicitly. Protocols and communications for example have undefined orders. You could write a state-machine explicitly, but writing a thread is usually much simpler to understand. Ultimately: this is cofunctions, Python Yield / generators, async-code, etc. etc. Remember: Djikstra invented semaphores on a single-core processor. The concept of "single core parallelism" is hugely useful and can grossly simplify code if used correctly.
2. High performance -- This is a different concept entirely!! GPUs are within the realm of high-performance parallelism (and not just merely "conceptual parallelism" in #1).
GPUs are hard because anyone who uses a GPU expects high performance. Be it the video gamer who wants raytracing on his video game, or the machine learning enthusiast training a billion neurons.
Writing a simple "GPU Hello World" is easy as all heck actually. Its all the performance stuff that gets really hard.
The only GPU-like thing it has is on-board HBM2 RAM (which other CPUs could do, but haven't bothered to do for some reason. Except that one Xeon Phi using the HMC competitor to HBM).
The SIMD-model of A64Fx: the SVE instruction set of ARM, is coming to mainstream Neoverse V-series ARM chips within a year or two.
SVE is CPU-parallelism. Well, its SIMD, but its just 512-bit SIMD (equivalent to AVX512 in width). So that's smaller scale SIMD / CPU parallelism IMO.
Those A64Fx are going to be branch-predicting and executing those SIMD-instructions out-of-order. Its a CPU.
-------
I consider GPUs to be the simpler in-order but more parallel stuff (AMD, NVidia, Xilinx maybe). But you're right in that the line keeps getting blurred each year. (NVidia's dual-issue pipelines is a very CPU-like feature. AVX512 and SVE pushing 512-bits is getting closer-and-closer to GPU-like SIMD widths)
GPUs can now basically do out-of-order execution using dynamic warp formation and driver-based prediction and optimization.
Branch prediction is useless in massively parallel workflows, because of linearization.
GPU SIMD is also more complicated than full-SIMD. Nowadays with work-stealing and multiple workgroups, you can have very similar performance characteristics to parallel SIMD on CPUs, just with more perfromance.
And in any case, OOE for SIMD instructions loses a lot of it's usefulness on many workflows.
> GPUs can now basically do out-of-order execution using dynamic warp formation and driver-based prediction and optimization.
That's... not OOE.
If you have:
add [eax], ebx
add ecx, edx
A CPU can execute "add ecx, edx" before the "add [eax], ebx" (especially if the [eax] memory read/write is far away: like in DDR4 or worse).
A GPU will ALWAYS execute the 1st line before the 2nd line. Dynamic warp formation means that SIMD-lanes can split-up in your warp, but none of the warp is executing anything out-of-order.
> Branch prediction is useless in massively parallel workflows, because of linearization.
Stockfish NNUE argues otherwise. Only part of the network updates at any given time. Its a massively-parallel problem, but you need dynamic / branch prediction to see which parts of the Neural-Net are actually updating.
The idea of CPU-designed networks (Neural Nets that PARTIALLY update, instead of a GPU-network that updates entirely) is intriguing to me. There might be other situations where branch prediction is helpful in other largely parallel problems.
----------
> And in any case, OOE for SIMD instructions loses a lot of it's usefulness on many workflows.
More like the SIMD offers enough parallelism to saturate the machine. But NNUE really shows where OOE / branch prediction can suddenly become unexpectedly helpful.
Classic SIMD algorithms do not account for branch prediction / OOE benefits. The very existence of high-speed SIMD + OOE + branch prediction is an unexplored part of computer science IMO.
The A64fx also has hardware synchronization barriers to synchronize cores, which is a pretty GPU-like thing (at least it is very common on GPUs, and rare on CPUs).
False sharing isn't that hard to deal with. Soundness is much harder to deal with than performance, and lock/wait free programming isn't about raw speed - it's about prioritizing latency over throughput, or providing latency guarantees. A slightly less difficult problem (but still very difficult in practice) is determining if your application needs to prioritize latency or throughput, most assume the latter and use strategies that emphasize the former.
And if you can tolerate the overhead of spawning processes than none of this really matters, fork/join is fine.
> Soundness is much harder to deal with than performance
Soundness within a pure Fork/join model is easier than you might think.
If any data needs to be shared between threads, just pthread_join() / wait(). Then execute the shared-portion of the data in a single thread. Then fork() back off after the sequential portion is complete.
There's no race condition because all the other threads are dead after the pthread_join() or wait() calls. As such, you're fully sound.
> And if you can tolerate the overhead of spawning processes than none of this really matters, fork/join is fine.
Exactly.
The main issue at hand here is that "parallel models" are useful mechanisms to organize code. A lot of parallel programming isn't about speed of execution, but instead about representing an innately parallel problem in their code.
If you have an "innately parallel problem" but do NOT care about performance, just pthread_create() / pthread_join() or fork()/wait() and be done with it.
That's a frankly insane way to write parallel programs. Parallel evaluation is about maximum utilization of system resources balanced against the inherent overhead of synchronization/spawning/joining.
It has nothing to do with code organization.
> If you have an "innately parallel problem" but do NOT care about performance, just pthread_create() / pthread_join() or fork()/wait() and be done with it.
No, just do it in serial. It's easier to read and debug. If you do care about performance and the tasks vastly dominate the time of spawning/joining, then fork/join. It's slightly harder to read and much more difficult to debug, but an order of magnitude faster for a large number of problems.
It should also be noted that promises/futures are close cousins of this model, and their implementation is significantly more complex than pthread_create/pthread_join (but APIs are much nicer, imo, since they do the imperative/cooperative transformation for you).
> That's a frankly insane way to write parallel programs.
Don't hate it until you try it!
#pragma openmp parallel for
for(...) { ... } // Parallel portion
// No longer parallel. Implicit join!
foobar(); // execute sequentially
#pragma openmp parallel for
for(...) { ... } // Parallel portion again
It really covers a lot of parallel situations. Its often better than no parallelism at all (especially with 4-core/8-threads on even the cheapest CPUs today), and its really, really, really easy.
---------
Also consider how CUDA / OpenCL kernels execute.
someCUDAKernel<<<ten-thousand CUDA-threads>>>(params);
cudaMemcpy(...); // transfers data from GPU back to CPU.
foobar(data); // Fully sequential portion when parallelism is non-obvious
cudaMemcpy(...); // transfer data from CPU back to GPU
someCUDAKernel2<<<ten-thousand CUDA-threads>>>(params); // Parallelism when its easy
Which is basically how most Tensorflow operations work today. (Bonus points: the "foobar" sequential portion is written in Python, something like 1000x slower than C/C++, but you don't care because 99.999% of the work is done inside of those CUDA-kernels)
This "stupid" fork-join parallelism is really the core concept of modern heterogenous processors. A really easy way to increase utilization of both CPU-and-GPU resources.
One of the biggest issues I encounter with concurrency is whether the code I've written could be optimized to reduce cache misses.
I feel like a lot of resources including this wonderful book gloss over reducing cache misses. Is it really that trivial? What are the tips and tricks involved?
Where is this a problem for you? The broad goal to minimize cache misses is the same as concurrency in general - minimize synchronization. This means understanding dependencies and doing work on each core in chunks that are as large as possible for the amount of latency you can tolerate between synchronizing.
- "A high-performance implementation would of course
use padding or special alignment directives to avoid false sharing."
- "Cache alignment and padding often improves performance by reducing false sharing."
- "Software can use the alignment directives available
in many compilers to avoid false sharing, and adding such directives is a common step
in tuning parallel software."
There's just essentially two or three lines in the book that I could find that refer to optimizing cache use through padding/alignment, and that was after knowing what to search for. Although I wouldn't be surprised if I missed something.
In my experience, the discourse stops at the surface level, which makes the topic appear like it's obvious or trivial. But there are many follow-up questions that naturally arise for me:
- What are the trade-offs of cache-padding shared data? Why does it degrade performance for certain problems?
- What is a good rule-of-thumb for when to prioritize cache padding/alignment over cache locality?
- Are there other best practices like cache padding/alignment/locality that improve performance?
- What is an alignment directive and how does one use it?
I agree with what you've said, I was merely pointing out that I wish parallel programming resources delved more into this subject, as I feel that it's a practical and common issue. I'm sure it's not trivial and requires a fair bit of expertise, but that's why one would reach for a book like this after all.
False sharing is a bit of a niche optimization compared to the architectural optimization of synchronizing less and doing more work in larger chunks.
All the same, it is pretty easy. It has been a while, don't take this as gospel - A cache line is 64 bytes. Cache line boundaries will be every 64 bytes. Make sure your atomics don't fall on two cache lines.
If a pointer is 8 bytes and 4 bytes are on 60-64 with the other 4 bytes on 65-68, accessing that atomic will introduce false sharing because it will access two cache lines.
This is generally not an optimization that will need to be worried about. If you have a hotly contested atomic variable though, you might benefit from aligning it to not overlap two cache lines. You can do this by allocating extra memory and just putting at an address evenly divisible by 64 (a 64 bit atomic would then use the first 8 bytes of that cache line).
Optimizations fall in to two camps. The first is architecture, which comes from experience of what you will need up front. The second camp is what you find out from optimization after the fact.
> - Are there other best practices like cache padding/alignment/locality that improve performance?
This is pretty much it as far as I can remember. 128 bit atomics will only work when aligned to 128 bit boundaries interestingly, while 64 bit and below are fine, you just have the possibility of false sharing.
The simple recipe for locality is not really about concurrency - use arrays and loop through memory sequentially so that it can be prefetched.
> - What is an alignment directive and how does one use it?
There is an alignment keyword in C++ now. I have not used it yet but there are good explanations out there I'm sure. It will come down to automatically doing what I described in a manual way earlier.
False sharing has nothing to do with misaligned atomics.
It's about having multiple CPUs actively use the same cache line. This is what happens when you try to make your data independent to remove locks, and you end up with extremely high contention with others when accessing your data, because the cache line cannot be in exclusive state in each core, so you spend your time flushing 64 bytes at once for each 4-byte write.
The typical case is this:
uint32_t counter[MAX_THREADS];
And have each thread perform counter[thread_id]++;
Without noticing this is packing 16 threads on the same cache line. It can make your 16 threads work at roughly the same speed as a single one. And sometimes worse.
The solution here is to identify all the data that are changed together within a same thread, and pack them together in a struct which you arrange in arrays:
It sounds like a dumb tautology but parallel programming is easy for parallel problems. It is harder to say maintain several interlinked grid cells in a simulation in parallel than it is to partion a non-interacting set to N cores to run in parallel and let the main thread know when everything is done.
A question to parallel programming experts. Can we assume that a task that is well seperated to perform on single core is always better than multi-core?
I meant using multiple instances working on single core vs. single instance with multi threads.
Data sharing between cores is the bane of parallel programming at every scale. Consequently, high-performance parallel systems are not multithreaded in any meaningful sense -- each thread operates almost exclusively on private local data. See also "thread-per-core" software architectures for maximizing absolute throughput, which has its origins in supercomputing.
However, this creates a new problem for any workload that is not embarrassingly parallel. If all data is private to a single thread/core then the workload can become very unevenly distributed across cores depending on the data they own, destroying efficient parallelism in another way. Unlike multithreading, quick adaptive load shedding to smooth out these hotspots can scale surprisingly well across a very large number of cores/nodes with little overhead for many workloads. This is how many massively parallel codes are written today for workloads that are not embarrassingly parallel.
Partitioning data across cores with out sharing is necessary to maximize throughput, and almost always better than multithreading, but insufficient. Fortunately, mitigating transient hotspots is a mostly solved problem.
For this load shedding in a single node, are you imagining something more than work stealing style approaches?
In a multi-node cooperative setting you need some way to transmit information that a given node is overloaded, some way to find nodes that have available capacity, and a low overhead way to shift the work over to them. If the work to be done depends solely on data that you have local to you, it seems silly to shift the data as well (depending on how big it is); this would only make sense when you have to combine data from a variety of nodes (which can be done on any other node, assuming that the current node is overloaded).
Probably not worth doing anything about short lived hotspots (<1s). I wonder what kind of granularity you have used in your systems (probably different for within node vs across nodes).
Yes, something different than work-stealing on a single node. Instead of pulling work from other cores when idle, cores push the data that is attracting work to other cores. Even though shedding data that is attracting work is not the same as shedding work per se, the effect is equivalent in real systems if you can shed it with very low latency.
The last several setups I've seen worked this way. This has lower overhead and significantly less thread contention than work-stealing. The hotspot granularity was pretty coarse a decade ago -- on the order of a few seconds -- because the algorithms and techniques weren't as good and the overhead was relatively high. Modern versions have some tricks that allow them to shed load so quickly (millisecond range) and with so little overhead that it just looks really smooth to have it proactively running all the time. The system I am currently working with is designed to shed thousands of megabyte-scale data chunks per second without interfering with throughput but you typically need quite a bit less to keep things balanced.
The ideas were originally developed for large parallel clusters running workloads inherently prone to hotspotting. To your point is a more complex problem and operates at coarser time granularities, both out of necessity and because hotspots form more slowly anyway. It turns out it works even better when applied at the scale of a large server. The concept is pretty simple in the abstract, the challenge is arriving at an efficient implementation with few edge cases that also can provide strong ordering guarantees. Quite a few slightly broken but usable designs were put into production, including by myself, before people started to grok the design idioms for the single-node case.
An important point is that unlike work-stealing, you can't sprinkle this on top of a system that wasn't designed for it. Implementations lean heavily on schedule control mechanisms, which most software architectures don't provide.
Sharing read only data is not an issue at all. Sharing data written by only one thread also scales very well: The single writer principle is well understood and critical for designing efficient multi-threaded applications.
Interestingly, not sharing read-only data is a well-known optimization for parallel code -- it can famously produce a super-linear scaling effect. While counterintuitive, the reason is mundane.
The majority of high-performance code is limited by memory bandwidth, not CPU or I/O. Consequently, throughput is a function of the aggregate CPU cache efficiency of the entire system. If read-only data is aggressively shared across cores, it means a large part of the total CPU cache capacity of the system is being used to store the same data multiple times, thereby wasting memory bandwidth and reducing system throughput.
If you do not share any read-only data, it maximizes the amount of the workload resident in the various CPU caches. This can have a significant performance effect in real parallel systems.
The only way to achieve perfect linear scaling is to run N threads that never talk to each other, because a single thread program doesn't communicate with other threads.
Whether you run processes or threads shouldn't matter. As soon as you do communication, expect your program to get slower than what is theoretically possible. Multiprocess tends to have higher communication overhead (not always).
If you flip some NUMA switches so tasks on the other core don’t add overhead, I’m inclined to say yes.
But then modern schedulers are also so low-overhead that this case really only applies if you have a hard real-time requirement - in which case dedicated hardware is the norm
I think Joint Parallel is hard, as in when two threads on two different cores are trying to write+read to+from the same memory.
My solutions are:
1) Use Java on the server; it's GC:ed VM has a complex memory model that can handle fast lock-free concurrency relatively well, even using OO though you'll hit cache-misses.
2) Use C on the client, primitive variable types are generally atomic so if you know what you are doing (I don't because I haven't gotten that far yet) you should be able to joint parallelize cores with char, int and float arrays as data, also avoiding cache-misses!
There are many more arguments for this division, Java doesn't crash (usually) and C is truly portable (you cannot run Java on certain devices iOS, Switch, PS4).
This seems more like someone taking wild guesses instead of information from experience and fundamentals.
First, I don't think "Joint Parallel" is a common term and everything gets synced eventually at some level.
Your solutions are languages, which are separate from understanding parallelism and concurrency. Using java doesn't solve anything directly.
> it's GC:ed VM has a complex memory model that can handle fast lock-free concurrency relatively well, even using OO though you'll hit cache-misses.
This is almost like talking about the sound of a color. Lock free programming is not a factor of a language. If there are good lock free data structures that work for your specific problems, then that can go a long way to solving synchronization. OO and cache misses have nothing to do with anything here, that comes down to how something is written, not the language.
> Use C on the client, primitive variable types are generally atomic so if you know what you are doing
This is not true. C has nothing to with this, it is a matter of getting access to atomics. Also primitive types are not "generally atomic", 32 bit stores can be written atomically, but that doesn't address ordering, compare and swap or atomic adds.
> you should be able to joint parallelize cores with char, int and float arrays as data, also avoiding cache-misses!
This seems like real information fed through google translate a few times. 'joint parallelize' doesn't mean anything. Avoiding cache misses is a matter of accessing memory sequentially. Arrays are orthogonal to atomics, each element needs to be dealt with individually.
As for experience, I wrote my own app server with distributed database and multiplayer protocol from scratch by myself and proved it with a MMO that has 300.000 customers:
The things you said in your comment above are untrue for very fundamental reasons. This isn't a judgement call of the best techniques based github histories.
> I minted Joint Parallel and gave the definition!
That means it's nonsense. It is an oxymoron, it contradicts itself. Parallelism by definition is about separation of dependencies and synchronization. Join implies the exact opposite. That's like saying Bright Black while telling people you invented a new color space. It is a hiccup to being taken seriously (to put it mildly).
You have said stuff like 'java doesn't crash (usually)' and that multi-threading rendering adds at least 10 frames of latency to a game. Statements like these are not just incorrect, they indicated a lack of understanding by thinking mostly orthogonal concepts are somehow at the mercy of each other.
I looked at your link. The first file I opened says "non-blocking and asynchronous" in the first line of the comment.
Your methods that are supposedly doing this use the synchronize keyword whose purpose is to "block (suspend execution)" of other threads that call the same method until the first thread is done.
Joint because Parallel by itself means nothing, if I setup two separate computers with a load balancer they are parallel.
Joint as I said; and it's not rocket science so I'm getting a little annoyed; is when two separate cores work on the same real-time task that shares memory, most software 1) does not do this yet because it's hard 2) most programmers don't learn this because 1).
For example Erlang, Rust and Go are terrible at sharing memory because they have simple memory models, they solve parallelism by synchronizing hard and slow with copying and locks.
And yes games today have 10 frames of motion-to-photon latency, I measured it myself.
Non-blocking there is referring to the IO, synchronized only blocks if there is contention, you have to compare this to the synchronous HTTP APIs to understand why I guess?
wakeup() (and other methods) are synchronized because you don't want two threads to contest the hand-over point (in this case wakeup a HTTP response); all shared memory is done over java.util.concurrent data, which in most cases is atomic.
You have to look at the fuse project to see most joint parallel code, not saying it's pretty; but it works at, just as these quotes imply, around 10x perf:
"While I'm on the topic of concurrency I should mention my far too brief chat with Doug Lea. He commented that multi-threaded Java these days far outperforms C, due to the memory management and a garbage collector. If I recall correctly he said "only 12 times faster than C means you haven't started optimizing"." - Martin Fowler https://martinfowler.com/bliki/OOPSLA2005.html
"Many lock-free structures offer atomic-free read paths, notably concurrent containers in garbage collected languages, such as ConcurrentHashMap in Java. Languages without garbage collection have fewer straightforward options, mostly because safe memory reclamation is a hard problem..." - Travis Downs https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
Until you show me something joint parallel you have done and preferably proven (with say 450 concurrent real-time action users on one machine); I'm calling this settled, your arguments are thin! (also why am I arguing with a user called dildo?!)
You completely missed the point, which is that you fundamentally don't understand how these things work.
> Joint as I said; and it's not rocket science so I'm getting a little annoyed; is when two separate cores work on the same real-time task that shares memory, most software 1) does not do this yet because it's hard 2) most programmers don't learn this because 1).
That's just parallelism. What do you think openMP does? It is a fork-join model and is the easiest method of parallel programming because reading data that isn't being written to is trivial.
> For example Erlang, Rust and Go are terrible at sharing memory because they have simple memory models, they solve parallelism by synchronizing hard and slow with copying and locks.
This is not true. Again you are not understanding the fundamentals of what a system language like rust does.
> And yes games today have 10 frames of motion-to-photon latency, I measured it myself.
I measured it and it was 1 frame
> Until you show me something joint parallel you have done and preferably proven (with say 450 concurrent real-time action users on one machine); I'm calling this settled, your arguments are thin! (also why am I arguing with a user called dildo?!)
None of this has to do with validating the things you have said. If you can use the concurrent data structures then that's good, that's what they are there for.
I think you have a very warped idea of what you actually understand. Cars get people where they want to go but getting your driver's license doesn't mean you can design an engine.
I put off reading the book for ages because I was convinced that knowing about locks and atomic data accesses, and intentionally keeping things simple would be enough to stay out of trouble.
The book certainly does cover how to coordinate between multiple threads, but it goes further by pointing out that the real secret to multithreaded programming is designing things so that coordination is kept to a minimum. For one thing, the coordination often has the side effect of purging your CPU caches! You definitely get a lot of cache misses. Even if you use atomic operations on primitive types.
I believe you can get a lot out of the book even if you already know Java’s or POSIX’s threading models.
Indeed. It's hard to imagine people who could work basically anywhere in the world chooses to work for the company that has demonstrated they only care about money, don't care about its users and actively do more harm than good.
I could understand people who started working for Facebook 10 years ago, before we knew the beast. But now, just a year ago start working for Facebook? I really wish we (engineers working in the software industry) could be more ethical inclined and stop caring about money so much.
There are literally billions of people who are happy to use it and get happiness from it. Yes it is also, fundamentally, spyware.
I think the answer is more nuanced and I struggle to untease it.
Consider that everyone who owns an internal combustion vehicle is knowingly implicit in poisoning the atmosphere. Most of them only care about money and don’t care about the atmosphere. Are they actively doing more harm than good?
(FB is a large corp and doesn’t need defense from little old me. I’m pushing back though at the absolutism of your judgement)
Do you eat meat? Do you were clothes created in sweatshop? Do you use any cobalt mined in Congo?
I believe all of those are far more evil than Facebook. Any TV you buy, any webpage you go to (not literally any, but more or less) sells your data as well.
I understand that you, or most, might disagree with me. I do believe that is a matter of opinion.
You're conflating consumer choices (where you have little ability to affect the outcome outside an organized boycott) with actually producing stuff (which makes the system go).
Is your base moral code to do what is popular? Because lets be real, this thread would not appear for any of the other companies that do the exact same thing but aren't currently in vouge to hate on.
Very easy to write this if you’ve already made it. Plenty of people haven’t, however. And while Facebook might not be the nicest place doing the best stuff for the world, we should not be disparaging developers for trying to better their situation unless you’re gonna go ahead and disparage every developer who ever worked at Facebook too.
We don't know the circumstances. Maybe he didn't have that many options, and needed money badly? It's easy to criticize people for their choices, but I always wonder what the critics would actually do in exactly the same circumstances. I personally am not so sure I can always justify my life choices as 100% ethically pure.
Yeah, that's how I look at people who work for the gangs that killed 200 million people last century, and continue to do most of the killing even today. Most people call them governments and they find nothing wrong with working for them.
9/10 times, if you use the right tools & techniques, a single x86 core (assuming you have access to a 'real' one) can easily chew through millions of business transactions per second and easily satisfy all technical requirements for whatever business system.
Latency is the ultimate devil. Getting your synchronous state to fit into L1/2 and processing transactions in (micro)batches of contiguous data is how you win the game. The closer your data is to your compute, the faster you can go.
The biggest problem most developers are struggling with right now is that they got tricked into scattering their business systems across many computers and potentially across multiple physical regions. The latency in L1 is measured in nanoseconds. The latency between 2 datacenters is measured in milliseconds. This is SIX orders of magnitude difference. Even if you squander 3 of those to some bad programming practices, you are still 1000x better off on a single box using a single thread than if you scattered your infra to the seven winds.
See: LMAX Disruptor for a case study in what the opposite end of the "infrastructure isnt my problem" spectrum looks like.