Removing the GIL sounds like it will make typical Python programs slower and will introduce a lot of complexity?
What is the real world benefit we will get in return?
In the rare case where I need to max out more than one CPU core, I usually implement that by having the OS run multiple instances of my program and put a bit of parallelization logic into the program itself. Like in the mandelbrot example the author gives, I would simply tell each instance of the program which part of the image it will calculate.
There is an argument that if you need in process multithreading you should use a different language. But a lot of people need to use python because everything else they’re doing is in python.
There are quite a few common cases where in process multi threading is useful. The main ones are where you have large inputs or large outputs to the work units. In process is nice because you can move the input or output state to the work units instead of having to copy it.
One very common case is almost all gui applications. Where you want to be able to do all work on background threads and just move data back and forth from the coordinating ui thread. JavaScript’s lack of support here, outside of a native language compiled into emscripten, is one reason web apps are so hard to make jankless. The copies of data across web workers or python processes are quite expensive as far as things go.
Once a week or so, I run into a high compute python scenario where the existing forms of multiprocessing fail me. Large shared inputs and or don’t want the multiprocess overhead; but GIL slows everything down.
> Where you want to be able to do all work on background threads and just move data back and forth from the coordinating ui thread. JavaScript’s lack of support here, outside of a native language compiled into emscripten, is one reason web apps are so hard to make jankless
I thought transferring array buffers through web workers didn’t involve any copies of you actually transferred ownership:
worker.postMessage(view.buffer, [view.buffer]);
I can understand that web workers might be more annoying to orchestrate than native threads and the like but I’m not sure that it lacks the primitives to make it possible. More likely it’s really hard to have a pauseless GC for JS (Python predominantly relies on reference counting and uses gc just to catch cycles).
This is true, but when do you really work with array buffers in Javascript? The default choice for whatever it is that you're doing is almost always something else, save for a few edge cases, and then you're stuck trying to bend your business logic to a different data type.
That’s a choice you get to make and probably depends on your problem domain and other things. For example when I was writing R2 it was all ArrayBuffers up and down the stack. And you could use something like capnproto or flat buffers for managing your object graph within an array buffer. But yeah, being able to transfer custom object graphs would be more powerful.
There is this assumption in these discussions that anything consuming significant CPU must necessarily have a simple interface that’s easy to reduce to a C-level ABI, like calling an ML library on an image, a numpy function on two big matrices or some encryption function. Therefore it is trivial to just move these to native code with an easy, narrow interface.
This assumption is incorrect. There are plenty of problems that consist entirely of business logic manipulating large and complex object graphs. “Just rewrite the hot function in rust, bro” and “just import multiprocessing, bro” are functionally identical to rewriting most of the application for these.
The performance work of the last few years, free threading and JIT are very valuable for these. All the rest is already written in C.
It's a good assumption though, because it keeps (in this case kept) closed the door to the absolutely nightmarish landscape of "multithreading to the masses". Those who made it open probably see it better, but, imo and ime, it should remain closed. Maybe they'll manage to handle it this time, but I'm 95% sure it's gonna be yet another round of ass pain for the world of python.
otoh, if all of your time is spent in python code and you have performance issues, it's time to rewrite in a different language. Correct multi threaded coffee is take hard and python is inherently really slow. the inherent complexity of multi threaded code is enough that you should just write the single threaded version in a language that's 10x faster (of which there are many)
> “Just rewrite the hot function in rust, bro” and “just import multiprocessing, bro” are functionally identical to rewriting most of the application for these.
Isn't "just use threads, bro" likely to be equally difficult?
As always, it depends a lot on what you're doing, and a lot of people are using Python for AI.
One of the drawbacks of multi-processing versus multi-threading is that you cannot share memory (easily, cheaply) between processes. During model training, and even during inference, this becomes a problem.
For example, imagine a high volume, low latency, synchronous computer vision inference service. If you're handling each request in a different process, then you're going to have to jump through a bunch of hoops to make this performant. For example, you'll need to use shared memory to move data around, because images are large, and sockets are slow. Another issue is that each process will need a different copy of the model in GPU memory, which is a problem in a world where GPU memory is at a premium. You could of course have a single process for the GPU processing part of your model, and then automatically batch inputs into this process, etc. etc. (and people do) but all this is just to work around the lack of proper threading support in Python.
By the way, if anyone is struggling with these challenges today, I recommend taking a peek at nvidia's Triton inference server (https://github.com/triton-inference-server/server), which handles a lot of these details for you. It supports things like zero-copy sharing of tensors between parts of your model running in different processes/threads and does auto-batching between requests as well. Especially auto-batching gave us big throughput increase with a minor latency penalty!
> For example, imagine a high volume, low latency, synchronous computer vision inference service.
I'm not in this space and this is probably too simplistic, but I would think pairing asyncio to do all IO (reading / decoding requests and preparing them for inference) coupled with asyncio.to_thread'd calls to do_inference_in_C_with_the_GIL_released(my_prepared_request), would get you nearly all of the performance benefit using current Python.
The biggest use case (that I am aware of) of GIL-less Python is for parallel feeding data into ML model training.
* PyTorch currently uses `multiprocessing` for that, but it is fraught with bugs and with less than ideal performance, which is sorely needed for ML training (it can starve the GPU).
* Tensorflow just discards Python for data loading. Its data loaders are actually in C++ so it has no performance problems. But it is so inflexible that it is always painful for me to load data in TF.
Given how hot ML is, and how Python is currently the major language for ML, it makes sense for them to optimize for this.
> Removing the GIL sounds like it will make typical Python programs slower and will introduce a lot of complexity?
This was the original reason for CPython to retain GIL for very long time, and probably true for most of that time. That's why the eventual GIL removal had to be paired with other important performance improvements like JIT, which was only implemented after some feasible paths were found and explicitly funded by a big sponsor.
Python development is done in public so you can just benchmark against the development version to see its improvement. In fact, daily benchmarks are already posted to [1]; it indicates around 20% improvement (corresponding to 1.25x in the table) since 3.10. The only thing you can't easily verify is that whether GIL was indeed historically necessary in the past.
My hunch is that in just a few years time single core computers will be almost extinct. Removing the GIL now feels to me like good strategic preparation for the near future.
I can't think of any actual computer outside of embedded that has been single core for at least a decade. The Core Duo and Athlon X2 were released almost 20 years ago now and within a few years basically everything was multicore.
(When did we get old?)
If you mean that single core workloads will be extinct, well, that's a harder sell.
Single core containers are also a terrible idea. Life got much less deadlocked as soon as there were 2+ processors everywhere.
(Huh, people like hard OS design problems for marginal behavior? OSes had trouble adopting SMP but we also got to jettison a lot of deadlock discussions as soon as there was CPU 2. It only takes a few people not prioritizing 1 CPU testing at any layer to make your 1 CPU container much worse than a 2 VCPU container limited to a 1 CPU average.)
It's actually quite difficult to get a "single core" container (ie: a container with access to only one logical processor).
When you set "request: 1" in Kubernetes or another container manager, you're saying "give me 1 CPU worth of CPU time" but if the underlying Linux host has 16 logical cores your container will still see them.
Your container is free to use 1/16th of each of them, 100% of one of them, or anything in-between.
You might think this doesn't matter in the end but it can if you have a lot of workloads on that node and those cores are busy. Your single threaded throughout can become quite compromised as a result.
I think the point was this isn’t the norm though. If you know you need to be pinned to a core you CAN configure kubernetes to do so but it’s not the default and therefore you are unknowingly leaving performance on the floor
I'm quite certain you'd leave more performance on the table by pinning in general and on average.
Just let the CPU scheduler do its job. Unless you know better, in which case, by all means go ahead and allocate computational resources manually. I don't see a way to make that a sensible default, though.
Neat, I didn't know it was a single flag in Docker.
The k8s method you linked definitely has some caveats as it doesn't allow this scheduling at the pod level, and requires quite a bit of fiddling to get working (atleast on GKE). This isn't even available if you use a fully managed setup like autopilot.
Maybe my expectations just aren't realistic but "easy" to me would mean I put the affinity right next to my CPU request in the podSpec :/
Most of my pods have a CPU request >= 1 so more containers than cores is rare. But obviously that really depends on your workload(s).
I don't think the scheduler picking a different core matters much unless your workload is super cache sensitive. My point is more about access to single threaded performance. If you have a single threaded workload (ex: an ffmpeg audio encode) and you want it to be able to access as many cycles from a single core as possible, it isn't always as simple as request: 1
> What is the real world benefit we will get in return?
If you have many CPU cores and an embarrassingly parallel algorithm, multi-threaded Python can now approach the performance of a single-threaded compiled language.
The question really is if one couldn't make multiprocess better instead of multithreaded. I did a ton of MPI work with python ten years ago already.
What's more I am now seeing in Julia that multithreading doesn't scale to larger core counts (like 128) due to the garbage collector. I had to revert to multithreaded again.
Well I once had an analytics/statistics tool that regularly chewed through a couple GBs of CSV files. After enough features had been added it took almost 5 minutes per run which got really annoying.
It took me less than an hour to add multiprocessing to analyze each file in its own process and merge the results together at the end. The runtime dropped to a couple seconds on my 24 thread machine.
It really was much easier than expected. Rewriting it in C++ would have probably taken a week.
let results = files |> Array.Parralel.map processFile
Literally that easy.
Earlier this week, I used a ProcessPoolExecutor to run some things in their own process. I needed a bare minimum of synchronization, so I needed a queue. Well, multiprocessing has its own queue. But that queue is not joinable. So I chose the multiprocessing JoinableQueue. Well, it turns out that that queue can't be used across processes. For that, you need to get a queue from the launching process' manager. That Queue is the regular Python queue.
It is a gigantic mess. And yes, asyncio also has its own queue class. So in Python, you literally have a half a dozen or so queue classes that are all incompatible, have different interfaces, and have different limitations that are rarely documented.
That's just one highlight of the mess between threading, asyncio, and multiprocessing.
That's not really correct. Python is by far the slowest mainstream language. It is embarrassingly slow. Further more, several mainstream compiled languages are already multicore compatible and have been for decades. So comparing against a single-threaded language or program doesn't make sense.
All this really means is that Python catches up on decades old language design.
However, it simply adds yet another design input. Python's threading, multiprocessing, and asyncio paradigms were all developed to get around the limitations of Python's performance issues and the lack of support for multicore. So my question is, how does this change affect the decision tree for selecting which paradigm(s) to use?
> Python's threading, multiprocessing, and asyncio paradigms were all developed to get around the limitations of Python's performance issues and the lack of support for multicore.
Threading is literally just Python's multithreading support, using standard OS threads, and async exists for the same reason it exists in a bunch of languages without even a GIL: OS threads have overhead, multiplexing IO-bound work over OS threads is useful.
Only multiprocessing can be construed as having been developed to get around the GIL.
No, asyncio's implementation exists because threading in Python has huge overhead for switching between threads and because threads don't use more than one core. So asyncio was introduced as a single threaded solution specifically for only network-based IO.
In any other language, async is implemented on top of the threading model, both because the threading model is more efficient than Python's and because it actually supports multiple cores.
Multiprocessing isn't needed in other languages because, again, their threading models support multiple cores.
So the three, relatively incompatible paradigms of asyncio, threading, and multiprocessing specifically in Python are indeed separate attempts to account for Python's poor design. Other languages do not have this embedded complexity.
> In any other language, async is implemented on top of the threading model
There are a lot of other languages. Javascript for example is a pretty popular language where async on a single threaded event loop has been the model since the beginning.
Async is useful even if you don't have an interpreter that introduces contention on a single "global interpreter lock." Just look at all the languages without this constraint that still work to implement async more naturally than just using callbacks.
Threads in Python are very useful even without removing the gil (performance critical sections have been written as extension modules for a long time, and often release the gil).
> are indeed separate attempts to account for Python's poor design
They all have tradeoffs. There are warts, but as designed it fits a particular use case very well.
Calling Python's design "poor" is hubris.
> So my question is, how does this change affect the decision tree for selecting which paradigm(s) to use?
The only effect I can see is that it reduces the chances that you'll reach for multiprocessing, unless you're using it with a process pool spread across multiple machines (so they can't share address space anyway)
Not in the least. Python is a poorly designed language by many accounts. Despite being the most popular language in the world, what language has it significantly influenced? None of note.
> Python is a poorly designed language by many accounts
Hubris isn't rare.
> what language has it significantly influenced?
I can think of at least 1 language designer[1] who doesn't think it's "poorly designed," based on it's significant impact on what they're currently working on[2]
Who cares about how many other languages a language has influenced? If that was a metric of any consideration we all would write Algol or something. Programming languages are tools, tools to help you perform a task.
>Python is by far the slowest mainstream language. It is embarrassingly slow.
Oh? It is by far the fastest language for me. No languages comes close on the time from starting to write, to have code that runs. For me that time far outweighs the execution time, so it is a lot more important.
What you’re describing is basically using MPI in some way, shape or form. This works, but also can introduce a lot of complexity. If your program doesn’t need to communicate, then it’s easy. But that’s not the case for all programs. Especially once we’re talking about simulations and other applications running on HPC systems.
Sometimes it’s also easier to split work using multiple threads. Other programming languages let you do that and actually use multiple threads efficiently. In Python, the benefit was just too limited due to the GIL.
> Removing the GIL sounds like it will make typical Python programs slower and will introduce a lot of complexity?
There is a lot of Python code that either explicitly (or implicitly) relies on the GIL for correctness in multithreaded programs.
I myself have even written such code, explicitly relying on the GIL as synchronization primitive.
Removing the GIL will break that code in subtle and difficult to track down ways.
The good news is that a large percentage of this code will stay running on older versions of python (2.7 even) and so will always have a GIL around.
Some of it however will end up running on no-GIL python and I don't envy the developers who will be tasked tracking down the bugs - but probably they will run on modern versions of python using --with-gil or whatever other flag is provided to enable the GIL.
The benefit to the rest of the world then is that future programs will be able to take advantage of multiple cores with shared memory, without needing to jump through the hoops of multi-process Python.
Python has been feeling the pain of the GIL in this area for many years already, and removing the GIL will make Python more viable for a whole host of applications.
> What is the real world benefit we will get in return?
None. I've been using Python "in anger" for twenty years and the GIL has been a problem zero times. It seems to me that removing the GIL will only make for more difficulty in debugging.
What is the real world benefit we will get in return?
In the rare case where I need to max out more than one CPU core, I usually implement that by having the OS run multiple instances of my program and put a bit of parallelization logic into the program itself. Like in the mandelbrot example the author gives, I would simply tell each instance of the program which part of the image it will calculate.