Removing the GIL sounds like it will make typical Python programs slower and wil...

tonygrue · on Sept 28, 2024

There is an argument that if you need in process multithreading you should use a different language. But a lot of people need to use python because everything else they’re doing is in python.

There are quite a few common cases where in process multi threading is useful. The main ones are where you have large inputs or large outputs to the work units. In process is nice because you can move the input or output state to the work units instead of having to copy it.

One very common case is almost all gui applications. Where you want to be able to do all work on background threads and just move data back and forth from the coordinating ui thread. JavaScript’s lack of support here, outside of a native language compiled into emscripten, is one reason web apps are so hard to make jankless. The copies of data across web workers or python processes are quite expensive as far as things go.

Once a week or so, I run into a high compute python scenario where the existing forms of multiprocessing fail me. Large shared inputs and or don’t want the multiprocess overhead; but GIL slows everything down.

vlovich123 · on Sept 28, 2024

> Where you want to be able to do all work on background threads and just move data back and forth from the coordinating ui thread. JavaScript’s lack of support here, outside of a native language compiled into emscripten, is one reason web apps are so hard to make jankless

I thought transferring array buffers through web workers didn’t involve any copies of you actually transferred ownership:

    worker.postMessage(view.buffer, [view.buffer]);

I can understand that web workers might be more annoying to orchestrate than native threads and the like but I’m not sure that it lacks the primitives to make it possible. More likely it’s really hard to have a pauseless GC for JS (Python predominantly relies on reference counting and uses gc just to catch cycles).

Etheryte · on Sept 28, 2024

This is true, but when do you really work with array buffers in Javascript? The default choice for whatever it is that you're doing is almost always something else, save for a few edge cases, and then you're stuck trying to bend your business logic to a different data type.

vlovich123 · on Sept 28, 2024

That’s a choice you get to make and probably depends on your problem domain and other things. For example when I was writing R2 it was all ArrayBuffers up and down the stack. And you could use something like capnproto or flat buffers for managing your object graph within an array buffer. But yeah, being able to transfer custom object graphs would be more powerful.

formerly_proven · on Sept 28, 2024

There is this assumption in these discussions that anything consuming significant CPU must necessarily have a simple interface that’s easy to reduce to a C-level ABI, like calling an ML library on an image, a numpy function on two big matrices or some encryption function. Therefore it is trivial to just move these to native code with an easy, narrow interface.

This assumption is incorrect. There are plenty of problems that consist entirely of business logic manipulating large and complex object graphs. “Just rewrite the hot function in rust, bro” and “just import multiprocessing, bro” are functionally identical to rewriting most of the application for these.

The performance work of the last few years, free threading and JIT are very valuable for these. All the rest is already written in C.

wruza · on Sept 28, 2024

It's a good assumption though, because it keeps (in this case kept) closed the door to the absolutely nightmarish landscape of "multithreading to the masses". Those who made it open probably see it better, but, imo and ime, it should remain closed. Maybe they'll manage to handle it this time, but I'm 95% sure it's gonna be yet another round of ass pain for the world of python.

adgjlsfhk1 · on Sept 28, 2024

otoh, if all of your time is spent in python code and you have performance issues, it's time to rewrite in a different language. Correct multi threaded coffee is take hard and python is inherently really slow. the inherent complexity of multi threaded code is enough that you should just write the single threaded version in a language that's 10x faster (of which there are many)

pansa2 · on Sept 28, 2024

> “Just rewrite the hot function in rust, bro” and “just import multiprocessing, bro” are functionally identical to rewriting most of the application for these.

Isn't "just use threads, bro" likely to be equally difficult?

adgjlsfhk1 · on Sept 28, 2024

no. it's much harder

tombl · on Sept 28, 2024

Is this some internal cloudflare feature flag or can everybody pass ArrayBuffers zero-copy via service bindings?

(random question, totally understand if you're not the right person to ask)

dwattttt · on Sept 28, 2024

It's not a cloudflare thing: https://developer.mozilla.org/en-US/docs/Web/API/Worker/post...

inoop · on Sept 28, 2024

As always, it depends a lot on what you're doing, and a lot of people are using Python for AI.

One of the drawbacks of multi-processing versus multi-threading is that you cannot share memory (easily, cheaply) between processes. During model training, and even during inference, this becomes a problem.

For example, imagine a high volume, low latency, synchronous computer vision inference service. If you're handling each request in a different process, then you're going to have to jump through a bunch of hoops to make this performant. For example, you'll need to use shared memory to move data around, because images are large, and sockets are slow. Another issue is that each process will need a different copy of the model in GPU memory, which is a problem in a world where GPU memory is at a premium. You could of course have a single process for the GPU processing part of your model, and then automatically batch inputs into this process, etc. etc. (and people do) but all this is just to work around the lack of proper threading support in Python.

By the way, if anyone is struggling with these challenges today, I recommend taking a peek at nvidia's Triton inference server (https://github.com/triton-inference-server/server), which handles a lot of these details for you. It supports things like zero-copy sharing of tensors between parts of your model running in different processes/threads and does auto-batching between requests as well. Especially auto-batching gave us big throughput increase with a minor latency penalty!

jgraettinger1 · on Sept 28, 2024

> For example, imagine a high volume, low latency, synchronous computer vision inference service.

I'm not in this space and this is probably too simplistic, but I would think pairing asyncio to do all IO (reading / decoding requests and preparing them for inference) coupled with asyncio.to_thread'd calls to do_inference_in_C_with_the_GIL_released(my_prepared_request), would get you nearly all of the performance benefit using current Python.

saagarjha · on Sept 28, 2024

Machine learning people not call their thing Triton challenge (IMPOSSIBLE)

buildbot · on Sept 28, 2024

This (Nvidia’s) triton predates openAI’s by a few years.

bdd8f1df777b · on Sept 28, 2024

The biggest use case (that I am aware of) of GIL-less Python is for parallel feeding data into ML model training.

* PyTorch currently uses `multiprocessing` for that, but it is fraught with bugs and with less than ideal performance, which is sorely needed for ML training (it can starve the GPU).

* Tensorflow just discards Python for data loading. Its data loaders are actually in C++ so it has no performance problems. But it is so inflexible that it is always painful for me to load data in TF.

Given how hot ML is, and how Python is currently the major language for ML, it makes sense for them to optimize for this.

lifthrasiir · on Sept 28, 2024

> Removing the GIL sounds like it will make typical Python programs slower and will introduce a lot of complexity?

This was the original reason for CPython to retain GIL for very long time, and probably true for most of that time. That's why the eventual GIL removal had to be paired with other important performance improvements like JIT, which was only implemented after some feasible paths were found and explicitly funded by a big sponsor.

klranh · on Sept 28, 2024

That is the official story. None of it has materialized so far.

lifthrasiir · on Sept 29, 2024

Python development is done in public so you can just benchmark against the development version to see its improvement. In fact, daily benchmarks are already posted to [1]; it indicates around 20% improvement (corresponding to 1.25x in the table) since 3.10. The only thing you can't easily verify is that whether GIL was indeed historically necessary in the past.

[1] https://github.com/faster-cpython/benchmarking-public

simonw · on Sept 28, 2024

My hunch is that in just a few years time single core computers will be almost extinct. Removing the GIL now feels to me like good strategic preparation for the near future.

naming_the_user · on Sept 28, 2024

It depends what you mean by extinct.

I can't think of any actual computer outside of embedded that has been single core for at least a decade. The Core Duo and Athlon X2 were released almost 20 years ago now and within a few years basically everything was multicore.

(When did we get old?)

If you mean that single core workloads will be extinct, well, that's a harder sell.

simonw · on Sept 28, 2024

Yeah, I just checked and even a RaspberryPi has four cores these days. So I guess they went extinct a long time ago!

poincaredisk · on Sept 28, 2024

Yes, but:

* Most of the programs I write are not (trivially) parallelizable, and a the bottleneck is still a single core performance

* There is more than one process at any time, especially on servers. Other cores are also busy and have their own work to do.

deadbunny · on Sept 28, 2024

Yes, but:

1. Other people with different needs exist.

2. That's why we have schedulers.

formerly_proven · on Sept 28, 2024

Even many microcontrollers have multiple cores nowadays. It’s not the norm just yet, though.

im3w1l · on Sept 28, 2024

Single core computers yes. Single core containers though..

gomerspiles · on Sept 28, 2024

Single core containers are also a terrible idea. Life got much less deadlocked as soon as there were 2+ processors everywhere.

(Huh, people like hard OS design problems for marginal behavior? OSes had trouble adopting SMP but we also got to jettison a lot of deadlock discussions as soon as there was CPU 2. It only takes a few people not prioritizing 1 CPU testing at any layer to make your 1 CPU container much worse than a 2 VCPU container limited to a 1 CPU average.)

seabrookmx · on Sept 28, 2024

It's actually quite difficult to get a "single core" container (ie: a container with access to only one logical processor).

When you set "request: 1" in Kubernetes or another container manager, you're saying "give me 1 CPU worth of CPU time" but if the underlying Linux host has 16 logical cores your container will still see them.

Your container is free to use 1/16th of each of them, 100% of one of them, or anything in-between.

You might think this doesn't matter in the end but it can if you have a lot of workloads on that node and those cores are busy. Your single threaded throughout can become quite compromised as a result.

3np · on Sept 28, 2024

It's easy, though?

On Docker, --cpuset-cpus=0 will pin the container to the first core.

K8s: https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

CPU affinity and pinning is something I think you should be able to achieve without too much hassle.

edude03 · on Sept 29, 2024

I think the point was this isn’t the norm though. If you know you need to be pinned to a core you CAN configure kubernetes to do so but it’s not the default and therefore you are unknowingly leaving performance on the floor

3np · on Sept 30, 2024

I'm quite certain you'd leave more performance on the table by pinning in general and on average.

Just let the CPU scheduler do its job. Unless you know better, in which case, by all means go ahead and allocate computational resources manually. I don't see a way to make that a sensible default, though.

seabrookmx · on Sept 29, 2024

> It's easy though

Neat, I didn't know it was a single flag in Docker.

The k8s method you linked definitely has some caveats as it doesn't allow this scheduling at the pod level, and requires quite a bit of fiddling to get working (atleast on GKE). This isn't even available if you use a fully managed setup like autopilot.

Maybe my expectations just aren't realistic but "easy" to me would mean I put the affinity right next to my CPU request in the podSpec :/

Dylan16807 · on Sept 28, 2024

> if you have a lot of workloads on that node and those cores are busy. Your single threaded throughout can become quite compromised as a result.

While yes this can cause a slowdown, wouldn't it still happen if each container thought it had a single core?

seabrookmx · on Sept 29, 2024

Only if you have more containers than cores.

Dylan16807 · on Sept 29, 2024

That depends on what your scheduler does. Having one virtual core doesn't necessarily mean you always get the same physical core.

Also you said "a lot of workloads" so yes probably more containers than cores.

seabrookmx · on Sept 30, 2024

Most of my pods have a CPU request >= 1 so more containers than cores is rare. But obviously that really depends on your workload(s).

I don't think the scheduler picking a different core matters much unless your workload is super cache sensitive. My point is more about access to single threaded performance. If you have a single threaded workload (ex: an ffmpeg audio encode) and you want it to be able to access as many cycles from a single core as possible, it isn't always as simple as request: 1

masklinn · on Sept 28, 2024

> My hunch is that in just a few years time single core computers will be almost extinct.

Single core computers are already functionally extinct, but single-threaded programs are not.

pjmlp · on Sept 29, 2024

Depends on the OS, on Windows or Android, even single processes have multiple threads under the hood.

pansa2 · on Sept 28, 2024

> What is the real world benefit we will get in return?

If you have many CPU cores and an embarrassingly parallel algorithm, multi-threaded Python can now approach the performance of a single-threaded compiled language.

Certhas · on Sept 28, 2024

The question really is if one couldn't make multiprocess better instead of multithreaded. I did a ton of MPI work with python ten years ago already.

What's more I am now seeing in Julia that multithreading doesn't scale to larger core counts (like 128) due to the garbage collector. I had to revert to multithreaded again.

Sakos · on Sept 28, 2024

I assume you meant you had to revert to multiprocess?

Certhas · on Sept 28, 2024

Yes exactly. Thanks.

0x000xca0xfe · on Sept 28, 2024

You could already easily parallelize with the multiprocessing module.

The real difference is the lower communication overhead between threads vs. processes thanks to a shared address space.

bmitc · on Sept 28, 2024

Easily is an overstatement. Multiprocessing is fraught with quirks.

0x000xca0xfe · on Sept 28, 2024

Well I once had an analytics/statistics tool that regularly chewed through a couple GBs of CSV files. After enough features had been added it took almost 5 minutes per run which got really annoying.

It took me less than an hour to add multiprocessing to analyze each file in its own process and merge the results together at the end. The runtime dropped to a couple seconds on my 24 thread machine.

It really was much easier than expected. Rewriting it in C++ would have probably taken a week.

bmitc · on Sept 28, 2024

In F#, it would just be

    let results = files |> Array.Parralel.map processFile

Literally that easy.

Earlier this week, I used a ProcessPoolExecutor to run some things in their own process. I needed a bare minimum of synchronization, so I needed a queue. Well, multiprocessing has its own queue. But that queue is not joinable. So I chose the multiprocessing JoinableQueue. Well, it turns out that that queue can't be used across processes. For that, you need to get a queue from the launching process' manager. That Queue is the regular Python queue.

It is a gigantic mess. And yes, asyncio also has its own queue class. So in Python, you literally have a half a dozen or so queue classes that are all incompatible, have different interfaces, and have different limitations that are rarely documented.

That's just one highlight of the mess between threading, asyncio, and multiprocessing.

0x000xca0xfe · on Sept 28, 2024

Well I'm not here to debate the API cleanliness, I just wanted to point out to OP that Python can utilize multicore processors without threads ;)

Here is the part of multiprocessing I used:

  with Pool() as p:
      results = p.map(calc_func, file_paths)

So, pretty easy too IMO.

sitkack · on Sept 28, 2024

Fraught with quirks sounds quite ominous. Quuuiiirkkksss.

I agree though.

bmitc · on Sept 28, 2024

That's not really correct. Python is by far the slowest mainstream language. It is embarrassingly slow. Further more, several mainstream compiled languages are already multicore compatible and have been for decades. So comparing against a single-threaded language or program doesn't make sense.

All this really means is that Python catches up on decades old language design.

However, it simply adds yet another design input. Python's threading, multiprocessing, and asyncio paradigms were all developed to get around the limitations of Python's performance issues and the lack of support for multicore. So my question is, how does this change affect the decision tree for selecting which paradigm(s) to use?

masklinn · on Sept 28, 2024

> Python's threading, multiprocessing, and asyncio paradigms were all developed to get around the limitations of Python's performance issues and the lack of support for multicore.

Threading is literally just Python's multithreading support, using standard OS threads, and async exists for the same reason it exists in a bunch of languages without even a GIL: OS threads have overhead, multiplexing IO-bound work over OS threads is useful.

Only multiprocessing can be construed as having been developed to get around the GIL.

bmitc · on Sept 28, 2024

No, asyncio's implementation exists because threading in Python has huge overhead for switching between threads and because threads don't use more than one core. So asyncio was introduced as a single threaded solution specifically for only network-based IO.

In any other language, async is implemented on top of the threading model, both because the threading model is more efficient than Python's and because it actually supports multiple cores.

Multiprocessing isn't needed in other languages because, again, their threading models support multiple cores.

So the three, relatively incompatible paradigms of asyncio, threading, and multiprocessing specifically in Python are indeed separate attempts to account for Python's poor design. Other languages do not have this embedded complexity.

cwalv · on Sept 28, 2024

> In any other language, async is implemented on top of the threading model

There are a lot of other languages. Javascript for example is a pretty popular language where async on a single threaded event loop has been the model since the beginning.

Async is useful even if you don't have an interpreter that introduces contention on a single "global interpreter lock." Just look at all the languages without this constraint that still work to implement async more naturally than just using callbacks.

Threads in Python are very useful even without removing the gil (performance critical sections have been written as extension modules for a long time, and often release the gil).

> are indeed separate attempts to account for Python's poor design

They all have tradeoffs. There are warts, but as designed it fits a particular use case very well.

Calling Python's design "poor" is hubris.

> So my question is, how does this change affect the decision tree for selecting which paradigm(s) to use?

The only effect I can see is that it reduces the chances that you'll reach for multiprocessing, unless you're using it with a process pool spread across multiple machines (so they can't share address space anyway)

bmitc · on Sept 29, 2024

> Calling Python's design "poor" is hubris.

Not in the least. Python is a poorly designed language by many accounts. Despite being the most popular language in the world, what language has it significantly influenced? None of note.

cwalv · on Sept 30, 2024

> Python is a poorly designed language by many accounts

Hubris isn't rare.

> what language has it significantly influenced?

I can think of at least 1 language designer[1] who doesn't think it's "poorly designed," based on it's significant impact on what they're currently working on[2]

1. https://en.m.wikipedia.org/wiki/Chris_Lattner 2. https://www.modular.com/mojo

fastasucan · on Sept 30, 2024

Who cares about how many other languages a language has influenced? If that was a metric of any consideration we all would write Algol or something. Programming languages are tools, tools to help you perform a task.

fastasucan · on Sept 30, 2024

>Python is by far the slowest mainstream language. It is embarrassingly slow.

Oh? It is by far the fastest language for me. No languages comes close on the time from starting to write, to have code that runs. For me that time far outweighs the execution time, so it is a lot more important.

Snacklive · on Oct 1, 2024

may i ask in what field do you specialize? Because any modern language i can think of, is one "project init" command away from "nothing" to "running"

Zyten · on Sept 28, 2024

What you’re describing is basically using MPI in some way, shape or form. This works, but also can introduce a lot of complexity. If your program doesn’t need to communicate, then it’s easy. But that’s not the case for all programs. Especially once we’re talking about simulations and other applications running on HPC systems.

Sometimes it’s also easier to split work using multiple threads. Other programming languages let you do that and actually use multiple threads efficiently. In Python, the benefit was just too limited due to the GIL.

imron · on Sept 28, 2024

> Removing the GIL sounds like it will make typical Python programs slower and will introduce a lot of complexity?

There is a lot of Python code that either explicitly (or implicitly) relies on the GIL for correctness in multithreaded programs.

I myself have even written such code, explicitly relying on the GIL as synchronization primitive.

Removing the GIL will break that code in subtle and difficult to track down ways.

The good news is that a large percentage of this code will stay running on older versions of python (2.7 even) and so will always have a GIL around.

Some of it however will end up running on no-GIL python and I don't envy the developers who will be tasked tracking down the bugs - but probably they will run on modern versions of python using --with-gil or whatever other flag is provided to enable the GIL.

The benefit to the rest of the world then is that future programs will be able to take advantage of multiple cores with shared memory, without needing to jump through the hoops of multi-process Python.

Python has been feeling the pain of the GIL in this area for many years already, and removing the GIL will make Python more viable for a whole host of applications.

carapace · on Sept 28, 2024

> What is the real world benefit we will get in return?

None. I've been using Python "in anger" for twenty years and the GIL has been a problem zero times. It seems to me that removing the GIL will only make for more difficulty in debugging.

cma · on Sept 28, 2024

There will be consumer chips with 64 cores before long