I'm learning about inference by running vLLM on a k8s cluster (EKS), building a ...

iugtmkbdfil834 · 2026-05-10T19:06:09 1778439969

Deeper dives into those uncover interesting limitations that don't seem to be documented anywhere. On the other hand, it is through those reverse shibboleths that I am now able to tell that my boss's boss has no idea what he is talking about llm-wise.

korbonits · 2026-05-11T22:09:12 1778537352

Have you considered using vLLM on top of Ray Serve (on EKS with KubeRay)? KubeRay makes Ray cluster-aware and there could be some optimizations you could make e.g. keeping that GPU fully utilized all the time :)

nicoinstrument · 2026-05-12T05:10:01 1778562601

Thanks for the suggestion! Have you found that Ray Serve’s built-in autoscaling plays nicely with custom SLO-based concurrency limits, or do you usually let Ray handle the load balancing entirely?"

korbonits · 2026-05-12T18:06:36 1778609196

To be honest, I don't know because I have not hit many of those limits due to what I would call "moderate" scale. So far, I have just provisioned enough pods to handle the traffic as-is without using KubeRay. So k8s is handling the load balancing adequately at the moment, but Ray serve is not cluster-aware, only pod aware, for now.