I'm learning about inference by running vLLM on a k8s cluster (EKS), building a gateway to keep a <2s TTFT SLO.
Most recent ha-ha moment: I kept wondering if it was normal that my cluster was only able to process 4 requests per second per vLLM engine (just seemed really low to me).
I realized a better metric is in-flight requests... Each engine is processing 70 requests at any given time, streaming tokens for over 30s.
Deeper dives into those uncover interesting limitations that don't seem to be documented anywhere. On the other hand, it is through those reverse shibboleths that I am now able to tell that my boss's boss has no idea what he is talking about llm-wise.
Have you considered using vLLM on top of Ray Serve (on EKS with KubeRay)? KubeRay makes Ray cluster-aware and there could be some optimizations you could make e.g. keeping that GPU fully utilized all the time :)
Thanks for the suggestion! Have you found that Ray Serve’s built-in autoscaling plays nicely with custom SLO-based concurrency limits, or do you usually let Ray handle the load balancing entirely?"
To be honest, I don't know because I have not hit many of those limits due to what I would call "moderate" scale. So far, I have just provisioned enough pods to handle the traffic as-is without using KubeRay. So k8s is handling the load balancing adequately at the moment, but Ray serve is not cluster-aware, only pod aware, for now.
Most recent ha-ha moment: I kept wondering if it was normal that my cluster was only able to process 4 requests per second per vLLM engine (just seemed really low to me).
I realized a better metric is in-flight requests... Each engine is processing 70 requests at any given time, streaming tokens for over 30s.
Code: https://github.com/Nicolas-Richard/vllm-on-eks