Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: