Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees
This paper presents a flow-controlled scheduling method for large language model (LLM) inference that guarantees provable stability and improves inference efficiency under high-load conditions. The technique is designed for scalable production deployment of LLMs, reducing tail latency in high-volume inference scenarios.