Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees
This paper proposes a flow-controlled scheduling method for LLM inference that guarantees provable stability while improving efficiency in high-throughput deployment scenarios. The approach balances request loads dynamically to reduce latency and maintain throughput under variable demand, making it suitable for production LLM serving at scale.
