2026年“核理安邦”联合博士生学术论坛

Name: 2026年“核理安邦”联合博士生学术论坛
Start: 2026-05-23T08:00:00+08:00
End: 2026-05-24T17:00:00+08:00
Location: 地址：清华大学校内

23–24 May 2026

地址：清华大学校内

Asia/Shanghai timezone

刘孝、吕博群、谢彦哲、唐子凌

面向结构性网络退化的解耦式大语言模型推理系统可靠性调度方法

Not scheduled

12m

地址：清华大学校内

北京市海淀区双清路30号

口头报告人工智能人工智能

Feng Huang (Tsinghua University)

摘要

随着大语言模型推理服务从单体式架构转向 Prefill-Decode 解耦架构，跨节点 KV Cache 传输逐渐成为影响系统可靠性的关键因素。长上下文请求会产生大规模状态迁移，使推理服务的成功完成不仅依赖 GPU 计算能力和显存容量，也依赖数据中心网络路径的可用性。现有调度方法多依赖本地队列长度、连接数或显存余量进行逐请求路由，能够较好处理局部计算扰动，但在结构性网络退化下可能产生误判。当核心链路带宽下降时，请求可能在 KV Cache 传输阶段阻塞并最终超时，而该过程并不一定及时反映到目标节点的本地队列指标中，从而诱发持续的超时级联失效。

本文提出 OTF-R，一种面向解耦式大语言模型推理系统的双环可靠性调度方法。该方法通过慢环聚合全局拓扑与资源状态，将网络退化和资源拥塞转换为 pod 级风险价格信号；快环在逐请求尺度融合本地队列、显存状态与全局风险价格，实现故障感知路由。基于 SimPy 离散事件仿真，本文在 64 节点 Fat-Tree 推理集群中注入计算退化和核心链路退化故障，并采用失败率、退化可用性、韧性指数、超时失败数和失败构成等指标进行评估。结果表明，局部计算退化主要可由本地队列感知调度处理，而结构性网络退化需要全局风险价格抑制超时级联。在最严重网络退化场景下，OTF-R 将平均失败率由 20.1% 降低至 5.9%，将退化可用性由 0.32 提升至 0.99，显著提升了系统在退化工况下的可靠性与韧性。

Abstract

As large language model (LLM) inference systems move from monolithic serving architectures to prefill-decode disaggregation, cross-node key-value (KV) cache transfer becomes a critical factor affecting system reliability. Long-context requests generate large state transfers, making successful inference dependent not only on GPU compute capacity and memory availability, but also on the availability of data-center network paths. Existing schedulers usually rely on local queue length, active connections, or memory availability for per-request routing. While such local policies are effective for component-level compute perturbations, they may be misleading under structural network degradation. When core-link bandwidth is reduced, requests can stall during KV-cache transfer and eventually time out, without being promptly reflected in the target node's local queue indicators. This can lead to timeout-driven cascading failures.

This paper proposes OTF-R, a dual-loop reliability-aware scheduling method for disaggregated LLM inference systems. The slow loop aggregates global topology and resource states and converts network degradation and resource congestion into a pod-level risk price. The fast loop then combines local queue information, memory state, and the global risk price to perform fault-aware request routing. Using a SimPy-based discrete-event simulator, we evaluate OTF-R on a 64-node Fat-Tree inference cluster with injected compute degradation and core-link network degradation. Reliability is assessed using failure rate, degraded availability, resilience index, timeout failures, and failure composition. The results show that local queue-aware scheduling is sufficient for component-level compute degradation, whereas structural network degradation requires global risk pricing to suppress timeout cascades. Under the most severe network degradation scenario, OTF-R reduces the mean failure rate from 20.1% to 5.9% and improves degraded availability from 0.32 to 0.99, significantly enhancing system reliability and resilience under degraded operation.

关键词	大语言模型推理；解耦式服务；可靠性调度；结构性网络退化；级联失效；韧性控制
Keywords	large language model inference; disaggregated serving; reliability-aware scheduling; structural network degradation; cascading failure; resilience control

Feng Huang (Tsinghua University)

OTF.pdf

2026年“核理安邦”联合博士生学术论坛

刘孝、吕博群、谢彦哲、唐子凌

面向结构性网络退化的解耦式大语言模型推理系统可靠性调度方法

地址：清华大学校内

Speaker

摘要

Abstract

Author

Presentation materials

Choose timezone

2026年“核理安邦”联合博士生学术论坛

刘孝、吕博群、谢彦哲、唐子凌

Speaker

摘要

Abstract

Author

Presentation materials