Tutorial A Date & Time: December 2, 2025 (Tuesday), 17:30-19:00 Location: China Merchants Hall 1 Title: Knowledge Graph-Based Orchestration: From Business Context to Autonomous Scaling in Kubernetes Presenters:
Abstract: Modern service-oriented systems, particularly those deployed on platforms like Kubernetes, are confronted with a 'growing complexity gap'. Standard orchestration tools, while efficient at managing resources, frequently lack the business context required for intelligent decision-making. This tutorial introduces a paradigm shift known as Knowledge Graph-based orchestration. The core concept is to build and maintain a dynamic Knowledge Graph that models the entire IT landscape, successfully connecting high-level business demands with low-level infrastructure resources. This unified, real-time 'brain' aims to bridge the gap that separates technical alerts from their actual business impact. The practical application of this approach moves beyond simple metric-based autoscaling, enabling truly autonomous, service-oriented operations. The tutorial features a hands-on case study demonstrating how to use the graph to make context-aware scaling decisions specifically within a Kubernetes environment. This session is intended for researchers, PhD students, cloud architects, and DevOps/SRE engineers. While a foundational understanding of SOA and cloud computing is required, and basic familiarity with Kubernetes concepts is recommended, no prior expertise in Knowledge Graphs is necessary. |
Tutorial B Date & Time: December 2, 2025 (Tuesday), 17:30-19:00 Location: China Merchants Hall 2 Title: Cloud-native Systems for Fine-grained and Dynamic LLMs Serving Presenters:
Abstract: The proliferation of Large Language Model (LLM) inference services in cloud environments has created unprecedented demand for efficient resource management. As companies like Amazon, Google, and Microsoft invest heavily in Model-as-a-Service (MaaS) paradigms, elastic scaling has become a critical capability for balancing service availability with cost efficiency. This tutorial explores cloud-native approaches to LLM serving, addressing three fundamental challenges: imprecise workload profiling due to variable input/output lengths, coarse-grained scaling at the instance level, and inflexible deployment configurations. In this tutorial, we introduce module-level elastic scaling techniques that operate at the Transformer layer granularity, achieving up to 4× throughput improvements over traditional approaches. Through hands-on demonstrations of the CoCoServe platform developed by us, participants will learn how intelligent request scheduling, fine-grained resource orchestration, and dynamic deployment strategies can dramatically enhance LLM serving efficiency. This tutorial bridges cutting-edge research with practical implementation, offering valuable insights for the service computing community navigating the rapidly evolving landscape of AI services. |
Tutorial C Date & Time: December 2, 2025 (Tuesday), 17:30-19:00 Location: China Merchants Hall 3 Title: Ubiquitous LLM Inference as a Service for Next-Generation Distributed Autonomous Devices Presenters:
Abstract: Large Language Models are moving beyond cloud only deployments into ubiquitous intelligence for robots, drones, and other autonomous systems that operate under tight latency and energy constraints. This tutorial frames LLM inference as a service across cloud to edge to device. We outline systems techniques that address these issues, including efficient on-device multimodal models (VLMs and omni models) on resource-constrained hardware such as smartphones and drones, emphasizing algorithm-level and system-level co-design for practical performance. In parallel, we conduct illustration of adaptive expert routing in mixture-of-experts, hierarchical and context caching with reuse, collaborative pipelines, and service-oriented orchestration. Participants will gain architectural blueprints and runtime mechanisms for dependable, scalable, and context-aware inference services. The tutorial distills techniques featured in venues such as ASPLOS and MobiCom, and uses our open-source inference engine mllm (https://github.com/UbiquitousLearning/mllm) to demonstrate real implementations and applications. In parallel, we discuss when to centralize vs. partition, and how to co-optimize placement with service objectives through graceful degradation, failover, and monitoring for edge workloads. Short case studies in services computing with emerging AI inference infrastructure will show patterns that fit intelligence anywhere and anytime. |