
Hands-On LLM Serving and Optimization: Hosting LLMs at Scale
Author(s): Chi Wang (Author), Peiheng Hu (Author)
- Publisher: O’Reilly Media
- Publication Date: June 9, 2026
- Edition: 1st
- Language: English
- Print length: 372 pages
- ASIN: B0G48JRRMF
- ISBN-13: 9798341621497
Book Description
Large language models (LLMs) are rapidly becoming the backbone of AI-driven applications. Without proper optimization, however, LLMs can be expensive to run, slow to serve, and prone to performance bottlenecks. As the demand for real-time AI applications grows, along comes Hands-On Serving and Optimizing LLM Models, a comprehensive guide to the complexities of deploying and optimizing LLMs at scale.
In this hands-on book, authors Chi Wang and Peiheng Hu take a real-world approach backed by practical examples and code, and assemble essential strategies for designing robust infrastructures that are equal to the demands of modern AI applications. Whether you’re building high-performance AI systems or looking to enhance your knowledge of LLM optimization, this indispensable book will serve as a pillar of your success.
- Learn the key principles for designing a model-serving system tailored to popular business scenarios
- Understand the common challenges of hosting LLMs at scale while minimizing costs
- Pick up practical techniques for optimizing LLM serving performance
- Build a model-serving system that meets specific business requirements
- Improve LLM serving throughput and reduce latency
- Host LLMs in a cost-effective manner, balancing performance and resource efficiency
Editorial Reviews
Editorial Reviews
About the Author
Peiheng Hu is an accomplished machine learning engineer with over 10 years of industry experience and expertise in building robust large-scale AI driven systems on the cloud. He holds a Master of Science in Computational Science & Engineering from Harvard University and a Bachelor of Science in Industrial Engineering Operations Research from Georgia Institute of Technology. Peiheng currently serves as a Principal Member of Technical Staff and ML Engineer at Salesforce, where he leads teams in developing cutting-edge machine learning inferencing solutions, including launching Salesforce’s only unified ML inferencing solution which now handles thousands of requests per second and a novel automated model optimization framework for Large Language Models (LLMs). His work has significantly enhanced model inference performance, scalability and cost-efficiency, saving millions in hardware expenses.
Brief Table of Contents (Not Yet Final)
Chapter 1: Introduction to Model Serving and Optimization (available)Chapter 2: Large Language Model (LLM) Serving (available)
Chapter 3: Model Serving Best Practices and Case Studies (available)
Chapter 4: Build an Agent Application with LLM from Scratch (unavailable)
Chapter 5: Performance Challenges When Serving LLMs (available)
Wow! eBook


