Name: Hands-On LLM Serving and Optimization: Hosting LLMs at Scale
Author: Chi Wang (Author), Peiheng Hu (Author)
ISBN: 9798341621497

Hands-On LLM Serving and Optimization: Hosting LLMs at Scale book cover

Hands-On LLM Serving and Optimization: Hosting LLMs at Scale

Author(s): Chi Wang (Author), Peiheng Hu (Author)

Publisher: O’Reilly Media
Publication Date: June 9, 2026
Edition: 1st
Language: English
Print length: 372 pages
ASIN: B0G48JRRMF
ISBN-13: 9798341621497

Book Description

Large language models (LLMs) are rapidly becoming the backbone of AI-driven applications. Without proper optimization, however, LLMs can be expensive to run, slow to serve, and prone to performance bottlenecks. As the demand for real-time AI applications grows, along comes Hands-On Serving and Optimizing LLM Models, a comprehensive guide to the complexities of deploying and optimizing LLMs at scale.

In this hands-on book, authors Chi Wang and Peiheng Hu take a real-world approach backed by practical examples and code, and assemble essential strategies for designing robust infrastructures that are equal to the demands of modern AI applications. Whether you’re building high-performance AI systems or looking to enhance your knowledge of LLM optimization, this indispensable book will serve as a pillar of your success.

Learn the key principles for designing a model-serving system tailored to popular business scenarios
Understand the common challenges of hosting LLMs at scale while minimizing costs
Pick up practical techniques for optimizing LLM serving performance
Build a model-serving system that meets specific business requirements
Improve LLM serving throughput and reduce latency
Host LLMs in a cost-effective manner, balancing performance and resource efficiency

Editorial Reviews

About the Author

Chi Wang has over 17 years of experience in the tech industry, with a particular focus on artificial intelligence and distributed systems. For the past 8 years, Chi has been a key contributor at Salesforce’s Einstein AI group, where he leads the development of AI platforms and infrastructures that support millions of Salesforce customers and power hundreds of AI features. Currently, as the Director of Engineering, Chi oversees two critical teams: one focused on model serving and optimization solutions, and the other on data science environments. Chi also filed 12 patents in areas such as dataset management, model serving and optimization, data access authorization, and networking management. In addition, he holds an Artificial Intelligence Graduate Certificate from Stanford University, which he completed in 2020.

Peiheng Hu is an accomplished machine learning engineer with over 10 years of industry experience and expertise in building robust large-scale AI driven systems on the cloud. He holds a Master of Science in Computational Science & Engineering from Harvard University and a Bachelor of Science in Industrial Engineering Operations Research from Georgia Institute of Technology. Peiheng currently serves as a Principal Member of Technical Staff and ML Engineer at Salesforce, where he leads teams in developing cutting-edge machine learning inferencing solutions, including launching Salesforce’s only unified ML inferencing solution which now handles thousands of requests per second and a novel automated model optimization framework for Large Language Models (LLMs). His work has significantly enhanced model inference performance, scalability and cost-efficiency, saving millions in hardware expenses.

Brief Table of Contents (Not Yet Final)
Chapter 1: Introduction to Model Serving and Optimization (available)

Chapter 2: Large Language Model (LLM) Serving (available)

Chapter 3: Model Serving Best Practices and Case Studies (available)

Chapter 4: Build an Agent Application with LLM from Scratch (unavailable)

Chapter 5: Performance Challenges When Serving LLMs (available)

View on Amazon

Hands-On LLM Serving and Optimization: Hosting LLMs at Scale

Book Description

Editorial Reviews

Editorial Reviews

About the Author

相关推荐