Hands-On LLM Serving and Optimization: Hosting LLMs at Scale

Hands-On LLM Serving and Optimization: Hosting LLMs at Scale book cover

Hands-On LLM Serving and Optimization: Hosting LLMs at Scale

Author(s): Chi Wang (Author), Peiheng Hu (Author)

  • Publisher: O’Reilly Media
  • Publication Date: June 9, 2026
  • Edition: 1st
  • Language: English
  • Print length: 372 pages
  • ASIN: B0G48JRRMF
  • ISBN-13: 9798341621497

Book Description

Large language models (LLMs) are rapidly becoming the backbone of AI-driven applications. Without proper optimization, however, LLMs can be expensive to run, slow to serve, and prone to performance bottlenecks. As the demand for real-time AI applications grows, along comes Hands-On Serving and Optimizing LLM Models, a comprehensive guide to the complexities of deploying and optimizing LLMs at scale.

In this hands-on book, authors Chi Wang and Peiheng Hu take a real-world approach backed by practical examples and code, and assemble essential strategies for designing robust infrastructures that are equal to the demands of modern AI applications. Whether you’re building high-performance AI systems or looking to enhance your knowledge of LLM optimization, this indispensable book will serve as a pillar of your success.

  • Learn the key principles for designing a model-serving system tailored to popular business scenarios
  • Understand the common challenges of hosting LLMs at scale while minimizing costs
  • Pick up practical techniques for optimizing LLM serving performance
  • Build a model-serving system that meets specific business requirements
  • Improve LLM serving throughput and reduce latency
  • Host LLMs in a cost-effective manner, balancing performance and resource efficiency

Editorial Reviews

Editorial Reviews

About the Author

Chi Wang has over 17 years of experience in the tech industry, with a particular focus on artificial intelligence and distributed systems. For the past 8 years, Chi has been a key contributor at Salesforce’s Einstein AI group, where he leads the development of AI platforms and infrastructures that support millions of Salesforce customers and power hundreds of AI features. Currently, as the Director of Engineering, Chi oversees two critical teams: one focused on model serving and optimization solutions, and the other on data science environments. Chi also filed 12 patents in areas such as dataset management, model serving and optimization, data access authorization, and networking management. In addition, he holds an Artificial Intelligence Graduate Certificate from Stanford University, which he completed in 2020.

Peiheng Hu is an accomplished machine learning engineer with over 10 years of industry experience and expertise in building robust large-scale AI driven systems on the cloud. He holds a Master of Science in Computational Science & Engineering from Harvard University and a Bachelor of Science in Industrial Engineering Operations Research from Georgia Institute of Technology. Peiheng currently serves as a Principal Member of Technical Staff and ML Engineer at Salesforce, where he leads teams in developing cutting-edge machine learning inferencing solutions, including launching Salesforce’s only unified ML inferencing solution which now handles thousands of requests per second and a novel automated model optimization framework for Large Language Models (LLMs). His work has significantly enhanced model inference performance, scalability and cost-efficiency, saving millions in hardware expenses.

Brief Table of Contents (Not Yet Final)
Chapter 1: Introduction to Model Serving and Optimization (available)

Chapter 2: Large Language Model (LLM) Serving (available)

Chapter 3: Model Serving Best Practices and Case Studies (available)

Chapter 4: Build an Agent Application with LLM from Scratch (unavailable)

Chapter 5: Performance Challenges When Serving LLMs (available)

View on Amazon

未经允许不得转载:Wow! eBook » Hands-On LLM Serving and Optimization: Hosting LLMs at Scale