Scaling Generative AI Inference at Trillion-Token Scale

15 min

June 18, 2025

WEKA Stage

About

Join Gyeong-In Yu, CTO of Friendly AI, as he explores the intricate challenges and innovative solutions surrounding generative AI inference at a trillion-token scale, presented at the SuperAI Singapore 2025 conference. As AI technology advances rapidly, the demand for efficient and scalable AI inference grows exponentially. This session delves into the factors driving this demand, including the rise of agentic AI and the shift towards utilizing GPUs primarily for inference rather than training.

Yu unveils key strategies to tackle the issues of increasing computational needs and infrastructure management. He emphasizes the necessity of reducing GPU costs and employing specialized optimizations tailored to different AI applications. By prioritizing user satisfaction with rapid response times and maintaining high-generation quality, Friendly AI aims to create a balance between operational efficiency and superior service delivery.

Attendees gain valuable insights into advanced optimization techniques such as batching, quantization, and caching that enhance computational efficiency. Yu introduces Friendly AI's proprietary FP8 quantization method that upholds model quality while accelerating execution. Furthermore, he showcases Friendly Suite, a robust platform offering services like friendly containers and managed cloud solutions that deliver cost-effective and high-performance AI inference.

Don't miss the chance to enhance your understanding of AI inference at scale. Comment your thoughts, like the video if you found it insightful, and subscribe to our channel for more groundbreaking AI discussions and innovations.

Speakers

Gyeong-In Yu

CTO

FriendliAI

Moderator

No items found.

Scaling Generative AI Inference at Trillion-Token Scale

About

Speakers

Moderator

Super Early Bird pre-sale now available