MK1 Flywheel has the Best Throughput and Latency for LLM Inference on NVIDIA and AMD

Summary

MK1 has built an optimized LLM inference engine, called Flywheel, that now blazes with higher throughput and lower latency on NVIDIA Ampere, Ada, Hopper and on AMD Instinct. Try it on Amazon SageMaker and engage with us for large-scale deployment.

Inference Performance Comparison

Back in August, we released a closed beta of our LLM inference runtime, called Flywheel, that clients could deploy on their own infrastructure. Our mission was to enable companies to push LLM inference performance to their very limits, saving cost and improving performance.

Every step of the way, we listened to our early adopters. With their feedback, we developed an inference framework that offers the best throughput and latency performance on the market. Flywheel is a drop-in replacement for popular frameworks with a focus on ease of use with cross-platform compatibility across NVIDIA and AMD (see companion blog post for AMD story).

The initial trials have exceeded our expectations, and today we are pleased to announce that Flywheel is in commercial use, servicing millions of active users for LLM chat applications.

Our Product: MK1 Flywheel

MK1 Flywheel is an enterprise-focused LLM inference framework that is easy to use, and a drop-in replacement for runtimes like vLLM, Hugging Face TGI, and NVIDIA TensorRT-LLM. It supports popular open source models like Llama and Mistral, and has a Python and PyTorch compatible interface that integrates with your infrastructure with a simple pip install. It’s also available for testing and deployment through Amazon SageMaker.

Unlike other inference frameworks, Flywheel was designed to work right out of the box with the highest possible performance without any configuration. The optimizations that we do under the hood were carefully selected through rigorous internal testing and from a first-principles analysis. To name a few, we support an in-house version of continuous batching backed by our library of optimized attention kernels, with tunings for Multi-Head Attention modifiers like MQA/GQA, sliding window, and long-context windows.

One of our core principles was to constrain inference optimizations to ones that preserve the fidelity of the original FP16 model within the noise margin of the measurement procedure. Then, through comprehensive A/B testing with real users, we validated that there was no impact on key metrics like customer engagement and retention for chat applications, and retrieval and Q&A for more enterprise-focused applications.

We think the effort paid off: Flywheel is the most performant inference runtime on the market on both NVIDIA and AMD hardware. Below we compare throughput vs latency to vLLM for popular model types (Mistral 7B and Llama-2 13B) on NVIDIA RTX A6000 and AMD Instinct MI210. In addition, CodeLlama-34B is profiled on NVIDIA A100 80GB and NVIDIA H100.

Throughput vs Latency on NVIDIA Platform

Llama-2-13b-chat NVIDIA

CodeLlama-34b NVIDIA

We measure throughput (requests/second) against average latency (end-to-end round trip time to service the generation request) across an increasing number of asynchronous workers (i.e. users) for a given input prompt and output generation distribution. The data distributions are curated from real-world chat scenarios, and has a max token context of 960 and max token generations of 80.

Throughput vs Latency on AMD Instinct MI210

Llama-2-13b-chat AMD

Throughput vs Latency on AMD Instinct MI210 using the same setup we used for NVIDIA benchmarks. See companion blog post for more details about Flywheel on AMD.

The key take away is that across all models tested, different workload scenarios, and on NVIDIA and AMD platforms, Flywheel has significantly higher throughput than vLLM at every measured latency. The increased performance translates into significant cost savings as the same GPU instance can now service more users.

Take Flywheel for a Spin on Amazon SageMaker

We now offer Flywheel on Amazon SageMaker. The offering comes in four distinct flavors, providing off-the-shelf implementations for Llama2-Chat-7B, Llama2-Chat-13B, Mistral-Instruct-7B as well as a Bring-Your-Own-Model version that supports custom fine-tuned models based on the architectures previously mentioned. MK1 Flywheel is currently available for the single-GPU ml.g5 instance types, powered by NVIDIA A10 GPUs, and we’ve prepared a comprehensive GitHub repository with examples showing how to deploy an Amazon SageMaker endpoint for your applications. Support for multiple GPUs is coming soon, so stay tuned!

Here’s our results running Flywheel on SageMaker running on an NVIDIA A10.

Flywheel SageMaker

The benchmarks presented above have been executed on a g5.8xlarge EC2 instance where both the endpoint and benchmarking tool are running on the same host. The triangle datapoint shows the response of the inference runtimes when servicing 24 simultaneous requests.

For users planning enterprise deployments, MK1 is your partner in growth. We have different options for integrating MK1 Flywheel natively into your production stack. We provide personalized support to help scale your operations efficiently, ensuring that your AI infrastructure grows seamlessly with your business. Don’t hesitate to reach out!

Share this post