MK1 Flywheel Unlocks the Full Potential of AMD Instinct for LLM Inference

Summary

With the release of our new inference engine MK1 Flywheel, we are excited to report that AMD Instinct Series can achieve comparable performance to a compute-matched NVIDIA GPU. We’ve designed MK1 Flywheel for maximum performance on AMD and NVIDIA hardware: our benchmarks demonstrate up to 3.7x higher throughput compared to vLLM.

Inference AMD Instinct

AMD Enters AI Market with Instinct Series Accelerators

The 2023 holiday season marked a significant milestone for the AI community with the launch of AMD’s eagerly anticipated Instinct MI300 series accelerator, showcasing their advanced CDNA 3 architecture. On paper, the MI300 has the potential to challenge NVIDIA’s market dominance for cloud AI workloads, bringing hope for genuine performance competition and leveling the playing field. AMD’s Achilles heel up to this point has been its lagging software ecosystem, however, recently there have been inroads into natively supporting AMD hardware on popular AI frameworks.

AMD Instinct

Once we realized what the AMD Instinct series cards were capable of, we challenged ourselves to port our LLM inference engine MK1 Flywheel to AMD. Having just achieved the best inference performance on NVIDIA hardware (see our companion post), we rolled up our sleeves and got to work.

How does AMD Stack Up to NVIDIA?

Before we take you through our journey with AMD Instinct and ROCm, let’s start with the results.

For reasons explained later on, we profiled the AMD Instinct MI100 and MI210 cards, and here focus on the newer MI210 for comparison. On the NVIDIA side, we chose the RTX A6000 since it has a similar hardware specification based on TFLOPS, memory and power.

	AMD Instinct MI100	AMD Instinct MI210	NVIDIA RTX A6000
Architecture	CDNA 1	CDNA 2	Ampere
FP16 TFLOPS	184.6	181	154.8
Power	300 W	300 W	300 W
Memory	32GB HBM2	64GB HBM2e	48GB GDDR6

Throughput vs. Latency - Typical

Throughput vs. Latency - Long Prompt

Inference performance for MK1 Flywheel and vLLM (v0.2.6) for Llama-2-13B. The benchmarks measure the throughput (requests/second) against average latency (end-to-end round trip time to service the request) across an increasing number of asynchronous workers (mimicking users) for a given input prompt and output generation distribution. Typical distribution (left) has a max token context of 960 and max token generations of 80; Long prompt distribution (right) has max token context of 1800 and a max token generation of 120.

The results are clear: MK1 Flywheel on an AMD Instinct MI210 now rivals a compute-matched NVIDIA GPU (also running Flywheel). Moreover, Flywheel shows higher throughput across all tested workloads on both AMD and NVIDIA compared to vLLM. The increased performance translates into significant cost savings as the same GPU instance can now service more users.

For clients considering using AMD Instinct for LLM inference workloads at scale, please reach out.

Recap of MK1 Flywheel

In our companion post, we introduced MK1 Flywheel, our enterprise LLM inference engine built for real-world enterprise applications. Working with our early adopters, we forged a performant inference solution that is now in commercial use servicing millions of active users for LLM chat applications. As a quick recap, Flywheel has

Unparalleled Throughput vs Latency characteristics.
Rapid auto scaling for optimized scaling up of inference on cloud platforms.
For enterprise customers: seamless integration into your stack with native PyTorch compatibility, and a drop in replacement for inference backends like vLLM, TensorRT-LLM or Hugging Face TGI.

You can take Flywheel for a spin on Amazon SageMaker. Currently it runs on NVIDIA backend, and we look forward to offering Flywheel on AMD backends.

Next up, we want to give you a behind-the-scenes journey of building the hardware and software components that brought Flywheel to life on AMD. Hope you have as much fun reading it as we had doing it!

Our Journey: Building out the Hardware

Our early exploration of the AMD Instinct series began with the harsh reality check that these cards aren’t ubiquitous across cloud platforms (for now). In order to chalk up a quick existential proof, we grabbed an MI100 (CDNA 1 architecture) from the bargain bin on eBay. The exercise was fairly touch-and-go at the start. Since the card does not have active cooling, we jury-rigged a 3D-printed fan hood onto the card to boot the card on our typical workstation desktops. Once everything was plugged in, only then did we realize that the system had no video output! Fortunately, we rummaged a relic in the form of an NVIDIA TitanX, and strapped it in. The workstation was rounded out with an Intel Xeon CPU, for reasons that will require its own blog post. Once we had this chimera of a system up and running, the absurdly loud high-static pressure fan did nothing to curb our enthusiasm when we generated text off a LLama-2-7B. We have come a long way since then.

A few weeks later we were able to land an MI210 (CDNA 2 architecture). With the lessons learned from the MI100, we had it up and running in no time!

Our first chimera system

Our first chimera system with AMD Instinct MI100 for AI acceleration, NVIDIA TitanX for video output and Intel Xeon CPU. The MI100 is cooled with a Silverstone FHS 80X mounted on a 3D printed fan hood.

Our Journey: Building out the Software

AMD has truly stepped up their game on their ROCm software stack over the past year, and it shows. We now have tight integration into bread and butter AI frameworks like PyTorch and TensorFlow. And, for the first time, you can effortlessly run LLMs right off of Hugging Face with a few lines of python.

As we designed MK1 Flywheel for the NVIDIA platform, we learned priceless lessons along the way and grew the intuition and engineering necessary to distinctly push NVIDIA GPUs to their limits. Energized by the momentum we saw on the AMD platform and community, we challenged ourselves to put our theories and engineering to the test. Once again, staying true to our performance-obsessed nature, we took no shortcuts and built the framework backend from first principles for the AMD stack.

While the interesting idiosyncrasies of the CDNA architecture flavored our ROCm kernel stack to be understandably different from our CUDA stack, the fundamentals that unlocked the true potential of the GPU hardware remained consistent. To start, we had to grok the CDNA architecture and its history (GCN - Graphics Core Next), alongside the complementary graphics-forward RDNA architecture. In parallel, we had to discover the strengths and limits of the compiler, especially on a compute pipeline that relies on instruction counting. Certain system-level techniques like optimized kernel scheduling worked right away due to being platform agnostic by nature. However, all core device kernels that used platform-specific features (say, NVIDIA Tensor Cores) had to be written from scratch to use the equivalent on the AMD hardware (AMD Matrix Cores).

In our experience so far, ROCm is in great shape to build functional kernels right off the bat on AMD hardware. Yet, to extract the most out of the CDNA architecture, we’ve had to go full manual, in comparison to our experience with CUDA. In all fairness, this can be marked up to the fact that NVIDIA has had a significant head start with CUDA for AI workloads. We believe what our work demonstrates, is that after equalizing the playing field on the software front with MK1 Flywheel, the AMD Instinct Series is a serious contender for today’s cloud inference workloads. We are excited to continue developing on ROCm, and extending Flywheel for MI300, as well as further optimizing for finetuning and training workloads.

Results: MK1 Flywheel on AMD Instinct MI210 and MI100

We further benchmarked Flywheel on AMD Instinct across Mistral-7B and Llama-2-13B for various workloads. These results confirm that Flywheel has excellent performance across different LLM use cases.

Throughput vs. Latency - Mistral-7B

Throughput vs. Latency - Llama-2-13B

Inference performance for MK1 Flywheel and vLLM (v0.2.6) for popular models. Data distribution characteristics are described further in our companion post. MI100 measurements are only provided for MK1 Flywheel, since vLLM does not officially support the CDNA 1 architecture.

Dev Notes for ROCm

There are a few things that we do miss from the CUDA stack. For example, the tight-loop performance debugging afforded by NVIDIA Nsight Compute was invaluable as we developed our inference stack for NVIDIA. Omnitrace and Omniperf on the ROCm stack got us some of the way but weren’t as polished as their counterparts. The documentation could use some love as well, as our current experience working on the ROCm stack was like driving a stick shift. On the plus side, the fact that ROCm toolkit and the AMD LLVM project are entirely open source allowed us to parse the true nature of the hardware and build design patterns. Conversely, CUDA is entirely closed.

The ROCm platform offers HIPIFY, a convenient utility that converts CUDA code to cross-platform HIP code. However, to extract the most out of the CDNA architecture, we couldn’t “hipify” our CUDA kernels tuned for current NVIDIA architectures. This is where our efforts of building a framework from first principles really paid off, as we applied our learnings and theories onto the new architecture and reaped the results.

What’s next?

The journey has only begun, and there is a lot more work to be done. Despite the VRMs screeching for their dear lives, the GPUs are not on fire… yet. As indicated in our companion post, there are a good deal of stack optimizations left on the table in order to maximize inference performance. MK1 Flywheel aims to be platform agnostic, offering performance parity across GPUs with similar hardware specifications, i.e. you will get your TFLOPs worth. This opens up a wider range of cloud hardware platforms and economic mobility to serve your inference applications, while retaining the familiar frontend and user experience. For users planning enterprise deployments, we have different options for integrating MK1 Flywheel natively into your production stack, regardless of platform. Don’t hesitate to reach out!

We are eager to explore the MI300X accelerator, and in theory, MK1 Flywheel should perform out-of-the-box for CDNA3, as we have prepared for it. The bigger hurdle is access to the new hardware. The MI300A APU has us looking forward to the converged strength of the Zen4 CPU and CDNA3 GPU.

Footnotes

A post from EmbeddedLLM dated Oct-27-2023 claims that AMD Instinct MI210 achieves LLM inference parity with the NVIDIA A100. There seems to be an unexplained discrepancy, considering that the NVIDIA A100 has 312 TFLOPS compared to the 181 TFLOPS of the Instinct MI210. In contrast, MK1 Flywheel achieves a proportional increase in performance on the A100 which has 1.7x the horsepower of the Instinct MI210, which can be seen in Throughput vs Latency comparisons.

We believe the EmbeddedLLM results may simply be out-of-date as an older version of vLLM (v0.1.4) was used as the base to develop ROCm support, and may not have included performance improvements from more recent versions. As of v0.2.4, vLLM natively supports AMD Instinct MI200 series GPUs.

Throughput vs. Latency - NVIDIA A100, AMD Instinct MI210

Inference performance on the NVIDIA A100 and AMD Instinct MI210 with MK1 Flywheel and vLLM v0.2.6

Share this post