Cut Costs and Accelerate LLM Inference using MK1 Flywheel on Modal

Copied link!

We are thrilled to announce that our LLM inference engine, MK1 Flywheel, is now available through Modal's developer-friendly cloud service. Modal enables you to quickly deploy your Generative AI applications without needing to set up and manage servers, and MK1 brings the state-of-the-art inference engine. Whether you are building your first LLM application, running daily AI tasks for document analysis, or running a chatbot with a vast user base – Modal with Flywheel is ready to serve your AI workloads.

Now you can:

  • Lower cost: Flywheel delivers more requests for the same cost as leading inference solutions.
  • Speed up responses: Flywheel processes requests significantly faster, crucial for latency-sensitive applications such as chat.
  • Jumpstart your applications: Set up an endpoint with just a few lines of Python code and be up and running within minutes. Works with standard and fine-tuned models.
  • Get instant access to GPUs: Get powerful GPUs such as A100s without any approval delays.
  • Scale with ease: Quickly scale up to 100's of GPUs and back down again in seconds, paying only for what you use.
  • Control your token economics: Use your dedicated GPUs to secure consistent service and optimize cost for your specific workloads.

Join Modal today with $30 in free credits each month. After signing up, take Flywheel for a spin and experience the most performant LLM inference engine available on the market.

Why Modal's Platform?

Modal provides all the tools necessary to develop and deploy AI applications without the hassle of securing cloud contracts or worrying about infrastructure. For many users, the ability to iterate quickly with your own fine-tuned models in a development environment is a huge advantage. MK1 brings the state-of-the-art inference engine to the platform, further accelerating your workflow, unlocking cost savings, and boosting performance.

This offering is a convenient way for Enterprise customers to experience Flywheel and discover how MK1 can add value to any Generative AI software stack. Please note that MK1 also has enterprise licensing options available for self-hosting, with native support to run Flywheel on AMD hardware in addition to NVIDIA. We invite you to reach out to us with your specific needs and discover how Flywheel can help you cut costs and boost LLM inference.

To get you started with Flywheel on Modal, we provide two examples showing reduced costs for document processing and faster responses for chat applications.

Demo: Cutting costs for document processing

We estimated "instantaneous throughput" for a Llama-2-13B summarizing new articles. Flywheel uses the GPU resources more efficiently, processing 2,048 articles in less time.

LLMs are a powerful tool for processing a large corpus of documents. Common examples include: summarizing news articles, extracting keywords from product descriptions, and parsing financial data from earning statements.

MK1 Flywheel is able to significantly reduce the cost for many document processing tasks. The reason is simple: Flywheel reduces the time required to complete the job, and with Modal, you only pay for GPU time down to the second.

For a concrete demonstration that you can run, we use Flywheel to summarize news articles. Here, we randomly selected 2,048 articles from the cnn_dailymail dataset, and used a Llama-2-13B-chat model prompted to perform a summarization task.

On an A100-40GB GPU instance, this model on Flywheel churns through documents faster. Specifically, the instantaneous throughput measured as tokens / second is consistently higher, resulting in less time to complete the task (see figure).

With current pricing, summarizing 2,048 articles on Flywheel costs $0.48, while the same task costs $0.93 using vLLM (v.0.3.0). This is a savings of over 45%!

Similar savings are possible using other model and GPU combinations. For example, we have found that smaller models (Llama-7B and Mistral) are cost effective on A10 GPUs, while larger models (Llama-30B and CodeLlama-34B) require A100 80GB. While there are different tradeoffs for each application, a big advantage of using Modal is being able to quickly try out different combinations to find the best token economics for your use case.

Demo: Faster chat responses for concurrent users

Many LLM applications require fast response times for a satisfying user experience.

For example, a service running a chat application will typically target <4 seconds for a few sentence response. For low traffic, a single endpoint can usually meet this requirement. However, as traffic increases, the only way to keep response latencies down is to spin up multiple endpoints and route the traffic accordingly. The threshold at which you need to spin up more endpoints determines your cost, since each endpoint requires one or more reserved GPUs.

Flywheel is able to service more concurrent users by processing requests significantly faster. This directly translates into lower cost and a better user experience.

For a concrete demonstration, we launched a Flywheel endpoint running a CodeLlama-34B on an A100-80GB GPU. We then simulated users sending requests drawn from a typical chat distribution with a max context of 960 tokens and max generation of 80 tokens.

Targeting an average latency of 4 seconds per response, a Flywheel endpoint is able to service 1.7x the traffic for the same GPU cost! At scale, this means you will need to spin up fewer GPUs to service the same number of users. Or in the same vein, you can run a bigger (more capable) model on the same GPU without sacrificing response times.

We provide an example to launch your own endpoint using Flywheel. More details on how to build an end-to-end chat application are coming soon.

More to come

Stay tuned for additional examples showing the advantages of using Flywheel for all your LLM inference use cases.

For users planning enterprise deployments, MK1 is your partner in growth. In addition to our cloud offerings, we have different options for integrating MK1 Flywheel natively into your production stack. We provide personalized support to help scale your operations efficiently, ensuring that your AI infrastructure grows seamlessly with your business. Don't hesitate to reach out!