LATEST POSTS

Unlocking real-time chat with 1M context Llama3-70B model on AMD’s MI300X

Unlocking real-time chat with 1M context Llama3-70B model on AMD’s MI300X

A killer application of large language models (LLMs) is being able to intelligently interact with vast amounts of text. Imagine conversing with a model that has comprehensive knowledge of entire codebases, literary works, or legal documents for unlocking valuable insights.

Recently, long context windows have emerged as a promising approach, with Google’s Gemini 1.5 Pro and Flash models supporting million-token context and outperforming popular techniques such as RAG.

While Google’s long context models are undeniably impressive, they are unfortunately closed behind an api and come with significant limitations:

  1. Lack of Customization: Limited fine-tuning or modification capabilities
  2. Scalability Constraints: API rate limits hinder large-scale deployments
  3. Cost Inefficiency: Prohibitive expenses for long context token utilization
  4. Data Security Risks: Reliance on third-party API for sensitive data processing
  5. Feature Gaps: Absence of real-time chat with cached context

Addressing these challenges, we’ve developed an alternative approach that gives users the ability to run long context models fully under their control.

At a recent event hosted by TensorWave in San Francisco for the developer community, we showed an open-source 1M context Llama3-70B model running on AMD MI300X hardware.

Demonstrating Two Real-World Applications

Apollo 11 Transcript: Shows the model’s ability to maintain context over the entire transcript of the historic moon landing.

Three.js code examples: Shows the model’s proficiency handling queries for complex coding environments.

A significant breakthrough is our novel feature never seen before: persistent context caching, enabling real-time multi-user interaction with the same document. This feature greatly enhances development workflows by allowing direct loading of large document sets with no additional processing time for subsequent interactions.

For example, in our Three.js demo we were able to load the cache into memory in 15 seconds, versus 8 minutes for the original pre-fill stage, which is over 20 times faster! Furthermore, there was no additional overhead for subsequent prompts!

Learn more about the announcement here.

Share this post