Google search engine

Red Hat launches the llm-d community, driving distributed AI Gen inference at scale

 O llm-d, a new open-source project has just been launched with the support of companies like CoreWeave, Google Cloud, IBM Research, and NVIDIA. The initiative focuses on accelerating the most crucial need for the future of generative AI: scaled inference.  Built on a Kubernetes-native architecture, the program utilizes distributed inference with vLLM and intelligent, AI-aware network routing, enabling the creation of robust inference clouds for large-scale Language Model models that meet the most demanding production-level Service Level Objectives (SLOs).

While training remains critical, the true impact of gen AI relies on more efficient and scalable inference — the engine that turns AI models into actionable insights and user experiences. According to Gartner, by 2028, as the market matures, over 80% of workload accelerators in data centers will be deployed specifically for inference rather than training. This means that the future of gen AI lies in execution capability. The increasing resource demands of increasingly sophisticated and complex reasoning models limit the feasibility of centralized inference and threaten to create bottlenecks in AI innovation due to prohibitive costs and paralyzing latency

Addressing the need for scalable inference 

Red Hat and its industry partners are addressing this challenge head-on with llm-d, a visionary project that extends the power of vLLM to overcome the limitations of a single server and unleash production at scale for AI inference. Leveraging Kubernetes proven orchestration power, llm-d integrates advanced inference capabilities into existing enterprise IT infrastructures. This unified platform empowers IT teams to meet diverse service demands for critical business workloads while implementing innovative techniques to maximize efficiency and significantly reduce the total cost of ownership (TCO) associated with high-performance AI accelerators.

llm-d offers a powerful set of innovations, including:

  • vLLM, which has quickly become the standard inference server in open source, supporting model since day zero for emerging frontier models and support for a wide range of accelerators, now including Google Cloud’s Tensor Processing Units (TPUs).
  • Prefill and disaggregated decoding to separate input context and AI token generation into distinct operations, which can be distributed across multiple servers.
  • KV (key-value) Cache Offloading, based on LMCache, this function transfers the KV cache memory load from GPU memory to a more economical and abundant standard storage, such as CPU memory or network storage.
  • Kubernetes-based clusters and controllers for more efficient scheduling of computing and storage resources as workload demands fluctuate, ensuring optimal performance and minimal latency.
  • AI-focused routing for networks to schedule incoming requests to servers and accelerators likely to have recent caches of previous calculations prior to inference.
  • High-performance communication APIs for faster and more efficient data transfer between servers, with support for NVIDIA Inference Xfer Library (NIXL).

llm-d: industry consensus

This new open-source project already has the backing of a formidable coalition of leading AI model providers, AI accelerator pioneers, and AI-focused cloud platforms. CoreWeave, Google Cloud, IBM Research, and NVIDIA are the founding collaborators, with AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI as partners, highlighting strong industry collaboration to architect the future of LLM execution at scale. The llm-d community also enjoys support from academic institutions like the University of California’s Sky Computing Lab, creators of vLLM, and the University of Chicago’s LMCache Lab, creators of the LMCache.

True to its unwavering commitment to open collaboration, Red Hat recognizes the critical importance of vibrant and accessible communities in the rapidly evolving landscape of AI gen inference. Red Hat will actively support the growth of the llm-d community, fostering an inclusive environment for new members and driving its continuous evolution.

Red Hat’s Vision: any model, any accelerator, any cloud

The future of AI must be defined by unlimited opportunities and not restricted by infrastructure silos. Red Hat envisions a horizon where organizations can deploy any model, on any accelerator, in any cloud, delivering an exceptional and more consistent user experience, without exorbitant costs. To unlock the true potential of investments in gen AI, companies need a universal inference platform — a new standard for continuous and high-performance AI innovations, both now and in the years to come.

Just as Red Hat pioneered the transformation of Linux into a fundamental base of modern IT, the company is now poised to architect the future of AI inference. vLLM has the potential to become a key piece for standardized inference in gen AI, and Red Hat is committed to building a thriving ecosystem not only around the vLLM community but also the llm-d, focused on large-scale distributed inference. The vision is clear: regardless of the AI model, underlying accelerator, or deployment environment, Red Hat aims to make vLLM the ultimate open standard for inference in the new hybrid cloud.

Red Hat Summit

Join the Red Hat Summit keynotes to hear the latest updates from Red Hat executives, customers, and partners: