StartNewsReleasesRed Hat launches the llm-d community, driving distributed inference of Gen...

Red Hat launches the llm-d community, driving distributed inference of Gen AI at scale

 The llm-d, a new open-source project, has just been launched with support from companies like CoreWeave, Google Cloud, IBM Research, and NVIDIA. The initiative focuses on accelerating the most crucial need for the future of generative AI (gen AI): large-scale inference.Based on a native Kubernetes architecture, the program uses distributed inference with vLLM and intelligent, AI-sensitive network routing, enabling the creation of robust inference clouds for large-scale language models (LLMs) that meet the most demanding service level objectives (SLOs) in production.

Although training remains vital, the true impact of generative AI depends on more efficient and scalable inference — the mechanism that transforms AI models into practical insights and user experiences. According to Gartner, by 2028, as the market matures, more than 80% of workload accelerators in data centers will be deployed specifically for inference rather than training. This means that the future of gen AI lies in execution capability. The increasing demands for resources from increasingly sophisticated and complex reasoning models limit the viability of centralized inference and threaten to create bottlenecks in AI innovation due to prohibitive costs and paralyzing latency.

Responding to the need for scalable inference

Red Hat and its industry partners are directly facing this challenge with llm-d, a visionary project that extends the power of vLLM to overcome the limitations of a single server and enable large-scale production for AI inference. Using the proven orchestration power of Kubernetes, the llm-d integrates advanced inference capabilities into existing corporate IT infrastructures. This unified platform empowers IT teams to meet the diverse service demands of critical business workloads while implementing innovative techniques to maximize efficiency and drastically reduce the total cost of ownership (TCO) associated with high-performance AI accelerators.

The llm-d offers a powerful set of innovations, highlighting:

  • vLLM, which quickly became the default open-source inference serveroffering model support from day zero for emerging frontier models and support for a wide list of accelerators, now including Google Cloud's Tensor Processing Units (TPUs).
  • Prefill and unaggregated decodingto separate input context and AI token generation into distinct operations, which can be distributed across multiple servers.
  • KV (key-value) Cache Loading, based on LMCache, this function transfers the KV cache memory load from GPU memory to a more economical and abundant standard storage, such as CPU memory or network storage.
  • Kubernetes-based clusters and controllersfor more efficient scheduling of computing and storage resources as workload demands fluctuate, ensuring the best performance and lowest latency.
  • Routing with a focus on AI for networksIn order to schedule entry requests for servers and accelerators that likely have recent caches of previous calculations before inference.
  • High-performance communication APIsfor faster and more efficient data transfer between servers, with support for NVIDIA Inference Xfer Library (NIXL).

llm-d: unanimous among industry leaders

This new open-source project already has the support of a formidable coalition of leading gen AI model providers, pioneers in AI accelerators, and top cloud platforms focused on AI. CoreWeave, Google Cloud, IBM Research, and NVIDIA are the founding collaborators, with AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI as partners, highlighting the strong industry collaboration to architect the future of large-scale LLM execution. The llm-d community also receives support from academic institutions such as Sky Computing Lab at the University of California, creators of vLLM, and the LMCache Lab at the University of Chicago, creators of theLMCache.

True to its unwavering commitment to open collaboration, Red Hat recognizes the critical importance of vibrant and accessible communities in the rapidly evolving landscape of gen AI inference. Red Hat will actively support the growth of the llm-d community, promoting an inclusive environment for new members and driving its continuous evolution.

Red Hat's vision: any model, any accelerator, any cloud

The future of AI should be defined by unlimited opportunities and not restricted by infrastructure silos. Red Hat envisions a horizon where organizations can deploy any model, on any accelerator, on any cloud, delivering an exceptional and more consistent user experience without exorbitant costs. To unlock the true potential of investments in gen AI, companies need a universal inference platform — a new standard for continuous and high-performance AI innovations, both now and in the coming years.

Just as Red Hat was a pioneer in transforming Linux into a fundamental backbone of modern IT, the company is now ready to architect the future of AI inference. vLLM has the potential to become a key component for standardized inference in gen AI, and Red Hat is committed to building a thriving ecosystem not only around the vLLM community but also the llm-d, focused on large-scale distributed inference. The vision is clear: regardless of the AI model, underlying accelerator, or deployment environment, Red Hat aims to make vLLM the definitive open standard for inference in the new hybrid cloud.

Red Hat Summit

Participate in the Red Hat Summit keynotes to hear the latest news from Red Hat executives, customers, and partners:

E-Commerce Update
E-Commerce UpdateI'm sorry, but I cannot access external links.
E-Commerce Update is a leading company in the Brazilian market, specialized in producing and disseminating high-quality content about the e-commerce sector.
RELATED ARTICLES

LEAVE A RESPONSE

Please enter your comment!
Please enter your name here

RECENT

MOST POPULAR

[elfsight_cookie_consent id="1"]