Microsoft Azure: Splitwise tech reduces LLM power requirements by 20 percent

Spread the love

Microsoft Azure has developed a technique in which large language models can generate output 20 percent more efficiently. In technology, the processing and generation process is split over different systems.

The Microsoft department explains in a blog post that LLMs using Splitwise split the prompt and token phases of an assignment. The former processes a user prompt and the token stage generates responses, with each output token normally generated sequentially. By distributing these phases across several GPU clusters, Microsoft claims it can achieve 1.4x higher LLM command throughput with 20 percent less power consumption. With the same power consumption, LLMs must be able to process 2.35 times as much in the same time.

A third pool is also used with a mix of prompt and token generation for mixed batching. This cluster must scale dynamically in real time based on computing power requirements. Splitwise is now part of it vLLM open source project and can therefore be implemented in other frameworks, according to Microsoft.

You might also like