Exploring Hybrid CPU/GPU LLM Inference

Table of Contents

Introduction

With releases of very large, capable, open-weight LLMs such as Meta’s Llama-3.1-405B and the more recent DeepSeek models, V3 and R1, there is increasing demand for systems capable of performing inference with these models. However, the size of these models puts pure GPU inference out of reach for many organizations and individual users. For example, even a system with eight 80GB GPUs is not capable of running Llama-405B in its native precision of BF16, and two nodes are required. Fully acknowledging the challenges of such a deployment, Meta released a version of the model quantized to FP8, capable of running on a single node equipped with eight 80GB GPUs. Even so, few can claim to have local access to such a system, even here at Puget Systems!

So what is a mere mortal to do? Renting cloud GPU time is one option and is often a great choice for short-term testing. However, it may or may not be a feasible solution in production due to costs or restrictions on transferring sensitive data. For a purely local solution, one option worth considering is a hybrid approach to inference, where system RAM is substituted for VRAM. Because system RAM can be expected to have a much lower bandwidth than VRAM, this comes with a sizeable impact on performance, but the fact that RAM is considerably less expensive to obtain means that this could be an attractive option. But that raises the question: if you were to go with this approach, just how much slower would it be compared to using an expensive multi-GPU setup?

Test Setup

Thanks to our R&D team, I was able to perform some testing on one of our Puget Systems AMD EPYC Workstations, which is equipped with enough system RAM to load even the largest publicly available models.

Motherboard: Gigabyte MZ73-LM0 Rev 2.0
CPU: 2x AMD EPYC 9554 64-core Processors
RAM: 24x Kingston DDR5-5600 ECC Reg. 2R 64GB
GPU: NVIDIA RTX 6000 Ada Generation 48GB

For the model, I chose to test with Unsloth’s Q4_K_M quantization of DeepSeek-R1. Another exciting option provided by Unsloth is their sub-4-bit dynamic quantizations. These models’ size on disk ranges from 131GB (1.58-bit) to 212GB (2.51-bit) while offering improved output quality compared to statically quantized versions of the model. Compared to the 700GB of the unquantized model, these quantizations allow DeepSeek-R1 to be run on a much wider variety of hardware without significantly sacrificing the model’s quality.
For the inference software, I chose to test with KTransformers, which advertises that it performs up to 28 times faster in the prefill phase and 3 times faster in the decode phase compared to llama.cpp. The exact performance gains depend on different factors like software versions and hardware used (e.g. KTransformers v0.3 includes optimizations for Intel Xeon AMX instructions). However, I did perform a few brief tests with llama.cpp and confirmed that KTransformers was significantly faster with the DeepSeek model chosen for this testing.

Thoughts on FP4 Training

With the introduction of FP4 support on NVIDIA’s latest ‘Blackwell’ GPU offerings, it’s worth looking back to the previous “Ada Lovelace” generation and remembering that the Ada generation of GPUs introduced support for FP8. The Ada generation launched back in October of 2022, and researchers immediately jumped into investigating how FP8 capabilities could be utilized to make training LLMs faster and more efficient.

Unlike most models, DeepSeek-V3 (which served as the base model for creating R1) was trained with a mixed precision framework using the FP8 data format. This means that V3 & R1’s native precision is FP8 rather than the more common FP/BF16. DeepSeek-V3 wasn’t released until the very end of December 2024. I’m not certain whether DeepSeek-V3 was the first publicly-released model trained using the newly supported FP8 data format, but it’s clear that it’s had the biggest impact. Assuming a similar timeline for FP4 training, we shouldn’t expect to see models with a native precision of FP4 until 2026.

Hybrid CPU/GPU Inference Results

I was only able to scratch the surface of the amount of testing I would ideally like to run, I do have some results to share from my time with this system. I used a series of four prompts of increasing size, ~1000 tokens, ~7500 tokens, and the final two consisting of ~16,000 tokens. Two series of tests were run using these prompts, one with 126 CPU threads utilized and one with 254 CPU threads.

Before diving into the data, it’s worth noting that due to variations in the token count of the output, the duration of the decode phase is less directly comparable between tests compared to the prefill phase.

126 threads	Prefill Performance (Tokens per second)	Decode Performance (Tokens per second)	Prefill Time (seconds)	Decode Time (seconds)
Prompt 1	N/A	13.71	N/A	64.79
Prompt 2	152.19	12.2	50.65	94.93
Prompt 3	146.46	10.73	107.59	172.99
Prompt 4	157.94	9.99	107.05	111.20

Unfortunately, I found that I had recorded an anomalous result for the first prompt during the 126-thread test, which I have omitted.

254 threads	Prefill Performance (Tokens per second)	Decode Performance (Tokens per second)	Prefill Time (seconds)	Decode Time (seconds)
Prompt 1	172.74	8.42	5.53	116.24
Prompt 2	150.43	7.39	51.24	126.55
Prompt 3	125.89	7.96	125.03	204.99
Prompt 4	126.49	7.89	134.00	160.91

When comparing the two series of tests, the immediate difference that stands out is that the 126-thread configuration consistently outperforms the 254-thread configuration during token generation. This makes sense considering that memory bandwidth is the limiting factor during the decoding phase. By using both CPUs, we are forcing the use of the interconnect between the CPUs, which not only limits memory bandwidth but also introduces additional latency and communication overhead.

By enabling the use of more CPU threads, we see an almost 40% drop in the token generation speed when testing with the smallest prompt, and that performance gap shrinks to around 20-25% as the prompt size increases. But considering that we are almost certainly using more energy to achieve this drop in performance, the efficiency of the second configuration is abysmal and makes me regret not recording the system’s electricity consumption for comparison.

We also see that the prefill stage performance is more consistent when using fewer threads as well, staying at roughly 150 tokens per second. Although the computation during this phase is primarily being performed by the GPU, it seems to be affected by the increased utilization of the CPU interconnect as well since the performance drops down to about 125 tokens per second when longer prompts are used in conjunction with the 254-thread configuration.

Performance Considerations

Regardless of the specifics of the software configuration, we can see that the end-user experience is considerably different than what we may be used to. Compared to pure-GPU inference that offers a largely real-time experience, depending on the size of the prompt and the desired length of the response, we can expect to spend several minutes waiting for a response to complete. Depending on the use-case, this may or may not be an acceptable trade-off for the ability to run models that are too large to run entirely within VRAM.

Based on my limited testing, the best approach to selecting hardware for hybrid inference seems to involve selecting a single-socket motherboard with as many RAM channels as possible to maximize bandwidth. Although dual-socket motherboards offer higher CPU thread counts and total RAM capacities, the limitations of utilizing the CPU interconnect translate into decreased performance.

Conclusion

Cost of hardware is a major barrier for the deployment of larger LLMs available today. Very large models like DeepSeek-R1 or Llama-3.1-405B can potentially require hundreds of thousands of dollars worth of hardware to run. Hybrid inference utilizing CPUs and system memory in addition to GPUs and VRAM offers a more affordable alternative to pure GPU inference, albeit with slower performance.

A hybrid approach will never be as fast as a pure GPU one (either in the cloud or on a local server), and can require several minutes to complete a prompt. However, for those who are not able to afford a multi-hundred-thousand dollar server and are able to sacrifice inference speed for the improved quality of output that larger models offer, a hybrid inference solution (based on something like our AMD EPYC Servers) is worth consideration.