LLM Icon
Hardware Recommendations
Large Language Model Server Banner Image Visualizing Data Streams

Hardware Recommendations for Large Lanuage Model Servers

Our hardware recommendations for large language model (LLM) AI servers below are provided by Dr. Kinghorn. These answers are intended to provide broad guidance, but specific situations may have unique requirements.

Large Language Model Server System Requirements

Quickly Jump To: Processor (CPU) • Video Card (GPU) • Memory (RAM) • Storage (Drives)

AI large language models have been advancing at a break-neck pace in recent months, with newer and better models constantly coming out. Likewise, hardware focused on powering these models are constantly being developed from the likes of NVIDIA, Intel, AMD, and others. What the “best” hardware is for hosting a given model will vary depending on your exact situation, but there are some general guidelines we can provide. The Q&A discussion below, with answers provided by Dr. Donald Kinghorn along with other members of our Labs team, will cover the basics. Please check out his HPC blog for more info, and of course reach out to our team of expert consultants for help configuring any purchasing your own LLM server.

Processor (CPU)

For large language model servers, the particular CPU is generally not as important as the platform it is installed on. We strongly recommend a server grade platform like Intel Xeon or AMD EPYC for hosting LLMs and applications using them. Those platforms have key features like lots of PCI-Express lanes for GPUs and storage, high memory bandwidth/capacity, and ECC memory support.

What CPU is best for running large language models (LLMs)?

The two recommended CPU platforms are Intel’s Xeon W and AMD’s Threadripper PRO. Both of these offer high core counts, excellent memory performance & capacity, and large numbers of PCIe lanes. Specifically, the 32-core versions of either of these are recommended for their utilization and balanced memory performance.

Do more CPU cores make large language models faster?

Unless you are running an LLM on a CPU, which is not generally recommended, the number of CPU cores will have little impact beyond the need for at least one core per GPU in the system.

However, when your workflow involves more than just running Generative models, the CPU could have a large impact. For example, if part of your workflow involves data collection, manipulation, or pre-processing, the CPU may be a critical component in your work pipeline.

Data pipelines, including ingestion, preprocessing, initial parsing, creating embeddings, and vector stores, may be run on the CPU to keep those loads off of the GPU. In this case the CPU could be a significant component in your application chain.

Do large language models work better with Intel or AMD CPUs?

It mostly doesn’t matter if AMD or Intel CPUs are used as long as the overall platform is high-quality and server-grade.

Video Card (GPU)

Applications utilizing LLMs have been made possible entirely because of GPUs’ extraordinary performance for this type of computational problem!

What type of GPU (video card) is best for large language models?

For LLM server applications, “Professional” or “Compute” level GPUs are recommended. This is because larger amounts of VRAM are available and because they are better suited for the cooling environment of a server chassis. Examples would be NVIDIA’s RTX 6000 Ada, L40S, and H100 – or AMD MII Instinct GPUs.

How much VRAM (video memory) do large language models need?

When working with LLMs, the total amount of VRAM available is often the limiting factor in what can be done. For example, to serve a STOTA ~70b parameter model in its native precision near 200GB of VRAM. For example, Llama3-70b can be served with good performance in multi-user environments (small/mid sized organization) with 4 x 6000Ada or L40s GPUs

Will multiple GPUs improve performance for large language model servers?

Yes! LLM servers and frameworks will make good use of multiple GPUs. A Linux-based server with 4 to 8 GPUs is a “standard” sized system.

Do large language models run better on NVIDIA or AMD GPUs?

NVIDIA is the historic leader in GPU computing and is largely responsible for the rapid development of AI. They continue to innovate and produce significant generation-over-generation improvements in their design. However, AMD has made great headway in the past year. AMDs ROCm alternative to NVIDIA CUDA is being actively supported by Hugging Face and PyTorch.

Do large language models require a “professional” video card?

Technically no, but they offer much higher amounts of VRAM per card – so for serious LLM servers they are almost always the right option. Moreover, consumer-grade video cards tend to take up more space and have cooling systems that are not well suited to use in rackmount chassis.

Most NVIDIA graphics cards no longer support NVLink, but there are select GPUs that do – like the NVIDIA H100 NVL. For supported GPUs, you would want to utilize NVLink… but it is not a requirement for LLM hosting.

Memory (RAM)

How much system RAM do large language models need?

NVIDIA (and us) recommends at least 2 x the amount of CPU system memory as there is total GPU VRAM. This accommodates full “memory pinning” to CPU space to facilitate efficient buffering.

Storage (Hard Drives)

What storage configuration works best for large language models?

We recommend high-capacity (2-8TB) NVMe solid-state drives for systems hosting LLMs. Model parameters and data sets can occupy large amounts of storage and many servers may also need to host large databases for application use. Additional NVMe SSDs can be used for that storage, in software-controlled arrays if local data redundancy is desired.

Should I use network-attached storage for large language models?

LLM parameters should be kept locally on the server for best performance, but network-attached storage can be a viable option for backup or sharing data across multiple systems.

Rackmount Computer Server Icon in Puget Colors

Looking for a LLM Server?

We build computers that are tailor-made for your workflow. 

Talking Head Icon in Puget Systems Colors

Don’t know where to start? We can help!

Get in touch with one of our technical consultants today.