[MLOps] Multi-GPU server setup
this article wrote at 2025-02-04.
motherboard: ASUS ROG Strix B650-A
GPU: RTX 4090 * 2
CPU: AMD Rygen 9 7950X
RAM: 128G
Install OS
I installed Ubuntu Server 24.04.1 LTS
After installing the OS
update first.
install docker
in this case, I followed officile docs
DO
DO NOT - manuall
install docker
install
Setup Nvidia Driver
type
then you can see list of drivers
nvidia-utils-550-server
nvidia-utils-550-server
This package contains only the user-space utilities for NVIDIA GPUs. It does not include the kernel driver, but provides:
CUDA libraries (
libcuda.so
,libnvcuvid.so
).NVIDIA tools (
nvidia-smi
,nvidia-settings
).Compatibility for headless systems.
🔹 Best for:
Servers or Docker containers that need GPU acceleration but do not require a display output.
AI/ML training environments where only CUDA runtime and GPU management are needed.
Headless machines (no X server, no GUI).
🔹 Installation:
install 550
or
and reboot!
check driver & gpu status
let install cuda
Stress test
Using GPU-Burn is a great way to stress-test your multi-GPU setup and verify stability after installing NVIDIA 550-server.
🔹 Step 1: Install Dependencies
Before downloading GPU-Burn, install necessary dependencies:
This ensures you have:
Git (to clone GPU-Burn)
Build tools (
gcc
,make
)CUDA Toolkit (to compile GPU-Burn)
🔹 Step 2: Clone GPU-Burn Repository
🔹 Step 3: Compile GPU-Burn
This will generate the executable gpu_burn
.
🔹 Step 4: Run GPU Stress Test
1️⃣ Test All GPUs for 60 Seconds
Expected output:
🔹 If both GPUs reach 100% load, multi-GPU support is working.
2️⃣ Run for a Longer Duration (e.g., 10 Minutes)
This will test GPUs for 10 minutes.
3️⃣ Run Test on a Specific GPU
Run on GPU 0:
Run on GPU 1:
🔹 Step 5: Monitor GPU Load & Temperature
While GPU-Burn is running, monitor GPU stats with:
This updates every 1 second, showing:
GPU utilization
Temperature
Power consumption
Active processes
🔍 Monitor While Running the Tests
Check GPU Clock Speed
PClk (core clock) should be ~2520 MHz under full load.
If stuck at ~2100 MHz or lower, power or cooling might be limiting performance.
Check PCIe Speed While Running
Ensure it shows 16GT/s, x16 (not
2.5GT/s, x4
).
Check PCI setting
For a proper GPU burn test and PCIe slot assignment check on your dual RTX 4090 setup, follow these steps:
1. GPU Burn Test (Stress Test)
To check if your GPUs are performing at their peak, you can use gpu-burn
, FurMark
, or stress-ng
.
Option 1: gpu-burn
(Recommended)
This tool stresses CUDA cores to measure real-world performance.
Install Dependencies:
Download and Build
gpu-burn
:Run the Test (adjust time as needed):
If temps go too high, press
Ctrl + C
to stop.
Option 2: FurMark
(Windows)
For Windows users, use FurMark with the FurMark GPU Stress Test to check power limits.
2. Check PCIe Slot Assignment
Since you’re only getting 51 TFLOPS per GPU, it's possible that:
One GPU is running at PCIe x8 instead of x16.
Power limits are throttling performance.
Check PCIe Bandwidth (Linux)
Run
nvidia-smi
to check the link speed:Look for
Link Width
andLink Speed
.For optimal performance, it should show:
If you see
x8
, one of the GPUs is running at half speed.
Check PCIe Speed with
lspci
Look for:
If Width is x8, check BIOS settings.
Check PCIe Slot Usage (Windows)
Use GPU-Z and check the Bus Interface section.
3. Power Limit Check
If PCIe slots are fine, power limits may be capping performance.
Check Power Limits via
nvidia-smi
RTX 4090 should be around 450W (stock) or 600W (unlocked).
If it's lower, try increasing it:
Check Performance Mode
Should be
P0
for full performance.
Possible Issues & Fixes
GPU running at x8
Check motherboard manual, BIOS settings, or re-seat GPU
Low power limit
Increase with nvidia-smi -pl 600
Low GPU utilization
Check with nvidia-smi dmon
, ensure workloads are running efficiently
Thermal Throttling
Monitor nvidia-smi
, adjust cooling
Next Steps
Run
gpu-burn
and check performance.Check PCIe bandwidth using
nvidia-smi
andlspci
.Monitor power usage and adjust if needed.
Let me know the results, and I can suggest further optimizations. 🚀
Now that you've installed tmux, follow these steps to run GPU-Burn and monitor nvidia-smi
in a split terminal.
Last updated
Was this helpful?