[MLOps] Multi-GPU server setup

this article wrote at 2025-02-04.

motherboard: ASUS ROG Strix B650-A

GPU: RTX 4090 * 2

CPU: AMD Rygen 9 7950X

RAM: 128G

Install OS

I installed Ubuntu Server 24.04.1 LTS

After installing the OS

update first.

sudo apt update && sudo apt upgrade -y

install docker

in this case, I followed officile docs

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh ./get-docker.sh --dry-run

DO NOT - manuall

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

install docker

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

install

sudo service docker start
sudo docker run hello-world

sudo usermod -aG docker $USER
newgrp docker

Setup Nvidia Driver

type

nvidia-smi

then you can see list of drivers

`nvidia-utils-550-server`

This package contains only the user-space utilities for NVIDIA GPUs. It does not include the kernel driver, but provides:

CUDA libraries (libcuda.so, libnvcuvid.so).
NVIDIA tools (nvidia-smi, nvidia-settings).
Compatibility for headless systems.

🔹 Best for:

Servers or Docker containers that need GPU acceleration but do not require a display output.
AI/ML training environments where only CUDA runtime and GPU management are needed.
Headless machines (no X server, no GUI).

🔹 Installation:

install 550

sudo apt install -y nvidia-driver-550-server

sudo ubuntu-drivers --gpgpu install nvidia:550-server

and reboot!

check driver & gpu status

nvidia-smi

let install cuda

sudo apt install nvidia-cuda-toolkit

Stress test

Using GPU-Burn is a great way to stress-test your multi-GPU setup and verify stability after installing NVIDIA 550-server.

🔹 Step 1: Install Dependencies

Before downloading GPU-Burn, install necessary dependencies:

sudo apt update && sudo apt install -y git build-essential nvidia-cuda-toolkit

This ensures you have:

Git (to clone GPU-Burn)
Build tools (gcc, make)
CUDA Toolkit (to compile GPU-Burn)

🔹 Step 2: Clone GPU-Burn Repository

git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn

🔹 Step 3: Compile GPU-Burn

make

This will generate the executable gpu_burn.

🔹 Step 4: Run GPU Stress Test

1️⃣ Test All GPUs for 60 Seconds

./gpu_burn 60

Expected output:

yamlCopyEditGPU 0: NVIDIA RTX 4090, 100% load
GPU 1: NVIDIA RTX 4090, 100% load
...
PASS (all values within tolerance)

🔹 If both GPUs reach 100% load, multi-GPU support is working.

2️⃣ Run for a Longer Duration (e.g., 10 Minutes)

./gpu_burn 600

This will test GPUs for 10 minutes.

3️⃣ Run Test on a Specific GPU

Run on GPU 0:

CUDA_VISIBLE_DEVICES=0 ./gpu_burn 60

Run on GPU 1:

CUDA_VISIBLE_DEVICES=1 ./gpu_burn 60

🔹 Step 5: Monitor GPU Load & Temperature

While GPU-Burn is running, monitor GPU stats with:

watch -n 1 nvidia-smi

This updates every 1 second, showing:

GPU utilization
Temperature
Power consumption
Active processes

🔍 Monitor While Running the Tests

Check GPU Clock Speed

nvidia-smi dmon

PClk (core clock) should be ~2520 MHz under full load.
If stuck at ~2100 MHz or lower, power or cooling might be limiting performance.

Check PCIe Speed While Running

sudo watch -n 1 "lspci -vvv | grep -i nvidia -A 30 | grep 'Lnk'"

Ensure it shows 16GT/s, x16 (not 2.5GT/s, x4).

Check PCI setting

nvidia-smi -q | grep -A 10 "PCI" | less

For a proper GPU burn test and PCIe slot assignment check on your dual RTX 4090 setup, follow these steps:

1. GPU Burn Test (Stress Test)

To check if your GPUs are performing at their peak, you can use gpu-burn, FurMark, or stress-ng.

Option 1: gpu-burn (Recommended)

This tool stresses CUDA cores to measure real-world performance.

Install Dependencies:

sudo apt update && sudo apt install git build-essential nvidia-cuda-toolkit -y

Download and Build gpu-burn:

git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make

Run the Test (adjust time as needed):
```
./gpu_burn 300  # Runs for 5 minutes
```
- If temps go too high, press Ctrl + C to stop.

Option 2: FurMark (Windows)

For Windows users, use FurMark with the FurMark GPU Stress Test to check power limits.

2. Check PCIe Slot Assignment

Since you’re only getting 51 TFLOPS per GPU, it's possible that:

One GPU is running at PCIe x8 instead of x16.
Power limits are throttling performance.

Check PCIe Bandwidth (Linux)

Run nvidia-smi to check the link speed:
```
nvidia-smi -q | grep -A 10 "PCI"
```
- Look for Link Width and Link Speed.
- For optimal performance, it should show:
  Link Width: x16 Link Speed: 16.0 GT/s
- If you see x8, one of the GPUs is running at half speed.
Check PCIe Speed with lspci
```
sudo lspci -vvv | grep -i nvidia -A 30 | less
```
- Look for:
  LnkSta: Speed 16GT/s, Width x16
- If Width is x8, check BIOS settings.

Check PCIe Slot Usage (Windows)

Use GPU-Z and check the Bus Interface section.

3. Power Limit Check

If PCIe slots are fine, power limits may be capping performance.

Check Power Limits via nvidia-smi
```
nvidia-smi -q | grep "Power Limit"
```
- RTX 4090 should be around 450W (stock) or 600W (unlocked).
- If it's lower, try increasing it:
  sudo nvidia-smi -i 0 -pl 600 sudo nvidia-smi -i 1 -pl 600
Check Performance Mode
```
nvidia-smi -q | grep -i "performance state"
```
- Should be P0 for full performance.