How to Set Up An Ubuntu GPU Machine for LLM Inference with vLLM

This guide will walk you through setting up an Ubuntu GPU machine for running Large Language Model (LLM) inference using vLLM. We will cover installing NVIDIA drivers and CUDA, setting up vLLM, and securing access with Nginx and Let’s Encrypt.

This guide is based on the provided notes and aims to provide a step-by-step approach to get your LLM inference server up and running.

Prerequisites

Before you begin, ensure you have the following:

Ubuntu 22.04 Server installed.
NVIDIA CUDA-compatible GPU.
Registered Domain Name (if you intend to access your server via a domain).
DNS A records configured to point your domain to your server’s public IP address (if using a domain).

Step 1: Install NVIDIA Drivers and CUDA Toolkit

Follow these steps to install the necessary NVIDIA drivers and CUDA Toolkit on your Ubuntu 22.04 machine.

1.1. Upgrade Your Ubuntu System

Start by updating and upgrading your Ubuntu packages to ensure you have the latest software versions.

sudo apt update
sudo apt upgrade

1.2. List Recommended NVIDIA Drivers

Identify the recommended NVIDIA driver for your system using the ubuntu-drivers utility.

sudo apt install ubuntu-drivers-common
sudo ubuntu-drivers devices

This command will list your NVIDIA GPU and recommend a driver version. In the example provided, nvidia-driver-535 was recommended.

model    : GP108M [GeForce MX150] (Mi Notebook Pro [GeForce MX150])
driver   : nvidia-driver-535 - distro non-free recommended

1.3. Install the Recommended NVIDIA Driver

Install the recommended NVIDIA driver. Replace nvidia-driver-535 with the driver version recommended for your system if it’s different.

sudo apt install nvidia-driver-535

1.4. Reboot Your System

Reboot your system to apply the driver changes.

sudo reboot now

1.5. Verify Driver Installation

After rebooting, verify that the NVIDIA driver is installed correctly using nvidia-smi.

nvidia-smi

You should see output similar to the following, confirming the driver version and CUDA compatibility. Note that CUDA Toolkit is not fully installed yet at this stage, but the driver component is ready.

NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2

1.6. Install GCC Compiler

Ensure GCC compiler is installed, as it’s required for the CUDA Toolkit installation.

sudo apt install gcc

Verify GCC installation by checking its version.

gcc -v

1.7. Install CUDA Toolkit

Download and install the CUDA Toolkit from NVIDIA’s website. Follow these steps provided by NVIDIA for Ubuntu 2204.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

If you encounter dependency issues, try fixing them with:

sudo apt --fix-broken install

1.8. Reboot Your System Again

Reboot your system after installing the CUDA Toolkit to load necessary modules.

sudo reboot now

1.9. Set Up Environment Variables

Configure environment variables to use CUDA effectively. Add the following lines to your ~/.bashrc file.

nano ~/.bashrc

Add these lines at the end of the file:

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Note: The cuda-12.2 in LD_LIBRARY_PATH might need to be adjusted based on the actual CUDA version installed. You can check the contents of /usr/local/cuda to see the versioned directory if needed. If you installed the latest cuda package, /usr/local/cuda should be a symbolic link to the latest version, and /usr/local/cuda/lib64 might be sufficient. However, explicitly using cuda-12.2 as in the example ensures you’re pointing to a specific version.

Save the file (Ctrl+X, then Y, then Enter) and reload the ~/.bashrc to apply the changes.

. ~/.bashrc

1.10. Test CUDA Toolkit Installation

Verify the CUDA Toolkit installation by checking the nvcc compiler version.

nvcc -V

This should output the nvcc version, confirming successful CUDA Toolkit installation.

Step 2: Install and Configure vLLM

Now, let’s set up vLLM for LLM inference.

2.1. Install Python and vLLM

It’s recommended to use a virtual environment for Python projects. You can use uv or conda. Here are instructions for both.

Using uv (recommended for speed):

First, install uv if you haven’t already (refer to uv documentation). Then, create a virtual environment and install vLLM.

uv venv myenv --python 3.12 --seed
source myenv/bin/activate
uv pip install vllm

Using conda:

If you prefer conda, install it first if you haven’t. Then, create a conda environment and install vLLM.

conda create -n myenv python=3.12 -y
conda activate myenv
pip install vllm

Note: Python versions 3.9 - 3.12 are supported. Python 3.12 is used in the examples above.

2.2. Start vLLM Server

Start the vLLM server with the OpenAI-compatible API. Adjust the model, port, and tensor-parallel-size as needed for your setup. This example uses microsoft/Phi-4 model, port 11434, and tensor parallelism size of 2 (adjust based on your GPU setup). It runs in the background and logs to vllm.log.

nohup vllm serve microsoft/Phi-4 --port 11434 --tensor-parallel-size 2 --dtype=half > vllm.log 2>&1 &

microsoft/Phi-4: Replace this with the desired model from the vLLM supported models list.
--port 11434: The port on which vLLM server will listen.
--tensor-parallel-size 2: Adjust this based on the number of GPUs you want to use for inference. For a single GPU, use 1. For multiple GPUs, adjust accordingly.
--dtype=half: Uses half-precision floating-point format (float16) for faster inference and reduced memory usage.

You can check the logs in vllm.log for any errors or to monitor the server startup.

Step 3: Install and Configure Nginx as Reverse Proxy

Set up Nginx as a reverse proxy to access the vLLM server securely and potentially add basic authentication.

3.1. Install Nginx

Install Nginx on your Ubuntu server.

sudo apt update
sudo apt install nginx

3.2. Configure Nginx Server Block

Create or modify your Nginx server block configuration file. You can create a new file in /etc/nginx/sites-available/ (e.g., example.com) and then create a symbolic link to /etc/nginx/sites-enabled/. For this guide, let’s assume you are creating a new file:

sudo nano /etc/nginx/sites-available/example.com

Paste the following configuration into the file. Remember to replace example.com with your actual domain name and adjust the proxy_pass URL if you are using a different port for vLLM.

server {
    server_name example.com;
 
    location / {
        # Skip authentication check for the specified IP address (Replace with your IP if needed)
        if ($remote_addr = "IP") { 
            # No authentication needed for this IP
            break;
        }
 
        # Check if authorization header is missing
        if ($http_authorization != "Basic API_KEY") {
            return 401;
        }
 
        # Proxy the request if authenticated
        proxy_pass http://localhost:11434/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Authorization $http_authorization;
        proxy_connect_timeout       240;
        proxy_send_timeout          240;
        proxy_read_timeout          240;
        send_timeout                240;
    }
 
    # Optional: Add a custom 401 response
    error_page 401 = @unauthorized;
 
    location @unauthorized {
        add_header Content-Type "application/json; charset=utf-8";
        return 401 '{"error":"Unauthorized"}';
    }
 
    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}
 
server {
    if ($host = example.com) {
        return 301 https://$host$request_uri;
    } # managed by Certbot
 
    listen 80;
    server_name example.com;
    return 404; # managed by Certbot
}

Important Security Notes:

Basic Wu_UMh-oI6iUf7vX_1jYefYr_fJlYNNyjn1d2hHjtck: This is a placeholder token and is INSECURE. You MUST replace this with a strong, randomly generated Basic Auth token. Use a tool to generate a strong username and password and then encode it in Base64 for Basic Auth. Do not use this example token in production.
80.158.43.127: This IP address is whitelisted to bypass authentication. Replace this with your trusted IP address or remove this section entirely if you don’t need IP whitelisting.
example.com: Replace this with your actual domain name.
SSL Certificates Paths: The paths /etc/letsencrypt/live/example.com/... are placeholders for Let’s Encrypt SSL certificate paths. These will be configured in the next step with Let’s Encrypt.

3.3. Enable Nginx Site and Test Configuration

Create a symbolic link to enable the site and test the Nginx configuration.

sudo ln -s /etc/nginx/sites-available/example.com /etc/nginx/sites-enabled
sudo nginx -t

If nginx -t shows “syntax is ok” and “test is successful”, reload Nginx to apply the configuration.

sudo systemctl reload nginx

Step 4: Secure Nginx with Let’s Encrypt SSL

Secure your Nginx server with Let’s Encrypt to enable HTTPS.

4.1. Install Certbot

Install Certbot and the Nginx plugin for Certbot.

sudo apt install certbot python3-certbot-nginx

4.2. Configure Firewall (UFW if enabled)

If you have UFW firewall enabled, allow “Nginx Full” and delete “Nginx HTTP” rules.

sudo ufw allow 'Nginx Full'
sudo ufw delete allow 'Nginx HTTP'

4.3. Obtain SSL Certificate with Certbot

Run Certbot to obtain and automatically install SSL certificates for your domain. Replace example.com with your domain.

sudo certbot --nginx -d example.com

Follow the prompts from Certbot. It will automatically configure your Nginx server block to use SSL certificates and set up auto-renewal.

4.4. Verify Auto-Renewal

Verify that Let’s Encrypt certificate auto-renewal is set up correctly.

sudo systemctl status certbot.timer
sudo certbot renew --dry-run

Step 5: Testing Your Setup

Access vLLM via Browser or API Client: Access your vLLM server through your domain name (https://example.com in the example) or server IP address (if you are whitelisted in Nginx config). If Basic Auth is enabled and you are not whitelisted, you will be prompted for credentials.
Test vLLM API: Use an API client (like curl, Postman, or Python requests) to send requests to your vLLM server endpoint (e.g., /v1/chat/completions if using OpenAI compatible API). Refer to vLLM documentation for API details.

Example curl command (replace with your actual API endpoint and data):
```
curl https://example.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Basic YOUR_BASE64_ENCODED_CREDENTIALS" \
-d '{
    "model": "microsoft/Phi-4",
    "messages": [{"role": "user", "content": "Hello, vLLM!"}]
}'
```
Remember to replace YOUR_BASE64_ENCODED_CREDENTIALS with your actual Basic Auth credentials if you are using authentication. If you have IP whitelisting enabled and are accessing from the whitelisted IP, you might not need the Authorization header.

Conclusion

You have now successfully set up an Ubuntu GPU machine for LLM inference with vLLM, secured with Nginx and Let’s Encrypt. You can now deploy and test your LLM applications using this infrastructure. Remember to monitor your server’s performance and security regularly and adjust configurations as needed for optimal performance and security.

0Shark Docs

Explorer

GPU Instance LLM Setup (VLLM)