How to Set Up An Ubuntu GPU Machine for LLM Inference with vLLM
This guide will walk you through setting up an Ubuntu GPU machine for running Large Language Model (LLM) inference using vLLM. We will cover installing NVIDIA drivers and CUDA, setting up vLLM, and securing access with Nginx and Let’s Encrypt.
This guide is based on the provided notes and aims to provide a step-by-step approach to get your LLM inference server up and running.
Prerequisites
Before you begin, ensure you have the following:
- Ubuntu 22.04 Server installed.
- NVIDIA CUDA-compatible GPU.
- Registered Domain Name (if you intend to access your server via a domain).
- DNS A records configured to point your domain to your server’s public IP address (if using a domain).
Step 1: Install NVIDIA Drivers and CUDA Toolkit
Follow these steps to install the necessary NVIDIA drivers and CUDA Toolkit on your Ubuntu 22.04 machine.
1.1. Upgrade Your Ubuntu System
Start by updating and upgrading your Ubuntu packages to ensure you have the latest software versions.
sudo apt update
sudo apt upgrade1.2. List Recommended NVIDIA Drivers
Identify the recommended NVIDIA driver for your system using the ubuntu-drivers utility.
sudo apt install ubuntu-drivers-common
sudo ubuntu-drivers devicesThis command will list your NVIDIA GPU and recommend a driver version. In the example provided, nvidia-driver-535 was recommended.
model : GP108M [GeForce MX150] (Mi Notebook Pro [GeForce MX150])
driver : nvidia-driver-535 - distro non-free recommended
1.3. Install the Recommended NVIDIA Driver
Install the recommended NVIDIA driver. Replace nvidia-driver-535 with the driver version recommended for your system if it’s different.
sudo apt install nvidia-driver-5351.4. Reboot Your System
Reboot your system to apply the driver changes.
sudo reboot now1.5. Verify Driver Installation
After rebooting, verify that the NVIDIA driver is installed correctly using nvidia-smi.
nvidia-smiYou should see output similar to the following, confirming the driver version and CUDA compatibility. Note that CUDA Toolkit is not fully installed yet at this stage, but the driver component is ready.
NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2
1.6. Install GCC Compiler
Ensure GCC compiler is installed, as it’s required for the CUDA Toolkit installation.
sudo apt install gccVerify GCC installation by checking its version.
gcc -v1.7. Install CUDA Toolkit
Download and install the CUDA Toolkit from NVIDIA’s website. Follow these steps provided by NVIDIA for Ubuntu 2204.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cudaIf you encounter dependency issues, try fixing them with:
sudo apt --fix-broken install1.8. Reboot Your System Again
Reboot your system after installing the CUDA Toolkit to load necessary modules.
sudo reboot now1.9. Set Up Environment Variables
Configure environment variables to use CUDA effectively. Add the following lines to your ~/.bashrc file.
nano ~/.bashrcAdd these lines at the end of the file:
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}Note: The cuda-12.2 in LD_LIBRARY_PATH might need to be adjusted based on the actual CUDA version installed. You can check the contents of /usr/local/cuda to see the versioned directory if needed. If you installed the latest cuda package, /usr/local/cuda should be a symbolic link to the latest version, and /usr/local/cuda/lib64 might be sufficient. However, explicitly using cuda-12.2 as in the example ensures you’re pointing to a specific version.
Save the file (Ctrl+X, then Y, then Enter) and reload the ~/.bashrc to apply the changes.
. ~/.bashrc1.10. Test CUDA Toolkit Installation
Verify the CUDA Toolkit installation by checking the nvcc compiler version.
nvcc -VThis should output the nvcc version, confirming successful CUDA Toolkit installation.
Step 2: Install and Configure vLLM
Now, let’s set up vLLM for LLM inference.
2.1. Install Python and vLLM
It’s recommended to use a virtual environment for Python projects. You can use uv or conda. Here are instructions for both.
Using uv (recommended for speed):
First, install uv if you haven’t already (refer to uv documentation). Then, create a virtual environment and install vLLM.
uv venv myenv --python 3.12 --seed
source myenv/bin/activate
uv pip install vllmUsing conda:
If you prefer conda, install it first if you haven’t. Then, create a conda environment and install vLLM.
conda create -n myenv python=3.12 -y
conda activate myenv
pip install vllmNote: Python versions 3.9 - 3.12 are supported. Python 3.12 is used in the examples above.
2.2. Start vLLM Server
Start the vLLM server with the OpenAI-compatible API. Adjust the model, port, and tensor-parallel-size as needed for your setup. This example uses microsoft/Phi-4 model, port 11434, and tensor parallelism size of 2 (adjust based on your GPU setup). It runs in the background and logs to vllm.log.
nohup vllm serve microsoft/Phi-4 --port 11434 --tensor-parallel-size 2 --dtype=half > vllm.log 2>&1 µsoft/Phi-4: Replace this with the desired model from the vLLM supported models list.--port 11434: The port on which vLLM server will listen.--tensor-parallel-size 2: Adjust this based on the number of GPUs you want to use for inference. For a single GPU, use1. For multiple GPUs, adjust accordingly.--dtype=half: Uses half-precision floating-point format (float16) for faster inference and reduced memory usage.
You can check the logs in vllm.log for any errors or to monitor the server startup.
Step 3: Install and Configure Nginx as Reverse Proxy
Set up Nginx as a reverse proxy to access the vLLM server securely and potentially add basic authentication.
3.1. Install Nginx
Install Nginx on your Ubuntu server.
sudo apt update
sudo apt install nginx3.2. Configure Nginx Server Block
Create or modify your Nginx server block configuration file. You can create a new file in /etc/nginx/sites-available/ (e.g., example.com) and then create a symbolic link to /etc/nginx/sites-enabled/. For this guide, let’s assume you are creating a new file:
sudo nano /etc/nginx/sites-available/example.comPaste the following configuration into the file. Remember to replace example.com with your actual domain name and adjust the proxy_pass URL if you are using a different port for vLLM.
server {
server_name example.com;
location / {
# Skip authentication check for the specified IP address (Replace with your IP if needed)
if ($remote_addr = "IP") {
# No authentication needed for this IP
break;
}
# Check if authorization header is missing
if ($http_authorization != "Basic API_KEY") {
return 401;
}
# Proxy the request if authenticated
proxy_pass http://localhost:11434/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Authorization $http_authorization;
proxy_connect_timeout 240;
proxy_send_timeout 240;
proxy_read_timeout 240;
send_timeout 240;
}
# Optional: Add a custom 401 response
error_page 401 = @unauthorized;
location @unauthorized {
add_header Content-Type "application/json; charset=utf-8";
return 401 '{"error":"Unauthorized"}';
}
listen 443 ssl; # managed by Certbot
ssl_certificate /etc/letsencrypt/live/example.com/fullchain.pem; # managed by Certbot
ssl_certificate_key /etc/letsencrypt/live/example.com/privkey.pem; # managed by Certbot
include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}
server {
if ($host = example.com) {
return 301 https://$host$request_uri;
} # managed by Certbot
listen 80;
server_name example.com;
return 404; # managed by Certbot
}Important Security Notes:
Basic Wu_UMh-oI6iUf7vX_1jYefYr_fJlYNNyjn1d2hHjtck: This is a placeholder token and is INSECURE. You MUST replace this with a strong, randomly generated Basic Auth token. Use a tool to generate a strong username and password and then encode it in Base64 for Basic Auth. Do not use this example token in production.80.158.43.127: This IP address is whitelisted to bypass authentication. Replace this with your trusted IP address or remove this section entirely if you don’t need IP whitelisting.example.com: Replace this with your actual domain name.- SSL Certificates Paths: The paths
/etc/letsencrypt/live/example.com/...are placeholders for Let’s Encrypt SSL certificate paths. These will be configured in the next step with Let’s Encrypt.
3.3. Enable Nginx Site and Test Configuration
Create a symbolic link to enable the site and test the Nginx configuration.
sudo ln -s /etc/nginx/sites-available/example.com /etc/nginx/sites-enabled
sudo nginx -tIf nginx -t shows “syntax is ok” and “test is successful”, reload Nginx to apply the configuration.
sudo systemctl reload nginxStep 4: Secure Nginx with Let’s Encrypt SSL
Secure your Nginx server with Let’s Encrypt to enable HTTPS.
4.1. Install Certbot
Install Certbot and the Nginx plugin for Certbot.
sudo apt install certbot python3-certbot-nginx4.2. Configure Firewall (UFW if enabled)
If you have UFW firewall enabled, allow “Nginx Full” and delete “Nginx HTTP” rules.
sudo ufw allow 'Nginx Full'
sudo ufw delete allow 'Nginx HTTP'4.3. Obtain SSL Certificate with Certbot
Run Certbot to obtain and automatically install SSL certificates for your domain. Replace example.com with your domain.
sudo certbot --nginx -d example.comFollow the prompts from Certbot. It will automatically configure your Nginx server block to use SSL certificates and set up auto-renewal.
4.4. Verify Auto-Renewal
Verify that Let’s Encrypt certificate auto-renewal is set up correctly.
sudo systemctl status certbot.timer
sudo certbot renew --dry-runStep 5: Testing Your Setup
-
Access vLLM via Browser or API Client: Access your vLLM server through your domain name (
https://example.comin the example) or server IP address (if you are whitelisted in Nginx config). If Basic Auth is enabled and you are not whitelisted, you will be prompted for credentials. -
Test vLLM API: Use an API client (like
curl,Postman, or Pythonrequests) to send requests to your vLLM server endpoint (e.g.,/v1/chat/completionsif using OpenAI compatible API). Refer to vLLM documentation for API details.Example
curlcommand (replace with your actual API endpoint and data):curl https://example.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Basic YOUR_BASE64_ENCODED_CREDENTIALS" \ -d '{ "model": "microsoft/Phi-4", "messages": [{"role": "user", "content": "Hello, vLLM!"}] }'Remember to replace
YOUR_BASE64_ENCODED_CREDENTIALSwith your actual Basic Auth credentials if you are using authentication. If you have IP whitelisting enabled and are accessing from the whitelisted IP, you might not need theAuthorizationheader.
Conclusion
You have now successfully set up an Ubuntu GPU machine for LLM inference with vLLM, secured with Nginx and Let’s Encrypt. You can now deploy and test your LLM applications using this infrastructure. Remember to monitor your server’s performance and security regularly and adjust configurations as needed for optimal performance and security.