Deploying large language models (LLMs) has traditionally been complex, requiring significant effort and specialized knowledge. However, tools like LLM CLI now streamline this process, making it accessible even to those new to deploying advanced AI models. Deepseek, a powerful language model known for its remarkable performance in understanding and generating human-like text, can now easily be deployed using LLM CLI on Ubuntu GPU servers.
In this tutorial, we will explain how to use LLM CLI to deploy the Deepseek model on your Ubuntu GPU server.
Prerequisites
- An Ubuntu 24.04 server with an NVIDIA GPU.
- A non-root user with sudo privileges.
- NVIDIA drivers installed.
Step 1: Setup a Python Environment
Ubuntu’s default repositories include the Python 3.12 version, which may not support the latest LLM optimizations. We’ll use the deadsnakes PPA to install Python 3.11.
1. Add the deadsnakes PPA for Python 3.11.
add-apt-repository ppa:deadsnakes/ppa -y
2. Update package lists.
apt update
3. Install Python 3.11 along with essential tools.
apt install python3.11 python3.11-venv python3.11-distutils python3.11-dev -y
4. Create a virtual environment named ‘deepseek-env’.
python3.11 -m venv deepseek-env
5. Activate the environment.
source deepseek-env/bin/activate
Step 2: Install PyTorch with CUDA 12.1 for GPU Acceleration
PyTorch with CUDA support is crucial for GPU-accelerated deep learning. We’ll install the latest stable version compatible with NVIDIA CUDA 12.1.
1. Install PyTorch with CUDA 12.1 support.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
2. Install Hugging Face libraries for model loading and quantization.
pip install transformers accelerate bitsandbytes auto-gptq optimum
3. Install the LLM CLI and Llama.cpp integration.
pip install llm llm-llama-cpp llama-cpp-python
Step 3: Download the Deepseek GGUF Model
GGUF is the latest format for efficient GPU inference. We’ll download the Q5_K_M quantized version (good balance between quality & speed).
1. Create a directory to store models.
mkdir -p ~/models
2. Navigate into the models directory.
cd ~/models
3. Download the Deepseek-7B GGUF model.
wget https://huggingface.co/TheBloke/deepseek-llm-7B-chat-GGUF/resolve/main/deepseek-llm-7b-chat.Q5_K_M.gguf
4. Register the model with an alias ‘deepseek-chat’.
llm llama-cpp add-model ~/models/deepseek-llm-7b-chat.Q5_K_M.gguf --alias deepseek-chat
Step 4: Run Inference with GPU Offloading
To maximize GPU utilization, we set LLAMA_CPP_GPU_LAYERS=35 (adjust based on VRAM).
Run the llm with your query.
LLAMA_CPP_GPU_LAYERS=35 llm -m deepseek-chat "What is Atlantic.Net Cloud GPU?"
Output.
llama_init_from_model: n_ctx_per_seq (4000) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
Atlantic.Net Cloud GPU is a service that provides high-performance computing resources using Graphics Processing Units (GPUs) for running compute-intensive applications and workloads. GPU-based cloud computing offers significant performance gains in comparison to CPU-based systems, particularly for tasks that require a lot of graphical processing and data-intensive applications like machine learning, deep learning, data analysis, and scientific simulations.
Conclusion
You’ve successfully deployed the Deepseek-7B-Chat model on an Ubuntu GPU server using the llm CLI. This setup allows for efficient, GPU-accelerated inference, making it ideal for AI-powered applications.