How to Use LLM CLI to Deploy the Deepseek Model on an Ubuntu GPU Server

Table of Contents

Prerequisites
Step 1: Setup a Python Environment
Step 2: Install PyTorch with CUDA 12.1 for GPU Acceleration
Step 3: Download the Deepseek GGUF Model
Step 4: Run Inference with GPU Offloading
Conclusion

Deploying large language models (LLMs) has traditionally been complex, requiring significant effort and specialized knowledge. However, tools like LLM CLI now streamline this process, making it accessible even to those new to deploying advanced AI models. Deepseek, a powerful language model known for its remarkable performance in understanding and generating human-like text, can now easily be deployed using LLM CLI on Ubuntu GPU servers.

In this tutorial, we will explain how to use LLM CLI to deploy the Deepseek model on your Ubuntu GPU server.

Prerequisites

An Ubuntu 24.04 server with an NVIDIA GPU.
A non-root user with sudo privileges.
NVIDIA drivers installed.

Step 1: Setup a Python Environment

Ubuntu’s default repositories include the Python 3.12 version, which may not support the latest LLM optimizations. We’ll use the deadsnakes PPA to install Python 3.11.

1. Add the deadsnakes PPA for Python 3.11.

add-apt-repository ppa:deadsnakes/ppa -y

2. Update package lists.

apt update

3. Install Python 3.11 along with essential tools.

apt install python3.11 python3.11-venv python3.11-distutils python3.11-dev -y

4. Create a virtual environment named ‘deepseek-env’.

python3.11 -m venv deepseek-env

5. Activate the environment.

source deepseek-env/bin/activate

Step 2: Install PyTorch with CUDA 12.1 for GPU Acceleration

PyTorch with CUDA support is crucial for GPU-accelerated deep learning. We’ll install the latest stable version compatible with NVIDIA CUDA 12.1.

1. Install PyTorch with CUDA 12.1 support.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

2. Install Hugging Face libraries for model loading and quantization.

pip install transformers accelerate bitsandbytes auto-gptq optimum

3. Install the LLM CLI and Llama.cpp integration.

pip install llm llm-llama-cpp llama-cpp-python

Step 3: Download the Deepseek GGUF Model

GGUF is the latest format for efficient GPU inference. We’ll download the Q5_K_M quantized version (good balance between quality & speed).

1. Create a directory to store models.

mkdir -p ~/models

2. Navigate into the models directory.

cd ~/models

3. Download the Deepseek-7B GGUF model.

wget https://huggingface.co/TheBloke/deepseek-llm-7B-chat-GGUF/resolve/main/deepseek-llm-7b-chat.Q5_K_M.gguf

4. Register the model with an alias ‘deepseek-chat’.

llm llama-cpp add-model ~/models/deepseek-llm-7b-chat.Q5_K_M.gguf --alias deepseek-chat

Step 4: Run Inference with GPU Offloading

To maximize GPU utilization, we set LLAMA_CPP_GPU_LAYERS=35 (adjust based on VRAM).

Run the llm with your query.

LLAMA_CPP_GPU_LAYERS=35 llm -m deepseek-chat "What is Atlantic.Net Cloud GPU?"

Output.

llama_init_from_model: n_ctx_per_seq (4000) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
Atlantic.Net Cloud GPU is a service that provides high-performance computing resources using Graphics Processing Units (GPUs) for running compute-intensive applications and workloads. GPU-based cloud computing offers significant performance gains in comparison to CPU-based systems, particularly for tasks that require a lot of graphical processing and data-intensive applications like machine learning, deep learning, data analysis, and scientific simulations.

Conclusion

You’ve successfully deployed the Deepseek-7B-Chat model on an Ubuntu GPU server using the llm CLI. This setup allows for efficient, GPU-accelerated inference, making it ideal for AI-powered applications.

Facebook

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Up in 60 Seconds!

Your subscription could not be saved. Please try again.

Your subscription has been successful.

Newsletter

Subscribe to our newsletter and stay updated.

Email Address

Provide your email address to subscribe. For e.g [email protected]

Your subscription could not be saved. Please try again.

Your subscription has been successful.

View White Papers

How to Use LLM CLI to Deploy the Deepseek Model on an Ubuntu GPU Server

Prerequisites

Step 1: Setup a Python Environment

Step 2: Install PyTorch with CUDA 12.1 for GPU Acceleration

Step 3: Download the Deepseek GGUF Model

Step 4: Run Inference with GPU Offloading

Conclusion

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Award-Winning Hosting Solutions & Services