Generative AI has revolutionized how we create content, from text and images to music and code. Models like OpenAI’s GPT, Stable Diffusion, and DALL-E have demonstrated the immense potential of generative AI. However, deploying these models efficiently requires robust hardware and software configurations, especially when leveraging GPUs for accelerated performance.

In this guide, we’ll set up a GPT-2 text generation model on an Ubuntu server with an NVIDIA GPU, using FastAPI for the backend and a simple web interface for interaction.

Prerequisites

  • An Ubuntu 24.04 server with an NVIDIA GPU.
  • A non-root user with sudo privileges.
  • NVIDIA drivers installed.

Step 1: Install System Dependencies

Before setting up the AI model, we must ensure the system has the required dependencies.

1. Update system packages.

apt update

2. Install Python and Pip.

apt install -y python3 python3-pip python3-venv

3. Create and activate the virtual environment.

python3 -m venv generative-ai-env
source generative-ai-env/bin/activate

Step 2: Install AI & Web Framework Dependencies

1. Create the required directories for your project.

mkdir templates static

2. Install PyTorch with GPU support.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

3. Install Transformers and Tensorflow.

pip install tensorflow transformers diffusers openai

4. Install FastAPI and Uvicorn.

pip install fastapi uvicorn jinja2

Step 3: Create the HTML Frontend

We’ll build a simple interface for text generation.

Create index.html.

nano templates/index.html

Add the following code:

<!DOCTYPE html>
<html>
<head>
    <title>GPT-2 Demo</title>
    <style>
        body { font-family: Arial; max-width: 800px; margin: 0 auto; padding: 20px; }
        textarea { width: 100%; height: 100px; }
        button { padding: 10px 20px; margin-top: 10px; }
        #output { white-space: pre-wrap; background: #f4f4f4; padding: 15px; }
    </style>
</head>
<body>
    <h1>GPT-2 Text Generator</h1>
    <form id="promptForm">
        <textarea id="prompt" placeholder="Type your prompt here..."></textarea>
        <button type="submit">Generate Text</button>
    </form>
    <h2>Output:</h2>
    <div id="output"></div>

    <script>
        document.getElementById("promptForm").addEventListener("submit", async (e) => {
            e.preventDefault();
            const prompt = document.getElementById("prompt").value;
            const response = await fetch("/generate", {
                method: "POST",
                headers: { "Content-Type": "application/json" },
                body: JSON.stringify({ prompt }),
            });
            const data = await response.json();
            document.getElementById("output").textContent = data.output;
        });
    </script>
</body>

Explanation:

  • A textarea for user input.
  • A submit button to trigger generation.
  • JavaScript Fetch API to communicate with the backend.
  • Simple CSS styling for better UX.

Step 4: Create the FastAPI Backend

The backend loads the GPT-2 model and handles requests.

Create app.py file.

nano app.py

Add the following code.

from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import HTMLResponse, JSONResponse
from fastapi.templating import Jinja2Templates
from transformers import pipeline, set_seed
import torch
import os
import logging
from pydantic import BaseModel

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI()
templates = Jinja2Templates(directory="templates")

class PromptRequest(BaseModel):
    prompt: str
    max_length: int = 150  # Reduced default length to prevent rambling

# Model loading with better error handling
def load_model():
    try:
        device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"Loading model on device: {device}")
        
        # Using gpt2 instead of distilgpt2 for better quality
        model_name = "gpt2"
        
        generator = pipeline(
            "text-generation",
            model=model_name,
            device=device,
            framework="pt",
            torch_dtype=torch.float16 if device == "cuda" else torch.float32
        )
        
        logger.info(f"Successfully loaded {model_name}")
        return generator
    except Exception as e:
        logger.error(f"Model loading failed: {str(e)}")
        return None

generator = load_model()

@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
    return templates.TemplateResponse("index.html", {
        "request": request,
        "model_loaded": generator is not None
    })

@app.post("/generate")
async def generate_text(request_data: PromptRequest):
    if not generator:
        return JSONResponse(
            status_code=503,
            content={"error": "Model not loaded. Please try again later."}
        )
    
    try:
        set_seed(42)  # For reproducible results
        
        # Enhanced generation parameters
        output = generator(
            request_data.prompt,
            max_length=request_data.max_length,
            num_return_sequences=1,
            do_sample=True,
            temperature=0.7,          # Controls randomness
            top_k=50,                # Limits to top 50 probable tokens
            top_p=0.9,               # Nucleus sampling threshold
            repetition_penalty=1.5,  # Strong penalty for repetition
            no_repeat_ngram_size=3,  # Prevents 3-word repetitions
            early_stopping=True      # Stops when coherent answer is reached
        )
        
        # Clean up the output
        generated_text = output[0]["generated_text"]
        
        # Remove the input prompt if it appears in output
        if generated_text.startswith(request_data.prompt):
            generated_text = generated_text[len(request_data.prompt):].strip()
        
        # Truncate at last complete sentence
        last_period = generated_text.rfind('.')
        if last_period > 0:
            generated_text = generated_text[:last_period + 1]
            
        return {"output": generated_text}
        
    except Exception as e:
        logger.error(f"Generation error: {str(e)}")
        return JSONResponse(
            status_code=500,
            content={"error": f"Generation failed: {str(e)}"}
        )

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(
        app,
        host="0.0.0.0",
        port=8000,
        reload=True,
        log_level="info"
    )

Key Components:

  • FastAPI: Handles HTTP requests.
  • transformers: Loads the GPT-2 model.
  • device=”cuda”: Ensures GPU acceleration.
  • POST /generate: Processes prompts and returns AI-generated text.

Step 5: Run the FastAPI Server

Start the server for testing.

uvicorn app:app --reload --host 0.0.0.0 --port 8000

Step 6: Access and Test the API

1. Open a browser and go to http://your-server-ip:8000.


2. Enter a prompt (e.g., “What is neural networks?“).
3. Click Generate Text and see the AI response!

Conclusion

You’ve deployed a Generative AI (GPT-2) model on an Ubuntu GPU server with a FastAPI backend and interactive web UI. This setup can be extended to models like LLaMA, Stable Diffusion, or Whisper. Now, go ahead and build amazing AI-powered applications!