Question Answering (QA) systems transform how users interact with data, allowing them to query information in natural language and get concise, direct answers. Building a QA system has become more straightforward with the wide availability of advanced NLP models and the growing popularity of frameworks like Hugging Face Transformers and OpenAI GPT.

This guide will show you how to set up and build a QA system on an Ubuntu GPU server.

Prerequisites

Before starting, ensure you have the following:

  • An Ubuntu 24.04 Cloud GPU Server.
  • CUDA Toolkit and cuDNN Installed.
  • Root or sudo privileges.

Step 1: Install Required Packages

Before starting, it is essential to update all system packages to the latest version.

apt update
apt upgrade

Next, install Python with all required libraries.

apt install python3-full python3-virtualenv -y

Step 2: Create a Python Virtual Environment

A virtual environment isolates your project’s dependencies, ensuring that installed packages don’t interfere with other projects or system-wide packages.

Let’s create a new virtual environment for your project.

python3 -m venv venv

Activate your virtual environment.

source venv/bin/activate

Step 3: Install PyTorch with CUDA Support

PyTorch is a deep learning framework for model operations. Installing it with CUDA support allows the GPU to be utilized, significantly accelerating computations.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Next, install additional libraries for building and deploying the QA system.

pip install transformers datasets fastapi uvicorn
  • transformers: Provides access to pre-trained models and tools from Hugging Face.
  • datasets: Offers a collection of datasets and tools for model evaluation.
  • fastapi: A modern web framework for building APIs with Python.
  • uvicorn: A lightning-fast ASGI server for serving FastAPI applications.

Connect to the Python shell and verify that PyTorch is correctly installed.

python3
>>> import torch
>>> print(torch.__version__)

Output:

2.6.0+cu118

Confirm that CUDA is available for GPU acceleration.

>>> print(torch.cuda.is_available())

Output:

True

Press CTRL+D to exit from the Python shell

Step 4: Create a Basic QA Pipeline

In this step, we’ll set up a basic question-answering pipeline using a pre-trained model to understand the foundational workings before any customization.

Create a Python script that utilizes a pre-trained model to answer questions based on a provided context.

nano qa_pipeline.py

Add the following code:

from transformers import pipeline

def main():
    # Initialize QA pipeline with a DistilBERT model fine-tuned on SQuAD
    qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

    context = (
        "Ubuntu is a Linux distribution based on Debian and composed mostly of free and open-source software. "
        "Ubuntu is officially released in three editions: Desktop, Server, and Core for IoT devices and robots."
    )
    question = "What is Ubuntu based on?"

    result = qa_pipeline({"context": context, "question": question})
    print("Answer:", result["answer"])
    print("Score:", result["score"])

if __name__ == "__main__":
    main()

This script uses the Hugging Face pipeline to load a pre-trained QA model and processes a sample context and question.

Now, run the above script.

python3 qa_pipeline.py

This will display the model’s answer and its confidence score.

Answer: Debian
Score: 0.9923

Explanation:

  • Answer: The model identifies “Debian” as the answer to the question.
  • Score: A confidence score (ranging from 0 to 1) indicating the model’s certainty. A score of 0.9923 signifies high confidence.

Step 5: Fine-Tune the Model

Fine-tuning tailors the pre-trained model to better suit our specific dataset, enhancing its performance on domain-specific questions.

Create a script to fine-tune the model using the SQuAD dataset, a widely used benchmark for QA tasks.

nano fine_tuning.py

Add the following code.

# fine_tuning.py

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer

def prepare_train_features(examples):
    # 1. Tokenize question + context
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation=True,
        max_length=384,
        padding="max_length",
        return_offsets_mapping=True
    )

    # We'll create start_positions and end_positions for each example
    start_positions = []
    end_positions = []

    # Loop over each example in the batch
    for i, offsets in enumerate(tokenized_examples["offset_mapping"]):
        # Each 'examples["answers"]' is a list of dict for batched data
        answer = examples["answers"][i]
        # We take the first answer (SQuAD usually has one per example)
        answer_start_char = answer["answer_start"][0]
        answer_text = answer["text"][0]
        answer_end_char = answer_start_char + len(answer_text)

        # Initialize
        start_token_idx = None
        end_token_idx = None

        # Find start/end token indices
        for idx, (start, end) in enumerate(offsets):
            if start <= answer_start_char < end:
                start_token_idx = idx
            if start < answer_end_char <= end:
                end_token_idx = idx
                break

        # Fallback in case we don't find a matching token
        if start_token_idx is None:
            start_token_idx = 0
        if end_token_idx is None:
            end_token_idx = len(offsets) - 1

        start_positions.append(start_token_idx)
        end_positions.append(end_token_idx)

    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions

    # Remove offset mapping to save memory
    tokenized_examples.pop("offset_mapping")
    return tokenized_examples


def main():
    # 1. Load SQuAD
    squad = load_dataset("squad")

    # 2. Set up model/tokenizer
    model_name = "distilbert-base-uncased"
    global tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)

    # 3. Preprocess train/validation
    squad_encoded = squad.map(prepare_train_features, batched=True)

    # 4. Define training arguments
    training_args = TrainingArguments(
        output_dir="./qa_model",
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=2,  # adjust as needed
        eval_strategy="epoch",  # replaced `evaluation_strategy` with `eval_strategy`
        save_steps=1000,
        save_total_limit=1,
    )

    # 5. Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=squad_encoded["train"],
        eval_dataset=squad_encoded["validation"],
    )

    # 6. Train
    trainer.train()

    # 7. Save final model
    trainer.save_model("./qa_model")
    print("Fine-tuning complete. Model saved to ./qa_model")

if __name__ == "__main__":
    main()

This script loads the SQuAD dataset, preprocesses the data, and fine-tunes the model.

Now, run the script to commence the fine-tuning process.

python3 fine_tuning.py

During training, you’ll observe outputs indicating the model’s progress, such as:

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 87599/87599 [00:24<00:00, 3514.81 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10570/10570 [00:03<00:00, 3419.28 examples/s]
{'loss': 3.2628, 'grad_norm': 23.89510726928711, 'learning_rate': 4.8858447488584476e-05, 'epoch': 0.05}                                                                
{'loss': 2.3261, 'grad_norm': 13.938529968261719, 'learning_rate': 4.7716894977168955e-05, 'epoch': 0.09}                                                               
{'loss': 2.0285, 'grad_norm': 20.18623161315918, 'learning_rate': 4.657534246575342e-05, 'epoch': 0.14}                                                                 
{'loss': 1.8877, 'grad_norm': 17.599531173706055, 'learning_rate': 4.54337899543379e-05, 'epoch': 0.18}                                                                 
{'loss': 1.8392, 'grad_norm': 20.058273315429688, 'learning_rate': 4.4292237442922375e-05, 'epoch': 0.23}                                                               
{'loss': 1.7229, 'grad_norm': 20.78485870361328, 'learning_rate': 4.3150684931506855e-05, 'epoch': 0.27}                                                                
{'loss': 1.6989, 'grad_norm': 29.019317626953125, 'learning_rate': 4.200913242009132e-05, 'epoch': 0.32}                                                                
{'loss': 1.6686, 'grad_norm': 20.27025604248047, 'learning_rate': 4.08675799086758e-05, 'epoch': 0.37}                                                                  
{'loss': 1.6013, 'grad_norm': 17.87761878967285, 'learning_rate': 3.9726027397260274e-05, 'epoch': 0.41}                                                                
{'loss': 1.6066, 'grad_norm': 25.986677169799805, 'learning_rate': 3.8584474885844754e-05, 'epoch': 0.46}                                                               
{'loss': 1.5666, 'grad_norm': 21.410463333129883, 'learning_rate': 3.744292237442922e-05, 'epoch': 0.5}                                                                 
{'loss': 1.6172, 'grad_norm': 22.285802841186523, 'learning_rate': 3.63013698630137e-05, 'epoch': 0.55}                                                                 
{'loss': 1.5496, 'grad_norm': 22.68842315673828, 'learning_rate': 3.5159817351598174e-05, 'epoch': 0.59} 

Fine-tuning customizes a pre-trained model to perform better on a specific task by training it on a relevant dataset. In this case, fine-tuning the DistilBERT model on the SQuAD dataset enhances its ability to answer questions accurately within given contexts.

Step 6: Update the qa_pipeline.py Script to Use the Fine-Tuned Model

Open the existing qa_pipeline.py script:

nano qa_pipeline.py

Update the script with the following code:

from transformers import pipeline, AutoModelForQuestionAnswering, AutoTokenizer

model_path = "./qa_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForQuestionAnswering.from_pretrained(model_path)

qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

context = "Hugging Face is based in New York and Paris."
question = "Where is Hugging Face based?"
result = qa_pipeline({"context": context, "question": question})
print(result)
# e.g., {'score': 0.95, 'start': 17, 'end': 25, 'answer': 'New York'}

Save and close the file.

Step 7: Deploy the Fine-Tuned Model with FastAPI

After fine-tuning our model, the next step is to deploy it so that it can serve real-time predictions. FastAPI is an excellent choice for this purpose due to its high performance and ease of use.

Create a Python script named app.py to set up our FastAPI application.

nano app.py

Add the following code.

# app.py

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()

# Load a QA model (pretrained or your fine-tuned one)
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

class QAPayload(BaseModel):
    context: str
    question: str

@app.post("/qa")
def get_answer(payload: QAPayload):
    result = qa_pipeline({"context": payload.context, "question": payload.question})
    return {"answer": result["answer"], "score": result["score"]}

This script initializes the FastAPI app and sets up an endpoint for our QA model.

Start the FastAPI server using Uvicorn.

uvicorn app:app --host 0.0.0.0 --port 8000

Output:

Device set to use cuda:0
INFO:     Started server process [14449]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The above command starts the FastAPI application, making it accessible at http://0.0.0.0:8000.

Step 9: Test the FastAPI Application

Testing is crucial to ensure that our API functions as expected. We’ll use the curl utility for manual testing.

Open another terminal and send a POST request to our /qa endpoint using curl.

curl -X POST "http://localhost:8000/qa" -H "Content-Type: application/json" -d '{ "context": "Ubuntu is a Linux distribution based on Debian",           "question": "What is Ubuntu based on?" }'

Output:

{
  "answer": "Debian",
  "score": 0.9965217709541321
}

The model identifies “Debian” as the answer to the question. The confidence score (e.g., 0.9965) indicates the model’s certainty regarding its answer. A score close to 1.0 signifies high confidence.

Conclusion

In this guide, we’ve walked through building a question-answering (QA) system on an Atlantic.Net Cloud GPU server, utilizing FastAPI, Hugging Face Transformers, and PyTorch. We began by setting up the necessary environment, installing essential packages, and verifying the installation to ensure our system was ready for development.