ClickHouse is an open-source column-oriented database management system that excels at real-time analytical processing (OLAP). When deployed on a GPU-enabled Ubuntu server, ClickHouse can achieve even greater performance for querying large datasets.

This in-depth guide will walk you through setting up ClickHouse with GPU support, generating test data, loading it into ClickHouse, and verifying the installation.

Prerequisites

  • An Ubuntu 24.04 server with an NVIDIA GPU.
  • A non-root user with sudo privileges.
  • NVIDIA drivers installed.

Step 1: Setting Up the Python Environment

First, we’ll create a clean Python environment for our data generation and verification scripts.

1. Install Python and the required packages.

apt install -y python3 python3-pip python3-venv

2. Create and activate the virtual environment.

python3 -m venv venv
source venv/bin/activate

3. Install ClickHouse and additional packages.

pip install numpy pandas clickhouse-driver

Explanation:

  • numpy for numerical operations
  • pandas for data manipulation
  • clickhouse-driver for Python connectivity to ClickHouse

Step 2: Generating Synthetic Test Data

Create a Python script to generate a large dataset for testing.

nano generate_data.py

Add the following code:

# generate_data.py
import pandas as pd
import numpy as np

# Generate 1 million rows of synthetic data
data = {
    'id': np.arange(1, 1_000_001),
    'feature1': np.random.rand(1_000_000),
    'feature2': np.random.rand(1_000_000),
    'feature3': np.random.rand(1_000_000)
}

df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)
print("Generated data.csv with 1,000,000 rows")

Run the script.

python3 generate_data.py

This creates a CSV file with 1 million rows of random floating-point numbers across three features, plus an ID column.

Generated data.csv with 1,000,000 rows

Step 3: Setting Up ClickHouse with GPU Support

ClickHouse has experimental GPU support that can accelerate certain operations. We’ll use Docker to run ClickHouse with GPU access.

docker run -d \
  --name clickhouse-server \
  --gpus all \
  -p 9000:9000 \
  -p 8123:8123 \
  -v $PWD/data:/var/lib/clickhouse \
  -e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
  clickhouse/clickhouse-server:latest

Key parameters:

  • –gpus all: Enables GPU access for the container
  • Ports 9000 (native protocol) and 8123 (HTTP interface) exposed
  • Data directory mounted for persistence
  • Access management enabled for security

Step 4: Preparing the ClickHouse Database

1. Copy the data file into the container and access the ClickHouse shell:

docker cp data.csv clickhouse-server:/tmp/data.csv

2. Connect to the ClickHouse container.

docker exec -it clickhouse-server /bin/bash

3. Inside the container, connect to the ClickHouse server.

clickhouse-client

4. Create a table to hold our data.

CREATE TABLE large_dataset (
    id UInt32,
    feature1 Float64,
    feature2 Float64,
    feature3 Float64
) ENGINE = MergeTree()
ORDER BY id;  

5. Exit from the ClickHouse shell.

exit

Step 5: Loading Data into ClickHouse

1. Use the ClickHouse client to import the CSV data.

clickhouse-client --query "INSERT INTO large_dataset FORMAT CSV" < /tmp/data.csv --progress

2. Exit from the Docker container.

exit

Step 6: Verifying the Data

1. Create a verification script to check the data was loaded correctly:

nano verify_data.py

Add the code below.

# verify_data.py
from clickhouse_driver import Client

client = Client(
    host='localhost',
    user='default',
    password=''
)

count = client.execute('SELECT count() FROM large_dataset')[0][0]
print(f"Table contains {count} rows")

sample = client.execute('SELECT * FROM large_dataset LIMIT 5')
print("Sample rows:")
for row in sample:
    print(row)

2. Run the verification.

python3 verify_data.py

Output.

Table contains 2000000 rows
Sample rows:
(1, 0.38233336651112515, 0.29973790958931235, 0.936007728842977)
(2, 0.13126390736004656, 0.249047252224595, 0.6534321474304323)
(3, 0.7285561857877713, 0.38113395640456516, 0.6836105833319909)
(4, 0.5117479220423862, 0.7073417787216559, 0.8446349501745961)
(5, 0.6415940064857016, 0.7574774031938657, 0.7175210065963386)

Step 7: Running Performance Queries

Now that the data is loaded, let’s test some queries. ClickHouse’s columnar storage and GPU acceleration shine with analytical queries:

1. First, install the ClickHouse client.

apt install clickhouse-client

2. Connect to the ClickHouse server.

clickhouse-client

Output.

ClickHouse client version 18.16.1.
Connecting to localhost:9000.
Connected to ClickHouse server version 25.3.2 revision 54476.

e7cf71653d41 :)

2. Run the below query for basic aggregation.

SELECT 
    avg(feature1) as avg_feature1,
    stddevPop(feature2) as std_feature2,
    quantile(0.99)(feature3) as p99_feature3
FROM large_dataset

Output:

SELECT 
    avg(feature1) AS avg_feature1, 
    stddevPop(feature2) AS std_feature2, 
    quantile(0.99)(feature3) AS p99_feature3
FROM large_dataset 

┌───────avg_feature1─┬────────std_feature2─┬───────p99_feature3─┐
│ 0.5002608413633995 │ 0.28872380769587086 │ 0.9908446093587597 │
└────────────────────┴─────────────────────┴────────────────────┘
↘ Progress: 1.00 million rows, 24.00 MB (108.22 million rows/s., 2.60 GB/s.)  99%
1 rows in set. Elapsed: 0.010 sec. Processed 1.00 million rows, 24.00 MB (103.62 million rows/s., 2.49 GB/s.) 

3. Run the below query for filtering and grouping.

SELECT 
    intDiv(id, 100000) as bucket,
    count() as count,
    avg(feature1 + feature2) as avg_combined
FROM large_dataset
WHERE feature3 > 0.5
GROUP BY bucket
ORDER BY bucket

Output:

SELECT 
    intDiv(id, 100000) AS bucket, 
    count() AS count, 
    avg(feature1 + feature2) AS avg_combined
FROM large_dataset 
WHERE feature3 > 0.5
GROUP BY bucket
ORDER BY bucket ASC

┌─bucket─┬─count─┬───────avg_combined─┐
│      0 │ 49802 │ 1.0000974894524948 │
│      1 │ 50131 │ 1.0002304278511396 │
│      2 │ 49990 │ 1.0011286054623088 │
│      3 │ 50031 │ 0.9968373869613624 │
│      4 │ 50022 │ 1.0022925786397234 │
│      5 │ 50123 │ 0.9986748463314014 │
│      6 │ 50218 │ 1.0004042381656353 │
│      7 │ 50041 │ 0.9998849136710541 │
│      8 │ 50138 │  1.000476200468667 │
│      9 │ 49908 │ 1.0030392683409466 │
└────────┴───────┴────────────────────┘
↙ Progress: 1.00 million rows, 28.00 MB (87.21 million rows/s., 2.44 GB/s.)  99%
10 rows in set. Elapsed: 0.012 sec. Processed 1.00 million rows, 28.00 MB (83.71 million rows/s., 2.34 GB/s.) 

4. Exit from the ClickHouse shell.

exit

Conclusion

By combining ClickHouse’s columnar storage and vectorized query execution with GPU acceleration on an Ubuntu server, you can achieve exceptional performance when querying large datasets. The setup process is straightforward with Docker, and the performance benefits for analytical workloads are substantial.