Table of Contents
ClickHouse is an open-source column-oriented database management system that excels at real-time analytical processing (OLAP). When deployed on a GPU-enabled Ubuntu server, ClickHouse can achieve even greater performance for querying large datasets.
This in-depth guide will walk you through setting up ClickHouse with GPU support, generating test data, loading it into ClickHouse, and verifying the installation.
Prerequisites
- An Ubuntu 24.04 server with an NVIDIA GPU.
- A non-root user with sudo privileges.
- NVIDIA drivers installed.
Step 1: Setting Up the Python Environment
First, we’ll create a clean Python environment for our data generation and verification scripts.
1. Install Python and the required packages.
apt install -y python3 python3-pip python3-venv
2. Create and activate the virtual environment.
python3 -m venv venv
source venv/bin/activate
3. Install ClickHouse and additional packages.
pip install numpy pandas clickhouse-driver
Explanation:
- numpy for numerical operations
- pandas for data manipulation
- clickhouse-driver for Python connectivity to ClickHouse
Step 2: Generating Synthetic Test Data
Create a Python script to generate a large dataset for testing.
nano generate_data.py
Add the following code:
# generate_data.py
import pandas as pd
import numpy as np
# Generate 1 million rows of synthetic data
data = {
'id': np.arange(1, 1_000_001),
'feature1': np.random.rand(1_000_000),
'feature2': np.random.rand(1_000_000),
'feature3': np.random.rand(1_000_000)
}
df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)
print("Generated data.csv with 1,000,000 rows")
Run the script.
python3 generate_data.py
This creates a CSV file with 1 million rows of random floating-point numbers across three features, plus an ID column.
Generated data.csv with 1,000,000 rows
Step 3: Setting Up ClickHouse with GPU Support
ClickHouse has experimental GPU support that can accelerate certain operations. We’ll use Docker to run ClickHouse with GPU access.
docker run -d \
--name clickhouse-server \
--gpus all \
-p 9000:9000 \
-p 8123:8123 \
-v $PWD/data:/var/lib/clickhouse \
-e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
clickhouse/clickhouse-server:latest
Key parameters:
- –gpus all: Enables GPU access for the container
- Ports 9000 (native protocol) and 8123 (HTTP interface) exposed
- Data directory mounted for persistence
- Access management enabled for security
Step 4: Preparing the ClickHouse Database
1. Copy the data file into the container and access the ClickHouse shell:
docker cp data.csv clickhouse-server:/tmp/data.csv
2. Connect to the ClickHouse container.
docker exec -it clickhouse-server /bin/bash
3. Inside the container, connect to the ClickHouse server.
clickhouse-client
4. Create a table to hold our data.
CREATE TABLE large_dataset (
id UInt32,
feature1 Float64,
feature2 Float64,
feature3 Float64
) ENGINE = MergeTree()
ORDER BY id;
5. Exit from the ClickHouse shell.
exit
Step 5: Loading Data into ClickHouse
1. Use the ClickHouse client to import the CSV data.
clickhouse-client --query "INSERT INTO large_dataset FORMAT CSV" < /tmp/data.csv --progress
2. Exit from the Docker container.
exit
Step 6: Verifying the Data
1. Create a verification script to check the data was loaded correctly:
nano verify_data.py
Add the code below.
# verify_data.py
from clickhouse_driver import Client
client = Client(
host='localhost',
user='default',
password=''
)
count = client.execute('SELECT count() FROM large_dataset')[0][0]
print(f"Table contains {count} rows")
sample = client.execute('SELECT * FROM large_dataset LIMIT 5')
print("Sample rows:")
for row in sample:
print(row)
2. Run the verification.
python3 verify_data.py
Output.
Table contains 2000000 rows
Sample rows:
(1, 0.38233336651112515, 0.29973790958931235, 0.936007728842977)
(2, 0.13126390736004656, 0.249047252224595, 0.6534321474304323)
(3, 0.7285561857877713, 0.38113395640456516, 0.6836105833319909)
(4, 0.5117479220423862, 0.7073417787216559, 0.8446349501745961)
(5, 0.6415940064857016, 0.7574774031938657, 0.7175210065963386)
Step 7: Running Performance Queries
Now that the data is loaded, let’s test some queries. ClickHouse’s columnar storage and GPU acceleration shine with analytical queries:
1. First, install the ClickHouse client.
apt install clickhouse-client
2. Connect to the ClickHouse server.
clickhouse-client
Output.
ClickHouse client version 18.16.1.
Connecting to localhost:9000.
Connected to ClickHouse server version 25.3.2 revision 54476.
e7cf71653d41 :)
2. Run the below query for basic aggregation.
SELECT
avg(feature1) as avg_feature1,
stddevPop(feature2) as std_feature2,
quantile(0.99)(feature3) as p99_feature3
FROM large_dataset
Output:
SELECT
avg(feature1) AS avg_feature1,
stddevPop(feature2) AS std_feature2,
quantile(0.99)(feature3) AS p99_feature3
FROM large_dataset
┌───────avg_feature1─┬────────std_feature2─┬───────p99_feature3─┐
│ 0.5002608413633995 │ 0.28872380769587086 │ 0.9908446093587597 │
└────────────────────┴─────────────────────┴────────────────────┘
↘ Progress: 1.00 million rows, 24.00 MB (108.22 million rows/s., 2.60 GB/s.) 99%
1 rows in set. Elapsed: 0.010 sec. Processed 1.00 million rows, 24.00 MB (103.62 million rows/s., 2.49 GB/s.)
3. Run the below query for filtering and grouping.
SELECT
intDiv(id, 100000) as bucket,
count() as count,
avg(feature1 + feature2) as avg_combined
FROM large_dataset
WHERE feature3 > 0.5
GROUP BY bucket
ORDER BY bucket
Output:
SELECT
intDiv(id, 100000) AS bucket,
count() AS count,
avg(feature1 + feature2) AS avg_combined
FROM large_dataset
WHERE feature3 > 0.5
GROUP BY bucket
ORDER BY bucket ASC
┌─bucket─┬─count─┬───────avg_combined─┐
│ 0 │ 49802 │ 1.0000974894524948 │
│ 1 │ 50131 │ 1.0002304278511396 │
│ 2 │ 49990 │ 1.0011286054623088 │
│ 3 │ 50031 │ 0.9968373869613624 │
│ 4 │ 50022 │ 1.0022925786397234 │
│ 5 │ 50123 │ 0.9986748463314014 │
│ 6 │ 50218 │ 1.0004042381656353 │
│ 7 │ 50041 │ 0.9998849136710541 │
│ 8 │ 50138 │ 1.000476200468667 │
│ 9 │ 49908 │ 1.0030392683409466 │
└────────┴───────┴────────────────────┘
↙ Progress: 1.00 million rows, 28.00 MB (87.21 million rows/s., 2.44 GB/s.) 99%
10 rows in set. Elapsed: 0.012 sec. Processed 1.00 million rows, 28.00 MB (83.71 million rows/s., 2.34 GB/s.)
4. Exit from the ClickHouse shell.
exit
Conclusion
By combining ClickHouse’s columnar storage and vectorized query execution with GPU acceleration on an Ubuntu server, you can achieve exceptional performance when querying large datasets. The setup process is straightforward with Docker, and the performance benefits for analytical workloads are substantial.