How to Use ClickHouse for High-Performance Querying on Large Datasets on Ubuntu GPU Server

Table of Contents

Prerequisites
Step 1: Setting Up the Python Environment
Step 2: Generating Synthetic Test Data
Step 3: Setting Up ClickHouse with GPU Support
Step 4: Preparing the ClickHouse Database
Step 5: Loading Data into ClickHouse
Step 6: Verifying the Data
Step 7: Running Performance Queries
Conclusion

ClickHouse is an open-source column-oriented database management system that excels at real-time analytical processing (OLAP). When deployed on a GPU-enabled Ubuntu server, ClickHouse can achieve even greater performance for querying large datasets.

This in-depth guide will walk you through setting up ClickHouse with GPU support, generating test data, loading it into ClickHouse, and verifying the installation.

Prerequisites

An Ubuntu 24.04 server with an NVIDIA GPU.
A non-root user with sudo privileges.
NVIDIA drivers installed.

Step 1: Setting Up the Python Environment

First, we’ll create a clean Python environment for our data generation and verification scripts.

1. Install Python and the required packages.

apt install -y python3 python3-pip python3-venv

2. Create and activate the virtual environment.

python3 -m venv venv
source venv/bin/activate

3. Install ClickHouse and additional packages.

pip install numpy pandas clickhouse-driver

Explanation:

numpy for numerical operations
pandas for data manipulation
clickhouse-driver for Python connectivity to ClickHouse

Step 2: Generating Synthetic Test Data

Create a Python script to generate a large dataset for testing.

nano generate_data.py

Add the following code:

# generate_data.py
import pandas as pd
import numpy as np

# Generate 1 million rows of synthetic data
data = {
    'id': np.arange(1, 1_000_001),
    'feature1': np.random.rand(1_000_000),
    'feature2': np.random.rand(1_000_000),
    'feature3': np.random.rand(1_000_000)
}

df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)
print("Generated data.csv with 1,000,000 rows")

Run the script.

python3 generate_data.py

This creates a CSV file with 1 million rows of random floating-point numbers across three features, plus an ID column.

Generated data.csv with 1,000,000 rows

Step 3: Setting Up ClickHouse with GPU Support

ClickHouse has experimental GPU support that can accelerate certain operations. We’ll use Docker to run ClickHouse with GPU access.

docker run -d \
  --name clickhouse-server \
  --gpus all \
  -p 9000:9000 \
  -p 8123:8123 \
  -v $PWD/data:/var/lib/clickhouse \
  -e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
  clickhouse/clickhouse-server:latest

Key parameters:

–gpus all: Enables GPU access for the container
Ports 9000 (native protocol) and 8123 (HTTP interface) exposed
Data directory mounted for persistence
Access management enabled for security

Step 4: Preparing the ClickHouse Database

1. Copy the data file into the container and access the ClickHouse shell:

docker cp data.csv clickhouse-server:/tmp/data.csv

2. Connect to the ClickHouse container.

docker exec -it clickhouse-server /bin/bash

3. Inside the container, connect to the ClickHouse server.

clickhouse-client

4. Create a table to hold our data.

CREATE TABLE large_dataset (
    id UInt32,
    feature1 Float64,
    feature2 Float64,
    feature3 Float64
) ENGINE = MergeTree()
ORDER BY id;

5. Exit from the ClickHouse shell.

exit

Step 5: Loading Data into ClickHouse

1. Use the ClickHouse client to import the CSV data.

clickhouse-client --query "INSERT INTO large_dataset FORMAT CSV" < /tmp/data.csv --progress

2. Exit from the Docker container.

exit

Step 6: Verifying the Data

1. Create a verification script to check the data was loaded correctly:

nano verify_data.py

Add the code below.

# verify_data.py
from clickhouse_driver import Client

client = Client(
    host='localhost',
    user='default',
    password=''
)

count = client.execute('SELECT count() FROM large_dataset')[0][0]
print(f"Table contains {count} rows")

sample = client.execute('SELECT * FROM large_dataset LIMIT 5')
print("Sample rows:")
for row in sample:
    print(row)

2. Run the verification.

python3 verify_data.py

Output.

Table contains 2000000 rows
Sample rows:
(1, 0.38233336651112515, 0.29973790958931235, 0.936007728842977)
(2, 0.13126390736004656, 0.249047252224595, 0.6534321474304323)
(3, 0.7285561857877713, 0.38113395640456516, 0.6836105833319909)
(4, 0.5117479220423862, 0.7073417787216559, 0.8446349501745961)
(5, 0.6415940064857016, 0.7574774031938657, 0.7175210065963386)

Step 7: Running Performance Queries

Now that the data is loaded, let’s test some queries. ClickHouse’s columnar storage and GPU acceleration shine with analytical queries:

1. First, install the ClickHouse client.

apt install clickhouse-client

2. Connect to the ClickHouse server.

clickhouse-client

Output.

ClickHouse client version 18.16.1.
Connecting to localhost:9000.
Connected to ClickHouse server version 25.3.2 revision 54476.

e7cf71653d41 :)

2. Run the below query for basic aggregation.

SELECT 
    avg(feature1) as avg_feature1,
    stddevPop(feature2) as std_feature2,
    quantile(0.99)(feature3) as p99_feature3
FROM large_dataset

Output:

SELECT 
    avg(feature1) AS avg_feature1, 
    stddevPop(feature2) AS std_feature2, 
    quantile(0.99)(feature3) AS p99_feature3
FROM large_dataset 

┌───────avg_feature1─┬────────std_feature2─┬───────p99_feature3─┐
│ 0.5002608413633995 │ 0.28872380769587086 │ 0.9908446093587597 │
└────────────────────┴─────────────────────┴────────────────────┘
↘ Progress: 1.00 million rows, 24.00 MB (108.22 million rows/s., 2.60 GB/s.)  99%
1 rows in set. Elapsed: 0.010 sec. Processed 1.00 million rows, 24.00 MB (103.62 million rows/s., 2.49 GB/s.)

3. Run the below query for filtering and grouping.

SELECT 
    intDiv(id, 100000) as bucket,
    count() as count,
    avg(feature1 + feature2) as avg_combined
FROM large_dataset
WHERE feature3 > 0.5
GROUP BY bucket
ORDER BY bucket

Output:

SELECT 
    intDiv(id, 100000) AS bucket, 
    count() AS count, 
    avg(feature1 + feature2) AS avg_combined
FROM large_dataset 
WHERE feature3 > 0.5
GROUP BY bucket
ORDER BY bucket ASC

┌─bucket─┬─count─┬───────avg_combined─┐
│      0 │ 49802 │ 1.0000974894524948 │
│      1 │ 50131 │ 1.0002304278511396 │
│      2 │ 49990 │ 1.0011286054623088 │
│      3 │ 50031 │ 0.9968373869613624 │
│      4 │ 50022 │ 1.0022925786397234 │
│      5 │ 50123 │ 0.9986748463314014 │
│      6 │ 50218 │ 1.0004042381656353 │
│      7 │ 50041 │ 0.9998849136710541 │
│      8 │ 50138 │  1.000476200468667 │
│      9 │ 49908 │ 1.0030392683409466 │
└────────┴───────┴────────────────────┘
↙ Progress: 1.00 million rows, 28.00 MB (87.21 million rows/s., 2.44 GB/s.)  99%
10 rows in set. Elapsed: 0.012 sec. Processed 1.00 million rows, 28.00 MB (83.71 million rows/s., 2.34 GB/s.)

4. Exit from the ClickHouse shell.

exit

Conclusion

By combining ClickHouse’s columnar storage and vectorized query execution with GPU acceleration on an Ubuntu server, you can achieve exceptional performance when querying large datasets. The setup process is straightforward with Docker, and the performance benefits for analytical workloads are substantial.

Facebook

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Up in 60 Seconds!

Your subscription could not be saved. Please try again.

Your subscription has been successful.

Newsletter

Subscribe to our newsletter and stay updated.

Email Address

Provide your email address to subscribe. For e.g [email protected]

Your subscription could not be saved. Please try again.

Your subscription has been successful.

View White Papers

How to Use ClickHouse for High-Performance Querying on Large Datasets on Ubuntu GPU Server

Prerequisites

Step 1: Setting Up the Python Environment

Step 2: Generating Synthetic Test Data

Step 3: Setting Up ClickHouse with GPU Support

Step 4: Preparing the ClickHouse Database

Step 5: Loading Data into ClickHouse

Step 6: Verifying the Data

Step 7: Running Performance Queries

Conclusion

Atlantic.Net Cloud GPU Hosting Massive Computing Power

Award-Winning Hosting Solutions & Services