ClickHouse & Parquet
ClickHouse is an optional analytical database that runs alongside your ar.io gateway to accelerate GraphQL queries on large datasets. It stores stable historical data in a compressed columnar format while SQLite continues handling real-time writes and recent data.
ClickHouse requires version 24.8 or later. Earlier versions have known issues with projections on ReplacingMergeTree tables. The default Docker image uses ClickHouse 26.3.
How It Works
The gateway uses a speed/batch layer architecture:
- SQLite (speed layer) — Handles all real-time writes, recent/unstable data, and non-GraphQL operations (data retrieval, chunk fetches, admin APIs)
- ClickHouse (batch layer) — Stores stable historical data exported as Parquet files, optimized for analytical GraphQL queries
Query Routing
When ClickHouse is enabled, GraphQL queries are split by block height:
- ClickHouse handles blocks up to
(max_imported_height - 10)— the bulk of historical data - SQLite handles blocks above that boundary — recent, potentially unstable data
Results from both backends are merged and deduplicated before being returned to the client. If either backend is unavailable, a circuit breaker degrades gracefully rather than failing the entire query.
When to Enable ClickHouse
Enable ClickHouse if your gateway:
- Serves heavy GraphQL traffic on historical data
- Indexes millions of data items (large unbundling filters)
- Needs faster bootstrapping than a full SQLite reindex (import pre-built Parquet snapshots)
- Would benefit from 3-5x storage compression on historical data
Keep it disabled for small deployments, gateways that only serve data retrieval (not GraphQL), or single-node low-resource environments.
Quick Start
Configure Environment
Add ClickHouse settings to your .env file:
CLICKHOUSE_URL=http://clickhouse:8123
CLICKHOUSE_PASSWORD=your-password
CLICKHOUSE_SQLITE_MIN_HEIGHT_ENABLED=true
ADMIN_API_KEY=your-admin-keyIf using a non-default username:
CLICKHOUSE_USER=your-usernameConfigure Unbundling Filters
ClickHouse stores data items exported from SQLite. The more you unbundle, the more data flows into ClickHouse. At minimum, configure filters that match the pre-built snapshot you plan to import.
The ArDrive snapshot includes all ArDrive-uploaded data items. To match this:
ANS104_UNBUNDLE_FILTER='{ "and": [ { "not": { "or": [ { "tags": [ { "name": "Bundler-App-Name", "value": "Warp" } ] }, { "tags": [ { "name": "Bundler-App-Name", "value": "Redstone" } ] }, { "tags": [ { "name": "Bundler-App-Name", "value": "KYVE" } ] }, { "tags": [ { "name": "Bundler-App-Name", "value": "AO" } ] }, { "attributes": { "owner_address": "-OXcT1sVRSA5eGwt2k6Yuz8-3e3g9WJi5uSE99CWqsBs" } }, { "attributes": { "owner_address": "ZE0N-8P9gXkhtK-07PQu9d8me5tGDxa_i4Mee5RzVYg" } }, { "attributes": { "owner_address": "6DTqSgzXVErOuLhaP0fmAjqF4yzXkvth58asTxP3pNw" } } ] } }, { "tags": [ { "name": "App-Name", "valueStartsWith": "ArDrive" } ] } ] }'
ANS104_INDEX_FILTER='{ "tags": [ { "name": "App-Name", "value": "ArDrive-App" } ] }'See the Filters guide for building custom filter configurations.
Start ClickHouse
ClickHouse runs as a Docker profile alongside your gateway:
docker compose --profile clickhouse up -dThis starts both the ClickHouse server and the auto-import daemon.
Import a Parquet Snapshot
Download and import a pre-built snapshot to bootstrap your ClickHouse instance:
# Download the ArDrive snapshot (~3.5GB, current to April 23, 2025)
curl -L https://turbo-gateway.com/JVmsuD2EmFkhitzWN71oi9woADE4WUfvrbBYgremCBM -o ardrive-parquet.tar.gz
# Extract
tar -xzf ardrive-parquet.tar.gz
# Move into the data directory
mkdir -p data/parquet
mv 2025-04-23-ardrive-ans104-parquet/* data/parquet/
# Run the import script
./scripts/clickhouse-importThe import takes 10-20 minutes depending on hardware.
Verify
Check the transaction count:
docker compose exec clickhouse clickhouse-client \
--password your-password \
-q 'SELECT COUNT(DISTINCT id) FROM transactions'Expected result: 32712311 (for the ArDrive snapshot).
Test a GraphQL query:
curl -g -X POST \
-H "Content-Type: application/json" \
-d '{"query":"query { transactions(ids: [\"YSNwoYB01EFIzbs6HmkGUjjxHW3xuqh-rckYhi0av4A\"]) { edges { node { block { height } bundledIn { id } } } } }"}' \
http://localhost:3000/graphqlAuto-Import
The clickhouse-auto-import container runs a continuous loop that exports new stable data from SQLite to Parquet and imports it into ClickHouse. This keeps ClickHouse up to date without manual intervention.
Each cycle:
- Reloads TTL rules from
config/clickhouse-ttl-rules.yaml - Advances the export window to the latest stable height
- Exports new Parquet partitions from SQLite
- Imports Parquet into ClickHouse staging tables
- Migrates staging to the final
transactionstable (applying TTL rules)
Auto-Import Configuration
| Variable | Default | Description |
|---|---|---|
CLICKHOUSE_AUTO_IMPORT_SLEEP_INTERVAL | 3600 | Seconds between import cycles |
CLICKHOUSE_AUTO_IMPORT_HEIGHT_INTERVAL | 1000 | Blocks per Parquet partition |
CLICKHOUSE_AUTO_IMPORT_MAX_ROWS_PER_FILE | - | Maximum rows per Parquet file |
The default 1-hour interval is conservative. For gateways with high unbundling throughput, reduce to 600-900 seconds to keep ClickHouse more current.
TTL Rules (Data Retention)
TTL rules control how long data items remain in ClickHouse before automatic expiration. Use them to manage storage costs by expiring ephemeral or low-value data while retaining important content indefinitely.
Configuration
Copy the example template and edit:
cp config/clickhouse-ttl-rules.example.yaml config/clickhouse-ttl-rules.yamlExample configuration:
# Keep L1 transactions forever, expire unmatched data items after 30 days
default_ttl_seconds: 2592000
l1_never_expires: true
rules:
# Keep ArDrive content forever
- tag_name: App-Name
tag_value: ArDrive
match: prefix
never_expire: true
# Expire ephemeral chat messages after 1 day
- tag_name: App-Name
tag_value: ephemeral-chat
ttl_seconds: 86400
# Expire test uploads after 1 hour
- tag_name: App-Name
tag_value: test-
match: prefix
ttl_seconds: 3600
# Expire specific owner's data after 7 days
- field: owner_address
value: abcDEF0123xyz
ttl_seconds: 604800Rule Precedence
When multiple rules could apply, the first matching branch wins:
l1_never_expires: trueAND row is L1 transaction → kept forever- Any matching rule with
never_expire: true→ kept forever - One or more matching TTL rules → shortest
ttl_secondswins default_ttl_secondsset → applied to all unmatched rows- Otherwise → kept indefinitely (no expiry)
Match Modes
exact(default) — O(1) dictionary lookup. Use for specific tag values.prefix— Matches the beginning of the value. Use forContent-Typewith parameters (e.g.,image/gifcatchesimage/gif; charset=...) or app name families.
TTL rules apply only to rows imported after the rules are loaded. Previously imported rows keep their existing expiration. To apply new rules retroactively, re-import affected Parquet partitions.
Normalization
- Tag names are lower-cased and trimmed
- Tag values are trimmed but case-preserving
- Owner values use base64url format (as displayed on Arweave explorers)
Deployment Topologies
Single Gateway (Default)
One gateway runs ClickHouse locally. The auto-import daemon handles continuous data export and import.
docker compose --profile clickhouse up -dShared ClickHouse Cluster
Multiple gateways share a replicated ClickHouse cluster. Only one gateway runs clickhouse-auto-import as the writer; others query the same cluster as readers.
# Writer gateway
CLICKHOUSE_URL=http://clickhouse-writer:8123
# runs clickhouse-auto-import
# Reader gateways
CLICKHOUSE_URL=http://clickhouse-reader:8123
# do NOT run clickhouse-auto-importReader gateways still need START_WRITERS=true because SQLite indexing is required for data retrieval, chunk fetches, and other non-GraphQL operations. Only GraphQL queries are routed to ClickHouse.
Independent ClickHouse per Gateway
Each gateway runs its own ClickHouse instance, bootstrapped from a shared Parquet snapshot (downloaded from Arweave, S3, or BitTorrent). No shared state between gateways.
This is the simplest topology for operators who want ClickHouse performance without cluster management.
Configuration Reference
Connection
| Variable | Default | Description |
|---|---|---|
CLICKHOUSE_URL | - | ClickHouse HTTP endpoint (e.g., http://clickhouse:8123) |
CLICKHOUSE_USER | default | ClickHouse username |
CLICKHOUSE_PASSWORD | - | ClickHouse password |
Query Routing
| Variable | Default | Description |
|---|---|---|
CLICKHOUSE_SQLITE_MIN_HEIGHT_ENABLED | false | Enable height-based query split between ClickHouse and SQLite |
CLICKHOUSE_SQLITE_MIN_HEIGHT_BUFFER | 10 | Number of blocks below ClickHouse max height still served by SQLite |
CLICKHOUSE_MAX_HEIGHT_CACHE_TTL_SECONDS | 60 | Cache duration for ClickHouse max-height lookup |
CLICKHOUSE_QUERY_TIMEOUT_SECONDS | 3 | Query timeout (both client and server side) |
CLICKHOUSE_GQL_MAX_ROWS_TO_READ | 10000000 | Safety guardrail: max rows scanned per query |
CLICKHOUSE_GQL_DEDUPE_HEADROOM | 4 | Pagination headroom multiplier for deduplication |
Circuit Breaker
The circuit breaker protects against cascading failures when the SQLite leg is unhealthy. When tripped, queries degrade to ClickHouse-only results with a partial result warning.
| Variable | Default | Description |
|---|---|---|
CLICKHOUSE_SQLITE_CIRCUIT_BREAKER_TIMEOUT_MS | 5000 | SQLite query timeout before counting as error |
CLICKHOUSE_SQLITE_CIRCUIT_BREAKER_ERROR_THRESHOLD_PERCENTAGE | 80 | Error rate percentage that trips the breaker |
CLICKHOUSE_SQLITE_CIRCUIT_BREAKER_ROLLING_COUNT_TIMEOUT_MS | 60000 | Rolling window for error rate calculation |
CLICKHOUSE_SQLITE_CIRCUIT_BREAKER_RESET_TIMEOUT_MS | 30000 | Wait time before attempting half-open trial |
Tuning
Short Pages in GraphQL Pagination
If GraphQL queries return fewer results than expected per page, unmerged duplicates in ClickHouse may be consuming pagination slots. Increase the deduplication headroom:
CLICKHOUSE_GQL_DEDUPE_HEADROOM=8 # default is 4This makes the inner query fetch more rows to compensate for duplicates that get collapsed.
Query Timeouts
For queries over very large datasets or complex tag filters, increase the timeout:
CLICKHOUSE_QUERY_TIMEOUT_SECONDS=10 # default is 3Max Rows Safety Guardrail
If queries hit the max_rows_to_read limit (returns an error), you can raise it — but do so with caution as it removes the protection against runaway full-table scans:
CLICKHOUSE_GQL_MAX_ROWS_TO_READ=50000000 # default is 10MUpgrading
Schema Evolution
Schema changes are applied automatically on each import cycle via idempotent ALTER TABLE statements. No manual migration steps are needed for routine upgrades.
Projection Rebuild (One-Time)
Gateways upgrading from pre-24.8 ClickHouse deployments may need to rebuild the owner projection:
ALTER TABLE transactions MATERIALIZE PROJECTION owner_projection;This runs in the background and may take time on large tables.
Rollback
If you need to roll back a ClickHouse upgrade:
- Drop the
transactionstable - Re-import from your Parquet files using
./scripts/clickhouse-import
Parquet files are the source of truth — ClickHouse is always rebuildable from them.
Troubleshooting
Import Failures
Failed imports leave files in data/etl/staging/ for inspection. Check the auto-import container logs:
docker compose --profile clickhouse logs clickhouse-auto-import -fFix the underlying issue (disk space, permissions, ClickHouse connectivity) and restart the container — the next cycle will retry.
Disk Usage Growing
If ClickHouse disk usage is higher than expected:
- Check your TTL rules are configured and expiring data as intended
- Verify background merges are running (rows are only deleted on merge):
docker compose exec clickhouse clickhouse-client \ --password your-password \ -q 'SELECT partition, count() FROM system.parts WHERE table = '\''transactions'\'' GROUP BY partition' - Force a merge on a specific partition if needed (heavyweight operation):
OPTIMIZE TABLE transactions PARTITION 14 FINAL
Checking Import Status
View the current max imported height:
docker compose exec clickhouse clickhouse-client \
--password your-password \
-q 'SELECT max(height) FROM transactions'Compare this to the gateway's current height (/height endpoint) — the difference indicates how far behind ClickHouse is.
Docker Commands
# Start gateway with ClickHouse
docker compose --profile clickhouse up -d
# Stop gateway with ClickHouse
docker compose --profile clickhouse down
# View ClickHouse logs
docker compose --profile clickhouse logs clickhouse -f
# View auto-import logs
docker compose --profile clickhouse logs clickhouse-auto-import -f
# Restart auto-import after config change
docker compose --profile clickhouse restart clickhouse-auto-import
# Pull latest images
docker compose --profile clickhouse pullRelated
Filters
Configure unbundling and indexing filters that control what data flows into ClickHouse
Grafana Monitoring
Visualize gateway and ClickHouse performance metrics
Environment Variables
Full configuration reference for all gateway services
Gateway Architecture
Understand the data layer and worker system architecture
How is this guide?