ar.io Logoar.io Documentation

ClickHouse & Parquet

ClickHouse is an optional analytical database that runs alongside your ar.io gateway to accelerate GraphQL queries on large datasets. It stores stable historical data in a compressed columnar format while SQLite continues handling real-time writes and recent data.

ClickHouse requires version 24.8 or later. Earlier versions have known issues with projections on ReplacingMergeTree tables. The default Docker image uses ClickHouse 26.3.

How It Works

The gateway uses a speed/batch layer architecture:

  • SQLite (speed layer) — Handles all real-time writes, recent/unstable data, and non-GraphQL operations (data retrieval, chunk fetches, admin APIs)
  • ClickHouse (batch layer) — Stores stable historical data exported as Parquet files, optimized for analytical GraphQL queries

Query Routing

When ClickHouse is enabled, GraphQL queries are split by block height:

  • ClickHouse handles blocks up to (max_imported_height - 10) — the bulk of historical data
  • SQLite handles blocks above that boundary — recent, potentially unstable data

Results from both backends are merged and deduplicated before being returned to the client. If either backend is unavailable, a circuit breaker degrades gracefully rather than failing the entire query.

When to Enable ClickHouse

Enable ClickHouse if your gateway:

  • Serves heavy GraphQL traffic on historical data
  • Indexes millions of data items (large unbundling filters)
  • Needs faster bootstrapping than a full SQLite reindex (import pre-built Parquet snapshots)
  • Would benefit from 3-5x storage compression on historical data

Keep it disabled for small deployments, gateways that only serve data retrieval (not GraphQL), or single-node low-resource environments.

Quick Start

Configure Environment

Add ClickHouse settings to your .env file:

CLICKHOUSE_URL=http://clickhouse:8123
CLICKHOUSE_PASSWORD=your-password
CLICKHOUSE_SQLITE_MIN_HEIGHT_ENABLED=true
ADMIN_API_KEY=your-admin-key

If using a non-default username:

CLICKHOUSE_USER=your-username

Configure Unbundling Filters

ClickHouse stores data items exported from SQLite. The more you unbundle, the more data flows into ClickHouse. At minimum, configure filters that match the pre-built snapshot you plan to import.

The ArDrive snapshot includes all ArDrive-uploaded data items. To match this:

ANS104_UNBUNDLE_FILTER='{ "and": [ { "not": { "or": [ { "tags": [ { "name": "Bundler-App-Name", "value": "Warp" } ] }, { "tags": [ { "name": "Bundler-App-Name", "value": "Redstone" } ] }, { "tags": [ { "name": "Bundler-App-Name", "value": "KYVE" } ] }, { "tags": [ { "name": "Bundler-App-Name", "value": "AO" } ] }, { "attributes": { "owner_address": "-OXcT1sVRSA5eGwt2k6Yuz8-3e3g9WJi5uSE99CWqsBs" } }, { "attributes": { "owner_address": "ZE0N-8P9gXkhtK-07PQu9d8me5tGDxa_i4Mee5RzVYg" } }, { "attributes": { "owner_address": "6DTqSgzXVErOuLhaP0fmAjqF4yzXkvth58asTxP3pNw" } } ] } }, { "tags": [ { "name": "App-Name", "valueStartsWith": "ArDrive" } ] } ] }'
ANS104_INDEX_FILTER='{ "tags": [ { "name": "App-Name", "value": "ArDrive-App" } ] }'

See the Filters guide for building custom filter configurations.

Start ClickHouse

ClickHouse runs as a Docker profile alongside your gateway:

docker compose --profile clickhouse up -d

This starts both the ClickHouse server and the auto-import daemon.

Import a Parquet Snapshot

Download and import a pre-built snapshot to bootstrap your ClickHouse instance:

# Download the ArDrive snapshot (~3.5GB, current to April 23, 2025)
curl -L https://turbo-gateway.com/JVmsuD2EmFkhitzWN71oi9woADE4WUfvrbBYgremCBM -o ardrive-parquet.tar.gz

# Extract
tar -xzf ardrive-parquet.tar.gz

# Move into the data directory
mkdir -p data/parquet
mv 2025-04-23-ardrive-ans104-parquet/* data/parquet/

# Run the import script
./scripts/clickhouse-import

The import takes 10-20 minutes depending on hardware.

Verify

Check the transaction count:

docker compose exec clickhouse clickhouse-client \
  --password your-password \
  -q 'SELECT COUNT(DISTINCT id) FROM transactions'

Expected result: 32712311 (for the ArDrive snapshot).

Test a GraphQL query:

curl -g -X POST \
  -H "Content-Type: application/json" \
  -d '{"query":"query { transactions(ids: [\"YSNwoYB01EFIzbs6HmkGUjjxHW3xuqh-rckYhi0av4A\"]) { edges { node { block { height } bundledIn { id } } } } }"}' \
  http://localhost:3000/graphql

Auto-Import

The clickhouse-auto-import container runs a continuous loop that exports new stable data from SQLite to Parquet and imports it into ClickHouse. This keeps ClickHouse up to date without manual intervention.

Each cycle:

  1. Reloads TTL rules from config/clickhouse-ttl-rules.yaml
  2. Advances the export window to the latest stable height
  3. Exports new Parquet partitions from SQLite
  4. Imports Parquet into ClickHouse staging tables
  5. Migrates staging to the final transactions table (applying TTL rules)

Auto-Import Configuration

VariableDefaultDescription
CLICKHOUSE_AUTO_IMPORT_SLEEP_INTERVAL3600Seconds between import cycles
CLICKHOUSE_AUTO_IMPORT_HEIGHT_INTERVAL1000Blocks per Parquet partition
CLICKHOUSE_AUTO_IMPORT_MAX_ROWS_PER_FILE-Maximum rows per Parquet file

The default 1-hour interval is conservative. For gateways with high unbundling throughput, reduce to 600-900 seconds to keep ClickHouse more current.

TTL Rules (Data Retention)

TTL rules control how long data items remain in ClickHouse before automatic expiration. Use them to manage storage costs by expiring ephemeral or low-value data while retaining important content indefinitely.

Configuration

Copy the example template and edit:

cp config/clickhouse-ttl-rules.example.yaml config/clickhouse-ttl-rules.yaml

Example configuration:

# Keep L1 transactions forever, expire unmatched data items after 30 days
default_ttl_seconds: 2592000
l1_never_expires: true

rules:
  # Keep ArDrive content forever
  - tag_name: App-Name
    tag_value: ArDrive
    match: prefix
    never_expire: true

  # Expire ephemeral chat messages after 1 day
  - tag_name: App-Name
    tag_value: ephemeral-chat
    ttl_seconds: 86400

  # Expire test uploads after 1 hour
  - tag_name: App-Name
    tag_value: test-
    match: prefix
    ttl_seconds: 3600

  # Expire specific owner's data after 7 days
  - field: owner_address
    value: abcDEF0123xyz
    ttl_seconds: 604800

Rule Precedence

When multiple rules could apply, the first matching branch wins:

  1. l1_never_expires: true AND row is L1 transaction → kept forever
  2. Any matching rule with never_expire: true → kept forever
  3. One or more matching TTL rules → shortest ttl_seconds wins
  4. default_ttl_seconds set → applied to all unmatched rows
  5. Otherwise → kept indefinitely (no expiry)

Match Modes

  • exact (default) — O(1) dictionary lookup. Use for specific tag values.
  • prefix — Matches the beginning of the value. Use for Content-Type with parameters (e.g., image/gif catches image/gif; charset=...) or app name families.

TTL rules apply only to rows imported after the rules are loaded. Previously imported rows keep their existing expiration. To apply new rules retroactively, re-import affected Parquet partitions.

Normalization

  • Tag names are lower-cased and trimmed
  • Tag values are trimmed but case-preserving
  • Owner values use base64url format (as displayed on Arweave explorers)

Deployment Topologies

Single Gateway (Default)

One gateway runs ClickHouse locally. The auto-import daemon handles continuous data export and import.

docker compose --profile clickhouse up -d

Shared ClickHouse Cluster

Multiple gateways share a replicated ClickHouse cluster. Only one gateway runs clickhouse-auto-import as the writer; others query the same cluster as readers.

# Writer gateway
CLICKHOUSE_URL=http://clickhouse-writer:8123
# runs clickhouse-auto-import

# Reader gateways
CLICKHOUSE_URL=http://clickhouse-reader:8123
# do NOT run clickhouse-auto-import

Reader gateways still need START_WRITERS=true because SQLite indexing is required for data retrieval, chunk fetches, and other non-GraphQL operations. Only GraphQL queries are routed to ClickHouse.

Independent ClickHouse per Gateway

Each gateway runs its own ClickHouse instance, bootstrapped from a shared Parquet snapshot (downloaded from Arweave, S3, or BitTorrent). No shared state between gateways.

This is the simplest topology for operators who want ClickHouse performance without cluster management.

Configuration Reference

Connection

VariableDefaultDescription
CLICKHOUSE_URL-ClickHouse HTTP endpoint (e.g., http://clickhouse:8123)
CLICKHOUSE_USERdefaultClickHouse username
CLICKHOUSE_PASSWORD-ClickHouse password

Query Routing

VariableDefaultDescription
CLICKHOUSE_SQLITE_MIN_HEIGHT_ENABLEDfalseEnable height-based query split between ClickHouse and SQLite
CLICKHOUSE_SQLITE_MIN_HEIGHT_BUFFER10Number of blocks below ClickHouse max height still served by SQLite
CLICKHOUSE_MAX_HEIGHT_CACHE_TTL_SECONDS60Cache duration for ClickHouse max-height lookup
CLICKHOUSE_QUERY_TIMEOUT_SECONDS3Query timeout (both client and server side)
CLICKHOUSE_GQL_MAX_ROWS_TO_READ10000000Safety guardrail: max rows scanned per query
CLICKHOUSE_GQL_DEDUPE_HEADROOM4Pagination headroom multiplier for deduplication

Circuit Breaker

The circuit breaker protects against cascading failures when the SQLite leg is unhealthy. When tripped, queries degrade to ClickHouse-only results with a partial result warning.

VariableDefaultDescription
CLICKHOUSE_SQLITE_CIRCUIT_BREAKER_TIMEOUT_MS5000SQLite query timeout before counting as error
CLICKHOUSE_SQLITE_CIRCUIT_BREAKER_ERROR_THRESHOLD_PERCENTAGE80Error rate percentage that trips the breaker
CLICKHOUSE_SQLITE_CIRCUIT_BREAKER_ROLLING_COUNT_TIMEOUT_MS60000Rolling window for error rate calculation
CLICKHOUSE_SQLITE_CIRCUIT_BREAKER_RESET_TIMEOUT_MS30000Wait time before attempting half-open trial

Tuning

Short Pages in GraphQL Pagination

If GraphQL queries return fewer results than expected per page, unmerged duplicates in ClickHouse may be consuming pagination slots. Increase the deduplication headroom:

CLICKHOUSE_GQL_DEDUPE_HEADROOM=8  # default is 4

This makes the inner query fetch more rows to compensate for duplicates that get collapsed.

Query Timeouts

For queries over very large datasets or complex tag filters, increase the timeout:

CLICKHOUSE_QUERY_TIMEOUT_SECONDS=10  # default is 3

Max Rows Safety Guardrail

If queries hit the max_rows_to_read limit (returns an error), you can raise it — but do so with caution as it removes the protection against runaway full-table scans:

CLICKHOUSE_GQL_MAX_ROWS_TO_READ=50000000  # default is 10M

Upgrading

Schema Evolution

Schema changes are applied automatically on each import cycle via idempotent ALTER TABLE statements. No manual migration steps are needed for routine upgrades.

Projection Rebuild (One-Time)

Gateways upgrading from pre-24.8 ClickHouse deployments may need to rebuild the owner projection:

ALTER TABLE transactions MATERIALIZE PROJECTION owner_projection;

This runs in the background and may take time on large tables.

Rollback

If you need to roll back a ClickHouse upgrade:

  1. Drop the transactions table
  2. Re-import from your Parquet files using ./scripts/clickhouse-import

Parquet files are the source of truth — ClickHouse is always rebuildable from them.

Troubleshooting

Import Failures

Failed imports leave files in data/etl/staging/ for inspection. Check the auto-import container logs:

docker compose --profile clickhouse logs clickhouse-auto-import -f

Fix the underlying issue (disk space, permissions, ClickHouse connectivity) and restart the container — the next cycle will retry.

Disk Usage Growing

If ClickHouse disk usage is higher than expected:

  • Check your TTL rules are configured and expiring data as intended
  • Verify background merges are running (rows are only deleted on merge):
    docker compose exec clickhouse clickhouse-client \
      --password your-password \
      -q 'SELECT partition, count() FROM system.parts WHERE table = '\''transactions'\'' GROUP BY partition'
  • Force a merge on a specific partition if needed (heavyweight operation):
    OPTIMIZE TABLE transactions PARTITION 14 FINAL

Checking Import Status

View the current max imported height:

docker compose exec clickhouse clickhouse-client \
  --password your-password \
  -q 'SELECT max(height) FROM transactions'

Compare this to the gateway's current height (/height endpoint) — the difference indicates how far behind ClickHouse is.

Docker Commands

# Start gateway with ClickHouse
docker compose --profile clickhouse up -d

# Stop gateway with ClickHouse
docker compose --profile clickhouse down

# View ClickHouse logs
docker compose --profile clickhouse logs clickhouse -f

# View auto-import logs
docker compose --profile clickhouse logs clickhouse-auto-import -f

# Restart auto-import after config change
docker compose --profile clickhouse restart clickhouse-auto-import

# Pull latest images
docker compose --profile clickhouse pull

How is this guide?