Optimizing Data Handling in AR.IO Gateway
The AR.IO Gateway provides powerful tools for optimizing how you access and serve specific types of data. By configuring filters and worker settings, you can focus your gateway on efficiently handling the data that matters most to your use case, ensuring quick and reliable access to relevant information.
Understanding the Filtering System
The AR.IO Gateway uses two filters to control how ANS104 data items are processed:
- ANS104_UNBUNDLE_FILTER: Controls which bundles are processed and unbundled
- ANS104_INDEX_FILTER: Controls which data items from unbundled bundles are stored in the database for querying
These filters are configured through environment variables:
ANS104_UNBUNDLE_FILTER=""
ANS104_INDEX_FILTER=""
By default, the gateway processes no bundles and indexes no data items. This allows you to selectively enable processing for the specific data types you need.
For a detailed explanation of how to construct these filters, see our Filtering System documentation.
Key Environment Variables
Several environment variables control how your gateway processes data:
Core Configuration
# Enable backfilling of missing bundle records
BACKFILL_BUNDLE_RECORDS=true
# Alternative: Manually queue bundles for backfilling via the admin API
# This can be more efficient than automatic backfilling when you know specific bundle IDs
# Requires admin API key to be set. See <gateway>/api-docs for more information
# Reprocess all indexed bundles with new filters
FILTER_CHANGE_REPROCESS=true
# Number of ANS-104 bundles to download in parallel
ANS104_DOWNLOAD_WORKERS=5
# Number of workers for unbundling (0 = disabled, 1 = enabled if filters set)
ANS104_UNBUNDLE_WORKERS=1
Data Management
# Number of new data items before flushing to stable storage
DATA_ITEM_FLUSH_COUNT_THRESHOLD=1000
# Maximum time between flushes (in seconds)
MAX_FLUSH_INTERVAL_SECONDS=600
# Maximum number of data items to queue for indexing
MAX_DATA_ITEM_QUEUE_SIZE=100000
# Enable background verification of unverified data
ENABLE_BACKGROUND_DATA_VERIFICATION=true
Data Item Flushing
The gateway uses a two-stage storage system for indexed data items:
- Temporary Storage: Newly indexed data items are first stored in a temporary table
- Stable Storage: Data items are periodically "flushed" from temporary to stable storage
This process is controlled by two environment variables:
DATA_ITEM_FLUSH_COUNT_THRESHOLD
: Number of items to queue before flushing (default: 1000)MAX_FLUSH_INTERVAL_SECONDS
: Maximum time between flushes (default: 600 seconds)
The gateway will flush data items when either:
- The number of items in temporary storage reaches the threshold
- The time since the last flush exceeds the interval
This batching approach helps optimize database performance by reducing the number of write operations.
GraphQL Configuration
The GRAPHQL_HOST
setting determines how your gateway handles GraphQL queries. You have two options:
GraphQL Host Options
# Set to arweave.net to proxy queries to a gateway with full index
GRAPHQL_HOST=arweave.net
# Or unset to enable local-only GraphQL queries
# GRAPHQL_HOST=
# Port for GraphQL queries (default: 443)
GRAPHQL_PORT=443
Using arweave.net (Recommended for new gateways)
- Proxies queries to a gateway with a complete index of the blockweave
- Provides immediate access to all historical data
- No need to wait for local indexing
- May introduce additional latency from proxying
Local-only Queries (Unset GRAPHQL_HOST)
- Responds to queries using only locally indexed data
- Faster response times for indexed data
- Requires complete local indexing (can take weeks for L1 transactions)
- No proxying overhead
- Only returns data that matches your indexing filters
Note: Even with GRAPHQL_HOST
set to arweave.net
, your gateway will still maintain its own index based on your filters. This allows for quick access to frequently requested data while ensuring availability of all historical data.
Common Use Cases
Optimizing for Specific Data Types
By configuring your filters and workers appropriately, you can optimize your gateway for different types of data:
- High-Volume Data: Configure workers to handle large amounts of data efficiently
- Specific Applications: Filter for particular app names or content types
Filter Examples
The AR.IO Gateway uses two distinct filters to control how ANS104 bundle data is processed:
- ANS104_UNBUNDLE_FILTER: Determines which bundles (including nested bundles) are unbundled
- ANS104_INDEX_FILTER: Determines which data items within a bundle have their data indexed
Here are some practical examples of how to configure these filters for specific use cases:
Specific Application Data
This configuration demonstrates how to focus your gateway on data from a specific application. In this example, we show how to process and index all ArDrive-related transactions, but you can adapt this pattern for any application, using the App-Name
tag. This approach is perfect for:
- Building application-specific services
- Creating application data archives
- Running application-focused analytics
- Supporting application infrastructure
- Reducing processing overhead by focusing only on relevant data
In this example, the index filter uses the ArFS
tag to only index ArFS-compliant data, which is a specific aspect of ArDrive applications. Index filters can be adjusted for any application's needs - the App-Name
tag is particularly useful here, as data items within a bundle can have a different App-Name
than the bundle that contains them.
All ArDrive Data
{
"or": [
{ "tags": [{ "name": "App-Name", "value": "ArDrive-App" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Web" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-CLI" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Desktop" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Mobile" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Core" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Sync" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive Turbo" }]}
]
}
// Alternative using valueStartsWith for a more concise filter
{
"tags": [{ "name": "App-Name", "valueStartsWith": "ArDrive" }]
}
Personal Data Gateway
This configuration is designed for users who want to run a personal gateway that only processes their own ArDrive data. It:
- Excludes common specific use case bundlers to reduce unnecessary processing
- Only indexes data owned by your wallet address
- Includes all ArDrive and Turbo app data
- Perfect for personal data management
Personal Data Gateway
All ArDrive Bundles (Excluding Common Bundlers)
This configuration is useful for gateways that want to process ArDrive data while avoiding common bundlers. It's ideal for:
- Reducing processing overhead by excluding known bundlers
- Maintaining a clean dataset focused on direct ArDrive transactions
- Optimizing storage and processing resources
- Supporting ArDrive infrastructure with reduced resource requirements
All ArDrive Bundles (Excluding Common Bundlers)
{
"and": [ // both of the wrapped filters must be true to process a bundle
{
"not": { // exclude common specific use case bundlers
"or": [
{ "tags": [{ "name": "Bundler-App-Name", "value": "Warp" }]},
{ "tags": [{ "name": "Bundler-App-Name", "value": "Redstone" }]},
{ "tags": [{ "name": "Bundler-App-Name", "value": "Kyve" }]},
{ "tags": [{ "name": "Bundler-App-Name", "value": "AO" }]},
{ "attributes": { "owner_address": "-OXcT1sVRSA5eGwt2k6Yuz8-3e3g9WJi5uSE99CWqsBs" }},
{ "attributes": { "owner_address": "ZE0N-8P9gXkhtK-07PQu9d8me5tGDxa_i4Mee5RzVYg" }},
{ "attributes": { "owner_address": "6DTqSgzXVErOuLhaP0fmAjqF4yzXkvth58asTxP3pNw" }}
]
}
},
{
"or": [ // include ArDrive and Turbo apps
{ "tags": [{ "name": "App-Name", "value": "ArDrive-App" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Web" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-CLI" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Desktop" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Mobile" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Core" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive-Sync" }]},
{ "tags": [{ "name": "App-Name", "value": "ArDrive Turbo" }]}
]
}
]
}
// Alternative using valueStartsWith for a more concise filter while maintaining bundler exclusions
{
"and": [ // both of the wrapped filters must be true to process a bundle
{
"not": { // exclude common specific use case bundlers
"or": [
{ "tags": [{ "name": "Bundler-App-Name", "value": "Warp" }]},
{ "tags": [{ "name": "Bundler-App-Name", "value": "Redstone" }]},
{ "tags": [{ "name": "Bundler-App-Name", "value": "Kyve" }]},
{ "tags": [{ "name": "Bundler-App-Name", "value": "AO" }]},
{ "attributes": { "owner_address": "-OXcT1sVRSA5eGwt2k6Yuz8-3e3g9WJi5uSE99CWqsBs" }},
{ "attributes": { "owner_address": "ZE0N-8P9gXkhtK-07PQu9d8me5tGDxa_i4Mee5RzVYg" }},
{ "attributes": { "owner_address": "6DTqSgzXVErOuLhaP0fmAjqF4yzXkvth58asTxP3pNw" }}
]
}
},
{
"tags": [{ "name": "App-Name", "valueStartsWith": "ArDrive" }]
}
]
}
Important Filter Considerations
When configuring your filters, keep these points in mind:
- The unbundle filter determines which bundles are processed and unbundled
- The index filter determines which data items from unbundled bundles are indexed in the database
- When filtering by owner addresses, use the modulus of the Arweave public address in the unbundle filter
Common Bundler Exclusions
When configuring filters, you may want to exclude data from common bundlers:
Bundler Addresses:
- Irys Node 1:
-OXcT1sVRSA5eGwt2k6Yuz8-3e3g9WJi5uSE99CWqsBs
- Irys Node 2:
ZE0N-8P9gXkhtK-07PQu9d8me5tGDxa_i4Mee5RzVYg
- Irys Node 3:
6DTqSgzXVErOuLhaP0fmAjqF4yzXkvth58asTxP3pNw
Bundler App Names:
- Warp
- Redstone
- Kyve
- AO
- ArDrive
Best Practices
- Start Small: Begin with conservative worker counts and adjust based on system performance
- Monitor Resources: Watch system memory and CPU usage when adjusting worker counts
- Reprocess Bundles with New Filters: Use
FILTER_CHANGE_REPROCESS
to reprocess bundles after changing filters - Regular Maintenance: Enable background verification and cleanup features
Performance Considerations
When optimizing your gateway, consider these factors:
- System Resources: Worker counts should be balanced against available CPU cores and memory
- Storage Space: Indexing filters affect database size and query performance
- Network Bandwidth: Unbundling workers can generate significant network traffic
- Query Performance: More indexed data means larger databases but better query capabilities
Next Steps
- Review the Filtering System documentation for detailed filter syntax
- Check your system resources to determine optimal worker counts
- Start with basic filters and gradually refine based on your needs
- Monitor system performance and adjust settings as needed
Optimization Strategy
Focus on configuring your gateway to efficiently handle the specific data types you need. The default state processes no data, so you can selectively enable processing for your use case without worrying about unnecessary resource usage.