S3 Configuration
Configure Rockfish MCP to query Parquet files from S3-compatible storage.
AWS S3
Default Credentials
If the s3 section is omitted, DuckDB uses AWS credentials from:
- Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) ~/.aws/credentials- IAM role (EC2/ECS)
sources:
flow:
path: s3://my-bucket/flows/
description: Network flows
Explicit Credentials
s3:
region: us-east-1
access_key_id: AKIAIOSFODNN7EXAMPLE
secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Security Note: Prefer environment variables or IAM roles over config file credentials.
MinIO
Self-hosted S3-compatible storage.
s3:
endpoint: localhost:9000
access_key_id: minioadmin
secret_access_key: minioadmin
use_ssl: false
url_style: path # Required for MinIO
sources:
flow:
path: s3://my-bucket/flows/
DigitalOcean Spaces
s3:
endpoint: nyc3.digitaloceanspaces.com
region: nyc3
access_key_id: your-spaces-key
secret_access_key: your-spaces-secret
sources:
flow:
path: s3://my-space/flows/
Cloudflare R2
s3:
endpoint: <account-id>.r2.cloudflarestorage.com
access_key_id: your-r2-key
secret_access_key: your-r2-secret
sources:
flow:
path: s3://my-bucket/flows/
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
region | string | - | AWS region (e.g., us-east-1) |
access_key_id | string | - | Access key ID |
secret_access_key | string | - | Secret access key |
endpoint | string | - | Custom endpoint URL |
use_ssl | bool | true | Use HTTPS |
url_style | string | vhost | path or vhost |
Querying S3 Data
Direct Path
sources:
flow:
path: s3://bucket/prefix/
description: All flow data
Hive Partitioned Data
Rockfish Probe can organize uploads with Hive-style partitioning:
s3://bucket/flows/year=2025/month=01/day=28/*.parquet
Query specific partitions:
sources:
flow:
path: s3://bucket/flows/year=2025/month=01/
description: January 2025 flows
Or use SQL with DuckDB’s Hive partitioning support:
query:
source: flow
sql: |
SELECT * FROM read_parquet(
's3://bucket/flows/year=2025/month=01/day=28/*.parquet',
hive_partitioning=true
)
LIMIT 100
Performance Tips
Use Partition Pruning
Structure queries to match partitioning scheme:
# Efficient - matches Hive partitions
query:
source: flow
filter: "year = 2025 AND month = 1 AND day = 28"
Limit Column Selection
Only select needed columns:
query:
source: flow
columns: [saddr, daddr, sbytes] # Much faster than SELECT *
Use Aggregation Server-Side
Push aggregation to DuckDB:
aggregate:
source: flow
group_by: [dport]
aggregations:
- function: count
alias: flows
Troubleshooting
“Access Denied”
- Verify credentials are correct
- Check bucket policy allows
s3:GetObjectands3:ListBucket - For cross-account access, verify IAM trust policies
“Bucket not found”
- Check region matches bucket region
- For custom endpoints, verify
url_stylesetting
“Connection refused”
- Verify endpoint URL is correct
- Check
use_sslmatches endpoint (http vs https) - For MinIO, ensure
url_style: path
Slow Queries
- Add partition filters to queries
- Select only needed columns
- Check network bandwidth to S3
Example: Multi-Source Configuration
s3:
region: us-east-1
sources:
# Production flows (licensed, validated)
prod_flows:
path: s3://prod-bucket/flows/
description: Production network flows
require_license: true
# Development data (no validation)
dev_flows:
path: s3://dev-bucket/flows/
description: Development test data
# Threat intel from intel server
threat_intel:
path: s3://prod-bucket/intel/
description: IP reputation data
output:
default_format: json
max_rows: 10000