Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

S3 Configuration

Configure Rockfish MCP to query Parquet files from S3-compatible storage.

AWS S3

Default Credentials

If the s3 section is omitted, DuckDB uses AWS credentials from:

  1. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  2. ~/.aws/credentials
  3. IAM role (EC2/ECS)
sources:
  flow:
    path: s3://my-bucket/flows/
    description: Network flows

Explicit Credentials

s3:
  region: us-east-1
  access_key_id: AKIAIOSFODNN7EXAMPLE
  secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Security Note: Prefer environment variables or IAM roles over config file credentials.

MinIO

Self-hosted S3-compatible storage.

s3:
  endpoint: localhost:9000
  access_key_id: minioadmin
  secret_access_key: minioadmin
  use_ssl: false
  url_style: path  # Required for MinIO

sources:
  flow:
    path: s3://my-bucket/flows/

DigitalOcean Spaces

s3:
  endpoint: nyc3.digitaloceanspaces.com
  region: nyc3
  access_key_id: your-spaces-key
  secret_access_key: your-spaces-secret

sources:
  flow:
    path: s3://my-space/flows/

Cloudflare R2

s3:
  endpoint: <account-id>.r2.cloudflarestorage.com
  access_key_id: your-r2-key
  secret_access_key: your-r2-secret

sources:
  flow:
    path: s3://my-bucket/flows/

Configuration Options

OptionTypeDefaultDescription
regionstring-AWS region (e.g., us-east-1)
access_key_idstring-Access key ID
secret_access_keystring-Secret access key
endpointstring-Custom endpoint URL
use_sslbooltrueUse HTTPS
url_stylestringvhostpath or vhost

Querying S3 Data

Direct Path

sources:
  flow:
    path: s3://bucket/prefix/
    description: All flow data

Hive Partitioned Data

Rockfish Probe can organize uploads with Hive-style partitioning:

s3://bucket/flows/year=2025/month=01/day=28/*.parquet

Query specific partitions:

sources:
  flow:
    path: s3://bucket/flows/year=2025/month=01/
    description: January 2025 flows

Or use SQL with DuckDB’s Hive partitioning support:

query:
  source: flow
  sql: |
    SELECT * FROM read_parquet(
      's3://bucket/flows/year=2025/month=01/day=28/*.parquet',
      hive_partitioning=true
    )
    LIMIT 100

Performance Tips

Use Partition Pruning

Structure queries to match partitioning scheme:

# Efficient - matches Hive partitions
query:
  source: flow
  filter: "year = 2025 AND month = 1 AND day = 28"

Limit Column Selection

Only select needed columns:

query:
  source: flow
  columns: [saddr, daddr, sbytes]  # Much faster than SELECT *

Use Aggregation Server-Side

Push aggregation to DuckDB:

aggregate:
  source: flow
  group_by: [dport]
  aggregations:
    - function: count
      alias: flows

Troubleshooting

“Access Denied”

  • Verify credentials are correct
  • Check bucket policy allows s3:GetObject and s3:ListBucket
  • For cross-account access, verify IAM trust policies

“Bucket not found”

  • Check region matches bucket region
  • For custom endpoints, verify url_style setting

“Connection refused”

  • Verify endpoint URL is correct
  • Check use_ssl matches endpoint (http vs https)
  • For MinIO, ensure url_style: path

Slow Queries

  • Add partition filters to queries
  • Select only needed columns
  • Check network bandwidth to S3

Example: Multi-Source Configuration

s3:
  region: us-east-1

sources:
  # Production flows (licensed, validated)
  prod_flows:
    path: s3://prod-bucket/flows/
    description: Production network flows
    require_license: true

  # Development data (no validation)
  dev_flows:
    path: s3://dev-bucket/flows/
    description: Development test data

  # Threat intel from intel server
  threat_intel:
    path: s3://prod-bucket/intel/
    description: IP reputation data

output:
  default_format: json
  max_rows: 10000