S3 Configuration

Configure Rockfish MCP to query Parquet files from S3-compatible storage.

AWS S3

Default Credentials

If the s3 section is omitted, DuckDB uses AWS credentials from:

Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
~/.aws/credentials
IAM role (EC2/ECS)

sources:
  flow:
    path: s3://my-bucket/flows/
    description: Network flows

Explicit Credentials

s3:
  region: us-east-1
  access_key_id: AKIAIOSFODNN7EXAMPLE
  secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Security Note: Prefer environment variables or IAM roles over config file credentials.

MinIO

Self-hosted S3-compatible storage.

s3:
  endpoint: localhost:9000
  access_key_id: minioadmin
  secret_access_key: minioadmin
  use_ssl: false
  url_style: path  # Required for MinIO

sources:
  flow:
    path: s3://my-bucket/flows/

DigitalOcean Spaces

s3:
  endpoint: nyc3.digitaloceanspaces.com
  region: nyc3
  access_key_id: your-spaces-key
  secret_access_key: your-spaces-secret

sources:
  flow:
    path: s3://my-space/flows/

Cloudflare R2

s3:
  endpoint: <account-id>.r2.cloudflarestorage.com
  access_key_id: your-r2-key
  secret_access_key: your-r2-secret

sources:
  flow:
    path: s3://my-bucket/flows/

Configuration Options

Option	Type	Default	Description
`region`	string	-	AWS region (e.g., `us-east-1`)
`access_key_id`	string	-	Access key ID
`secret_access_key`	string	-	Secret access key
`endpoint`	string	-	Custom endpoint URL
`use_ssl`	bool	true	Use HTTPS
`url_style`	string	vhost	`path` or `vhost`

Querying S3 Data

Direct Path

sources:
  flow:
    path: s3://bucket/prefix/
    description: All flow data

Hive Partitioned Data

Rockfish Probe can organize uploads with Hive-style partitioning:

s3://bucket/flows/year=2025/month=01/day=28/*.parquet

Query specific partitions:

sources:
  flow:
    path: s3://bucket/flows/year=2025/month=01/
    description: January 2025 flows

Or use SQL with DuckDB’s Hive partitioning support:

query:
  source: flow
  sql: |
    SELECT * FROM read_parquet(
      's3://bucket/flows/year=2025/month=01/day=28/*.parquet',
      hive_partitioning=true
    )
    LIMIT 100

Performance Tips

Use Partition Pruning

Structure queries to match partitioning scheme:

# Efficient - matches Hive partitions
query:
  source: flow
  filter: "year = 2025 AND month = 1 AND day = 28"

Limit Column Selection

Only select needed columns:

query:
  source: flow
  columns: [saddr, daddr, sbytes]  # Much faster than SELECT *

Use Aggregation Server-Side

Push aggregation to DuckDB:

aggregate:
  source: flow
  group_by: [dport]
  aggregations:
    - function: count
      alias: flows

Troubleshooting

“Access Denied”

Verify credentials are correct
Check bucket policy allows s3:GetObject and s3:ListBucket
For cross-account access, verify IAM trust policies

“Bucket not found”

Check region matches bucket region
For custom endpoints, verify url_style setting

“Connection refused”

Verify endpoint URL is correct
Check use_ssl matches endpoint (http vs https)
For MinIO, ensure url_style: path

Slow Queries

Add partition filters to queries
Select only needed columns
Check network bandwidth to S3

Example: Multi-Source Configuration

s3:
  region: us-east-1

sources:
  # Production flows (licensed, validated)
  prod_flows:
    path: s3://prod-bucket/flows/
    description: Production network flows
    require_license: true

  # Development data (no validation)
  dev_flows:
    path: s3://dev-bucket/flows/
    description: Development test data

  # Threat intel from intel server
  threat_intel:
    path: s3://prod-bucket/intel/
    description: IP reputation data

output:
  default_format: json
  max_rows: 10000

Keyboard shortcuts

Rockfish Documentation