Introduction
Network Flow Telemetry. Simple. Affordable. AI-Ready.
Rockfish Toolkit captures network flows and writes them directly to your S3 in Apache Parquet format. That’s it. No intermediate databases, no proprietary formats, no vendor lock-in.
Your data. Your privacy. Your control.
Your data is immediately ready for analysis by DuckDB, Spark, Pandas, Python, R, or any tool that reads Parquet - which is virtually every modern data platform.
| Simple | One binary. Capture traffic. Write to S3. Done. |
| Affordable | Enterprise-grade network visibility for less than the price of a grande latte per day. |
| AI-Ready | Structured, queryable data that ML pipelines and AI assistants can consume immediately. |
A Bolt-On Toolkit for SOC AI Readiness
The question “Is your SOC AI-ready?” has become central to modern security operations. Industry consensus is clear: AI readiness starts with SOC Data Foundations - structured, queryable security data that AI systems can actually consume.
The challenge? Traditional security tools generate logs in proprietary formats, scattered across siloed systems. Ripping and replacing your entire security stack isn’t practical.
Rockfish Toolkit is different. Deploy alongside your existing infrastructure to create an AI-ready data layer:
- No replacement required - Add Rockfish to your network without changing existing tools
- Deploy in minutes - Single binary or Docker container, no complex dependencies
- Immediate AI compatibility - Output flows directly to any ML pipeline, SIEM, or AI assistant
- Open data format - Apache Parquet works with DuckDB, Spark, Pandas, and every major analytics platform
- S3-native - Scalable, cost-effective cloud storage
Why Parquet for Network Data?
Rockfish Toolkit captures network flows and exports them as Apache Parquet files - the same columnar format used by data science platforms, ML pipelines, and modern SIEM architectures:
| Benefit | Description |
|---|---|
| Columnar storage | Fast analytical queries on specific fields |
| Schema enforcement | Consistent, typed data for ML models |
| 70-90% compression | Reduced storage costs vs. raw logs |
| Universal compatibility | Works with DuckDB, Spark, Pandas, and AI frameworks |
| S3-native | Scalable, cost-effective cloud storage |
This architecture enables security teams to add AI capabilities without rebuilding their entire SOC.
Why S3 Changes Everything
S3—and object storage generally—fundamentally changes what’s possible in cybersecurity by decoupling data collection from data analysis.
Traditional architectures force a painful tradeoff: either store everything and pay for expensive hot storage, or age out logs and lose forensic depth. S3 eliminates this with virtually unlimited, cheap, durable storage that can hold years of netflow, DNS logs, endpoint telemetry, and packet captures in columnar formats like Parquet.
This unlocks data science at scale:
- Train anomaly detection models on months of baseline behavior
- Run retrospective threat hunts when new IOCs emerge
- Feed AI-driven SOC tools with the volume of data they need to learn patterns rather than just match signatures
You own your data:
The hive-partitioned, schema-on-read model means you’re not locked into a SIEM vendor’s data model. Your data lives in open formats, queryable by any tool—Athena, Spark, DuckDB, Pandas, or a custom Rust binary polling for new files.
When storage is cheap and permanent, detection becomes a software problem rather than a retention policy negotiation—and that shifts the advantage back to defenders.
What Rockfish Provides
| Capability | Description |
|---|---|
| Network Flow Capture | High-performance packet capture with flow generation |
| Protocol Detection | Application-level protocol identification via nDPI |
| Device Fingerprinting | TLS/TCP fingerprints via nDPI for device identification |
| Threat Intelligence | IP reputation and risk scoring |
| Anomaly Detection | ML-based detection for enterprise deployments |
| MCP Integration | Query flows directly from AI assistants via Model Context Protocol |
Use Cases
Rockfish Toolkit provides network visibility and AI-ready telemetry across diverse environments:
| Environment | Use Case |
|---|---|
| Security Operations (SOC) | Threat detection, incident response, network forensics, AI-assisted investigation |
| IoT Networks | Device inventory, behavioral baselining, anomaly detection for connected devices |
| Industrial / Manufacturing | OT network monitoring, detecting unauthorized communications, compliance auditing |
| Robotics & Automation | Fleet communication analysis, identifying misconfigurations, performance monitoring |
| Healthcare | Medical device tracking, HIPAA compliance, detecting data exfiltration |
| SMB / Branch Offices | Affordable network visibility without enterprise SIEM costs |
| MSPs / MSSPs | Multi-tenant flow collection, centralized threat analysis across customers |
| Research & Education | Network traffic analysis, security research, ML model development |
Components
| Component | Description |
|---|---|
| rockfish_probe | Flow meter - captures packets and generates flow records |
| rockfish_mcp | MCP query server - SQL queries on Parquet files via DuckDB (Coming March 2025) |
| rockfish_detect | ML training and anomaly detection (Enterprise) |
| rockfish_intel | Threat intelligence caching server |
Data Pipeline
Network Traffic
|
v
rockfish_probe --> Parquet Files --> S3
|
v
rockfish_mcp (DuckDB queries)
|
v
AI Assistants / SIEM / Analytics
Parquet Schema by Tier
Rockfish outputs flow data in Apache Parquet format. The schema varies by license tier:
| Tier | Fields | Key Data |
|---|---|---|
| Community | 44 | 5-tuple, timing, traffic volumes, TCP flags, payload entropy |
| Basic | 54 | + nDPI application detection, GeoIP (country, city, ASN) |
| Professional | 60 | + GeoIP AS org, nDPI fingerprints |
| Enterprise | 63+ | + Anomaly scores, severity classification |
Key Fields
All tiers include:
saddr,daddr- Source/destination IP addressessport,dport- Source/destination portsproto- Protocol (TCP, UDP, ICMP)spkts,dpkts,sbytes,dbytes- Traffic volumesdur,rtt- Duration and round-trip timesentropy,dentropy- Payload entropy (encrypted traffic detection)
Basic+ adds:
scountry,dcountry- Geographic country codesscity,dcity- Geographic city namessasn,dasn- Autonomous System Numbersndpi_appid- Application identifier (e.g., “TLS.YouTube”)ndpi_risk_score- Risk scoring
Professional+ adds:
sasnorg,dasnorg- AS organization namesndpi_ja4,ndpi_ja3s- TLS fingerprints for device identificationndpi_tcp_fp- TCP fingerprint with OS detection hintndpi_fp- nDPI composite fingerprint
Enterprise adds:
anomaly_score- ML-derived anomaly score (0.0-1.0)anomaly_severity- Classification (LOW, MEDIUM, HIGH, CRITICAL)
See Parquet Schema for complete field reference.
License Tiers
| Tier | Features |
|---|---|
| Community | Basic schema (44 fields), S3 upload |
| Basic | + nDPI labels, GeoIP (country, city, ASN), custom observation name (54 fields) |
| Professional | + GeoIP AS org, nDPI fingerprints (60 fields) |
| Enterprise | + ML models, anomaly detection |
See License Tiers for detailed comparison.
Getting Started
- Installation - Install from download portal
- Quick Start - Capture your first flows
- Licensing - Activate your license
Support
- Email: [email protected]
- Download Portal: download.rockfishnetworks.com
Installation
Quick Install
curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash
The installer auto-detects your platform and installs via the appropriate method (Debian package, Docker, or binary).
Options:
# Install specific version
ROCKFISH_VERSION=1.0.0 curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash
# Force Docker installation
ROCKFISH_METHOD=docker curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash
Manual Installation
Rockfish Toolkit is also available as a Debian package and Docker image from the Rockfish Networks download portal.
System Requirements
- Operating System: Debian 11+, Ubuntu 20.04+, or Docker-compatible host
- Architecture: x86_64 (amd64)
- Memory: 2GB minimum (4GB+ recommended for high-traffic networks)
- Storage: Depends on retention policy (10GB minimum)
- Network: Interface with capture capabilities
Debian Package Installation
Download the toolkit package from the Rockfish download portal:
# Download the package
wget https://download.rockfishnetworks.com/rockfish_toolkit.deb
# Install
sudo dpkg -i rockfish_toolkit.deb
# Install dependencies if needed
sudo apt-get install -f
The rockfish_toolkit.deb package includes all Rockfish Toolkit binaries:
| Binary | Description |
|---|---|
rockfish_probe | Network flow meter |
rockfish_mcp | MCP query server |
rockfish_detect | ML anomaly detection (Enterprise) |
rockfish_intel | Threat intelligence server |
Installed Files
After installation:
| Path | Description |
|---|---|
/usr/bin/rockfish_* | Rockfish binaries |
/etc/rockfish/ | Configuration directory |
/var/lib/rockfish/ | Data directory |
/var/log/rockfish/ | Log directory |
Docker Installation
Pull the Rockfish Toolkit image from Docker Hub:
docker pull rockfishnetworks/toolkit:latest
The toolkit image includes all Rockfish Toolkit binaries. Specify the command to run the desired component.
Running the Probe
docker run -d \
--name rockfish-probe \
--network host \
--cap-add NET_ADMIN \
--cap-add NET_RAW \
-v /etc/rockfish:/etc/rockfish:ro \
-v /var/lib/rockfish:/var/lib/rockfish \
rockfishnetworks/toolkit:latest \
rockfish_probe -c /etc/rockfish/probe.yaml
Running the MCP Server
docker run -d \
--name rockfish-mcp \
-p 8080:8080 \
-v /etc/rockfish:/etc/rockfish:ro \
-v /var/lib/rockfish:/var/lib/rockfish:ro \
rockfishnetworks/toolkit:latest \
rockfish_mcp -c /etc/rockfish/mcp.yaml
Docker Compose
Example docker-compose.yml:
version: '3.8'
services:
probe:
image: rockfishnetworks/toolkit:latest
network_mode: host
cap_add:
- NET_ADMIN
- NET_RAW
volumes:
- ./config:/etc/rockfish:ro
- ./data:/var/lib/rockfish
command: ["rockfish_probe", "-c", "/etc/rockfish/probe.yaml"]
restart: unless-stopped
mcp:
image: rockfishnetworks/toolkit:latest
ports:
- "8080:8080"
volumes:
- ./config:/etc/rockfish:ro
- ./data:/var/lib/rockfish:ro
command: ["rockfish_mcp", "-c", "/etc/rockfish/mcp.yaml"]
restart: unless-stopped
Verifying Installation
Check that the installation was successful:
# Check probe version
rockfish_probe --version
# Check MCP version
rockfish_mcp --version
Next Steps
- Quick Start - Run your first capture
- Licensing - Activate your license
- Configuration - Configure the probe
Quick Start
This guide walks you through capturing network flows and querying them.
1. Capture Flows
From a PCAP File
# Basic capture to Parquet
rockfish_probe -i capture.pcap --parquet-dir ./flows
# With nDPI application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows
Live Capture
# Standard libpcap capture (requires root)
sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows
# High-performance AF_PACKET capture (Linux)
sudo rockfish_probe -i eth0 --live afpacket --parquet-dir ./flows
With a Configuration File
# Create config.yaml (see Configuration docs)
rockfish_probe -c config.yaml
2. Verify Output
# Check generated files
ls -la flows/
# View file info with DuckDB
duckdb -c "DESCRIBE SELECT * FROM 'flows/*.parquet'"
3. Query with MCP
Set up the MCP server to query your flows:
# mcp-config.yaml
sources:
flow:
path: ./flows/
description: Network flow data
output:
default_format: table
max_rows: 100
# Start MCP server
ROCKFISH_CONFIG=mcp-config.yaml rockfish_mcp
Example Queries
Using the MCP tools:
# Count total flows
count:
source: flow
# Top talkers by bytes
query:
source: flow
sql: |
SELECT saddr, SUM(sbytes + dbytes) as total_bytes
FROM {source}
GROUP BY saddr
ORDER BY total_bytes DESC
LIMIT 10
# Filter by protocol
query:
source: flow
filter: "proto = 'TCP'"
limit: 50
4. Upload to S3 (Optional)
Configure S3 upload in your probe config:
output:
parquet_dir: /var/lib/rockfish/flows
s3:
bucket: my-flow-data
region: us-east-1
hive_partitioning: true
delete_after_upload: true
Files are automatically uploaded and organized by date:
s3://my-flow-data/year=2025/month=01/day=28/rockfish-*.parquet
Next Steps
- Configuration - Full configuration reference
- Capture Modes - High-performance capture options
- MCP Setup - Query server configuration
Licensing
Rockfish uses Ed25519-signed licenses with tier-based feature restrictions.
License Tiers
| Tier | Features |
|---|---|
| Community | Basic schema (48 fields), local storage only |
| Basic | + nDPI labels, custom observation name |
| Professional | + GeoIP, nDPI fingerprints (60 fields) |
| Enterprise | + ML models, anomaly detection |
License File
Licenses are JSON files with an Ed25519 signature:
{
"id": "lic_abc123",
"tier": "professional",
"customer_email": "[email protected]",
"company": "Example Corp",
"observation": "sensor-01",
"issued_at": "2025-01-01T00:00:00Z",
"expires_at": "2026-01-01T00:00:00Z",
"signature": "base64-encoded-signature"
}
Configuration
Specify the license file in your config:
license:
path: /opt/rockfish/etc/license.json
Or via environment variable:
export ROCKFISH_LICENSE_PATH=/opt/rockfish/etc/license.json
rockfish_probe -c config.yaml
Feature Matrix
| Feature | Community | Basic | Professional | Enterprise |
|---|---|---|---|---|
| Schema v1 (Simple) | Yes | Yes | Yes | Yes |
| Schema v2 (Extended) | No | No | Yes | Yes |
| GeoIP Fields | No | No | Yes | Yes |
| nDPI Fingerprints | No | No | Yes | Yes |
| nDPI Labeling | No | Yes | Yes | Yes |
| Custom Observation Domain | No | Yes | Yes | Yes |
| Anomaly Detection | No | No | No | Yes |
Parquet Metadata
Licensed files include metadata for validation:
| Key | Description |
|---|---|
rockfish.license_id | License identifier |
rockfish.tier | License tier |
rockfish.company | Company name |
rockfish.customer_email | Customer email |
rockfish.issued_at | License issue date |
rockfish.observation | Observation domain name |
MCP License Validation
Rockfish MCP can validate that Parquet files were generated by a licensed probe:
sources:
licensed_flows:
path: s3://data/flows/
description: Licensed network flow data
require_license: true
enterprise_flows:
path: s3://data/enterprise/
description: Enterprise flow data
require_license: true
allowed_license_ids:
- "lic_abc123"
- "lic_def456"
Obtaining a License
Contact [email protected] for license inquiries.
Probe Overview
Rockfish Probe is a high-performance flow meter that captures network traffic and generates flow records in Apache Parquet format.
Features
- Packet capture via libpcap - Live interface capture or PCAP file reading
- High-performance AF_PACKET - Linux TPACKET_V3 with mmap ring buffer
- Fragment reassembly - Reassembles fragmented IP packets
- Bidirectional flows - Forward and reverse direction tracking
- nDPI integration - Application protocol detection
- GeoIP lookups - Geographic location via MaxMind databases
- IP reputation - AbuseIPDB integration with local caching
- S3 upload - Automatic upload to S3-compatible storage
Output Format
Flow records follow IPFIX Information Element naming conventions (RFC 5102/5103):
{
"flowStartMilliseconds": "2025-01-15T10:30:00.000Z",
"flowEndMilliseconds": "2025-01-15T10:30:05.123Z",
"flowDurationMilliseconds": 5123,
"ipVersion": 4,
"protocolIdentifier": 6,
"sourceIPAddress": "192.168.1.100",
"sourceTransportPort": 54321,
"destinationIPAddress": "93.184.216.34",
"destinationTransportPort": 443,
"octetTotalCount": 1234,
"packetTotalCount": 15,
"applicationName": "TLS"
}
Basic Usage
# Read from PCAP file
rockfish_probe -i capture.pcap --parquet-dir ./flows
# Live capture with libpcap
sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows
# High-performance AF_PACKET (Linux)
sudo rockfish_probe -i eth0 --live afpacket --parquet-dir ./flows
# With nDPI application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows
Next Steps
- Configuration - Full configuration reference
- Capture Modes - Platform-specific capture options
- Performance Tuning - High-speed capture optimization
Configuration Reference
Rockfish Probe uses YAML configuration files. Command-line arguments override config file settings.
# Run with configuration file
rockfish_probe -c /path/to/config.yaml
# Override settings via CLI
rockfish_probe -c config.yaml --source eth1
Configuration Sections
License
license:
path: /opt/rockfish/etc/license.json
| Option | Type | Default | Description |
|---|---|---|---|
path | string | - | Path to license file (JSON with Ed25519 signature) |
Environment Variable: ROCKFISH_LICENSE_PATH
Input
input:
source: eth0
live_type: afpacket
filter: "tcp or udp"
snaplen: 65535
promisc_off: false
| Option | Type | Default | Description |
|---|---|---|---|
source | string | (required) | Interface name or PCAP file path/glob |
live_type | string | pcap | Capture method: pcap, afpacket, netmap, fmadio |
filter | string | - | BPF filter expression |
snaplen | int | 65535 | Maximum bytes per packet |
promisc_off | bool | false | Disable promiscuous mode |
BPF Filter Examples
# TCP and UDP only
filter: "tcp or udp"
# HTTP and HTTPS
filter: "port 80 or port 443"
# Specific subnet
filter: "net 192.168.1.0/24"
# Exclude SSH
filter: "not port 22"
Flow
flow:
idle_timeout: 300
active_timeout: 1800
max_flows: 0
max_payload: 500
udp_uniflow_port: 0
mac: true
| Option | Type | Default | Description |
|---|---|---|---|
idle_timeout | int | 300 | Seconds of inactivity before flow expires |
active_timeout | int | 1800 | Maximum flow duration before export |
max_flows | int | 0 | Maximum concurrent flows (0 = unlimited) |
max_payload | int | 500 | Max payload bytes for protocol detection |
udp_uniflow_port | int | 0 | UDP uniflow mode (0=off, 1=all) |
mac | bool | true | Include MAC addresses |
Note: TLS/TCP fingerprints (ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp) are automatically extracted when nDPI is enabled and included in Professional+ tier output.
nDPI
ndpi:
enabled: true
protocol_file: /opt/rockfish/etc/ndpi-protos.txt
categories_file: /opt/rockfish/etc/ndpi-categories.txt
| Option | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable nDPI application labeling |
protocol_file | string | - | Custom protocol definitions |
categories_file | string | - | Custom category definitions |
Note: nDPI is included in all Rockfish packages (Basic tier and above).
Fragment
fragment:
disabled: false
max_tables: 1024
timeout: 30
| Option | Type | Default | Description |
|---|---|---|---|
disabled | bool | false | Disable IP fragment reassembly |
max_tables | int | 1024 | Max concurrent fragment tables |
timeout | int | 30 | Fragment timeout in seconds |
Output
output:
parquet_dir: /var/run/rockfish/flows
parquet_batch_size: 1000000
parquet_file_prefix: rockfish-flow
parquet_schema: simple
observation: sensor-01
hive_boundary_flush: false
stats: true
verbose: 1
log_file: /var/log/rockfish/rockfish.log
| Option | Type | Default | Description |
|---|---|---|---|
parquet_dir | string | (required) | Output directory for Parquet files |
parquet_batch_size | int | 1000000 | Max flows per file before rotation |
parquet_file_prefix | string | rockfish-flow | Filename prefix |
parquet_schema | string | simple | Schema: simple (50 fields) or extended (62 fields) |
observation | string | gnat | Observation domain name |
hive_boundary_flush | bool | false | Flush at day boundaries for Hive partitioning |
verbose | int | 1 | 0=warnings, 1=info, 2=debug, 3=trace |
log_file | string | - | Log file path (enables daily rotation) |
AFPacket
Linux high-performance capture:
afpacket:
block_size: 2097152
block_count: 64
fanout_group: 0
fanout_mode: hash
| Option | Type | Default | Description |
|---|---|---|---|
block_size | int | 2097152 | Ring buffer block size (bytes) |
block_count | int | 64 | Number of ring buffer blocks |
fanout_group | int | 0 | Fanout group ID (0 = disabled) |
fanout_mode | string | hash | Distribution: hash, lb, cpu, rollover, random |
Memory: block_size × block_count (default: 128 MB)
Netmap
FreeBSD high-performance capture:
netmap:
rx_slots: 1024
tx_slots: 1024
poll_timeout: 1000
host_rings: false
S3
s3:
bucket: my-flow-bucket
prefix: flows
region: us-east-1
endpoint: https://nyc3.digitaloceanspaces.com
force_path_style: false
hive_partitioning: true
delete_after_upload: true
aggregate: true
aggregate_hold_minutes: 5
| Option | Type | Default | Description |
|---|---|---|---|
bucket | string | (required) | S3 bucket name |
prefix | string | - | S3 key prefix |
region | string | (required) | AWS region |
endpoint | string | - | Custom endpoint (MinIO, DO Spaces, etc.) |
force_path_style | bool | false | Use path-style URLs (required for MinIO) |
hive_partitioning | bool | false | Organize by year=/month=/day=/ |
delete_after_upload | bool | false | Delete local files after upload |
aggregate | bool | false | Merge files per minute before upload |
aggregate_hold_minutes | int | 1 | Hold time before aggregating |
GeoIP
geoip:
country_db: /opt/rockfish/etc/GeoLite2-Country.mmdb
city_db: /opt/rockfish/etc/GeoLite2-City.mmdb
asn_db: /opt/rockfish/etc/GeoLite2-ASN.mmdb
Note: Requires --features geoip and MaxMind databases.
Threat Intel
threat_intel:
enabled: true
endpoint_url: "http://localhost:8080"
api_token: "your-api-token"
batch_size: 100
timeout_seconds: 10
| Option | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable threat intel lookups |
endpoint_url | string | (required) | API endpoint URL |
api_token | string | (required) | Bearer token for authentication |
batch_size | int | 100 | IPs per API request |
timeout_seconds | int | 10 | Request timeout |
Output goes to <parquet_dir>/intel/.
Complete Example
license:
path: /opt/rockfish/etc/license.json
input:
source: eth0
live_type: afpacket
filter: "tcp or udp"
flow:
idle_timeout: 300
active_timeout: 1800
max_flows: 1000000
max_payload: 500
ndpi:
enabled: true # Fingerprints (ndpi_ja4, ndpi_ja3s) extracted automatically
output:
parquet_dir: /var/run/rockfish/flows
observation: sensor-01
hive_boundary_flush: true
afpacket:
block_size: 2097152
block_count: 64
s3:
bucket: flow-data
prefix: sensors/sensor-01
region: us-east-1
hive_partitioning: true
delete_after_upload: true
geoip:
city_db: /opt/rockfish/etc/GeoLite2-City.mmdb
asn_db: /opt/rockfish/etc/GeoLite2-ASN.mmdb
Capture Modes
Rockfish Probe supports multiple capture backends for different platforms and performance requirements.
Capture Types
| Type | Platform | Description |
|---|---|---|
pcap | All | Standard libpcap (portable) |
afpacket | Linux | AF_PACKET with TPACKET_V3 (high-performance) |
netmap | FreeBSD | Netmap framework (high-performance) |
fmadio | Linux | FMADIO appliance ring buffer |
libpcap (Default)
The most portable option, works on all platforms.
input:
source: eth0
live_type: pcap
filter: "tcp or udp"
snaplen: 65535
sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows
Pros
- Works everywhere (Linux, FreeBSD, macOS)
- Supports BPF filters
- Well-documented
Cons
- Lower performance than kernel-bypass methods
- Copies packets through kernel
AF_PACKET (Linux)
High-performance capture using Linux’s TPACKET_V3 with memory-mapped ring buffers.
input:
source: eth0
live_type: afpacket
afpacket:
block_size: 2097152 # 2 MB blocks
block_count: 64 # 128 MB total ring
fanout_group: 0 # 0 = disabled
fanout_mode: hash
sudo rockfish_probe -i eth0 --live afpacket \
--afp-block-size 2097152 \
--afp-block-count 64 \
--parquet-dir ./flows
Ring Buffer Sizing
Total Ring Buffer = block_size × block_count
Default: 2 MB × 64 = 128 MB
For 10 Gbps+:
afpacket:
block_size: 4194304 # 4 MB
block_count: 128 # 512 MB total
Fanout Mode
Distribute packets across multiple processes:
afpacket:
fanout_group: 1 # Non-zero enables fanout
fanout_mode: hash # Distribute by flow hash
| Mode | Description |
|---|---|
hash | By flow hash (recommended for flow analysis) |
lb | Round-robin load balancing |
cpu | By receiving CPU |
rollover | Fill one socket, then next |
random | Random distribution |
Multi-Process Capture
Run multiple instances with the same fanout group:
# Terminal 1
sudo rockfish_probe -i eth0 --live afpacket \
--afp-fanout-group 1 -o flows1/
# Terminal 2
sudo rockfish_probe -i eth0 --live afpacket \
--afp-fanout-group 1 -o flows2/
Netmap (FreeBSD)
High-performance capture using FreeBSD’s netmap framework.
input:
source: em0
live_type: netmap
netmap:
rx_slots: 1024
tx_slots: 1024
poll_timeout: 1000
host_rings: false
| Option | Default | Description |
|---|---|---|
rx_slots | driver default | RX ring slot count |
tx_slots | driver default | TX ring slot count |
poll_timeout | 1000 | Poll timeout (ms) |
host_rings | false | Enable host stack access |
FMADIO (Linux)
Capture from FMADIO 100G packet capture appliances.
input:
source: ring0
live_type: fmadio
fmadio:
ring_path: /opt/fmadio/queue/lxc_ring0
include_fcs_errors: false
Note: FMADIO support is included in all Rockfish packages.
Reading PCAP Files
Process existing capture files:
# Single file
rockfish_probe -i capture.pcap --parquet-dir ./flows
# Multiple files with glob
rockfish_probe -i "/data/captures/*.pcap" --parquet-dir ./flows
# With application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows
BPF Filters
All capture modes support BPF filters (except FMADIO):
input:
filter: "tcp or udp"
Common filters:
# Web traffic only
--filter "port 80 or port 443"
# Specific subnet
--filter "net 10.0.0.0/8"
# Exclude broadcast
--filter "not broadcast"
# DNS traffic
--filter "port 53"
Choosing a Capture Mode
| Requirement | Recommended Mode |
|---|---|
| Portability | pcap |
| Linux high-speed (1-10 Gbps) | afpacket |
| Linux 40-100 Gbps | afpacket with large ring + fanout |
| FreeBSD high-speed | netmap |
| FMADIO appliance | fmadio |
Next Steps
- Performance Tuning - Optimize for high-speed capture
- Configuration - Full configuration reference
Performance Tuning
Optimize Rockfish Probe for high-speed network capture.
AF_PACKET Tuning
Ring Buffer Size
For 10 Gbps+ capture, increase the ring buffer:
afpacket:
block_size: 4194304 # 4 MB per block
block_count: 128 # 512 MB total ring buffer
Use Fanout for Multi-Queue NICs
Modern NICs have multiple RX queues. Use fanout to utilize all cores:
# Run multiple instances with same fanout group
taskset -c 0 rockfish_probe -i eth0 --live afpacket \
--afp-fanout-group 1 --parquet-dir ./flows1 &
taskset -c 1 rockfish_probe -i eth0 --live afpacket \
--afp-fanout-group 1 --parquet-dir ./flows2 &
Use hash fanout mode to keep flows together.
CPU Pinning
Pin to specific CPU cores:
taskset -c 0 rockfish_probe -i eth0 --live afpacket ...
Or use CPU isolation:
# /etc/default/grub
GRUB_CMDLINE_LINUX="isolcpus=0,1"
System Tuning
Socket Buffers
Increase kernel buffer sizes:
# Temporary
sudo sysctl -w net.core.rmem_max=134217728
sudo sysctl -w net.core.rmem_default=134217728
# Permanent (/etc/sysctl.conf)
net.core.rmem_max=134217728
net.core.rmem_default=134217728
Network Budget
Increase NAPI budget for high packet rates:
sudo sysctl -w net.core.netdev_budget=600
sudo sysctl -w net.core.netdev_budget_usecs=8000
IRQ Affinity
Distribute NIC interrupts across CPUs:
# Find NIC IRQs
cat /proc/interrupts | grep eth0
# Set affinity (example for 4 queues)
echo 1 > /proc/irq/24/smp_affinity
echo 2 > /proc/irq/25/smp_affinity
echo 4 > /proc/irq/26/smp_affinity
echo 8 > /proc/irq/27/smp_affinity
Or use irqbalance with proper configuration.
Disable CPU Power Saving
Prevent CPU frequency scaling:
# Set performance governor
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $cpu
done
Flow Table Sizing
Limit memory usage under high connection rates:
flow:
max_flows: 1000000 # Limit to 1M concurrent flows
idle_timeout: 60 # Shorter timeout for faster cleanup
Parquet Output Tuning
Batch Size
Larger batches = fewer files, better compression:
output:
parquet_batch_size: 2000000 # 2M flows per file
S3 Aggregation
Reduce small file overhead:
s3:
aggregate: true
aggregate_hold_minutes: 5 # Merge files for 5 minutes
delete_after_upload: true
Monitoring
Statistics Output
Enable periodic statistics:
output:
stats: true
verbose: 2 # Debug level
Key Metrics to Watch
- Packets/sec: Compare with NIC stats (
ethtool -S eth0) - Drops: Check for ring buffer overflows
- Flows/sec: Flow export rate
- Memory usage: Monitor with
toporhtop
Check for Drops
# NIC drops
ethtool -S eth0 | grep -i drop
# Kernel drops
cat /proc/net/dev | grep eth0
# AF_PACKET drops
cat /proc/net/packet
Hardware Recommendations
NIC Selection
For high-speed capture:
- Intel X710/XL710 (40 GbE)
- Intel E810 (100 GbE)
- Mellanox ConnectX-5/6
Enable RSS (Receive Side Scaling) for multi-queue distribution.
CPU
- Modern Intel Xeon or AMD EPYC
- At least 1 core per 10 Gbps
- Large L3 cache helps
Storage
For sustained capture:
- NVMe SSD for local Parquet files
- Fast S3-compatible storage with adequate bandwidth
Example: 10 Gbps Configuration
license:
path: /opt/rockfish/etc/license.json
input:
source: eth0
live_type: afpacket
flow:
idle_timeout: 120
active_timeout: 900
max_flows: 2000000
max_payload: 256
afpacket:
block_size: 4194304
block_count: 128
fanout_group: 1
fanout_mode: hash
output:
parquet_dir: /data/flows
parquet_batch_size: 2000000
observation: sensor-01
s3:
bucket: flow-data
region: us-east-1
aggregate: true
aggregate_hold_minutes: 2
delete_after_upload: true
Run with CPU pinning:
sudo taskset -c 0-3 rockfish_probe -c config.yaml
IP Reputation
Rockfish Probe integrates with threat intelligence services for IP reputation lookups.
Overview
Two approaches are available:
| Feature | ip_reputation | threat_intel |
|---|---|---|
| Provider | Direct AbuseIPDB | External API server |
| Caching | Local in-memory | Server-side |
| Rate limits | Managed locally | Server manages |
| Best for | Single sensor | Multiple sensors |
These features are mutually exclusive.
IP Reputation (Direct AbuseIPDB)
Query AbuseIPDB directly with local caching.
Configuration
ip_reputation:
enabled: true
api_key: "your-abuseipdb-api-key"
cache_ttl_hours: 24
max_age_in_days: 90
s3_upload: true
| Option | Default | Description |
|---|---|---|
enabled | false | Enable IP reputation lookups |
api_key | (required) | AbuseIPDB API key |
output_dir | <parquet_dir>/ip_reputation | Output directory |
cache_ttl_hours | 24 | Cache entry lifetime |
max_age_in_days | 90 | Max age for AbuseIPDB reports |
s3_upload | false | Upload parquet files to S3 |
How It Works
- For each flow, source and destination IPs are queued for lookup
- Lookups run in a background thread
- Results are cached in memory with reference counting
- Cache is exported to Parquet every hour
Rate Limiting
AbuseIPDB free tier: 1000 requests/day.
When rate-limited (HTTP 429):
- API requests pause
- Local cache continues serving
- Resumes at the next hour boundary
- Repeats if still rate-limited
Output Schema
Hourly Parquet exports include:
| Field | Type | Description |
|---|---|---|
ip_address | String | IP address |
abuse_confidence_score | Int32 | Score (0-100) |
country_code | String | Country code |
isp | String | ISP name |
domain | String | Associated domain |
total_reports | Int32 | Total abuse reports |
last_reported_at | Timestamp | Last report time |
is_whitelisted | Boolean | Whitelisted status |
reference_count | Int64 | Times seen in flows |
first_seen | Timestamp | First flow occurrence |
last_seen | Timestamp | Last flow occurrence |
Threat Intel (External API)
Use an external threat intelligence server (e.g., rockfish_intel) for centralized lookups.
Configuration
threat_intel:
enabled: true
endpoint_url: "http://localhost:8080"
api_token: "your-api-token"
batch_size: 100
timeout_seconds: 10
| Option | Default | Description |
|---|---|---|
enabled | false | Enable threat intel lookups |
endpoint_url | (required) | API server URL |
api_token | (required) | Bearer token |
batch_size | 100 | IPs per request |
timeout_seconds | 10 | Request timeout |
Benefits
- Centralized caching: Share cache across multiple sensors
- Rate limit management: Server handles provider limits
- Multiple providers: Server can aggregate multiple sources
Output
Threat intel Parquet files are written to <parquet_dir>/intel/.
With S3 and Hive partitioning:
s3://bucket/prefix/intel/year=YYYY/month=MM/day=DD/filename.parquet
Setup with rockfish_intel
- Start the intel server with your AbuseIPDB key
- Create a client entry in
clients.yaml - Configure the probe:
threat_intel:
enabled: true
endpoint_url: "http://threatintel-server:8080"
api_token: "client-token-from-clients-yaml"
Choosing Between Options
| Scenario | Recommendation |
|---|---|
| Single sensor, simple setup | ip_reputation |
| Multiple sensors | threat_intel + rockfish_intel |
| Enterprise with custom providers | threat_intel |
| Limited API quota | threat_intel (shared cache) |
Getting an AbuseIPDB API Key
- Create account at abuseipdb.com
- Go to API settings
- Generate API key
Free tier: 1000 checks/day Paid tiers: Higher limits, additional features
MCP Overview
Coming Soon: Rockfish MCP is currently under development and will be available in March 2025.
Rockfish MCP is a Model Context Protocol (MCP) server for querying Parquet files using DuckDB.
Features
- SQL queries via DuckDB - Full SQL support for Parquet files
- S3 support - AWS, MinIO, Cloudflare R2, DigitalOcean Spaces
- Configurable data sources - Abstract file locations from API
- Multiple output formats - JSON, JSON Lines, CSV, Table
- TLS support - Secure connections for remote access
- HTTP/WebSocket mode - Standard HTTP with Bearer token auth
- License validation - Verify Parquet files were generated by licensed probes
Operation Modes
| Mode | Transport | Use Case |
|---|---|---|
| stdio | stdin/stdout | Claude Desktop, local tools |
| TLS | Raw TCP+TLS | Custom integrations |
| HTTP | HTTPS+WebSocket | Web clients, standard tooling |
Built-in Tools
| Tool | Description |
|---|---|
list_sources | List configured data sources |
schema | Get column names and types |
query | Query with filters and column selection |
aggregate | Group and aggregate data |
sample | Get random sample rows |
count | Count rows with optional filter |
Quick Example
# config.yaml
sources:
flow:
path: s3://security-data/netflow/
description: Network flow data
output:
default_format: json
max_rows: 1000
ROCKFISH_CONFIG=config.yaml rockfish_mcp
Query example:
query:
source: flow
columns: [saddr, daddr, sbytes, dbytes]
filter: "sbytes > 1000000"
limit: 50
License Validation
Rockfish MCP will validate that Parquet files were generated by a licensed rockfish_probe. Each Parquet file includes signed metadata:
rockfish.license_id- License identifierrockfish.tier- License tier (Community, Basic, Professional, Enterprise)rockfish.company- Company namerockfish.observation- Observation domain name
Configure validation per data source:
sources:
prod_flows:
path: s3://data/flows/
require_license: true # Reject unlicensed files
allowed_license_ids: # Optional: restrict to specific licenses
- "lic_abc123"
Next Steps
- Setup - Configure MCP server
- Authentication - Secure your server
- Tools & Queries - Query reference
MCP Setup
Configure Rockfish MCP for different deployment scenarios.
Configuration File
Create a config.yaml:
# S3 credentials (optional)
s3:
region: us-east-1
# access_key_id: your-key
# secret_access_key: your-secret
# endpoint: localhost:9000 # For MinIO/R2
# Output settings
output:
default_format: json
max_rows: 1000
pretty_print: true
# Data source mappings
sources:
flow:
path: s3://security-data/netflow/
description: Network flow data
require_license: true
ip_reputation:
path: /data/threat-intel/ip-reputation.parquet
description: IP reputation scores
stdio Mode (Default)
For Claude Desktop or local tools.
Claude Desktop Configuration
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Linux: ~/.config/claude/claude_desktop_config.json
{
"mcpServers": {
"rockfish": {
"command": "/path/to/rockfish-mcp",
"env": {
"ROCKFISH_CONFIG": "/path/to/config.yaml"
}
}
}
}
HTTP/WebSocket Mode
For web applications and standard HTTP clients.
Quick Start
-
Generate self-signed certificate:
./generate-self-signed-cert.sh -
Generate API key and hash:
API_KEY=$(openssl rand -base64 32) echo "API Key: $API_KEY" echo "Hash: $(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)" -
Configure
config.yaml:tls: enabled: true http_mode: true bind_address: "0.0.0.0:8443" cert_path: "./certs/cert.pem" key_path: "./certs/key.pem" auth: api_keys: - name: "web-client" key_hash: "paste-hash-here" -
Run the server:
ROCKFISH_CONFIG=config.yaml rockfish_mcp -
Connect:
python examples/python_client_bearer_auth.py \ --host localhost --port 8443 \ --token "$API_KEY" --skip-verify
Plain HTTP Mode (Development)
For local development or behind a reverse proxy:
tls:
enabled: true
http_mode: true
disable_tls: true # No encryption
bind_address: "127.0.0.1:8080"
auth:
api_keys:
- name: "dev-client"
key_hash: "your-hash-here"
Warning: Only use plain HTTP for local development or behind a TLS-terminating proxy.
TLS Server Mode
For custom integrations with raw TLS connections.
tls:
enabled: true
http_mode: false # Raw TLS mode
bind_address: "127.0.0.1:8443"
cert_path: "./certs/cert.pem"
key_path: "./certs/key.pem"
auth:
api_keys:
- name: "production-client"
key_hash: "your-key-hash-here"
License Validation
Require Parquet files to have valid Rockfish license metadata:
sources:
# Any valid Rockfish license
licensed_flows:
path: s3://data/flows/
description: Licensed network flow data
require_license: true
# Specific license IDs only
enterprise_flows:
path: s3://data/enterprise/
description: Enterprise flow data
require_license: true
allowed_license_ids:
- "lic_abc123"
- "lic_def456"
# No validation (default)
public_data:
path: /data/public/
description: Public datasets
Rockfish Probe embeds license metadata in Parquet files:
rockfish.license.idrockfish.license.tierrockfish.license.customer_emailrockfish.license.issued_at
Environment Variables
| Variable | Description |
|---|---|
ROCKFISH_CONFIG | Path to config.yaml |
AWS_ACCESS_KEY_ID | AWS credentials |
AWS_SECRET_ACCESS_KEY | AWS credentials |
AWS_REGION | AWS region |
Testing
# Start server
ROCKFISH_CONFIG=config.yaml rockfish_mcp
# Test with curl (HTTP mode)
curl -X POST https://localhost:8443/mcp \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'
Next Steps
- Authentication - Secure your server
- Tools & Queries - Query reference
- S3 Configuration - Cloud storage setup
Authentication
Rockfish MCP supports multiple authentication mechanisms.
Overview
| Method | Transport | Description |
|---|---|---|
| API Key (JSON) | Raw TLS | JSON frame before MCP session |
| Bearer Token | HTTP/WS | Standard Authorization header |
| Mutual TLS (mTLS) | Any TLS | Client certificate verification |
These can be combined for defense-in-depth.
Bearer Token Authentication (HTTP Mode)
Standard HTTP authentication using Authorization: Bearer <token> header.
Setup
-
Generate API key and hash:
API_KEY=$(openssl rand -base64 32) echo "API Key: $API_KEY" echo "Hash: $(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)" -
Configure:
tls: enabled: true http_mode: true bind_address: "0.0.0.0:8443" cert_path: "./certs/cert.pem" key_path: "./certs/key.pem" auth: api_keys: - name: "production-client" key_hash: "a1b2c3d4e5f6..."
Client Examples
Python (websockets):
import asyncio
import websockets
import json
async def connect():
uri = "wss://localhost:8443/mcp"
headers = {"Authorization": "Bearer your-api-key"}
async with websockets.connect(uri, extra_headers=headers) as ws:
await ws.send(json.dumps({
"jsonrpc": "2.0",
"id": 1,
"method": "initialize",
"params": {
"protocolVersion": "2024-11-05",
"capabilities": {},
"clientInfo": {"name": "python-client", "version": "1.0"}
}
}))
print(await ws.recv())
asyncio.run(connect())
JavaScript/Node.js:
const WebSocket = require('ws');
const ws = new WebSocket('wss://localhost:8443/mcp', {
headers: { 'Authorization': 'Bearer your-api-key' },
rejectUnauthorized: true // false for self-signed certs
});
ws.on('open', () => {
ws.send(JSON.stringify({
jsonrpc: '2.0',
id: 1,
method: 'initialize',
params: {
protocolVersion: '2024-11-05',
capabilities: {},
clientInfo: { name: 'nodejs-client', version: '1.0' }
}
}));
});
ws.on('message', data => console.log(data.toString()));
cURL:
curl -i -N \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "Authorization: Bearer your-api-key" \
-H "Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==" \
-H "Sec-WebSocket-Version: 13" \
https://localhost:8443/mcp
API Key Authentication (TLS Mode)
JSON-based authentication for raw TLS connections.
Protocol
- Client connects via TLS
- Client sends:
{"api_key": "your-secret-key"}\n - Server responds:
{"success": true/false, "message": "..."}\n - MCP session proceeds if successful
Configuration
tls:
enabled: true
http_mode: false # Raw TLS mode
bind_address: "127.0.0.1:8443"
cert_path: "./certs/cert.pem"
key_path: "./certs/key.pem"
auth:
api_keys:
- name: "production-client"
key_hash: "sha256-hash-here"
Client Example
import socket
import ssl
import json
context = ssl.create_default_context()
sock = socket.create_connection(("localhost", 8443))
tls_sock = context.wrap_socket(sock, server_hostname="localhost")
# Authenticate
auth = {"api_key": "your-secret-key"}
tls_sock.sendall((json.dumps(auth) + "\n").encode())
response = json.loads(tls_sock.recv(4096).decode().strip())
if not response["success"]:
raise Exception(f"Auth failed: {response['message']}")
# Proceed with MCP protocol...
Mutual TLS (mTLS)
Transport-level authentication using client certificates.
Create CA and Client Certificates
# Generate CA
openssl genrsa -out ca-key.pem 4096
openssl req -new -x509 -key ca-key.pem -out ca-cert.pem -days 3650 \
-subj "/CN=Rockfish MCP CA/O=Your Org"
# Generate client certificate
openssl genrsa -out client-key.pem 2048
openssl req -new -key client-key.pem -out client.csr \
-subj "/CN=client1/O=Your Org"
openssl x509 -req -in client.csr -CA ca-cert.pem -CAkey ca-key.pem \
-CAcreateserial -out client-cert.pem -days 365
Configuration
tls:
enabled: true
bind_address: "0.0.0.0:8443"
cert_path: "./certs/cert.pem"
key_path: "./certs/key.pem"
auth:
require_client_cert: true
client_ca_cert_path: "./certs/ca-cert.pem"
Client Example
import ssl
import socket
context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH)
context.load_cert_chain(
certfile="client-cert.pem",
keyfile="client-key.pem"
)
context.load_verify_locations(cafile="server-ca-cert.pem")
sock = socket.create_connection(("localhost", 8443))
tls_sock = context.wrap_socket(sock, server_hostname="localhost")
# Connection authenticated via mTLS
Combining Authentication Methods
For maximum security, use both mTLS and API keys:
tls:
enabled: true
bind_address: "0.0.0.0:8443"
cert_path: "./certs/cert.pem"
key_path: "./certs/key.pem"
auth:
require_client_cert: true
client_ca_cert_path: "./certs/ca-cert.pem"
api_keys:
- name: "production-client"
key_hash: "a1b2c3d4e5f6..."
Both must succeed for authorization.
Security Best Practices
API Keys
- Generate with sufficient entropy:
openssl rand -base64 32 - One key per client for audit/revocation
- Rotate regularly
- Never store plain-text keys in config
mTLS
- Protect CA private key:
chmod 600 ca-key.pem - Use short certificate lifetimes (90 days)
- Implement certificate revocation
- Unique certificates per client
General
- Use TLS in production
- Implement rate limiting
- Monitor authentication logs
- Use network segmentation
Troubleshooting
| Error | Solution |
|---|---|
| “Authentication failed” | Verify key matches hash |
| “Invalid auth request format” | Check JSON format, ensure \n at end |
| “Client certificate verification failed” | Check cert signed by configured CA |
| “require_client_cert without client_ca_cert_path” | Add CA path to config |
Utility: Generate API Key
#!/bin/bash
API_KEY=$(openssl rand -base64 32)
KEY_HASH=$(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)
echo "API Key: $API_KEY"
echo "Hash: $KEY_HASH"
echo ""
echo "Config entry:"
echo " - name: \"client-name\""
echo " key_hash: \"$KEY_HASH\""
Tools & Queries
Rockfish MCP provides SQL-based tools for querying Parquet data.
Available Tools
| Tool | Description |
|---|---|
list_sources | List configured data sources |
schema | Get column names and types |
query | Query with filters and column selection |
aggregate | Group and aggregate data |
sample | Get random sample rows |
count | Count rows with optional filter |
list_sources
List all configured data sources.
list_sources: {}
Response:
{
"sources": [
{"name": "flow", "description": "Network flow data"},
{"name": "ip_reputation", "description": "IP reputation scores"}
]
}
schema
Get column names and types for a data source.
schema:
source: flow
format: table
Parameters:
| Name | Required | Description |
|---|---|---|
source | Yes | Data source name |
format | No | Output format (default: table) |
query
Query with filtering, column selection, and custom SQL.
Basic Query
query:
source: flow
columns: [saddr, daddr, sbytes, dbytes]
filter: "sbytes > 1000000"
limit: 50
format: json
Parameters:
| Name | Required | Description |
|---|---|---|
source | Yes | Data source name |
columns | No | Columns to select (default: all) |
filter | No | WHERE clause condition |
order_by | No | ORDER BY clause |
limit | No | Maximum rows |
format | No | Output format |
Custom SQL
Use {source} placeholder for the data source:
query:
source: flow
sql: |
SELECT saddr, COUNT(*) as connection_count, SUM(sbytes) as total_bytes
FROM {source}
GROUP BY saddr
ORDER BY total_bytes DESC
LIMIT 10
Time-based Queries
query:
source: flow
filter: "stime >= '2025-01-01' AND stime < '2025-01-02'"
columns: [stime, saddr, daddr, proto]
Protocol Filtering
query:
source: flow
filter: "proto = 'TCP' AND dport = 443"
columns: [saddr, daddr, ndpi_appid]
aggregate
Group and aggregate data.
aggregate:
source: flow
group_by: [dport]
aggregations:
- function: sum
column: sbytes
alias: total_bytes
- function: count
alias: connection_count
filter: "proto = 'TCP'"
order_by: "total_bytes DESC"
limit: 20
format: table
Parameters:
| Name | Required | Description |
|---|---|---|
source | Yes | Data source name |
group_by | Yes | Columns to group by |
aggregations | Yes | Aggregation functions |
filter | No | WHERE clause |
order_by | No | ORDER BY clause |
limit | No | Maximum rows |
Aggregation Functions
| Function | Description |
|---|---|
count | Count rows |
sum | Sum values |
avg | Average |
min | Minimum |
max | Maximum |
count_distinct | Count unique values |
Examples
Top destination ports by traffic:
aggregate:
source: flow
group_by: [dport]
aggregations:
- function: sum
column: sbytes + dbytes
alias: total_bytes
- function: count
alias: flows
order_by: "total_bytes DESC"
limit: 10
Flows by country (requires GeoIP):
aggregate:
source: flow
group_by: [scountry, dcountry]
aggregations:
- function: count
alias: flow_count
filter: "scountry IS NOT NULL"
sample
Get random sample rows.
sample:
source: flow
n: 10
format: json
Parameters:
| Name | Required | Description |
|---|---|---|
source | Yes | Data source name |
n | No | Number of rows (default: 10) |
format | No | Output format |
count
Count rows with optional filter.
count:
source: flow
filter: "ndpi_risk_score > 50"
Parameters:
| Name | Required | Description |
|---|---|---|
source | Yes | Data source name |
filter | No | WHERE clause |
Output Formats
| Format | Description |
|---|---|
json | Pretty-printed JSON array |
jsonl / json_lines / ndjson | Newline-delimited JSON |
csv | CSV with header |
table / text | ASCII table |
Common Query Patterns
Top Talkers
query:
source: flow
sql: |
SELECT saddr,
COUNT(*) as flows,
SUM(sbytes) as sent,
SUM(dbytes) as received
FROM {source}
GROUP BY saddr
ORDER BY sent + received DESC
LIMIT 20
DNS Traffic
query:
source: flow
filter: "dport = 53 OR sport = 53"
columns: [stime, saddr, daddr, sbytes, dbytes]
High-Risk Flows
query:
source: flow
filter: "ndpi_risk_score > 100"
columns: [stime, saddr, daddr, ndpi_appid, ndpi_risk_list]
Long-Duration Flows
query:
source: flow
filter: "dur > 3600000" # > 1 hour in ms
columns: [stime, etime, dur, saddr, daddr, sbytes, dbytes]
order_by: "dur DESC"
External Traffic
query:
source: flow
filter: "NOT (saddr LIKE '10.%' OR saddr LIKE '192.168.%')"
columns: [saddr, daddr, scountry, dcountry]
Application Distribution
aggregate:
source: flow
group_by: [ndpi_appid]
aggregations:
- function: count
alias: flows
- function: sum
column: sbytes + dbytes
alias: bytes
filter: "ndpi_appid IS NOT NULL"
order_by: "bytes DESC"
limit: 20
S3 Configuration
Configure Rockfish MCP to query Parquet files from S3-compatible storage.
AWS S3
Default Credentials
If the s3 section is omitted, DuckDB uses AWS credentials from:
- Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) ~/.aws/credentials- IAM role (EC2/ECS)
sources:
flow:
path: s3://my-bucket/flows/
description: Network flows
Explicit Credentials
s3:
region: us-east-1
access_key_id: AKIAIOSFODNN7EXAMPLE
secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Security Note: Prefer environment variables or IAM roles over config file credentials.
MinIO
Self-hosted S3-compatible storage.
s3:
endpoint: localhost:9000
access_key_id: minioadmin
secret_access_key: minioadmin
use_ssl: false
url_style: path # Required for MinIO
sources:
flow:
path: s3://my-bucket/flows/
DigitalOcean Spaces
s3:
endpoint: nyc3.digitaloceanspaces.com
region: nyc3
access_key_id: your-spaces-key
secret_access_key: your-spaces-secret
sources:
flow:
path: s3://my-space/flows/
Cloudflare R2
s3:
endpoint: <account-id>.r2.cloudflarestorage.com
access_key_id: your-r2-key
secret_access_key: your-r2-secret
sources:
flow:
path: s3://my-bucket/flows/
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
region | string | - | AWS region (e.g., us-east-1) |
access_key_id | string | - | Access key ID |
secret_access_key | string | - | Secret access key |
endpoint | string | - | Custom endpoint URL |
use_ssl | bool | true | Use HTTPS |
url_style | string | vhost | path or vhost |
Querying S3 Data
Direct Path
sources:
flow:
path: s3://bucket/prefix/
description: All flow data
Hive Partitioned Data
Rockfish Probe can organize uploads with Hive-style partitioning:
s3://bucket/flows/year=2025/month=01/day=28/*.parquet
Query specific partitions:
sources:
flow:
path: s3://bucket/flows/year=2025/month=01/
description: January 2025 flows
Or use SQL with DuckDB’s Hive partitioning support:
query:
source: flow
sql: |
SELECT * FROM read_parquet(
's3://bucket/flows/year=2025/month=01/day=28/*.parquet',
hive_partitioning=true
)
LIMIT 100
Performance Tips
Use Partition Pruning
Structure queries to match partitioning scheme:
# Efficient - matches Hive partitions
query:
source: flow
filter: "year = 2025 AND month = 1 AND day = 28"
Limit Column Selection
Only select needed columns:
query:
source: flow
columns: [saddr, daddr, sbytes] # Much faster than SELECT *
Use Aggregation Server-Side
Push aggregation to DuckDB:
aggregate:
source: flow
group_by: [dport]
aggregations:
- function: count
alias: flows
Troubleshooting
“Access Denied”
- Verify credentials are correct
- Check bucket policy allows
s3:GetObjectands3:ListBucket - For cross-account access, verify IAM trust policies
“Bucket not found”
- Check region matches bucket region
- For custom endpoints, verify
url_stylesetting
“Connection refused”
- Verify endpoint URL is correct
- Check
use_sslmatches endpoint (http vs https) - For MinIO, ensure
url_style: path
Slow Queries
- Add partition filters to queries
- Select only needed columns
- Check network bandwidth to S3
Example: Multi-Source Configuration
s3:
region: us-east-1
sources:
# Production flows (licensed, validated)
prod_flows:
path: s3://prod-bucket/flows/
description: Production network flows
require_license: true
# Development data (no validation)
dev_flows:
path: s3://dev-bucket/flows/
description: Development test data
# Threat intel from intel server
threat_intel:
path: s3://prod-bucket/intel/
description: IP reputation data
output:
default_format: json
max_rows: 10000
Rockfish Detect Overview
Rockfish Detect is the ML training and anomaly detection service for the Rockfish platform. It provides a complete pipeline for building models from network flow data and scoring flows for anomalies.
Note: Rockfish Detect requires an Enterprise tier license.
Features
- Data Sampling - Random sampling from S3-stored Parquet files
- Feature Engineering - Build normalization tables for ML training
- Feature Ranking - Identify most significant fields for detection
- Model Training - Train anomaly detection models (HBOS, Hybrid)
- Flow Scoring - Score flows using trained models
- Device Fingerprinting - Passive OS/device detection via nDPI fingerprints
- Automated Scheduling - Run as daemon with daily training cycles
Architecture
Network Traffic
|
v
Parquet Files in S3 (from rockfish_probe)
|
v
+------------------------------------------+
| rockfish_detect |
+------------------------------------------+
| Sampler |
| - Queries S3 with DuckDB |
| - Random sampling |
| - Output: sample/*.parquet |
+------------------------------------------+
| Feature Engineer |
| - Build normalization tables |
| - Histogram binning + frequency |
| - Output: extract/*.parquet |
+------------------------------------------+
| Feature Ranker |
| - Importance scoring |
| - Output: rockfish_rank.parquet |
+------------------------------------------+
| Model Trainer (HBOS/Hybrid) |
| - Train on sampled data |
| - Output: models/*.json |
+------------------------------------------+
| Flow Scorer |
| - Score flows using trained models |
| - Output: score/*.parquet |
+------------------------------------------+
|
v
Anomaly Scores --> rockfish_mcp --> Alerts
Algorithms
| Algorithm | Type | Description |
|---|---|---|
| HBOS | Unsupervised | Histogram-Based Outlier Score - fast, interpretable |
| Hybrid | Combined | HBOS + fingerprint correlation + threat intelligence |
| Random Forest | Supervised | Classification-based (framework) |
| Autoencoder | Neural Network | Reconstruction error-based (framework) |
Use Cases
- Unsupervised Anomaly Detection - HBOS identifies statistical outliers
- Behavioral Change Detection - Hybrid mode detects unusual fingerprint combinations
- Device Profiling - Fingerprinting detects lateral movement
- Threat Prioritization - Score-based reporting prioritizes investigations
- Network Baselining - Feature ranking identifies important characteristics
Quick Start
# Validate configuration
rockfish_detect -c config.yaml validate
# Run full pipeline for specific date
rockfish_detect -c config.yaml auto --date 2025-01-28
# Start as scheduler daemon
rockfish_detect -c config.yaml run
# Run immediately (don't wait for schedule)
rockfish_detect -c config.yaml run --run-now
Requirements
- Enterprise tier license
- S3-compatible storage with flow data from rockfish_probe
- Multi-core system recommended (uses half available cores)
Next Steps
- Configuration - Set up rockfish_detect
- Data Pipeline - Understand the processing stages
- Anomaly Detection - Configure detection models
Configuration Reference
Rockfish Detect uses YAML configuration files.
rockfish_detect -c /path/to/config.yaml [command]
Configuration Sections
License
license:
path: /etc/rockfish/license.json
observation: flows
| Option | Type | Required | Description |
|---|---|---|---|
path | string | No | License file path (auto-searches if not set) |
observation | string | Yes | S3 prefix / observation domain |
S3
s3:
bucket: my-flow-bucket
region: us-east-1
endpoint: https://s3.example.com
hive_partitioning: true
http_retries: 10
http_retry_wait_ms: 2000
http_retry_backoff: 2.0
| Option | Type | Default | Description |
|---|---|---|---|
bucket | string | (required) | S3 bucket name |
region | string | (required) | AWS region |
endpoint | string | - | Custom endpoint (MinIO, etc.) |
hive_partitioning | bool | true | Match rockfish_probe structure |
http_retries | int | 10 | Retry count for S3 operations |
http_retry_wait_ms | int | 2000 | Base wait between retries |
http_retry_backoff | float | 2.0 | Exponential backoff multiplier |
S3 Data Structure
Expected path structure (from rockfish_probe):
s3://<bucket>/<observation>/v2/year=YYYY/month=MM/day=DD/*.parquet
Sampling
sampling:
sample_percent: 10.0
retention_days: 7
sample_hour: 0
sample_minute: 30
output_prefix: flows/sample
| Option | Type | Default | Description |
|---|---|---|---|
sample_percent | float | 10.0 | Percentage of rows to sample (0-100) |
retention_days | int | 7 | Rolling window retention |
sample_hour | int | 0 | UTC hour for scheduled sampling |
sample_minute | int | random | Minute for scheduled sampling |
output_prefix | string | <obs>/sample/ | S3 output prefix |
Features
Configure feature engineering (normalization tables).
features:
num_bins: 10
histogram_type: quantile
ip_hash_modulus: 65536
sample_days: 7
| Option | Type | Default | Description |
|---|---|---|---|
num_bins | int | 10 | Histogram bins for numeric features |
histogram_type | string | quantile | quantile or equal_width |
ip_hash_modulus | int | 65536 | Dimensionality reduction for IPs |
sample_days | int | 7 | Days of samples to process |
Histogram Types
| Type | Description | Best For |
|---|---|---|
quantile | Equal sample count per bin | Skewed distributions |
equal_width | Equal value range per bin | Uniform distributions |
Training
training:
enabled: true
train_hour: 1
train_minute: 0
algorithm: hbos
model_output_dir: /var/lib/rockfish/models
min_importance_score: 0.7
hbos:
num_bins: 10
fields:
- dur
- rtt
- pcr
- spkts
- dpkts
- sbytes
- dbytes
- sentropy
- dentropy
hybrid:
hbos_weight: 0.5
correlation_weight: 0.3
threat_intel_weight: 0.2
hbos_filter_percentile: 90.0
min_observations: 3
| Option | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable training |
train_hour | int | 1 | UTC hour for scheduled training |
train_minute | int | random | Minute for scheduled training |
algorithm | string | hbos | hbos, hybrid, random_forest, autoencoder |
model_output_dir | string | - | Directory for trained models |
min_importance_score | float | 0.7 | Threshold for ranked features |
HBOS Options
| Option | Type | Default | Description |
|---|---|---|---|
num_bins | int | 10 | Histogram bins |
fields | list | - | Fields to include in model |
Hybrid Options
| Option | Type | Default | Description |
|---|---|---|---|
hbos_weight | float | 0.5 | Weight for HBOS score |
correlation_weight | float | 0.3 | Weight for fingerprint correlation |
threat_intel_weight | float | 0.2 | Weight for threat intel score |
hbos_filter_percentile | float | 90.0 | Pre-filter percentile |
min_observations | int | 3 | Min observations for correlation |
Fingerprint
Device/OS fingerprinting via nDPI signatures.
fingerprint:
enabled: false
history_days: 7
client_field: ndpi_ja4
server_field: ndpi_ja3s
min_observations: 10
anomaly_threshold: 0.7
max_fingerprints_per_host: 5
detect_suspicious: true
| Option | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable fingerprinting |
history_days | int | 7 | Days of history to analyze |
client_field | string | ndpi_ja4 | Field for client fingerprint (JA4 via nDPI) |
server_field | string | ndpi_ja3s | Field for server fingerprint (JA3 via nDPI) |
min_observations | int | 10 | Minimum observations for baseline |
anomaly_threshold | float | 0.7 | Threshold for anomaly detection |
max_fingerprints_per_host | int | 5 | Max expected fingerprints |
detect_suspicious | bool | true | Detect fingerprint changes |
Note: Requires nDPI fingerprint fields in flow data (Professional+ license for probe).
Logging
logging:
level: info
file: /var/log/rockfish/detect.log
| Option | Type | Default | Description |
|---|---|---|---|
level | string | info | Log level: error, warn, info, debug, trace |
file | string | - | Log file path (optional) |
Other Options
parallel_protocols: true
protocols:
- tcp
- udp
- icmp
duckdb:
autoload_extensions: false
| Option | Type | Default | Description |
|---|---|---|---|
parallel_protocols | bool | true | Process protocols in parallel |
protocols | list | tcp, udp, icmp | Protocols to process |
duckdb.autoload_extensions | bool | false | DuckDB extension autoload |
Complete Example
license:
path: /opt/rockfish/etc/license.json
observation: sensor-01
s3:
bucket: flow-data
region: us-east-1
hive_partitioning: true
sampling:
sample_percent: 10.0
retention_days: 7
sample_hour: 0
features:
num_bins: 10
histogram_type: quantile
sample_days: 7
training:
enabled: true
train_hour: 1
algorithm: hybrid
model_output_dir: /var/lib/rockfish/models
hbos:
num_bins: 10
fields:
- dur
- rtt
- pcr
- spkts
- dpkts
- sbytes
- dbytes
hybrid:
hbos_weight: 0.5
correlation_weight: 0.3
threat_intel_weight: 0.2
fingerprint:
enabled: true
history_days: 7
min_observations: 10
logging:
level: info
file: /var/log/rockfish/detect.log
Data Pipeline
Rockfish Detect processes data through a series of stages, each producing artifacts used by subsequent stages.
Pipeline Stages
sample --> extract --> rank --> train --> score
| Stage | Command | Input | Output |
|---|---|---|---|
| Sample | sample | Raw flow Parquet | Sampled Parquet |
| Extract | extract | Sampled Parquet | Normalization tables |
| Rank | rank | Normalization tables | Feature rankings |
| Train | train | Sampled + Normalization | Model files |
| Score | score | Raw flows + Model | Anomaly scores |
1. Sampling
Randomly samples flow data to reduce volume while maintaining statistical properties.
# Sample specific date
rockfish_detect -c config.yaml sample --date 2025-01-28
# Sample last N days
rockfish_detect -c config.yaml sample --days 7
# Clear state and resample all
rockfish_detect -c config.yaml sample --clear
Input Path
s3://<bucket>/<observation>/v2/year=YYYY/month=MM/day=DD/*.parquet
Output Path
s3://<bucket>/<observation>/sample/sample-YYYY-MM-DD.parquet
Configuration
sampling:
sample_percent: 10.0 # 10% of rows
retention_days: 7 # Keep 7 days of samples
State Tracking
Sampling maintains state to avoid reprocessing:
- Tracks which dates have been sampled
- Skips dates already in state file
- Use
--clearto reset state
2. Feature Extraction
Builds normalization lookup tables for ML training.
# Extract features for all protocols
rockfish_detect -c config.yaml extract
# Specific protocol
rockfish_detect -c config.yaml extract -p tcp
# Sequential (not parallel)
rockfish_detect -c config.yaml extract --sequential
Processing
For each field, creates a normalization table:
Numeric fields (dur, rtt, bytes, etc.):
- Histogram binning (quantile or equal-width)
- Maps raw values to bin indices
- Normalizes to [0, 1] range
Categorical fields (proto, ports, IPs):
- Frequency counting
- Maps values to frequency scores
- Special handling for IPs (/24 truncation)
Output Path
s3://<bucket>/<observation>/extract/<protocol>/<field>.parquet
Configuration
features:
num_bins: 10 # Histogram resolution
histogram_type: quantile # Better for skewed data
ip_hash_modulus: 65536 # IP dimensionality reduction
3. Feature Ranking
Ranks features by importance for model training.
# Rank using reconstruction error
rockfish_detect -c config.yaml rank
# Rank using SVD
rockfish_detect -c config.yaml rank -a svd
# Specific protocol
rockfish_detect -c config.yaml rank -p tcp
Algorithms
| Algorithm | Description |
|---|---|
reconstruction | Autoencoder reconstruction error (default) |
svd | Singular Value Decomposition importance |
Output
s3://<bucket>/<observation>/extract/<protocol>/rockfish_rank.parquet
Contains importance scores (0-1) for each field.
Using Rankings
training:
min_importance_score: 0.7 # Only use features above this
4. Model Training
Trains anomaly detection models on sampled data.
# Train HBOS model
rockfish_detect -c config.yaml train -a hbos
# Train hybrid model
rockfish_detect -c config.yaml train -a hybrid
# Train with ranked features only
rockfish_detect -c config.yaml train-ranked -n 10
# Specific protocol
rockfish_detect -c config.yaml train -p tcp
Algorithms
HBOS (Histogram-Based Outlier Score):
- Fast, interpretable
- Inverse density scoring
- Good baseline algorithm
Hybrid:
- Combines HBOS + correlation + threat intel
- Weighted scoring model
- Better for complex environments
Output
Models saved to configured directory:
<model_output_dir>/<protocol>_model.json
Configuration
training:
algorithm: hbos
model_output_dir: /var/lib/rockfish/models
hbos:
num_bins: 10
fields: [dur, rtt, pcr, spkts, dpkts, sbytes, dbytes]
5. Flow Scoring
Scores flows using trained models.
# Score specific date
rockfish_detect -c config.yaml score -d 2025-01-28
# Score since timestamp
rockfish_detect -c config.yaml score --since 2025-01-28T00:00:00Z
# With severity threshold
rockfish_detect -c config.yaml score -t 0.8
# Limit results
rockfish_detect -c config.yaml score -n 1000
# Output to file
rockfish_detect -c config.yaml score -o anomalies.parquet
Options
| Option | Description |
|---|---|
-d, --date | Score specific date |
--since | Score since timestamp |
-p | Specific protocol |
-t, --threshold | Minimum score threshold |
-n, --limit | Maximum results |
-o, --output | Output file path |
Severity Classification
# Percentile-based (default)
severity_mode: percentile
# Fixed thresholds
severity_mode: fixed
severity_thresholds:
low: 0.5
medium: 0.7
high: 0.85
critical: 0.95
Output
s3://<bucket>/<observation>/score/score-YYYY-MM-DD.parquet
Includes:
- Original flow fields
anomaly_score(0-1)severity(LOW, MEDIUM, HIGH, CRITICAL)
Automated Pipeline
Run the complete pipeline with a single command:
# Full pipeline for today
rockfish_detect -c config.yaml auto
# Specific date
rockfish_detect -c config.yaml auto --date 2025-01-28
# Last 7 days
rockfish_detect -c config.yaml auto --days 7
# Stop on first error
rockfish_detect -c config.yaml auto --fail-fast
Pipeline Order
- Sample data
- Extract features
- Rank features
- Train model
- Score flows
Reporting
Generate reports from scored data:
# Text report
rockfish_detect -c config.yaml report --date 2025-01-28
# JSON output
rockfish_detect -c config.yaml report -f json
# Filter by severity
rockfish_detect -c config.yaml report --min-severity HIGH
# Top N anomalies
rockfish_detect -c config.yaml report -n 50
Output Formats
| Format | Description |
|---|---|
text | Human-readable (default) |
json | Machine-readable JSON |
csv | CSV export |
Anomaly Detection
Rockfish Detect supports multiple anomaly detection algorithms for identifying unusual network flows.
Algorithms
| Algorithm | Type | Speed | Interpretability | Use Case |
|---|---|---|---|---|
| HBOS | Unsupervised | Fast | High | General anomaly detection |
| Hybrid | Combined | Medium | Medium | Complex environments |
| Random Forest | Supervised | Medium | Medium | Known threat patterns |
| Autoencoder | Neural Network | Slow | Low | Complex patterns |
HBOS (Histogram-Based Outlier Score)
HBOS is the default algorithm - fast, interpretable, and effective for network anomaly detection.
How It Works
- Build histograms for each feature from training data
- Calculate density for each bin
- Score new flows based on inverse density
- Combine scores across features
Flows falling in low-density bins receive high anomaly scores.
Configuration
training:
algorithm: hbos
hbos:
num_bins: 10
fields:
- dur # Flow duration
- rtt # Round-trip time
- pcr # Producer-consumer ratio
- spkts # Source packets
- dpkts # Destination packets
- sbytes # Source bytes
- dbytes # Destination bytes
- sentropy # Source entropy
- dentropy # Destination entropy
- ssmallpktcnt # Small packet count
- slargepktcnt # Large packet count
Feature Selection
Choose fields that characterize normal behavior:
| Category | Fields | Detects |
|---|---|---|
| Volume | sbytes, dbytes, spkts, dpkts | Data exfiltration, DDoS |
| Timing | dur, rtt | Tunneling, beaconing |
| Behavior | pcr, entropy | C2, encrypted channels |
| Packets | smallpktcnt, largepktcnt | Protocol anomalies |
Example Output
Flow: 192.168.1.100:52341 -> 45.33.32.156:443
Score: 0.92 (CRITICAL)
Contributing factors:
- dbytes: 47MB (unusual outbound volume)
- dur: 28800s (8-hour connection)
- pcr: -0.98 (highly asymmetric)
Hybrid Algorithm
Combines multiple detection methods for improved accuracy.
Components
Final Score = (HBOS * W1) + (Correlation * W2) + (Threat Intel * W3)
| Component | Default Weight | Description |
|---|---|---|
| HBOS | 0.5 | Statistical outlier score |
| Correlation | 0.3 | Fingerprint pair frequency |
| Threat Intel | 0.2 | nDPI risk + IP reputation |
Configuration
training:
algorithm: hybrid
hybrid:
hbos_weight: 0.5
correlation_weight: 0.3
threat_intel_weight: 0.2
hbos_filter_percentile: 90.0
min_observations: 3
Correlation Score
Based on nDPI fingerprint pair frequency (ndpi_ja4/ndpi_ja3s):
- Build database of (client_fingerprint, server_fingerprint) pairs
- Track frequency of each pair
- Score rare or never-seen combinations higher
Detects:
- New client/server combinations
- Unusual application behaviors
- Potential lateral movement
Threat Intel Score
Incorporates external intelligence:
- nDPI risk scores: Protocol-level risks
- IP reputation: AbuseIPDB confidence scores
- Known bad indicators: Blacklisted IPs/domains
Tuning Weights
| Environment | HBOS | Correlation | Threat Intel |
|---|---|---|---|
| General | 0.5 | 0.3 | 0.2 |
| High threat | 0.3 | 0.3 | 0.4 |
| Internal only | 0.6 | 0.4 | 0.0 |
Severity Classification
Anomaly scores are classified into severity levels.
Percentile-Based (Default)
Dynamic thresholds based on score distribution:
| Severity | Percentile |
|---|---|
| LOW | 50-75th |
| MEDIUM | 75-90th |
| HIGH | 90-95th |
| CRITICAL | >95th |
Adapts to your environment’s baseline.
Fixed Thresholds
Static thresholds for consistent alerting:
severity_mode: fixed
severity_thresholds:
low: 0.5
medium: 0.7
high: 0.85
critical: 0.95
Protocol-Specific Models
Rockfish Detect trains separate models per protocol:
# Train TCP model only
rockfish_detect -c config.yaml train -p tcp
# Score UDP traffic
rockfish_detect -c config.yaml score -p udp
Why Separate Models?
- TCP, UDP, and ICMP have different characteristics
- Prevents cross-protocol noise
- Better detection accuracy per protocol
Configuration
protocols:
- tcp
- udp
- icmp
parallel_protocols: true # Process in parallel
Feature Ranking
Use feature importance to select the most relevant fields.
# Rank features
rockfish_detect -c config.yaml rank
# Train with top 10 ranked features
rockfish_detect -c config.yaml train-ranked -n 10
Benefits
- Reduces model complexity
- Improves training speed
- May improve detection accuracy
Configuration
training:
min_importance_score: 0.7 # Include features above this threshold
Best Practices
1. Start with HBOS
- Fast iteration
- Easy to interpret
- Good baseline performance
2. Use Adequate Training Data
- Minimum 7 days of samples
- Include normal business hours and off-hours
- Ensure representative traffic mix
3. Tune for Your Environment
- Adjust severity thresholds based on alert volume
- Weight algorithms based on threat model
- Include relevant fields for your use case
4. Regular Retraining
- Retrain weekly or monthly
- Network behavior changes over time
- New applications may appear as anomalies initially
5. Validate Results
- Review high-severity alerts
- Adjust thresholds to reduce false positives
- Document known-good anomalies
Troubleshooting
High False Positive Rate
- Increase severity thresholds
- Add more training data
- Exclude noisy fields from model
Missing True Positives
- Lower severity thresholds
- Include more fields in model
- Check training data for bias
Slow Scoring
- Use ranked features (fewer fields)
- Process protocols in parallel
- Increase hardware resources
Device Fingerprinting
Rockfish Detect includes ML-based passive device fingerprinting using network signals.
Note: Requires nDPI fingerprints in flow data (Professional+ license for rockfish_probe).
Overview
Device fingerprinting identifies devices and operating systems based on their network behavior, without requiring agents or active scanning.
Signals Used
| Priority | Signal | Field | Description |
|---|---|---|---|
| Primary | TLS client | ndpi_ja4 | JA4 TLS client fingerprint |
| Primary | TLS server | ndpi_ja3s | JA3 TLS server fingerprint |
| Secondary | TCP stack | ndpi_tcp_fp | TCP fingerprint with OS hint (TTL, window size, options) |
| Secondary | Composite | ndpi_fp | nDPI combined fingerprint for device correlation |
| Tertiary | Application | - | HTTP headers, DNS patterns |
Use Cases
- Asset Inventory - Discover devices on your network
- Baseline Monitoring - Track device behavior over time
- Lateral Movement Detection - Detect hosts changing fingerprints
- Unauthorized Devices - Identify unexpected device types
Commands
Build Fingerprint Database
Build baseline from historical data:
# Build from last 7 days
rockfish_detect -c config.yaml fingerprint build --days 7
# Build from specific date range
rockfish_detect -c config.yaml fingerprint build --start 2025-01-01 --end 2025-01-28
Detect Anomalies
Find hosts with unusual fingerprint changes:
# Detect for today
rockfish_detect -c config.yaml fingerprint detect
# Detect for specific date
rockfish_detect -c config.yaml fingerprint detect --date 2025-01-28
Profile Specific Host
Get fingerprint profile for an IP:
# Profile specific IP
rockfish_detect -c config.yaml fingerprint profile --ip 192.168.1.100
# With history
rockfish_detect -c config.yaml fingerprint profile --ip 192.168.1.100 --days 30
Configuration
fingerprint:
enabled: true
history_days: 7
client_field: ndpi_ja4
server_field: ndpi_ja3s
min_observations: 10
anomaly_threshold: 0.7
max_fingerprints_per_host: 5
detect_suspicious: true
| Option | Default | Description |
|---|---|---|
enabled | false | Enable fingerprinting |
history_days | 7 | Days of history to analyze |
client_field | ndpi_ja4 | Field for client fingerprint (JA4 via nDPI) |
server_field | ndpi_ja3s | Field for server fingerprint (JA3 via nDPI) |
min_observations | 10 | Minimum flows to establish baseline |
anomaly_threshold | 0.7 | Score threshold for anomalies |
max_fingerprints_per_host | 5 | Expected max fingerprints per device |
detect_suspicious | true | Flag suspicious changes |
How It Works
1. Baseline Building
For each IP address, collect:
- Set of observed ndpi_ja4 fingerprints (client connections)
- Set of observed ndpi_ja3s fingerprints (server connections)
- Frequency of each fingerprint
- First and last seen timestamps
2. Anomaly Detection
Flag hosts that:
- Present a new, never-seen fingerprint
- Exceed
max_fingerprints_per_host - Show sudden fingerprint changes
- Have rare fingerprint combinations
3. Correlation Scoring
Score fingerprint pairs by frequency:
Rare pair (first time seen) -> High anomaly score
Common pair (seen 1000+ times) -> Low anomaly score
Detection Scenarios
New Device on Network
Alert: New fingerprint detected
Host: 192.168.1.150
Fingerprint: t13d1516h2_8daaf6152771_b0da82dd1658
First seen: 2025-01-28T14:32:00Z
Action: Verify device is authorized
Host Fingerprint Change
Alert: Fingerprint change detected
Host: 192.168.1.100
Previous: t13d1516h2_8daaf6152771_b0da82dd1658 (Windows 11)
Current: t13d1517h2_5b57614c22b0_06cda9e17597 (Linux)
Risk: Possible lateral movement or VM switch
Unusual Client/Server Pair
Alert: Rare fingerprint combination
Client: 192.168.1.100 (ndpi_ja4: t13d1516h2_...)
Server: 45.33.32.156 (ndpi_ja3s: t120200_...)
Observations: 1 (first time)
Typical for this client: 847 connections to known servers
Risk: New external communication
Integration with Hybrid Scoring
Fingerprint correlation is a component of the hybrid algorithm:
training:
algorithm: hybrid
hybrid:
hbos_weight: 0.5
correlation_weight: 0.3 # Fingerprint correlation
threat_intel_weight: 0.2
Flows with rare fingerprint combinations receive higher anomaly scores.
Output Schema
Fingerprint analysis adds these fields to scored flows:
| Field | Type | Description |
|---|---|---|
fp_client | String | Client fingerprint (ndpi_ja4) |
fp_server | String | Server fingerprint (ndpi_ja3s) |
fp_pair_count | Int | Times this pair has been seen |
fp_client_count | Int | Times client has been seen |
fp_is_new | Bool | First observation of this pair |
fp_anomaly_score | Float | Fingerprint-specific anomaly score |
Best Practices
1. Build Sufficient Baseline
- Use at least 7 days of data
- Include weekdays and weekends
- Ensure coverage of all network segments
2. Tune Thresholds
- Start with defaults
- Adjust
max_fingerprints_per_hostfor your environment - Some hosts (proxies, VMs) legitimately have many fingerprints
3. Handle Known Exceptions
- Exclude known multi-fingerprint hosts
- Document expected fingerprint changes (updates, migrations)
4. Combine with Other Signals
- Use hybrid algorithm for combined scoring
- Correlate with threat intelligence
- Consider flow volume and timing
Limitations
- Requires nDPI fingerprint fields (ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp, ndpi_fp) in flow data
- TLS fingerprints only available for TLS connections
- VPN/proxy traffic may obscure true fingerprints
- Fingerprints can change with software updates
Scheduler
Rockfish Detect can run as a daemon with automated scheduling for continuous anomaly detection.
Running as Daemon
# Start scheduler
rockfish_detect -c config.yaml run
# Run immediately without waiting
rockfish_detect -c config.yaml run --run-now
The scheduler runs two daily jobs:
- Sample job - Sample new flow data
- Train job - Retrain models with new samples
Schedule Configuration
sampling:
sample_hour: 0 # UTC hour (0-23)
sample_minute: 30 # Optional; random if not set
training:
train_hour: 1 # UTC hour (0-23)
train_minute: 0 # Optional; random if not set
Random Minutes
If sample_minute or train_minute is not set, a random minute (0-59) is selected at startup. This prevents multiple instances from running concurrently.
Example Schedule
# Sample at 00:30 UTC, train at 01:00 UTC
sampling:
sample_hour: 0
sample_minute: 30
training:
train_hour: 1
train_minute: 0
Timeline:
00:30 UTC - Sample yesterday's flow data
01:00 UTC - Retrain models with updated samples
Systemd Service
Create /etc/systemd/system/rockfish-detect.service:
[Unit]
Description=Rockfish Detect ML Service
After=network.target
[Service]
Type=simple
User=rockfish
ExecStart=/usr/local/bin/rockfish_detect -c /etc/rockfish/detect.yaml run
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable rockfish-detect
sudo systemctl start rockfish-detect
# Check status
sudo systemctl status rockfish-detect
# View logs
sudo journalctl -u rockfish-detect -f
Docker Deployment
# Pull the image
docker pull rockfishnetworks/toolkit:latest
# Run the scheduler
docker run -d \
--name rockfish-detect \
-v /path/to/config.yaml:/etc/rockfish/config.yaml \
-v /path/to/license.json:/etc/rockfish/license.json \
-e AWS_ACCESS_KEY_ID=xxx \
-e AWS_SECRET_ACCESS_KEY=xxx \
rockfishnetworks/toolkit:latest \
rockfish_detect -c /etc/rockfish/config.yaml run
Graceful Shutdown
The scheduler handles SIGTERM/SIGINT for graceful shutdown:
- Stops accepting new jobs
- Waits for running jobs to complete
- Saves state
- Exits cleanly
# Graceful stop
sudo systemctl stop rockfish-detect
# Or with kill
kill -TERM $(pgrep rockfish_detect)
State Management
The scheduler maintains state to avoid redundant work:
Sample State
Tracks which dates have been sampled:
s3://<bucket>/<observation>/sample/.state.json
Skip already-sampled dates on restart.
Score State
Tracks last scored timestamp:
s3://<bucket>/<observation>/score/.state.json
Resume scoring from last checkpoint.
Reset State
# Clear sample state
rockfish_detect -c config.yaml sample --clear
# Force rescore
rockfish_detect -c config.yaml score --since 2025-01-01T00:00:00Z
Monitoring
Log Output
logging:
level: info
file: /var/log/rockfish/detect.log
Log levels:
error- Errors onlywarn- Warnings and errorsinfo- Normal operation (default)debug- Detailed operationtrace- Very verbose
Health Check
# Validate configuration
rockfish_detect -c config.yaml validate
# Test S3 connectivity
rockfish_detect -c config.yaml test-s3
# Check license
rockfish_detect -c config.yaml license
Metrics to Monitor
| Metric | Description |
|---|---|
| Sample job duration | Time to complete sampling |
| Train job duration | Time to complete training |
| Flows sampled | Number of flows per sample run |
| Anomalies detected | High-severity anomalies per day |
| S3 errors | Failed S3 operations |
Multi-Instance Deployment
For high availability or distributed processing:
Separate Responsibilities
# Instance 1: Sampling and training
rockfish_detect -c config-train.yaml run
# Instance 2: Scoring only
rockfish_detect -c config-score.yaml score --continuous
Shared State
All instances read/write to the same S3 bucket. State files prevent duplicate work.
Protocol Distribution
# Instance 1: TCP
rockfish_detect -c config.yaml run -p tcp
# Instance 2: UDP
rockfish_detect -c config.yaml run -p udp
Troubleshooting
Job Not Running
- Check system time (UTC)
- Verify schedule configuration
- Check logs for errors
Job Failing
# Run manually with verbose output
rockfish_detect -c config.yaml -vv auto
High Memory Usage
- Reduce
sample_percent - Process protocols sequentially
- Limit
sample_days
Slow Jobs
- Enable
parallel_protocols: true - Use faster S3 storage
- Increase hardware resources
Parquet Schema
Rockfish exports flow data in Apache Parquet format with IPFIX-compliant field naming. The schema varies by license tier.
Schema by Tier
| Tier | Schema Version | Fields | Key Features |
|---|---|---|---|
| Community | v1 | 44 | Basic flow fields |
| Basic | v1 | 54 | + nDPI detection, GeoIP (country, city, ASN) |
| Professional | v2 | 60 | + GeoIP AS org, nDPI fingerprints |
| Enterprise | v2 | 63+ | + Anomaly scores, ML predictions |
Community Schema (44 Fields)
Basic flow capture with core network fields.
| # | Field | Type | Description |
|---|---|---|---|
| 1 | version | UInt16 | Schema version (1) |
| 2 | flowid | String | Unique flow UUID |
| 3 | obname | String | Observation domain name |
| 4 | stime | Timestamp | Flow start time (UTC) |
| 5 | etime | Timestamp | Flow end time (UTC) |
| 6 | dur | UInt32 | Duration (milliseconds) |
| 7 | rtt | UInt32 | Round-trip time (microseconds) |
| 8 | pcr | Int32 | Producer-consumer ratio |
| 9 | proto | String | Protocol (TCP, UDP, ICMP) |
| 10 | saddr | String | Source IP address |
| 11 | daddr | String | Destination IP address |
| 12 | sport | UInt16 | Source port |
| 13 | dport | UInt16 | Destination port |
| 14 | iflags | String | Initial TCP flags |
| 15 | uflags | String | Union of all TCP flags |
| 16 | stcpseq | UInt32 | Source initial TCP sequence |
| 17 | dtcpseq | UInt32 | Dest initial TCP sequence |
| 18 | svlan | UInt16 | Source VLAN ID |
| 19 | dvlan | UInt16 | Destination VLAN ID |
| 20 | spkts | UInt64 | Source packet count |
| 21 | dpkts | UInt64 | Destination packet count |
| 22 | sbytes | UInt64 | Source byte count |
| 23 | dbytes | UInt64 | Destination byte count |
| 24 | sentropy | UInt8 | Source payload entropy (0-255) |
| 25 | dentropy | UInt8 | Destination payload entropy |
| 26 | ssmallpktcnt | UInt32 | Source small packets (<60 bytes) |
| 27 | dsmallpktcnt | UInt32 | Dest small packets |
| 28 | slargepktcnt | UInt32 | Source large packets (>225 bytes) |
| 29 | dlargepktcnt | UInt32 | Dest large packets |
| 30 | snonemptypktcnt | UInt32 | Source non-empty packets |
| 31 | dnonemptypktcnt | UInt32 | Dest non-empty packets |
| 32 | sfirstnonemptycnt | UInt16 | Source first N non-empty sizes |
| 33 | dfirstnonemptycnt | UInt16 | Dest first N non-empty sizes |
| 34 | smaxpktsize | UInt16 | Source max packet size |
| 35 | dmaxpktsize | UInt16 | Dest max packet size |
| 36 | savgpayload | UInt16 | Source avg payload size |
| 37 | davgpayload | UInt16 | Dest avg payload size |
| 38 | sstdevpayload | UInt16 | Source payload std deviation |
| 39 | dstdevpayload | UInt16 | Dest payload std deviation |
| 40 | spd | String | Small packet direction flags |
| 41 | spdt | String | Small packet direction timing |
| 42 | reason | String | Flow termination reason |
| 43 | smac | String | Source MAC address |
| 44 | dmac | String | Destination MAC address |
Basic Schema (54 Fields)
Community schema + nDPI application detection + GeoIP (country, city, ASN).
GeoIP fields:
| # | Field | Type | Description |
|---|---|---|---|
| 45 | scountry | String | Source country (ISO 3166-1 alpha-2) |
| 46 | dcountry | String | Destination country |
| 47 | scity | String | Source city |
| 48 | dcity | String | Destination city |
| 49 | sasn | UInt32 | Source ASN |
| 50 | dasn | UInt32 | Destination ASN |
nDPI fields:
| # | Field | Type | Description |
|---|---|---|---|
| 51 | ndpi_appid | String | nDPI application ID (e.g., “TLS.YouTube”) |
| 52 | ndpi_category | String | nDPI category (e.g., “Streaming”) |
| 53 | ndpi_risk_score | UInt32 | nDPI cumulative risk score |
| 54 | ndpi_risk_severity | UInt8 | Risk severity (0=none, 1=low, 2=medium, 3=high) |
Professional Schema (60 Fields)
Basic schema + GeoIP AS organization names and nDPI fingerprinting.
Additional GeoIP fields (AS organization):
| # | Field | Type | Description |
|---|---|---|---|
| 55 | sasnorg | String | Source ASN organization |
| 56 | dasnorg | String | Destination ASN organization |
nDPI fingerprint fields:
| # | Field | Type | Description |
|---|---|---|---|
| 57 | ndpi_ja4 | String | JA4 TLS client fingerprint (via nDPI) |
| 58 | ndpi_ja3s | String | JA3 TLS server fingerprint (via nDPI) |
| 59 | ndpi_tcp_fp | String | TCP fingerprint with OS hint (via nDPI) |
| 60 | ndpi_fp | String | nDPI composite fingerprint |
Enterprise Schema (63+ Fields)
Professional schema + anomaly detection and ML predictions.
Anomaly detection fields:
| # | Field | Type | Description |
|---|---|---|---|
| 61 | anomaly_score | Float32 | Anomaly score (0.0 - 1.0) |
| 62 | anomaly_severity | String | Severity (LOW, MEDIUM, HIGH, CRITICAL) |
| 63 | anomaly_factors | String | Contributing factors |
File Naming
| Tier | File Pattern |
|---|---|
| Community | rockfish-v1-YYYYMMDD-HHMMSS.parquet |
| Basic | rockfish-v1-YYYYMMDD-HHMMSS.parquet |
| Professional | rockfish-<observation>-v2-YYYYMMDD-HHMMSS.parquet |
| Enterprise | rockfish-<observation>-v2-YYYYMMDD-HHMMSS.parquet |
S3 Path Structure
With Hive partitioning enabled:
s3://<bucket>/<prefix>/v1/year=YYYY/month=MM/day=DD/*.parquet
s3://<bucket>/<prefix>/v2/year=YYYY/month=MM/day=DD/*.parquet
Field Descriptions
Flow Identification
- flowid: Unique UUID for deduplication and correlation
- obname: Observation domain name (sensor identifier)
Timing
- stime/etime: Timestamps with microsecond precision, UTC
- dur: Duration in milliseconds
- rtt: Estimated TCP round-trip time
Network Addresses
- saddr/daddr: IPv4 or IPv6 addresses as strings
- sport/dport: Port numbers (0 for non-TCP/UDP)
- smac/dmac: MAC addresses in standard notation
Traffic Volumes
- spkts/dpkts: Packet counts per direction
- sbytes/dbytes: Byte counts per direction
- pcr: Producer-consumer ratio:
(sent-recv)/(sent+recv)
TCP Flags
- iflags: Initial TCP flags (SYN, ACK, etc.)
- uflags: Union of all flags seen in flow
Payload Analysis
- sentropy/dentropy: Shannon entropy (0-255)
-
230: Likely encrypted/compressed
- ~140: English text
- Low: Sparse or zero-padded
-
Flow Termination
- reason: Why the flow ended
idle: Idle timeoutactive: Active timeouteof: End of captureend: FIN exchangerst: TCP reset
GeoIP (Professional+)
- scountry/dcountry: ISO 3166-1 alpha-2 codes
- sasn/dasn: Autonomous System Numbers
- sasnorg/dasnorg: AS organization names
nDPI Detection (Basic+)
- ndpi_appid: Application identifier (e.g., “TLS.YouTube”)
- ndpi_category: Category (e.g., “Streaming”)
- ndpi_risk_score: Cumulative risk score
- ndpi_risk_severity: 0=none, 1=low, 2=medium, 3=high
nDPI Fingerprints (Professional+)
- ndpi_ja4: JA4 TLS client fingerprint
- ndpi_ja3s: JA3 TLS server fingerprint
- ndpi_tcp_fp: TCP fingerprint with OS detection hint (format: “fingerprint/os”)
- ndpi_fp: nDPI composite fingerprint for device correlation
Anomaly Detection (Enterprise)
- anomaly_score: 0.0-1.0 indicating how unusual the flow is
- anomaly_severity: Classification based on score percentile
- anomaly_factors: Fields contributing most to the score
Parquet File Metadata
Each file includes custom metadata:
| Key | Description |
|---|---|
rockfish.license_id | License identifier |
rockfish.tier | License tier |
rockfish.company | Company name |
rockfish.observation | Observation domain |
rockfish.schema_version | Schema version |
Example Queries
DuckDB - Read from S3
SELECT * FROM read_parquet(
's3://bucket/v2/year=2025/month=01/day=28/*.parquet',
hive_partitioning=true
);
Count by Protocol
SELECT proto, COUNT(*) as count
FROM read_parquet('flows/*.parquet')
GROUP BY proto
ORDER BY count DESC;
Filter by Country (Professional+)
SELECT saddr, daddr, scountry, dcountry, ndpi_appid
FROM read_parquet('flows/*.parquet')
WHERE scountry = 'US' AND dcountry != 'US';
High-Risk Flows (Basic+)
SELECT stime, saddr, daddr, ndpi_appid, ndpi_risk_score
FROM read_parquet('flows/*.parquet')
WHERE ndpi_risk_score > 100
ORDER BY ndpi_risk_score DESC;
Anomalous Flows (Enterprise)
SELECT stime, saddr, daddr, anomaly_score, anomaly_severity
FROM read_parquet('flows/*.parquet')
WHERE anomaly_severity IN ('HIGH', 'CRITICAL')
ORDER BY anomaly_score DESC
LIMIT 100;
CLI Reference
Command-line options for Rockfish tools.
rockfish_probe
Usage
rockfish_probe [OPTIONS]
Global Options
| Option | Short | Description |
|---|---|---|
--config <FILE> | -c | Configuration file path |
--help | -h | Show help |
--version | -V | Show version |
Input Options
| Option | Short | Description |
|---|---|---|
--source <SRC> | -i | Input source (interface or pcap file) |
--live <TYPE> | Capture type: pcap, afpacket, netmap, fmadio | |
--filter <EXPR> | BPF filter expression | |
--snaplen <BYTES> | Maximum capture bytes per packet | |
--promisc-off | Disable promiscuous mode |
Flow Options
| Option | Description |
|---|---|
--idle-timeout <SECS> | Idle timeout (default: 300) |
--active-timeout <SECS> | Active timeout (default: 1800) |
--max-flows <COUNT> | Maximum flow table size |
--max-payload <BYTES> | Max payload bytes to capture |
--udp-uniflow <PORT> | UDP uniflow port (0=disabled) |
--ndpi | Enable nDPI (includes JA4/JA3s fingerprints) |
Fragment Options
| Option | Description |
|---|---|
--no-frag | Disable fragment reassembly |
--max-frag-tables <N> | Max fragment tables (default: 1024) |
--frag-timeout <SECS> | Fragment timeout (default: 30) |
AF_PACKET Options (Linux)
| Option | Description |
|---|---|
--afp-block-size <BYTES> | Ring buffer block size |
--afp-block-count <N> | Ring buffer block count |
--afp-fanout-group <ID> | Fanout group ID |
--afp-fanout-mode <MODE> | Fanout mode: hash, lb, cpu, rollover, random |
Output Options
| Option | Description |
|---|---|
--parquet-dir <DIR> | Output directory for Parquet files |
--parquet-batch-size <N> | Flows per file |
--parquet-prefix <PREFIX> | Filename prefix |
--parquet-schema <TYPE> | Schema: simple or extended |
--observation <NAME> | Observation domain name |
--hive-boundary-flush | Flush at day boundaries |
S3 Options
| Option | Description |
|---|---|
--s3-bucket <NAME> | S3 bucket name |
--s3-prefix <PREFIX> | S3 key prefix |
--s3-region <REGION> | AWS region |
--s3-endpoint <URL> | Custom S3 endpoint |
--s3-force-path-style | Use path-style URLs |
--s3-hive-partitioning | Enable Hive partitioning |
--s3-delete-after-upload | Delete local after upload |
--test-s3 | Test S3 connectivity and exit |
Logging Options
| Option | Short | Description |
|---|---|---|
--verbose | -v | Increase verbosity (-vv for debug) |
--quiet | -q | Quiet mode |
--stats | Print statistics | |
--log-file <PATH> | Log file path |
License Options
| Option | Description |
|---|---|
--license <PATH> | License file path |
Environment: ROCKFISH_LICENSE_PATH
Examples
# Basic PCAP processing
rockfish_probe -i capture.pcap --parquet-dir ./flows
# Live capture with AF_PACKET
sudo rockfish_probe -i eth0 --live afpacket \
--afp-block-size 4194304 \
--afp-fanout-group 1 \
--parquet-dir ./flows
# With all features (nDPI includes fingerprints)
rockfish_probe -i eth0 --live afpacket \
--ndpi \
--parquet-dir ./flows \
--s3-bucket my-bucket \
--s3-region us-east-1 \
--s3-hive-partitioning \
-vv
# Test S3 connectivity
rockfish_probe --test-s3 \
--s3-bucket my-bucket \
--s3-region us-east-1
rockfish_mcp
Usage
rockfish_mcp [OPTIONS]
Options
| Option | Description |
|---|---|
--config <FILE> | Configuration file path |
--help | Show help |
--version | Show version |
Environment: ROCKFISH_CONFIG
Examples
# Start with config file
ROCKFISH_CONFIG=config.yaml rockfish_mcp
# Or via argument
rockfish_mcp --config /etc/rockfish/mcp.yaml
Common Patterns
Processing Multiple PCAPs
# Glob pattern
rockfish_probe -i "/data/captures/*.pcap" --parquet-dir ./flows
# Multiple runs
for f in /data/captures/*.pcap; do
rockfish_probe -i "$f" --parquet-dir ./flows
done
High-Performance Capture
# Pin to CPUs, large ring buffer, fanout
sudo taskset -c 0-3 rockfish_probe -i eth0 --live afpacket \
--afp-block-size 4194304 \
--afp-block-count 128 \
--afp-fanout-group 1 \
--afp-fanout-mode hash \
--parquet-dir /data/flows
Development/Testing
# Verbose output, no S3
rockfish_probe -i test.pcap \
--parquet-dir ./test-flows \
--ndpi \
--stats \
-vv
Production Deployment
# Full featured with S3
rockfish_probe -c /opt/rockfish/etc/config.yaml \
--license /opt/rockfish/etc/license.json
License Tiers
Rockfish uses a tiered licensing model to enable different feature sets.
Tier Comparison
| Feature | Community | Basic | Professional | Enterprise |
|---|---|---|---|---|
| Core Features | ||||
| Packet capture | Yes | Yes | Yes | Yes |
| Flow generation | Yes | Yes | Yes | Yes |
| Parquet export | Yes | Yes | Yes | Yes |
| S3 upload | Yes | Yes | Yes | Yes |
| Schema | ||||
| v1 (Simple - 54 fields) | Yes | Yes | Yes | Yes |
| v2 (Extended - 60 fields) | - | - | Yes | Yes |
| Application Detection | ||||
| nDPI labeling | - | Yes | Yes | Yes |
| nDPI risk scoring | - | Yes | Yes | Yes |
| Network Intelligence | ||||
| GeoIP country/city/ASN | - | Yes | Yes | Yes |
| GeoIP AS organization | - | - | Yes | Yes |
| nDPI fingerprints (JA4, JA3s, TCP) | - | - | Yes | Yes |
| Customization | ||||
| Custom observation name | - | Yes | Yes | Yes |
| Advanced Features | ||||
| Anomaly detection | - | - | - | Yes |
| ML model integration | - | - | - | Yes |
Feature Details
Community Tier
Free tier with basic flow capture:
- Standard 5-tuple flow generation
- Parquet export (v1 schema)
- S3 upload support
- AF_PACKET high-performance capture
- Fragment reassembly
Basic Tier
Adds application visibility and GeoIP intelligence:
- All Community features
- nDPI application labeling
- nDPI risk scoring and categories
- GeoIP lookups (scountry, dcountry, scity, dcity, sasn, dasn)
- Custom observation domain name
- 54 fields total
Professional Tier
Adds AS organization names and device fingerprinting:
- All Basic features
- Extended schema (60 fields)
- GeoIP AS organization names (sasnorg, dasnorg)
- nDPI fingerprints (JA4 client, JA3 server, TCP fingerprint, composite)
Enterprise Tier
Full feature set:
- All Professional features
- Anomaly detection (HBOS)
- ML model integration
- SaaS schema (63+ fields)
- Correlation with rockfish_sensor
Schema Comparison
v1 (Simple) - Community/Basic
54 core fields:
- Flow identification (flowid, obname)
- Timing (stime, etime, dur, rtt)
- Addresses (saddr, daddr, sport, dport)
- Traffic (spkts, dpkts, sbytes, dbytes)
- TCP state (iflags, uflags, sequences)
- Payload analysis (entropy, packet sizes)
- GeoIP: scountry, dcountry, scity, dcity, sasn, dasn (Basic tier)
- nDPI results (Basic tier)
v2 (Extended) - Professional/Enterprise
60 fields (v1 + 6 additional):
- GeoIP AS organization: sasnorg, dasnorg
- nDPI fingerprints: ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp, ndpi_fp
v3 (SaaS) - Enterprise
63+ fields:
- All v2 fields
- Anomaly scores
- ML predictions
- Correlation IDs
License Enforcement
Parquet Metadata
Licensed files include metadata for validation:
rockfish.license_id: "lic_abc123"
rockfish.tier: "professional"
rockfish.company: "Example Corp"
rockfish.observation: "sensor-01"
MCP Validation
Configure license validation in MCP:
sources:
# Require valid license
prod_flows:
path: s3://data/flows/
require_license: true
# Restrict to specific licenses
enterprise_flows:
path: s3://data/enterprise/
require_license: true
allowed_license_ids:
- "lic_abc123"
Obtaining a License
Contact [email protected] for:
- License quotes
- Trial licenses
- Enterprise agreements
- Volume discounts