Introduction

Network Flow Telemetry. Simple. Affordable. AI-Ready.

Rockfish Toolkit captures network flows and writes them directly to your S3 in Apache Parquet format. That’s it. No intermediate databases, no proprietary formats, no vendor lock-in.

Your data. Your privacy. Your control.

Your data is immediately ready for analysis by DuckDB, Spark, Pandas, Python, R, or any tool that reads Parquet - which is virtually every modern data platform.


Simple	One binary. Capture traffic. Write to S3. Done.
Affordable	Enterprise-grade network visibility for less than the price of a grande latte per day.
AI-Ready	Structured, queryable data that ML pipelines and AI assistants can consume immediately.

A Bolt-On Toolkit for SOC AI Readiness

The question “Is your SOC AI-ready?” has become central to modern security operations. Industry consensus is clear: AI readiness starts with SOC Data Foundations - structured, queryable security data that AI systems can actually consume.

The challenge? Traditional security tools generate logs in proprietary formats, scattered across siloed systems. Ripping and replacing your entire security stack isn’t practical.

Rockfish Toolkit is different. Deploy alongside your existing infrastructure to create an AI-ready data layer:

No replacement required - Add Rockfish to your network without changing existing tools
Deploy in minutes - Single binary or Docker container, no complex dependencies
Immediate AI compatibility - Output flows directly to any ML pipeline, SIEM, or AI assistant
Open data format - Apache Parquet works with DuckDB, Spark, Pandas, and every major analytics platform
S3-native - Scalable, cost-effective cloud storage

Why Parquet for Network Data?

Rockfish Toolkit captures network flows and exports them as Apache Parquet files - the same columnar format used by data science platforms, ML pipelines, and modern SIEM architectures:

Benefit	Description
Columnar storage	Fast analytical queries on specific fields
Schema enforcement	Consistent, typed data for ML models
70-90% compression	Reduced storage costs vs. raw logs
Universal compatibility	Works with DuckDB, Spark, Pandas, and AI frameworks
S3-native	Scalable, cost-effective cloud storage

This architecture enables security teams to add AI capabilities without rebuilding their entire SOC.

Why S3 Changes Everything

S3—and object storage generally—fundamentally changes what’s possible in cybersecurity by decoupling data collection from data analysis.

Traditional architectures force a painful tradeoff: either store everything and pay for expensive hot storage, or age out logs and lose forensic depth. S3 eliminates this with virtually unlimited, cheap, durable storage that can hold years of netflow, DNS logs, endpoint telemetry, and packet captures in columnar formats like Parquet.

This unlocks data science at scale:

Train anomaly detection models on months of baseline behavior
Run retrospective threat hunts when new IOCs emerge
Feed AI-driven SOC tools with the volume of data they need to learn patterns rather than just match signatures

You own your data:

The hive-partitioned, schema-on-read model means you’re not locked into a SIEM vendor’s data model. Your data lives in open formats, queryable by any tool—Athena, Spark, DuckDB, Pandas, or a custom Rust binary polling for new files.

When storage is cheap and permanent, detection becomes a software problem rather than a retention policy negotiation—and that shifts the advantage back to defenders.

What Rockfish Provides

Capability	Description
Network Flow Capture	High-performance packet capture with flow generation
Protocol Detection	Application-level protocol identification via nDPI
Device Fingerprinting	TLS/TCP fingerprints via nDPI for device identification
Threat Intelligence	IP reputation and risk scoring
Anomaly Detection	ML-based detection for enterprise deployments
MCP Integration	Query flows directly from AI assistants via Model Context Protocol

Use Cases

Rockfish Toolkit provides network visibility and AI-ready telemetry across diverse environments:

Environment	Use Case
Security Operations (SOC)	Threat detection, incident response, network forensics, AI-assisted investigation
IoT Networks	Device inventory, behavioral baselining, anomaly detection for connected devices
Industrial / Manufacturing	OT network monitoring, detecting unauthorized communications, compliance auditing
Robotics & Automation	Fleet communication analysis, identifying misconfigurations, performance monitoring
Healthcare	Medical device tracking, HIPAA compliance, detecting data exfiltration
SMB / Branch Offices	Affordable network visibility without enterprise SIEM costs
MSPs / MSSPs	Multi-tenant flow collection, centralized threat analysis across customers
Research & Education	Network traffic analysis, security research, ML model development

Components

Component	Description
rockfish_probe	Flow meter - captures packets and generates flow records
rockfish_mcp	MCP query server - SQL queries on Parquet files via DuckDB (Coming March 2025)
rockfish_detect	ML training and anomaly detection (Enterprise)
rockfish_intel	Threat intelligence caching server

Data Pipeline

Network Traffic
      |
      v
rockfish_probe  -->  Parquet Files  -->  S3
                           |
                           v
                    rockfish_mcp (DuckDB queries)
                           |
                           v
                    AI Assistants / SIEM / Analytics

Parquet Schema by Tier

Rockfish outputs flow data in Apache Parquet format. The schema varies by license tier:

Tier	Fields	Key Data
Community	44	5-tuple, timing, traffic volumes, TCP flags, payload entropy
Basic	54	+ nDPI application detection, GeoIP (country, city, ASN)
Professional	60	+ GeoIP AS org, nDPI fingerprints
Enterprise	63+	+ Anomaly scores, severity classification

Key Fields

All tiers include:

saddr, daddr - Source/destination IP addresses
sport, dport - Source/destination ports
proto - Protocol (TCP, UDP, ICMP)
spkts, dpkts, sbytes, dbytes - Traffic volumes
dur, rtt - Duration and round-trip time
sentropy, dentropy - Payload entropy (encrypted traffic detection)

Basic+ adds:

scountry, dcountry - Geographic country codes
scity, dcity - Geographic city names
sasn, dasn - Autonomous System Numbers
ndpi_appid - Application identifier (e.g., “TLS.YouTube”)
ndpi_risk_score - Risk scoring

Professional+ adds:

sasnorg, dasnorg - AS organization names
ndpi_ja4, ndpi_ja3s - TLS fingerprints for device identification
ndpi_tcp_fp - TCP fingerprint with OS detection hint
ndpi_fp - nDPI composite fingerprint

Enterprise adds:

anomaly_score - ML-derived anomaly score (0.0-1.0)
anomaly_severity - Classification (LOW, MEDIUM, HIGH, CRITICAL)

See Parquet Schema for complete field reference.

License Tiers

Tier	Features
Community	Basic schema (44 fields), S3 upload
Basic	+ nDPI labels, GeoIP (country, city, ASN), custom observation name (54 fields)
Professional	+ GeoIP AS org, nDPI fingerprints (60 fields)
Enterprise	+ ML models, anomaly detection

See License Tiers for detailed comparison.

Getting Started

Installation - Install from download portal
Quick Start - Capture your first flows
Licensing - Activate your license

Support

Email: [email protected]
Download Portal: download.rockfishnetworks.com

Installation

Quick Install

curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash

The installer auto-detects your platform and installs via the appropriate method (Debian package, Docker, or binary).

Options:

# Install specific version
ROCKFISH_VERSION=1.0.0 curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash

# Force Docker installation
ROCKFISH_METHOD=docker curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash

Manual Installation

Rockfish Toolkit is also available as a Debian package and Docker image from the Rockfish Networks download portal.

System Requirements

Operating System: Debian 11+, Ubuntu 20.04+, or Docker-compatible host
Architecture: x86_64 (amd64)
Memory: 2GB minimum (4GB+ recommended for high-traffic networks)
Storage: Depends on retention policy (10GB minimum)
Network: Interface with capture capabilities

Debian Package Installation

Download the toolkit package from the Rockfish download portal:

# Download the package
wget https://download.rockfishnetworks.com/rockfish_toolkit.deb

# Install
sudo dpkg -i rockfish_toolkit.deb

# Install dependencies if needed
sudo apt-get install -f

The rockfish_toolkit.deb package includes all Rockfish Toolkit binaries:

Binary	Description
`rockfish_probe`	Network flow meter
`rockfish_mcp`	MCP query server
`rockfish_detect`	ML anomaly detection (Enterprise)
`rockfish_intel`	Threat intelligence server

Installed Files

After installation:

Path	Description
`/usr/bin/rockfish_*`	Rockfish binaries
`/etc/rockfish/`	Configuration directory
`/var/lib/rockfish/`	Data directory
`/var/log/rockfish/`	Log directory

Docker Installation

Pull the Rockfish Toolkit image from Docker Hub:

docker pull rockfishnetworks/toolkit:latest

The toolkit image includes all Rockfish Toolkit binaries. Specify the command to run the desired component.

Running the Probe

docker run -d \
  --name rockfish-probe \
  --network host \
  --cap-add NET_ADMIN \
  --cap-add NET_RAW \
  -v /etc/rockfish:/etc/rockfish:ro \
  -v /var/lib/rockfish:/var/lib/rockfish \
  rockfishnetworks/toolkit:latest \
  rockfish_probe -c /etc/rockfish/probe.yaml

Running the MCP Server

docker run -d \
  --name rockfish-mcp \
  -p 8080:8080 \
  -v /etc/rockfish:/etc/rockfish:ro \
  -v /var/lib/rockfish:/var/lib/rockfish:ro \
  rockfishnetworks/toolkit:latest \
  rockfish_mcp -c /etc/rockfish/mcp.yaml

Docker Compose

Example docker-compose.yml:

version: '3.8'

services:
  probe:
    image: rockfishnetworks/toolkit:latest
    network_mode: host
    cap_add:
      - NET_ADMIN
      - NET_RAW
    volumes:
      - ./config:/etc/rockfish:ro
      - ./data:/var/lib/rockfish
    command: ["rockfish_probe", "-c", "/etc/rockfish/probe.yaml"]
    restart: unless-stopped

  mcp:
    image: rockfishnetworks/toolkit:latest
    ports:
      - "8080:8080"
    volumes:
      - ./config:/etc/rockfish:ro
      - ./data:/var/lib/rockfish:ro
    command: ["rockfish_mcp", "-c", "/etc/rockfish/mcp.yaml"]
    restart: unless-stopped

Verifying Installation

Check that the installation was successful:

# Check probe version
rockfish_probe --version

# Check MCP version
rockfish_mcp --version

Next Steps

Quick Start - Run your first capture
Licensing - Activate your license
Configuration - Configure the probe

Quick Start

This guide walks you through capturing network flows and querying them.

1. Capture Flows

From a PCAP File

# Basic capture to Parquet
rockfish_probe -i capture.pcap --parquet-dir ./flows

# With nDPI application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows

Live Capture

# Standard libpcap capture (requires root)
sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows

# High-performance AF_PACKET capture (Linux)
sudo rockfish_probe -i eth0 --live afpacket --parquet-dir ./flows

With a Configuration File

# Create config.yaml (see Configuration docs)
rockfish_probe -c config.yaml

2. Verify Output

# Check generated files
ls -la flows/

# View file info with DuckDB
duckdb -c "DESCRIBE SELECT * FROM 'flows/*.parquet'"

3. Query with MCP

Set up the MCP server to query your flows:

# mcp-config.yaml
sources:
  flow:
    path: ./flows/
    description: Network flow data

output:
  default_format: table
  max_rows: 100

# Start MCP server
ROCKFISH_CONFIG=mcp-config.yaml rockfish_mcp

Example Queries

Using the MCP tools:

# Count total flows
count:
  source: flow

# Top talkers by bytes
query:
  source: flow
  sql: |
    SELECT saddr, SUM(sbytes + dbytes) as total_bytes
    FROM {source}
    GROUP BY saddr
    ORDER BY total_bytes DESC
    LIMIT 10

# Filter by protocol
query:
  source: flow
  filter: "proto = 'TCP'"
  limit: 50

4. Upload to S3 (Optional)

Configure S3 upload in your probe config:

output:
  parquet_dir: /var/lib/rockfish/flows

s3:
  bucket: my-flow-data
  region: us-east-1
  hive_partitioning: true
  delete_after_upload: true

Files are automatically uploaded and organized by date:

s3://my-flow-data/year=2025/month=01/day=28/rockfish-*.parquet

Next Steps

Configuration - Full configuration reference
Capture Modes - High-performance capture options
MCP Setup - Query server configuration

Licensing

Rockfish uses Ed25519-signed licenses with tier-based feature restrictions.

License Tiers

Tier	Features
Community	Basic schema (48 fields), local storage only
Basic	+ nDPI labels, custom observation name
Professional	+ GeoIP, nDPI fingerprints (60 fields)
Enterprise	+ ML models, anomaly detection

License File

Licenses are JSON files with an Ed25519 signature:

{
  "id": "lic_abc123",
  "tier": "professional",
  "customer_email": "[email protected]",
  "company": "Example Corp",
  "observation": "sensor-01",
  "issued_at": "2025-01-01T00:00:00Z",
  "expires_at": "2026-01-01T00:00:00Z",
  "signature": "base64-encoded-signature"
}

Configuration

Specify the license file in your config:

license:
  path: /opt/rockfish/etc/license.json

Or via environment variable:

export ROCKFISH_LICENSE_PATH=/opt/rockfish/etc/license.json
rockfish_probe -c config.yaml

Feature Matrix

Feature	Community	Basic	Professional	Enterprise
Schema v1 (Simple)	Yes	Yes	Yes	Yes
Schema v2 (Extended)	No	No	Yes	Yes
GeoIP Fields	No	No	Yes	Yes
nDPI Fingerprints	No	No	Yes	Yes
nDPI Labeling	No	Yes	Yes	Yes
Custom Observation Domain	No	Yes	Yes	Yes
Anomaly Detection	No	No	No	Yes

Parquet Metadata

Licensed files include metadata for validation:

Key	Description
`rockfish.license_id`	License identifier
`rockfish.tier`	License tier
`rockfish.company`	Company name
`rockfish.customer_email`	Customer email
`rockfish.issued_at`	License issue date
`rockfish.observation`	Observation domain name

MCP License Validation

Rockfish MCP can validate that Parquet files were generated by a licensed probe:

sources:
  licensed_flows:
    path: s3://data/flows/
    description: Licensed network flow data
    require_license: true

  enterprise_flows:
    path: s3://data/enterprise/
    description: Enterprise flow data
    require_license: true
    allowed_license_ids:
      - "lic_abc123"
      - "lic_def456"

Obtaining a License

Contact [email protected] for license inquiries.

Probe Overview

Rockfish Probe is a high-performance flow meter that captures network traffic and generates flow records in Apache Parquet format.

Features

Packet capture via libpcap - Live interface capture or PCAP file reading
High-performance AF_PACKET - Linux TPACKET_V3 with mmap ring buffer
Fragment reassembly - Reassembles fragmented IP packets
Bidirectional flows - Forward and reverse direction tracking
nDPI integration - Application protocol detection
GeoIP lookups - Geographic location via MaxMind databases
IP reputation - AbuseIPDB integration with local caching
S3 upload - Automatic upload to S3-compatible storage

Output Format

Flow records follow IPFIX Information Element naming conventions (RFC 5102/5103):

{
  "flowStartMilliseconds": "2025-01-15T10:30:00.000Z",
  "flowEndMilliseconds": "2025-01-15T10:30:05.123Z",
  "flowDurationMilliseconds": 5123,
  "ipVersion": 4,
  "protocolIdentifier": 6,
  "sourceIPAddress": "192.168.1.100",
  "sourceTransportPort": 54321,
  "destinationIPAddress": "93.184.216.34",
  "destinationTransportPort": 443,
  "octetTotalCount": 1234,
  "packetTotalCount": 15,
  "applicationName": "TLS"
}

Basic Usage

# Read from PCAP file
rockfish_probe -i capture.pcap --parquet-dir ./flows

# Live capture with libpcap
sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows

# High-performance AF_PACKET (Linux)
sudo rockfish_probe -i eth0 --live afpacket --parquet-dir ./flows

# With nDPI application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows

Next Steps

Configuration - Full configuration reference
Capture Modes - Platform-specific capture options
Performance Tuning - High-speed capture optimization

Configuration Reference

Rockfish Probe uses YAML configuration files. Command-line arguments override config file settings.

# Run with configuration file
rockfish_probe -c /path/to/config.yaml

# Override settings via CLI
rockfish_probe -c config.yaml --source eth1

License

license:
  path: /opt/rockfish/etc/license.json

Option	Type	Default	Description
`path`	string	-	Path to license file (JSON with Ed25519 signature)

Environment Variable: ROCKFISH_LICENSE_PATH

Input

input:
  source: eth0
  live_type: afpacket
  filter: "tcp or udp"
  snaplen: 65535
  promisc_off: false

Option	Type	Default	Description
`source`	string	(required)	Interface name or PCAP file path/glob
`live_type`	string	`pcap`	Capture method: `pcap`, `afpacket`, `netmap`, `fmadio`
`filter`	string	-	BPF filter expression
`snaplen`	int	65535	Maximum bytes per packet
`promisc_off`	bool	false	Disable promiscuous mode

BPF Filter Examples

# TCP and UDP only
filter: "tcp or udp"

# HTTP and HTTPS
filter: "port 80 or port 443"

# Specific subnet
filter: "net 192.168.1.0/24"

# Exclude SSH
filter: "not port 22"

Flow

flow:
  idle_timeout: 300
  active_timeout: 1800
  max_flows: 0
  max_payload: 500
  udp_uniflow_port: 0
  mac: true

Option	Type	Default	Description
`idle_timeout`	int	300	Seconds of inactivity before flow expires
`active_timeout`	int	1800	Maximum flow duration before export
`max_flows`	int	0	Maximum concurrent flows (0 = unlimited)
`max_payload`	int	500	Max payload bytes for protocol detection
`udp_uniflow_port`	int	0	UDP uniflow mode (0=off, 1=all)
`mac`	bool	true	Include MAC addresses

Note: TLS/TCP fingerprints (ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp) are automatically extracted when nDPI is enabled and included in Professional+ tier output.

nDPI

ndpi:
  enabled: true
  protocol_file: /opt/rockfish/etc/ndpi-protos.txt
  categories_file: /opt/rockfish/etc/ndpi-categories.txt

Option	Type	Default	Description
`enabled`	bool	false	Enable nDPI application labeling
`protocol_file`	string	-	Custom protocol definitions
`categories_file`	string	-	Custom category definitions

Note: nDPI is included in all Rockfish packages (Basic tier and above).

Fragment

fragment:
  disabled: false
  max_tables: 1024
  timeout: 30

Option	Type	Default	Description
`disabled`	bool	false	Disable IP fragment reassembly
`max_tables`	int	1024	Max concurrent fragment tables
`timeout`	int	30	Fragment timeout in seconds

Output

output:
  parquet_dir: /var/run/rockfish/flows
  parquet_batch_size: 1000000
  parquet_file_prefix: rockfish-flow
  parquet_schema: simple
  observation: sensor-01
  hive_boundary_flush: false
  stats: true
  verbose: 1
  log_file: /var/log/rockfish/rockfish.log

Option	Type	Default	Description
`parquet_dir`	string	(required)	Output directory for Parquet files
`parquet_batch_size`	int	1000000	Max flows per file before rotation
`parquet_file_prefix`	string	rockfish-flow	Filename prefix
`parquet_schema`	string	simple	Schema: `simple` (50 fields) or `extended` (62 fields)
`observation`	string	gnat	Observation domain name
`hive_boundary_flush`	bool	false	Flush at day boundaries for Hive partitioning
`verbose`	int	1	0=warnings, 1=info, 2=debug, 3=trace
`log_file`	string	-	Log file path (enables daily rotation)

AFPacket

Linux high-performance capture:

afpacket:
  block_size: 2097152
  block_count: 64
  fanout_group: 0
  fanout_mode: hash

Option	Type	Default	Description
`block_size`	int	2097152	Ring buffer block size (bytes)
`block_count`	int	64	Number of ring buffer blocks
`fanout_group`	int	0	Fanout group ID (0 = disabled)
`fanout_mode`	string	hash	Distribution: hash, lb, cpu, rollover, random

Memory: block_size × block_count (default: 128 MB)

Netmap

FreeBSD high-performance capture:

netmap:
  rx_slots: 1024
  tx_slots: 1024
  poll_timeout: 1000
  host_rings: false

S3

s3:
  bucket: my-flow-bucket
  prefix: flows
  region: us-east-1
  endpoint: https://nyc3.digitaloceanspaces.com
  force_path_style: false
  hive_partitioning: true
  delete_after_upload: true
  aggregate: true
  aggregate_hold_minutes: 5

Option	Type	Default	Description
`bucket`	string	(required)	S3 bucket name
`prefix`	string	-	S3 key prefix
`region`	string	(required)	AWS region
`endpoint`	string	-	Custom endpoint (MinIO, DO Spaces, etc.)
`force_path_style`	bool	false	Use path-style URLs (required for MinIO)
`hive_partitioning`	bool	false	Organize by `year=/month=/day=/`
`delete_after_upload`	bool	false	Delete local files after upload
`aggregate`	bool	false	Merge files per minute before upload
`aggregate_hold_minutes`	int	1	Hold time before aggregating

GeoIP

geoip:
  country_db: /opt/rockfish/etc/GeoLite2-Country.mmdb
  city_db: /opt/rockfish/etc/GeoLite2-City.mmdb
  asn_db: /opt/rockfish/etc/GeoLite2-ASN.mmdb

Note: Requires --features geoip and MaxMind databases.

Threat Intel

threat_intel:
  enabled: true
  endpoint_url: "http://localhost:8080"
  api_token: "your-api-token"
  batch_size: 100
  timeout_seconds: 10

Option	Type	Default	Description
`enabled`	bool	false	Enable threat intel lookups
`endpoint_url`	string	(required)	API endpoint URL
`api_token`	string	(required)	Bearer token for authentication
`batch_size`	int	100	IPs per API request
`timeout_seconds`	int	10	Request timeout

Output goes to <parquet_dir>/intel/.

Complete Example

license:
  path: /opt/rockfish/etc/license.json

input:
  source: eth0
  live_type: afpacket
  filter: "tcp or udp"

flow:
  idle_timeout: 300
  active_timeout: 1800
  max_flows: 1000000
  max_payload: 500

ndpi:
  enabled: true  # Fingerprints (ndpi_ja4, ndpi_ja3s) extracted automatically

output:
  parquet_dir: /var/run/rockfish/flows
  observation: sensor-01
  hive_boundary_flush: true

afpacket:
  block_size: 2097152
  block_count: 64

s3:
  bucket: flow-data
  prefix: sensors/sensor-01
  region: us-east-1
  hive_partitioning: true
  delete_after_upload: true

geoip:
  city_db: /opt/rockfish/etc/GeoLite2-City.mmdb
  asn_db: /opt/rockfish/etc/GeoLite2-ASN.mmdb

Capture Modes

Rockfish Probe supports multiple capture backends for different platforms and performance requirements.

Capture Types

Type	Platform	Description
`pcap`	All	Standard libpcap (portable)
`afpacket`	Linux	AF_PACKET with TPACKET_V3 (high-performance)
`netmap`	FreeBSD	Netmap framework (high-performance)
`fmadio`	Linux	FMADIO appliance ring buffer

libpcap (Default)

The most portable option, works on all platforms.

input:
  source: eth0
  live_type: pcap
  filter: "tcp or udp"
  snaplen: 65535

sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows

Pros

Works everywhere (Linux, FreeBSD, macOS)
Supports BPF filters
Well-documented

Cons

Lower performance than kernel-bypass methods
Copies packets through kernel

AF_PACKET (Linux)

High-performance capture using Linux’s TPACKET_V3 with memory-mapped ring buffers.

input:
  source: eth0
  live_type: afpacket

afpacket:
  block_size: 2097152    # 2 MB blocks
  block_count: 64        # 128 MB total ring
  fanout_group: 0        # 0 = disabled
  fanout_mode: hash

sudo rockfish_probe -i eth0 --live afpacket \
    --afp-block-size 2097152 \
    --afp-block-count 64 \
    --parquet-dir ./flows

Ring Buffer Sizing

Total Ring Buffer = block_size × block_count
Default: 2 MB × 64 = 128 MB

For 10 Gbps+:

afpacket:
  block_size: 4194304   # 4 MB
  block_count: 128      # 512 MB total

Fanout Mode

Distribute packets across multiple processes:

afpacket:
  fanout_group: 1       # Non-zero enables fanout
  fanout_mode: hash     # Distribute by flow hash

Mode	Description
`hash`	By flow hash (recommended for flow analysis)
`lb`	Round-robin load balancing
`cpu`	By receiving CPU
`rollover`	Fill one socket, then next
`random`	Random distribution

Multi-Process Capture

Run multiple instances with the same fanout group:

# Terminal 1
sudo rockfish_probe -i eth0 --live afpacket \
    --afp-fanout-group 1 -o flows1/

# Terminal 2
sudo rockfish_probe -i eth0 --live afpacket \
    --afp-fanout-group 1 -o flows2/

Netmap (FreeBSD)

High-performance capture using FreeBSD’s netmap framework.

input:
  source: em0
  live_type: netmap

netmap:
  rx_slots: 1024
  tx_slots: 1024
  poll_timeout: 1000
  host_rings: false

Option	Default	Description
`rx_slots`	driver default	RX ring slot count
`tx_slots`	driver default	TX ring slot count
`poll_timeout`	1000	Poll timeout (ms)
`host_rings`	false	Enable host stack access

FMADIO (Linux)

Capture from FMADIO 100G packet capture appliances.

input:
  source: ring0
  live_type: fmadio

fmadio:
  ring_path: /opt/fmadio/queue/lxc_ring0
  include_fcs_errors: false

Note: FMADIO support is included in all Rockfish packages.

Reading PCAP Files

Process existing capture files:

# Single file
rockfish_probe -i capture.pcap --parquet-dir ./flows

# Multiple files with glob
rockfish_probe -i "/data/captures/*.pcap" --parquet-dir ./flows

# With application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows

BPF Filters

All capture modes support BPF filters (except FMADIO):

input:
  filter: "tcp or udp"

Common filters:

# Web traffic only
--filter "port 80 or port 443"

# Specific subnet
--filter "net 10.0.0.0/8"

# Exclude broadcast
--filter "not broadcast"

# DNS traffic
--filter "port 53"

Choosing a Capture Mode

Requirement	Recommended Mode
Portability	`pcap`
Linux high-speed (1-10 Gbps)	`afpacket`
Linux 40-100 Gbps	`afpacket` with large ring + fanout
FreeBSD high-speed	`netmap`
FMADIO appliance	`fmadio`

Next Steps

Performance Tuning - Optimize for high-speed capture
Configuration - Full configuration reference

Performance Tuning

Optimize Rockfish Probe for high-speed network capture.

AF_PACKET Tuning

Ring Buffer Size

For 10 Gbps+ capture, increase the ring buffer:

afpacket:
  block_size: 4194304   # 4 MB per block
  block_count: 128      # 512 MB total ring buffer

Use Fanout for Multi-Queue NICs

Modern NICs have multiple RX queues. Use fanout to utilize all cores:

# Run multiple instances with same fanout group
taskset -c 0 rockfish_probe -i eth0 --live afpacket \
    --afp-fanout-group 1 --parquet-dir ./flows1 &

taskset -c 1 rockfish_probe -i eth0 --live afpacket \
    --afp-fanout-group 1 --parquet-dir ./flows2 &

Use hash fanout mode to keep flows together.

CPU Pinning

Pin to specific CPU cores:

taskset -c 0 rockfish_probe -i eth0 --live afpacket ...

Or use CPU isolation:

# /etc/default/grub
GRUB_CMDLINE_LINUX="isolcpus=0,1"

System Tuning

Socket Buffers

Increase kernel buffer sizes:

# Temporary
sudo sysctl -w net.core.rmem_max=134217728
sudo sysctl -w net.core.rmem_default=134217728

# Permanent (/etc/sysctl.conf)
net.core.rmem_max=134217728
net.core.rmem_default=134217728

Network Budget

Increase NAPI budget for high packet rates:

sudo sysctl -w net.core.netdev_budget=600
sudo sysctl -w net.core.netdev_budget_usecs=8000

IRQ Affinity

Distribute NIC interrupts across CPUs:

# Find NIC IRQs
cat /proc/interrupts | grep eth0

# Set affinity (example for 4 queues)
echo 1 > /proc/irq/24/smp_affinity
echo 2 > /proc/irq/25/smp_affinity
echo 4 > /proc/irq/26/smp_affinity
echo 8 > /proc/irq/27/smp_affinity

Or use irqbalance with proper configuration.

Disable CPU Power Saving

Prevent CPU frequency scaling:

# Set performance governor
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done

Flow Table Sizing

Limit memory usage under high connection rates:

flow:
  max_flows: 1000000    # Limit to 1M concurrent flows
  idle_timeout: 60      # Shorter timeout for faster cleanup

Parquet Output Tuning

Batch Size

Larger batches = fewer files, better compression:

output:
  parquet_batch_size: 2000000   # 2M flows per file

S3 Aggregation

Reduce small file overhead:

s3:
  aggregate: true
  aggregate_hold_minutes: 5   # Merge files for 5 minutes
  delete_after_upload: true

Monitoring

Statistics Output

Enable periodic statistics:

output:
  stats: true
  verbose: 2    # Debug level

Key Metrics to Watch

Packets/sec: Compare with NIC stats (ethtool -S eth0)
Drops: Check for ring buffer overflows
Flows/sec: Flow export rate
Memory usage: Monitor with top or htop

Check for Drops

# NIC drops
ethtool -S eth0 | grep -i drop

# Kernel drops
cat /proc/net/dev | grep eth0

# AF_PACKET drops
cat /proc/net/packet

Hardware Recommendations

NIC Selection

For high-speed capture:

Intel X710/XL710 (40 GbE)
Intel E810 (100 GbE)
Mellanox ConnectX-5/6

Enable RSS (Receive Side Scaling) for multi-queue distribution.

CPU

Modern Intel Xeon or AMD EPYC
At least 1 core per 10 Gbps
Large L3 cache helps

Storage

For sustained capture:

NVMe SSD for local Parquet files
Fast S3-compatible storage with adequate bandwidth

Example: 10 Gbps Configuration

license:
  path: /opt/rockfish/etc/license.json

input:
  source: eth0
  live_type: afpacket

flow:
  idle_timeout: 120
  active_timeout: 900
  max_flows: 2000000
  max_payload: 256

afpacket:
  block_size: 4194304
  block_count: 128
  fanout_group: 1
  fanout_mode: hash

output:
  parquet_dir: /data/flows
  parquet_batch_size: 2000000
  observation: sensor-01

s3:
  bucket: flow-data
  region: us-east-1
  aggregate: true
  aggregate_hold_minutes: 2
  delete_after_upload: true

Run with CPU pinning:

sudo taskset -c 0-3 rockfish_probe -c config.yaml

IP Reputation

Rockfish Probe integrates with threat intelligence services for IP reputation lookups.

Overview

Two approaches are available:

Feature	`ip_reputation`	`threat_intel`
Provider	Direct AbuseIPDB	External API server
Caching	Local in-memory	Server-side
Rate limits	Managed locally	Server manages
Best for	Single sensor	Multiple sensors

These features are mutually exclusive.

IP Reputation (Direct AbuseIPDB)

Query AbuseIPDB directly with local caching.

Configuration

ip_reputation:
  enabled: true
  api_key: "your-abuseipdb-api-key"
  cache_ttl_hours: 24
  max_age_in_days: 90
  s3_upload: true

Option	Default	Description
`enabled`	false	Enable IP reputation lookups
`api_key`	(required)	AbuseIPDB API key
`output_dir`	`<parquet_dir>/ip_reputation`	Output directory
`cache_ttl_hours`	24	Cache entry lifetime
`max_age_in_days`	90	Max age for AbuseIPDB reports
`s3_upload`	false	Upload parquet files to S3

How It Works

For each flow, source and destination IPs are queued for lookup
Lookups run in a background thread
Results are cached in memory with reference counting
Cache is exported to Parquet every hour

Rate Limiting

AbuseIPDB free tier: 1000 requests/day.

When rate-limited (HTTP 429):

API requests pause
Local cache continues serving
Resumes at the next hour boundary
Repeats if still rate-limited

Output Schema

Hourly Parquet exports include:

Field	Type	Description
`ip_address`	String	IP address
`abuse_confidence_score`	Int32	Score (0-100)
`country_code`	String	Country code
`isp`	String	ISP name
`domain`	String	Associated domain
`total_reports`	Int32	Total abuse reports
`last_reported_at`	Timestamp	Last report time
`is_whitelisted`	Boolean	Whitelisted status
`reference_count`	Int64	Times seen in flows
`first_seen`	Timestamp	First flow occurrence
`last_seen`	Timestamp	Last flow occurrence

Threat Intel (External API)

Use an external threat intelligence server (e.g., rockfish_intel) for centralized lookups.

Configuration

threat_intel:
  enabled: true
  endpoint_url: "http://localhost:8080"
  api_token: "your-api-token"
  batch_size: 100
  timeout_seconds: 10

Option	Default	Description
`enabled`	false	Enable threat intel lookups
`endpoint_url`	(required)	API server URL
`api_token`	(required)	Bearer token
`batch_size`	100	IPs per request
`timeout_seconds`	10	Request timeout

Benefits

Centralized caching: Share cache across multiple sensors
Rate limit management: Server handles provider limits
Multiple providers: Server can aggregate multiple sources

Output

Threat intel Parquet files are written to <parquet_dir>/intel/.

With S3 and Hive partitioning:

s3://bucket/prefix/intel/year=YYYY/month=MM/day=DD/filename.parquet

Setup with rockfish_intel

Start the intel server with your AbuseIPDB key
Create a client entry in clients.yaml
Configure the probe:

threat_intel:
  enabled: true
  endpoint_url: "http://threatintel-server:8080"
  api_token: "client-token-from-clients-yaml"

Choosing Between Options

Scenario	Recommendation
Single sensor, simple setup	`ip_reputation`
Multiple sensors	`threat_intel` + `rockfish_intel`
Enterprise with custom providers	`threat_intel`
Limited API quota	`threat_intel` (shared cache)

Getting an AbuseIPDB API Key

Create account at abuseipdb.com
Go to API settings
Generate API key

Free tier: 1000 checks/day Paid tiers: Higher limits, additional features

MCP Overview

Coming Soon: Rockfish MCP is currently under development and will be available in March 2025.

Rockfish MCP is a Model Context Protocol (MCP) server for querying Parquet files using DuckDB.

Features

SQL queries via DuckDB - Full SQL support for Parquet files
S3 support - AWS, MinIO, Cloudflare R2, DigitalOcean Spaces
Configurable data sources - Abstract file locations from API
Multiple output formats - JSON, JSON Lines, CSV, Table
TLS support - Secure connections for remote access
HTTP/WebSocket mode - Standard HTTP with Bearer token auth
License validation - Verify Parquet files were generated by licensed probes

Operation Modes

Mode	Transport	Use Case
stdio	stdin/stdout	Claude Desktop, local tools
TLS	Raw TCP+TLS	Custom integrations
HTTP	HTTPS+WebSocket	Web clients, standard tooling

Built-in Tools

Tool	Description
`list_sources`	List configured data sources
`schema`	Get column names and types
`query`	Query with filters and column selection
`aggregate`	Group and aggregate data
`sample`	Get random sample rows
`count`	Count rows with optional filter

Quick Example

# config.yaml
sources:
  flow:
    path: s3://security-data/netflow/
    description: Network flow data

output:
  default_format: json
  max_rows: 1000

ROCKFISH_CONFIG=config.yaml rockfish_mcp

Query example:

query:
  source: flow
  columns: [saddr, daddr, sbytes, dbytes]
  filter: "sbytes > 1000000"
  limit: 50

License Validation

Rockfish MCP will validate that Parquet files were generated by a licensed rockfish_probe. Each Parquet file includes signed metadata:

rockfish.license_id - License identifier
rockfish.tier - License tier (Community, Basic, Professional, Enterprise)
rockfish.company - Company name
rockfish.observation - Observation domain name

Configure validation per data source:

sources:
  prod_flows:
    path: s3://data/flows/
    require_license: true              # Reject unlicensed files
    allowed_license_ids:               # Optional: restrict to specific licenses
      - "lic_abc123"

Next Steps

Setup - Configure MCP server
Authentication - Secure your server
Tools & Queries - Query reference

MCP Setup

Configure Rockfish MCP for different deployment scenarios.

Configuration File

Create a config.yaml:

# S3 credentials (optional)
s3:
  region: us-east-1
  # access_key_id: your-key
  # secret_access_key: your-secret
  # endpoint: localhost:9000  # For MinIO/R2

# Output settings
output:
  default_format: json
  max_rows: 1000
  pretty_print: true

# Data source mappings
sources:
  flow:
    path: s3://security-data/netflow/
    description: Network flow data
    require_license: true

  ip_reputation:
    path: /data/threat-intel/ip-reputation.parquet
    description: IP reputation scores

stdio Mode (Default)

For Claude Desktop or local tools.

Claude Desktop Configuration

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Linux: ~/.config/claude/claude_desktop_config.json

{
  "mcpServers": {
    "rockfish": {
      "command": "/path/to/rockfish-mcp",
      "env": {
        "ROCKFISH_CONFIG": "/path/to/config.yaml"
      }
    }
  }
}

HTTP/WebSocket Mode

For web applications and standard HTTP clients.

Quick Start

Generate self-signed certificate:
```
./generate-self-signed-cert.sh
```

Generate API key and hash:

API_KEY=$(openssl rand -base64 32)
echo "API Key: $API_KEY"
echo "Hash: $(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)"

Configure config.yaml:

tls:
  enabled: true
  http_mode: true
  bind_address: "0.0.0.0:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    api_keys:
      - name: "web-client"
        key_hash: "paste-hash-here"

Run the server:

ROCKFISH_CONFIG=config.yaml rockfish_mcp

Connect:

python examples/python_client_bearer_auth.py \
  --host localhost --port 8443 \
  --token "$API_KEY" --skip-verify

Plain HTTP Mode (Development)

For local development or behind a reverse proxy:

tls:
  enabled: true
  http_mode: true
  disable_tls: true  # No encryption
  bind_address: "127.0.0.1:8080"
  auth:
    api_keys:
      - name: "dev-client"
        key_hash: "your-hash-here"

Warning: Only use plain HTTP for local development or behind a TLS-terminating proxy.

TLS Server Mode

For custom integrations with raw TLS connections.

tls:
  enabled: true
  http_mode: false  # Raw TLS mode
  bind_address: "127.0.0.1:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    api_keys:
      - name: "production-client"
        key_hash: "your-key-hash-here"

License Validation

Require Parquet files to have valid Rockfish license metadata:

sources:
  # Any valid Rockfish license
  licensed_flows:
    path: s3://data/flows/
    description: Licensed network flow data
    require_license: true

  # Specific license IDs only
  enterprise_flows:
    path: s3://data/enterprise/
    description: Enterprise flow data
    require_license: true
    allowed_license_ids:
      - "lic_abc123"
      - "lic_def456"

  # No validation (default)
  public_data:
    path: /data/public/
    description: Public datasets

Rockfish Probe embeds license metadata in Parquet files:

rockfish.license.id
rockfish.license.tier
rockfish.license.customer_email
rockfish.license.issued_at

Environment Variables

Variable	Description
`ROCKFISH_CONFIG`	Path to config.yaml
`AWS_ACCESS_KEY_ID`	AWS credentials
`AWS_SECRET_ACCESS_KEY`	AWS credentials
`AWS_REGION`	AWS region

Testing

# Start server
ROCKFISH_CONFIG=config.yaml rockfish_mcp

# Test with curl (HTTP mode)
curl -X POST https://localhost:8443/mcp \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'

Next Steps

Authentication - Secure your server
Tools & Queries - Query reference
S3 Configuration - Cloud storage setup

Authentication

Rockfish MCP supports multiple authentication mechanisms.

Overview

Method	Transport	Description
API Key (JSON)	Raw TLS	JSON frame before MCP session
Bearer Token	HTTP/WS	Standard Authorization header
Mutual TLS (mTLS)	Any TLS	Client certificate verification

These can be combined for defense-in-depth.

Bearer Token Authentication (HTTP Mode)

Standard HTTP authentication using Authorization: Bearer <token> header.

Setup

Generate API key and hash:

API_KEY=$(openssl rand -base64 32)
echo "API Key: $API_KEY"
echo "Hash: $(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)"

Configure:

tls:
  enabled: true
  http_mode: true
  bind_address: "0.0.0.0:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    api_keys:
      - name: "production-client"
        key_hash: "a1b2c3d4e5f6..."

Client Examples

Python (websockets):

import asyncio
import websockets
import json

async def connect():
    uri = "wss://localhost:8443/mcp"
    headers = {"Authorization": "Bearer your-api-key"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        await ws.send(json.dumps({
            "jsonrpc": "2.0",
            "id": 1,
            "method": "initialize",
            "params": {
                "protocolVersion": "2024-11-05",
                "capabilities": {},
                "clientInfo": {"name": "python-client", "version": "1.0"}
            }
        }))
        print(await ws.recv())

asyncio.run(connect())

JavaScript/Node.js:

const WebSocket = require('ws');

const ws = new WebSocket('wss://localhost:8443/mcp', {
  headers: { 'Authorization': 'Bearer your-api-key' },
  rejectUnauthorized: true  // false for self-signed certs
});

ws.on('open', () => {
  ws.send(JSON.stringify({
    jsonrpc: '2.0',
    id: 1,
    method: 'initialize',
    params: {
      protocolVersion: '2024-11-05',
      capabilities: {},
      clientInfo: { name: 'nodejs-client', version: '1.0' }
    }
  }));
});

ws.on('message', data => console.log(data.toString()));

cURL:

curl -i -N \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Authorization: Bearer your-api-key" \
  -H "Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==" \
  -H "Sec-WebSocket-Version: 13" \
  https://localhost:8443/mcp

API Key Authentication (TLS Mode)

JSON-based authentication for raw TLS connections.

Protocol

Client connects via TLS
Client sends: {"api_key": "your-secret-key"}\n
Server responds: {"success": true/false, "message": "..."}\n
MCP session proceeds if successful

Configuration

tls:
  enabled: true
  http_mode: false  # Raw TLS mode
  bind_address: "127.0.0.1:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    api_keys:
      - name: "production-client"
        key_hash: "sha256-hash-here"

Client Example

import socket
import ssl
import json

context = ssl.create_default_context()
sock = socket.create_connection(("localhost", 8443))
tls_sock = context.wrap_socket(sock, server_hostname="localhost")

# Authenticate
auth = {"api_key": "your-secret-key"}
tls_sock.sendall((json.dumps(auth) + "\n").encode())

response = json.loads(tls_sock.recv(4096).decode().strip())
if not response["success"]:
    raise Exception(f"Auth failed: {response['message']}")

# Proceed with MCP protocol...

Mutual TLS (mTLS)

Transport-level authentication using client certificates.

Create CA and Client Certificates

# Generate CA
openssl genrsa -out ca-key.pem 4096
openssl req -new -x509 -key ca-key.pem -out ca-cert.pem -days 3650 \
  -subj "/CN=Rockfish MCP CA/O=Your Org"

# Generate client certificate
openssl genrsa -out client-key.pem 2048
openssl req -new -key client-key.pem -out client.csr \
  -subj "/CN=client1/O=Your Org"
openssl x509 -req -in client.csr -CA ca-cert.pem -CAkey ca-key.pem \
  -CAcreateserial -out client-cert.pem -days 365

Configuration

tls:
  enabled: true
  bind_address: "0.0.0.0:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    require_client_cert: true
    client_ca_cert_path: "./certs/ca-cert.pem"

Client Example

import ssl
import socket

context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH)
context.load_cert_chain(
    certfile="client-cert.pem",
    keyfile="client-key.pem"
)
context.load_verify_locations(cafile="server-ca-cert.pem")

sock = socket.create_connection(("localhost", 8443))
tls_sock = context.wrap_socket(sock, server_hostname="localhost")
# Connection authenticated via mTLS

Combining Authentication Methods

For maximum security, use both mTLS and API keys:

tls:
  enabled: true
  bind_address: "0.0.0.0:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    require_client_cert: true
    client_ca_cert_path: "./certs/ca-cert.pem"
    api_keys:
      - name: "production-client"
        key_hash: "a1b2c3d4e5f6..."

Both must succeed for authorization.

Security Best Practices

API Keys

Generate with sufficient entropy: openssl rand -base64 32
One key per client for audit/revocation
Rotate regularly
Never store plain-text keys in config

mTLS

Protect CA private key: chmod 600 ca-key.pem
Use short certificate lifetimes (90 days)
Implement certificate revocation
Unique certificates per client

General

Use TLS in production
Implement rate limiting
Monitor authentication logs
Use network segmentation

Troubleshooting

Error	Solution
“Authentication failed”	Verify key matches hash
“Invalid auth request format”	Check JSON format, ensure `\n` at end
“Client certificate verification failed”	Check cert signed by configured CA
“require_client_cert without client_ca_cert_path”	Add CA path to config

Utility: Generate API Key

#!/bin/bash
API_KEY=$(openssl rand -base64 32)
KEY_HASH=$(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)

echo "API Key: $API_KEY"
echo "Hash: $KEY_HASH"
echo ""
echo "Config entry:"
echo "  - name: \"client-name\""
echo "    key_hash: \"$KEY_HASH\""

Tools & Queries

Rockfish MCP provides SQL-based tools for querying Parquet data.

Available Tools

Tool	Description
`list_sources`	List configured data sources
`schema`	Get column names and types
`query`	Query with filters and column selection
`aggregate`	Group and aggregate data
`sample`	Get random sample rows
`count`	Count rows with optional filter

list_sources

List all configured data sources.

list_sources: {}

Response:

{
  "sources": [
    {"name": "flow", "description": "Network flow data"},
    {"name": "ip_reputation", "description": "IP reputation scores"}
  ]
}

schema

Get column names and types for a data source.

schema:
  source: flow
  format: table

Parameters:

Name	Required	Description
`source`	Yes	Data source name
`format`	No	Output format (default: table)

query

Query with filtering, column selection, and custom SQL.

Basic Query

query:
  source: flow
  columns: [saddr, daddr, sbytes, dbytes]
  filter: "sbytes > 1000000"
  limit: 50
  format: json

Parameters:

Name	Required	Description
`source`	Yes	Data source name
`columns`	No	Columns to select (default: all)
`filter`	No	WHERE clause condition
`order_by`	No	ORDER BY clause
`limit`	No	Maximum rows
`format`	No	Output format

Custom SQL

Use {source} placeholder for the data source:

query:
  source: flow
  sql: |
    SELECT saddr, COUNT(*) as connection_count, SUM(sbytes) as total_bytes
    FROM {source}
    GROUP BY saddr
    ORDER BY total_bytes DESC
    LIMIT 10

Time-based Queries

query:
  source: flow
  filter: "stime >= '2025-01-01' AND stime < '2025-01-02'"
  columns: [stime, saddr, daddr, proto]

Protocol Filtering

query:
  source: flow
  filter: "proto = 'TCP' AND dport = 443"
  columns: [saddr, daddr, ndpi_appid]

aggregate

Group and aggregate data.

aggregate:
  source: flow
  group_by: [dport]
  aggregations:
    - function: sum
      column: sbytes
      alias: total_bytes
    - function: count
      alias: connection_count
  filter: "proto = 'TCP'"
  order_by: "total_bytes DESC"
  limit: 20
  format: table

Parameters:

Name	Required	Description
`source`	Yes	Data source name
`group_by`	Yes	Columns to group by
`aggregations`	Yes	Aggregation functions
`filter`	No	WHERE clause
`order_by`	No	ORDER BY clause
`limit`	No	Maximum rows

Aggregation Functions

Function	Description
`count`	Count rows
`sum`	Sum values
`avg`	Average
`min`	Minimum
`max`	Maximum
`count_distinct`	Count unique values

Examples

Top destination ports by traffic:

aggregate:
  source: flow
  group_by: [dport]
  aggregations:
    - function: sum
      column: sbytes + dbytes
      alias: total_bytes
    - function: count
      alias: flows
  order_by: "total_bytes DESC"
  limit: 10

Flows by country (requires GeoIP):

aggregate:
  source: flow
  group_by: [scountry, dcountry]
  aggregations:
    - function: count
      alias: flow_count
  filter: "scountry IS NOT NULL"

sample

Get random sample rows.

sample:
  source: flow
  n: 10
  format: json

Parameters:

Name	Required	Description
`source`	Yes	Data source name
`n`	No	Number of rows (default: 10)
`format`	No	Output format

count

Count rows with optional filter.

count:
  source: flow
  filter: "ndpi_risk_score > 50"

Parameters:

Name	Required	Description
`source`	Yes	Data source name
`filter`	No	WHERE clause

Output Formats

Format	Description
`json`	Pretty-printed JSON array
`jsonl` / `json_lines` / `ndjson`	Newline-delimited JSON
`csv`	CSV with header
`table` / `text`	ASCII table

Common Query Patterns

Top Talkers

query:
  source: flow
  sql: |
    SELECT saddr,
           COUNT(*) as flows,
           SUM(sbytes) as sent,
           SUM(dbytes) as received
    FROM {source}
    GROUP BY saddr
    ORDER BY sent + received DESC
    LIMIT 20

DNS Traffic

query:
  source: flow
  filter: "dport = 53 OR sport = 53"
  columns: [stime, saddr, daddr, sbytes, dbytes]

High-Risk Flows

query:
  source: flow
  filter: "ndpi_risk_score > 100"
  columns: [stime, saddr, daddr, ndpi_appid, ndpi_risk_list]

Long-Duration Flows

query:
  source: flow
  filter: "dur > 3600000"  # > 1 hour in ms
  columns: [stime, etime, dur, saddr, daddr, sbytes, dbytes]
  order_by: "dur DESC"

External Traffic

query:
  source: flow
  filter: "NOT (saddr LIKE '10.%' OR saddr LIKE '192.168.%')"
  columns: [saddr, daddr, scountry, dcountry]

Application Distribution

aggregate:
  source: flow
  group_by: [ndpi_appid]
  aggregations:
    - function: count
      alias: flows
    - function: sum
      column: sbytes + dbytes
      alias: bytes
  filter: "ndpi_appid IS NOT NULL"
  order_by: "bytes DESC"
  limit: 20

S3 Configuration

Configure Rockfish MCP to query Parquet files from S3-compatible storage.

AWS S3

Default Credentials

If the s3 section is omitted, DuckDB uses AWS credentials from:

Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
~/.aws/credentials
IAM role (EC2/ECS)

sources:
  flow:
    path: s3://my-bucket/flows/
    description: Network flows

Explicit Credentials

s3:
  region: us-east-1
  access_key_id: AKIAIOSFODNN7EXAMPLE
  secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Security Note: Prefer environment variables or IAM roles over config file credentials.

MinIO

Self-hosted S3-compatible storage.

s3:
  endpoint: localhost:9000
  access_key_id: minioadmin
  secret_access_key: minioadmin
  use_ssl: false
  url_style: path  # Required for MinIO

sources:
  flow:
    path: s3://my-bucket/flows/

DigitalOcean Spaces

s3:
  endpoint: nyc3.digitaloceanspaces.com
  region: nyc3
  access_key_id: your-spaces-key
  secret_access_key: your-spaces-secret

sources:
  flow:
    path: s3://my-space/flows/

Cloudflare R2

s3:
  endpoint: <account-id>.r2.cloudflarestorage.com
  access_key_id: your-r2-key
  secret_access_key: your-r2-secret

sources:
  flow:
    path: s3://my-bucket/flows/

Configuration Options

Option	Type	Default	Description
`region`	string	-	AWS region (e.g., `us-east-1`)
`access_key_id`	string	-	Access key ID
`secret_access_key`	string	-	Secret access key
`endpoint`	string	-	Custom endpoint URL
`use_ssl`	bool	true	Use HTTPS
`url_style`	string	vhost	`path` or `vhost`

Querying S3 Data

Direct Path

sources:
  flow:
    path: s3://bucket/prefix/
    description: All flow data

Hive Partitioned Data

Rockfish Probe can organize uploads with Hive-style partitioning:

s3://bucket/flows/year=2025/month=01/day=28/*.parquet

Query specific partitions:

sources:
  flow:
    path: s3://bucket/flows/year=2025/month=01/
    description: January 2025 flows

Or use SQL with DuckDB’s Hive partitioning support:

query:
  source: flow
  sql: |
    SELECT * FROM read_parquet(
      's3://bucket/flows/year=2025/month=01/day=28/*.parquet',
      hive_partitioning=true
    )
    LIMIT 100

Performance Tips

Use Partition Pruning

Structure queries to match partitioning scheme:

# Efficient - matches Hive partitions
query:
  source: flow
  filter: "year = 2025 AND month = 1 AND day = 28"

Limit Column Selection

Only select needed columns:

query:
  source: flow
  columns: [saddr, daddr, sbytes]  # Much faster than SELECT *

Use Aggregation Server-Side

Push aggregation to DuckDB:

aggregate:
  source: flow
  group_by: [dport]
  aggregations:
    - function: count
      alias: flows

Troubleshooting

“Access Denied”

Verify credentials are correct
Check bucket policy allows s3:GetObject and s3:ListBucket
For cross-account access, verify IAM trust policies

“Bucket not found”

Check region matches bucket region
For custom endpoints, verify url_style setting

“Connection refused”

Verify endpoint URL is correct
Check use_ssl matches endpoint (http vs https)
For MinIO, ensure url_style: path

Slow Queries

Add partition filters to queries
Select only needed columns
Check network bandwidth to S3

Example: Multi-Source Configuration

s3:
  region: us-east-1

sources:
  # Production flows (licensed, validated)
  prod_flows:
    path: s3://prod-bucket/flows/
    description: Production network flows
    require_license: true

  # Development data (no validation)
  dev_flows:
    path: s3://dev-bucket/flows/
    description: Development test data

  # Threat intel from intel server
  threat_intel:
    path: s3://prod-bucket/intel/
    description: IP reputation data

output:
  default_format: json
  max_rows: 10000

Rockfish Detect Overview

Rockfish Detect is the ML training and anomaly detection service for the Rockfish platform. It provides a complete pipeline for building models from network flow data and scoring flows for anomalies.

Note: Rockfish Detect requires an Enterprise tier license.

Features

Data Sampling - Random sampling from S3-stored Parquet files
Feature Engineering - Build normalization tables for ML training
Feature Ranking - Identify most significant fields for detection
Model Training - Train anomaly detection models (HBOS, Hybrid)
Flow Scoring - Score flows using trained models
Device Fingerprinting - Passive OS/device detection via nDPI fingerprints
Automated Scheduling - Run as daemon with daily training cycles

Architecture

Network Traffic
    |
    v
Parquet Files in S3 (from rockfish_probe)
    |
    v
+------------------------------------------+
|   rockfish_detect                        |
+------------------------------------------+
| Sampler                                  |
|   - Queries S3 with DuckDB               |
|   - Random sampling                      |
|   - Output: sample/*.parquet             |
+------------------------------------------+
| Feature Engineer                         |
|   - Build normalization tables           |
|   - Histogram binning + frequency        |
|   - Output: extract/*.parquet            |
+------------------------------------------+
| Feature Ranker                           |
|   - Importance scoring                   |
|   - Output: rockfish_rank.parquet        |
+------------------------------------------+
| Model Trainer (HBOS/Hybrid)              |
|   - Train on sampled data                |
|   - Output: models/*.json                |
+------------------------------------------+
| Flow Scorer                              |
|   - Score flows using trained models     |
|   - Output: score/*.parquet              |
+------------------------------------------+
    |
    v
Anomaly Scores --> rockfish_mcp --> Alerts

Algorithms

Algorithm	Type	Description
HBOS	Unsupervised	Histogram-Based Outlier Score - fast, interpretable
Hybrid	Combined	HBOS + fingerprint correlation + threat intelligence
Random Forest	Supervised	Classification-based (framework)
Autoencoder	Neural Network	Reconstruction error-based (framework)

Use Cases

Unsupervised Anomaly Detection - HBOS identifies statistical outliers
Behavioral Change Detection - Hybrid mode detects unusual fingerprint combinations
Device Profiling - Fingerprinting detects lateral movement
Threat Prioritization - Score-based reporting prioritizes investigations
Network Baselining - Feature ranking identifies important characteristics

Quick Start

# Validate configuration
rockfish_detect -c config.yaml validate

# Run full pipeline for specific date
rockfish_detect -c config.yaml auto --date 2025-01-28

# Start as scheduler daemon
rockfish_detect -c config.yaml run

# Run immediately (don't wait for schedule)
rockfish_detect -c config.yaml run --run-now

Requirements

Enterprise tier license
S3-compatible storage with flow data from rockfish_probe
Multi-core system recommended (uses half available cores)

Next Steps

Configuration - Set up rockfish_detect
Data Pipeline - Understand the processing stages
Anomaly Detection - Configure detection models

Configuration Reference

Rockfish Detect uses YAML configuration files.

rockfish_detect -c /path/to/config.yaml [command]

Configuration Sections

License
S3
Sampling
Features
Training
Fingerprint
Logging

License

license:
  path: /etc/rockfish/license.json
  observation: flows

Option	Type	Required	Description
`path`	string	No	License file path (auto-searches if not set)
`observation`	string	Yes	S3 prefix / observation domain

S3

s3:
  bucket: my-flow-bucket
  region: us-east-1
  endpoint: https://s3.example.com
  hive_partitioning: true
  http_retries: 10
  http_retry_wait_ms: 2000
  http_retry_backoff: 2.0

Option	Type	Default	Description
`bucket`	string	(required)	S3 bucket name
`region`	string	(required)	AWS region
`endpoint`	string	-	Custom endpoint (MinIO, etc.)
`hive_partitioning`	bool	true	Match rockfish_probe structure
`http_retries`	int	10	Retry count for S3 operations
`http_retry_wait_ms`	int	2000	Base wait between retries
`http_retry_backoff`	float	2.0	Exponential backoff multiplier

S3 Data Structure

Expected path structure (from rockfish_probe):

s3://<bucket>/<observation>/v2/year=YYYY/month=MM/day=DD/*.parquet

Sampling

sampling:
  sample_percent: 10.0
  retention_days: 7
  sample_hour: 0
  sample_minute: 30
  output_prefix: flows/sample

Option	Type	Default	Description
`sample_percent`	float	10.0	Percentage of rows to sample (0-100)
`retention_days`	int	7	Rolling window retention
`sample_hour`	int	0	UTC hour for scheduled sampling
`sample_minute`	int	random	Minute for scheduled sampling
`output_prefix`	string	`<obs>/sample/`	S3 output prefix

Features

Configure feature engineering (normalization tables).

features:
  num_bins: 10
  histogram_type: quantile
  ip_hash_modulus: 65536
  sample_days: 7

Option	Type	Default	Description
`num_bins`	int	10	Histogram bins for numeric features
`histogram_type`	string	quantile	`quantile` or `equal_width`
`ip_hash_modulus`	int	65536	Dimensionality reduction for IPs
`sample_days`	int	7	Days of samples to process

Histogram Types

Type	Description	Best For
`quantile`	Equal sample count per bin	Skewed distributions
`equal_width`	Equal value range per bin	Uniform distributions

Training

training:
  enabled: true
  train_hour: 1
  train_minute: 0
  algorithm: hbos
  model_output_dir: /var/lib/rockfish/models
  min_importance_score: 0.7

  hbos:
    num_bins: 10
    fields:
      - dur
      - rtt
      - pcr
      - spkts
      - dpkts
      - sbytes
      - dbytes
      - sentropy
      - dentropy

  hybrid:
    hbos_weight: 0.5
    correlation_weight: 0.3
    threat_intel_weight: 0.2
    hbos_filter_percentile: 90.0
    min_observations: 3

Option	Type	Default	Description
`enabled`	bool	true	Enable training
`train_hour`	int	1	UTC hour for scheduled training
`train_minute`	int	random	Minute for scheduled training
`algorithm`	string	hbos	`hbos`, `hybrid`, `random_forest`, `autoencoder`
`model_output_dir`	string	-	Directory for trained models
`min_importance_score`	float	0.7	Threshold for ranked features

HBOS Options

Option	Type	Default	Description
`num_bins`	int	10	Histogram bins
`fields`	list	-	Fields to include in model

Hybrid Options

Option	Type	Default	Description
`hbos_weight`	float	0.5	Weight for HBOS score
`correlation_weight`	float	0.3	Weight for fingerprint correlation
`threat_intel_weight`	float	0.2	Weight for threat intel score
`hbos_filter_percentile`	float	90.0	Pre-filter percentile
`min_observations`	int	3	Min observations for correlation

Fingerprint

Device/OS fingerprinting via nDPI signatures.

fingerprint:
  enabled: false
  history_days: 7
  client_field: ndpi_ja4
  server_field: ndpi_ja3s
  min_observations: 10
  anomaly_threshold: 0.7
  max_fingerprints_per_host: 5
  detect_suspicious: true

Option	Type	Default	Description
`enabled`	bool	false	Enable fingerprinting
`history_days`	int	7	Days of history to analyze
`client_field`	string	ndpi_ja4	Field for client fingerprint (JA4 via nDPI)
`server_field`	string	ndpi_ja3s	Field for server fingerprint (JA3 via nDPI)
`min_observations`	int	10	Minimum observations for baseline
`anomaly_threshold`	float	0.7	Threshold for anomaly detection
`max_fingerprints_per_host`	int	5	Max expected fingerprints
`detect_suspicious`	bool	true	Detect fingerprint changes

Note: Requires nDPI fingerprint fields in flow data (Professional+ license for probe).

Logging

logging:
  level: info
  file: /var/log/rockfish/detect.log

Option	Type	Default	Description
`level`	string	info	Log level: error, warn, info, debug, trace
`file`	string	-	Log file path (optional)

Other Options

parallel_protocols: true
protocols:
  - tcp
  - udp
  - icmp

duckdb:
  autoload_extensions: false

Option	Type	Default	Description
`parallel_protocols`	bool	true	Process protocols in parallel
`protocols`	list	tcp, udp, icmp	Protocols to process
`duckdb.autoload_extensions`	bool	false	DuckDB extension autoload

Complete Example

license:
  path: /opt/rockfish/etc/license.json
  observation: sensor-01

s3:
  bucket: flow-data
  region: us-east-1
  hive_partitioning: true

sampling:
  sample_percent: 10.0
  retention_days: 7
  sample_hour: 0

features:
  num_bins: 10
  histogram_type: quantile
  sample_days: 7

training:
  enabled: true
  train_hour: 1
  algorithm: hybrid
  model_output_dir: /var/lib/rockfish/models

  hbos:
    num_bins: 10
    fields:
      - dur
      - rtt
      - pcr
      - spkts
      - dpkts
      - sbytes
      - dbytes

  hybrid:
    hbos_weight: 0.5
    correlation_weight: 0.3
    threat_intel_weight: 0.2

fingerprint:
  enabled: true
  history_days: 7
  min_observations: 10

logging:
  level: info
  file: /var/log/rockfish/detect.log

Data Pipeline

Rockfish Detect processes data through a series of stages, each producing artifacts used by subsequent stages.

Pipeline Stages

sample --> extract --> rank --> train --> score

Stage	Command	Input	Output
Sample	`sample`	Raw flow Parquet	Sampled Parquet
Extract	`extract`	Sampled Parquet	Normalization tables
Rank	`rank`	Normalization tables	Feature rankings
Train	`train`	Sampled + Normalization	Model files
Score	`score`	Raw flows + Model	Anomaly scores

1. Sampling

Randomly samples flow data to reduce volume while maintaining statistical properties.

# Sample specific date
rockfish_detect -c config.yaml sample --date 2025-01-28

# Sample last N days
rockfish_detect -c config.yaml sample --days 7

# Clear state and resample all
rockfish_detect -c config.yaml sample --clear

Input Path

s3://<bucket>/<observation>/v2/year=YYYY/month=MM/day=DD/*.parquet

Output Path

s3://<bucket>/<observation>/sample/sample-YYYY-MM-DD.parquet

Configuration

sampling:
  sample_percent: 10.0    # 10% of rows
  retention_days: 7       # Keep 7 days of samples

State Tracking

Sampling maintains state to avoid reprocessing:

Tracks which dates have been sampled
Skips dates already in state file
Use --clear to reset state

2. Feature Extraction

Builds normalization lookup tables for ML training.

# Extract features for all protocols
rockfish_detect -c config.yaml extract

# Specific protocol
rockfish_detect -c config.yaml extract -p tcp

# Sequential (not parallel)
rockfish_detect -c config.yaml extract --sequential

Processing

For each field, creates a normalization table:

Numeric fields (dur, rtt, bytes, etc.):

Histogram binning (quantile or equal-width)
Maps raw values to bin indices
Normalizes to [0, 1] range

Categorical fields (proto, ports, IPs):

Frequency counting
Maps values to frequency scores
Special handling for IPs (/24 truncation)

Output Path

s3://<bucket>/<observation>/extract/<protocol>/<field>.parquet

Configuration

features:
  num_bins: 10              # Histogram resolution
  histogram_type: quantile  # Better for skewed data
  ip_hash_modulus: 65536    # IP dimensionality reduction

3. Feature Ranking

Ranks features by importance for model training.

# Rank using reconstruction error
rockfish_detect -c config.yaml rank

# Rank using SVD
rockfish_detect -c config.yaml rank -a svd

# Specific protocol
rockfish_detect -c config.yaml rank -p tcp

Algorithms

Algorithm	Description
`reconstruction`	Autoencoder reconstruction error (default)
`svd`	Singular Value Decomposition importance

Output

s3://<bucket>/<observation>/extract/<protocol>/rockfish_rank.parquet

Contains importance scores (0-1) for each field.

Using Rankings

training:
  min_importance_score: 0.7   # Only use features above this

4. Model Training

Trains anomaly detection models on sampled data.

# Train HBOS model
rockfish_detect -c config.yaml train -a hbos

# Train hybrid model
rockfish_detect -c config.yaml train -a hybrid

# Train with ranked features only
rockfish_detect -c config.yaml train-ranked -n 10

# Specific protocol
rockfish_detect -c config.yaml train -p tcp

Algorithms

HBOS (Histogram-Based Outlier Score):

Fast, interpretable
Inverse density scoring
Good baseline algorithm

Hybrid:

Combines HBOS + correlation + threat intel
Weighted scoring model
Better for complex environments

Output

Models saved to configured directory:

<model_output_dir>/<protocol>_model.json

Configuration

training:
  algorithm: hbos
  model_output_dir: /var/lib/rockfish/models

  hbos:
    num_bins: 10
    fields: [dur, rtt, pcr, spkts, dpkts, sbytes, dbytes]

5. Flow Scoring

Scores flows using trained models.

# Score specific date
rockfish_detect -c config.yaml score -d 2025-01-28

# Score since timestamp
rockfish_detect -c config.yaml score --since 2025-01-28T00:00:00Z

# With severity threshold
rockfish_detect -c config.yaml score -t 0.8

# Limit results
rockfish_detect -c config.yaml score -n 1000

# Output to file
rockfish_detect -c config.yaml score -o anomalies.parquet

Options

Option	Description
`-d, --date`	Score specific date
`--since`	Score since timestamp
`-p`	Specific protocol
`-t, --threshold`	Minimum score threshold
`-n, --limit`	Maximum results
`-o, --output`	Output file path

Severity Classification

# Percentile-based (default)
severity_mode: percentile

# Fixed thresholds
severity_mode: fixed
severity_thresholds:
  low: 0.5
  medium: 0.7
  high: 0.85
  critical: 0.95

Output

s3://<bucket>/<observation>/score/score-YYYY-MM-DD.parquet

Includes:

Original flow fields
anomaly_score (0-1)
severity (LOW, MEDIUM, HIGH, CRITICAL)

Automated Pipeline

Run the complete pipeline with a single command:

# Full pipeline for today
rockfish_detect -c config.yaml auto

# Specific date
rockfish_detect -c config.yaml auto --date 2025-01-28

# Last 7 days
rockfish_detect -c config.yaml auto --days 7

# Stop on first error
rockfish_detect -c config.yaml auto --fail-fast

Pipeline Order

Sample data
Extract features
Rank features
Train model
Score flows

Reporting

Generate reports from scored data:

# Text report
rockfish_detect -c config.yaml report --date 2025-01-28

# JSON output
rockfish_detect -c config.yaml report -f json

# Filter by severity
rockfish_detect -c config.yaml report --min-severity HIGH

# Top N anomalies
rockfish_detect -c config.yaml report -n 50

Output Formats

Format	Description
`text`	Human-readable (default)
`json`	Machine-readable JSON
`csv`	CSV export

Anomaly Detection

Rockfish Detect supports multiple anomaly detection algorithms for identifying unusual network flows.

Algorithms

Algorithm	Type	Speed	Interpretability	Use Case
HBOS	Unsupervised	Fast	High	General anomaly detection
Hybrid	Combined	Medium	Medium	Complex environments
Random Forest	Supervised	Medium	Medium	Known threat patterns
Autoencoder	Neural Network	Slow	Low	Complex patterns

HBOS (Histogram-Based Outlier Score)

HBOS is the default algorithm - fast, interpretable, and effective for network anomaly detection.

How It Works

Build histograms for each feature from training data
Calculate density for each bin
Score new flows based on inverse density
Combine scores across features

Flows falling in low-density bins receive high anomaly scores.

Configuration

training:
  algorithm: hbos

  hbos:
    num_bins: 10
    fields:
      - dur          # Flow duration
      - rtt          # Round-trip time
      - pcr          # Producer-consumer ratio
      - spkts        # Source packets
      - dpkts        # Destination packets
      - sbytes       # Source bytes
      - dbytes       # Destination bytes
      - sentropy     # Source entropy
      - dentropy     # Destination entropy
      - ssmallpktcnt # Small packet count
      - slargepktcnt # Large packet count

Feature Selection

Choose fields that characterize normal behavior:

Category	Fields	Detects
Volume	sbytes, dbytes, spkts, dpkts	Data exfiltration, DDoS
Timing	dur, rtt	Tunneling, beaconing
Behavior	pcr, entropy	C2, encrypted channels
Packets	smallpktcnt, largepktcnt	Protocol anomalies

Example Output

Flow: 192.168.1.100:52341 -> 45.33.32.156:443
Score: 0.92 (CRITICAL)
Contributing factors:
  - dbytes: 47MB (unusual outbound volume)
  - dur: 28800s (8-hour connection)
  - pcr: -0.98 (highly asymmetric)

Hybrid Algorithm

Combines multiple detection methods for improved accuracy.

Components

Final Score = (HBOS * W1) + (Correlation * W2) + (Threat Intel * W3)

Component	Default Weight	Description
HBOS	0.5	Statistical outlier score
Correlation	0.3	Fingerprint pair frequency
Threat Intel	0.2	nDPI risk + IP reputation

Configuration

training:
  algorithm: hybrid

  hybrid:
    hbos_weight: 0.5
    correlation_weight: 0.3
    threat_intel_weight: 0.2
    hbos_filter_percentile: 90.0
    min_observations: 3

Correlation Score

Based on nDPI fingerprint pair frequency (ndpi_ja4/ndpi_ja3s):

Build database of (client_fingerprint, server_fingerprint) pairs
Track frequency of each pair
Score rare or never-seen combinations higher

Detects:

New client/server combinations
Unusual application behaviors
Potential lateral movement

Threat Intel Score

Incorporates external intelligence:

nDPI risk scores: Protocol-level risks
IP reputation: AbuseIPDB confidence scores
Known bad indicators: Blacklisted IPs/domains

Tuning Weights

Environment	HBOS	Correlation	Threat Intel
General	0.5	0.3	0.2
High threat	0.3	0.3	0.4
Internal only	0.6	0.4	0.0

Severity Classification

Anomaly scores are classified into severity levels.

Percentile-Based (Default)

Dynamic thresholds based on score distribution:

Severity	Percentile
LOW	50-75th
MEDIUM	75-90th
HIGH	90-95th
CRITICAL	>95th

Adapts to your environment’s baseline.

Fixed Thresholds

Static thresholds for consistent alerting:

severity_mode: fixed
severity_thresholds:
  low: 0.5
  medium: 0.7
  high: 0.85
  critical: 0.95

Protocol-Specific Models

Rockfish Detect trains separate models per protocol:

# Train TCP model only
rockfish_detect -c config.yaml train -p tcp

# Score UDP traffic
rockfish_detect -c config.yaml score -p udp

Why Separate Models?

TCP, UDP, and ICMP have different characteristics
Prevents cross-protocol noise
Better detection accuracy per protocol

Configuration

protocols:
  - tcp
  - udp
  - icmp

parallel_protocols: true   # Process in parallel

Feature Ranking

Use feature importance to select the most relevant fields.

# Rank features
rockfish_detect -c config.yaml rank

# Train with top 10 ranked features
rockfish_detect -c config.yaml train-ranked -n 10

Benefits

Reduces model complexity
Improves training speed
May improve detection accuracy

Configuration

training:
  min_importance_score: 0.7   # Include features above this threshold

Best Practices

1. Start with HBOS

Fast iteration
Easy to interpret
Good baseline performance

2. Use Adequate Training Data

Minimum 7 days of samples
Include normal business hours and off-hours
Ensure representative traffic mix

3. Tune for Your Environment

Adjust severity thresholds based on alert volume
Weight algorithms based on threat model
Include relevant fields for your use case

4. Regular Retraining

Retrain weekly or monthly
Network behavior changes over time
New applications may appear as anomalies initially

5. Validate Results

Review high-severity alerts
Adjust thresholds to reduce false positives
Document known-good anomalies

Troubleshooting

High False Positive Rate

Increase severity thresholds
Add more training data
Exclude noisy fields from model

Missing True Positives

Lower severity thresholds
Include more fields in model
Check training data for bias

Slow Scoring

Use ranked features (fewer fields)
Process protocols in parallel
Increase hardware resources

Device Fingerprinting

Rockfish Detect includes ML-based passive device fingerprinting using network signals.

Note: Requires nDPI fingerprints in flow data (Professional+ license for rockfish_probe).

Overview

Device fingerprinting identifies devices and operating systems based on their network behavior, without requiring agents or active scanning.

Signals Used

Priority	Signal	Field	Description
Primary	TLS client	ndpi_ja4	JA4 TLS client fingerprint
Primary	TLS server	ndpi_ja3s	JA3 TLS server fingerprint
Secondary	TCP stack	ndpi_tcp_fp	TCP fingerprint with OS hint (TTL, window size, options)
Secondary	Composite	ndpi_fp	nDPI combined fingerprint for device correlation
Tertiary	Application	-	HTTP headers, DNS patterns

Use Cases

Asset Inventory - Discover devices on your network
Baseline Monitoring - Track device behavior over time
Lateral Movement Detection - Detect hosts changing fingerprints
Unauthorized Devices - Identify unexpected device types

Commands

Build Fingerprint Database

Build baseline from historical data:

# Build from last 7 days
rockfish_detect -c config.yaml fingerprint build --days 7

# Build from specific date range
rockfish_detect -c config.yaml fingerprint build --start 2025-01-01 --end 2025-01-28

Detect Anomalies

Find hosts with unusual fingerprint changes:

# Detect for today
rockfish_detect -c config.yaml fingerprint detect

# Detect for specific date
rockfish_detect -c config.yaml fingerprint detect --date 2025-01-28

Profile Specific Host

Get fingerprint profile for an IP:

# Profile specific IP
rockfish_detect -c config.yaml fingerprint profile --ip 192.168.1.100

# With history
rockfish_detect -c config.yaml fingerprint profile --ip 192.168.1.100 --days 30

Configuration

fingerprint:
  enabled: true
  history_days: 7
  client_field: ndpi_ja4
  server_field: ndpi_ja3s
  min_observations: 10
  anomaly_threshold: 0.7
  max_fingerprints_per_host: 5
  detect_suspicious: true

Option	Default	Description
`enabled`	false	Enable fingerprinting
`history_days`	7	Days of history to analyze
`client_field`	ndpi_ja4	Field for client fingerprint (JA4 via nDPI)
`server_field`	ndpi_ja3s	Field for server fingerprint (JA3 via nDPI)
`min_observations`	10	Minimum flows to establish baseline
`anomaly_threshold`	0.7	Score threshold for anomalies
`max_fingerprints_per_host`	5	Expected max fingerprints per device
`detect_suspicious`	true	Flag suspicious changes

How It Works

1. Baseline Building

For each IP address, collect:

Set of observed ndpi_ja4 fingerprints (client connections)
Set of observed ndpi_ja3s fingerprints (server connections)
Frequency of each fingerprint
First and last seen timestamps

2. Anomaly Detection

Flag hosts that:

Present a new, never-seen fingerprint
Exceed max_fingerprints_per_host
Show sudden fingerprint changes
Have rare fingerprint combinations

3. Correlation Scoring

Score fingerprint pairs by frequency:

Rare pair (first time seen) -> High anomaly score
Common pair (seen 1000+ times) -> Low anomaly score

Detection Scenarios

New Device on Network

Alert: New fingerprint detected
Host: 192.168.1.150
Fingerprint: t13d1516h2_8daaf6152771_b0da82dd1658
First seen: 2025-01-28T14:32:00Z
Action: Verify device is authorized

Host Fingerprint Change

Alert: Fingerprint change detected
Host: 192.168.1.100
Previous: t13d1516h2_8daaf6152771_b0da82dd1658 (Windows 11)
Current: t13d1517h2_5b57614c22b0_06cda9e17597 (Linux)
Risk: Possible lateral movement or VM switch

Unusual Client/Server Pair

Alert: Rare fingerprint combination
Client: 192.168.1.100 (ndpi_ja4: t13d1516h2_...)
Server: 45.33.32.156 (ndpi_ja3s: t120200_...)
Observations: 1 (first time)
Typical for this client: 847 connections to known servers
Risk: New external communication

Integration with Hybrid Scoring

Fingerprint correlation is a component of the hybrid algorithm:

training:
  algorithm: hybrid

  hybrid:
    hbos_weight: 0.5
    correlation_weight: 0.3      # Fingerprint correlation
    threat_intel_weight: 0.2

Flows with rare fingerprint combinations receive higher anomaly scores.

Output Schema

Fingerprint analysis adds these fields to scored flows:

Field	Type	Description
`fp_client`	String	Client fingerprint (ndpi_ja4)
`fp_server`	String	Server fingerprint (ndpi_ja3s)
`fp_pair_count`	Int	Times this pair has been seen
`fp_client_count`	Int	Times client has been seen
`fp_is_new`	Bool	First observation of this pair
`fp_anomaly_score`	Float	Fingerprint-specific anomaly score

Best Practices

1. Build Sufficient Baseline

Use at least 7 days of data
Include weekdays and weekends
Ensure coverage of all network segments

2. Tune Thresholds

Start with defaults
Adjust max_fingerprints_per_host for your environment
Some hosts (proxies, VMs) legitimately have many fingerprints

3. Handle Known Exceptions

Exclude known multi-fingerprint hosts
Document expected fingerprint changes (updates, migrations)

4. Combine with Other Signals

Use hybrid algorithm for combined scoring
Correlate with threat intelligence
Consider flow volume and timing

Limitations

Requires nDPI fingerprint fields (ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp, ndpi_fp) in flow data
TLS fingerprints only available for TLS connections
VPN/proxy traffic may obscure true fingerprints
Fingerprints can change with software updates

Scheduler

Rockfish Detect can run as a daemon with automated scheduling for continuous anomaly detection.

Running as Daemon

# Start scheduler
rockfish_detect -c config.yaml run

# Run immediately without waiting
rockfish_detect -c config.yaml run --run-now

The scheduler runs two daily jobs:

Sample job - Sample new flow data
Train job - Retrain models with new samples

Schedule Configuration

sampling:
  sample_hour: 0          # UTC hour (0-23)
  sample_minute: 30       # Optional; random if not set

training:
  train_hour: 1           # UTC hour (0-23)
  train_minute: 0         # Optional; random if not set

Random Minutes

If sample_minute or train_minute is not set, a random minute (0-59) is selected at startup. This prevents multiple instances from running concurrently.

Example Schedule

# Sample at 00:30 UTC, train at 01:00 UTC
sampling:
  sample_hour: 0
  sample_minute: 30

training:
  train_hour: 1
  train_minute: 0

Timeline:

00:30 UTC - Sample yesterday's flow data
01:00 UTC - Retrain models with updated samples

Systemd Service

Create /etc/systemd/system/rockfish-detect.service:

[Unit]
Description=Rockfish Detect ML Service
After=network.target

[Service]
Type=simple
User=rockfish
ExecStart=/usr/local/bin/rockfish_detect -c /etc/rockfish/detect.yaml run
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable rockfish-detect
sudo systemctl start rockfish-detect

# Check status
sudo systemctl status rockfish-detect

# View logs
sudo journalctl -u rockfish-detect -f

Docker Deployment

# Pull the image
docker pull rockfishnetworks/toolkit:latest

# Run the scheduler
docker run -d \
  --name rockfish-detect \
  -v /path/to/config.yaml:/etc/rockfish/config.yaml \
  -v /path/to/license.json:/etc/rockfish/license.json \
  -e AWS_ACCESS_KEY_ID=xxx \
  -e AWS_SECRET_ACCESS_KEY=xxx \
  rockfishnetworks/toolkit:latest \
  rockfish_detect -c /etc/rockfish/config.yaml run

Graceful Shutdown

The scheduler handles SIGTERM/SIGINT for graceful shutdown:

Stops accepting new jobs
Waits for running jobs to complete
Saves state
Exits cleanly

# Graceful stop
sudo systemctl stop rockfish-detect

# Or with kill
kill -TERM $(pgrep rockfish_detect)

State Management

The scheduler maintains state to avoid redundant work:

Sample State

Tracks which dates have been sampled:

s3://<bucket>/<observation>/sample/.state.json

Skip already-sampled dates on restart.

Score State

Tracks last scored timestamp:

s3://<bucket>/<observation>/score/.state.json

Resume scoring from last checkpoint.

Reset State

# Clear sample state
rockfish_detect -c config.yaml sample --clear

# Force rescore
rockfish_detect -c config.yaml score --since 2025-01-01T00:00:00Z

Monitoring

Log Output

logging:
  level: info
  file: /var/log/rockfish/detect.log

Log levels:

error - Errors only
warn - Warnings and errors
info - Normal operation (default)
debug - Detailed operation
trace - Very verbose

Health Check

# Validate configuration
rockfish_detect -c config.yaml validate

# Test S3 connectivity
rockfish_detect -c config.yaml test-s3

# Check license
rockfish_detect -c config.yaml license

Metrics to Monitor

Metric	Description
Sample job duration	Time to complete sampling
Train job duration	Time to complete training
Flows sampled	Number of flows per sample run
Anomalies detected	High-severity anomalies per day
S3 errors	Failed S3 operations

Multi-Instance Deployment

For high availability or distributed processing:

Separate Responsibilities

# Instance 1: Sampling and training
rockfish_detect -c config-train.yaml run

# Instance 2: Scoring only
rockfish_detect -c config-score.yaml score --continuous

Shared State

All instances read/write to the same S3 bucket. State files prevent duplicate work.

Protocol Distribution

# Instance 1: TCP
rockfish_detect -c config.yaml run -p tcp

# Instance 2: UDP
rockfish_detect -c config.yaml run -p udp

Troubleshooting

Job Not Running

Check system time (UTC)
Verify schedule configuration
Check logs for errors

Job Failing

# Run manually with verbose output
rockfish_detect -c config.yaml -vv auto

High Memory Usage

Reduce sample_percent
Process protocols sequentially
Limit sample_days

Slow Jobs

Enable parallel_protocols: true
Use faster S3 storage
Increase hardware resources

Parquet Schema

Rockfish exports flow data in Apache Parquet format with IPFIX-compliant field naming. The schema varies by license tier.

Schema by Tier

Tier	Schema Version	Fields	Key Features
Community	v1	44	Basic flow fields
Basic	v1	54	+ nDPI detection, GeoIP (country, city, ASN)
Professional	v2	60	+ GeoIP AS org, nDPI fingerprints
Enterprise	v2	63+	+ Anomaly scores, ML predictions

Community Schema (44 Fields)

Basic flow capture with core network fields.

#	Field	Type	Description
1	version	UInt16	Schema version (1)
2	flowid	String	Unique flow UUID
3	obname	String	Observation domain name
4	stime	Timestamp	Flow start time (UTC)
5	etime	Timestamp	Flow end time (UTC)
6	dur	UInt32	Duration (milliseconds)
7	rtt	UInt32	Round-trip time (microseconds)
8	pcr	Int32	Producer-consumer ratio
9	proto	String	Protocol (TCP, UDP, ICMP)
10	saddr	String	Source IP address
11	daddr	String	Destination IP address
12	sport	UInt16	Source port
13	dport	UInt16	Destination port
14	iflags	String	Initial TCP flags
15	uflags	String	Union of all TCP flags
16	stcpseq	UInt32	Source initial TCP sequence
17	dtcpseq	UInt32	Dest initial TCP sequence
18	svlan	UInt16	Source VLAN ID
19	dvlan	UInt16	Destination VLAN ID
20	spkts	UInt64	Source packet count
21	dpkts	UInt64	Destination packet count
22	sbytes	UInt64	Source byte count
23	dbytes	UInt64	Destination byte count
24	sentropy	UInt8	Source payload entropy (0-255)
25	dentropy	UInt8	Destination payload entropy
26	ssmallpktcnt	UInt32	Source small packets (<60 bytes)
27	dsmallpktcnt	UInt32	Dest small packets
28	slargepktcnt	UInt32	Source large packets (>225 bytes)
29	dlargepktcnt	UInt32	Dest large packets
30	snonemptypktcnt	UInt32	Source non-empty packets
31	dnonemptypktcnt	UInt32	Dest non-empty packets
32	sfirstnonemptycnt	UInt16	Source first N non-empty sizes
33	dfirstnonemptycnt	UInt16	Dest first N non-empty sizes
34	smaxpktsize	UInt16	Source max packet size
35	dmaxpktsize	UInt16	Dest max packet size
36	savgpayload	UInt16	Source avg payload size
37	davgpayload	UInt16	Dest avg payload size
38	sstdevpayload	UInt16	Source payload std deviation
39	dstdevpayload	UInt16	Dest payload std deviation
40	spd	String	Small packet direction flags
41	spdt	String	Small packet direction timing
42	reason	String	Flow termination reason
43	smac	String	Source MAC address
44	dmac	String	Destination MAC address

Basic Schema (54 Fields)

Community schema + nDPI application detection + GeoIP (country, city, ASN).

GeoIP fields:

#	Field	Type	Description
45	scountry	String	Source country (ISO 3166-1 alpha-2)
46	dcountry	String	Destination country
47	scity	String	Source city
48	dcity	String	Destination city
49	sasn	UInt32	Source ASN
50	dasn	UInt32	Destination ASN

nDPI fields:

#	Field	Type	Description
51	ndpi_appid	String	nDPI application ID (e.g., “TLS.YouTube”)
52	ndpi_category	String	nDPI category (e.g., “Streaming”)
53	ndpi_risk_score	UInt32	nDPI cumulative risk score
54	ndpi_risk_severity	UInt8	Risk severity (0=none, 1=low, 2=medium, 3=high)

Professional Schema (60 Fields)

Basic schema + GeoIP AS organization names and nDPI fingerprinting.

Additional GeoIP fields (AS organization):

#	Field	Type	Description
55	sasnorg	String	Source ASN organization
56	dasnorg	String	Destination ASN organization

nDPI fingerprint fields:

#	Field	Type	Description
57	ndpi_ja4	String	JA4 TLS client fingerprint (via nDPI)
58	ndpi_ja3s	String	JA3 TLS server fingerprint (via nDPI)
59	ndpi_tcp_fp	String	TCP fingerprint with OS hint (via nDPI)
60	ndpi_fp	String	nDPI composite fingerprint

Enterprise Schema (63+ Fields)

Professional schema + anomaly detection and ML predictions.

Anomaly detection fields:

#	Field	Type	Description
61	anomaly_score	Float32	Anomaly score (0.0 - 1.0)
62	anomaly_severity	String	Severity (LOW, MEDIUM, HIGH, CRITICAL)
63	anomaly_factors	String	Contributing factors

File Naming

Tier	File Pattern
Community	`rockfish-v1-YYYYMMDD-HHMMSS.parquet`
Basic	`rockfish-v1-YYYYMMDD-HHMMSS.parquet`
Professional	`rockfish-<observation>-v2-YYYYMMDD-HHMMSS.parquet`
Enterprise	`rockfish-<observation>-v2-YYYYMMDD-HHMMSS.parquet`

S3 Path Structure

With Hive partitioning enabled:

s3://<bucket>/<prefix>/v1/year=YYYY/month=MM/day=DD/*.parquet
s3://<bucket>/<prefix>/v2/year=YYYY/month=MM/day=DD/*.parquet

Field Descriptions

Flow Identification

flowid: Unique UUID for deduplication and correlation
obname: Observation domain name (sensor identifier)

Timing

stime/etime: Timestamps with microsecond precision, UTC
dur: Duration in milliseconds
rtt: Estimated TCP round-trip time

Network Addresses

saddr/daddr: IPv4 or IPv6 addresses as strings
sport/dport: Port numbers (0 for non-TCP/UDP)
smac/dmac: MAC addresses in standard notation

Traffic Volumes

spkts/dpkts: Packet counts per direction
sbytes/dbytes: Byte counts per direction
pcr: Producer-consumer ratio: (sent-recv)/(sent+recv)

TCP Flags

iflags: Initial TCP flags (SYN, ACK, etc.)
uflags: Union of all flags seen in flow

Payload Analysis

sentropy/dentropy: Shannon entropy (0-255)
- 230: Likely encrypted/compressed
- ~140: English text
- Low: Sparse or zero-padded

Flow Termination

reason: Why the flow ended
- idle: Idle timeout
- active: Active timeout
- eof: End of capture
- end: FIN exchange
- rst: TCP reset

GeoIP (Professional+)

scountry/dcountry: ISO 3166-1 alpha-2 codes
sasn/dasn: Autonomous System Numbers
sasnorg/dasnorg: AS organization names

nDPI Detection (Basic+)

ndpi_appid: Application identifier (e.g., “TLS.YouTube”)
ndpi_category: Category (e.g., “Streaming”)
ndpi_risk_score: Cumulative risk score
ndpi_risk_severity: 0=none, 1=low, 2=medium, 3=high

nDPI Fingerprints (Professional+)

ndpi_ja4: JA4 TLS client fingerprint
ndpi_ja3s: JA3 TLS server fingerprint
ndpi_tcp_fp: TCP fingerprint with OS detection hint (format: “fingerprint/os”)
ndpi_fp: nDPI composite fingerprint for device correlation

Anomaly Detection (Enterprise)

anomaly_score: 0.0-1.0 indicating how unusual the flow is
anomaly_severity: Classification based on score percentile
anomaly_factors: Fields contributing most to the score

Parquet File Metadata

Each file includes custom metadata:

Key	Description
`rockfish.license_id`	License identifier
`rockfish.tier`	License tier
`rockfish.company`	Company name
`rockfish.observation`	Observation domain
`rockfish.schema_version`	Schema version

Example Queries

DuckDB - Read from S3

SELECT * FROM read_parquet(
    's3://bucket/v2/year=2025/month=01/day=28/*.parquet',
    hive_partitioning=true
);

Count by Protocol

SELECT proto, COUNT(*) as count
FROM read_parquet('flows/*.parquet')
GROUP BY proto
ORDER BY count DESC;

Filter by Country (Professional+)

SELECT saddr, daddr, scountry, dcountry, ndpi_appid
FROM read_parquet('flows/*.parquet')
WHERE scountry = 'US' AND dcountry != 'US';

High-Risk Flows (Basic+)

SELECT stime, saddr, daddr, ndpi_appid, ndpi_risk_score
FROM read_parquet('flows/*.parquet')
WHERE ndpi_risk_score > 100
ORDER BY ndpi_risk_score DESC;

Anomalous Flows (Enterprise)

SELECT stime, saddr, daddr, anomaly_score, anomaly_severity
FROM read_parquet('flows/*.parquet')
WHERE anomaly_severity IN ('HIGH', 'CRITICAL')
ORDER BY anomaly_score DESC
LIMIT 100;

CLI Reference

Command-line options for Rockfish tools.

rockfish_probe

Usage

rockfish_probe [OPTIONS]

Global Options

Option	Short	Description
`--config <FILE>`	`-c`	Configuration file path
`--help`	`-h`	Show help
`--version`	`-V`	Show version

Input Options

Option	Short	Description
`--source <SRC>`	`-i`	Input source (interface or pcap file)
`--live <TYPE>`		Capture type: `pcap`, `afpacket`, `netmap`, `fmadio`
`--filter <EXPR>`		BPF filter expression
`--snaplen <BYTES>`		Maximum capture bytes per packet
`--promisc-off`		Disable promiscuous mode

Flow Options

Option	Description
`--idle-timeout <SECS>`	Idle timeout (default: 300)
`--active-timeout <SECS>`	Active timeout (default: 1800)
`--max-flows <COUNT>`	Maximum flow table size
`--max-payload <BYTES>`	Max payload bytes to capture
`--udp-uniflow <PORT>`	UDP uniflow port (0=disabled)
`--ndpi`	Enable nDPI (includes JA4/JA3s fingerprints)

Fragment Options

Option	Description
`--no-frag`	Disable fragment reassembly
`--max-frag-tables <N>`	Max fragment tables (default: 1024)
`--frag-timeout <SECS>`	Fragment timeout (default: 30)

AF_PACKET Options (Linux)

Option	Description
`--afp-block-size <BYTES>`	Ring buffer block size
`--afp-block-count <N>`	Ring buffer block count
`--afp-fanout-group <ID>`	Fanout group ID
`--afp-fanout-mode <MODE>`	Fanout mode: hash, lb, cpu, rollover, random

Output Options

Option	Description
`--parquet-dir <DIR>`	Output directory for Parquet files
`--parquet-batch-size <N>`	Flows per file
`--parquet-prefix <PREFIX>`	Filename prefix
`--parquet-schema <TYPE>`	Schema: `simple` or `extended`
`--observation <NAME>`	Observation domain name
`--hive-boundary-flush`	Flush at day boundaries

S3 Options

Option	Description
`--s3-bucket <NAME>`	S3 bucket name
`--s3-prefix <PREFIX>`	S3 key prefix
`--s3-region <REGION>`	AWS region
`--s3-endpoint <URL>`	Custom S3 endpoint
`--s3-force-path-style`	Use path-style URLs
`--s3-hive-partitioning`	Enable Hive partitioning
`--s3-delete-after-upload`	Delete local after upload
`--test-s3`	Test S3 connectivity and exit

Logging Options

Option	Short	Description
`--verbose`	`-v`	Increase verbosity (-vv for debug)
`--quiet`	`-q`	Quiet mode
`--stats`		Print statistics
`--log-file <PATH>`		Log file path

License Options

Option	Description
`--license <PATH>`	License file path

Environment: ROCKFISH_LICENSE_PATH

Examples

# Basic PCAP processing
rockfish_probe -i capture.pcap --parquet-dir ./flows

# Live capture with AF_PACKET
sudo rockfish_probe -i eth0 --live afpacket \
    --afp-block-size 4194304 \
    --afp-fanout-group 1 \
    --parquet-dir ./flows

# With all features (nDPI includes fingerprints)
rockfish_probe -i eth0 --live afpacket \
    --ndpi \
    --parquet-dir ./flows \
    --s3-bucket my-bucket \
    --s3-region us-east-1 \
    --s3-hive-partitioning \
    -vv

# Test S3 connectivity
rockfish_probe --test-s3 \
    --s3-bucket my-bucket \
    --s3-region us-east-1

rockfish_mcp

Usage

rockfish_mcp [OPTIONS]

Options

Option	Description
`--config <FILE>`	Configuration file path
`--help`	Show help
`--version`	Show version

Environment: ROCKFISH_CONFIG

Examples

# Start with config file
ROCKFISH_CONFIG=config.yaml rockfish_mcp

# Or via argument
rockfish_mcp --config /etc/rockfish/mcp.yaml

Common Patterns

Processing Multiple PCAPs

# Glob pattern
rockfish_probe -i "/data/captures/*.pcap" --parquet-dir ./flows

# Multiple runs
for f in /data/captures/*.pcap; do
    rockfish_probe -i "$f" --parquet-dir ./flows
done

High-Performance Capture

# Pin to CPUs, large ring buffer, fanout
sudo taskset -c 0-3 rockfish_probe -i eth0 --live afpacket \
    --afp-block-size 4194304 \
    --afp-block-count 128 \
    --afp-fanout-group 1 \
    --afp-fanout-mode hash \
    --parquet-dir /data/flows

Development/Testing

# Verbose output, no S3
rockfish_probe -i test.pcap \
    --parquet-dir ./test-flows \
    --ndpi \
    --stats \
    -vv

Production Deployment

# Full featured with S3
rockfish_probe -c /opt/rockfish/etc/config.yaml \
    --license /opt/rockfish/etc/license.json

License Tiers

Rockfish uses a tiered licensing model to enable different feature sets.

Tier Comparison

Feature	Community	Basic	Professional	Enterprise
Core Features
Packet capture	Yes	Yes	Yes	Yes
Flow generation	Yes	Yes	Yes	Yes
Parquet export	Yes	Yes	Yes	Yes
S3 upload	Yes	Yes	Yes	Yes
Schema
v1 (Simple - 54 fields)	Yes	Yes	Yes	Yes
v2 (Extended - 60 fields)	-	-	Yes	Yes
Application Detection
nDPI labeling	-	Yes	Yes	Yes
nDPI risk scoring	-	Yes	Yes	Yes
Network Intelligence
GeoIP country/city/ASN	-	Yes	Yes	Yes
GeoIP AS organization	-	-	Yes	Yes
nDPI fingerprints (JA4, JA3s, TCP)	-	-	Yes	Yes
Customization
Custom observation name	-	Yes	Yes	Yes
Advanced Features
Anomaly detection	-	-	-	Yes
ML model integration	-	-	-	Yes

Feature Details

Community Tier

Free tier with basic flow capture:

Standard 5-tuple flow generation
Parquet export (v1 schema)
S3 upload support
AF_PACKET high-performance capture
Fragment reassembly

Basic Tier

Adds application visibility and GeoIP intelligence:

All Community features
nDPI application labeling
nDPI risk scoring and categories
GeoIP lookups (scountry, dcountry, scity, dcity, sasn, dasn)
Custom observation domain name
54 fields total

Professional Tier

Adds AS organization names and device fingerprinting:

All Basic features
Extended schema (60 fields)
GeoIP AS organization names (sasnorg, dasnorg)
nDPI fingerprints (JA4 client, JA3 server, TCP fingerprint, composite)

Enterprise Tier

Full feature set:

All Professional features
Anomaly detection (HBOS)
ML model integration
SaaS schema (63+ fields)
Correlation with rockfish_sensor

Schema Comparison

v1 (Simple) - Community/Basic

54 core fields:

Flow identification (flowid, obname)
Timing (stime, etime, dur, rtt)
Addresses (saddr, daddr, sport, dport)
Traffic (spkts, dpkts, sbytes, dbytes)
TCP state (iflags, uflags, sequences)
Payload analysis (entropy, packet sizes)
GeoIP: scountry, dcountry, scity, dcity, sasn, dasn (Basic tier)
nDPI results (Basic tier)

v2 (Extended) - Professional/Enterprise

60 fields (v1 + 6 additional):

GeoIP AS organization: sasnorg, dasnorg
nDPI fingerprints: ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp, ndpi_fp

v3 (SaaS) - Enterprise

63+ fields:

All v2 fields
Anomaly scores
ML predictions
Correlation IDs

License Enforcement

Parquet Metadata

Licensed files include metadata for validation:

rockfish.license_id: "lic_abc123"
rockfish.tier: "professional"
rockfish.company: "Example Corp"
rockfish.observation: "sensor-01"

MCP Validation

Configure license validation in MCP:

sources:
  # Require valid license
  prod_flows:
    path: s3://data/flows/
    require_license: true

  # Restrict to specific licenses
  enterprise_flows:
    path: s3://data/enterprise/
    require_license: true
    allowed_license_ids:
      - "lic_abc123"

Obtaining a License

Contact [email protected] for:

License quotes
Trial licenses
Enterprise agreements
Volume discounts

Keyboard shortcuts

Rockfish Documentation