Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Rockfish Networks

Introduction

Network Flow Telemetry. Simple. Affordable. AI-Ready.

Rockfish Toolkit captures network flows and writes them directly to your S3 in Apache Parquet format. That’s it. No intermediate databases, no proprietary formats, no vendor lock-in.

Your data. Your privacy. Your control.

Your data is immediately ready for analysis by DuckDB, Spark, Pandas, Python, R, or any tool that reads Parquet - which is virtually every modern data platform.

SimpleOne binary. Capture traffic. Write to S3. Done.
AffordableEnterprise-grade network visibility for less than the price of a grande latte per day.
AI-ReadyStructured, queryable data that ML pipelines and AI assistants can consume immediately.

A Bolt-On Toolkit for SOC AI Readiness

The question “Is your SOC AI-ready?” has become central to modern security operations. Industry consensus is clear: AI readiness starts with SOC Data Foundations - structured, queryable security data that AI systems can actually consume.

The challenge? Traditional security tools generate logs in proprietary formats, scattered across siloed systems. Ripping and replacing your entire security stack isn’t practical.

Rockfish Toolkit is different. Deploy alongside your existing infrastructure to create an AI-ready data layer:

  • No replacement required - Add Rockfish to your network without changing existing tools
  • Deploy in minutes - Single binary or Docker container, no complex dependencies
  • Immediate AI compatibility - Output flows directly to any ML pipeline, SIEM, or AI assistant
  • Open data format - Apache Parquet works with DuckDB, Spark, Pandas, and every major analytics platform
  • S3-native - Scalable, cost-effective cloud storage

Why Parquet for Network Data?

Rockfish Toolkit captures network flows and exports them as Apache Parquet files - the same columnar format used by data science platforms, ML pipelines, and modern SIEM architectures:

BenefitDescription
Columnar storageFast analytical queries on specific fields
Schema enforcementConsistent, typed data for ML models
70-90% compressionReduced storage costs vs. raw logs
Universal compatibilityWorks with DuckDB, Spark, Pandas, and AI frameworks
S3-nativeScalable, cost-effective cloud storage

This architecture enables security teams to add AI capabilities without rebuilding their entire SOC.

Why S3 Changes Everything

S3—and object storage generally—fundamentally changes what’s possible in cybersecurity by decoupling data collection from data analysis.

Traditional architectures force a painful tradeoff: either store everything and pay for expensive hot storage, or age out logs and lose forensic depth. S3 eliminates this with virtually unlimited, cheap, durable storage that can hold years of netflow, DNS logs, endpoint telemetry, and packet captures in columnar formats like Parquet.

This unlocks data science at scale:

  • Train anomaly detection models on months of baseline behavior
  • Run retrospective threat hunts when new IOCs emerge
  • Feed AI-driven SOC tools with the volume of data they need to learn patterns rather than just match signatures

You own your data:

The hive-partitioned, schema-on-read model means you’re not locked into a SIEM vendor’s data model. Your data lives in open formats, queryable by any tool—Athena, Spark, DuckDB, Pandas, or a custom Rust binary polling for new files.

When storage is cheap and permanent, detection becomes a software problem rather than a retention policy negotiation—and that shifts the advantage back to defenders.

What Rockfish Provides

CapabilityDescription
Network Flow CaptureHigh-performance packet capture with flow generation
Protocol DetectionApplication-level protocol identification via nDPI
Device FingerprintingTLS/TCP fingerprints via nDPI for device identification
Threat IntelligenceIP reputation and risk scoring
Anomaly DetectionML-based detection for enterprise deployments
MCP IntegrationQuery flows directly from AI assistants via Model Context Protocol

Use Cases

Rockfish Toolkit provides network visibility and AI-ready telemetry across diverse environments:

EnvironmentUse Case
Security Operations (SOC)Threat detection, incident response, network forensics, AI-assisted investigation
IoT NetworksDevice inventory, behavioral baselining, anomaly detection for connected devices
Industrial / ManufacturingOT network monitoring, detecting unauthorized communications, compliance auditing
Robotics & AutomationFleet communication analysis, identifying misconfigurations, performance monitoring
HealthcareMedical device tracking, HIPAA compliance, detecting data exfiltration
SMB / Branch OfficesAffordable network visibility without enterprise SIEM costs
MSPs / MSSPsMulti-tenant flow collection, centralized threat analysis across customers
Research & EducationNetwork traffic analysis, security research, ML model development

Components

ComponentDescription
rockfish_probeFlow meter - captures packets and generates flow records
rockfish_mcpMCP query server - SQL queries on Parquet files via DuckDB (Coming March 2025)
rockfish_detectML training and anomaly detection (Enterprise)
rockfish_intelThreat intelligence caching server

Data Pipeline

Network Traffic
      |
      v
rockfish_probe  -->  Parquet Files  -->  S3
                           |
                           v
                    rockfish_mcp (DuckDB queries)
                           |
                           v
                    AI Assistants / SIEM / Analytics

Parquet Schema by Tier

Rockfish outputs flow data in Apache Parquet format. The schema varies by license tier:

TierFieldsKey Data
Community445-tuple, timing, traffic volumes, TCP flags, payload entropy
Basic54+ nDPI application detection, GeoIP (country, city, ASN)
Professional60+ GeoIP AS org, nDPI fingerprints
Enterprise63++ Anomaly scores, severity classification

Key Fields

All tiers include:

  • saddr, daddr - Source/destination IP addresses
  • sport, dport - Source/destination ports
  • proto - Protocol (TCP, UDP, ICMP)
  • spkts, dpkts, sbytes, dbytes - Traffic volumes
  • dur, rtt - Duration and round-trip time
  • sentropy, dentropy - Payload entropy (encrypted traffic detection)

Basic+ adds:

  • scountry, dcountry - Geographic country codes
  • scity, dcity - Geographic city names
  • sasn, dasn - Autonomous System Numbers
  • ndpi_appid - Application identifier (e.g., “TLS.YouTube”)
  • ndpi_risk_score - Risk scoring

Professional+ adds:

  • sasnorg, dasnorg - AS organization names
  • ndpi_ja4, ndpi_ja3s - TLS fingerprints for device identification
  • ndpi_tcp_fp - TCP fingerprint with OS detection hint
  • ndpi_fp - nDPI composite fingerprint

Enterprise adds:

  • anomaly_score - ML-derived anomaly score (0.0-1.0)
  • anomaly_severity - Classification (LOW, MEDIUM, HIGH, CRITICAL)

See Parquet Schema for complete field reference.

License Tiers

TierFeatures
CommunityBasic schema (44 fields), S3 upload
Basic+ nDPI labels, GeoIP (country, city, ASN), custom observation name (54 fields)
Professional+ GeoIP AS org, nDPI fingerprints (60 fields)
Enterprise+ ML models, anomaly detection

See License Tiers for detailed comparison.

Getting Started

  1. Installation - Install from download portal
  2. Quick Start - Capture your first flows
  3. Licensing - Activate your license

Support

Installation

Quick Install

curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash

The installer auto-detects your platform and installs via the appropriate method (Debian package, Docker, or binary).

Options:

# Install specific version
ROCKFISH_VERSION=1.0.0 curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash

# Force Docker installation
ROCKFISH_METHOD=docker curl -fsSL https://toolkit.rockfishnetworks.com/install.sh | bash

Manual Installation

Rockfish Toolkit is also available as a Debian package and Docker image from the Rockfish Networks download portal.

System Requirements

  • Operating System: Debian 11+, Ubuntu 20.04+, or Docker-compatible host
  • Architecture: x86_64 (amd64)
  • Memory: 2GB minimum (4GB+ recommended for high-traffic networks)
  • Storage: Depends on retention policy (10GB minimum)
  • Network: Interface with capture capabilities

Debian Package Installation

Download the toolkit package from the Rockfish download portal:

# Download the package
wget https://download.rockfishnetworks.com/rockfish_toolkit.deb

# Install
sudo dpkg -i rockfish_toolkit.deb

# Install dependencies if needed
sudo apt-get install -f

The rockfish_toolkit.deb package includes all Rockfish Toolkit binaries:

BinaryDescription
rockfish_probeNetwork flow meter
rockfish_mcpMCP query server
rockfish_detectML anomaly detection (Enterprise)
rockfish_intelThreat intelligence server

Installed Files

After installation:

PathDescription
/usr/bin/rockfish_*Rockfish binaries
/etc/rockfish/Configuration directory
/var/lib/rockfish/Data directory
/var/log/rockfish/Log directory

Docker Installation

Pull the Rockfish Toolkit image from Docker Hub:

docker pull rockfishnetworks/toolkit:latest

The toolkit image includes all Rockfish Toolkit binaries. Specify the command to run the desired component.

Running the Probe

docker run -d \
  --name rockfish-probe \
  --network host \
  --cap-add NET_ADMIN \
  --cap-add NET_RAW \
  -v /etc/rockfish:/etc/rockfish:ro \
  -v /var/lib/rockfish:/var/lib/rockfish \
  rockfishnetworks/toolkit:latest \
  rockfish_probe -c /etc/rockfish/probe.yaml

Running the MCP Server

docker run -d \
  --name rockfish-mcp \
  -p 8080:8080 \
  -v /etc/rockfish:/etc/rockfish:ro \
  -v /var/lib/rockfish:/var/lib/rockfish:ro \
  rockfishnetworks/toolkit:latest \
  rockfish_mcp -c /etc/rockfish/mcp.yaml

Docker Compose

Example docker-compose.yml:

version: '3.8'

services:
  probe:
    image: rockfishnetworks/toolkit:latest
    network_mode: host
    cap_add:
      - NET_ADMIN
      - NET_RAW
    volumes:
      - ./config:/etc/rockfish:ro
      - ./data:/var/lib/rockfish
    command: ["rockfish_probe", "-c", "/etc/rockfish/probe.yaml"]
    restart: unless-stopped

  mcp:
    image: rockfishnetworks/toolkit:latest
    ports:
      - "8080:8080"
    volumes:
      - ./config:/etc/rockfish:ro
      - ./data:/var/lib/rockfish:ro
    command: ["rockfish_mcp", "-c", "/etc/rockfish/mcp.yaml"]
    restart: unless-stopped

Verifying Installation

Check that the installation was successful:

# Check probe version
rockfish_probe --version

# Check MCP version
rockfish_mcp --version

Next Steps

Quick Start

This guide walks you through capturing network flows and querying them.

1. Capture Flows

From a PCAP File

# Basic capture to Parquet
rockfish_probe -i capture.pcap --parquet-dir ./flows

# With nDPI application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows

Live Capture

# Standard libpcap capture (requires root)
sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows

# High-performance AF_PACKET capture (Linux)
sudo rockfish_probe -i eth0 --live afpacket --parquet-dir ./flows

With a Configuration File

# Create config.yaml (see Configuration docs)
rockfish_probe -c config.yaml

2. Verify Output

# Check generated files
ls -la flows/

# View file info with DuckDB
duckdb -c "DESCRIBE SELECT * FROM 'flows/*.parquet'"

3. Query with MCP

Set up the MCP server to query your flows:

# mcp-config.yaml
sources:
  flow:
    path: ./flows/
    description: Network flow data

output:
  default_format: table
  max_rows: 100
# Start MCP server
ROCKFISH_CONFIG=mcp-config.yaml rockfish_mcp

Example Queries

Using the MCP tools:

# Count total flows
count:
  source: flow

# Top talkers by bytes
query:
  source: flow
  sql: |
    SELECT saddr, SUM(sbytes + dbytes) as total_bytes
    FROM {source}
    GROUP BY saddr
    ORDER BY total_bytes DESC
    LIMIT 10

# Filter by protocol
query:
  source: flow
  filter: "proto = 'TCP'"
  limit: 50

4. Upload to S3 (Optional)

Configure S3 upload in your probe config:

output:
  parquet_dir: /var/lib/rockfish/flows

s3:
  bucket: my-flow-data
  region: us-east-1
  hive_partitioning: true
  delete_after_upload: true

Files are automatically uploaded and organized by date:

s3://my-flow-data/year=2025/month=01/day=28/rockfish-*.parquet

Next Steps

Licensing

Rockfish uses Ed25519-signed licenses with tier-based feature restrictions.

License Tiers

TierFeatures
CommunityBasic schema (48 fields), local storage only
Basic+ nDPI labels, custom observation name
Professional+ GeoIP, nDPI fingerprints (60 fields)
Enterprise+ ML models, anomaly detection

License File

Licenses are JSON files with an Ed25519 signature:

{
  "id": "lic_abc123",
  "tier": "professional",
  "customer_email": "[email protected]",
  "company": "Example Corp",
  "observation": "sensor-01",
  "issued_at": "2025-01-01T00:00:00Z",
  "expires_at": "2026-01-01T00:00:00Z",
  "signature": "base64-encoded-signature"
}

Configuration

Specify the license file in your config:

license:
  path: /opt/rockfish/etc/license.json

Or via environment variable:

export ROCKFISH_LICENSE_PATH=/opt/rockfish/etc/license.json
rockfish_probe -c config.yaml

Feature Matrix

FeatureCommunityBasicProfessionalEnterprise
Schema v1 (Simple)YesYesYesYes
Schema v2 (Extended)NoNoYesYes
GeoIP FieldsNoNoYesYes
nDPI FingerprintsNoNoYesYes
nDPI LabelingNoYesYesYes
Custom Observation DomainNoYesYesYes
Anomaly DetectionNoNoNoYes

Parquet Metadata

Licensed files include metadata for validation:

KeyDescription
rockfish.license_idLicense identifier
rockfish.tierLicense tier
rockfish.companyCompany name
rockfish.customer_emailCustomer email
rockfish.issued_atLicense issue date
rockfish.observationObservation domain name

MCP License Validation

Rockfish MCP can validate that Parquet files were generated by a licensed probe:

sources:
  licensed_flows:
    path: s3://data/flows/
    description: Licensed network flow data
    require_license: true

  enterprise_flows:
    path: s3://data/enterprise/
    description: Enterprise flow data
    require_license: true
    allowed_license_ids:
      - "lic_abc123"
      - "lic_def456"

Obtaining a License

Contact [email protected] for license inquiries.

Probe Overview

Rockfish Probe is a high-performance flow meter that captures network traffic and generates flow records in Apache Parquet format.

Features

  • Packet capture via libpcap - Live interface capture or PCAP file reading
  • High-performance AF_PACKET - Linux TPACKET_V3 with mmap ring buffer
  • Fragment reassembly - Reassembles fragmented IP packets
  • Bidirectional flows - Forward and reverse direction tracking
  • nDPI integration - Application protocol detection
  • GeoIP lookups - Geographic location via MaxMind databases
  • IP reputation - AbuseIPDB integration with local caching
  • S3 upload - Automatic upload to S3-compatible storage

Output Format

Flow records follow IPFIX Information Element naming conventions (RFC 5102/5103):

{
  "flowStartMilliseconds": "2025-01-15T10:30:00.000Z",
  "flowEndMilliseconds": "2025-01-15T10:30:05.123Z",
  "flowDurationMilliseconds": 5123,
  "ipVersion": 4,
  "protocolIdentifier": 6,
  "sourceIPAddress": "192.168.1.100",
  "sourceTransportPort": 54321,
  "destinationIPAddress": "93.184.216.34",
  "destinationTransportPort": 443,
  "octetTotalCount": 1234,
  "packetTotalCount": 15,
  "applicationName": "TLS"
}

Basic Usage

# Read from PCAP file
rockfish_probe -i capture.pcap --parquet-dir ./flows

# Live capture with libpcap
sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows

# High-performance AF_PACKET (Linux)
sudo rockfish_probe -i eth0 --live afpacket --parquet-dir ./flows

# With nDPI application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows

Next Steps

Configuration Reference

Rockfish Probe uses YAML configuration files. Command-line arguments override config file settings.

# Run with configuration file
rockfish_probe -c /path/to/config.yaml

# Override settings via CLI
rockfish_probe -c config.yaml --source eth1

Configuration Sections


License

license:
  path: /opt/rockfish/etc/license.json
OptionTypeDefaultDescription
pathstring-Path to license file (JSON with Ed25519 signature)

Environment Variable: ROCKFISH_LICENSE_PATH


Input

input:
  source: eth0
  live_type: afpacket
  filter: "tcp or udp"
  snaplen: 65535
  promisc_off: false
OptionTypeDefaultDescription
sourcestring(required)Interface name or PCAP file path/glob
live_typestringpcapCapture method: pcap, afpacket, netmap, fmadio
filterstring-BPF filter expression
snaplenint65535Maximum bytes per packet
promisc_offboolfalseDisable promiscuous mode

BPF Filter Examples

# TCP and UDP only
filter: "tcp or udp"

# HTTP and HTTPS
filter: "port 80 or port 443"

# Specific subnet
filter: "net 192.168.1.0/24"

# Exclude SSH
filter: "not port 22"

Flow

flow:
  idle_timeout: 300
  active_timeout: 1800
  max_flows: 0
  max_payload: 500
  udp_uniflow_port: 0
  mac: true
OptionTypeDefaultDescription
idle_timeoutint300Seconds of inactivity before flow expires
active_timeoutint1800Maximum flow duration before export
max_flowsint0Maximum concurrent flows (0 = unlimited)
max_payloadint500Max payload bytes for protocol detection
udp_uniflow_portint0UDP uniflow mode (0=off, 1=all)
macbooltrueInclude MAC addresses

Note: TLS/TCP fingerprints (ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp) are automatically extracted when nDPI is enabled and included in Professional+ tier output.


nDPI

ndpi:
  enabled: true
  protocol_file: /opt/rockfish/etc/ndpi-protos.txt
  categories_file: /opt/rockfish/etc/ndpi-categories.txt
OptionTypeDefaultDescription
enabledboolfalseEnable nDPI application labeling
protocol_filestring-Custom protocol definitions
categories_filestring-Custom category definitions

Note: nDPI is included in all Rockfish packages (Basic tier and above).


Fragment

fragment:
  disabled: false
  max_tables: 1024
  timeout: 30
OptionTypeDefaultDescription
disabledboolfalseDisable IP fragment reassembly
max_tablesint1024Max concurrent fragment tables
timeoutint30Fragment timeout in seconds

Output

output:
  parquet_dir: /var/run/rockfish/flows
  parquet_batch_size: 1000000
  parquet_file_prefix: rockfish-flow
  parquet_schema: simple
  observation: sensor-01
  hive_boundary_flush: false
  stats: true
  verbose: 1
  log_file: /var/log/rockfish/rockfish.log
OptionTypeDefaultDescription
parquet_dirstring(required)Output directory for Parquet files
parquet_batch_sizeint1000000Max flows per file before rotation
parquet_file_prefixstringrockfish-flowFilename prefix
parquet_schemastringsimpleSchema: simple (50 fields) or extended (62 fields)
observationstringgnatObservation domain name
hive_boundary_flushboolfalseFlush at day boundaries for Hive partitioning
verboseint10=warnings, 1=info, 2=debug, 3=trace
log_filestring-Log file path (enables daily rotation)

AFPacket

Linux high-performance capture:

afpacket:
  block_size: 2097152
  block_count: 64
  fanout_group: 0
  fanout_mode: hash
OptionTypeDefaultDescription
block_sizeint2097152Ring buffer block size (bytes)
block_countint64Number of ring buffer blocks
fanout_groupint0Fanout group ID (0 = disabled)
fanout_modestringhashDistribution: hash, lb, cpu, rollover, random

Memory: block_size × block_count (default: 128 MB)


Netmap

FreeBSD high-performance capture:

netmap:
  rx_slots: 1024
  tx_slots: 1024
  poll_timeout: 1000
  host_rings: false

S3

s3:
  bucket: my-flow-bucket
  prefix: flows
  region: us-east-1
  endpoint: https://nyc3.digitaloceanspaces.com
  force_path_style: false
  hive_partitioning: true
  delete_after_upload: true
  aggregate: true
  aggregate_hold_minutes: 5
OptionTypeDefaultDescription
bucketstring(required)S3 bucket name
prefixstring-S3 key prefix
regionstring(required)AWS region
endpointstring-Custom endpoint (MinIO, DO Spaces, etc.)
force_path_styleboolfalseUse path-style URLs (required for MinIO)
hive_partitioningboolfalseOrganize by year=/month=/day=/
delete_after_uploadboolfalseDelete local files after upload
aggregateboolfalseMerge files per minute before upload
aggregate_hold_minutesint1Hold time before aggregating

GeoIP

geoip:
  country_db: /opt/rockfish/etc/GeoLite2-Country.mmdb
  city_db: /opt/rockfish/etc/GeoLite2-City.mmdb
  asn_db: /opt/rockfish/etc/GeoLite2-ASN.mmdb

Note: Requires --features geoip and MaxMind databases.


Threat Intel

threat_intel:
  enabled: true
  endpoint_url: "http://localhost:8080"
  api_token: "your-api-token"
  batch_size: 100
  timeout_seconds: 10
OptionTypeDefaultDescription
enabledboolfalseEnable threat intel lookups
endpoint_urlstring(required)API endpoint URL
api_tokenstring(required)Bearer token for authentication
batch_sizeint100IPs per API request
timeout_secondsint10Request timeout

Output goes to <parquet_dir>/intel/.


Complete Example

license:
  path: /opt/rockfish/etc/license.json

input:
  source: eth0
  live_type: afpacket
  filter: "tcp or udp"

flow:
  idle_timeout: 300
  active_timeout: 1800
  max_flows: 1000000
  max_payload: 500

ndpi:
  enabled: true  # Fingerprints (ndpi_ja4, ndpi_ja3s) extracted automatically

output:
  parquet_dir: /var/run/rockfish/flows
  observation: sensor-01
  hive_boundary_flush: true

afpacket:
  block_size: 2097152
  block_count: 64

s3:
  bucket: flow-data
  prefix: sensors/sensor-01
  region: us-east-1
  hive_partitioning: true
  delete_after_upload: true

geoip:
  city_db: /opt/rockfish/etc/GeoLite2-City.mmdb
  asn_db: /opt/rockfish/etc/GeoLite2-ASN.mmdb

Capture Modes

Rockfish Probe supports multiple capture backends for different platforms and performance requirements.

Capture Types

TypePlatformDescription
pcapAllStandard libpcap (portable)
afpacketLinuxAF_PACKET with TPACKET_V3 (high-performance)
netmapFreeBSDNetmap framework (high-performance)
fmadioLinuxFMADIO appliance ring buffer

libpcap (Default)

The most portable option, works on all platforms.

input:
  source: eth0
  live_type: pcap
  filter: "tcp or udp"
  snaplen: 65535
sudo rockfish_probe -i eth0 --live pcap --parquet-dir ./flows

Pros

  • Works everywhere (Linux, FreeBSD, macOS)
  • Supports BPF filters
  • Well-documented

Cons

  • Lower performance than kernel-bypass methods
  • Copies packets through kernel

AF_PACKET (Linux)

High-performance capture using Linux’s TPACKET_V3 with memory-mapped ring buffers.

input:
  source: eth0
  live_type: afpacket

afpacket:
  block_size: 2097152    # 2 MB blocks
  block_count: 64        # 128 MB total ring
  fanout_group: 0        # 0 = disabled
  fanout_mode: hash
sudo rockfish_probe -i eth0 --live afpacket \
    --afp-block-size 2097152 \
    --afp-block-count 64 \
    --parquet-dir ./flows

Ring Buffer Sizing

Total Ring Buffer = block_size × block_count
Default: 2 MB × 64 = 128 MB

For 10 Gbps+:

afpacket:
  block_size: 4194304   # 4 MB
  block_count: 128      # 512 MB total

Fanout Mode

Distribute packets across multiple processes:

afpacket:
  fanout_group: 1       # Non-zero enables fanout
  fanout_mode: hash     # Distribute by flow hash
ModeDescription
hashBy flow hash (recommended for flow analysis)
lbRound-robin load balancing
cpuBy receiving CPU
rolloverFill one socket, then next
randomRandom distribution

Multi-Process Capture

Run multiple instances with the same fanout group:

# Terminal 1
sudo rockfish_probe -i eth0 --live afpacket \
    --afp-fanout-group 1 -o flows1/

# Terminal 2
sudo rockfish_probe -i eth0 --live afpacket \
    --afp-fanout-group 1 -o flows2/

Netmap (FreeBSD)

High-performance capture using FreeBSD’s netmap framework.

input:
  source: em0
  live_type: netmap

netmap:
  rx_slots: 1024
  tx_slots: 1024
  poll_timeout: 1000
  host_rings: false
OptionDefaultDescription
rx_slotsdriver defaultRX ring slot count
tx_slotsdriver defaultTX ring slot count
poll_timeout1000Poll timeout (ms)
host_ringsfalseEnable host stack access

FMADIO (Linux)

Capture from FMADIO 100G packet capture appliances.

input:
  source: ring0
  live_type: fmadio

fmadio:
  ring_path: /opt/fmadio/queue/lxc_ring0
  include_fcs_errors: false

Note: FMADIO support is included in all Rockfish packages.

Reading PCAP Files

Process existing capture files:

# Single file
rockfish_probe -i capture.pcap --parquet-dir ./flows

# Multiple files with glob
rockfish_probe -i "/data/captures/*.pcap" --parquet-dir ./flows

# With application labeling
rockfish_probe -i capture.pcap --ndpi --parquet-dir ./flows

BPF Filters

All capture modes support BPF filters (except FMADIO):

input:
  filter: "tcp or udp"

Common filters:

# Web traffic only
--filter "port 80 or port 443"

# Specific subnet
--filter "net 10.0.0.0/8"

# Exclude broadcast
--filter "not broadcast"

# DNS traffic
--filter "port 53"

Choosing a Capture Mode

RequirementRecommended Mode
Portabilitypcap
Linux high-speed (1-10 Gbps)afpacket
Linux 40-100 Gbpsafpacket with large ring + fanout
FreeBSD high-speednetmap
FMADIO appliancefmadio

Next Steps

Performance Tuning

Optimize Rockfish Probe for high-speed network capture.

AF_PACKET Tuning

Ring Buffer Size

For 10 Gbps+ capture, increase the ring buffer:

afpacket:
  block_size: 4194304   # 4 MB per block
  block_count: 128      # 512 MB total ring buffer

Use Fanout for Multi-Queue NICs

Modern NICs have multiple RX queues. Use fanout to utilize all cores:

# Run multiple instances with same fanout group
taskset -c 0 rockfish_probe -i eth0 --live afpacket \
    --afp-fanout-group 1 --parquet-dir ./flows1 &

taskset -c 1 rockfish_probe -i eth0 --live afpacket \
    --afp-fanout-group 1 --parquet-dir ./flows2 &

Use hash fanout mode to keep flows together.

CPU Pinning

Pin to specific CPU cores:

taskset -c 0 rockfish_probe -i eth0 --live afpacket ...

Or use CPU isolation:

# /etc/default/grub
GRUB_CMDLINE_LINUX="isolcpus=0,1"

System Tuning

Socket Buffers

Increase kernel buffer sizes:

# Temporary
sudo sysctl -w net.core.rmem_max=134217728
sudo sysctl -w net.core.rmem_default=134217728

# Permanent (/etc/sysctl.conf)
net.core.rmem_max=134217728
net.core.rmem_default=134217728

Network Budget

Increase NAPI budget for high packet rates:

sudo sysctl -w net.core.netdev_budget=600
sudo sysctl -w net.core.netdev_budget_usecs=8000

IRQ Affinity

Distribute NIC interrupts across CPUs:

# Find NIC IRQs
cat /proc/interrupts | grep eth0

# Set affinity (example for 4 queues)
echo 1 > /proc/irq/24/smp_affinity
echo 2 > /proc/irq/25/smp_affinity
echo 4 > /proc/irq/26/smp_affinity
echo 8 > /proc/irq/27/smp_affinity

Or use irqbalance with proper configuration.

Disable CPU Power Saving

Prevent CPU frequency scaling:

# Set performance governor
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done

Flow Table Sizing

Limit memory usage under high connection rates:

flow:
  max_flows: 1000000    # Limit to 1M concurrent flows
  idle_timeout: 60      # Shorter timeout for faster cleanup

Parquet Output Tuning

Batch Size

Larger batches = fewer files, better compression:

output:
  parquet_batch_size: 2000000   # 2M flows per file

S3 Aggregation

Reduce small file overhead:

s3:
  aggregate: true
  aggregate_hold_minutes: 5   # Merge files for 5 minutes
  delete_after_upload: true

Monitoring

Statistics Output

Enable periodic statistics:

output:
  stats: true
  verbose: 2    # Debug level

Key Metrics to Watch

  • Packets/sec: Compare with NIC stats (ethtool -S eth0)
  • Drops: Check for ring buffer overflows
  • Flows/sec: Flow export rate
  • Memory usage: Monitor with top or htop

Check for Drops

# NIC drops
ethtool -S eth0 | grep -i drop

# Kernel drops
cat /proc/net/dev | grep eth0

# AF_PACKET drops
cat /proc/net/packet

Hardware Recommendations

NIC Selection

For high-speed capture:

  • Intel X710/XL710 (40 GbE)
  • Intel E810 (100 GbE)
  • Mellanox ConnectX-5/6

Enable RSS (Receive Side Scaling) for multi-queue distribution.

CPU

  • Modern Intel Xeon or AMD EPYC
  • At least 1 core per 10 Gbps
  • Large L3 cache helps

Storage

For sustained capture:

  • NVMe SSD for local Parquet files
  • Fast S3-compatible storage with adequate bandwidth

Example: 10 Gbps Configuration

license:
  path: /opt/rockfish/etc/license.json

input:
  source: eth0
  live_type: afpacket

flow:
  idle_timeout: 120
  active_timeout: 900
  max_flows: 2000000
  max_payload: 256

afpacket:
  block_size: 4194304
  block_count: 128
  fanout_group: 1
  fanout_mode: hash

output:
  parquet_dir: /data/flows
  parquet_batch_size: 2000000
  observation: sensor-01

s3:
  bucket: flow-data
  region: us-east-1
  aggregate: true
  aggregate_hold_minutes: 2
  delete_after_upload: true

Run with CPU pinning:

sudo taskset -c 0-3 rockfish_probe -c config.yaml

IP Reputation

Rockfish Probe integrates with threat intelligence services for IP reputation lookups.

Overview

Two approaches are available:

Featureip_reputationthreat_intel
ProviderDirect AbuseIPDBExternal API server
CachingLocal in-memoryServer-side
Rate limitsManaged locallyServer manages
Best forSingle sensorMultiple sensors

These features are mutually exclusive.

IP Reputation (Direct AbuseIPDB)

Query AbuseIPDB directly with local caching.

Configuration

ip_reputation:
  enabled: true
  api_key: "your-abuseipdb-api-key"
  cache_ttl_hours: 24
  max_age_in_days: 90
  s3_upload: true
OptionDefaultDescription
enabledfalseEnable IP reputation lookups
api_key(required)AbuseIPDB API key
output_dir<parquet_dir>/ip_reputationOutput directory
cache_ttl_hours24Cache entry lifetime
max_age_in_days90Max age for AbuseIPDB reports
s3_uploadfalseUpload parquet files to S3

How It Works

  1. For each flow, source and destination IPs are queued for lookup
  2. Lookups run in a background thread
  3. Results are cached in memory with reference counting
  4. Cache is exported to Parquet every hour

Rate Limiting

AbuseIPDB free tier: 1000 requests/day.

When rate-limited (HTTP 429):

  1. API requests pause
  2. Local cache continues serving
  3. Resumes at the next hour boundary
  4. Repeats if still rate-limited

Output Schema

Hourly Parquet exports include:

FieldTypeDescription
ip_addressStringIP address
abuse_confidence_scoreInt32Score (0-100)
country_codeStringCountry code
ispStringISP name
domainStringAssociated domain
total_reportsInt32Total abuse reports
last_reported_atTimestampLast report time
is_whitelistedBooleanWhitelisted status
reference_countInt64Times seen in flows
first_seenTimestampFirst flow occurrence
last_seenTimestampLast flow occurrence

Threat Intel (External API)

Use an external threat intelligence server (e.g., rockfish_intel) for centralized lookups.

Configuration

threat_intel:
  enabled: true
  endpoint_url: "http://localhost:8080"
  api_token: "your-api-token"
  batch_size: 100
  timeout_seconds: 10
OptionDefaultDescription
enabledfalseEnable threat intel lookups
endpoint_url(required)API server URL
api_token(required)Bearer token
batch_size100IPs per request
timeout_seconds10Request timeout

Benefits

  • Centralized caching: Share cache across multiple sensors
  • Rate limit management: Server handles provider limits
  • Multiple providers: Server can aggregate multiple sources

Output

Threat intel Parquet files are written to <parquet_dir>/intel/.

With S3 and Hive partitioning:

s3://bucket/prefix/intel/year=YYYY/month=MM/day=DD/filename.parquet

Setup with rockfish_intel

  1. Start the intel server with your AbuseIPDB key
  2. Create a client entry in clients.yaml
  3. Configure the probe:
threat_intel:
  enabled: true
  endpoint_url: "http://threatintel-server:8080"
  api_token: "client-token-from-clients-yaml"

Choosing Between Options

ScenarioRecommendation
Single sensor, simple setupip_reputation
Multiple sensorsthreat_intel + rockfish_intel
Enterprise with custom providersthreat_intel
Limited API quotathreat_intel (shared cache)

Getting an AbuseIPDB API Key

  1. Create account at abuseipdb.com
  2. Go to API settings
  3. Generate API key

Free tier: 1000 checks/day Paid tiers: Higher limits, additional features

MCP Overview

Coming Soon: Rockfish MCP is currently under development and will be available in March 2025.

Rockfish MCP is a Model Context Protocol (MCP) server for querying Parquet files using DuckDB.

Features

  • SQL queries via DuckDB - Full SQL support for Parquet files
  • S3 support - AWS, MinIO, Cloudflare R2, DigitalOcean Spaces
  • Configurable data sources - Abstract file locations from API
  • Multiple output formats - JSON, JSON Lines, CSV, Table
  • TLS support - Secure connections for remote access
  • HTTP/WebSocket mode - Standard HTTP with Bearer token auth
  • License validation - Verify Parquet files were generated by licensed probes

Operation Modes

ModeTransportUse Case
stdiostdin/stdoutClaude Desktop, local tools
TLSRaw TCP+TLSCustom integrations
HTTPHTTPS+WebSocketWeb clients, standard tooling

Built-in Tools

ToolDescription
list_sourcesList configured data sources
schemaGet column names and types
queryQuery with filters and column selection
aggregateGroup and aggregate data
sampleGet random sample rows
countCount rows with optional filter

Quick Example

# config.yaml
sources:
  flow:
    path: s3://security-data/netflow/
    description: Network flow data

output:
  default_format: json
  max_rows: 1000
ROCKFISH_CONFIG=config.yaml rockfish_mcp

Query example:

query:
  source: flow
  columns: [saddr, daddr, sbytes, dbytes]
  filter: "sbytes > 1000000"
  limit: 50

License Validation

Rockfish MCP will validate that Parquet files were generated by a licensed rockfish_probe. Each Parquet file includes signed metadata:

  • rockfish.license_id - License identifier
  • rockfish.tier - License tier (Community, Basic, Professional, Enterprise)
  • rockfish.company - Company name
  • rockfish.observation - Observation domain name

Configure validation per data source:

sources:
  prod_flows:
    path: s3://data/flows/
    require_license: true              # Reject unlicensed files
    allowed_license_ids:               # Optional: restrict to specific licenses
      - "lic_abc123"

Next Steps

MCP Setup

Configure Rockfish MCP for different deployment scenarios.

Configuration File

Create a config.yaml:

# S3 credentials (optional)
s3:
  region: us-east-1
  # access_key_id: your-key
  # secret_access_key: your-secret
  # endpoint: localhost:9000  # For MinIO/R2

# Output settings
output:
  default_format: json
  max_rows: 1000
  pretty_print: true

# Data source mappings
sources:
  flow:
    path: s3://security-data/netflow/
    description: Network flow data
    require_license: true

  ip_reputation:
    path: /data/threat-intel/ip-reputation.parquet
    description: IP reputation scores

stdio Mode (Default)

For Claude Desktop or local tools.

Claude Desktop Configuration

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Linux: ~/.config/claude/claude_desktop_config.json

{
  "mcpServers": {
    "rockfish": {
      "command": "/path/to/rockfish-mcp",
      "env": {
        "ROCKFISH_CONFIG": "/path/to/config.yaml"
      }
    }
  }
}

HTTP/WebSocket Mode

For web applications and standard HTTP clients.

Quick Start

  1. Generate self-signed certificate:

    ./generate-self-signed-cert.sh
    
  2. Generate API key and hash:

    API_KEY=$(openssl rand -base64 32)
    echo "API Key: $API_KEY"
    echo "Hash: $(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)"
    
  3. Configure config.yaml:

    tls:
      enabled: true
      http_mode: true
      bind_address: "0.0.0.0:8443"
      cert_path: "./certs/cert.pem"
      key_path: "./certs/key.pem"
      auth:
        api_keys:
          - name: "web-client"
            key_hash: "paste-hash-here"
    
  4. Run the server:

    ROCKFISH_CONFIG=config.yaml rockfish_mcp
    
  5. Connect:

    python examples/python_client_bearer_auth.py \
      --host localhost --port 8443 \
      --token "$API_KEY" --skip-verify
    

Plain HTTP Mode (Development)

For local development or behind a reverse proxy:

tls:
  enabled: true
  http_mode: true
  disable_tls: true  # No encryption
  bind_address: "127.0.0.1:8080"
  auth:
    api_keys:
      - name: "dev-client"
        key_hash: "your-hash-here"

Warning: Only use plain HTTP for local development or behind a TLS-terminating proxy.

TLS Server Mode

For custom integrations with raw TLS connections.

tls:
  enabled: true
  http_mode: false  # Raw TLS mode
  bind_address: "127.0.0.1:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    api_keys:
      - name: "production-client"
        key_hash: "your-key-hash-here"

License Validation

Require Parquet files to have valid Rockfish license metadata:

sources:
  # Any valid Rockfish license
  licensed_flows:
    path: s3://data/flows/
    description: Licensed network flow data
    require_license: true

  # Specific license IDs only
  enterprise_flows:
    path: s3://data/enterprise/
    description: Enterprise flow data
    require_license: true
    allowed_license_ids:
      - "lic_abc123"
      - "lic_def456"

  # No validation (default)
  public_data:
    path: /data/public/
    description: Public datasets

Rockfish Probe embeds license metadata in Parquet files:

  • rockfish.license.id
  • rockfish.license.tier
  • rockfish.license.customer_email
  • rockfish.license.issued_at

Environment Variables

VariableDescription
ROCKFISH_CONFIGPath to config.yaml
AWS_ACCESS_KEY_IDAWS credentials
AWS_SECRET_ACCESS_KEYAWS credentials
AWS_REGIONAWS region

Testing

# Start server
ROCKFISH_CONFIG=config.yaml rockfish_mcp

# Test with curl (HTTP mode)
curl -X POST https://localhost:8443/mcp \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc": "2.0", "id": 1, "method": "tools/list"}'

Next Steps

Authentication

Rockfish MCP supports multiple authentication mechanisms.

Overview

MethodTransportDescription
API Key (JSON)Raw TLSJSON frame before MCP session
Bearer TokenHTTP/WSStandard Authorization header
Mutual TLS (mTLS)Any TLSClient certificate verification

These can be combined for defense-in-depth.

Bearer Token Authentication (HTTP Mode)

Standard HTTP authentication using Authorization: Bearer <token> header.

Setup

  1. Generate API key and hash:

    API_KEY=$(openssl rand -base64 32)
    echo "API Key: $API_KEY"
    echo "Hash: $(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)"
    
  2. Configure:

    tls:
      enabled: true
      http_mode: true
      bind_address: "0.0.0.0:8443"
      cert_path: "./certs/cert.pem"
      key_path: "./certs/key.pem"
      auth:
        api_keys:
          - name: "production-client"
            key_hash: "a1b2c3d4e5f6..."
    

Client Examples

Python (websockets):

import asyncio
import websockets
import json

async def connect():
    uri = "wss://localhost:8443/mcp"
    headers = {"Authorization": "Bearer your-api-key"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        await ws.send(json.dumps({
            "jsonrpc": "2.0",
            "id": 1,
            "method": "initialize",
            "params": {
                "protocolVersion": "2024-11-05",
                "capabilities": {},
                "clientInfo": {"name": "python-client", "version": "1.0"}
            }
        }))
        print(await ws.recv())

asyncio.run(connect())

JavaScript/Node.js:

const WebSocket = require('ws');

const ws = new WebSocket('wss://localhost:8443/mcp', {
  headers: { 'Authorization': 'Bearer your-api-key' },
  rejectUnauthorized: true  // false for self-signed certs
});

ws.on('open', () => {
  ws.send(JSON.stringify({
    jsonrpc: '2.0',
    id: 1,
    method: 'initialize',
    params: {
      protocolVersion: '2024-11-05',
      capabilities: {},
      clientInfo: { name: 'nodejs-client', version: '1.0' }
    }
  }));
});

ws.on('message', data => console.log(data.toString()));

cURL:

curl -i -N \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Authorization: Bearer your-api-key" \
  -H "Sec-WebSocket-Key: x3JJHMbDL1EzLkh9GBhXDw==" \
  -H "Sec-WebSocket-Version: 13" \
  https://localhost:8443/mcp

API Key Authentication (TLS Mode)

JSON-based authentication for raw TLS connections.

Protocol

  1. Client connects via TLS
  2. Client sends: {"api_key": "your-secret-key"}\n
  3. Server responds: {"success": true/false, "message": "..."}\n
  4. MCP session proceeds if successful

Configuration

tls:
  enabled: true
  http_mode: false  # Raw TLS mode
  bind_address: "127.0.0.1:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    api_keys:
      - name: "production-client"
        key_hash: "sha256-hash-here"

Client Example

import socket
import ssl
import json

context = ssl.create_default_context()
sock = socket.create_connection(("localhost", 8443))
tls_sock = context.wrap_socket(sock, server_hostname="localhost")

# Authenticate
auth = {"api_key": "your-secret-key"}
tls_sock.sendall((json.dumps(auth) + "\n").encode())

response = json.loads(tls_sock.recv(4096).decode().strip())
if not response["success"]:
    raise Exception(f"Auth failed: {response['message']}")

# Proceed with MCP protocol...

Mutual TLS (mTLS)

Transport-level authentication using client certificates.

Create CA and Client Certificates

# Generate CA
openssl genrsa -out ca-key.pem 4096
openssl req -new -x509 -key ca-key.pem -out ca-cert.pem -days 3650 \
  -subj "/CN=Rockfish MCP CA/O=Your Org"

# Generate client certificate
openssl genrsa -out client-key.pem 2048
openssl req -new -key client-key.pem -out client.csr \
  -subj "/CN=client1/O=Your Org"
openssl x509 -req -in client.csr -CA ca-cert.pem -CAkey ca-key.pem \
  -CAcreateserial -out client-cert.pem -days 365

Configuration

tls:
  enabled: true
  bind_address: "0.0.0.0:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    require_client_cert: true
    client_ca_cert_path: "./certs/ca-cert.pem"

Client Example

import ssl
import socket

context = ssl.create_default_context(ssl.Purpose.SERVER_AUTH)
context.load_cert_chain(
    certfile="client-cert.pem",
    keyfile="client-key.pem"
)
context.load_verify_locations(cafile="server-ca-cert.pem")

sock = socket.create_connection(("localhost", 8443))
tls_sock = context.wrap_socket(sock, server_hostname="localhost")
# Connection authenticated via mTLS

Combining Authentication Methods

For maximum security, use both mTLS and API keys:

tls:
  enabled: true
  bind_address: "0.0.0.0:8443"
  cert_path: "./certs/cert.pem"
  key_path: "./certs/key.pem"
  auth:
    require_client_cert: true
    client_ca_cert_path: "./certs/ca-cert.pem"
    api_keys:
      - name: "production-client"
        key_hash: "a1b2c3d4e5f6..."

Both must succeed for authorization.

Security Best Practices

API Keys

  • Generate with sufficient entropy: openssl rand -base64 32
  • One key per client for audit/revocation
  • Rotate regularly
  • Never store plain-text keys in config

mTLS

  • Protect CA private key: chmod 600 ca-key.pem
  • Use short certificate lifetimes (90 days)
  • Implement certificate revocation
  • Unique certificates per client

General

  • Use TLS in production
  • Implement rate limiting
  • Monitor authentication logs
  • Use network segmentation

Troubleshooting

ErrorSolution
“Authentication failed”Verify key matches hash
“Invalid auth request format”Check JSON format, ensure \n at end
“Client certificate verification failed”Check cert signed by configured CA
“require_client_cert without client_ca_cert_path”Add CA path to config

Utility: Generate API Key

#!/bin/bash
API_KEY=$(openssl rand -base64 32)
KEY_HASH=$(echo -n "$API_KEY" | sha256sum | cut -d' ' -f1)

echo "API Key: $API_KEY"
echo "Hash: $KEY_HASH"
echo ""
echo "Config entry:"
echo "  - name: \"client-name\""
echo "    key_hash: \"$KEY_HASH\""

Tools & Queries

Rockfish MCP provides SQL-based tools for querying Parquet data.

Available Tools

ToolDescription
list_sourcesList configured data sources
schemaGet column names and types
queryQuery with filters and column selection
aggregateGroup and aggregate data
sampleGet random sample rows
countCount rows with optional filter

list_sources

List all configured data sources.

list_sources: {}

Response:

{
  "sources": [
    {"name": "flow", "description": "Network flow data"},
    {"name": "ip_reputation", "description": "IP reputation scores"}
  ]
}

schema

Get column names and types for a data source.

schema:
  source: flow
  format: table

Parameters:

NameRequiredDescription
sourceYesData source name
formatNoOutput format (default: table)

query

Query with filtering, column selection, and custom SQL.

Basic Query

query:
  source: flow
  columns: [saddr, daddr, sbytes, dbytes]
  filter: "sbytes > 1000000"
  limit: 50
  format: json

Parameters:

NameRequiredDescription
sourceYesData source name
columnsNoColumns to select (default: all)
filterNoWHERE clause condition
order_byNoORDER BY clause
limitNoMaximum rows
formatNoOutput format

Custom SQL

Use {source} placeholder for the data source:

query:
  source: flow
  sql: |
    SELECT saddr, COUNT(*) as connection_count, SUM(sbytes) as total_bytes
    FROM {source}
    GROUP BY saddr
    ORDER BY total_bytes DESC
    LIMIT 10

Time-based Queries

query:
  source: flow
  filter: "stime >= '2025-01-01' AND stime < '2025-01-02'"
  columns: [stime, saddr, daddr, proto]

Protocol Filtering

query:
  source: flow
  filter: "proto = 'TCP' AND dport = 443"
  columns: [saddr, daddr, ndpi_appid]

aggregate

Group and aggregate data.

aggregate:
  source: flow
  group_by: [dport]
  aggregations:
    - function: sum
      column: sbytes
      alias: total_bytes
    - function: count
      alias: connection_count
  filter: "proto = 'TCP'"
  order_by: "total_bytes DESC"
  limit: 20
  format: table

Parameters:

NameRequiredDescription
sourceYesData source name
group_byYesColumns to group by
aggregationsYesAggregation functions
filterNoWHERE clause
order_byNoORDER BY clause
limitNoMaximum rows

Aggregation Functions

FunctionDescription
countCount rows
sumSum values
avgAverage
minMinimum
maxMaximum
count_distinctCount unique values

Examples

Top destination ports by traffic:

aggregate:
  source: flow
  group_by: [dport]
  aggregations:
    - function: sum
      column: sbytes + dbytes
      alias: total_bytes
    - function: count
      alias: flows
  order_by: "total_bytes DESC"
  limit: 10

Flows by country (requires GeoIP):

aggregate:
  source: flow
  group_by: [scountry, dcountry]
  aggregations:
    - function: count
      alias: flow_count
  filter: "scountry IS NOT NULL"

sample

Get random sample rows.

sample:
  source: flow
  n: 10
  format: json

Parameters:

NameRequiredDescription
sourceYesData source name
nNoNumber of rows (default: 10)
formatNoOutput format

count

Count rows with optional filter.

count:
  source: flow
  filter: "ndpi_risk_score > 50"

Parameters:

NameRequiredDescription
sourceYesData source name
filterNoWHERE clause

Output Formats

FormatDescription
jsonPretty-printed JSON array
jsonl / json_lines / ndjsonNewline-delimited JSON
csvCSV with header
table / textASCII table

Common Query Patterns

Top Talkers

query:
  source: flow
  sql: |
    SELECT saddr,
           COUNT(*) as flows,
           SUM(sbytes) as sent,
           SUM(dbytes) as received
    FROM {source}
    GROUP BY saddr
    ORDER BY sent + received DESC
    LIMIT 20

DNS Traffic

query:
  source: flow
  filter: "dport = 53 OR sport = 53"
  columns: [stime, saddr, daddr, sbytes, dbytes]

High-Risk Flows

query:
  source: flow
  filter: "ndpi_risk_score > 100"
  columns: [stime, saddr, daddr, ndpi_appid, ndpi_risk_list]

Long-Duration Flows

query:
  source: flow
  filter: "dur > 3600000"  # > 1 hour in ms
  columns: [stime, etime, dur, saddr, daddr, sbytes, dbytes]
  order_by: "dur DESC"

External Traffic

query:
  source: flow
  filter: "NOT (saddr LIKE '10.%' OR saddr LIKE '192.168.%')"
  columns: [saddr, daddr, scountry, dcountry]

Application Distribution

aggregate:
  source: flow
  group_by: [ndpi_appid]
  aggregations:
    - function: count
      alias: flows
    - function: sum
      column: sbytes + dbytes
      alias: bytes
  filter: "ndpi_appid IS NOT NULL"
  order_by: "bytes DESC"
  limit: 20

S3 Configuration

Configure Rockfish MCP to query Parquet files from S3-compatible storage.

AWS S3

Default Credentials

If the s3 section is omitted, DuckDB uses AWS credentials from:

  1. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  2. ~/.aws/credentials
  3. IAM role (EC2/ECS)
sources:
  flow:
    path: s3://my-bucket/flows/
    description: Network flows

Explicit Credentials

s3:
  region: us-east-1
  access_key_id: AKIAIOSFODNN7EXAMPLE
  secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Security Note: Prefer environment variables or IAM roles over config file credentials.

MinIO

Self-hosted S3-compatible storage.

s3:
  endpoint: localhost:9000
  access_key_id: minioadmin
  secret_access_key: minioadmin
  use_ssl: false
  url_style: path  # Required for MinIO

sources:
  flow:
    path: s3://my-bucket/flows/

DigitalOcean Spaces

s3:
  endpoint: nyc3.digitaloceanspaces.com
  region: nyc3
  access_key_id: your-spaces-key
  secret_access_key: your-spaces-secret

sources:
  flow:
    path: s3://my-space/flows/

Cloudflare R2

s3:
  endpoint: <account-id>.r2.cloudflarestorage.com
  access_key_id: your-r2-key
  secret_access_key: your-r2-secret

sources:
  flow:
    path: s3://my-bucket/flows/

Configuration Options

OptionTypeDefaultDescription
regionstring-AWS region (e.g., us-east-1)
access_key_idstring-Access key ID
secret_access_keystring-Secret access key
endpointstring-Custom endpoint URL
use_sslbooltrueUse HTTPS
url_stylestringvhostpath or vhost

Querying S3 Data

Direct Path

sources:
  flow:
    path: s3://bucket/prefix/
    description: All flow data

Hive Partitioned Data

Rockfish Probe can organize uploads with Hive-style partitioning:

s3://bucket/flows/year=2025/month=01/day=28/*.parquet

Query specific partitions:

sources:
  flow:
    path: s3://bucket/flows/year=2025/month=01/
    description: January 2025 flows

Or use SQL with DuckDB’s Hive partitioning support:

query:
  source: flow
  sql: |
    SELECT * FROM read_parquet(
      's3://bucket/flows/year=2025/month=01/day=28/*.parquet',
      hive_partitioning=true
    )
    LIMIT 100

Performance Tips

Use Partition Pruning

Structure queries to match partitioning scheme:

# Efficient - matches Hive partitions
query:
  source: flow
  filter: "year = 2025 AND month = 1 AND day = 28"

Limit Column Selection

Only select needed columns:

query:
  source: flow
  columns: [saddr, daddr, sbytes]  # Much faster than SELECT *

Use Aggregation Server-Side

Push aggregation to DuckDB:

aggregate:
  source: flow
  group_by: [dport]
  aggregations:
    - function: count
      alias: flows

Troubleshooting

“Access Denied”

  • Verify credentials are correct
  • Check bucket policy allows s3:GetObject and s3:ListBucket
  • For cross-account access, verify IAM trust policies

“Bucket not found”

  • Check region matches bucket region
  • For custom endpoints, verify url_style setting

“Connection refused”

  • Verify endpoint URL is correct
  • Check use_ssl matches endpoint (http vs https)
  • For MinIO, ensure url_style: path

Slow Queries

  • Add partition filters to queries
  • Select only needed columns
  • Check network bandwidth to S3

Example: Multi-Source Configuration

s3:
  region: us-east-1

sources:
  # Production flows (licensed, validated)
  prod_flows:
    path: s3://prod-bucket/flows/
    description: Production network flows
    require_license: true

  # Development data (no validation)
  dev_flows:
    path: s3://dev-bucket/flows/
    description: Development test data

  # Threat intel from intel server
  threat_intel:
    path: s3://prod-bucket/intel/
    description: IP reputation data

output:
  default_format: json
  max_rows: 10000

Rockfish Detect Overview

Rockfish Detect is the ML training and anomaly detection service for the Rockfish platform. It provides a complete pipeline for building models from network flow data and scoring flows for anomalies.

Note: Rockfish Detect requires an Enterprise tier license.

Features

  • Data Sampling - Random sampling from S3-stored Parquet files
  • Feature Engineering - Build normalization tables for ML training
  • Feature Ranking - Identify most significant fields for detection
  • Model Training - Train anomaly detection models (HBOS, Hybrid)
  • Flow Scoring - Score flows using trained models
  • Device Fingerprinting - Passive OS/device detection via nDPI fingerprints
  • Automated Scheduling - Run as daemon with daily training cycles

Architecture

Network Traffic
    |
    v
Parquet Files in S3 (from rockfish_probe)
    |
    v
+------------------------------------------+
|   rockfish_detect                        |
+------------------------------------------+
| Sampler                                  |
|   - Queries S3 with DuckDB               |
|   - Random sampling                      |
|   - Output: sample/*.parquet             |
+------------------------------------------+
| Feature Engineer                         |
|   - Build normalization tables           |
|   - Histogram binning + frequency        |
|   - Output: extract/*.parquet            |
+------------------------------------------+
| Feature Ranker                           |
|   - Importance scoring                   |
|   - Output: rockfish_rank.parquet        |
+------------------------------------------+
| Model Trainer (HBOS/Hybrid)              |
|   - Train on sampled data                |
|   - Output: models/*.json                |
+------------------------------------------+
| Flow Scorer                              |
|   - Score flows using trained models     |
|   - Output: score/*.parquet              |
+------------------------------------------+
    |
    v
Anomaly Scores --> rockfish_mcp --> Alerts

Algorithms

AlgorithmTypeDescription
HBOSUnsupervisedHistogram-Based Outlier Score - fast, interpretable
HybridCombinedHBOS + fingerprint correlation + threat intelligence
Random ForestSupervisedClassification-based (framework)
AutoencoderNeural NetworkReconstruction error-based (framework)

Use Cases

  1. Unsupervised Anomaly Detection - HBOS identifies statistical outliers
  2. Behavioral Change Detection - Hybrid mode detects unusual fingerprint combinations
  3. Device Profiling - Fingerprinting detects lateral movement
  4. Threat Prioritization - Score-based reporting prioritizes investigations
  5. Network Baselining - Feature ranking identifies important characteristics

Quick Start

# Validate configuration
rockfish_detect -c config.yaml validate

# Run full pipeline for specific date
rockfish_detect -c config.yaml auto --date 2025-01-28

# Start as scheduler daemon
rockfish_detect -c config.yaml run

# Run immediately (don't wait for schedule)
rockfish_detect -c config.yaml run --run-now

Requirements

  • Enterprise tier license
  • S3-compatible storage with flow data from rockfish_probe
  • Multi-core system recommended (uses half available cores)

Next Steps

Configuration Reference

Rockfish Detect uses YAML configuration files.

rockfish_detect -c /path/to/config.yaml [command]

Configuration Sections


License

license:
  path: /etc/rockfish/license.json
  observation: flows
OptionTypeRequiredDescription
pathstringNoLicense file path (auto-searches if not set)
observationstringYesS3 prefix / observation domain

S3

s3:
  bucket: my-flow-bucket
  region: us-east-1
  endpoint: https://s3.example.com
  hive_partitioning: true
  http_retries: 10
  http_retry_wait_ms: 2000
  http_retry_backoff: 2.0
OptionTypeDefaultDescription
bucketstring(required)S3 bucket name
regionstring(required)AWS region
endpointstring-Custom endpoint (MinIO, etc.)
hive_partitioningbooltrueMatch rockfish_probe structure
http_retriesint10Retry count for S3 operations
http_retry_wait_msint2000Base wait between retries
http_retry_backofffloat2.0Exponential backoff multiplier

S3 Data Structure

Expected path structure (from rockfish_probe):

s3://<bucket>/<observation>/v2/year=YYYY/month=MM/day=DD/*.parquet

Sampling

sampling:
  sample_percent: 10.0
  retention_days: 7
  sample_hour: 0
  sample_minute: 30
  output_prefix: flows/sample
OptionTypeDefaultDescription
sample_percentfloat10.0Percentage of rows to sample (0-100)
retention_daysint7Rolling window retention
sample_hourint0UTC hour for scheduled sampling
sample_minuteintrandomMinute for scheduled sampling
output_prefixstring<obs>/sample/S3 output prefix

Features

Configure feature engineering (normalization tables).

features:
  num_bins: 10
  histogram_type: quantile
  ip_hash_modulus: 65536
  sample_days: 7
OptionTypeDefaultDescription
num_binsint10Histogram bins for numeric features
histogram_typestringquantilequantile or equal_width
ip_hash_modulusint65536Dimensionality reduction for IPs
sample_daysint7Days of samples to process

Histogram Types

TypeDescriptionBest For
quantileEqual sample count per binSkewed distributions
equal_widthEqual value range per binUniform distributions

Training

training:
  enabled: true
  train_hour: 1
  train_minute: 0
  algorithm: hbos
  model_output_dir: /var/lib/rockfish/models
  min_importance_score: 0.7

  hbos:
    num_bins: 10
    fields:
      - dur
      - rtt
      - pcr
      - spkts
      - dpkts
      - sbytes
      - dbytes
      - sentropy
      - dentropy

  hybrid:
    hbos_weight: 0.5
    correlation_weight: 0.3
    threat_intel_weight: 0.2
    hbos_filter_percentile: 90.0
    min_observations: 3
OptionTypeDefaultDescription
enabledbooltrueEnable training
train_hourint1UTC hour for scheduled training
train_minuteintrandomMinute for scheduled training
algorithmstringhboshbos, hybrid, random_forest, autoencoder
model_output_dirstring-Directory for trained models
min_importance_scorefloat0.7Threshold for ranked features

HBOS Options

OptionTypeDefaultDescription
num_binsint10Histogram bins
fieldslist-Fields to include in model

Hybrid Options

OptionTypeDefaultDescription
hbos_weightfloat0.5Weight for HBOS score
correlation_weightfloat0.3Weight for fingerprint correlation
threat_intel_weightfloat0.2Weight for threat intel score
hbos_filter_percentilefloat90.0Pre-filter percentile
min_observationsint3Min observations for correlation

Fingerprint

Device/OS fingerprinting via nDPI signatures.

fingerprint:
  enabled: false
  history_days: 7
  client_field: ndpi_ja4
  server_field: ndpi_ja3s
  min_observations: 10
  anomaly_threshold: 0.7
  max_fingerprints_per_host: 5
  detect_suspicious: true
OptionTypeDefaultDescription
enabledboolfalseEnable fingerprinting
history_daysint7Days of history to analyze
client_fieldstringndpi_ja4Field for client fingerprint (JA4 via nDPI)
server_fieldstringndpi_ja3sField for server fingerprint (JA3 via nDPI)
min_observationsint10Minimum observations for baseline
anomaly_thresholdfloat0.7Threshold for anomaly detection
max_fingerprints_per_hostint5Max expected fingerprints
detect_suspiciousbooltrueDetect fingerprint changes

Note: Requires nDPI fingerprint fields in flow data (Professional+ license for probe).


Logging

logging:
  level: info
  file: /var/log/rockfish/detect.log
OptionTypeDefaultDescription
levelstringinfoLog level: error, warn, info, debug, trace
filestring-Log file path (optional)

Other Options

parallel_protocols: true
protocols:
  - tcp
  - udp
  - icmp

duckdb:
  autoload_extensions: false
OptionTypeDefaultDescription
parallel_protocolsbooltrueProcess protocols in parallel
protocolslisttcp, udp, icmpProtocols to process
duckdb.autoload_extensionsboolfalseDuckDB extension autoload

Complete Example

license:
  path: /opt/rockfish/etc/license.json
  observation: sensor-01

s3:
  bucket: flow-data
  region: us-east-1
  hive_partitioning: true

sampling:
  sample_percent: 10.0
  retention_days: 7
  sample_hour: 0

features:
  num_bins: 10
  histogram_type: quantile
  sample_days: 7

training:
  enabled: true
  train_hour: 1
  algorithm: hybrid
  model_output_dir: /var/lib/rockfish/models

  hbos:
    num_bins: 10
    fields:
      - dur
      - rtt
      - pcr
      - spkts
      - dpkts
      - sbytes
      - dbytes

  hybrid:
    hbos_weight: 0.5
    correlation_weight: 0.3
    threat_intel_weight: 0.2

fingerprint:
  enabled: true
  history_days: 7
  min_observations: 10

logging:
  level: info
  file: /var/log/rockfish/detect.log

Data Pipeline

Rockfish Detect processes data through a series of stages, each producing artifacts used by subsequent stages.

Pipeline Stages

sample --> extract --> rank --> train --> score
StageCommandInputOutput
SamplesampleRaw flow ParquetSampled Parquet
ExtractextractSampled ParquetNormalization tables
RankrankNormalization tablesFeature rankings
TraintrainSampled + NormalizationModel files
ScorescoreRaw flows + ModelAnomaly scores

1. Sampling

Randomly samples flow data to reduce volume while maintaining statistical properties.

# Sample specific date
rockfish_detect -c config.yaml sample --date 2025-01-28

# Sample last N days
rockfish_detect -c config.yaml sample --days 7

# Clear state and resample all
rockfish_detect -c config.yaml sample --clear

Input Path

s3://<bucket>/<observation>/v2/year=YYYY/month=MM/day=DD/*.parquet

Output Path

s3://<bucket>/<observation>/sample/sample-YYYY-MM-DD.parquet

Configuration

sampling:
  sample_percent: 10.0    # 10% of rows
  retention_days: 7       # Keep 7 days of samples

State Tracking

Sampling maintains state to avoid reprocessing:

  • Tracks which dates have been sampled
  • Skips dates already in state file
  • Use --clear to reset state

2. Feature Extraction

Builds normalization lookup tables for ML training.

# Extract features for all protocols
rockfish_detect -c config.yaml extract

# Specific protocol
rockfish_detect -c config.yaml extract -p tcp

# Sequential (not parallel)
rockfish_detect -c config.yaml extract --sequential

Processing

For each field, creates a normalization table:

Numeric fields (dur, rtt, bytes, etc.):

  • Histogram binning (quantile or equal-width)
  • Maps raw values to bin indices
  • Normalizes to [0, 1] range

Categorical fields (proto, ports, IPs):

  • Frequency counting
  • Maps values to frequency scores
  • Special handling for IPs (/24 truncation)

Output Path

s3://<bucket>/<observation>/extract/<protocol>/<field>.parquet

Configuration

features:
  num_bins: 10              # Histogram resolution
  histogram_type: quantile  # Better for skewed data
  ip_hash_modulus: 65536    # IP dimensionality reduction

3. Feature Ranking

Ranks features by importance for model training.

# Rank using reconstruction error
rockfish_detect -c config.yaml rank

# Rank using SVD
rockfish_detect -c config.yaml rank -a svd

# Specific protocol
rockfish_detect -c config.yaml rank -p tcp

Algorithms

AlgorithmDescription
reconstructionAutoencoder reconstruction error (default)
svdSingular Value Decomposition importance

Output

s3://<bucket>/<observation>/extract/<protocol>/rockfish_rank.parquet

Contains importance scores (0-1) for each field.

Using Rankings

training:
  min_importance_score: 0.7   # Only use features above this

4. Model Training

Trains anomaly detection models on sampled data.

# Train HBOS model
rockfish_detect -c config.yaml train -a hbos

# Train hybrid model
rockfish_detect -c config.yaml train -a hybrid

# Train with ranked features only
rockfish_detect -c config.yaml train-ranked -n 10

# Specific protocol
rockfish_detect -c config.yaml train -p tcp

Algorithms

HBOS (Histogram-Based Outlier Score):

  • Fast, interpretable
  • Inverse density scoring
  • Good baseline algorithm

Hybrid:

  • Combines HBOS + correlation + threat intel
  • Weighted scoring model
  • Better for complex environments

Output

Models saved to configured directory:

<model_output_dir>/<protocol>_model.json

Configuration

training:
  algorithm: hbos
  model_output_dir: /var/lib/rockfish/models

  hbos:
    num_bins: 10
    fields: [dur, rtt, pcr, spkts, dpkts, sbytes, dbytes]

5. Flow Scoring

Scores flows using trained models.

# Score specific date
rockfish_detect -c config.yaml score -d 2025-01-28

# Score since timestamp
rockfish_detect -c config.yaml score --since 2025-01-28T00:00:00Z

# With severity threshold
rockfish_detect -c config.yaml score -t 0.8

# Limit results
rockfish_detect -c config.yaml score -n 1000

# Output to file
rockfish_detect -c config.yaml score -o anomalies.parquet

Options

OptionDescription
-d, --dateScore specific date
--sinceScore since timestamp
-pSpecific protocol
-t, --thresholdMinimum score threshold
-n, --limitMaximum results
-o, --outputOutput file path

Severity Classification

# Percentile-based (default)
severity_mode: percentile

# Fixed thresholds
severity_mode: fixed
severity_thresholds:
  low: 0.5
  medium: 0.7
  high: 0.85
  critical: 0.95

Output

s3://<bucket>/<observation>/score/score-YYYY-MM-DD.parquet

Includes:

  • Original flow fields
  • anomaly_score (0-1)
  • severity (LOW, MEDIUM, HIGH, CRITICAL)

Automated Pipeline

Run the complete pipeline with a single command:

# Full pipeline for today
rockfish_detect -c config.yaml auto

# Specific date
rockfish_detect -c config.yaml auto --date 2025-01-28

# Last 7 days
rockfish_detect -c config.yaml auto --days 7

# Stop on first error
rockfish_detect -c config.yaml auto --fail-fast

Pipeline Order

  1. Sample data
  2. Extract features
  3. Rank features
  4. Train model
  5. Score flows

Reporting

Generate reports from scored data:

# Text report
rockfish_detect -c config.yaml report --date 2025-01-28

# JSON output
rockfish_detect -c config.yaml report -f json

# Filter by severity
rockfish_detect -c config.yaml report --min-severity HIGH

# Top N anomalies
rockfish_detect -c config.yaml report -n 50

Output Formats

FormatDescription
textHuman-readable (default)
jsonMachine-readable JSON
csvCSV export

Anomaly Detection

Rockfish Detect supports multiple anomaly detection algorithms for identifying unusual network flows.

Algorithms

AlgorithmTypeSpeedInterpretabilityUse Case
HBOSUnsupervisedFastHighGeneral anomaly detection
HybridCombinedMediumMediumComplex environments
Random ForestSupervisedMediumMediumKnown threat patterns
AutoencoderNeural NetworkSlowLowComplex patterns

HBOS (Histogram-Based Outlier Score)

HBOS is the default algorithm - fast, interpretable, and effective for network anomaly detection.

How It Works

  1. Build histograms for each feature from training data
  2. Calculate density for each bin
  3. Score new flows based on inverse density
  4. Combine scores across features

Flows falling in low-density bins receive high anomaly scores.

Configuration

training:
  algorithm: hbos

  hbos:
    num_bins: 10
    fields:
      - dur          # Flow duration
      - rtt          # Round-trip time
      - pcr          # Producer-consumer ratio
      - spkts        # Source packets
      - dpkts        # Destination packets
      - sbytes       # Source bytes
      - dbytes       # Destination bytes
      - sentropy     # Source entropy
      - dentropy     # Destination entropy
      - ssmallpktcnt # Small packet count
      - slargepktcnt # Large packet count

Feature Selection

Choose fields that characterize normal behavior:

CategoryFieldsDetects
Volumesbytes, dbytes, spkts, dpktsData exfiltration, DDoS
Timingdur, rttTunneling, beaconing
Behaviorpcr, entropyC2, encrypted channels
Packetssmallpktcnt, largepktcntProtocol anomalies

Example Output

Flow: 192.168.1.100:52341 -> 45.33.32.156:443
Score: 0.92 (CRITICAL)
Contributing factors:
  - dbytes: 47MB (unusual outbound volume)
  - dur: 28800s (8-hour connection)
  - pcr: -0.98 (highly asymmetric)

Hybrid Algorithm

Combines multiple detection methods for improved accuracy.

Components

Final Score = (HBOS * W1) + (Correlation * W2) + (Threat Intel * W3)
ComponentDefault WeightDescription
HBOS0.5Statistical outlier score
Correlation0.3Fingerprint pair frequency
Threat Intel0.2nDPI risk + IP reputation

Configuration

training:
  algorithm: hybrid

  hybrid:
    hbos_weight: 0.5
    correlation_weight: 0.3
    threat_intel_weight: 0.2
    hbos_filter_percentile: 90.0
    min_observations: 3

Correlation Score

Based on nDPI fingerprint pair frequency (ndpi_ja4/ndpi_ja3s):

  1. Build database of (client_fingerprint, server_fingerprint) pairs
  2. Track frequency of each pair
  3. Score rare or never-seen combinations higher

Detects:

  • New client/server combinations
  • Unusual application behaviors
  • Potential lateral movement

Threat Intel Score

Incorporates external intelligence:

  • nDPI risk scores: Protocol-level risks
  • IP reputation: AbuseIPDB confidence scores
  • Known bad indicators: Blacklisted IPs/domains

Tuning Weights

EnvironmentHBOSCorrelationThreat Intel
General0.50.30.2
High threat0.30.30.4
Internal only0.60.40.0

Severity Classification

Anomaly scores are classified into severity levels.

Percentile-Based (Default)

Dynamic thresholds based on score distribution:

SeverityPercentile
LOW50-75th
MEDIUM75-90th
HIGH90-95th
CRITICAL>95th

Adapts to your environment’s baseline.

Fixed Thresholds

Static thresholds for consistent alerting:

severity_mode: fixed
severity_thresholds:
  low: 0.5
  medium: 0.7
  high: 0.85
  critical: 0.95

Protocol-Specific Models

Rockfish Detect trains separate models per protocol:

# Train TCP model only
rockfish_detect -c config.yaml train -p tcp

# Score UDP traffic
rockfish_detect -c config.yaml score -p udp

Why Separate Models?

  • TCP, UDP, and ICMP have different characteristics
  • Prevents cross-protocol noise
  • Better detection accuracy per protocol

Configuration

protocols:
  - tcp
  - udp
  - icmp

parallel_protocols: true   # Process in parallel

Feature Ranking

Use feature importance to select the most relevant fields.

# Rank features
rockfish_detect -c config.yaml rank

# Train with top 10 ranked features
rockfish_detect -c config.yaml train-ranked -n 10

Benefits

  • Reduces model complexity
  • Improves training speed
  • May improve detection accuracy

Configuration

training:
  min_importance_score: 0.7   # Include features above this threshold

Best Practices

1. Start with HBOS

  • Fast iteration
  • Easy to interpret
  • Good baseline performance

2. Use Adequate Training Data

  • Minimum 7 days of samples
  • Include normal business hours and off-hours
  • Ensure representative traffic mix

3. Tune for Your Environment

  • Adjust severity thresholds based on alert volume
  • Weight algorithms based on threat model
  • Include relevant fields for your use case

4. Regular Retraining

  • Retrain weekly or monthly
  • Network behavior changes over time
  • New applications may appear as anomalies initially

5. Validate Results

  • Review high-severity alerts
  • Adjust thresholds to reduce false positives
  • Document known-good anomalies

Troubleshooting

High False Positive Rate

  • Increase severity thresholds
  • Add more training data
  • Exclude noisy fields from model

Missing True Positives

  • Lower severity thresholds
  • Include more fields in model
  • Check training data for bias

Slow Scoring

  • Use ranked features (fewer fields)
  • Process protocols in parallel
  • Increase hardware resources

Device Fingerprinting

Rockfish Detect includes ML-based passive device fingerprinting using network signals.

Note: Requires nDPI fingerprints in flow data (Professional+ license for rockfish_probe).

Overview

Device fingerprinting identifies devices and operating systems based on their network behavior, without requiring agents or active scanning.

Signals Used

PrioritySignalFieldDescription
PrimaryTLS clientndpi_ja4JA4 TLS client fingerprint
PrimaryTLS serverndpi_ja3sJA3 TLS server fingerprint
SecondaryTCP stackndpi_tcp_fpTCP fingerprint with OS hint (TTL, window size, options)
SecondaryCompositendpi_fpnDPI combined fingerprint for device correlation
TertiaryApplication-HTTP headers, DNS patterns

Use Cases

  • Asset Inventory - Discover devices on your network
  • Baseline Monitoring - Track device behavior over time
  • Lateral Movement Detection - Detect hosts changing fingerprints
  • Unauthorized Devices - Identify unexpected device types

Commands

Build Fingerprint Database

Build baseline from historical data:

# Build from last 7 days
rockfish_detect -c config.yaml fingerprint build --days 7

# Build from specific date range
rockfish_detect -c config.yaml fingerprint build --start 2025-01-01 --end 2025-01-28

Detect Anomalies

Find hosts with unusual fingerprint changes:

# Detect for today
rockfish_detect -c config.yaml fingerprint detect

# Detect for specific date
rockfish_detect -c config.yaml fingerprint detect --date 2025-01-28

Profile Specific Host

Get fingerprint profile for an IP:

# Profile specific IP
rockfish_detect -c config.yaml fingerprint profile --ip 192.168.1.100

# With history
rockfish_detect -c config.yaml fingerprint profile --ip 192.168.1.100 --days 30

Configuration

fingerprint:
  enabled: true
  history_days: 7
  client_field: ndpi_ja4
  server_field: ndpi_ja3s
  min_observations: 10
  anomaly_threshold: 0.7
  max_fingerprints_per_host: 5
  detect_suspicious: true
OptionDefaultDescription
enabledfalseEnable fingerprinting
history_days7Days of history to analyze
client_fieldndpi_ja4Field for client fingerprint (JA4 via nDPI)
server_fieldndpi_ja3sField for server fingerprint (JA3 via nDPI)
min_observations10Minimum flows to establish baseline
anomaly_threshold0.7Score threshold for anomalies
max_fingerprints_per_host5Expected max fingerprints per device
detect_suspicioustrueFlag suspicious changes

How It Works

1. Baseline Building

For each IP address, collect:

  • Set of observed ndpi_ja4 fingerprints (client connections)
  • Set of observed ndpi_ja3s fingerprints (server connections)
  • Frequency of each fingerprint
  • First and last seen timestamps

2. Anomaly Detection

Flag hosts that:

  • Present a new, never-seen fingerprint
  • Exceed max_fingerprints_per_host
  • Show sudden fingerprint changes
  • Have rare fingerprint combinations

3. Correlation Scoring

Score fingerprint pairs by frequency:

Rare pair (first time seen) -> High anomaly score
Common pair (seen 1000+ times) -> Low anomaly score

Detection Scenarios

New Device on Network

Alert: New fingerprint detected
Host: 192.168.1.150
Fingerprint: t13d1516h2_8daaf6152771_b0da82dd1658
First seen: 2025-01-28T14:32:00Z
Action: Verify device is authorized

Host Fingerprint Change

Alert: Fingerprint change detected
Host: 192.168.1.100
Previous: t13d1516h2_8daaf6152771_b0da82dd1658 (Windows 11)
Current: t13d1517h2_5b57614c22b0_06cda9e17597 (Linux)
Risk: Possible lateral movement or VM switch

Unusual Client/Server Pair

Alert: Rare fingerprint combination
Client: 192.168.1.100 (ndpi_ja4: t13d1516h2_...)
Server: 45.33.32.156 (ndpi_ja3s: t120200_...)
Observations: 1 (first time)
Typical for this client: 847 connections to known servers
Risk: New external communication

Integration with Hybrid Scoring

Fingerprint correlation is a component of the hybrid algorithm:

training:
  algorithm: hybrid

  hybrid:
    hbos_weight: 0.5
    correlation_weight: 0.3      # Fingerprint correlation
    threat_intel_weight: 0.2

Flows with rare fingerprint combinations receive higher anomaly scores.

Output Schema

Fingerprint analysis adds these fields to scored flows:

FieldTypeDescription
fp_clientStringClient fingerprint (ndpi_ja4)
fp_serverStringServer fingerprint (ndpi_ja3s)
fp_pair_countIntTimes this pair has been seen
fp_client_countIntTimes client has been seen
fp_is_newBoolFirst observation of this pair
fp_anomaly_scoreFloatFingerprint-specific anomaly score

Best Practices

1. Build Sufficient Baseline

  • Use at least 7 days of data
  • Include weekdays and weekends
  • Ensure coverage of all network segments

2. Tune Thresholds

  • Start with defaults
  • Adjust max_fingerprints_per_host for your environment
  • Some hosts (proxies, VMs) legitimately have many fingerprints

3. Handle Known Exceptions

  • Exclude known multi-fingerprint hosts
  • Document expected fingerprint changes (updates, migrations)

4. Combine with Other Signals

  • Use hybrid algorithm for combined scoring
  • Correlate with threat intelligence
  • Consider flow volume and timing

Limitations

  • Requires nDPI fingerprint fields (ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp, ndpi_fp) in flow data
  • TLS fingerprints only available for TLS connections
  • VPN/proxy traffic may obscure true fingerprints
  • Fingerprints can change with software updates

Scheduler

Rockfish Detect can run as a daemon with automated scheduling for continuous anomaly detection.

Running as Daemon

# Start scheduler
rockfish_detect -c config.yaml run

# Run immediately without waiting
rockfish_detect -c config.yaml run --run-now

The scheduler runs two daily jobs:

  1. Sample job - Sample new flow data
  2. Train job - Retrain models with new samples

Schedule Configuration

sampling:
  sample_hour: 0          # UTC hour (0-23)
  sample_minute: 30       # Optional; random if not set

training:
  train_hour: 1           # UTC hour (0-23)
  train_minute: 0         # Optional; random if not set

Random Minutes

If sample_minute or train_minute is not set, a random minute (0-59) is selected at startup. This prevents multiple instances from running concurrently.

Example Schedule

# Sample at 00:30 UTC, train at 01:00 UTC
sampling:
  sample_hour: 0
  sample_minute: 30

training:
  train_hour: 1
  train_minute: 0

Timeline:

00:30 UTC - Sample yesterday's flow data
01:00 UTC - Retrain models with updated samples

Systemd Service

Create /etc/systemd/system/rockfish-detect.service:

[Unit]
Description=Rockfish Detect ML Service
After=network.target

[Service]
Type=simple
User=rockfish
ExecStart=/usr/local/bin/rockfish_detect -c /etc/rockfish/detect.yaml run
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable rockfish-detect
sudo systemctl start rockfish-detect

# Check status
sudo systemctl status rockfish-detect

# View logs
sudo journalctl -u rockfish-detect -f

Docker Deployment

# Pull the image
docker pull rockfishnetworks/toolkit:latest

# Run the scheduler
docker run -d \
  --name rockfish-detect \
  -v /path/to/config.yaml:/etc/rockfish/config.yaml \
  -v /path/to/license.json:/etc/rockfish/license.json \
  -e AWS_ACCESS_KEY_ID=xxx \
  -e AWS_SECRET_ACCESS_KEY=xxx \
  rockfishnetworks/toolkit:latest \
  rockfish_detect -c /etc/rockfish/config.yaml run

Graceful Shutdown

The scheduler handles SIGTERM/SIGINT for graceful shutdown:

  1. Stops accepting new jobs
  2. Waits for running jobs to complete
  3. Saves state
  4. Exits cleanly
# Graceful stop
sudo systemctl stop rockfish-detect

# Or with kill
kill -TERM $(pgrep rockfish_detect)

State Management

The scheduler maintains state to avoid redundant work:

Sample State

Tracks which dates have been sampled:

s3://<bucket>/<observation>/sample/.state.json

Skip already-sampled dates on restart.

Score State

Tracks last scored timestamp:

s3://<bucket>/<observation>/score/.state.json

Resume scoring from last checkpoint.

Reset State

# Clear sample state
rockfish_detect -c config.yaml sample --clear

# Force rescore
rockfish_detect -c config.yaml score --since 2025-01-01T00:00:00Z

Monitoring

Log Output

logging:
  level: info
  file: /var/log/rockfish/detect.log

Log levels:

  • error - Errors only
  • warn - Warnings and errors
  • info - Normal operation (default)
  • debug - Detailed operation
  • trace - Very verbose

Health Check

# Validate configuration
rockfish_detect -c config.yaml validate

# Test S3 connectivity
rockfish_detect -c config.yaml test-s3

# Check license
rockfish_detect -c config.yaml license

Metrics to Monitor

MetricDescription
Sample job durationTime to complete sampling
Train job durationTime to complete training
Flows sampledNumber of flows per sample run
Anomalies detectedHigh-severity anomalies per day
S3 errorsFailed S3 operations

Multi-Instance Deployment

For high availability or distributed processing:

Separate Responsibilities

# Instance 1: Sampling and training
rockfish_detect -c config-train.yaml run

# Instance 2: Scoring only
rockfish_detect -c config-score.yaml score --continuous

Shared State

All instances read/write to the same S3 bucket. State files prevent duplicate work.

Protocol Distribution

# Instance 1: TCP
rockfish_detect -c config.yaml run -p tcp

# Instance 2: UDP
rockfish_detect -c config.yaml run -p udp

Troubleshooting

Job Not Running

  1. Check system time (UTC)
  2. Verify schedule configuration
  3. Check logs for errors

Job Failing

# Run manually with verbose output
rockfish_detect -c config.yaml -vv auto

High Memory Usage

  • Reduce sample_percent
  • Process protocols sequentially
  • Limit sample_days

Slow Jobs

  • Enable parallel_protocols: true
  • Use faster S3 storage
  • Increase hardware resources

Parquet Schema

Rockfish exports flow data in Apache Parquet format with IPFIX-compliant field naming. The schema varies by license tier.

Schema by Tier

TierSchema VersionFieldsKey Features
Communityv144Basic flow fields
Basicv154+ nDPI detection, GeoIP (country, city, ASN)
Professionalv260+ GeoIP AS org, nDPI fingerprints
Enterprisev263++ Anomaly scores, ML predictions

Community Schema (44 Fields)

Basic flow capture with core network fields.

#FieldTypeDescription
1versionUInt16Schema version (1)
2flowidStringUnique flow UUID
3obnameStringObservation domain name
4stimeTimestampFlow start time (UTC)
5etimeTimestampFlow end time (UTC)
6durUInt32Duration (milliseconds)
7rttUInt32Round-trip time (microseconds)
8pcrInt32Producer-consumer ratio
9protoStringProtocol (TCP, UDP, ICMP)
10saddrStringSource IP address
11daddrStringDestination IP address
12sportUInt16Source port
13dportUInt16Destination port
14iflagsStringInitial TCP flags
15uflagsStringUnion of all TCP flags
16stcpseqUInt32Source initial TCP sequence
17dtcpseqUInt32Dest initial TCP sequence
18svlanUInt16Source VLAN ID
19dvlanUInt16Destination VLAN ID
20spktsUInt64Source packet count
21dpktsUInt64Destination packet count
22sbytesUInt64Source byte count
23dbytesUInt64Destination byte count
24sentropyUInt8Source payload entropy (0-255)
25dentropyUInt8Destination payload entropy
26ssmallpktcntUInt32Source small packets (<60 bytes)
27dsmallpktcntUInt32Dest small packets
28slargepktcntUInt32Source large packets (>225 bytes)
29dlargepktcntUInt32Dest large packets
30snonemptypktcntUInt32Source non-empty packets
31dnonemptypktcntUInt32Dest non-empty packets
32sfirstnonemptycntUInt16Source first N non-empty sizes
33dfirstnonemptycntUInt16Dest first N non-empty sizes
34smaxpktsizeUInt16Source max packet size
35dmaxpktsizeUInt16Dest max packet size
36savgpayloadUInt16Source avg payload size
37davgpayloadUInt16Dest avg payload size
38sstdevpayloadUInt16Source payload std deviation
39dstdevpayloadUInt16Dest payload std deviation
40spdStringSmall packet direction flags
41spdtStringSmall packet direction timing
42reasonStringFlow termination reason
43smacStringSource MAC address
44dmacStringDestination MAC address

Basic Schema (54 Fields)

Community schema + nDPI application detection + GeoIP (country, city, ASN).

GeoIP fields:

#FieldTypeDescription
45scountryStringSource country (ISO 3166-1 alpha-2)
46dcountryStringDestination country
47scityStringSource city
48dcityStringDestination city
49sasnUInt32Source ASN
50dasnUInt32Destination ASN

nDPI fields:

#FieldTypeDescription
51ndpi_appidStringnDPI application ID (e.g., “TLS.YouTube”)
52ndpi_categoryStringnDPI category (e.g., “Streaming”)
53ndpi_risk_scoreUInt32nDPI cumulative risk score
54ndpi_risk_severityUInt8Risk severity (0=none, 1=low, 2=medium, 3=high)

Professional Schema (60 Fields)

Basic schema + GeoIP AS organization names and nDPI fingerprinting.

Additional GeoIP fields (AS organization):

#FieldTypeDescription
55sasnorgStringSource ASN organization
56dasnorgStringDestination ASN organization

nDPI fingerprint fields:

#FieldTypeDescription
57ndpi_ja4StringJA4 TLS client fingerprint (via nDPI)
58ndpi_ja3sStringJA3 TLS server fingerprint (via nDPI)
59ndpi_tcp_fpStringTCP fingerprint with OS hint (via nDPI)
60ndpi_fpStringnDPI composite fingerprint

Enterprise Schema (63+ Fields)

Professional schema + anomaly detection and ML predictions.

Anomaly detection fields:

#FieldTypeDescription
61anomaly_scoreFloat32Anomaly score (0.0 - 1.0)
62anomaly_severityStringSeverity (LOW, MEDIUM, HIGH, CRITICAL)
63anomaly_factorsStringContributing factors

File Naming

TierFile Pattern
Communityrockfish-v1-YYYYMMDD-HHMMSS.parquet
Basicrockfish-v1-YYYYMMDD-HHMMSS.parquet
Professionalrockfish-<observation>-v2-YYYYMMDD-HHMMSS.parquet
Enterpriserockfish-<observation>-v2-YYYYMMDD-HHMMSS.parquet

S3 Path Structure

With Hive partitioning enabled:

s3://<bucket>/<prefix>/v1/year=YYYY/month=MM/day=DD/*.parquet
s3://<bucket>/<prefix>/v2/year=YYYY/month=MM/day=DD/*.parquet

Field Descriptions

Flow Identification

  • flowid: Unique UUID for deduplication and correlation
  • obname: Observation domain name (sensor identifier)

Timing

  • stime/etime: Timestamps with microsecond precision, UTC
  • dur: Duration in milliseconds
  • rtt: Estimated TCP round-trip time

Network Addresses

  • saddr/daddr: IPv4 or IPv6 addresses as strings
  • sport/dport: Port numbers (0 for non-TCP/UDP)
  • smac/dmac: MAC addresses in standard notation

Traffic Volumes

  • spkts/dpkts: Packet counts per direction
  • sbytes/dbytes: Byte counts per direction
  • pcr: Producer-consumer ratio: (sent-recv)/(sent+recv)

TCP Flags

  • iflags: Initial TCP flags (SYN, ACK, etc.)
  • uflags: Union of all flags seen in flow

Payload Analysis

  • sentropy/dentropy: Shannon entropy (0-255)
    • 230: Likely encrypted/compressed

    • ~140: English text
    • Low: Sparse or zero-padded

Flow Termination

  • reason: Why the flow ended
    • idle: Idle timeout
    • active: Active timeout
    • eof: End of capture
    • end: FIN exchange
    • rst: TCP reset

GeoIP (Professional+)

  • scountry/dcountry: ISO 3166-1 alpha-2 codes
  • sasn/dasn: Autonomous System Numbers
  • sasnorg/dasnorg: AS organization names

nDPI Detection (Basic+)

  • ndpi_appid: Application identifier (e.g., “TLS.YouTube”)
  • ndpi_category: Category (e.g., “Streaming”)
  • ndpi_risk_score: Cumulative risk score
  • ndpi_risk_severity: 0=none, 1=low, 2=medium, 3=high

nDPI Fingerprints (Professional+)

  • ndpi_ja4: JA4 TLS client fingerprint
  • ndpi_ja3s: JA3 TLS server fingerprint
  • ndpi_tcp_fp: TCP fingerprint with OS detection hint (format: “fingerprint/os”)
  • ndpi_fp: nDPI composite fingerprint for device correlation

Anomaly Detection (Enterprise)

  • anomaly_score: 0.0-1.0 indicating how unusual the flow is
  • anomaly_severity: Classification based on score percentile
  • anomaly_factors: Fields contributing most to the score

Parquet File Metadata

Each file includes custom metadata:

KeyDescription
rockfish.license_idLicense identifier
rockfish.tierLicense tier
rockfish.companyCompany name
rockfish.observationObservation domain
rockfish.schema_versionSchema version

Example Queries

DuckDB - Read from S3

SELECT * FROM read_parquet(
    's3://bucket/v2/year=2025/month=01/day=28/*.parquet',
    hive_partitioning=true
);

Count by Protocol

SELECT proto, COUNT(*) as count
FROM read_parquet('flows/*.parquet')
GROUP BY proto
ORDER BY count DESC;

Filter by Country (Professional+)

SELECT saddr, daddr, scountry, dcountry, ndpi_appid
FROM read_parquet('flows/*.parquet')
WHERE scountry = 'US' AND dcountry != 'US';

High-Risk Flows (Basic+)

SELECT stime, saddr, daddr, ndpi_appid, ndpi_risk_score
FROM read_parquet('flows/*.parquet')
WHERE ndpi_risk_score > 100
ORDER BY ndpi_risk_score DESC;

Anomalous Flows (Enterprise)

SELECT stime, saddr, daddr, anomaly_score, anomaly_severity
FROM read_parquet('flows/*.parquet')
WHERE anomaly_severity IN ('HIGH', 'CRITICAL')
ORDER BY anomaly_score DESC
LIMIT 100;

CLI Reference

Command-line options for Rockfish tools.

rockfish_probe

Usage

rockfish_probe [OPTIONS]

Global Options

OptionShortDescription
--config <FILE>-cConfiguration file path
--help-hShow help
--version-VShow version

Input Options

OptionShortDescription
--source <SRC>-iInput source (interface or pcap file)
--live <TYPE>Capture type: pcap, afpacket, netmap, fmadio
--filter <EXPR>BPF filter expression
--snaplen <BYTES>Maximum capture bytes per packet
--promisc-offDisable promiscuous mode

Flow Options

OptionDescription
--idle-timeout <SECS>Idle timeout (default: 300)
--active-timeout <SECS>Active timeout (default: 1800)
--max-flows <COUNT>Maximum flow table size
--max-payload <BYTES>Max payload bytes to capture
--udp-uniflow <PORT>UDP uniflow port (0=disabled)
--ndpiEnable nDPI (includes JA4/JA3s fingerprints)

Fragment Options

OptionDescription
--no-fragDisable fragment reassembly
--max-frag-tables <N>Max fragment tables (default: 1024)
--frag-timeout <SECS>Fragment timeout (default: 30)

AF_PACKET Options (Linux)

OptionDescription
--afp-block-size <BYTES>Ring buffer block size
--afp-block-count <N>Ring buffer block count
--afp-fanout-group <ID>Fanout group ID
--afp-fanout-mode <MODE>Fanout mode: hash, lb, cpu, rollover, random

Output Options

OptionDescription
--parquet-dir <DIR>Output directory for Parquet files
--parquet-batch-size <N>Flows per file
--parquet-prefix <PREFIX>Filename prefix
--parquet-schema <TYPE>Schema: simple or extended
--observation <NAME>Observation domain name
--hive-boundary-flushFlush at day boundaries

S3 Options

OptionDescription
--s3-bucket <NAME>S3 bucket name
--s3-prefix <PREFIX>S3 key prefix
--s3-region <REGION>AWS region
--s3-endpoint <URL>Custom S3 endpoint
--s3-force-path-styleUse path-style URLs
--s3-hive-partitioningEnable Hive partitioning
--s3-delete-after-uploadDelete local after upload
--test-s3Test S3 connectivity and exit

Logging Options

OptionShortDescription
--verbose-vIncrease verbosity (-vv for debug)
--quiet-qQuiet mode
--statsPrint statistics
--log-file <PATH>Log file path

License Options

OptionDescription
--license <PATH>License file path

Environment: ROCKFISH_LICENSE_PATH

Examples

# Basic PCAP processing
rockfish_probe -i capture.pcap --parquet-dir ./flows

# Live capture with AF_PACKET
sudo rockfish_probe -i eth0 --live afpacket \
    --afp-block-size 4194304 \
    --afp-fanout-group 1 \
    --parquet-dir ./flows

# With all features (nDPI includes fingerprints)
rockfish_probe -i eth0 --live afpacket \
    --ndpi \
    --parquet-dir ./flows \
    --s3-bucket my-bucket \
    --s3-region us-east-1 \
    --s3-hive-partitioning \
    -vv

# Test S3 connectivity
rockfish_probe --test-s3 \
    --s3-bucket my-bucket \
    --s3-region us-east-1

rockfish_mcp

Usage

rockfish_mcp [OPTIONS]

Options

OptionDescription
--config <FILE>Configuration file path
--helpShow help
--versionShow version

Environment: ROCKFISH_CONFIG

Examples

# Start with config file
ROCKFISH_CONFIG=config.yaml rockfish_mcp

# Or via argument
rockfish_mcp --config /etc/rockfish/mcp.yaml

Common Patterns

Processing Multiple PCAPs

# Glob pattern
rockfish_probe -i "/data/captures/*.pcap" --parquet-dir ./flows

# Multiple runs
for f in /data/captures/*.pcap; do
    rockfish_probe -i "$f" --parquet-dir ./flows
done

High-Performance Capture

# Pin to CPUs, large ring buffer, fanout
sudo taskset -c 0-3 rockfish_probe -i eth0 --live afpacket \
    --afp-block-size 4194304 \
    --afp-block-count 128 \
    --afp-fanout-group 1 \
    --afp-fanout-mode hash \
    --parquet-dir /data/flows

Development/Testing

# Verbose output, no S3
rockfish_probe -i test.pcap \
    --parquet-dir ./test-flows \
    --ndpi \
    --stats \
    -vv

Production Deployment

# Full featured with S3
rockfish_probe -c /opt/rockfish/etc/config.yaml \
    --license /opt/rockfish/etc/license.json

License Tiers

Rockfish uses a tiered licensing model to enable different feature sets.

Tier Comparison

FeatureCommunityBasicProfessionalEnterprise
Core Features
Packet captureYesYesYesYes
Flow generationYesYesYesYes
Parquet exportYesYesYesYes
S3 uploadYesYesYesYes
Schema
v1 (Simple - 54 fields)YesYesYesYes
v2 (Extended - 60 fields)--YesYes
Application Detection
nDPI labeling-YesYesYes
nDPI risk scoring-YesYesYes
Network Intelligence
GeoIP country/city/ASN-YesYesYes
GeoIP AS organization--YesYes
nDPI fingerprints (JA4, JA3s, TCP)--YesYes
Customization
Custom observation name-YesYesYes
Advanced Features
Anomaly detection---Yes
ML model integration---Yes

Feature Details

Community Tier

Free tier with basic flow capture:

  • Standard 5-tuple flow generation
  • Parquet export (v1 schema)
  • S3 upload support
  • AF_PACKET high-performance capture
  • Fragment reassembly

Basic Tier

Adds application visibility and GeoIP intelligence:

  • All Community features
  • nDPI application labeling
  • nDPI risk scoring and categories
  • GeoIP lookups (scountry, dcountry, scity, dcity, sasn, dasn)
  • Custom observation domain name
  • 54 fields total

Professional Tier

Adds AS organization names and device fingerprinting:

  • All Basic features
  • Extended schema (60 fields)
  • GeoIP AS organization names (sasnorg, dasnorg)
  • nDPI fingerprints (JA4 client, JA3 server, TCP fingerprint, composite)

Enterprise Tier

Full feature set:

  • All Professional features
  • Anomaly detection (HBOS)
  • ML model integration
  • SaaS schema (63+ fields)
  • Correlation with rockfish_sensor

Schema Comparison

v1 (Simple) - Community/Basic

54 core fields:

  • Flow identification (flowid, obname)
  • Timing (stime, etime, dur, rtt)
  • Addresses (saddr, daddr, sport, dport)
  • Traffic (spkts, dpkts, sbytes, dbytes)
  • TCP state (iflags, uflags, sequences)
  • Payload analysis (entropy, packet sizes)
  • GeoIP: scountry, dcountry, scity, dcity, sasn, dasn (Basic tier)
  • nDPI results (Basic tier)

v2 (Extended) - Professional/Enterprise

60 fields (v1 + 6 additional):

  • GeoIP AS organization: sasnorg, dasnorg
  • nDPI fingerprints: ndpi_ja4, ndpi_ja3s, ndpi_tcp_fp, ndpi_fp

v3 (SaaS) - Enterprise

63+ fields:

  • All v2 fields
  • Anomaly scores
  • ML predictions
  • Correlation IDs

License Enforcement

Parquet Metadata

Licensed files include metadata for validation:

rockfish.license_id: "lic_abc123"
rockfish.tier: "professional"
rockfish.company: "Example Corp"
rockfish.observation: "sensor-01"

MCP Validation

Configure license validation in MCP:

sources:
  # Require valid license
  prod_flows:
    path: s3://data/flows/
    require_license: true

  # Restrict to specific licenses
  enterprise_flows:
    path: s3://data/enterprise/
    require_license: true
    allowed_license_ids:
      - "lic_abc123"

Obtaining a License

Contact [email protected] for:

  • License quotes
  • Trial licenses
  • Enterprise agreements
  • Volume discounts