Benchmarking LLM APIs Under High-Velocity BGP Streams

Reason For This Experiment

Most LLM benchmarks evaluate static prompts, offline datasets, or batch inference. That tells us very little about how LLM APIs behave in real-time streaming systems.

This experiment asks a different question:

What happens when you push high-velocity, unbounded network telemetry into an LLM and expect near-real-time output?

To explore this, I routed live BGP update streams into multiple LLM APIs using identical prompts, parameters, and infrastructure.

The goal is not to declare a “best” model, but to understand architectural trade-offs:

latency vs reasoning
verbosity vs efficiency
stability under continuous load

Test Setup

Streaming Source

WebSocket: wss://ris-live.ripe.net/v1/ws/?client=turbomart-test
Subscription Message:

{
  "type": "ris_subscribe",
  "data": { "host": "rrc21" }
}

This produces a live firehose of BGP updates with no batching or artificial throttling.

Prompt Configuration (Identical Across All Models)

System Prompt

You are an expert network engineer whose make a living analyzing BGP feeds. You work at a data center and in the evening you volunteer managing a local IXP in Texas. You can read BGP feed like it is your second nature.

User Prompt

Summarize the following BGP update in under 140 characters for a real-time network alert. Include ASNs details like who owns the ASN, prefix, and region if known.

No model-specific tuning, retries, or truncation guards were applied.

Metrics Collected

For each request, the following metrics were recorded:

Time to First Token (TTFT)
Total latency (full completion time)
Tokens in
Tokens out
Compression ratio (tokens_out ÷ tokens_in)

These metrics directly affect real-time systems such as:

alerting pipelines
streaming dashboards
LLM-in-the-loop observability
backpressure-sensitive architectures

Average Quantitative Results

Averages computed across all samples per provider.

LLM Provider	Avg TTFT (ms)	Avg Total Latency (ms)	Avg Tokens In	Avg Tokens Out	Avg Compression	Samples
OpenAI	829.87	1,842.37	5,140.17	45.40	0.01	30
Anthropic	2,120.74	6,349.84	5,074.66	136.71	0.03	38
Azure OpenAI	2,815.26	2,815.29	5,103.84	9,481.97	1.85	31
Gemini	3,024.51	3,409.26	5,104.94	9,633.51	1.84	35
Grok	19,320.29	19,733.14	5,278.57	33.50	0.01	14

Model-by-Model Analysis

OpenAI

Observed behavior

Fastest average TTFT
Lowest total latency
High compression
Strong adherence to output constraints

Interpretation

OpenAI consistently behaves like a stream-safe summarizer. It avoids context echoing, filler text, and verbosity inflation.

This makes it suitable for:

real-time alerts
dashboards
high-frequency summarization

Anthropic

Observed behavior

Higher TTFT and total latency
Significantly higher token output
Richer semantic interpretation

Interpretation

Anthropic performs deeper inference per request. That cost appears clearly in latency and token usage.

It behaves less like an alert engine and more like an analyst reviewing the feed.

Best suited for:

offline analysis
anomaly investigation
post-incident review

Azure OpenAI

Observed behavior

Partial context ingestion
High output token counts
Repetitive filler text
Inconsistent prompt adherence

Interpretation

The high compression ratio (>1.8×) indicates output inflation, which is problematic in streaming pipelines.

This configuration likely requires:

strict token caps
aggressive chunking
tighter context controls

Gemini

Observed behavior

Moderate latency
Truncated responses
Low semantic richness
High output verbosity

Interpretation

Gemini appears optimized for short-form answers rather than structured telemetry interpretation.

Not well suited for BGP-style stream summarization.

Grok

Observed behavior

Extremely high TTFT
Minimal output
Focused on signaling change rather than describing it

Interpretation

Grok behaves like a delta notifier rather than a summarizer.

Useful for:

detecting that something changed

Not suitable for:

explaining what changed or why

Key Takeaways

This benchmark demonstrates that LLM APIs are not interchangeable when used in real-time streaming systems.

Each model embeds assumptions about:

time sensitivity
verbosity
reasoning depth
context discipline

In high-velocity environments:

latency beats intelligence
consistency beats creativity
token efficiency beats verbosity

An answer that arrives late is operationally equivalent to noise.

Future work will include:

p95 / p99 latency distributions
streaming jitter analysis
cost-normalized efficiency metrics
sustained-load backpressure testing