Benchmarking LLM APIs Under High-Velocity BGP Streams
Published December 28, 2025

Reason For This Experiment
Most LLM benchmarks evaluate static prompts, offline datasets, or batch inference. That tells us very little about how LLM APIs behave in real-time streaming systems.
This experiment asks a different question:
What happens when you push high-velocity, unbounded network telemetry into an LLM and expect near-real-time output?
To explore this, I routed live BGP update streams into multiple LLM APIs using identical prompts, parameters, and infrastructure.
The goal is not to declare a “best” model, but to understand architectural trade-offs:
- latency vs reasoning
- verbosity vs efficiency
- stability under continuous load
Test Setup
Streaming Source
- WebSocket:
wss://ris-live.ripe.net/v1/ws/?client=turbomart-test - Subscription Message:
{
"type": "ris_subscribe",
"data": { "host": "rrc21" }
}
This produces a live firehose of BGP updates with no batching or artificial throttling.
Prompt Configuration (Identical Across All Models)
System Prompt
You are an expert network engineer whose make a living analyzing BGP feeds. You work at a data center and in the evening you volunteer managing a local IXP in Texas. You can read BGP feed like it is your second nature.
User Prompt
Summarize the following BGP update in under 140 characters for a real-time network alert. Include ASNs details like who owns the ASN, prefix, and region if known.
No model-specific tuning, retries, or truncation guards were applied.
Metrics Collected
For each request, the following metrics were recorded:
- Time to First Token (TTFT)
- Total latency (full completion time)
- Tokens in
- Tokens out
- Compression ratio (tokens_out ÷ tokens_in)
These metrics directly affect real-time systems such as:
- alerting pipelines
- streaming dashboards
- LLM-in-the-loop observability
- backpressure-sensitive architectures
Average Quantitative Results
Averages computed across all samples per provider.
| LLM Provider | Avg TTFT (ms) | Avg Total Latency (ms) | Avg Tokens In | Avg Tokens Out | Avg Compression | Samples |
|---|---|---|---|---|---|---|
| OpenAI | 829.87 | 1,842.37 | 5,140.17 | 45.40 | 0.01 | 30 |
| Anthropic | 2,120.74 | 6,349.84 | 5,074.66 | 136.71 | 0.03 | 38 |
| Azure OpenAI | 2,815.26 | 2,815.29 | 5,103.84 | 9,481.97 | 1.85 | 31 |
| Gemini | 3,024.51 | 3,409.26 | 5,104.94 | 9,633.51 | 1.84 | 35 |
| Grok | 19,320.29 | 19,733.14 | 5,278.57 | 33.50 | 0.01 | 14 |
Model-by-Model Analysis
OpenAI
Observed behavior
- Fastest average TTFT
- Lowest total latency
- High compression
- Strong adherence to output constraints
Interpretation
OpenAI consistently behaves like a stream-safe summarizer. It avoids context echoing, filler text, and verbosity inflation.
This makes it suitable for:
- real-time alerts
- dashboards
- high-frequency summarization
Anthropic
Observed behavior
- Higher TTFT and total latency
- Significantly higher token output
- Richer semantic interpretation
Interpretation
Anthropic performs deeper inference per request. That cost appears clearly in latency and token usage.
It behaves less like an alert engine and more like an analyst reviewing the feed.
Best suited for:
- offline analysis
- anomaly investigation
- post-incident review
Azure OpenAI
Observed behavior
- Partial context ingestion
- High output token counts
- Repetitive filler text
- Inconsistent prompt adherence
Interpretation
The high compression ratio (>1.8×) indicates output inflation, which is problematic in streaming pipelines.
This configuration likely requires:
- strict token caps
- aggressive chunking
- tighter context controls
Gemini
Observed behavior
- Moderate latency
- Truncated responses
- Low semantic richness
- High output verbosity
Interpretation
Gemini appears optimized for short-form answers rather than structured telemetry interpretation.
Not well suited for BGP-style stream summarization.
Grok
Observed behavior
- Extremely high TTFT
- Minimal output
- Focused on signaling change rather than describing it
Interpretation
Grok behaves like a delta notifier rather than a summarizer.
Useful for:
- detecting that something changed
Not suitable for:
- explaining what changed or why
Key Takeaways
This benchmark demonstrates that LLM APIs are not interchangeable when used in real-time streaming systems.
Each model embeds assumptions about:
- time sensitivity
- verbosity
- reasoning depth
- context discipline
In high-velocity environments:
- latency beats intelligence
- consistency beats creativity
- token efficiency beats verbosity
An answer that arrives late is operationally equivalent to noise.
Future work will include:
- p95 / p99 latency distributions
- streaming jitter analysis
- cost-normalized efficiency metrics
- sustained-load backpressure testing