Benchmarking LLM APIs Under High-Velocity BGP Streams

Published December 28, 2025

Benchmarking LLM APIs Under High-Velocity BGP Streams
LLMBenchmarkingBGPReal-timeAI

Reason For This Experiment

Most LLM benchmarks evaluate static prompts, offline datasets, or batch inference. That tells us very little about how LLM APIs behave in real-time streaming systems.

This experiment asks a different question:

What happens when you push high-velocity, unbounded network telemetry into an LLM and expect near-real-time output?

To explore this, I routed live BGP update streams into multiple LLM APIs using identical prompts, parameters, and infrastructure.

The goal is not to declare a “best” model, but to understand architectural trade-offs:

  • latency vs reasoning
  • verbosity vs efficiency
  • stability under continuous load

Test Setup

Streaming Source

  • WebSocket: wss://ris-live.ripe.net/v1/ws/?client=turbomart-test
  • Subscription Message:
{
  "type": "ris_subscribe",
  "data": { "host": "rrc21" }
}

This produces a live firehose of BGP updates with no batching or artificial throttling.


Prompt Configuration (Identical Across All Models)

System Prompt

You are an expert network engineer whose make a living analyzing BGP feeds. You work at a data center and in the evening you volunteer managing a local IXP in Texas. You can read BGP feed like it is your second nature.

User Prompt

Summarize the following BGP update in under 140 characters for a real-time network alert. Include ASNs details like who owns the ASN, prefix, and region if known.

No model-specific tuning, retries, or truncation guards were applied.


Metrics Collected

For each request, the following metrics were recorded:

  • Time to First Token (TTFT)
  • Total latency (full completion time)
  • Tokens in
  • Tokens out
  • Compression ratio (tokens_out ÷ tokens_in)

These metrics directly affect real-time systems such as:

  • alerting pipelines
  • streaming dashboards
  • LLM-in-the-loop observability
  • backpressure-sensitive architectures

Average Quantitative Results

Averages computed across all samples per provider.

LLM ProviderAvg TTFT (ms)Avg Total Latency (ms)Avg Tokens InAvg Tokens OutAvg CompressionSamples
OpenAI829.871,842.375,140.1745.400.0130
Anthropic2,120.746,349.845,074.66136.710.0338
Azure OpenAI2,815.262,815.295,103.849,481.971.8531
Gemini3,024.513,409.265,104.949,633.511.8435
Grok19,320.2919,733.145,278.5733.500.0114

Model-by-Model Analysis

OpenAI

Observed behavior

  • Fastest average TTFT
  • Lowest total latency
  • High compression
  • Strong adherence to output constraints

Interpretation

OpenAI consistently behaves like a stream-safe summarizer. It avoids context echoing, filler text, and verbosity inflation.

This makes it suitable for:

  • real-time alerts
  • dashboards
  • high-frequency summarization

Anthropic

Observed behavior

  • Higher TTFT and total latency
  • Significantly higher token output
  • Richer semantic interpretation

Interpretation

Anthropic performs deeper inference per request. That cost appears clearly in latency and token usage.

It behaves less like an alert engine and more like an analyst reviewing the feed.

Best suited for:

  • offline analysis
  • anomaly investigation
  • post-incident review

Azure OpenAI

Observed behavior

  • Partial context ingestion
  • High output token counts
  • Repetitive filler text
  • Inconsistent prompt adherence

Interpretation

The high compression ratio (>1.8×) indicates output inflation, which is problematic in streaming pipelines.

This configuration likely requires:

  • strict token caps
  • aggressive chunking
  • tighter context controls

Gemini

Observed behavior

  • Moderate latency
  • Truncated responses
  • Low semantic richness
  • High output verbosity

Interpretation

Gemini appears optimized for short-form answers rather than structured telemetry interpretation.

Not well suited for BGP-style stream summarization.


Grok

Observed behavior

  • Extremely high TTFT
  • Minimal output
  • Focused on signaling change rather than describing it

Interpretation

Grok behaves like a delta notifier rather than a summarizer.

Useful for:

  • detecting that something changed

Not suitable for:

  • explaining what changed or why

Key Takeaways

This benchmark demonstrates that LLM APIs are not interchangeable when used in real-time streaming systems.

Each model embeds assumptions about:

  • time sensitivity
  • verbosity
  • reasoning depth
  • context discipline

In high-velocity environments:

  • latency beats intelligence
  • consistency beats creativity
  • token efficiency beats verbosity

An answer that arrives late is operationally equivalent to noise.

Future work will include:

  • p95 / p99 latency distributions
  • streaming jitter analysis
  • cost-normalized efficiency metrics
  • sustained-load backpressure testing