// Production / Inference Optimization

Scalable LLM deployment on Ray Serve.

Production Ray Serve clusters tuned for throughput, latency, and cost - built by engineers who've put 1B+ tokens/day in front of paying customers.

Schedule consultation See related case study

serve.py

from ray import serve
from vllm import AsyncLLMEngine

@serve.deployment(
  num_replicas="auto",
ray_actor_options={"num_gpus": 1},
)
class LlamaService:
  def __init__(self):
    self.engine = AsyncLLMEngine.from_engine_args(
      model="meta-llama/Llama-3-70B",
      max_num_batched_tokens=8192,
    )

  async def __call__(self, req):
    return await self.engine.generate(req.prompt)

12 replicas · 4× A100p99 412ms

// Why Ray Serve

The ML team gets velocity. The infra team gets predictable scale.

Built for traffic spikes

Queue-aware autoscaling that adds replicas on real backpressure - not on flat CPU charts. Scale to zero when traffic drops so your GPU bill follows usage.

Multi-model, multi-tenant by design

Run dozens of models on the same cluster with fractional GPU allocation, per-route weighting, and isolation that prevents one tenant from starving another.

Python-native, no rewrite

Your research code becomes the production code. Ray Serve wraps existing PyTorch and vLLM logic in deployments - no FastAPI plumbing, no protobuf, no Triton model repository to maintain.

// Case Study

Production LLM Processing at Surfer Scale

We helped Surfer handle massive content generation workloads with a reliable, cost-optimized LLM pipeline built for scale.

300B+
tokens processed
100k+
credits sold in 6 months
5 months
from concept to full product release

Read the case study

Production LLM Processing at Surfer Scale

// What we deliver

Production LLM serving, end to end.

From cluster design to oncall runbooks. We build it to be operated by your team - not by ours forever.

Cluster architecture

Deployment graphs, replica groups, and fractional GPU scheduling sized to your traffic shape - not someone's reference architecture.

Multi-node clusters on EKS, GKE, on-prem K8s, or KubeRay
Fractional GPU allocation for small models, MIG slices on H100
Deployment graphs for RAG, ensembles, and multi-step pipelines
Spot/on-demand mix with safe drain on preemption

Inference optimization

We pick the right backend (vLLM, TGI, TensorRT-LLM) for your model and traffic, then tune the knobs that actually move p99.

Continuous batching and paged attention via vLLM
KV cache reuse across turns and prefix sharing
Speculative decoding and assisted generation where it pays off
Quantization (AWQ, GPTQ, FP8) with quality regression checks

Autoscaling that respects cost

Most autoscalers either over-provision GPUs or melt under burst. We tune for both directions.

Queue-length and TTFT-based scaling signals
Cold-start mitigation via warm pools and model preloading
Per-deployment min/max replicas and cooldowns
Cost dashboards down to $ per million tokens

Routing and traffic control

The boring infrastructure that makes safe rollouts possible - canary, shadow, A/B, kill switches.

Canary and shadow deploys for new models or prompts
Per-tenant rate limiting and priority queues
Streaming responses with SSE and token-level cancellation
Sticky sessions for stateful agent workloads

Observability built in

If you can't see it, you can't scale it. We wire metrics, traces, and evals into every deployment from day one.

Token, latency, and cost tracking with Prometheus + Grafana
Per-route eval suites tied to deploy gates
Distributed traces across deployment graphs
Alerting on regressions, drift, and quota exhaustion

Production hardening

On-prem-grade reliability whether you run in our cloud, yours, or air-gapped on the customer's metal.

Air-gapped install runbooks and offline model registry
Graceful degradation under partial GPU failure
Runtime upgrades with zero-downtime model swaps
Audit logging suitable for regulated environments

// Expert insight

“Most teams reach for Ray Serve and stop at the basic deployment. The leverage is in the deployment graphs - chaining a router, a base model, and a dozen LoRA adapters in a single Python file you can actually reason about.”

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

1B+ tokens/day shipped. On-call after we hand it off.

We bridge the gap between research-grade ML and production ops - because most teams have one or the other, not both.

1B+ tokens/day in production

We've operated LLM platforms at the scale of a top-10 SEO product - not just spun up a demo.

Open-weight specialists

Llama, Qwen, Mistral, DeepSeek - we know which checkpoint to start from and which knobs to tune for your workload.

Ray + vLLM + Triton

Multi-backend by default. We pick the right one per route, not the one we read about last week.

16+ open-source models

Hugging Face contributors with 80K+ monthly downloads. We ship models, not just slides about models.

On-prem & air-gapped capable

Deployed in environments that block outbound traffic and require signed-off bills of materials.

Senior team, no juniors

Every engineer has shipped LLMs to production at scale. We don't bring ramp-up time to your project.

// FAQ

Common questions about Ray Serve deployments

vLLM is a fantastic engine. Ray Serve is the layer above it - autoscaling, multi-model routing, deployment graphs, and ops that you'd otherwise hand-build with FastAPI, k8s HPA, and a lot of glue. We use both: vLLM as the inference backend, Ray Serve as the orchestration layer.

Yes. KubeRay runs on any conformant Kubernetes (EKS, GKE, AKS, OpenShift, on-prem). We integrate with your existing ingress, service mesh, secrets store, and observability stack rather than asking you to swap them out.

We scale on backpressure (queue length, time-to-first-token), not CPU. Combined with warm pools, model preloading, and per-deployment min/max replicas, the cluster matches traffic in seconds while staying near zero off-hours. Cost dashboards translate scaling decisions into $ per million tokens so you can see the tradeoffs.

Most of our deployments use customer-specific fine-tunes - LoRA adapters, full SFT, DPO. Ray Serve makes this cleaner via deployment graphs: a base model deployment with multiple LoRA adapters routed by request metadata.

First production deployment ranges from 3 to 8 weeks depending on scope. A simple single-model rollout with autoscaling and observability lands in around 3 weeks. Multi-model platforms with custom routing, evals, and on-prem compliance work take longer. We work in weekly increments with a working system at the end of each.

Yes - we've deployed in air-gapped environments for regulated and defense-adjacent customers. We provide signed install bundles, an offline model registry, and runbooks that don't assume outbound network access.

Two options. Either we hand off with documentation, runbooks, and a few weeks of paired oncall to your team - or we stay on as the team that operates the platform. Most customers start with the second and graduate to the first.

// Let's ship it

Ship LLM inference that scales - without the cost surprise.

Tell us your traffic shape, your model, and your latency budget. We'll come back with a deployment plan and a number, usually within a business day.

Book a meeting hello@bards.ai

Karol Gawron

Head of R&D @ bards.ai