// Production / Inference Optimization

Scalable LLM deployment on Ray Serve.

Production Ray Serve clusters tuned for throughput, latency, and cost - built by engineers who've put 1B+ tokens/day in front of paying customers.

// Why Ray Serve

The ML team gets velocity. The infra team gets predictable scale.

01

Built for traffic spikes

Queue-aware autoscaling that adds replicas on real backpressure - not on flat CPU charts. Scale to zero when traffic drops so your GPU bill follows usage.

02

Multi-model, multi-tenant by design

Run dozens of models on the same cluster with fractional GPU allocation, per-route weighting, and isolation that prevents one tenant from starving another.

03

Python-native, no rewrite

Your research code becomes the production code. Ray Serve wraps existing PyTorch and vLLM logic in deployments - no FastAPI plumbing, no protobuf, no Triton model repository to maintain.

// Case Study

Production LLM Processing at Surfer Scale

We helped Surfer handle massive content generation workloads with a reliable, cost-optimized LLM pipeline built for scale.

  • 300B+

    tokens processed

  • 100k+

    credits sold in 6 months

  • 5 months

    from concept to full product release

Read the case study
Production LLM Processing at Surfer Scale

// What we deliver

Production LLM serving, end to end.

From cluster design to oncall runbooks. We build it to be operated by your team - not by ours forever.

Cluster architecture

Deployment graphs, replica groups, and fractional GPU scheduling sized to your traffic shape - not someone's reference architecture.

  • Multi-node clusters on EKS, GKE, on-prem K8s, or KubeRay
  • Fractional GPU allocation for small models, MIG slices on H100
  • Deployment graphs for RAG, ensembles, and multi-step pipelines
  • Spot/on-demand mix with safe drain on preemption

Inference optimization

We pick the right backend (vLLM, TGI, TensorRT-LLM) for your model and traffic, then tune the knobs that actually move p99.

  • Continuous batching and paged attention via vLLM
  • KV cache reuse across turns and prefix sharing
  • Speculative decoding and assisted generation where it pays off
  • Quantization (AWQ, GPTQ, FP8) with quality regression checks

Autoscaling that respects cost

Most autoscalers either over-provision GPUs or melt under burst. We tune for both directions.

  • Queue-length and TTFT-based scaling signals
  • Cold-start mitigation via warm pools and model preloading
  • Per-deployment min/max replicas and cooldowns
  • Cost dashboards down to $ per million tokens

Routing and traffic control

The boring infrastructure that makes safe rollouts possible - canary, shadow, A/B, kill switches.

  • Canary and shadow deploys for new models or prompts
  • Per-tenant rate limiting and priority queues
  • Streaming responses with SSE and token-level cancellation
  • Sticky sessions for stateful agent workloads

Observability built in

If you can't see it, you can't scale it. We wire metrics, traces, and evals into every deployment from day one.

  • Token, latency, and cost tracking with Prometheus + Grafana
  • Per-route eval suites tied to deploy gates
  • Distributed traces across deployment graphs
  • Alerting on regressions, drift, and quota exhaustion

Production hardening

On-prem-grade reliability whether you run in our cloud, yours, or air-gapped on the customer's metal.

  • Air-gapped install runbooks and offline model registry
  • Graceful degradation under partial GPU failure
  • Runtime upgrades with zero-downtime model swaps
  • Audit logging suitable for regulated environments
Karol Gawron

// Expert insight

Most teams reach for Ray Serve and stop at the basic deployment. The leverage is in the deployment graphs - chaining a router, a base model, and a dozen LoRA adapters in a single Python file you can actually reason about.

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Scientists who ship. Operators who don't leave.

We bridge the gap between research-grade ML and production ops - because most teams have one or the other, not both.

1B+ tokens/day in production

We've operated LLM platforms at the scale of a top-10 SEO product - not just spun up a demo.

Open-weight specialists

Llama, Qwen, Mistral, DeepSeek - we know which checkpoint to start from and which knobs to tune for your workload.

Ray + vLLM + Triton

Multi-backend by default. We pick the right one per route, not the one we read about last week.

16+ open-source models

Hugging Face contributors with 80K+ monthly downloads. We ship models, not just slides about models.

On-prem & air-gapped capable

Deployed in environments that block outbound traffic and require signed-off bills of materials.

Senior team, no juniors

Every engineer has shipped LLMs to production at scale. We don't bring ramp-up time to your project.

// FAQ

Common questions about Ray Serve deployments

vLLM is a fantastic engine. Ray Serve is the layer above it - autoscaling, multi-model routing, deployment graphs, and ops that you'd otherwise hand-build with FastAPI, k8s HPA, and a lot of glue. We use both: vLLM as the inference backend, Ray Serve as the orchestration layer.

Yes. KubeRay runs on any conformant Kubernetes (EKS, GKE, AKS, OpenShift, on-prem). We integrate with your existing ingress, service mesh, secrets store, and observability stack rather than asking you to swap them out.

We scale on backpressure (queue length, time-to-first-token), not CPU. Combined with warm pools, model preloading, and per-deployment min/max replicas, the cluster matches traffic in seconds while staying near zero off-hours. Cost dashboards translate scaling decisions into $ per million tokens so you can see the tradeoffs.

Most of our deployments use customer-specific fine-tunes - LoRA adapters, full SFT, DPO. Ray Serve makes this cleaner via deployment graphs: a base model deployment with multiple LoRA adapters routed by request metadata.

First production deployment ranges from 3 to 8 weeks depending on scope. A simple single-model rollout with autoscaling and observability lands in around 3 weeks. Multi-model platforms with custom routing, evals, and on-prem compliance work take longer. We work in weekly increments with a working system at the end of each.

Yes - we've deployed in air-gapped environments for regulated and defense-adjacent customers. We provide signed install bundles, an offline model registry, and runbooks that don't assume outbound network access.

Two options. Either we hand off with documentation, runbooks, and a few weeks of paired oncall to your team - or we stay on as the team that operates the platform. Most customers start with the second and graduate to the first.

// Let's ship it

Ship LLM inference that scales - without the cost surprise.

Tell us your traffic shape, your model, and your latency budget. We'll come back with a deployment plan and a number, usually within a business day.

Karol Gawron

Karol Gawron

Head of R&D @ bards.ai