// Production / Inference Optimization
Scalable LLM deployment on Ray Serve.
Production Ray Serve clusters tuned for throughput, latency, and cost - built by engineers who've put 1B+ tokens/day in front of paying customers.
// Why Ray Serve
The ML team gets velocity. The infra team gets predictable scale.
01
Built for traffic spikes
Queue-aware autoscaling that adds replicas on real backpressure - not on flat CPU charts. Scale to zero when traffic drops so your GPU bill follows usage.
02
Multi-model, multi-tenant by design
Run dozens of models on the same cluster with fractional GPU allocation, per-route weighting, and isolation that prevents one tenant from starving another.
03
Python-native, no rewrite
Your research code becomes the production code. Ray Serve wraps existing PyTorch and vLLM logic in deployments - no FastAPI plumbing, no protobuf, no Triton model repository to maintain.
// Case Study
Production LLM Processing at Surfer Scale
We helped Surfer handle massive content generation workloads with a reliable, cost-optimized LLM pipeline built for scale.
300B+
tokens processed
100k+
credits sold in 6 months
5 months
from concept to full product release

// What we deliver
Production LLM serving, end to end.
From cluster design to oncall runbooks. We build it to be operated by your team - not by ours forever.
Cluster architecture
Deployment graphs, replica groups, and fractional GPU scheduling sized to your traffic shape - not someone's reference architecture.
- Multi-node clusters on EKS, GKE, on-prem K8s, or KubeRay
- Fractional GPU allocation for small models, MIG slices on H100
- Deployment graphs for RAG, ensembles, and multi-step pipelines
- Spot/on-demand mix with safe drain on preemption
Inference optimization
We pick the right backend (vLLM, TGI, TensorRT-LLM) for your model and traffic, then tune the knobs that actually move p99.
- Continuous batching and paged attention via vLLM
- KV cache reuse across turns and prefix sharing
- Speculative decoding and assisted generation where it pays off
- Quantization (AWQ, GPTQ, FP8) with quality regression checks
Autoscaling that respects cost
Most autoscalers either over-provision GPUs or melt under burst. We tune for both directions.
- Queue-length and TTFT-based scaling signals
- Cold-start mitigation via warm pools and model preloading
- Per-deployment min/max replicas and cooldowns
- Cost dashboards down to $ per million tokens
Routing and traffic control
The boring infrastructure that makes safe rollouts possible - canary, shadow, A/B, kill switches.
- Canary and shadow deploys for new models or prompts
- Per-tenant rate limiting and priority queues
- Streaming responses with SSE and token-level cancellation
- Sticky sessions for stateful agent workloads
Observability built in
If you can't see it, you can't scale it. We wire metrics, traces, and evals into every deployment from day one.
- Token, latency, and cost tracking with Prometheus + Grafana
- Per-route eval suites tied to deploy gates
- Distributed traces across deployment graphs
- Alerting on regressions, drift, and quota exhaustion
Production hardening
On-prem-grade reliability whether you run in our cloud, yours, or air-gapped on the customer's metal.
- Air-gapped install runbooks and offline model registry
- Graceful degradation under partial GPU failure
- Runtime upgrades with zero-downtime model swaps
- Audit logging suitable for regulated environments
// Expert insight
“Most teams reach for Ray Serve and stop at the basic deployment. The leverage is in the deployment graphs - chaining a router, a base model, and a dozen LoRA adapters in a single Python file you can actually reason about.”
Karol Gawron
Head of R&D @ bards.ai
// Why bards.ai
Scientists who ship. Operators who don't leave.
We bridge the gap between research-grade ML and production ops - because most teams have one or the other, not both.
1B+ tokens/day in production
We've operated LLM platforms at the scale of a top-10 SEO product - not just spun up a demo.
Open-weight specialists
Llama, Qwen, Mistral, DeepSeek - we know which checkpoint to start from and which knobs to tune for your workload.
Ray + vLLM + Triton
Multi-backend by default. We pick the right one per route, not the one we read about last week.
16+ open-source models
Hugging Face contributors with 80K+ monthly downloads. We ship models, not just slides about models.
On-prem & air-gapped capable
Deployed in environments that block outbound traffic and require signed-off bills of materials.
Senior team, no juniors
Every engineer has shipped LLMs to production at scale. We don't bring ramp-up time to your project.
// FAQ
Common questions about Ray Serve deployments
vLLM is a fantastic engine. Ray Serve is the layer above it - autoscaling, multi-model routing, deployment graphs, and ops that you'd otherwise hand-build with FastAPI, k8s HPA, and a lot of glue. We use both: vLLM as the inference backend, Ray Serve as the orchestration layer.
Yes. KubeRay runs on any conformant Kubernetes (EKS, GKE, AKS, OpenShift, on-prem). We integrate with your existing ingress, service mesh, secrets store, and observability stack rather than asking you to swap them out.
We scale on backpressure (queue length, time-to-first-token), not CPU. Combined with warm pools, model preloading, and per-deployment min/max replicas, the cluster matches traffic in seconds while staying near zero off-hours. Cost dashboards translate scaling decisions into $ per million tokens so you can see the tradeoffs.
Most of our deployments use customer-specific fine-tunes - LoRA adapters, full SFT, DPO. Ray Serve makes this cleaner via deployment graphs: a base model deployment with multiple LoRA adapters routed by request metadata.
First production deployment ranges from 3 to 8 weeks depending on scope. A simple single-model rollout with autoscaling and observability lands in around 3 weeks. Multi-model platforms with custom routing, evals, and on-prem compliance work take longer. We work in weekly increments with a working system at the end of each.
Yes - we've deployed in air-gapped environments for regulated and defense-adjacent customers. We provide signed install bundles, an offline model registry, and runbooks that don't assume outbound network access.
Two options. Either we hand off with documentation, runbooks, and a few weeks of paired oncall to your team - or we stay on as the team that operates the platform. Most customers start with the second and graduate to the first.
// Let's ship it
Ship LLM inference that scales - without the cost surprise.
Tell us your traffic shape, your model, and your latency budget. We'll come back with a deployment plan and a number, usually within a business day.
Karol Gawron
Head of R&D @ bards.ai