// Production / Inference Optimization

Production deployment on Triton Inference Server.

Triton clusters tuned for heterogeneous workloads - TensorRT, ONNX, PyTorch, and Python backends behind a single gRPC endpoint. We pick Triton when it actually wins, and we tune it like we mean it.

Schedule consultation See related case study

model_repository/llama/config.pbtxt

name: "llama_ensemble"
platform: "ensemble"
max_batch_size: 64

dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 2000
}

instance_group [
  { kind: KIND_GPU, count: 2, gpus: [ 0, 1 ] }
]

model_warmup [
  { name: "sample", count: 4 }
]

gRPC · 8 instances · H100p99 187ms

// Why Triton

One server for every framework. Real batching. Real metrics.

Multi-framework, one endpoint

TensorRT for the LLM, ONNX for the embedder, PyTorch for the reranker, Python backend for the glue - all served from the same model repository, all reachable over gRPC or HTTP.

Dynamic batching that actually pays off

Triton's batcher coalesces concurrent requests at the framework level, not in your application code. Tuned right, it doubles GPU utilization without touching the model.

Ensembles and BLS without glue code

Wire preprocessing, model, and postprocessing into a single ensemble - or use Business Logic Scripting for branching pipelines. No FastAPI sidecar, no extra hop, no extra latency.

// What we deliver

Triton, deployed and tuned to your traffic.

From model repository design to oncall runbooks. We deliver something your team can operate after we leave.

Model repository design

The model_repository layout decides how much pain you have for the next two years. We design it for safe rollouts, version pinning, and clean A/B.

Version directories with policy-based loading
Shared backends across models to keep memory bounded
Repository layout for canary, shadow, and stable channels
S3, GCS, or local filesystem with hot reload

Backend selection and tuning

TensorRT-LLM for big LLMs, ONNX for cross-framework models, PyTorch for fast iteration, Python backend for everything weird. We pick per model, not per project.

TensorRT-LLM engines with INT8/FP8 quantization
ONNX Runtime with CUDA, TensorRT, or OpenVINO providers
Python backend for custom logic and HF transformers
vLLM backend integration for OSS LLMs where it wins

Dynamic batching and instance groups

The two knobs nobody tunes correctly. We profile your real traffic shape and configure preferred batch sizes, queue delays, and instance counts that match it.

Preferred batch sizes derived from latency budget
Queue delay tuning per route, not per cluster
Instance groups across GPUs with MIG slicing on H100
Model warmup to eliminate cold first-request spikes

Ensembles and BLS pipelines

Multi-step inference without an external orchestrator. Tokenize, embed, rerank, and decode inside a single Triton call.

Ensemble scheduling for linear pipelines
Business Logic Scripting for branching and loops
Zero-copy tensor passing between steps
Streaming responses via gRPC for token-level output

Observability and profiling

Triton's metrics endpoint is rich, and almost nobody scrapes it correctly. We wire it into your stack and add the traces it doesn't ship with.

Prometheus scrape of per-model latency, queue, and batch stats
NVTX traces for kernel-level bottleneck hunting
perf_analyzer harnesses checked into your repo
GPU utilization and memory dashboards per model

Kubernetes and on-prem rollout

Triton is happiest behind a real ingress, with proper health probes, GPU node pools, and rolling upgrades. We handle the boring parts.

Helm charts for EKS, GKE, OpenShift, or vanilla K8s
GPU operator integration with NVIDIA's enterprise stack
Canary and shadow channels via repository policies
Air-gapped install bundles with offline model registry

// Expert insight

“Teams adopt Triton for the LLM and miss the actual win - running the embedder, reranker, and a CV model on the same server. One model_repository, one gRPC endpoint, one set of metrics. The savings show up on the bill before they show up in the latency chart.”

Karol Gawron

Head of R&D @ bards.ai

See our open-source work

// Why bards.ai

Multi-backend inference, built by engineers who run it on-call.

We bridge research-grade ML and production ops - most teams have one or the other, not both.

1B+ tokens/day in production

We've operated inference platforms at the scale of a top-10 SEO product - not just spun up a demo on a single A100.

TensorRT, ONNX, vLLM, Triton

Multi-backend by default. We pick the right one per model and route, not the one we read about last week.

Heterogeneous workloads

LLMs, embedders, rerankers, CV models, and Python glue served from the same cluster - without a separate stack per family.

16+ open-source models

Hugging Face contributors with 80K+ monthly downloads. We ship models, not just slides about models.

On-prem & air-gapped capable

Deployed in environments that block outbound traffic and require signed-off bills of materials.

Senior team, no juniors

Every engineer has shipped models to production at scale. We don't bring ramp-up time to your project.

// FAQ

Common questions about Triton deployments

If you only serve OSS LLMs and need throughput, vLLM behind Ray Serve is usually simpler. Triton wins when you have heterogeneous models - a CV pipeline, an embedder, a reranker, an LLM - and want one server, one ingress, one metrics endpoint. It also wins in regulated environments where NVIDIA's enterprise support is the deciding factor.

TensorRT-LLM is a backend that runs inside Triton. You get TensorRT-LLM's continuous batching, paged KV cache, and FP8 kernels with Triton's model repository, ensembles, and metrics on top. For OSS LLMs at very high throughput, the vLLM backend is also a strong option inside Triton.

Helm charts with the NVIDIA GPU Operator, GPU node pools tagged by capability, and proper liveness and readiness probes that check model load - not just process up. Model repositories live in S3 or GCS with periodic poll, or in a PVC for air-gapped installs. We integrate with your existing ingress, service mesh, and observability stack.

Triton's repository policies handle the loading side - multiple model versions live side by side and you can pin which versions are serving. The traffic split happens at the ingress or routing layer. We typically wire this through the gateway so that canary, shadow, and stable channels are addressable independently.

Only if you tune it wrong. The right preferred_batch_size and max_queue_delay come from your real traffic - not the defaults. We profile with perf_analyzer at expected concurrency, then set the knobs so p99 stays inside your latency budget while throughput rises 2-4x.

Yes - we've deployed Triton in air-gapped environments for regulated and defense-adjacent customers. Signed install bundles, offline model registry, runbooks that don't assume outbound network access, and integration with NVIDIA's enterprise support contract where customers have one.

First production deployment ranges from 3 to 8 weeks. A single-model rollout with dynamic batching and observability lands in around 3 weeks. Multi-model platforms with ensembles, custom backends, and on-prem compliance work take longer. We work in weekly increments with a working system at the end of each.

// Let's ship it

Ship Triton inference that earns its place on the bill.

Tell us your model mix, your traffic shape, and your latency budget. We'll come back with a deployment plan and a number, usually within a business day.

Book a meeting hello@bards.ai

Karol Gawron

Head of R&D @ bards.ai