// Research / Computer Vision

Object detection that holds up on your data.

Public datasets won't cut it for industry-specific objects, unusual viewpoints, occlusion, or whatever else your environment throws at it. We build custom detectors and segmenters - labeled with foundation-model loops, trained for your latency budget, and shipped to the GPU or edge device that actually runs in production.

Book a meeting See related case study

// What we see

Generic models pass the demo. Production has a different bar.

The long tail eats the accuracy

Pre-trained YOLO works on the 80% of scenes that look like COCO. Production traffic is mostly the other 20% - weird angles, occlusion, your specific lighting, your specific objects. The headline mAP looks fine; customer complaints don't.

Labels are 80% of the project

Most teams pick the detector first and the labeling pipeline last. That's backwards. The architecture you can swap in a week. The 50,000 labeled images you cannot. Teams that under-budget annotation discover this at week six, not week one.

The model fits the laptop, not the device

It hits 0.91 mAP in your notebook on an A100. Then it has to run on a Jetson Orin at 30 FPS, or on a CPU-only on-prem server that processes 200 camera feeds. The deployment target is where most CV projects actually stall.

// Case Study

Text-search across 200 live city camera feeds

Municipal operators type a description and the system surfaces matching events from across the city's live CCTV network. We built it for Neural; the City of Oława's Straż Miejska runs it on-prem. 200 cameras per server; review time on a typical incident dropped from ~8 hours of manual scrubbing to under 1 hour - an ~88% reduction.

200
live cameras per on-prem server
~88%
less time per incident review
~33K
residents covered (Oława)

Read the case study

Text-search across 200 live city camera feeds

// What we do

Three things that decide whether the detector ships.

Most production CV problems aren't solved by a better architecture. They're solved by the labeling loop, the right hardware target picked early, and an eval that sees the slices mAP hides.

Foundation-model labeling loops

Grounding DINO, SAM 2, and CLIP-based weak supervision turn a few hundred annotated images into tens of thousands of high-quality pseudo labels. Humans verify rather than label from scratch - 3-5x faster, higher inter-annotator agreement, and the loop tightens as the model improves.

Architecture chosen for your hardware

We profile the candidate models on your target device before final selection - YOLOv11 for 60 FPS edge, RT-DETR for accuracy on big GPUs, EfficientDet when memory is tight. The detector is picked after the deployment constraints are known, not before.

Eval beyond a single mAP number

A single mAP hides every failure that matters. We slice by class, scene type, occlusion, small-object size, low-light - and add a hard-negatives suite from your real production failures. The metric on your dashboard reflects what breaks, not what averages out.

// Method fit

Custom detection isn't the right tool for every CV problem.

skip it if

A cloud API covers your objects
If your detection target is common objects from common viewpoints (people, cars, faces, brand logos), AWS Rekognition / GCP Vertex Vision / Azure CV will hit your accuracy bar with no training. Custom is for the long tail those APIs miss.
Your real problem is OCR or document layout
Reading text, extracting tables, parsing forms - those are document-AI problems with their own toolchain (LayoutLM, Donut, OCR engines). A general object detector is the wrong instrument.
Document AI & OCR
You only need a label, not a box
If 'is there a forklift in this frame' is enough and you don't need to know where, a fine-tuned classifier wants 5-10x less labeling than a detector. Skip the boxes until you actually need spatial information.

use it if

Custom object detection fits when you have proprietary visual data, unusual viewpoints or environments, latency or hardware constraints that rule out cloud APIs, or accuracy requirements on industry-specific classes that public datasets don't cover.

// How we work

Hardware audit first. Iterate in the open. Hand off the retraining loop.

Every CV engagement starts with the constraints that are expensive to undo - target hardware, latency budget, label availability. The model gets picked after those are known, not before.

Data and hardware audit (week one)

We sit with your team, profile sample data, benchmark candidate detectors on your actual target device, and write down the labeling budget. The output is a design your engineers approve before any training run.

Iterate in a shared workspace

Every training run lands in a Weights & Biases or MLflow workspace your engineers can see. You watch the loss curves, the per-class precision, and the failure-case montages - in real time, not in a Friday demo.

Hand off the retraining loop

We hand off the model export (TensorRT, ONNX, OpenVINO), the eval suite, the labeling pipeline, and a runbook for retraining when the data drifts. Slack for 30 days after delivery for the questions that come up after we leave.

// Expert insight

“Most teams pick the detector first and the labeling pipeline last. That's backwards. The architecture you can swap in a week. The 50,000 labeled images you can't. Foundation models changed what's possible for a labeling budget - the team that adopts that loop early ships in half the time.”

Norbert Ropiak

Co-founder @ bards.ai

See our open-source work

// Why bards.ai

Why us, instead of two senior CV engineers you'd hire.

You could hire the team. It would take a year and they'd learn your hardware constraints on you. We've already learned them - on production engagements at Comcast, Oława's municipal CCTV network, and the rest.

Production CV deployments at scale

Comcast's UI element detector runs daily on millions of VOD screenshots. Neural's video-search system runs on-prem in Oława across 200 live CCTV feeds. Both shipped, both still running.

Edge and on-prem deployment, not just GPU

TensorRT, ONNX Runtime, OpenVINO, Triton, Jetson Orin, and CPU-only on-prem boxes for environments where data can't leave. We benchmark on your hardware before we commit to a frame rate.

Foundation-model labeling loops as default

Grounding DINO and SAM 2 annotation pipelines on every engagement - not an upgrade tier. We've run them across industrial, surveillance, retail, and medical datasets. The 3-5x labeling speedup is real and repeatable.

10+ peer-reviewed CV and ML publications

CLARIN-PL spinoff. We've published on detection, segmentation, and representation learning. The eval discipline and per-class slicing in every test harness comes from the same rigor we'd apply to a submission.

16+ open-source models on Hugging Face

80K+ monthly downloads across NLP and CV. We publish the models we build. Proof of engineering before you sign anything.

Senior engineers only, no juniors

Every person on your engagement has shipped CV models to paying customers. No ramp-up tax, no learning the labeling-loop story on your dollar.

// FAQ

Common questions about custom object detection

With foundation-model bootstrapping (Grounding DINO + SAM 2 pseudo labels, human verification), useful detectors start at 200-500 verified samples. Production-grade accuracy on most tasks lands in the 2-5K range. Without bootstrapping, multiply by 5-10x. The first thing we measure is your label budget; the architecture comes after.

Yes. We profile the candidate models on the target device in week one - before final architecture selection - so we don't over-train a model that won't fit. Jetson Orin AGX runs YOLOv11s at 30-60 FPS with INT8 quantization. CPU-only servers run smaller variants or batched inference for offline pipelines. We benchmark before we commit.

Cloud APIs are great when your objects look like the COCO/ImageNet distribution. They fall over on industry-specific classes, unusual viewpoints, regulated data that can't leave the customer perimeter, and edge deployment without internet. Custom models also let you tune the precision/recall tradeoff per class and own the retraining loop.

Engagements start at $40K. Most custom detection projects land between $40K and $120K depending on labeling scope, target hardware complexity, segmentation requirements, and whether the eval suite is greenfield. Fixed-fee proposal after the first scoping call - no time-and-materials surprise.

YOLOv11 is the default for edge and latency-constrained workloads - it runs at 30-120 FPS on Jetson Orin, L4, and T4 with INT8 quantization and covers most industrial, retail, and surveillance use cases without modification. RT-DETR is the pick when accuracy is the hard constraint and you have GPU headroom: it closes the gap on small objects, dense occlusion, and unusual aspect ratios, but runs 3-5x slower at similar parameter counts. EfficientDet and YOLO-NAS for memory-limited deployments. We benchmark candidate architectures on your target device in week one before committing - frames-per-second at your accuracy threshold on your hardware matters more than COCO leaderboard rank.

Yes - streaming inference is the standard production case for surveillance, manufacturing quality control, and automated inspection. The pipeline handles RTSP and RTMP ingestion, GStreamer or OpenCV-based capture, and GPU batching across concurrent streams. A single L40S processes 4-16 streams at detection grade depending on resolution and frame rate. On-device Jetson Orin runs 1-4 streams for edge deployments where video can't leave the facility. We design the ingestion pipeline alongside the model so end-to-end latency from frame capture to detection result meets your SLA - typically 50-200ms for real-time alerting.

Typically 6-12 weeks. Weeks 1-2: hardware audit, sample data profiling, target-device benchmarks, labeling plan. Weeks 2-5: annotation pipeline with foundation-model bootstrapping, initial label batch. Weeks 5-8: training iterations, per-class and per-scene eval, hard-negative mining. Weeks 8-12: deployment packaging (TensorRT or ONNX or OpenVINO), inference server wiring, monitoring, and runbook handoff. Timeline is most affected by label volume and the number of novel failure modes that surface in the first training run - we scope this in week one with explicit milestones.

// Let's ship it

Send us a folder of images. We'll send back a plan.

Tell us about the objects, the environment, the target hardware, and the failure mode you can't fix with a cloud API. We'll come back with a labeling plan, an architecture pick, and a benchmark on your device - usually within a business day. Engagements from $40K, typically 4-8 weeks.

Book a meeting hello@bards.ai

Norbert Ropiak

Co-founder @ bards.ai