// AI Video Search

Text-search across 200 live city camera feeds

Municipal operators type a description and the system surfaces matching events from across the city's live CCTV network. We built it for Neural; the City of Oława's Straż Miejska runs it on-prem. 200 cameras per server; review time on a typical incident dropped from ~8 hours of manual scrubbing to under 1 hour - an ~88% reduction.

offices

Oława, Poland

size

~33,000 residents

industry

Municipal public safety

revenue

-

// Outcomes

The numbers that matter

  • 200

    live cameras per on-prem server

  • ~88%

    less time per incident review

  • ~33K

    residents covered (Oława)

01 · Watching 200 live cameras isn't a human problem

The Challenge

A municipal CCTV network is dozens to hundreds of cameras streaming 24/7. When something happens - a break-in, a stolen bike, a missing person - operators have to scrub through hours of footage across the cameras nearest the incident, jumping between feeds and replaying segments at speed until the right frame turns up. Existing video-management software supports motion search and basic object filters, but not the way an operator actually thinks about an incident: in plain words.

The end-to-end review on a typical incident was ~8 hours: watch the footage, log the timestamps, cross-reference cameras, write up the report. The 30-second event the operator was actually looking for was buried inside that. The bottleneck was discovery, not response.

Two hard constraints shaped the build. Footage and indices couldn't leave the municipality's network - data residency and privacy posture for municipal CCTV rule out cloud-only solutions. And the system had to scale to the full camera count on commodity on-prem hardware, because that's what city budgets actually approve.

02 · Text-first search. Vision-language embeddings. On-prem.

Approach

Step 1: Continuous frame-level embedding.

Every camera feed is ingested in real time and sampled at a configurable rate. Each sampled frame is encoded into a vision-language embedding space (CLIP-class), tagged with camera ID, timestamp, and geolocation, and written to a searchable embedding store. The index grows continuously as long as the cameras are streaming.

Step 2: Natural-language query at the operator's console.

An operator types a description - "red jacket near the riverside", "blue van at the rynek around 14:00", "man with a bicycle entering the alley". The query is encoded into the same embedding space and the system retrieves the top-K matching frames across all cameras, with timestamps, camera positions on the city map, and confidence scores. The operator clicks through to the matching clip in their existing dispatch console - no new tool to learn.

Step 3: Inference path optimized for 200 streams per box.

GPU-resident model with batched inference across cameras, frame-rate adaptive sampling so quiet feeds don't waste cycles, and an embedding store sized for a city's typical retention window. The result: a single mid-range on-prem server keeps up with 200 live camera streams in real time. The hardware footprint stays inside what a municipal IT budget can sign off on.

Step 4: On-prem by design.

Footage, embeddings, and query history all live on the municipality's hardware behind the municipality's firewall. Nothing leaves the network. The data-residency posture matches what municipal CCTV procurement actually requires, without the operator having to think about it.

03 · 200 cameras per server. ~8h → <1h on a typical incident.

Result

Built for Neural and deployed by the City of Oława's Straż Miejska (Municipal Guard). Oława is a town of ~33,000 in Lower Silesia, Poland, and the system covers the same camera network the Guard already operated - just with a text-search layer over it.

  • 200 live camera streams supported per single on-prem server - typical mid-range GPU + batched inference, no cloud dependency.
  • Incident review time on a typical case dropped from ~8 hours of manual scrubbing to under 1 hour - an ~88% reduction.
  • Plain-language search ("red jacket near the bridge at 14:00") returns top matches across the full camera network in seconds.
  • On-prem deployment - footage and embeddings never leave the municipality's network, satisfying the data-residency requirements municipal CCTV procurement is built around.
  • Native integration into the operator's existing dispatch console - no separate tool to learn, no extra workflow to maintain.

The win is in what the discovery step costs now. Once the index is text-queryable, the bottleneck stops being how fast a human can scrub through hours of footage and starts being how fast the operator can describe what they're looking for. That's the whole engagement.

// Expert insight

Watching 200 cameras in real time isn't a human problem - once the index is text-queryable, what used to be hours of scrubbing collapses to a sentence and a few seconds. We built it for Neural; the City of Oława's Straż Miejska runs it day-to-day, and the discovery step on a typical incident dropped from ~8 hours to under 1 hour.
Norbert Ropiak

Norbert Ropiak

Co-founder @ bards.ai

// Ready to ship?

Let's build something that delivers numbers like these.

Book a meeting