Architecture
UI 1
UI 2
Confluent Schema Registry

Sentinel

Agentic Video Intelligence Platform with Confluent + Vertex AI

Vision → Decision → Action: Transform any video feed into governed event streams that drive real-time, explainable actions across manufacturing, healthcare, retail, logistics, and beyond, with an audit-grade evidence trail you can replay.

💡 Inspiration - The Problem We're Solving

Organizations across every industry deploy thousands of cameras - factories, warehouses, hospitals, retail stores, construction sites, farms, energy infrastructure, and smart cities. Yet most footage remains reactive, reviewed only after incidents occur, when the damage is already done.

The cost of waiting is universal and staggering:

Downtime: $36K–$2.3M per hour depending on industry (automotive, FMCG, manufacturing) - Siemens 2024
Major outages: 54% cost over $100K; 20% exceed $1M - Uptime Institute 2024
Safety incidents: $176.5B annual workplace injury costs, averaging $43K per medically consulted injury - NSC 2023
Operational impact: 90%+ of enterprises estimate hourly downtime exceeds $300K - ITIC 2024

The opportunity: Video represents ~80% of global data by 2025 (175 zettabytes projected) - IDC/Seagate 2018, yet organizations across all industries struggle to operationalize multimodal AI in real-time with governance, cost control, and auditability.

Sentinel closes that gap, not for one industry, but for every industry where visual monitoring matters.

🎯 What It Does

Sentinel is a agentic video intelligence platform that continuously converts raw video into governed operational intelligence across any industry:

End-to-End Pipeline

📹 Video Feed
    ↓ (Motion Detection + Sampling)
🔍 Observe (Gemini Multimodal Analysis)
    ↓ (Structured JSON Observations)
🧠 Think (Reasoning + Domain Knowledge Grounding via Vertex AI Search)
    ↓ (Explainable Decisions with Citations)
⚡ Act (Automated Alerts / Actions / webhooks)
    ↓ (Deduped, Cooldown-Protected)
📊 Audit + Real-Time KPIs (BigQuery + Flink SQL)

Cross-Industry Applications

The same architecture solves different problems across verticals:

Industry	Use Case	Detection	Action	Impact
Manufacturing	Equipment anomaly detection	Abnormal vibrations, leaks, smoke	Predictive maintenance alert	Prevent $2M/hour downtime
Healthcare	Patient safety monitoring	Fall detection, mobility issues	Immediate staff alert	Reduce adverse events
Retail	Queue & service optimization	Long wait times, checkout bottlenecks	Staff reallocation	Improve customer experience
Logistics	Loading dock safety	Forklift near-misses, improper stacking	Stop operations, supervisor alert	Prevent $43K injuries
Agriculture	Crop & livestock monitoring	Irrigation issues, animal distress	Automated intervention	Prevent yield loss
Energy	Infrastructure monitoring	Pipeline leaks, equipment corrosion	Emergency shutdown	Prevent environmental disasters
Construction	Site safety compliance	Missing PPE, unsafe scaffolding	Stop work order	Reduce OSHA violations
Smart Cities	Traffic & crowd management	Congestion, crowd density	Dynamic signal control	Optimize urban flow

Demo Implementations (Included)

We've built two use-case examples to showcase the platform's flexibility:

1. Security & Safety Monitoring

Detects violations in real-time (PPE missing, unsafe behavior, spills)
Evaluates severity with confidence scores
Executes stop-line commands or alerts with full evidence chain
Shows trace-linked video clips and reasoning

2. Assembly SOP Compliance

Sessionizes station workflows into discrete work units
Validates completion against SOP requirements
Identifies missing steps with citations to procedure documents
Provides operator-ready corrective instructions

The key insight: Both demos use the exact same streaming architecture, only the prompts, knowledge bases, and action handlers change. This proves the platform's universality.

🏗️ How We Built It - Architecture

Alt text

Three-layer design: Streaming (Confluent), Intelligence (Vertex AI), and Audit (BigQuery + Flink)

Three-Plane Design

1. Streaming Plane (Confluent Cloud)

Multi-stage event choreography through Kafka topics
Schema-governed contracts via Schema Registry (JSON Schema)
Independent scaling per agent via consumer groups
Replay-first architecture for forensics and iteration

2. Intelligence Plane (Vertex AI)

Gemini multimodal: Zero-shot video understanding
Gemini reasoning: Severity assessment and action planning
Vertex AI Search: RAG-grounded SOP lookups with citations
Embeddings API: Semantic retrieval for knowledge base

3. Audit & Analytics Plane

BigQuery: Immutable audit logs with correlated trace IDs
Flink SQL: Real-time KPIs computed directly over Kafka streams
Cloud Storage: Clip archival for evidence replay

Multi-Agent Streaming Architecture

video.clips → [Observer Agent]
    ↓
video.observations → [Sessionizer Agent]
    ↓
station.sessions → [Thinker Agent + SOP Grounding]
    ↓
sop.decisions → [Action Agent + Dedup]
    ↓
workflow.actions → [Audit Sink]
    ↓
audit.events (BigQuery)

Key Innovation: Each agent is an independent service consuming/producing from Kafka topics. This enables:

Horizontal Scaling where it matters (e.g., 10x Observer instances for 100 cameras)
Independent Evolution (swap models/prompts without downstream rewrites)
Fault Isolation (one agent failure doesn't crash the pipeline)
Clean Contracts (Schema Registry ensures safe evolution)

🔧 Technical Implementation - Why This Works at Scale

1️⃣ Cost-Controlled Multimodal Inference

Video AI inference can bankrupt a deployment. We built multiple cost gates:

Motion Detection Prefilter

Pixel-change detection + background subtraction
Filters 80-90% of "quiet" clips before inference
Turns "impossible economics" into viable deployment

Smart Sampling & Segmentation

Configurable clip length and FPS
Prevents redundant processing of static scenes

Streaming Deduplication

Cooldown windows prevent alert storms
Action-level dedup across topics

Cost Model:

API Calls/day ≈ (N_cameras × 1440 minutes/day) / (T_clip × (1 - filter_rate))

Example: 10 cameras, 30-sec clips, 80% filtered
= (10 × 1440) / (0.5 × 0.2) = 144,000 calls/day
With 80% prefilter → 28,800 calls/day

2️⃣ Explainability & Governance as First-Class Output

Every decision includes:

Evidence: Exact clip timestamp range
Rationale: Human-readable explanation
Confidence scores: Per-signal uncertainty
Citations: When grounded in SOP/policy (via Vertex AI Search)
Trace ID: Correlates across all pipeline stages

Operators trust the system because they can see why it decided, not just what it decided.

3️⃣ Real-Time KPIs Without Extra Infrastructure

The same Kafka streams that drive actions also power analytics:

Flink SQL Queries (examples included):

Rule hit rates by violation type
Stop-line frequency trends
Confidence distribution analysis
P95 end-to-end latency tracking
Alert storm detection windows

Value: No separate analytics pipeline, KPIs are computed in-stream.

4️⃣ Production-Grade Replay & Forensics

Kafka's retention + correlated trace_id enables:

Incident investigation: Replay exactly what the system saw
Model tuning: Re-run decisions with updated prompts
Compliance audits: Full evidence chain for regulatory review
A/B testing: Compare model outputs on same event history

🚀 Confluent Cloud Usage

This project is Confluent-native by design.

What We Used & Why It Matters

Confluent Feature	How We Use It	Business Value
Kafka Topics	Multi-agent backbone with topic-per-stage pattern	Decoupling, fault isolation, clean evolution
Consumer Groups	Scale Observer instances (10x) independently from Thinker (2x)	Cost-efficient horizontal scaling
Schema Registry	JSON Schema contracts generated from Pydantic models	Safe prompt/model evolution, fewer breakages
Replayability	Replay by `trace_id` for forensics and iteration	Incident investigation, compliance, A/B testing
Flink SQL	Real-time KPIs + stream-side cost filters	Operational visibility, upstream cost gates

Critical Design Decision: We chose event choreography over orchestration. Each agent is autonomous, consuming from upstream topics and producing to downstream topics. This creates natural backpressure, enables independent scaling, and makes the system resilient to partial failures.

🧠 Vertex AI Usage

What We Used & Why It Matters

Vertex AI Feature	How We Use It	Business Value
Gemini Multimodal	Observer agent reads video clips, emits structured signals	Zero-shot understanding, no CV pipeline required
Gemini Reasoning	Thinker + Doer convert signals → severity → actions	Operational judgment with consistent JSON
Vertex AI Search	RAG grounding for SOP compliance checks	Citation-backed decisions, reduced hallucinations
Embeddings API	Semantic SOP chunk retrieval	Scalable knowledge grounding

Critical Design Decision: We use structured output prompting (strict JSON schemas) to ensure Gemini outputs are Kafka-ready events, not unstructured text. This makes the pipeline reliable and testable.

💪 Challenges We Overcame

1. Latency vs. Cost Trade-off

Problem: High-resolution video at 30 FPS = 1,800 frames/minute. At $0.05/frame, that's $90/minute = $129,600/day per camera. Impossible.

Solution:

Motion detection cuts inference by 80-90%
Segment into 15-30 second clips
Sample at 1-2 FPS for analysis
Result: ~$10-20/day per camera (economically viable)

2. Multi-Agent Coordination Without Brittle Orchestration

Problem: Centralized orchestrators become single points of failure and bottlenecks.

Solution:

Event choreography via Kafka topics
Schema Registry enforces contracts between agents
Each agent scales independently
Natural backpressure prevents cascade failures

3. Trust & Explainability for Compliance Use Cases

Problem: "AI said stop the line" isn't acceptable in regulated environments.

Solution:

Vertex AI Search grounds decisions in actual SOP documents
Every decision includes citations to specific procedure sections
Full audit trail with correlated trace IDs
Operators can replay incidents to understand "why"

💼 Potential Value Applications

Manufacturing & Industrial:

Downtime prevention: Early detection of equipment issues addresses documented $36K–$2.3M/hour costs
Predictive maintenance: Faster awareness of visual anomalies (smoke, leaks, vibrations)
Quality control: Real-time visual inspection of assembly processes

Safety & Compliance:

Injury prevention: With $43K average cost per workplace injury, early hazard detection has measurable value
Compliance monitoring: Automated verification of safety protocols
Regulatory support: Documented audit trails for incident investigation

Operational Efficiency:

Process monitoring: Visual verification of workflow completion
Audit support: Reduced time spent on manual video review
Quality feedback: Faster identification of process deviations

Healthcare & Patient Safety:

Fall detection: Addressing documented hospital fall costs
Early intervention: Real-time alerts for patient mobility issues
Staff support: Automated monitoring between routine checks

Retail & Customer Experience:

Queue optimization: Visual monitoring of checkout wait times
Loss prevention: Automated detection of unusual activity
Service quality: Real-time awareness of customer service needs

🎯 What's Next

We built Sentinel as a architectural foundation. Natural next steps for us are:

Near-Term (1-3 Months)

Flink-first cost gating: Move motion/signal thresholds into stream processing (materialized views)
Connector ecosystem: Slack, PagerDuty, ServiceNow, Jira (action handlers already modular)
Policy pack system: Plug-in SOP libraries per station/site/customer with version control

Medium-Term (3-6 Months)

Vector Search hardening: Upgrade SOP retrieval to Vertex Vector Search for lower latency
Multi-modal expansion: Add audio analysis (machine sounds, alarms) to video
Edge deployment: Run Observer agents closer to cameras for ultra-low latency

Long-Term (6-12 Months)

Federated learning: Train station-specific anomaly models on local data
Predictive maintenance: Correlate visual signals with equipment telemetry
Cross-site benchmarking: Compare SOP adherence across facilities

🛠️ Built With

Confluent Cloud (Core Platform)

Kafka Topics: Multi-agent event backbone
Consumer Groups: Independent scaling per agent type
Schema Registry: Governed JSON Schema contracts
Flink SQL: Real-time KPIs and stream analytics
Replayability: Forensic replay by trace ID

Google Cloud Vertex AI (Intelligence Layer)

Gemini 2.5 Flash (Multimodal): Video understanding
Gemini 2.5 Pro (Reasoning): Decision synthesis
Vertex AI Search: RAG-grounded SOP retrieval
Embeddings API: Semantic knowledge base

Google Cloud Infrastructure

Cloud Storage: Clip archival
BigQuery: Audit log warehouse
Cloud Run: FastAPI control plane (demo UI)
Secret Manager: API key governance

🎬 Try It Yourself

Demo Video: https://youtu.be/-X6tXlmWlvM

Live Demo: https://github.com/Niket93/sentinel

Code Repository: https://sentinel-464199486062.us-central1.run.app/ui

👥 Team

Niket Shah - LinkedIn

📚 References & Research

IDC/Seagate (2018): "The Digitization of the World - From Edge to Core" - 175 zettabytes of data by 2025, 80% video/video-like
Siemens (2024): "The True Cost of Downtime 2024" - Downtime costs $36K/hour (FMCG) to $2.3M/hour (automotive)
Uptime Institute (2024/2025): "Annual Outage Analysis" - 54% of outages cost >$100K, 20% cost >$1M
ITIC (2024): "Hourly Cost of Downtime Survey" - 90%+ of enterprises estimate >$300K/hour downtime cost
National Safety Council (2023): "Work Injury Costs - Injury Facts" - $176.5B total workplace injury costs, $43K average per injury

🏆 Why This Matters

Real-time video intelligence is hard to operationalize. Most approaches either sacrifice cost-efficiency, explainability, or governance. Sentinel demonstrates that you don't have to choose.

✅ Solves Documented Problems
Addresses $36K–$2.3M/hour downtime costs and $43K workplace injury costs with a practical, economically viable approach.

✅ Production-First Design
Cost controls aren't an afterthought, they're built into the architecture. Motion detection, sampling strategies, and deduplication make multimodal AI inference economically feasible at scale.

✅ Deep Sponsor Integration
This isn't a shallow integration. We use Confluent's event choreography for multi-agent coordination, Schema Registry for safe evolution, and Flink SQL for real-time KPIs. Vertex AI powers zero-shot video understanding, grounded reasoning with RAG, and citation-backed decisions.

✅ Explainability as Output
Every decision includes evidence, rationale, confidence scores, and citations. This isn't a black box, it's a system operators can trust and auditors can verify.

✅ Demonstrates Architectural Thinking
Two different use cases running on the same infrastructure proves the approach is adaptable. The streaming backbone doesn't change, only prompts and knowledge bases do.