Sentinel
Agentic Video Intelligence Platform with Confluent + Vertex AI
Vision → Decision → Action: Transform any video feed into governed event streams that drive real-time, explainable actions across manufacturing, healthcare, retail, logistics, and beyond, with an audit-grade evidence trail you can replay.
💡 Inspiration - The Problem We're Solving
Organizations across every industry deploy thousands of cameras - factories, warehouses, hospitals, retail stores, construction sites, farms, energy infrastructure, and smart cities. Yet most footage remains reactive, reviewed only after incidents occur, when the damage is already done.
The cost of waiting is universal and staggering:
- Downtime: $36K–$2.3M per hour depending on industry (automotive, FMCG, manufacturing) - Siemens 2024
- Major outages: 54% cost over $100K; 20% exceed $1M - Uptime Institute 2024
- Safety incidents: $176.5B annual workplace injury costs, averaging $43K per medically consulted injury - NSC 2023
- Operational impact: 90%+ of enterprises estimate hourly downtime exceeds $300K - ITIC 2024
The opportunity: Video represents ~80% of global data by 2025 (175 zettabytes projected) - IDC/Seagate 2018, yet organizations across all industries struggle to operationalize multimodal AI in real-time with governance, cost control, and auditability.
Sentinel closes that gap, not for one industry, but for every industry where visual monitoring matters.
🎯 What It Does
Sentinel is a agentic video intelligence platform that continuously converts raw video into governed operational intelligence across any industry:
End-to-End Pipeline
📹 Video Feed
↓ (Motion Detection + Sampling)
🔍 Observe (Gemini Multimodal Analysis)
↓ (Structured JSON Observations)
🧠 Think (Reasoning + Domain Knowledge Grounding via Vertex AI Search)
↓ (Explainable Decisions with Citations)
⚡ Act (Automated Alerts / Actions / webhooks)
↓ (Deduped, Cooldown-Protected)
📊 Audit + Real-Time KPIs (BigQuery + Flink SQL)
Cross-Industry Applications
The same architecture solves different problems across verticals:
| Industry | Use Case | Detection | Action | Impact |
|---|---|---|---|---|
| Manufacturing | Equipment anomaly detection | Abnormal vibrations, leaks, smoke | Predictive maintenance alert | Prevent $2M/hour downtime |
| Healthcare | Patient safety monitoring | Fall detection, mobility issues | Immediate staff alert | Reduce adverse events |
| Retail | Queue & service optimization | Long wait times, checkout bottlenecks | Staff reallocation | Improve customer experience |
| Logistics | Loading dock safety | Forklift near-misses, improper stacking | Stop operations, supervisor alert | Prevent $43K injuries |
| Agriculture | Crop & livestock monitoring | Irrigation issues, animal distress | Automated intervention | Prevent yield loss |
| Energy | Infrastructure monitoring | Pipeline leaks, equipment corrosion | Emergency shutdown | Prevent environmental disasters |
| Construction | Site safety compliance | Missing PPE, unsafe scaffolding | Stop work order | Reduce OSHA violations |
| Smart Cities | Traffic & crowd management | Congestion, crowd density | Dynamic signal control | Optimize urban flow |
Demo Implementations (Included)
We've built two use-case examples to showcase the platform's flexibility:
1. Security & Safety Monitoring
- Detects violations in real-time (PPE missing, unsafe behavior, spills)
- Evaluates severity with confidence scores
- Executes stop-line commands or alerts with full evidence chain
- Shows trace-linked video clips and reasoning
2. Assembly SOP Compliance
- Sessionizes station workflows into discrete work units
- Validates completion against SOP requirements
- Identifies missing steps with citations to procedure documents
- Provides operator-ready corrective instructions
The key insight: Both demos use the exact same streaming architecture, only the prompts, knowledge bases, and action handlers change. This proves the platform's universality.
🏗️ How We Built It - Architecture

Three-layer design: Streaming (Confluent), Intelligence (Vertex AI), and Audit (BigQuery + Flink)
Three-Plane Design
1. Streaming Plane (Confluent Cloud)
- Multi-stage event choreography through Kafka topics
- Schema-governed contracts via Schema Registry (JSON Schema)
- Independent scaling per agent via consumer groups
- Replay-first architecture for forensics and iteration
2. Intelligence Plane (Vertex AI)
- Gemini multimodal: Zero-shot video understanding
- Gemini reasoning: Severity assessment and action planning
- Vertex AI Search: RAG-grounded SOP lookups with citations
- Embeddings API: Semantic retrieval for knowledge base
3. Audit & Analytics Plane
- BigQuery: Immutable audit logs with correlated trace IDs
- Flink SQL: Real-time KPIs computed directly over Kafka streams
- Cloud Storage: Clip archival for evidence replay
Multi-Agent Streaming Architecture
video.clips → [Observer Agent]
↓
video.observations → [Sessionizer Agent]
↓
station.sessions → [Thinker Agent + SOP Grounding]
↓
sop.decisions → [Action Agent + Dedup]
↓
workflow.actions → [Audit Sink]
↓
audit.events (BigQuery)
Key Innovation: Each agent is an independent service consuming/producing from Kafka topics. This enables:
- Horizontal Scaling where it matters (e.g., 10x Observer instances for 100 cameras)
- Independent Evolution (swap models/prompts without downstream rewrites)
- Fault Isolation (one agent failure doesn't crash the pipeline)
- Clean Contracts (Schema Registry ensures safe evolution)
🔧 Technical Implementation - Why This Works at Scale
1️⃣ Cost-Controlled Multimodal Inference
Video AI inference can bankrupt a deployment. We built multiple cost gates:
Motion Detection Prefilter
- Pixel-change detection + background subtraction
- Filters 80-90% of "quiet" clips before inference
- Turns "impossible economics" into viable deployment
Smart Sampling & Segmentation
- Configurable clip length and FPS
- Prevents redundant processing of static scenes
Streaming Deduplication
- Cooldown windows prevent alert storms
- Action-level dedup across topics
Cost Model:
API Calls/day ≈ (N_cameras × 1440 minutes/day) / (T_clip × (1 - filter_rate))
Example: 10 cameras, 30-sec clips, 80% filtered
= (10 × 1440) / (0.5 × 0.2) = 144,000 calls/day
With 80% prefilter → 28,800 calls/day
2️⃣ Explainability & Governance as First-Class Output
Every decision includes:
- Evidence: Exact clip timestamp range
- Rationale: Human-readable explanation
- Confidence scores: Per-signal uncertainty
- Citations: When grounded in SOP/policy (via Vertex AI Search)
- Trace ID: Correlates across all pipeline stages
Operators trust the system because they can see why it decided, not just what it decided.
3️⃣ Real-Time KPIs Without Extra Infrastructure
The same Kafka streams that drive actions also power analytics:
Flink SQL Queries (examples included):
- Rule hit rates by violation type
- Stop-line frequency trends
- Confidence distribution analysis
- P95 end-to-end latency tracking
- Alert storm detection windows
Value: No separate analytics pipeline, KPIs are computed in-stream.
4️⃣ Production-Grade Replay & Forensics
Kafka's retention + correlated trace_id enables:
- Incident investigation: Replay exactly what the system saw
- Model tuning: Re-run decisions with updated prompts
- Compliance audits: Full evidence chain for regulatory review
- A/B testing: Compare model outputs on same event history
🚀 Confluent Cloud Usage
This project is Confluent-native by design.
What We Used & Why It Matters
| Confluent Feature | How We Use It | Business Value |
|---|---|---|
| Kafka Topics | Multi-agent backbone with topic-per-stage pattern | Decoupling, fault isolation, clean evolution |
| Consumer Groups | Scale Observer instances (10x) independently from Thinker (2x) | Cost-efficient horizontal scaling |
| Schema Registry | JSON Schema contracts generated from Pydantic models | Safe prompt/model evolution, fewer breakages |
| Replayability | Replay by trace_id for forensics and iteration |
Incident investigation, compliance, A/B testing |
| Flink SQL | Real-time KPIs + stream-side cost filters | Operational visibility, upstream cost gates |
Critical Design Decision: We chose event choreography over orchestration. Each agent is autonomous, consuming from upstream topics and producing to downstream topics. This creates natural backpressure, enables independent scaling, and makes the system resilient to partial failures.
🧠 Vertex AI Usage
What We Used & Why It Matters
| Vertex AI Feature | How We Use It | Business Value |
|---|---|---|
| Gemini Multimodal | Observer agent reads video clips, emits structured signals | Zero-shot understanding, no CV pipeline required |
| Gemini Reasoning | Thinker + Doer convert signals → severity → actions | Operational judgment with consistent JSON |
| Vertex AI Search | RAG grounding for SOP compliance checks | Citation-backed decisions, reduced hallucinations |
| Embeddings API | Semantic SOP chunk retrieval | Scalable knowledge grounding |
Critical Design Decision: We use structured output prompting (strict JSON schemas) to ensure Gemini outputs are Kafka-ready events, not unstructured text. This makes the pipeline reliable and testable.
💪 Challenges We Overcame
1. Latency vs. Cost Trade-off
Problem: High-resolution video at 30 FPS = 1,800 frames/minute. At $0.05/frame, that's $90/minute = $129,600/day per camera. Impossible.
Solution:
- Motion detection cuts inference by 80-90%
- Segment into 15-30 second clips
- Sample at 1-2 FPS for analysis
- Result: ~$10-20/day per camera (economically viable)
2. Multi-Agent Coordination Without Brittle Orchestration
Problem: Centralized orchestrators become single points of failure and bottlenecks.
Solution:
- Event choreography via Kafka topics
- Schema Registry enforces contracts between agents
- Each agent scales independently
- Natural backpressure prevents cascade failures
3. Trust & Explainability for Compliance Use Cases
Problem: "AI said stop the line" isn't acceptable in regulated environments.
Solution:
- Vertex AI Search grounds decisions in actual SOP documents
- Every decision includes citations to specific procedure sections
- Full audit trail with correlated trace IDs
- Operators can replay incidents to understand "why"
💼 Potential Value Applications
Manufacturing & Industrial:
- Downtime prevention: Early detection of equipment issues addresses documented $36K–$2.3M/hour costs
- Predictive maintenance: Faster awareness of visual anomalies (smoke, leaks, vibrations)
- Quality control: Real-time visual inspection of assembly processes
Safety & Compliance:
- Injury prevention: With $43K average cost per workplace injury, early hazard detection has measurable value
- Compliance monitoring: Automated verification of safety protocols
- Regulatory support: Documented audit trails for incident investigation
Operational Efficiency:
- Process monitoring: Visual verification of workflow completion
- Audit support: Reduced time spent on manual video review
- Quality feedback: Faster identification of process deviations
Healthcare & Patient Safety:
- Fall detection: Addressing documented hospital fall costs
- Early intervention: Real-time alerts for patient mobility issues
- Staff support: Automated monitoring between routine checks
Retail & Customer Experience:
- Queue optimization: Visual monitoring of checkout wait times
- Loss prevention: Automated detection of unusual activity
- Service quality: Real-time awareness of customer service needs
🎯 What's Next
We built Sentinel as a architectural foundation. Natural next steps for us are:
Near-Term (1-3 Months)
- Flink-first cost gating: Move motion/signal thresholds into stream processing (materialized views)
- Connector ecosystem: Slack, PagerDuty, ServiceNow, Jira (action handlers already modular)
- Policy pack system: Plug-in SOP libraries per station/site/customer with version control
Medium-Term (3-6 Months)
- Vector Search hardening: Upgrade SOP retrieval to Vertex Vector Search for lower latency
- Multi-modal expansion: Add audio analysis (machine sounds, alarms) to video
- Edge deployment: Run Observer agents closer to cameras for ultra-low latency
Long-Term (6-12 Months)
- Federated learning: Train station-specific anomaly models on local data
- Predictive maintenance: Correlate visual signals with equipment telemetry
- Cross-site benchmarking: Compare SOP adherence across facilities
🛠️ Built With
Confluent Cloud (Core Platform)
- Kafka Topics: Multi-agent event backbone
- Consumer Groups: Independent scaling per agent type
- Schema Registry: Governed JSON Schema contracts
- Flink SQL: Real-time KPIs and stream analytics
- Replayability: Forensic replay by trace ID
Google Cloud Vertex AI (Intelligence Layer)
- Gemini 2.5 Flash (Multimodal): Video understanding
- Gemini 2.5 Pro (Reasoning): Decision synthesis
- Vertex AI Search: RAG-grounded SOP retrieval
- Embeddings API: Semantic knowledge base
Google Cloud Infrastructure
- Cloud Storage: Clip archival
- BigQuery: Audit log warehouse
- Cloud Run: FastAPI control plane (demo UI)
- Secret Manager: API key governance
🎬 Try It Yourself
Demo Video: https://youtu.be/-X6tXlmWlvM
Live Demo: https://github.com/Niket93/sentinel
Code Repository: https://sentinel-464199486062.us-central1.run.app/ui
👥 Team
Niket Shah - LinkedIn
📚 References & Research
- IDC/Seagate (2018): "The Digitization of the World - From Edge to Core" - 175 zettabytes of data by 2025, 80% video/video-like
- Siemens (2024): "The True Cost of Downtime 2024" - Downtime costs $36K/hour (FMCG) to $2.3M/hour (automotive)
- Uptime Institute (2024/2025): "Annual Outage Analysis" - 54% of outages cost >$100K, 20% cost >$1M
- ITIC (2024): "Hourly Cost of Downtime Survey" - 90%+ of enterprises estimate >$300K/hour downtime cost
- National Safety Council (2023): "Work Injury Costs - Injury Facts" - $176.5B total workplace injury costs, $43K average per injury
🏆 Why This Matters
Real-time video intelligence is hard to operationalize. Most approaches either sacrifice cost-efficiency, explainability, or governance. Sentinel demonstrates that you don't have to choose.
✅ Solves Documented Problems
Addresses $36K–$2.3M/hour downtime costs and $43K workplace injury costs with a practical, economically viable approach.
✅ Production-First Design
Cost controls aren't an afterthought, they're built into the architecture. Motion detection, sampling strategies, and deduplication make multimodal AI inference economically feasible at scale.
✅ Deep Sponsor Integration
This isn't a shallow integration. We use Confluent's event choreography for multi-agent coordination, Schema Registry for safe evolution, and Flink SQL for real-time KPIs. Vertex AI powers zero-shot video understanding, grounded reasoning with RAG, and citation-backed decisions.
✅ Explainability as Output
Every decision includes evidence, rationale, confidence scores, and citations. This isn't a black box, it's a system operators can trust and auditors can verify.
✅ Demonstrates Architectural Thinking
Two different use cases running on the same infrastructure proves the approach is adaptable. The streaming backbone doesn't change, only prompts and knowledge bases do.
Sentinel shows how Confluent and Vertex AI can work together to make video intelligence operationally viable: governed, explainable, cost-controlled, and production-ready.
Built With
- bash
- confluent
- confluent-kafka
- css
- docker
- fastapi
- flink
- gcp
- gcs
- google-bigquery
- html
- javascript
- kafka
- opencv
- python
- sql
- uvicorn
- vertex
Log in or sign up for Devpost to join the conversation.