Introduction

Video stream sources, such as CCTV, smartphones, drones, and other devices, play a crucial role in generating video content, prompting the development of systems for video content analysis, management, and retrieval. As online video traffic is projected to exceed a zettabyte annually by 2030, the need for a scalable and distributed Content-based Video Retrieval (CBVR) architecture is becoming crucial [1]. This rapidly expanding volume of video data, referred to as "video big data", far exceeds the capabilities of conventional analytics methods. CBVR, which retrieves relevant videos based on content rather than metadata, demands scalable systems for indexing, querying, and feature extraction at scale.

The rapid increase in video data from sources like CCTV, drones, and online platforms has far exceeded the ability of traditional retrieval systems. In distributed setups, real-time analytics and historical video searches usually need separate processing systems. This results in duplicated infrastructure, inconsistent indexing, and higher latency. Although distributed computing frameworks like Hadoop [2] and Spark [3] offer scalable infrastructure, they were not originally designed to handle the temporal, semantic, and high-dimensional nature of video data. Consequently, adapting these platforms to efficiently support CBVR is still a significant challenge. Systems designed for large-scale CBVR must effectively integrate video ingestion, deep feature computation, indexing, and retrieval for both streaming and batch pipelines. Despite ongoing research, there is still a lack of scalable, end-to-end solutions that are capable of handling multi-modal video content in real-time and offline settings.

Furthermore, CBVR has gained significant attention in recent years [4,5,6]. Prior works have focused on specific aspects of video retrieval such as handcrafted features [7,8,9], deep feature extraction [10, 11], indexing [11,12,13,14], retrieval modes such as text-based, image-based, or clip-based search [9, 15, 16], action recognition [17, 18], and object/face recognition [19]. Other works address similarity search [20, 21] or specific execution models like batch [19, 22] and stream [20, 23] processing. Some systems adopt distributed architectures [20, 24], but none offer a unified solution that supports distributed, multi-modal, and real-time CBVR. This fragmentation reveals a significant gap: existing systems fail to offer general-purpose, scalable, and low-latency CBVR frameworks that address end-to-end video analytics workflows.

To address this, the lambda architecture paradigm [25] offers a promising foundation for scalable video analytics, and has attracted both the industry’s and researchers’ attention [26,27,28,29]. However, its application to CBVR remains underexplored [30, 31], and adapting it to handle video-specific challenges like feature extraction and indexing remains non-trivial. Although distributed processing engines such as Spark [3] enables in-memory distributed computing, it lacks native support for video processing and deep learning workloads. This limitation poses a key challenge in effectively extending and optimizing Spark to handle large-scale, low-latency video analytics pipelines for both streaming and batch videos.

Similarly, distributed streaming systems [32] are widely used for real-time, fault-tolerant message streaming. Their application however, is non-trivial in video big data workflows. As such, integrating such systems into a CBVR system requires significant architectural adaptation to ensure efficient stream acquisition, buffering, and delivery of video and feature data. Moreover, the widespread use of Convolutional Neural Networks (CNNs) for video analysis brings significant challenges related to feature diversity. Different CNN architectures generate various feature representations such as global features, object features, and temporal semantics, which differ in scale, dimensionality, and structure. This variety makes it harder to develop unified indexing and retrieval strategies in distributed CBVR systems.

Implementing big data technology stacks for end-to-end video analytics encounters several key system-level bottlenecks. These include issues with orchestration, scalability, inefficiencies in acquiring real-time streams, challenges in distributed processing, and the computational overhead of indexing and querying high-dimensional deep features. This highlights the need for a complete and unified architectural framework that can effectively support both real-time and batch processing of large-scale video data in distributed settings.

While it is theoretically possible to connect different systems for each workload, this often creates significant architectural and operational challenges. Batch systems expect static data and global aggregation, while streaming systems need incremental updates and quick processing. These approaches differ not only in data models and state handling but also in how they maintain consistency. Simple integration results in duplication, synchronization issues, and inconsistent retrieval results. The lambda architecture partially solves this by defining separate batch and stream layers that meet at a serving layer. We leverage on this idea to unify not just ingestion and feature extraction, but also indexing and querying from both real-time streams and offline video data. This architecture bridges the batch-stream divide, enables scalable distributed processing, and supports low-latency video search in dynamic environments.

In this work, we present \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\), a scalable distributed framework for CBVR based on lambda architecture to support both batch and real-time processing. \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) comprises of three layers, Video Big Data Layer (VBDL), Video Big Data Analytics Layer (VBDAL), and Web Service Layer (WSL) to manage the acquisition, structural analysis, deep feature extraction, and indexing of large-scale video data. We develop Distributed Encoded Deep Feature Indexer (DEFI) to encode deep global and object-level features for indexing to facilitate efficient multimodel retrieval. The design philosophy in \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) reflect the practical challenges of merging streaming and batch processing.

The remainder of this paper is organized as follows: Section "Related Work" discusses related work in CBVR, big data processing, and indexing. Section "Research Objectives" outlines the research objectives and architectural challenges. Section "Proposed Architecture" presents the proposed \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) architecture in detail. Section "Evaluation and Discussion" describes the experimental setup, datasets, and evaluation metrics, followed by performance evaluation. Finally, Sect. "Conclusion and Future Work" concludes the paper.

Related work

Several approaches have been proposed to address CBVR, focusing on feature extraction, indexing, and system scalability. These methods span from traditional handcrafted descriptors to more recent deep learning and cloud-based techniques. However, most prior works address isolated components of the pipeline rather than providing an integrated architecture suitable for large-scale, hybrid video processing.

Traditional CBVR systems relied on handcrafted features and static metadata. Amato et al. [13] proposed a large-scale system for retrieval using text, color histograms, and object detection. Sun et al. [33] presented a framework for segment-level video retrieval. Li et al. [15] proposed Sentence Encoder Assembly to support text-to-video matching. These systems, while effective for specific retrieval tasks, do not address scalability or unified system architecture. Other efforts explored temporal and semantic representations. Shang et al. [23] used frame-level gray-scale intensities and time-oriented video structures but incurred high parallel processing costs. Sivic et al. [34] developed classifiers for character-specific tracking in television content, and Song et al. [8] applied multi-feature hashing in Hamming space. Lai et al. [9] combined appearance and motion trajectories for object-centric retrieval, while Araujo et al. [35] performed large-scale retrieval using image queries. These works generally assume static datasets and lack mechanisms for real-time or adaptive stream processing. Zhang et al. [12] created a video frame information extraction model using CNN and Bag of Visual Words for retrieving videos on a large scale using image queries. Such conventional CBVR approaches cannot be considered as an effective candidate solution for Video big data. Furthermore, these CBVR systems can process a limited or no amount of near-real-time video streams.

Recent efforts have looked into big data technologies to improve scalability in CBVR. Saoudi et al. [36] developed a distributed system to create compact video signatures using motion and residual features along with clustering-based indexing. Gao et al. [10] introduced a cloud-based actor identification framework that keeps spatial coherence in feature encoding. Wang et al. [21] used MapReduce and GPU acceleration for near-duplicate retrieval with Hessian-Affine detector [37] encoded through Bag of Features and stored on HDFS. Zhu et al. [20] presented Marlin, a streaming pipeline for distributed indexing and micro-batch feature extraction. However, it does not support batch analytics or generalized features. Muhling et al. [17] use a cluster computing approach to propose a CBVR system based on Distributed Resource Management Application API [24] and deep learning for professional film, television, and media production. Lin et al. [38] developed a cloud-based facial retrieval system. Other approaches [19, 22, 39] also tried to design cloud-based CBVR. These approaches, however, are task-specific and lack architectural integration.

Indexing is crucial in the information retrieval system, and various indexing strategies have been proposed for video retrieval. Chen et al. [40] convert spatial-temporal features into compact binary codes using supervised hashing to train hash functions. Liu et al. [41] index features on a large scale by encoding each image with a pre-trained CNN and constructing a visual dictionary of codewords. They then design a hash-based inverted index for retrieval. However, this approach suffers from linear search time growth with increasing data volume and lacks scalability for distributed or cloud environments. Amato et al. [42] proposed a permutation-based strategy for CNN features. Permutations are created by combining a collection of metric space reference objects. However, it necessitates calculating the distances between pivots and targets, which is time-consuming in the case of deep features. Anuranji et al. [43] argue that existing hashing methods inadequately represent video frame features and fail to exploit temporal information, leading to significant feature loss during dimensionality reduction. They propose a hashing mechanism that captures both spatial and temporal features efficiently using stacked convolutional networks and bi-directional learning for improved feature extraction.

Despite these advances, several fundamental challenges remain unaddressed. First, most existing CBVR systems lack a unified architecture that supports both real-time streaming and offline batch video processing, leading to fragmented pipelines and redundant indexing logic. Second, current indexing methods often struggle to scale with high-dimensional, heterogeneous features (e.g., spatial, temporal, object-level), limiting their applicability across varied video analytics tasks. Third, many systems are not designed for distributed, cloud-native environments, making them difficult to deploy or extend at scale.

Table 1 Feature wise comparison of the proposed \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) with state-of-the-art.

To address these limitations, we introduce \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\), a scalable and fault-tolerant system based on lambda architecture that combines streaming and batch pipelines. Table 1 shows a comparison of key CBVR systems alongside our platform, emphasizing the architectural and operational aspects important for scaling video retrieval. Lambda Architecture refers to systems that integrate batch and stream processing layers to ensure both low-latency and accurate results. A Scalable System can manage increasing data volumes and user demands without losing performance. A Plugable Framework makes it easy to integrate new components or algorithms. Service Oriented means the system is divided into independent services, which improves maintainability and scalability. Big data Store enables efficient storage and access to large video datasets. Distributed Messaging allows separate, fault-tolerant communication between components, which is essential for coordinating real-time data flows in streaming pipelines. An Incremental Index updates as new data arrives, avoiding the need for expensive full re-indexing. Multi-Type Features supports the handling of diverse types of features. Feature Encoding refers to the transformation of computed features for efficient indexing. Batch processing deals with large amounts of offline video data. Stream Processing receives and analyzes video streams in real-time. A Multistream Environment refers to the ability to acquire and process video data from a diverse range of connected sources. Query By Text allows users to find videos using text input, while Query By Image and Query By Clip enable visual input in content-based searches.

Research objectives

The main research objective of this work are listed below:

  • Lambda-architecture based CBVR: To design a scalable system for video big data indexing and retrieval that leverages lambda architecture principles to process, index, and retrieve both streaming and batch video data. The integration of these two processing modes poses significant architectural challenges due to differences in latency requirements, data consistency, and processing semantics, challenges that are largely unaddressed in existing CBVR systems.

  • In-memory distributed video analytics: General-purpose Directed Acyclic Graph based distributed computing engines such as Spark lack native support for video analytics, including specialized data structures and video-specific semantics. This limitation forces developers to manage low-level complexities manually, which is labor-intensive, error-prone, and hinders scalability. Addressing this gap is significant for enabling efficient, high-throughput video processing in large-scale systems. Implementing an end-to-end, video-aware processing framework on top of such engines is a non-trivial task, but essential for supporting real-time and batch video analytics in distributed environments.

  • Deep feature indexer: The diversity of video features-including textual metadata, deep visual representations, and spatiotemporal patterns-poses a major challenge for CBVR. These features differ in structure, scale, and semantic meaning, making unified indexing complex. Moreover, the high-dimensional and dense nature of deep features significantly increases the computational cost of indexing and retrieval. Designing a scalable, distributed indexer that can efficiently handle and unify these heterogeneous feature types across both streaming and batch video data is a critical and non-trivial challenge with direct impact on retrieval accuracy and system performance.

  • High-level abstractions: To design and optimize high-level abstractions over big data technologies that support scalable video analytics while hiding the complexity of the underlying distributed computing stack. This objective aims to simplify the development of video processing pipelines by providing domain-specific constructs that encapsulate data ingestion, transformation, and feature extraction, enabling efficient deployment in both real-time and batch processing environments.

  • Bottlenecks analysis: To systematically identify and analyze performance bottlenecks that arise in distributed video analytics pipelines. These include challenges related to video storage and orchestration, scalability, real-time stream acquisition, and high-dimensional deep feature indexing. The goal is to understand the limitations of existing big data technologies when applied to video workloads, and to use these insights to guide the design of more efficient, scalable, and fault-tolerant CBVR systems. Special attention is given to how architectural choices affect latency, throughput, and resource utilization in both streaming and batch processing contexts.

The main research contributions of this work are as follows:

  • We propose a lambda-inspired layered architecture that supports both near real-time and batch video processing, enabling scalable and unified retrieval of video big data.

  • We design and implement an in-memory video analytics framework that introduces a high-level abstraction over distributed processing engines, facilitating efficient video operations such as frame extraction and deep feature computation.

  • We address the challenge of heterogeneous feature representation by introducing DEFI, a unified distributed indexer capable of indexing and retrieving multi-type deep features, including spatial, object-level, and temporal representations, from video sources.

  • We introduce a set of high-level abstractions built over big data technologies to simplify the construction of scalable video analytics pipelines. These abstractions encapsulate complex tasks such as data ingestion, transformation, and feature extraction, and are designed to support both streaming video (processed in real-time with low-latency constraints) and batch video (processed offline for large-scale analysis), ensuring flexibility and consistency across diverse workloads.

  • We develop and evaluate the proposed system using three benchmark video datasets, demonstrating its scalability and performance under real-world conditions. Our analysis highlights key system-level bottlenecks across storage, stream acquisition, distributed processing, and deep feature indexing.

Fig. 1
figure 1

The proposed \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) comprises three layers. The initial layer, VBDL, actively orchestrates video big data in its life cycle, from acquiring and storing to archiving or obsolescence. VBDL performs structural analysis and mining operations on top of Spark, an in-memory processing that enhances the MapReduce model for efficient computations. The WSL is the gateway that provides access to the top-notch features of \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) as clear role-based web services to provide the proposed framework’s functionality across the web.

Proposed architecture

Here, we formally introduce the proposed \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) architecture, followed by detailed technical specifications in the subsequent sections. \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) is comprised of three layers: Video Big Data Layer (VBDL), Video Big Data Analytics Layer (VBDAL), and Web Service Layer (WSL), as illustrated in Fig. 1. VBDL forms the base layer of \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\), managing the video big data lifecycle, from acquisition and storage to archiving or obsolescence. Within VBDL, we introduce the unified feature indexer DEFI, which encodes and indexes deep global and object features, referred to as Intermediate Results (IR) throughout this work. Additionally, the Structured Metastore manages structured metadata, including user information, video sources and configurations, video retrieval service subscriptions, application logs, and system workloads. VBDAL handles structural analysis, mining operations, generates query maps, and performs ranking operations. Lastly, \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) consolidates all advanced functionalities into role-based web services through the WSL, which is built on top of both VBDL and VBDAL, enabling the proposed framework to operate seamlessly over the web. In the subsequent sections, we provide the technical details of each layer within the proposed \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\).

Fig. 2
figure 2

Video Stream Acquisition and Producer Service. (a) Internal logic and flow. (b) Data Structures. SDS are various connected "Streaming Data Sources" and the MSG schema represents each Record.

Video Big Data Layer

VBDL is responsible for large-scale acquisition and management of batch video data and real-time video streams, along with indexing encoded deep features. The four key components of VBDL are the Video Stream Manager, Video Big Data Store, Structured Metastore, and DEFI.

Video Stream Manager

The source devices provide the real-time video stream and executors perform on-the-fly processing and deep feature extraction on these streams. The role of this component is to acquire video streams in real-time and their orchestration. Video Stream Manager comprises four modules: Video Stream Acquisition Service (VSAS), Distributed Message Broker, Broker Client Services, Intermediate Results Manager, and Video Stream Consumer.

Algorithm 1
figure a

Video stream event producer

Video Stream Acquisition and Producer Service

Video Stream Acquisition and Producer Service(VSAPS) provides interfaces for the configuration and acquisition of video streams from the source devices. For a specific video source subscribed to the proposed \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\), the configuration metadata is obtained from the Structured Metastore by the VSAPS, which is then accompanied by the configuration of the video streaming source device. Once configured, the video stream is decoded by VSAPS and is followed by the frame extraction process. Once the frames are extracted, preprocessing operations are performed on the frames as per scenario which include metadata extraction, frame correction & sizing, etc. Communication with the source of video stream occurs through a JSON object comprising five fields: dataSourceID, width & height of the frame, timestamp, and Payload, which is the actual frame data. We use the term Record for this JSON object. Each Record is subsequently sent to the Producer Handler after instantiation. Figure 2 and Algorithm 1 illustrates this process. The VSAPS then serializes the records from mini-batches, compresses it using snappy compression [46], and sends it to the Kafka topic in the Kafka Distributed Message Broker.

Fig. 3
figure 3

Workflow of VSAPS. Here UID, VDS, IR, and P represent user id, video stream data source id, intermediate results, and partition respectively.

Distributed Message Broker

The Distributed Message Broker contains topics, and each topic contains one or more partitions. Broker Client Services manages the topics and partitions in the Distributed Message Broker. Figure 3 illustrates the Broker Client Services which are comprised of three sub-modules: Topic Manager, Partition Manager, and Replication Factor Manager. The Topic Manager sub-module dynamically creates new topics in the Broker Cluster with the subscription of a new video stream source by a registered user. The naming convention for the topics is UID_VDS, and UID_VDS_IR. Where UID, VDS, and IR represent the unique user identifier, video sources identifier, and intermediate results identifier, respectively. These identifiers are dynamically provided by the Structured Metastore. The actual video streams and the intermediate results computed from the video frames are held in these topics.

Video stream records are distributed across partitions within each topic, with the number of partitions proportional to the parallelism degree and overall throughput. When a new video stream source is registered with the framework, the Partition Manager creates a new partition under the relevant user’s topic. To ensure fault tolerance, the Replication Protocol manages partition replication across the brokers. APIs in the Replication Manager handle the replication factor, optimizing the use of the Replication Protocol. Through experimentation, we found that a replication factor of 3 is both optimal and sufficient, provided it does not exceed the total number of Broker Servers, as depicted in the flowchart in Fig. 4.

Fig. 4
figure 4

Flowchart for replication factor

Video Stream Consumer

We acquire the incoming streams from the video sources as mini-batches that then reside inside respective topics in the brokers. The Video Stream Consumer helps VBDAL read these minibatches from their topics in the broker for IR extraction.

Intermediate Results Manager

VBDAL layer generates deep features and requires proper management and indexing. The extracted IR are sent and consumed from the topic \(UID\_VDS\_IR\) using Intermediate Results Manager. The schema is composed of dataSourseId, frameId, timestamp, width, height, and frameData.

Video Big Data Store

The VBDL provides a permanent distributed persistence and storage space for video big data. The data is systematically stored and mapped in accordance with the \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) ’s business logic. A User Space is properly allocated when a new user registers with the proposed system. The User Space is divided into a proper subspace structure hierarchically with the owner having specific read & write privileges. The subspace hierarchy under User Space is mapped and synchronized in accordance with the user identification in the User Centric Metastore. The User Space is subdivided into two high-level subspaces: Raw Video Space where the video data is managed, and Model Space where the models trained and inferenced for IR extraction are stored. The Raw Video Space is subdivided in 2 more subspaces: one for batch-video data subject to batch analytics, and streaming data acquired from the connected streaming sources. The streams are timestamped properly for persistence. The level of granularity for raw streams is contingent upon the video data sources according to the functional requirements of the users.

Since streams are acquired from heterogeneous connected sources, it is imperative to orchestrate their storage and access in accordance with the proposed system for efficiency. To handle this issue, we carefully design Video Stream Unit (VDSU). We define the segment size as \(\eta\), a regularizing factor for the VDSU. It is tightly related to the HDFS Block Size. Choosing a reasonable value for \(\eta\) is important because this is the controlling factor of VDSU. If \(\eta\) is small, many segments are generated which imposes significant overhead on the NameNode as well as network traffic. On the other hand, if \(\eta\) is large, it hinders the processing performance as more memory is required, and the granularity level increases. Moreover, it causes the segment to span multiple blocks stored in different data nodes thus incurring the added overhead. The incoming frames from the heterogeneous connected sources are added to a segment until the segment size reaches \(\eta\). Sufficient metadata is added to the segment thus enabling it to be processed without loading the entire stream. The segment is then stored in the VDSU as illustrated in Fig 5.

Fig. 5
figure 5

Video stream unit (VDSU).

Structured Metastore

Structured Metastore manages the structured data of the proposed system which includes User Centric Metastore, Video Data Source Metastore, Subscription Metastore, System Logs, and Sys Configs. The User Centric Metastore maintains user logs and respective information. We deploy a customized salted encryption scheme based on [47] to encrypt the user information for security purposes. Furthermore, the proposed framework uses the Video Data Source Metastore metastore to handle two types of video sources: stream sources and datasets. The Video Data Source Metastore manages the meta-information and access rights for these sources. After registering a data source with \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\), it can be subscribed to for video indexing and retrieval. Such subscriptions are managed through Subscription Metastore. The System Logs and Sys Configs Metastores store the logs generated by the proposed system and the configurations respectively. Finally, the DIO Reader and Writer modules are designed to enable the users to access the underlying data from the distributed as well as Structured Data Stores to manage and operationalize the data.

Distributed Encoded Deep Feature Indexer

We designed DEFI as a unified indexing mechanism for \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) which is composed of four modules: Feature Encoder, IR Aggregator, IR indexer, and Query Handler, as shown in Fig. 6.

Fig. 6
figure 6

IR Indexer: A unified deep feature indexing and retrieval component in \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\). The Acquired deep features are transformed, aggregated, and indexed in the inverted index which immediately becomes available for retrieval.

To avoid multiple indexing schemes, we designed a unified indexer that facilitates indexing and querying multiple types of features. However, because of their high dimensionality and dense nature, incorporating deep features directly into an inverted index-based system poses a significant challenge. Thus, we designed a Feature Encoder motivated by [48] to transform the deep features into a representation most suitable for the inverted index-based system. Given a feature vector \(f_v \in {\mathbb {R}}^D\), we use a transformation function to make it compatible with our inverted index system.

Table 2 \({\mathcal {Q}}\) factor vs and its effect on Empty Yield.

For feature encoding in [48], they use a hard-coded constant \({\mathcal {Q}}\) to quantize the feature vectors. The drawbacks of this approach are two-fold. The first one is finding the optimal value of the quantization factor. Since \({\mathcal {Q}}\) is a magic number, a trial and error method must be determined. Secondly, if \({\mathcal {Q}}\)’s value is small, then because of the floor function, the features in which the value of every dimension is near zero are yielded empty (null), thus missing corresponding videos. On the other hand, the encoding size is proportional to the value of \({\mathcal {Q}}\), a larger value of \({\mathcal {Q}}\) yields larger encodings. Table 2 summarizes the effect of this empty yield on 1M features extracted from the Youtube-8M [49], V3C1 [50], and Sports-1M [51] datasets using the VGG’s VGG-19 prediction layer.

We modify the quantization factor \({\mathcal {Q}}\) in our proposed approach and can be defined as:

$$\begin{aligned} q = e.log(d) \end{aligned}$$
(1)

where d is the dimensionality of the \(f_v\) and e is Euler’s constant. This adaptive quantization factor prevents under- or over-sampling of features in the inverted index. For each component \(f_{v,d} \in f_v\) (where \(d = 1, 2, .... D\)), the encoding transformation is:

$$\begin{aligned} S_{v,d} = \left| \left\lfloor q \cdot f_{v,d} \right\rfloor \right| \end{aligned}$$
(2)

After quantization, we prune zero or near-zero values:

$$\begin{aligned} \text {tf}_{v} = \{ s_{v,d} \mid s_{v,d}> 0 \} \end{aligned}$$
(3)

This step helps reduce the data’s size, removing dimensions that contribute little to the overall representation and thereby improving storage efficiency in the index. Each transformed component \(S_{v,d}\) is further converted to a hexadecimal representation, \(Hex(S_{v,d})\) and repeated \(S_{v,d}\) times:

$$\begin{aligned} encoding (S_{v,d}) = \underbrace{\text {Hex}(s_{v,d}) \, \text {Hex}(s_{v,d}) \, \dots \, \text {Hex}(s_{v,d})}_{s_{v,d} \text { times}} \end{aligned}$$
(4)

Finally, the encoded feature vector \({\mathbb {F}}_v\) is assembled by concatenating the hexadecimal values across all dimensions. The purpose of hexadecimal encoding in the proposed DEFI is to create a compact, invertible representation of high-dimensional, dense feature vectors. Encoding each dimension of the feature vector in hexadecimal allows us to efficiently store, retrieve, and index these transformed features.

Algorithm 2
figure b

Global features transformation and encoding

The proposed DEFI aggregates multiple types of features, represented as:

$$\begin{aligned} F_v = \begin{bmatrix} F_v^{\text {spatial}}, F_v^{\text {object}}, F_v^{\text {textual}} \end{bmatrix} \end{aligned}$$
(5)

where \(\begin{bmatrix} F_v^{\text {spatial}}, F_v^{\text {object}}, F_v^{\text {textual}} \end{bmatrix}\) are encoded feature vectors for spatial, object, and textual features, respectively. These are indexed collectively in the distributed inverted indexer. Algorithm 2 outlines the proposed DEFI

In distributed systems, managing the query load across multiple servers (in this case, N servers in the distributed inverted indexer) is crucial for performance, scalability, and reliability. The Weighted Round Robin (WRR) algorithm is an effective way to allocate queries based on the capacity of each server, ensuring that servers with more resources handle a greater proportion of the load.

$$\begin{aligned} W_i = \frac{C_i}{\sum _{j=1}^N C_j} \end{aligned}$$
(6)

where \(W_i\) is the weight assigned to server i based on its computational power \(C_i\). The load is distributed to balance the workload dynamically.

Fig. 7
figure 7

MapReduce-based workflow for batch video analytics. In this figure, MR represents MapReduce. VidRDD, FeRDD, and EFeRDD represents the Video, Feature, and Ecnoded Feature abstractions on top of Spark’s native RDD. (a) Step by step workflow. (b) Data representation inside VidRDD’s. (c) Processing carried out on each node

Video Big Data Analytics Layer

We designed VBDAL comprising four components: Video Grabber, Structure Analyzer, Feature Extractor, and Query Handler. We utilize Spark [3] for distributed video processing and analytics. At the point of this writing, Spark has no native support for video data handling. To address this issue, we first build Video Resilient Distributed Dataset (VidRDD), a unified API wrapper utilizing Spark Resilient Distributed Dataset (RDD) that can be integrated with both Spark and Spark Streaming. We load the video data as a binary stream into the VidRDD. All the operations of VBDAL are performed on our proposed VidRDD as shown in Fig. 7.

Video Grabber

The Video Grabber component comprises two modules: Streaming Mini-batch Reader and Batch video Loader. The Streaming Mini-batch Reader allows us to subscribe to distributed broker’s topics to acquire and load video mini-batches into distributed main memory (as shown in Algorithm 1) for near real-time video analytics. Likewise, the Batch video Loader performs the loading of batch video data as VidRDD into distributed main memory from the Video Big Data Storage (Raw Video Space) for batch video analytics.

Structure Analyzer

Once the VidRDD is initialized and populated with video data, structure analysis operations are carried out by the Structure Analyzer which in most cases include frame extraction, metadata extraction, and preprocessing. At this stage, the video data resides in the distributed main memory as VidRDD. The Metadata Extractor, Frames Extractor, and Preprocessor components operate in an inline manner, as shown in Fig. 7 (MR1) and in Algorithm 3 (step 2–6) for batch structure analysis and Algorithm 4 for Streaming Video Data in near real-time structure analysis.

Algorithm 3
figure c

Distributed batch video processing

Algorithm 4
figure d

Video stream processing & analytics

Metadata extraction is vital for indexing and retrieving video data and video frames. Metadata is categorized into two categories. General metadata contains information about the video file/stream, such as the Format (MPEG-4, MKV, etc.), Codec, File Size, Duration, bitrate mode (Constant or variable) & overall bitrate, Timestamp, etc. Video metadata includes the Format, profile, actual bitrate, Width and height of the frame, Display aspect ratio, Frame rate mode (Constant or variable), actual Frame rate, Color space, Bit depth, and Stream size. This metadata helps in understanding the video content. In batch video data, the metadata is extracted directly from the video data [52]. In the case of video streams, the metadata packed within the Kafka messages is used Algorithm 4 (steps 2–8).

The Frame Extractor performs the video frame extraction operation on the VidRDD. Once the frames are extracted then the VidRDD is transformed into FrameRDD. The FrameRDD holds attributes like, i.e., FrameNo, TimeStamp, Dimensions (Width and height), RGB Channels, and the actual payload. In the case of streaming video, this operation is performed by VSAPS as described in 4.1.

Since video content is often noisy due to non-uniform lighting conditions, objects in an occluded context, and other conditions, the extracted video frames must first be preprocessed before inputting them into the convolutional neural network. The Preprocessor component handles operations like frame resizing, downsampling, background subtraction, color and gamma correction, histogram equalization, and exposure and white balance. Once the structural analysis is performed the next step is IR extraction.

IR Extractor

In the context of CBVR, the IR are of two types: deep global features, and object features, which are computed by the Deep Global Feature Extractor and Object Extractor, respectively.

Deep Global Feature Extractor

The global features provide a good description of image content, which is typically represented by a single feature vector. Deep Feature Extractor employs VGG-19 [53] to compute global spatial features from video frames.

We use video keyframes for feature extraction for two reasons; first, it reduces the computation overhead, and second, keyframes are sufficient for global representation of the video because there is no significant difference between the consecutive frames within the two keyframes. For batch video processing, we load the video data into VidRDD which resides in distributed memory. Structure analysis (decomposition into frames, metadata extraction, etc.) on the VidRDD is performed by the Structure Analyzer. Then the preprocessor performs pre-processing operations such as downsampling, resizing, histogram equalization, etc. on the extracted frames to improve object precision and further accelerate deep-feature extraction, respectively.

We compute spatial features after preprocessing utilizing the VGG-19 based Deep Global Feature Extractor. These features are then stored in the FeatureRDD (FeRDD), as shown in Fig. 7. The computed global features from Deep Global Feature Extractor, and objects being detected by Deep Object Extractor are kept in FeRDD. Finally, we encode and aggregate the features in FeRDD as described in section 4.1, and immediately send them to the Feature Indexer which indexes them in the DEFI.

Deep Object Extractor

Similar to Deep Global Feature Extractor, the object features are also computed from the frames residing in the FrameRDD as shown in Fig. 7 and algorithm 3 (step 6 to 10). We compute the object features from the keyframes utilizing Yolo [54]. The objects information is properly structured and stored alongside the deep global features in the FeRDD. To maintain the semantics, we create a list style semantic structure depicting the transitions of objects in the frames.

Fig. 8
figure 8

Query execution plan against three types of queries

Query Handler

The purpose of Query Handler is to receive user queries, invoke the APIs of the \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) against those queries, and return the results to the users. Upon receiving the request from users, the Query Generator makes an execution plan and executes it. For user assistance, we generate query maps using Query Map Generator and return them to the user. The query execution plan –outlined in Fig. 8– depends on the type of query. For textual query, it is transformed and executed by DEFI format directly. For Image and Video queries, structure analysis and feature extraction are performed, then transformed and executed by DEFI. The top-k final results are ranked and returned to the user.

Evaluation and discussion

In this section, we delineate the experimental configuration for \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\), encompassing the dataset, performance parameters, and the evaluative experiment. The assessment of video retrieval performance involves testing the accuracy and speed of the retrieved videos.

Fig. 9
figure 9

Experimental in-house cluster setup

Table 3 Topics for Video Stream Services and IR management. Parts indicate the number of partitions

Experimental setup

We set up an in-house distributed cloud environment with ten computing nodes for testing and evaluation. The cluster configuration and specifications for each node are detailed in Fig. 9. The Cluster consists of five types of nodes: \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) Server, Worker Agents, HDFS Server, Solr Server, and Broker Servers. The \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) Server hosts VSAS, Video Stream Producer (VSP), and the \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) Web Server, along with Ambari Server [55] for cluster management and configuration.

The HDFS Server (Agent-1) configures HDFS Name Node, Spark2 History Server, Yarn Resource Manager and Zookeeper Server. Worker Agents consist of four different types of agents that configure the components of \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) such as VBDL (Video Stream Consumer, Video Big Data Store, and DEFI), and VBDAL (Structure analyzer, Deep feature extractors services). The real-time and offline video analytics are carried out by these nodes. For clarification, we used these Worker agents first for near real-time evaluation and then for batch evaluation.

The Broker Server is configured on agents six, seven, and eight to buffer large-scale video streams, and the IR (the deep features). Agent nine deploys the Solr Server and Structured Metastore schema. Some nodes in Fig. 9 have clients and data node services configured. The clients are instances of Zookeeper, Yarn, Spark, and Solr, and the data node is a HDFS node. Table 3 shows the topics for Video Stream Services and IR management.

Datasets

We evaluated \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) on three datasets: V3C1 [50], YouTube-8M [49], and Sports-1M [51]. V3C1 consists of 7,475 Vimeo clips with an average duration of eight minutes, spanning around 1000 hours (1.3 TB in size). In a JSON file, videos’ metadata such as keywords, description, and title are available. YouTube-8M contains approximately 6 million video IDs with high-quality machine-generated annotations, extracted from a diverse vocabulary of over 3800 visual entities. Each video within this dataset has a duration ranging from 120 to 500 seconds. Sports-1M contains 1 million YouTube videos categorized into 487 sports categories.

Video Big Data Layer evaluation

We do a thorough analysis of the various Video Big Data Layer components, including DEFI, Video Stream Manager, and Video Big Data store.

Video Stream Acquisition Service performance evaluation

As the VSAS is agnostic to heterogeneous devices, we registered several heterogeneous devices along with offline video stream sources. These devices have different frame rates, comprising of cellphones with frame rate 60, and others such as depth camera [56], IP camera [57], and RTSP [58] with frame rate 30. A video file saved on secondary storage has also been tested with the VSAS sub-component for offline video analytics.

Fig. 10
figure 10

Video big data store performance evaluation: (a) Video stream acquisition and synchronization benchmarking (b) VSAS and video stream consumer service performance

The acquired frame’s resolution was set to \(480 \times 320\) pixels by the VSAS. As a result, the size of each captured frame was 623.584 KB. With an average speed of 6ms, the VSAS transforms the obtained frames into messages which are then sent to VSP by the VSAS. Moreover, before sending the message to \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) Broker Server at a rate of 12ms on average, the VSP compresses the message on average to 140.538 KB. Both of these modules are employed on \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) Server, for collecting and transferring streams to the \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) Broker Servers. The improvements can be seen in Fig. 10(a), as we achieved a frame rate of 30, and 58 fps on average from heterogeneous real-time and offline sources, which is approximately 36% and 116% faster than the ideal 25 fps for real-time video analytics.

We also experimented on assessing the scalability for \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\), by increasing the number of video sources from five to 140 on \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) Server. At the same time, this experiment assessed the VSP and VSAS sub-components by increasing the average number of messages to 54 per second for each video stream. Furthermore, we acquire and create streams from 70 devices at 40 messages for each stream using over VSAS and VSP modules which is a significant boost in performance. However, adding more streaming devices results in performance degradation, therefore we recommend connecting 70 sources per system for better performance. In our case, we can further scale up by adding more brokers and producers.

Table 4 Performance evaluation of VSCS in terms of messages per second on a single thread (ST) and maximum (optimal) number of threads (MT)

Video Stream Consumer Services performance evaluation

In this section, we evaluate Video Stream Consumer Services (VSCS) performance which is responsible for acquiring video streams from the \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) Broker server as mini-batches for analytics. In the context of near real-time analytics, mini-batch size is significant and is determined by the MAX_REQUEST_SIZE_CONF parameter. We established four distinct scenarios for the evaluation, namely SCN-1, SCN-2, SCN-3, and SCN-4, as shown in Table 4. We configured the mini-batch size to be 4096, 6144, 8192, and 10,240 KB respectively. A single thread on a single worker machine achieves average message processing rates of 50, 88, 101, and 166 messages per second for the respective scenarios with synchronous replication. We deploy 22 threads to receive mini-batches from the Broker servers in each scenario. In SCN-1, the optimal performance is attained with 20 threads, achieving a reception rate of 814 messages per second. Likewise, in SCN-2, SCN-3, and SCN-4, the best performance is achieved with 13, 9, and 7 threads, respectively, as shown in Table 4. Additional threads do not contribute to increased performance. Figure 10(b) shows the impact of the Video Stream Consumer Service in the production environment, where the messages sent on UID_1 and UID_2 nearly coincide with those received.

Fig. 11
figure 11

Video big data store performance evaluation: (a) Performance of distributed persistent big data store (active data writer) (b) Performance evaluation of passive data reader and writer

Video Big Data Store performance evaluation

We conducted experiments on both the Active Data Reader & Writer and Passive Data Reader & Writer. Active Data Writer consumes the video stream from the topic UID_VDS (Broker server) and stores it in distributed persistence storage. Figure 11(a) illustrates Active Data Writer performance. The results indicate that the Active Data Writer ensures proper data distribution as well as data locality. Similarly, Fig. 11(b) shows the disk write counts co-related with bytes aggregated for HDFS hosts during the active write operations for three hours.

Likewise, we evaluate the performance of Passive Data Reader and Writer operations (shown in Fig. 11(b)) on batch video data. These operations were conducted for five different batch video sizes, i.e., 1024 MB, 5120 MB, 10240 MB, 15360 MB, and 20480 MB. The outcomes reveal that the write operation outpaces the read operation, and the time disparity for both read and write is proportionate to the volume of batch video data.

Distributed Encoded Deep Feature Indexer evaluation

We evaluated the indexing performance in terms of feature encoding size, concurrent queries and their response time with and without load balancing, the effect of batch commit operation, and retrieval time. Our findings indicate that the feature encoding in our proposed system exhibits significant performance in all aspects. Figure 12(a) illustrates the comparison of index storage sizes between the Deep Global features and their encoded counterparts. The size of the Global features without encoding exhibits linear growth, while the size of the encoded features remains consistently small even with a linear increase in the number of features. The compact size not only reduces indexing time but also lowers query latency, thereby enhancing the overall system performance. Without encoding, the preferred method of similarity measure for CNN features is cosine similarity, as outlined in 7. However, cosine similarity is computationally expensive when dealing with very large numbers of features. The TF-IDF similarity measure of the inverted table results not only in a significant performance boost of the similarity computation but also improves the video retrieval accuracy.

Fig. 12
figure 12

Distributed encoded deep feature indexer evaluation: (a) Feature index size comparison (b) Feature load balancing (c) Feature Batch size time (d) Feature query vs time

$$\begin{aligned} d\left( f,q\right) =\frac{f\cdot q}{|f| \times |q| }=\frac{\sum ^{n}_{i=1}f_{i}\times q_{i}}{\sqrt{\sum ^{n}_{i=1}f^{2}_{i}}\times \sqrt{\sum ^{n}_{i=1}q^{2}_{i}}} \end{aligned}$$
(7)

Figure 12(b) shows the query performance of our system. The proposed framework demonstrates commendable query performance against concurrent queries. Moreover, the load balancing mechanism also demonstrates effective performance in substantially reducing retrieval time. Figure 12(c) shows the feature indexing time on the Youtube-8M, V3C1, and Sports-1M datasets. Specifying the ideal batch size is crucial due to the computational and IO cost associated with commit operations after each batch. If the batch time is too short, more commit operations are required, leading to increased feature indexing time. Conversely, excessively increasing the batch size does not enhance indexing efficiency. The experiments reveal that a batch size ranging from 500 to 1000 is optimal. Additionally, in Fig. 12(d), the query and retrieval time in relation to a query feature are depicted. Retrieval time escalates with the number of videos in the result. Employing incremental retrieval is a prudent approach for latency reduction.

Fig. 13
figure 13

Video big data analytics performance evaluation on 1-Million frames of V3C1, Youtube-8m, and Sports-1M datasets: (a) Processing time and scalability testing (b) Accuracy results of keyword, image, and query clip

Video Big Data Analytics performance

We retrieved the top k videos based on a query video clip, image, or keyword, and then calculated the precision, recall, and accuracy as indicated by (8), (9), and (10), where True Positive (TP) represents the number of relevant videos retrieved, False Positive (FP) represents the number of non-relevant videos retrieved, True Negative (TN) represents the number of non-relevant videos not retrieved, and False Negative (FN) represents the number of relevant videos not retrieved.

$$\begin{aligned} Precision=\frac{TP}{TP + FP} \end{aligned}$$
(8)
$$\begin{aligned} Recall=\frac{TP}{TP + FN} \end{aligned}$$
(9)
$$\begin{aligned} Accuracy=\frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(10)

Figure 13(a) shows the performance of feature and object extraction tasks using over 1 million keyframes from three datasets: Sports-1M, YouTube-8M, and V3C1. For all the datasets, we notice a steady decrease in processing time as the number of computing nodes increases from 1 to 4, which is evident of effective parallelism and workload distribution across nodes in our proposed system. For example, the total processing time for Sports-1M dataset drops from over 600 minutes on a single node to under 200 minutes on four nodes. This threefold reduction shows that our system has near-linear scalability in practice. Overall, feature extraction using the VGG-19 model is more time-consuming than object extraction with YOLO. This is probably due to the deeper network structure and greater number of convolutional layers in VGG-19 compared to the optimized real-time design of YOLO.

Figure 13(b) shows the retrieval performance for three query types (text, image, and video clip) on the V3C1, YouTube-8M, and Sports-1M datasets. The evaluation uses four standard metrics: Accuracy, Precision, Recall, and F1-Score. Across the datasets, image-based retrieval consistently achieve the highest accuracy. This suggests that static visual features provide a strong semantic match for retrieval tasks. Video clip-based queries usually perform better in recall and F1-score, especially on YouTube-8M and Sports-1M datasets. This indicates that temporal features play a significant role in capturing relevant content. Text-based queries have relatively lower precision and recall, especially in large datasets like YouTube-8M. This shows the natural uncertainty in text-based queries and the difference between keywords and visual content. While V3C1 shows fairly balanced results across all types of queries, YouTube-8M and Sports-1M show more variability. Text-based queries perform relatively weaker performance compared to image and video. These differences underscore the importance of modality selection in retrieval systems and highlight the trade-offs between semantic specificity, content structure, and dataset characteristics.

Comparison with state-of-the-art

Our primary focus is on the architectural aspect of CBVR in a Cloud Environment with a complete end-to-end framework. To the best of our knowledge, there is no existing comparable work in the literature to date. In this section, we provide an in-depth feature-wise comparison of \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) with various state-of-the-art approaches in content-based video retrieval illustrated in Table 1. Our proposed system excels in multiple aspects, showcasing a comprehensive and unified approach to address the complexities of large-scale video analytics. Unlike [13], \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) adopts a Lambda Architecture, allowing it to efficiently handle both real-time and batch processing. The scalability of \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) is evident through its scalable system architecture, pluggable framework, and service-oriented design, setting it apart from [44] and [36]. Additionally, \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) leverages big data stores, distributed messaging, and incremental indexing, addressing the limitations of [38] and other baseline systems. The flexibility to process multi-type features, advanced feature encoding techniques, and support for stream as well as batch processing contribute to the superior performance of \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\). Notably, \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) outshines [7, 10, 17, 21, 45], and [20] by incorporating a multistream environment and providing robust support for querying by text, image, and clip. The combination of these features positions \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) as a highly efficient, versatile, and scalable solution for content-based video retrieval in large-scale environments.

Moreover, Table 5 presents a comprehensive performance comparison of \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) with several state-of-the-art approaches in content-based video retrieval. Our proposed system exhibits superior performance with a processing time of 10.91s, outperforming baseline systems such as [44] (baseline 1) with a processing time of 500 s and [36] (baseline 2) with 12.8s. Notably, \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) achieves a remarkable accuracy of 0.93 and precision of 0.67, surpassing other methods in the evaluation metrics. The enhanced performance of \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) can be attributed to its unified indexing mechanism (DEFI), which efficiently encodes deep features using an innovative quantization method inspired by [48]. This encoding strategy not only avoids the drawbacks of fixed constant quantization but also addresses the high dimensionality and dense nature of deep features. Additionally, the integration of Apache Spark, VidRDD, and a fine-tuned load-balancing mechanism contributes to the system’s scalability and efficient distributed processing. The adoption of VGG-19 and You only look once (YOLO)-V3 for deep feature extraction ensures the system’s ability to capture rich global spatial and object features, leading to improved accuracy in content-based video retrieval. Therefore, \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) emerges as a robust and efficient solution for large-scale video analytics, offering significant advancements over existing state-of-the-art techniques.

Table 5 Platform performance comparison with State of the Art

Conclusion and future work

This paper introduces \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\), a feature-rich content-based video retrieval (CBVR) system designed for offline and near-real-time video retrieval from diverse sources, such as IP cameras. Tailored for cloud environments, \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) emphasizes robustness, speed, and scalability. Built on lambda architecture principles, the system leverages distributed deep learning and in-memory computation for efficient video analytics and processing. The VidRDD abstraction serves as the core unit for distributed in-memory video analytics.

To enhance video retrieval, \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) computes multi-type features and supports seamless integration, indexing, and retrieval of these features through an incremental distributed index. Retrieval is further optimized by generating a query execution plan before the actual query is triggered. The system’s distributed data management and in-memory computation technologies ensure efficient operations.

We evaluated \(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\) using three benchmark datasets: YouTube-8M, Sports-1M, and V3C1, focusing on key bottlenecks. Results demonstrate satisfactory performance in terms of scalability, efficiency, computation time, and accuracy. We believe this work contributes to the research community and the industry, offering a scalable and efficient video big data analytics solution.