$$\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$$ : looking for a needle in a haystack: a content-based video big data retrieval system in the cloud

Khan, Muhammad Numan; Alam, Aftab; Lee, Young-Koo

doi:10.1186/s40537-025-01308-1

$\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$: looking for a needle in a haystack: a content-based video big data retrieval system in the cloud

Research
Open access
Published: 21 November 2025

Volume 12, article number 257, (2025)
Cite this article

You have full access to this open access article

Download PDF

View saved research

Journal of Big Data Aims and scope Submit manuscript

$\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$: looking for a needle in a haystack: a content-based video big data retrieval system in the cloud

Download PDF

Muhammad Numan Khan¹,
Aftab Alam¹ &
Young-Koo Lee¹

747 Accesses
Explore all metrics

Abstract

The rapid proliferation of video data from various sources underscore the pressing need for effective Content-based Video Retrieval (CBVR) systems. Traditional retrieval methodologies are increasingly inadequate for managing the complexities and scale of video big data, which necessitates the development of advanced distributed computing frameworks. This study identifies and addresses critical challenges in CBVR , specifically the implementation of lambda architecture for the retrieval of both streaming and batch video data, the enhancement of in-memory analytics for video data structures, and the efficient indexing of heterogeneous video features. We propose $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, a novel scale-out system which integrates state-of-the-art big data technologies with deep learning algorithms. The system architecture is inspired by lambda principles and is designed to facilitate both near real-time and offline video indexing and retrieval. Key contributions of this research include: (1) the formulation of a lambda-style architecture tailored for video big data, (2) the development of an in-memory processing framework that provides a high-level abstraction for video analytics, (3) the introduction of a unified distributed indexer, termed Distributed Encoded Deep Feature Indexer (DEFI), capable of indexing multi-type features from both streaming and batch video datasets, and (4) a comprehensive bottleneck analysis of the proposed system. Performance evaluations utilizing three benchmark datasets demonstrate the system’s effectiveness, revealing insights into performance bottlenecks related to storage, video stream acquisition, processing, and indexing. This research provides a foundational framework for scalable and efficient video analytics, significantly advancing the state-of-the-art in cloud-based CBVR systems.

Lambda-IVR: An Indexing Framework for Video Big Data Retrieval Using Distributed In-memory Computation in the Cloud

A distributed Content-Based Video Retrieval system for large datasets

Article Open access 09 June 2021

An Image Retrieval System for Video

Introduction

Video stream sources, such as CCTV, smartphones, drones, and other devices, play a crucial role in generating video content, prompting the development of systems for video content analysis, management, and retrieval. As online video traffic is projected to exceed a zettabyte annually by 2030, the need for a scalable and distributed Content-based Video Retrieval (CBVR) architecture is becoming crucial [1]. This rapidly expanding volume of video data, referred to as "video big data", far exceeds the capabilities of conventional analytics methods. CBVR, which retrieves relevant videos based on content rather than metadata, demands scalable systems for indexing, querying, and feature extraction at scale.

The rapid increase in video data from sources like CCTV, drones, and online platforms has far exceeded the ability of traditional retrieval systems. In distributed setups, real-time analytics and historical video searches usually need separate processing systems. This results in duplicated infrastructure, inconsistent indexing, and higher latency. Although distributed computing frameworks like Hadoop [2] and Spark [3] offer scalable infrastructure, they were not originally designed to handle the temporal, semantic, and high-dimensional nature of video data. Consequently, adapting these platforms to efficiently support CBVR is still a significant challenge. Systems designed for large-scale CBVR must effectively integrate video ingestion, deep feature computation, indexing, and retrieval for both streaming and batch pipelines. Despite ongoing research, there is still a lack of scalable, end-to-end solutions that are capable of handling multi-modal video content in real-time and offline settings.

Furthermore, CBVR has gained significant attention in recent years [4,5,6]. Prior works have focused on specific aspects of video retrieval such as handcrafted features [7,8,9], deep feature extraction [10, 11], indexing [11,12,13,14], retrieval modes such as text-based, image-based, or clip-based search [9, 15, 16], action recognition [17, 18], and object/face recognition [19]. Other works address similarity search [20, 21] or specific execution models like batch [19, 22] and stream [20, 23] processing. Some systems adopt distributed architectures [20, 24], but none offer a unified solution that supports distributed, multi-modal, and real-time CBVR. This fragmentation reveals a significant gap: existing systems fail to offer general-purpose, scalable, and low-latency CBVR frameworks that address end-to-end video analytics workflows.

To address this, the lambda architecture paradigm [25] offers a promising foundation for scalable video analytics, and has attracted both the industry’s and researchers’ attention [26,27,28,29]. However, its application to CBVR remains underexplored [30, 31], and adapting it to handle video-specific challenges like feature extraction and indexing remains non-trivial. Although distributed processing engines such as Spark [3] enables in-memory distributed computing, it lacks native support for video processing and deep learning workloads. This limitation poses a key challenge in effectively extending and optimizing Spark to handle large-scale, low-latency video analytics pipelines for both streaming and batch videos.

Similarly, distributed streaming systems [32] are widely used for real-time, fault-tolerant message streaming. Their application however, is non-trivial in video big data workflows. As such, integrating such systems into a CBVR system requires significant architectural adaptation to ensure efficient stream acquisition, buffering, and delivery of video and feature data. Moreover, the widespread use of Convolutional Neural Networks (CNNs) for video analysis brings significant challenges related to feature diversity. Different CNN architectures generate various feature representations such as global features, object features, and temporal semantics, which differ in scale, dimensionality, and structure. This variety makes it harder to develop unified indexing and retrieval strategies in distributed CBVR systems.

Implementing big data technology stacks for end-to-end video analytics encounters several key system-level bottlenecks. These include issues with orchestration, scalability, inefficiencies in acquiring real-time streams, challenges in distributed processing, and the computational overhead of indexing and querying high-dimensional deep features. This highlights the need for a complete and unified architectural framework that can effectively support both real-time and batch processing of large-scale video data in distributed settings.

While it is theoretically possible to connect different systems for each workload, this often creates significant architectural and operational challenges. Batch systems expect static data and global aggregation, while streaming systems need incremental updates and quick processing. These approaches differ not only in data models and state handling but also in how they maintain consistency. Simple integration results in duplication, synchronization issues, and inconsistent retrieval results. The lambda architecture partially solves this by defining separate batch and stream layers that meet at a serving layer. We leverage on this idea to unify not just ingestion and feature extraction, but also indexing and querying from both real-time streams and offline video data. This architecture bridges the batch-stream divide, enables scalable distributed processing, and supports low-latency video search in dynamic environments.

In this work, we present $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, a scalable distributed framework for CBVR based on lambda architecture to support both batch and real-time processing. $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ comprises of three layers, Video Big Data Layer (VBDL), Video Big Data Analytics Layer (VBDAL), and Web Service Layer (WSL) to manage the acquisition, structural analysis, deep feature extraction, and indexing of large-scale video data. We develop Distributed Encoded Deep Feature Indexer (DEFI) to encode deep global and object-level features for indexing to facilitate efficient multimodel retrieval. The design philosophy in $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ reflect the practical challenges of merging streaming and batch processing.

The remainder of this paper is organized as follows: Section "Related Work" discusses related work in CBVR, big data processing, and indexing. Section "Research Objectives" outlines the research objectives and architectural challenges. Section "Proposed Architecture" presents the proposed $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ architecture in detail. Section "Evaluation and Discussion" describes the experimental setup, datasets, and evaluation metrics, followed by performance evaluation. Finally, Sect. "Conclusion and Future Work" concludes the paper.

Related work

Several approaches have been proposed to address CBVR, focusing on feature extraction, indexing, and system scalability. These methods span from traditional handcrafted descriptors to more recent deep learning and cloud-based techniques. However, most prior works address isolated components of the pipeline rather than providing an integrated architecture suitable for large-scale, hybrid video processing.

Traditional CBVR systems relied on handcrafted features and static metadata. Amato et al. [13] proposed a large-scale system for retrieval using text, color histograms, and object detection. Sun et al. [33] presented a framework for segment-level video retrieval. Li et al. [15] proposed Sentence Encoder Assembly to support text-to-video matching. These systems, while effective for specific retrieval tasks, do not address scalability or unified system architecture. Other efforts explored temporal and semantic representations. Shang et al. [23] used frame-level gray-scale intensities and time-oriented video structures but incurred high parallel processing costs. Sivic et al. [34] developed classifiers for character-specific tracking in television content, and Song et al. [8] applied multi-feature hashing in Hamming space. Lai et al. [9] combined appearance and motion trajectories for object-centric retrieval, while Araujo et al. [35] performed large-scale retrieval using image queries. These works generally assume static datasets and lack mechanisms for real-time or adaptive stream processing. Zhang et al. [12] created a video frame information extraction model using CNN and Bag of Visual Words for retrieving videos on a large scale using image queries. Such conventional CBVR approaches cannot be considered as an effective candidate solution for Video big data. Furthermore, these CBVR systems can process a limited or no amount of near-real-time video streams.

Recent efforts have looked into big data technologies to improve scalability in CBVR. Saoudi et al. [36] developed a distributed system to create compact video signatures using motion and residual features along with clustering-based indexing. Gao et al. [10] introduced a cloud-based actor identification framework that keeps spatial coherence in feature encoding. Wang et al. [21] used MapReduce and GPU acceleration for near-duplicate retrieval with Hessian-Affine detector [37] encoded through Bag of Features and stored on HDFS. Zhu et al. [20] presented Marlin, a streaming pipeline for distributed indexing and micro-batch feature extraction. However, it does not support batch analytics or generalized features. Muhling et al. [17] use a cluster computing approach to propose a CBVR system based on Distributed Resource Management Application API [24] and deep learning for professional film, television, and media production. Lin et al. [38] developed a cloud-based facial retrieval system. Other approaches [19, 22, 39] also tried to design cloud-based CBVR. These approaches, however, are task-specific and lack architectural integration.

Indexing is crucial in the information retrieval system, and various indexing strategies have been proposed for video retrieval. Chen et al. [40] convert spatial-temporal features into compact binary codes using supervised hashing to train hash functions. Liu et al. [41] index features on a large scale by encoding each image with a pre-trained CNN and constructing a visual dictionary of codewords. They then design a hash-based inverted index for retrieval. However, this approach suffers from linear search time growth with increasing data volume and lacks scalability for distributed or cloud environments. Amato et al. [42] proposed a permutation-based strategy for CNN features. Permutations are created by combining a collection of metric space reference objects. However, it necessitates calculating the distances between pivots and targets, which is time-consuming in the case of deep features. Anuranji et al. [43] argue that existing hashing methods inadequately represent video frame features and fail to exploit temporal information, leading to significant feature loss during dimensionality reduction. They propose a hashing mechanism that captures both spatial and temporal features efficiently using stacked convolutional networks and bi-directional learning for improved feature extraction.

Despite these advances, several fundamental challenges remain unaddressed. First, most existing CBVR systems lack a unified architecture that supports both real-time streaming and offline batch video processing, leading to fragmented pipelines and redundant indexing logic. Second, current indexing methods often struggle to scale with high-dimensional, heterogeneous features (e.g., spatial, temporal, object-level), limiting their applicability across varied video analytics tasks. Third, many systems are not designed for distributed, cloud-native environments, making them difficult to deploy or extend at scale.

Table 1 Feature wise comparison of the proposed $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ with state-of-the-art.

Full size table

To address these limitations, we introduce $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, a scalable and fault-tolerant system based on lambda architecture that combines streaming and batch pipelines. Table 1 shows a comparison of key CBVR systems alongside our platform, emphasizing the architectural and operational aspects important for scaling video retrieval. Lambda Architecture refers to systems that integrate batch and stream processing layers to ensure both low-latency and accurate results. A Scalable System can manage increasing data volumes and user demands without losing performance. A Plugable Framework makes it easy to integrate new components or algorithms. Service Oriented means the system is divided into independent services, which improves maintainability and scalability. Big data Store enables efficient storage and access to large video datasets. Distributed Messaging allows separate, fault-tolerant communication between components, which is essential for coordinating real-time data flows in streaming pipelines. An Incremental Index updates as new data arrives, avoiding the need for expensive full re-indexing. Multi-Type Features supports the handling of diverse types of features. Feature Encoding refers to the transformation of computed features for efficient indexing. Batch processing deals with large amounts of offline video data. Stream Processing receives and analyzes video streams in real-time. A Multistream Environment refers to the ability to acquire and process video data from a diverse range of connected sources. Query By Text allows users to find videos using text input, while Query By Image and Query By Clip enable visual input in content-based searches.

Research objectives

The main research objective of this work are listed below:

Lambda-architecture based CBVR: To design a scalable system for video big data indexing and retrieval that leverages lambda architecture principles to process, index, and retrieve both streaming and batch video data. The integration of these two processing modes poses significant architectural challenges due to differences in latency requirements, data consistency, and processing semantics, challenges that are largely unaddressed in existing CBVR systems.
In-memory distributed video analytics: General-purpose Directed Acyclic Graph based distributed computing engines such as Spark lack native support for video analytics, including specialized data structures and video-specific semantics. This limitation forces developers to manage low-level complexities manually, which is labor-intensive, error-prone, and hinders scalability. Addressing this gap is significant for enabling efficient, high-throughput video processing in large-scale systems. Implementing an end-to-end, video-aware processing framework on top of such engines is a non-trivial task, but essential for supporting real-time and batch video analytics in distributed environments.
Deep feature indexer: The diversity of video features-including textual metadata, deep visual representations, and spatiotemporal patterns-poses a major challenge for CBVR. These features differ in structure, scale, and semantic meaning, making unified indexing complex. Moreover, the high-dimensional and dense nature of deep features significantly increases the computational cost of indexing and retrieval. Designing a scalable, distributed indexer that can efficiently handle and unify these heterogeneous feature types across both streaming and batch video data is a critical and non-trivial challenge with direct impact on retrieval accuracy and system performance.
High-level abstractions: To design and optimize high-level abstractions over big data technologies that support scalable video analytics while hiding the complexity of the underlying distributed computing stack. This objective aims to simplify the development of video processing pipelines by providing domain-specific constructs that encapsulate data ingestion, transformation, and feature extraction, enabling efficient deployment in both real-time and batch processing environments.
Bottlenecks analysis: To systematically identify and analyze performance bottlenecks that arise in distributed video analytics pipelines. These include challenges related to video storage and orchestration, scalability, real-time stream acquisition, and high-dimensional deep feature indexing. The goal is to understand the limitations of existing big data technologies when applied to video workloads, and to use these insights to guide the design of more efficient, scalable, and fault-tolerant CBVR systems. Special attention is given to how architectural choices affect latency, throughput, and resource utilization in both streaming and batch processing contexts.

The main research contributions of this work are as follows:

We propose a lambda-inspired layered architecture that supports both near real-time and batch video processing, enabling scalable and unified retrieval of video big data.
We design and implement an in-memory video analytics framework that introduces a high-level abstraction over distributed processing engines, facilitating efficient video operations such as frame extraction and deep feature computation.
We address the challenge of heterogeneous feature representation by introducing DEFI, a unified distributed indexer capable of indexing and retrieving multi-type deep features, including spatial, object-level, and temporal representations, from video sources.
We introduce a set of high-level abstractions built over big data technologies to simplify the construction of scalable video analytics pipelines. These abstractions encapsulate complex tasks such as data ingestion, transformation, and feature extraction, and are designed to support both streaming video (processed in real-time with low-latency constraints) and batch video (processed offline for large-scale analysis), ensuring flexibility and consistency across diverse workloads.
We develop and evaluate the proposed system using three benchmark video datasets, demonstrating its scalability and performance under real-world conditions. Our analysis highlights key system-level bottlenecks across storage, stream acquisition, distributed processing, and deep feature indexing.

Proposed architecture

Here, we formally introduce the proposed $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ architecture, followed by detailed technical specifications in the subsequent sections. $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ is comprised of three layers: Video Big Data Layer (VBDL), Video Big Data Analytics Layer (VBDAL), and Web Service Layer (WSL), as illustrated in Fig. 1. VBDL forms the base layer of $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, managing the video big data lifecycle, from acquisition and storage to archiving or obsolescence. Within VBDL, we introduce the unified feature indexer DEFI, which encodes and indexes deep global and object features, referred to as Intermediate Results (IR) throughout this work. Additionally, the Structured Metastore manages structured metadata, including user information, video sources and configurations, video retrieval service subscriptions, application logs, and system workloads. VBDAL handles structural analysis, mining operations, generates query maps, and performs ranking operations. Lastly, $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ consolidates all advanced functionalities into role-based web services through the WSL, which is built on top of both VBDL and VBDAL, enabling the proposed framework to operate seamlessly over the web. In the subsequent sections, we provide the technical details of each layer within the proposed $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$.

Video Big Data Layer

VBDL is responsible for large-scale acquisition and management of batch video data and real-time video streams, along with indexing encoded deep features. The four key components of VBDL are the Video Stream Manager, Video Big Data Store, Structured Metastore, and DEFI.

Video Stream Manager

The source devices provide the real-time video stream and executors perform on-the-fly processing and deep feature extraction on these streams. The role of this component is to acquire video streams in real-time and their orchestration. Video Stream Manager comprises four modules: Video Stream Acquisition Service (VSAS), Distributed Message Broker, Broker Client Services, Intermediate Results Manager, and Video Stream Consumer.

Video Stream Acquisition and Producer Service

Video Stream Acquisition and Producer Service(VSAPS) provides interfaces for the configuration and acquisition of video streams from the source devices. For a specific video source subscribed to the proposed $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, the configuration metadata is obtained from the Structured Metastore by the VSAPS, which is then accompanied by the configuration of the video streaming source device. Once configured, the video stream is decoded by VSAPS and is followed by the frame extraction process. Once the frames are extracted, preprocessing operations are performed on the frames as per scenario which include metadata extraction, frame correction & sizing, etc. Communication with the source of video stream occurs through a JSON object comprising five fields: dataSourceID, width & height of the frame, timestamp, and Payload, which is the actual frame data. We use the term Record for this JSON object. Each Record is subsequently sent to the Producer Handler after instantiation. Figure 2 and Algorithm 1 illustrates this process. The VSAPS then serializes the records from mini-batches, compresses it using snappy compression [46], and sends it to the Kafka topic in the Kafka Distributed Message Broker.

Distributed Message Broker

The Distributed Message Broker contains topics, and each topic contains one or more partitions. Broker Client Services manages the topics and partitions in the Distributed Message Broker. Figure 3 illustrates the Broker Client Services which are comprised of three sub-modules: Topic Manager, Partition Manager, and Replication Factor Manager. The Topic Manager sub-module dynamically creates new topics in the Broker Cluster with the subscription of a new video stream source by a registered user. The naming convention for the topics is UID_VDS, and UID_VDS_IR. Where UID, VDS, and IR represent the unique user identifier, video sources identifier, and intermediate results identifier, respectively. These identifiers are dynamically provided by the Structured Metastore. The actual video streams and the intermediate results computed from the video frames are held in these topics.

Video stream records are distributed across partitions within each topic, with the number of partitions proportional to the parallelism degree and overall throughput. When a new video stream source is registered with the framework, the Partition Manager creates a new partition under the relevant user’s topic. To ensure fault tolerance, the Replication Protocol manages partition replication across the brokers. APIs in the Replication Manager handle the replication factor, optimizing the use of the Replication Protocol. Through experimentation, we found that a replication factor of 3 is both optimal and sufficient, provided it does not exceed the total number of Broker Servers, as depicted in the flowchart in Fig. 4.

Video Stream Consumer

We acquire the incoming streams from the video sources as mini-batches that then reside inside respective topics in the brokers. The Video Stream Consumer helps VBDAL read these minibatches from their topics in the broker for IR extraction.

Intermediate Results Manager

VBDAL layer generates deep features and requires proper management and indexing. The extracted IR are sent and consumed from the topic $UID\_VDS\_IR$ using Intermediate Results Manager. The schema is composed of dataSourseId, frameId, timestamp, width, height, and frameData.

Video Big Data Store

The VBDL provides a permanent distributed persistence and storage space for video big data. The data is systematically stored and mapped in accordance with the $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ ’s business logic. A User Space is properly allocated when a new user registers with the proposed system. The User Space is divided into a proper subspace structure hierarchically with the owner having specific read & write privileges. The subspace hierarchy under User Space is mapped and synchronized in accordance with the user identification in the User Centric Metastore. The User Space is subdivided into two high-level subspaces: Raw Video Space where the video data is managed, and Model Space where the models trained and inferenced for IR extraction are stored. The Raw Video Space is subdivided in 2 more subspaces: one for batch-video data subject to batch analytics, and streaming data acquired from the connected streaming sources. The streams are timestamped properly for persistence. The level of granularity for raw streams is contingent upon the video data sources according to the functional requirements of the users.

Since streams are acquired from heterogeneous connected sources, it is imperative to orchestrate their storage and access in accordance with the proposed system for efficiency. To handle this issue, we carefully design Video Stream Unit (VDSU). We define the segment size as $\eta$, a regularizing factor for the VDSU. It is tightly related to the HDFS Block Size. Choosing a reasonable value for $\eta$ is important because this is the controlling factor of VDSU. If $\eta$ is small, many segments are generated which imposes significant overhead on the NameNode as well as network traffic. On the other hand, if $\eta$ is large, it hinders the processing performance as more memory is required, and the granularity level increases. Moreover, it causes the segment to span multiple blocks stored in different data nodes thus incurring the added overhead. The incoming frames from the heterogeneous connected sources are added to a segment until the segment size reaches $\eta$. Sufficient metadata is added to the segment thus enabling it to be processed without loading the entire stream. The segment is then stored in the VDSU as illustrated in Fig 5.

Structured Metastore

Structured Metastore manages the structured data of the proposed system which includes User Centric Metastore, Video Data Source Metastore, Subscription Metastore, System Logs, and Sys Configs. The User Centric Metastore maintains user logs and respective information. We deploy a customized salted encryption scheme based on [47] to encrypt the user information for security purposes. Furthermore, the proposed framework uses the Video Data Source Metastore metastore to handle two types of video sources: stream sources and datasets. The Video Data Source Metastore manages the meta-information and access rights for these sources. After registering a data source with $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, it can be subscribed to for video indexing and retrieval. Such subscriptions are managed through Subscription Metastore. The System Logs and Sys Configs Metastores store the logs generated by the proposed system and the configurations respectively. Finally, the DIO Reader and Writer modules are designed to enable the users to access the underlying data from the distributed as well as Structured Data Stores to manage and operationalize the data.

Distributed Encoded Deep Feature Indexer

We designed DEFI as a unified indexing mechanism for $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ which is composed of four modules: Feature Encoder, IR Aggregator, IR indexer, and Query Handler, as shown in Fig. 6.

To avoid multiple indexing schemes, we designed a unified indexer that facilitates indexing and querying multiple types of features. However, because of their high dimensionality and dense nature, incorporating deep features directly into an inverted index-based system poses a significant challenge. Thus, we designed a Feature Encoder motivated by [48] to transform the deep features into a representation most suitable for the inverted index-based system. Given a feature vector $f_v \in {\mathbb {R}}^D$, we use a transformation function to make it compatible with our inverted index system.

Table 2 ${\mathcal {Q}}$ factor vs and its effect on Empty Yield.

Full size table

For feature encoding in [48], they use a hard-coded constant ${\mathcal {Q}}$ to quantize the feature vectors. The drawbacks of this approach are two-fold. The first one is finding the optimal value of the quantization factor. Since ${\mathcal {Q}}$ is a magic number, a trial and error method must be determined. Secondly, if ${\mathcal {Q}}$’s value is small, then because of the floor function, the features in which the value of every dimension is near zero are yielded empty (null), thus missing corresponding videos. On the other hand, the encoding size is proportional to the value of ${\mathcal {Q}}$, a larger value of ${\mathcal {Q}}$ yields larger encodings. Table 2 summarizes the effect of this empty yield on 1M features extracted from the Youtube-8M [49], V3C1 [50], and Sports-1M [51] datasets using the VGG’s VGG-19 prediction layer.

We modify the quantization factor ${\mathcal {Q}}$ in our proposed approach and can be defined as:

$$\begin{aligned} q = e.log(d) \end{aligned}$$

(1)

where d is the dimensionality of the $f_v$ and e is Euler’s constant. This adaptive quantization factor prevents under- or over-sampling of features in the inverted index. For each component $f_{v,d} \in f_v$ (where $d = 1, 2, .... D$), the encoding transformation is:

$$\begin{aligned} S_{v,d} = \left| \left\lfloor q \cdot f_{v,d} \right\rfloor \right| \end{aligned}$$

(2)

After quantization, we prune zero or near-zero values:

$$\begin{aligned} \text {tf}_{v} = \{ s_{v,d} \mid s_{v,d}> 0 \} \end{aligned}$$

(3)

This step helps reduce the data’s size, removing dimensions that contribute little to the overall representation and thereby improving storage efficiency in the index. Each transformed component $S_{v,d}$ is further converted to a hexadecimal representation, $Hex(S_{v,d})$ and repeated $S_{v,d}$ times:

$$\begin{aligned} encoding (S_{v,d}) = \underbrace{\text {Hex}(s_{v,d}) \, \text {Hex}(s_{v,d}) \, \dots \, \text {Hex}(s_{v,d})}_{s_{v,d} \text { times}} \end{aligned}$$

(4)

Finally, the encoded feature vector ${\mathbb {F}}_v$ is assembled by concatenating the hexadecimal values across all dimensions. The purpose of hexadecimal encoding in the proposed DEFI is to create a compact, invertible representation of high-dimensional, dense feature vectors. Encoding each dimension of the feature vector in hexadecimal allows us to efficiently store, retrieve, and index these transformed features.

The proposed DEFI aggregates multiple types of features, represented as:

$$\begin{aligned} F_v = \begin{bmatrix} F_v^{\text {spatial}}, F_v^{\text {object}}, F_v^{\text {textual}} \end{bmatrix} \end{aligned}$$

(5)

where $\begin{bmatrix} F_v^{\text {spatial}}, F_v^{\text {object}}, F_v^{\text {textual}} \end{bmatrix}$ are encoded feature vectors for spatial, object, and textual features, respectively. These are indexed collectively in the distributed inverted indexer. Algorithm 2 outlines the proposed DEFI

In distributed systems, managing the query load across multiple servers (in this case, N servers in the distributed inverted indexer) is crucial for performance, scalability, and reliability. The Weighted Round Robin (WRR) algorithm is an effective way to allocate queries based on the capacity of each server, ensuring that servers with more resources handle a greater proportion of the load.

$$\begin{aligned} W_i = \frac{C_i}{\sum _{j=1}^N C_j} \end{aligned}$$

(6)

where $W_i$ is the weight assigned to server i based on its computational power $C_i$. The load is distributed to balance the workload dynamically.

Video Big Data Analytics Layer

We designed VBDAL comprising four components: Video Grabber, Structure Analyzer, Feature Extractor, and Query Handler. We utilize Spark [3] for distributed video processing and analytics. At the point of this writing, Spark has no native support for video data handling. To address this issue, we first build Video Resilient Distributed Dataset (VidRDD), a unified API wrapper utilizing Spark Resilient Distributed Dataset (RDD) that can be integrated with both Spark and Spark Streaming. We load the video data as a binary stream into the VidRDD. All the operations of VBDAL are performed on our proposed VidRDD as shown in Fig. 7.

Video Grabber

The Video Grabber component comprises two modules: Streaming Mini-batch Reader and Batch video Loader. The Streaming Mini-batch Reader allows us to subscribe to distributed broker’s topics to acquire and load video mini-batches into distributed main memory (as shown in Algorithm 1) for near real-time video analytics. Likewise, the Batch video Loader performs the loading of batch video data as VidRDD into distributed main memory from the Video Big Data Storage (Raw Video Space) for batch video analytics.

Structure Analyzer

Once the VidRDD is initialized and populated with video data, structure analysis operations are carried out by the Structure Analyzer which in most cases include frame extraction, metadata extraction, and preprocessing. At this stage, the video data resides in the distributed main memory as VidRDD. The Metadata Extractor, Frames Extractor, and Preprocessor components operate in an inline manner, as shown in Fig. 7 (MR1) and in Algorithm 3 (step 2–6) for batch structure analysis and Algorithm 4 for Streaming Video Data in near real-time structure analysis.

Metadata extraction is vital for indexing and retrieving video data and video frames. Metadata is categorized into two categories. General metadata contains information about the video file/stream, such as the Format (MPEG-4, MKV, etc.), Codec, File Size, Duration, bitrate mode (Constant or variable) & overall bitrate, Timestamp, etc. Video metadata includes the Format, profile, actual bitrate, Width and height of the frame, Display aspect ratio, Frame rate mode (Constant or variable), actual Frame rate, Color space, Bit depth, and Stream size. This metadata helps in understanding the video content. In batch video data, the metadata is extracted directly from the video data [52]. In the case of video streams, the metadata packed within the Kafka messages is used Algorithm 4 (steps 2–8).

The Frame Extractor performs the video frame extraction operation on the VidRDD. Once the frames are extracted then the VidRDD is transformed into FrameRDD. The FrameRDD holds attributes like, i.e., FrameNo, TimeStamp, Dimensions (Width and height), RGB Channels, and the actual payload. In the case of streaming video, this operation is performed by VSAPS as described in 4.1.

Since video content is often noisy due to non-uniform lighting conditions, objects in an occluded context, and other conditions, the extracted video frames must first be preprocessed before inputting them into the convolutional neural network. The Preprocessor component handles operations like frame resizing, downsampling, background subtraction, color and gamma correction, histogram equalization, and exposure and white balance. Once the structural analysis is performed the next step is IR extraction.

IR Extractor

In the context of CBVR, the IR are of two types: deep global features, and object features, which are computed by the Deep Global Feature Extractor and Object Extractor, respectively.

Deep Global Feature Extractor

The global features provide a good description of image content, which is typically represented by a single feature vector. Deep Feature Extractor employs VGG-19 [53] to compute global spatial features from video frames.

We use video keyframes for feature extraction for two reasons; first, it reduces the computation overhead, and second, keyframes are sufficient for global representation of the video because there is no significant difference between the consecutive frames within the two keyframes. For batch video processing, we load the video data into VidRDD which resides in distributed memory. Structure analysis (decomposition into frames, metadata extraction, etc.) on the VidRDD is performed by the Structure Analyzer. Then the preprocessor performs pre-processing operations such as downsampling, resizing, histogram equalization, etc. on the extracted frames to improve object precision and further accelerate deep-feature extraction, respectively.

We compute spatial features after preprocessing utilizing the VGG-19 based Deep Global Feature Extractor. These features are then stored in the FeatureRDD (FeRDD), as shown in Fig. 7. The computed global features from Deep Global Feature Extractor, and objects being detected by Deep Object Extractor are kept in FeRDD. Finally, we encode and aggregate the features in FeRDD as described in section 4.1, and immediately send them to the Feature Indexer which indexes them in the DEFI.

Deep Object Extractor

Similar to Deep Global Feature Extractor, the object features are also computed from the frames residing in the FrameRDD as shown in Fig. 7 and algorithm 3 (step 6 to 10). We compute the object features from the keyframes utilizing Yolo [54]. The objects information is properly structured and stored alongside the deep global features in the FeRDD. To maintain the semantics, we create a list style semantic structure depicting the transitions of objects in the frames.

Query Handler

The purpose of Query Handler is to receive user queries, invoke the APIs of the $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ against those queries, and return the results to the users. Upon receiving the request from users, the Query Generator makes an execution plan and executes it. For user assistance, we generate query maps using Query Map Generator and return them to the user. The query execution plan –outlined in Fig. 8– depends on the type of query. For textual query, it is transformed and executed by DEFI format directly. For Image and Video queries, structure analysis and feature extraction are performed, then transformed and executed by DEFI. The top-k final results are ranked and returned to the user.

Evaluation and discussion

In this section, we delineate the experimental configuration for $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, encompassing the dataset, performance parameters, and the evaluative experiment. The assessment of video retrieval performance involves testing the accuracy and speed of the retrieved videos.

Table 3 Topics for Video Stream Services and IR management. Parts indicate the number of partitions

Full size table

Experimental setup

We set up an in-house distributed cloud environment with ten computing nodes for testing and evaluation. The cluster configuration and specifications for each node are detailed in Fig. 9. The Cluster consists of five types of nodes: $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ Server, Worker Agents, HDFS Server, Solr Server, and Broker Servers. The $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ Server hosts VSAS, Video Stream Producer (VSP), and the $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ Web Server, along with Ambari Server [55] for cluster management and configuration.

The HDFS Server (Agent-1) configures HDFS Name Node, Spark2 History Server, Yarn Resource Manager and Zookeeper Server. Worker Agents consist of four different types of agents that configure the components of $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ such as VBDL (Video Stream Consumer, Video Big Data Store, and DEFI), and VBDAL (Structure analyzer, Deep feature extractors services). The real-time and offline video analytics are carried out by these nodes. For clarification, we used these Worker agents first for near real-time evaluation and then for batch evaluation.

The Broker Server is configured on agents six, seven, and eight to buffer large-scale video streams, and the IR (the deep features). Agent nine deploys the Solr Server and Structured Metastore schema. Some nodes in Fig. 9 have clients and data node services configured. The clients are instances of Zookeeper, Yarn, Spark, and Solr, and the data node is a HDFS node. Table 3 shows the topics for Video Stream Services and IR management.

Datasets

We evaluated $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ on three datasets: V3C1 [50], YouTube-8M [49], and Sports-1M [51]. V3C1 consists of 7,475 Vimeo clips with an average duration of eight minutes, spanning around 1000 hours (1.3 TB in size). In a JSON file, videos’ metadata such as keywords, description, and title are available. YouTube-8M contains approximately 6 million video IDs with high-quality machine-generated annotations, extracted from a diverse vocabulary of over 3800 visual entities. Each video within this dataset has a duration ranging from 120 to 500 seconds. Sports-1M contains 1 million YouTube videos categorized into 487 sports categories.

Video Big Data Layer evaluation

We do a thorough analysis of the various Video Big Data Layer components, including DEFI, Video Stream Manager, and Video Big Data store.

Video Stream Acquisition Service performance evaluation

As the VSAS is agnostic to heterogeneous devices, we registered several heterogeneous devices along with offline video stream sources. These devices have different frame rates, comprising of cellphones with frame rate 60, and others such as depth camera [56], IP camera [57], and RTSP [58] with frame rate 30. A video file saved on secondary storage has also been tested with the VSAS sub-component for offline video analytics.

The acquired frame’s resolution was set to $480 \times 320$ pixels by the VSAS. As a result, the size of each captured frame was 623.584 KB. With an average speed of 6ms, the VSAS transforms the obtained frames into messages which are then sent to VSP by the VSAS. Moreover, before sending the message to $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ Broker Server at a rate of 12ms on average, the VSP compresses the message on average to 140.538 KB. Both of these modules are employed on $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ Server, for collecting and transferring streams to the $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ Broker Servers. The improvements can be seen in Fig. 10(a), as we achieved a frame rate of 30, and 58 fps on average from heterogeneous real-time and offline sources, which is approximately 36% and 116% faster than the ideal 25 fps for real-time video analytics.

We also experimented on assessing the scalability for $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, by increasing the number of video sources from five to 140 on $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ Server. At the same time, this experiment assessed the VSP and VSAS sub-components by increasing the average number of messages to 54 per second for each video stream. Furthermore, we acquire and create streams from 70 devices at 40 messages for each stream using over VSAS and VSP modules which is a significant boost in performance. However, adding more streaming devices results in performance degradation, therefore we recommend connecting 70 sources per system for better performance. In our case, we can further scale up by adding more brokers and producers.

Table 4 Performance evaluation of VSCS in terms of messages per second on a single thread (ST) and maximum (optimal) number of threads (MT)

Full size table

Video Stream Consumer Services performance evaluation

In this section, we evaluate Video Stream Consumer Services (VSCS) performance which is responsible for acquiring video streams from the $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ Broker server as mini-batches for analytics. In the context of near real-time analytics, mini-batch size is significant and is determined by the MAX_REQUEST_SIZE_CONF parameter. We established four distinct scenarios for the evaluation, namely SCN-1, SCN-2, SCN-3, and SCN-4, as shown in Table 4. We configured the mini-batch size to be 4096, 6144, 8192, and 10,240 KB respectively. A single thread on a single worker machine achieves average message processing rates of 50, 88, 101, and 166 messages per second for the respective scenarios with synchronous replication. We deploy 22 threads to receive mini-batches from the Broker servers in each scenario. In SCN-1, the optimal performance is attained with 20 threads, achieving a reception rate of 814 messages per second. Likewise, in SCN-2, SCN-3, and SCN-4, the best performance is achieved with 13, 9, and 7 threads, respectively, as shown in Table 4. Additional threads do not contribute to increased performance. Figure 10(b) shows the impact of the Video Stream Consumer Service in the production environment, where the messages sent on UID_1 and UID_2 nearly coincide with those received.

Video Big Data Store performance evaluation

We conducted experiments on both the Active Data Reader & Writer and Passive Data Reader & Writer. Active Data Writer consumes the video stream from the topic UID_VDS (Broker server) and stores it in distributed persistence storage. Figure 11(a) illustrates Active Data Writer performance. The results indicate that the Active Data Writer ensures proper data distribution as well as data locality. Similarly, Fig. 11(b) shows the disk write counts co-related with bytes aggregated for HDFS hosts during the active write operations for three hours.

Likewise, we evaluate the performance of Passive Data Reader and Writer operations (shown in Fig. 11(b)) on batch video data. These operations were conducted for five different batch video sizes, i.e., 1024 MB, 5120 MB, 10240 MB, 15360 MB, and 20480 MB. The outcomes reveal that the write operation outpaces the read operation, and the time disparity for both read and write is proportionate to the volume of batch video data.

Distributed Encoded Deep Feature Indexer evaluation

We evaluated the indexing performance in terms of feature encoding size, concurrent queries and their response time with and without load balancing, the effect of batch commit operation, and retrieval time. Our findings indicate that the feature encoding in our proposed system exhibits significant performance in all aspects. Figure 12(a) illustrates the comparison of index storage sizes between the Deep Global features and their encoded counterparts. The size of the Global features without encoding exhibits linear growth, while the size of the encoded features remains consistently small even with a linear increase in the number of features. The compact size not only reduces indexing time but also lowers query latency, thereby enhancing the overall system performance. Without encoding, the preferred method of similarity measure for CNN features is cosine similarity, as outlined in 7. However, cosine similarity is computationally expensive when dealing with very large numbers of features. The TF-IDF similarity measure of the inverted table results not only in a significant performance boost of the similarity computation but also improves the video retrieval accuracy.

$$\begin{aligned} d\left( f,q\right) =\frac{f\cdot q}{|f| \times |q| }=\frac{\sum ^{n}_{i=1}f_{i}\times q_{i}}{\sqrt{\sum ^{n}_{i=1}f^{2}_{i}}\times \sqrt{\sum ^{n}_{i=1}q^{2}_{i}}} \end{aligned}$$

(7)

Figure 12(b) shows the query performance of our system. The proposed framework demonstrates commendable query performance against concurrent queries. Moreover, the load balancing mechanism also demonstrates effective performance in substantially reducing retrieval time. Figure 12(c) shows the feature indexing time on the Youtube-8M, V3C1, and Sports-1M datasets. Specifying the ideal batch size is crucial due to the computational and IO cost associated with commit operations after each batch. If the batch time is too short, more commit operations are required, leading to increased feature indexing time. Conversely, excessively increasing the batch size does not enhance indexing efficiency. The experiments reveal that a batch size ranging from 500 to 1000 is optimal. Additionally, in Fig. 12(d), the query and retrieval time in relation to a query feature are depicted. Retrieval time escalates with the number of videos in the result. Employing incremental retrieval is a prudent approach for latency reduction.

Video Big Data Analytics performance

We retrieved the top k videos based on a query video clip, image, or keyword, and then calculated the precision, recall, and accuracy as indicated by (8), (9), and (10), where True Positive (TP) represents the number of relevant videos retrieved, False Positive (FP) represents the number of non-relevant videos retrieved, True Negative (TN) represents the number of non-relevant videos not retrieved, and False Negative (FN) represents the number of relevant videos not retrieved.

$$\begin{aligned} Precision=\frac{TP}{TP + FP} \end{aligned}$$

(8)

$$\begin{aligned} Recall=\frac{TP}{TP + FN} \end{aligned}$$

(9)

$$\begin{aligned} Accuracy=\frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$

(10)

Figure 13(a) shows the performance of feature and object extraction tasks using over 1 million keyframes from three datasets: Sports-1M, YouTube-8M, and V3C1. For all the datasets, we notice a steady decrease in processing time as the number of computing nodes increases from 1 to 4, which is evident of effective parallelism and workload distribution across nodes in our proposed system. For example, the total processing time for Sports-1M dataset drops from over 600 minutes on a single node to under 200 minutes on four nodes. This threefold reduction shows that our system has near-linear scalability in practice. Overall, feature extraction using the VGG-19 model is more time-consuming than object extraction with YOLO. This is probably due to the deeper network structure and greater number of convolutional layers in VGG-19 compared to the optimized real-time design of YOLO.

Figure 13(b) shows the retrieval performance for three query types (text, image, and video clip) on the V3C1, YouTube-8M, and Sports-1M datasets. The evaluation uses four standard metrics: Accuracy, Precision, Recall, and F1-Score. Across the datasets, image-based retrieval consistently achieve the highest accuracy. This suggests that static visual features provide a strong semantic match for retrieval tasks. Video clip-based queries usually perform better in recall and F1-score, especially on YouTube-8M and Sports-1M datasets. This indicates that temporal features play a significant role in capturing relevant content. Text-based queries have relatively lower precision and recall, especially in large datasets like YouTube-8M. This shows the natural uncertainty in text-based queries and the difference between keywords and visual content. While V3C1 shows fairly balanced results across all types of queries, YouTube-8M and Sports-1M show more variability. Text-based queries perform relatively weaker performance compared to image and video. These differences underscore the importance of modality selection in retrieval systems and highlight the trade-offs between semantic specificity, content structure, and dataset characteristics.

Comparison with state-of-the-art

Our primary focus is on the architectural aspect of CBVR in a Cloud Environment with a complete end-to-end framework. To the best of our knowledge, there is no existing comparable work in the literature to date. In this section, we provide an in-depth feature-wise comparison of $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ with various state-of-the-art approaches in content-based video retrieval illustrated in Table 1. Our proposed system excels in multiple aspects, showcasing a comprehensive and unified approach to address the complexities of large-scale video analytics. Unlike [13], $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ adopts a Lambda Architecture, allowing it to efficiently handle both real-time and batch processing. The scalability of $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ is evident through its scalable system architecture, pluggable framework, and service-oriented design, setting it apart from [44] and [36]. Additionally, $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ leverages big data stores, distributed messaging, and incremental indexing, addressing the limitations of [38] and other baseline systems. The flexibility to process multi-type features, advanced feature encoding techniques, and support for stream as well as batch processing contribute to the superior performance of $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$. Notably, $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ outshines [7, 10, 17, 21, 45], and [20] by incorporating a multistream environment and providing robust support for querying by text, image, and clip. The combination of these features positions $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ as a highly efficient, versatile, and scalable solution for content-based video retrieval in large-scale environments.

Moreover, Table 5 presents a comprehensive performance comparison of $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ with several state-of-the-art approaches in content-based video retrieval. Our proposed system exhibits superior performance with a processing time of 10.91s, outperforming baseline systems such as [44] (baseline 1) with a processing time of 500 s and [36] (baseline 2) with 12.8s. Notably, $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ achieves a remarkable accuracy of 0.93 and precision of 0.67, surpassing other methods in the evaluation metrics. The enhanced performance of $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ can be attributed to its unified indexing mechanism (DEFI), which efficiently encodes deep features using an innovative quantization method inspired by [48]. This encoding strategy not only avoids the drawbacks of fixed constant quantization but also addresses the high dimensionality and dense nature of deep features. Additionally, the integration of Apache Spark, VidRDD, and a fine-tuned load-balancing mechanism contributes to the system’s scalability and efficient distributed processing. The adoption of VGG-19 and You only look once (YOLO)-V3 for deep feature extraction ensures the system’s ability to capture rich global spatial and object features, leading to improved accuracy in content-based video retrieval. Therefore, $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ emerges as a robust and efficient solution for large-scale video analytics, offering significant advancements over existing state-of-the-art techniques.

Table 5 Platform performance comparison with State of the Art

Full size table

Conclusion and future work

This paper introduces $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$, a feature-rich content-based video retrieval (CBVR) system designed for offline and near-real-time video retrieval from diverse sources, such as IP cameras. Tailored for cloud environments, $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ emphasizes robustness, speed, and scalability. Built on lambda architecture principles, the system leverages distributed deep learning and in-memory computation for efficient video analytics and processing. The VidRDD abstraction serves as the core unit for distributed in-memory video analytics.

To enhance video retrieval, $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ computes multi-type features and supports seamless integration, indexing, and retrieval of these features through an incremental distributed index. Retrieval is further optimized by generating a query execution plan before the actual query is triggered. The system’s distributed data management and in-memory computation technologies ensure efficient operations.

We evaluated $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$ using three benchmark datasets: YouTube-8M, Sports-1M, and V3C1, focusing on key bottlenecks. Results demonstrate satisfactory performance in terms of scalability, efficiency, computation time, and accuracy. We believe this work contributes to the research community and the industry, offering a scalable and efficient video big data analytics solution.

Data availability

No datasets were generated or analyzed during the current study. The source code is available at https://github.com/lambdaclovr/lambdaclovr.

Abbreviations

CBVR:: Content-based Video Retrieval
CNN:: Convolutional Neural Network
DEFI:: Distributed Encoded Deep Feature Indexer
FeRDD:: FeatureRDD
IR:: Intermediate Results
RDD:: Resilient Distributed Dataset
VBDAL:: Video Big Data Analytics Layer
VBDL:: Video Big Data Layer
VidRDD:: Video Resilient Distributed Dataset
VSAPS:: Video Stream Acquisition and Producer Service
VSAS:: Video Stream Acquisition Service
VSCS:: Video Stream Consumer Services
VSP:: Video Stream Producer
WSL:: Web Service Layer
YOLO:: You only look once

References

Mohan A, Gauen K, Lu Y-H, Li WW, Chen X. Internet of video things in 2030: a world with many cameras. In: 2017 IEEE Int Symp Circuits Syst. (ISCAS), 2017. pp. 1–4. https://doi.org/10.1109/ISCAS.2017.8050296.
Lam C. Hadoop in Action. Shelter Island, NY: Manning Publications Co.; 2010.
Google Scholar
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65.
Article Google Scholar
Navarrete E, Nehring A, Schanze S, Ewerth R, Hoppe A. A closer look into recent video-based learning research: a comprehensive review of video characteristics, tools, technologies, and learning effectiveness. Int J Artif Intell Educ. 2025. https://doi.org/10.1007/s40593-025-00481-x.
Article Google Scholar
Dubey SR. A decade survey of content based image retrieval using deep learning. IEEE Trans Circuits Syst Video Technol. 2022;32(5):2687–704. https://doi.org/10.1109/TCSVT.2021.3080920.
Article Google Scholar
Pareek P, Thakkar A. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artif Intell Rev. 2021;54(3):2259–322. https://doi.org/10.1007/s10462-020-09904-8.
Article Google Scholar
Ding S, Li G, Li Y, Li X, Zhai Q, Champion AC, et al. Survsurf: human retrieval on large surveillance video data. Multimedia Tools Appl. 2017;76(5):6521–49.
Article Google Scholar
Song J, Yang Y, Huang Z, Shen HT, Hong R. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In: Proc 19th ACM Int Conf Multi. 2011. p.423–432. ACM
Lai Y-H, Yang C-K. Video object retrieval by trajectory and appearance. IEEE Trans Circuits Syst Video Technol. 2014;25(6):1026–37.
Google Scholar
Gao G, Liu CH, Chen M, Guo S, Leung KK. Cloud-based actor identification with batch-orthogonal local-sensitive hashing and sparse representation. IEEE Trans Multimedia. 2016;18(9):1749–61. https://doi.org/10.1109/TMM.2016.2579305.
Article Google Scholar
Dong J, Li X, Xu C, Yang X, Yang G, Wang X, et al. Dual Encoding for Video Retrieval by Text. IEEE Trans Pattern Anal Mach Intell. 2021;1. https://doi.org/10.1109/TPAMI.2021.3059295.
Zhang C, Lin Y, Zhu L, Liu A, Zhang Z, Huang F. Cnn-vwii: an efficient approach for large-scale video retrieval by image queries. Pattern Recogn Lett. 2019;123:82–8. https://doi.org/10.1016/j.patrec.2019.03.015.
Article Google Scholar
Amato G, Bolettieri P, Carrara F, Falchi F, Gennaro C, Messina N, Vadicamo L, Vairo C. Visione: A large-scale video retrieval system with advanced search functionalities. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. ICMR ’23, pp. 649–653. Association for Computing Machinery, New York, NY, USA 2023 https://doi.org/10.1145/3591106.3592226 .
Fernandez-Beltran R, Pla F. Latent topics-based relevance feedback for video retrieval. Pattern Recogn. 2016;51:72–84.
Article Google Scholar
Li X, Zhou F, Xu C, Ji J, Yang G. SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries. IEEE Trans Multi. 2020;1. https://doi.org/10.1109/TMM.2020.3042067.
Sivic J, Zisserman A. Video google: a text retrieval approach to object matching in videos. In: Null, 2003;1470 IEEE
Mühling M, Korfhage N, Müller E, Otto C, Springstein M, Langelage T, et al. Deep learning for content-based video retrieval in film and television production. Multimedia Tools Appl. 2017;76(21):22169–94.
Article Google Scholar
Mallick AK, Mukhopadhyay S. Video retrieval framework based on color co-occurrence feature of adaptive low rank extracted keyframes and graph pattern matching. Inf Process Manage. 2022. https://doi.org/10.1016/j.ipm.2022.102870.
Article Google Scholar
ElAraby M, Shams M. Face retrieval system based on elastic web crawler over cloud computing. Multimed Tools Appl. 2021. https://doi.org/10.1007/s11042-020-10271-3.
Article Google Scholar
Zhu N, He W, Hua Y, Chen Y. Marlin: Taming the big streaming data in large scale video similarity search. In: 2015 IEEE Int Conf on Big Data (Big Data), 2015;1755–1764. IEEE.
Wang H, Zhu F, Xiao B, Wang L, Jiang Y-G. Gpu-based mapreduce for large-scale near-duplicate video retrieval. Multimedia Tools Appl. 2015;74(23):10515–34.
Article Google Scholar
Mandal A, Sinaeepourfard A, Naskar SK. Vda: Deep learning based visual data analysis in integrated edge to cloud computing environment. In: Adjunct Proc 2021 Int Conf Dist Comput Netw. 2021;7–12
Shang L, Yang L, Wang F, Chan K-P, Hua X-S. Real-time large scale near-duplicate web video retrieval. In: Proc 18th ACM Int Conf Multi. 2010;531–540 ACM
API A, Templeton D, Brobst R. Dist Res Manage Appl. API 2.0 2008.
Marz N, Warren J. Big Data: Principles and Best Practices of Scalable Realtime Data Systems. 1st ed. USA: Manning Publications Co.; 2015.
Google Scholar
Wang S, Zhao M, Chen C, Kong J, Liu Z. Application of lambda architecture in intelligent operation and maintenance system of rail transit vehicles. In: 2022 Int Conf Artif Int Every (AIE), 2022;568–573. https://doi.org/10.1109/AIE57029.2022.00113.
Hassan AA, Hassan TM. Real-time big data analytics for data stream challenges: an overview. Eur J Inf Tech Comput Sci. 2022;2(4):1–6. https://doi.org/10.24018/compute.2022.2.4.62.
Article Google Scholar
Essaidi A, Bellafkih M. A new big data architecture for analysis: The challenges on social media. Int J Adv Comput Sci Appl. 2023;14(3):
Deepthi BG, Rani KS, Krishna PV, Saritha V. An efficient architecture for processing real-time traffic data streams using apache flink. Multi Tools Appl. 2024;83(13):37369–85. https://doi.org/10.1007/s11042-023-17151-6.
Article Google Scholar
Garriga M, Monsieur G, Tamburri D. In: Liebregts, W., Heuvel, W.-J., Born, A. (eds.) Big Data Architectures, pp. 63–76. Springer, Cham. 2023. https://doi.org/10.1007/978-3-031-19554-9_4 .
Kreps J. Questioning the Lambda Architecture. Published on O’Reilly 2014. https://www.oreilly.com/radar/questioning-the-lambda-architecture/. Accessed 2025-05-26.
Kreps J, Narkhede N, Rao J, et al.: Kafka: A distributed messaging system for log processing. In: Proc NetDB, 2011;11:1–7
Sun X, Long X, He D, Wen S, Lian Z. VSRNet: end-to-end video segment retrieval with text query. Pattern Recogn. 2021;119:108027. https://doi.org/10.1016/j.patcog.2021.108027.
Article Google Scholar
Sivic J, Everingham M, Zisserman A. "who are you?"-learning person specific classifiers from video. In: 2009 IEEE Conf Comput Vision and Pattern Recogn. 2009;1145–1152. IEEE.
Araujo A, Chaves J, Angst R, Girod B. Temporal aggregation for large-scale query-by-image video retrieval. In: 2015 IEEE Int Conf Image Proc (ICIP), 2015;1519–1522. IEEE.
Saoudi EM, Jai-Andaloussi S. A distributed content-based video retrieval system for large datasets. J Big Data. 2021;8(1):87. https://doi.org/10.1186/s40537-021-00479-x.
Article Google Scholar
Mikolajczyk K, Schmid C. Scale & affine invariant interest point detectors. Int J Comput Vision. 2004;60(1):63–86.
Article Google Scholar
Lin F-C, Ngo H-H, Dow C-R. A cloud-based face video retrieval system with deep learning. J Supercomput. 2020. https://doi.org/10.1007/s11227-019-03123-x.
Article Google Scholar
Al Kabary I, Schuldt H. Scalable sketch-based sport video retrieval in the cloud. In: Int Conf Cloud Comput. 2020;226–241. Springer.
Chen H, Hu C, Lee F, Lin C, Yao W, Chen L, et al. A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for Large-Scale Video Retrieval. Sensors. 2021;21(9):1–2. https://doi.org/10.3390/s21093094.
Article Google Scholar
Liu R, Wei S, Zhao Y, Yang Y. Indexing of the cnn features for the large scale image search. Multimedia Tools Appl. 2018;77(24):32107–31. https://doi.org/10.1007/s11042-018-6210-3.
Article Google Scholar
Amato G, Carrara F, Falchi F, Gennaro C, Vadicamo L. Large-scale instance-level image retrieval. Inf Proc Manage. 2019;102100. https://doi.org/10.1016/j.ipm.2019.102100.
Anuranji R, Srimathi H. A supervised deep convolutional based bidirectional long short term memory video hashing for large scale video retrieval applications. Digit Signal Process. 2020;102:102729. https://doi.org/10.1016/j.dsp.2020.102729.
Article Google Scholar
Phan T-C, Phan A-C, Cao H-P, Trieu T-N. Content-based video big data retrieval with extensive features and deep learning. Appl Sci. 2022. https://doi.org/10.3390/app12136753.
Article Google Scholar
Lv J, Wu B, Yang S, Jia B, Qiu P. Efficient large scale near-duplicate video detection base on spark. In: 2016 IEEE Int Conf Big Data (Big Data), 2016;957–962. IEEE.
Gunderson SH. Snappy: a fast compressor/decompressor.
Rivest R. Rfc 1321: The md5 message-digest algorithm, April 1992. Status: INFORMATIONAl; 2014.
Google Scholar
Amato G, Debole F, Falchi F, Gennaro C, Rabitti F. Large scale indexing and searching deep convolutional neural network features. In: Int Conf Big Data Anal Know Disc. pp. 2016;213–224. Springer.
Abu-El-Haija S, Kothari N, Lee J, Natsev A, Toderici G, Varadarajan B, Vijayanarasimhan S. Youtube-8m: A large-scale video classification benchmark. ArXiv abs/1609.0. 2016.
Berns F, Rossetto L, Schoeffmann K, Beecks C, Awad G. V3c1 dataset: An evaluation of content characteristics. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval. ICMR ’19, pp. 334–338. Assoc Comput Mach. New York, NY, USA 2019. https://doi.org/10.1145/3323873.3325051 .
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L. Large-scale video classification with convolutional neural networks. In: Proc IEEE Conf Comput Vision and Pattern Rec (CVPR), 2014;1725–1732.
Mattmann CA, Zitting JL. Tika in Action. ??? Manning; 2012.
Google Scholar
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014.
Redmon J, Farhadi A. Yolo9000: better, faster, stronger. In: Proc IEEE Conf Comput Vision and Pattern Rec. 2017;7263–7271.
Wadkar S, Siddalingaiah M, Venner J. Pro Apache Hadoop. 2nd ed. USA: Apress; 2014.
Google Scholar
Zhang Z. Microsoft Kinect sensor and its effect. IEEE Multimed. 2012;19:4–12.
Article Google Scholar
Yang M, Tham JY, Wu D, Goh KH. Cost effective ip camera for video surveillance. In: 2009 4th IEEE Conf Ind Elect Appl. 2009;2432–2435. https://doi.org/10.1109/ICIEA.2009.5138638.
Schulzrinne H, Rao A, Lanphier R. RFC2326: Real Time Streaming Protocol (RTSP). RFC Editor, USA 1998.
Xu W, Uddin MA, Dolgorsuren B, Akhond MR, Khan KU, Hossain MI, et al. Similarity estimation for large-scale human action video data on spark. Appl Sci. 2018. https://doi.org/10.3390/app8050778.
Article Google Scholar
Uddin MA, Joolee JB, Alam A, Lee Y-K. Human action recognition using adaptive local motion descriptor in spark. IEEE Access. 2017;5:21157–67. https://doi.org/10.1109/ACCESS.2017.2759225.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This research was supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP), funded by the Korea Government (MSIT), under Grant 2021-0-02068 (Artificial Intelligence Innovation Hub) and Grant RS-2022-00155911 (Artificial Intelligence Convergence Innovation Human Resources Development, Kyung Hee University).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Kyung-Hee University, Yongin-si, Gyeonggi-do, 17104, Republic of Korea
Muhammad Numan Khan, Aftab Alam & Young-Koo Lee

Authors

Muhammad Numan Khan
View author publications
Search author on:PubMed Google Scholar
Aftab Alam
View author publications
Search author on:PubMed Google Scholar
Young-Koo Lee
View author publications
Search author on:PubMed Google Scholar

Contributions

MNK developed the research idea, the research design, and wrote the main manuscript. AA contributed to the methodology, algorithms, and visualizations in the manuscript. YKL reviewed the manuscript and provided feedback on research design, analysis, and writing. All authors contributed to the editing and proofreading. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Young-Koo Lee.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary file 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Khan, M.N., Alam, A. & Lee, YK. $\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}$: looking for a needle in a haystack: a content-based video big data retrieval system in the cloud. J Big Data 12, 257 (2025). https://doi.org/10.1186/s40537-025-01308-1

Download citation

Received: 31 October 2024
Accepted: 08 October 2025
Published: 21 November 2025
Version of record: 21 November 2025
DOI: https://doi.org/10.1186/s40537-025-01308-1

Keywords

Profiles

Aftab Alam View author profile

\(\uplambda \text {-}\hspace{2pt}\mathcal {CLOVR}\): looking for a needle in a haystack: a content-based video big data retrieval system in the cloud

Abstract

Similar content being viewed by others

Lambda-IVR: An Indexing Framework for Video Big Data Retrieval Using Distributed In-memory Computation in the Cloud

A distributed Content-Based Video Retrieval system for large datasets

An Image Retrieval System for Video

Explore related subjects

Introduction

Related work

Research objectives

Proposed architecture

Video Big Data Layer

Video Stream Manager

Video Stream Acquisition and Producer Service

Distributed Message Broker

Video Stream Consumer

Intermediate Results Manager

Video Big Data Store

Structured Metastore

Distributed Encoded Deep Feature Indexer

Video Big Data Analytics Layer

Video Grabber

Structure Analyzer

IR Extractor

Deep Global Feature Extractor

Deep Object Extractor

Query Handler

Evaluation and discussion

Experimental setup

Datasets

Video Big Data Layer evaluation

Video Stream Acquisition Service performance evaluation

Video Stream Consumer Services performance evaluation

Video Big Data Store performance evaluation

Distributed Encoded Deep Feature Indexer evaluation

Video Big Data Analytics performance

Comparison with state-of-the-art

Conclusion and future work

Data availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles