Automatic video captioning using tree hierarchical deep convolutional neural network and ASRNN-bi-directional LSTM

The development of automatic video understanding technology is highly needed due to the rise of mass video data, like surveillance videos, personal video data. Several methods have been presented previously for automatic video captioning. But, the existing methods have some problems, like more time consume during processing a huge number of frames, and also it contains over fitting problem. This is a difficult task to automate the process of video caption. So, it affects final result (Caption) accuracy. To overcome these issues, Automatic Video Captioning using Tree Hierarchical Deep Convolutional Neural Network and attention segmental recurrent neural network-bi-directional Long Short-Term Memory (ASRNN-bi-directional LSTM) is proposed in this paper. The captioning part contains two phases: Feature Encoder and Decoder. In feature encoder phase, the tree hierarchical Deep Convolutional Neural Network (Tree CNN) encodes the vector representation of video and extract three kinds of features. In decoder phase, the attention segmental recurrent neural network (ASRNN) decode vector into textual description. ASRNN-base methods struck with long-term dependency issue. To deal this issue, focuses on all generated words from the bi-directional LSTM and caption generator for extracting global context information presented by concealed state of caption generator is local and unfinished. Hence, Golden Eagle Optimization is exploited to enhance ASRNN weight parameters. The proposed method is executed in Python. The proposed technique achieves 34.89%, 29.06% and 20.78% higher accuracy, 23.65%, 22.10% and 29.68% lesser Mean Squared Error compared to the existing methods.

Semantic context driven language descriptions of videos using deep neural network

Article Open access 10 February 2022

Video Captioning Using Deep Learning Approach-A Comprehensive Survey

Efficient Video Captioning with Frame Similarity-Based Filtering

1 Introduction

The development of automatically video understanding technology is highly wanted due to the rise of mass video data, such as personal video data and surveillance videos [1]. Image captioning together with video captioning is difficult undertakings because they call for expertise in language processing and processing of visual content [2]. It is necessary to record and represent image elements in visual material, such as scenery and objects [3]. To create coherent sentences, however, language conventions such text context should be used [4]. Even though video captioning has made enormous strides, several issues still need to be resolved [5, 6]. Despite the fact that a language dependence aid in producing higher plausible description, a language dependence that is excessively high will result in errors by associating two terms that do not appear together in the video yet regularly do so in the dataset [7]. Thus, it has an impact on the final result’s correctness (Caption). Some remedies are proposed to solve the problems. The existing techniques have higher error rate and does not accurately classify video captions. These drawbacks are provoked to do this work.

The primary contributions of this work are abridged below:

Development of a two-phase captioning process comprising a Feature Encoder phase using Tree CNN [8] to extract 2D image, 3D motion, and sentence-context features.
Decoder phase using ASRNN [9] to convert these features into textual descriptions.
Implementation of GEO to enhance ASRNN weight parameters, addressing long-term dependency issues and improving model accuracy and robustness [10, 11].
Demonstrated significant improvements in accuracy and reductions in mean squared error compared to existing methods.

2 Literature survey

Amongst various studies on automatic video captioning, a certain recent researches are revised here,

Prakash et al. [12] have presented a leveraging deep machine learning to automatically caption videos for proactive video management. The pipeline utilising a convolutional neural networks (CNN) was suggested to reliably and efficiently process the video. For this, the video clip would need to be processed and indexed in a database. The same indexing employed in this method makes it suitable for tasks that involve content-based image retrieval (CBIR). It offers greater accuracy, but lesser PSNR.

Deng et al. [13] have introduced a syntax-guided hierarchical attention network (SHAN) captioning videos. To incorporate visual with sentence-context information for captioning, the SHAN model was used. After that, horizontal content along syntactic attention was introduced to flexibly incorporate characteristics under temporal, feature quality. It provides lower computation time with higher error rate.

Zhao et al. [14] have suggested Co-attention model-based RNN for video captioned known as CAM-RNN. The visual and textual data were encoded using the co-attention model (CAM), and the decoder—the recurrent neural network—created the video caption. Once in generate process, the consideration and focus component was capable to adaptively concentrate on the salient areas for every image includes the frames that were closely related with caption. It provides lower mean squared error with higher precision.

Islam et al. [15] have presented an exploring video captioning strategy: a comprehensive survey on deep learning models. It focused on state-of-the-art methodologies, placing special emphasis on deep learning models, evaluating benchmark datasets across a range of parameters, then categorising the benefits and drawbacks of certain evaluation measures dependent on prior research in the deep learning field. It reached higher f-score with lesser structural similarity index.

Zheng et al. [16] have presented visual dialogues that use structural and incomplete observations. This task was explicitly characterized as inference in the research proposal using a modeling method with poorly observed edges and unobserved network structures. Consider the provided dialogue entities as the observed nodes. Introduce an Expectation Maximization technique to begin with to deduce the underlying dialogue structures as well as the missing node values. It provides greater accuracy with high mean squared error.

Zellers et al. [17] have presented from recognition to cognition: Visual common sense reasoning. It was developed as a graphic application of common sense. A machine must correctly respond to a difficult inquiry regarding an image and then provide a justification for its response. A novel method for producing multiple choice questions with the least amount of bias from rich annotations. It provides lower error rate with higher computation time.

Alkalouti and Masre [18] have presented an encoder and decoder mode for automated video captioning utilizing YOLO algorithm. Unlike other video captioning methods to use other deep learning method, the suggested method uses YOLO on MSVD datasets and shows good efficiency and accuracy. It describes the video content in a meaningful sentence. It offers lesser execution time and lower f-score.

3 Proposed methodology

In this manuscript, Automatic Video Captioning using Tree Hierarchical Deep Convolutional Neural Network and ASRNN-bi-directional LSTM is proposed for generating a title and a concise abstract for a video (Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC). The block diagram of proposed Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC technique is given in Fig. 1. In this, the input video frames are taken from jssprz/video captioning datasets. The captioning part contains two phases one is Feature Encoder and another one is Decoder. In feature encoder phase, Tree CNN encodes the vector representation of video, which extract the 3 types of features. In decoder phase, ASRNN-LSTM decodes vector to textual description. GEO is employed to upgrade the ASRNN weight parameters.

3.1 Image acquisition

The dataset used in this method is jssprz/video captioning datasets. The captioning part consists of two phases: Feature Encoder and Decoder.

3.2 Feature encoder under tree hierarchical deep convolutional neural network

In this section, Feature Encoder utilizing TCNN [8] is discussed. TCNN system is employed to encode the vector representation of the video, which extract the above mentioned 3 types of features. The hierarchical architecture enables the network to learn hierarchical representations of input data, allowing it to capture features at multiple levels of abstraction. This facilitates the extraction of complex patterns and relationships from data. The network’s hierarchical structure allows for efficient feature extraction by organising layers in a tree-like structure. Specific features can be extracted by different branches or pathways. These networks’ hierarchical representations frequently lead to better generalisation to previously unseen data. They are capable of capturing invariant features across different scales, orientations, and positions, thereby improving the model’s robustness. The branching structure of tree hierarchical networks enables multi-task learning by allowing different branches to focus on different tasks at the same time. Furthermore, these architectures’ parallel processing capability can lead to computational efficiencies. The network’s hierarchical structure may provide insights into how features are organised hierarchically, aiding in the interpretability and understanding of the learned representations.

From the input video captioning image using the learning and correlation patterns is to forecast the valued data by applying a neural network. Thus, TCNN model is proposed that stimulated as of hierarchical classifiers. Here, TCNN method comprises multiple nodes linked at tree system set [19, 20]. Since it is a big node of tree in this model, a video depiction of a video is carried out in root node. Additionally, the child node classification tag is used for more process that continues up to the tree is reached to the node of leaf. 3D motion, 2D image, and sentence context information can all be extracted. Although sentence-context elements serve as the primary input in the generation of non-visual words, two and three dimension phenomenon is most significant part in the visual words generation. The video extract $s^{\prime }$ frames along $r^{\prime }$ optional flow clips is fed to the video. Then Tree CNN is applied that encodes the video vector representation, also extract 3 types of features: 2 dimensional image, 3 dimensional motion, and sentence-context features.

Initially, TCNN system is encoded along datasets its states $D = \left\{ {1,2,3.......,n} \right\}$ along n^th data numbers. Thus, the offered method contain neural network include multi-layers it is activated by way of root node together numerous nodes of leaf. Also, above mentioned 3 types of features extracted model are distinguish as a task. Model of the root node is creates a smaller data sample through encoding customary. The output of the Tree hierarchical deep convolutional neural network node offers a three-dimensional matrix, which is denoted by $D^{K \times M \times N}$ however, $K$ the root node represents the whole size of children, $M$ embodies the size of new classes (video title), and video title data is specified $N$ to every single class. Moreover, $D(k,m,n)$ extracted 3 kinds of features output of the $k^{th}$ neuron for $m^{th}$ class $n^{th}$ data. From that, the $k^{th}$ neurons value accords toward [1,K], $m^{th}$ class goes near [1,M] then $n^{th}$ data be appropriate to [1,N]. Thus, middling extricated $N$ dataset output is stated as $D_{avg}^{K \times M}$ it is estimated in Eq. (1),

$$ D_{avg} (k,m) = \sum\limits_{n = 1}^{N} {\left( {D(k,m,n} \right)} /N $$

(1)

In Eq. (1), 2D image, 3D motion is extracted. Also, the possibility of soft-max function is computed above $D_{avg}^{K \times M}$ also probability matrix $R^{K \times M}$ is measured by Eq. (2),

$$ R(k,m) = \left[ {\frac{{e^{{D_{avg} (k,m)}} }}{{\sum\limits_{k = 1}^{K} {e^{{D_{avg} (k,m)}} } }}} \right] $$

(2)

here probability matrix symbolized $R^{K \times M} ,\;e^{{D_{avg} (k,m)}}$ indicates arithmetical contented of mode that are employed to improving the accuracy of the method. Next, extracted features list $(L)$ order is made as of $R^{K \times M}$ which contains similar properties. In Eq. (2) sentence-context features are extracted. Here, the orderly list $(L)$ obtains samples of data $\left[ {N_{1} ,N_{2} ,N_{3} } \right]$ wherein, every data encompass $m$ loads. Also, video captioning output values are organized in descendent ordinance embody as $\left[ {N_{1} \ge N_{2} \ge N_{3} } \right]$. Therefore, Sorting is done to ensure that the extracted features along highest probability values are given at the leaf node of the TCNN method. Finally, 2 dimensional image, 3 dimensional motion, sentence-context features are extracted.

3.3 Decoder phase using ASRNN

In this section, decoder Phase using ASRNN [9] is discussed. The ASRNN decodes the vector together with textual description during the decoding phase. The ASRNN includes 3 layers: character-level encoder, word-level encoder, semi-Markov conditional random fields (semi-CRF) layer. During decoding, the segmental RNN processes a segment of the previously generated output sequence at each time step and employs the attention mechanism for focusing on appropriate parts of encoded input sequence. This attention mechanism aids in determining the most important input parts in producing the next output element. These levels are arranged from bottom to uppermost. Extracts the character-level details of every word by the bottom character-level coder. The word embedded vectors and the collected character-level information makes n-dimension vector that serves as the word-level encoder’s input. This input produces the neural feature score for all labels once phrase data out of each concatenated input vector has been extracted. To combine train the semi-CRF, the feature scores transmitted into semi-CRF layer, so that they combined with feature scores from the original semi-CRF layer. The 3 layers are described in detail below.

Character-level representation for ASRNN

A neural network encoder combines the character-level data in neural network in the bottom layer. Likened with word level information it contains two key advantages. Beneficiary of character-level data is to modelling out-of-vocabulary. Named Entity Recognition (NER) is communal to organizations. From characters of words the character-level data delivers exterior morphological data. When identify parts of speech, prefixes like "ing" and "ed" is essential identifiers of adjective words are an instance. The character-level encoder is build using defined mode. Using bidirectional LSTM for extracting the global context information from the captioned generator’s hidden state, that is limited and insufficient, the problem can be resolved by focusing on all the formed lines from the caption generator. It reduces the problem of over fitting in classification and the amount of time needed for training during in the feature extraction process. This encoder makes bottommost Bi-LSTM as of input series to gets vector representation (ci). The topmost Bi-LSTM is used to scale the representations for every segment (word). The combination of such bottom Bi-forward LSTMs and backward rotation over the input pattern of raw character data serves as the representation for each token. Following the computation of representation for each word (characters) utilizing bottom Bi-LSTM, representation for each span are computed using the top Bi-LSTM (word). The formulas are given in Eqs. (3), (4) and (5).

$$ V_{jT} = \tan g\left( {W_{w\prime }^{\prime } g_{jT} + b_{{w^{\prime}}}^{\prime } } \right) $$

(3)

where $V_{jT}$ indicates is the concatenated output, $g_{jT}$ is the backward parameter with $j_{th}$ order, $T$ indicates the time stride.

$$ \beta_{jT} = \frac{{\exp \left( {V_{jT} V_{w\prime } } \right)}}{{\sum {\exp \left( {V_{jT} V_{w\prime } } \right)} }} $$

(4)

where $\beta_{jT}$ represents the corresponding weight, $V_{jT}$ is the hidden representation of the current timestamp, $V_{w\prime }$ is the global vector.

$$ R_{j} = \sum {_{T} \beta_{jT} } g_{jT} $$

(5)

where $R_{j}$ is the final word representation sequence, $\beta_{jT}$ represents the corresponding weight, $g_{jT}$ is the backward parameter.

Word-level representation for ASRNN

The word—based encoder employs the ASRNN model. The retrieved character-level depictions were combined with appropriate previous training glove word embedding are joined together. The context vector is obtained using a Bi-LSTM $C_{k\prime }$ in every time step $k^{\prime }$ to continuously estimate the segmental representation, Bi-LSTM is used $se_{j}$ for every section $j$ from $C_{k\prime }$ context vector including attention procedure. A dynamic recursion is utilized in the computation process to compute segments between lengths 1 and L, with L as longest segment possible. The label categorization process is then carried out using a fully connected layer on each segmental representation $se_{j}$. The label score calculated through fully connected layer serves as direct input for the neural feature scores. The convolution layers procedure for efficient process with both the partially connected layer is not required, because the sum of neural characteristic scores is more than 1. These calculated neural characteristic scores transmitted to semi-CRF along with typical semi-CRF aspects to mutually train it. The computation process is given below.

$$ C_{jT} = \left[ {fw_{{lstm_{1} }} \left( y \right):bw_{{lstm_{1} }} \left( y \right)} \right] $$

(6)

$$ g_{jT} = \left[ {fw_{{lstm_{2} }} \left( {C_{jT} } \right):bw_{{lstm_{2} }} \left( {C_{jT} } \right)} \right] $$

(7)

$$ R_{j} = \sum {_{T} \beta_{jT} } g_{jT} \,{\text{and}}\,f_{j} = mlp^{\prime } \left( {R_{j} } \right) $$

(8)

where $y$ represents the input combination of the character-level data and the from before the word embedding, ($:$) represents the concatenate process, the context vector signified as $C_{jT} ,\;\beta_{jT}$ denotes the focus, weight, and the Bi-LSTM neural network both projections layer, which computes the values of neural feature, uses the fully connected network MLP. The character-level as well as word-level transmitters include attention mechanism, enabling the models to pay attention to particular letters and words in various ways. This information is essential for activities involving sequence labelling. When employed in the POS tag job, both suffixes "ing" and "ed," which seem to be character-level information, indicate that the word is likely to be an adjective. The term "is" and similar word-level cues demonstrate that the front portion of the NER issue is an NP. In this case, the RNN uses an attention strategy supports to model quotation this data and enrich efficiency. The time complexity is $TC\left( {M^{2} } \right)$ to the attention base encoder approach, where $M$ represents the input sentence length. Although an attention-based encoder has a higher temporal complexity than a traditional RNN $\left( {TC\left( M \right)} \right)$, The performance boost is the major topic of this paper.

Semi-CRF layer for ASRNN

Here, $se_{j}$ represents segmental representations; $X_{1} ,X_{2}$ represents the measured output sequences are made up of circular nodes. The semi-optional CRF’s features are indicated by dashed lines. The objective function has been changed to pay attention to neural characteristics differently from traditional CRF sparse features. Equation (9) represents the conditional distribution of a potential output sequence $r^{\prime }$ over an input sequence $y$ in the improved semi-CRF layer Eq. (9),

$$ Q\left( {r^{\prime } |y} \right) = \frac{1}{z\left( y \right)}\overline{\exp } \left\{ {w_{1}^{\prime } B\left( {y,r^{\prime } } \right) + w_{2}^{\prime } f\left( {y,r^{\prime } } \right)} \right\} $$

(9)

where $B\left( {y,r^{\prime } } \right)$ represents the conventional semi-CRF features scores, $w_{1}^{\prime } ,\;w_{2}^{\prime }$ represents the appropriate weights for the neural and sparse CRF features, $f\left( {y,r^{\prime } } \right)$ represents neural feature scores set, $z\left( y \right)$ represents the factor for normalising all feasible segmentations $r^{\prime }$ over $y$. The maximal conditional prospect valuation utilized to trained neural semi-CRF. The log-likelihood for training set $\left\{ {\left( {y_{j} ,r_{j}^{\prime } } \right)} \right\}$ is expressed in Eq. (10),

$$ l_{d} \left( {W^{\prime } } \right) = \sum\nolimits_{j \in d} {\overline{\log } Q\left( {r^{\prime } |y} \right)} $$

(10)

The semi-CRF layer of time complexity is expressed in Eq. (11),

$$ TC\left( {mlX^{2} } \right) $$

(11)

where $m$ represents the length sequence, $l$ represents the segment length, $X$ represents the labels number.

3.4 Golden eagle optimization for optimizing attention segmental recurrent neural network (ASRNN)

The golden eagle optimization (GEO) technique is exploited to augment ASRNN to find ideal parameters $l_{d} \left( {W^{\prime } } \right)\;{\text{and}}\;Q\left( {r^{\prime } |y} \right)$. Chose GEO for optimizing the ASRNN-bi-directional LSTM in automatic video captioning due to its unique ability to balance exploration and exploitation in the search space, which is crucial for effectively navigating the complex parameter landscape of this neural network. GEO mimics the hunting behavior of golden eagles, providing robust adaptability and superior performance in various optimization scenarios. Preliminary experiments showed that GEO outperforms traditional algorithms in terms of convergence speed and error reduction, crucial for minimizing training time and enhancing model accuracy. Thus, GEO’s proven efficiency, adaptability, and robustness make it an ideal choice for optimizing the ASRNN-bi-directional LSTM in this research. The aforementioned process is continued till the ideal result is obtained. The flow chart for GEO is shown in Fig. 2. The stepwise process of GEO is given beneath:

3.4.1 Step 1: Initialization

Golden eagles populace are initialized depends upon the GE spiral motion. Each golden eagle remembers a previously visited location. Every single GE memory as well as population has initialized. The populace quantity is denoted as total count of video captioning images in this work.

3.4.2 Step 2: Random generation

Each GE $(l)$ is preferred randomly on each iteration $(i)$ with the target of another golden eagle $(n)$. It surrounds the superior place frequented with golden eagle. $(n)$. GE $(n)$ chooses the circle with their own memory $n \in \left\{ {1,2,3,...............Pop\_size} \right\}$.

3.4.3 Step 3: Fitness function

The random solution is created from initialized values. The value of the optimization’s parameter is shown by the fitness function and objective function solutions, such as $l_{d} \left( {W^{\prime } } \right)\;{\text{and}}\;Q\left( {r^{\prime } |y} \right)$ updating parameters of ASRNN weight, bias parameter,

$$ {\text{Fitness}}\,{\text{Function}} = {\text{optimizing}}\,[{\text{ l}}_{{\text{d}}} \left( {{\text{W}}^{\prime } } \right){\text{ and}}\,{\text{Q}}\left( {{\text{r}}^{\prime } {\text{|y}}} \right) \, ] $$

(12)

3.4.4 Step 4: Update exploitation behaviour of golden eagle to enhance $l_{d} \left( {W^{\prime } } \right)$

This attack is dependent on the golden eagle’s current position as well as ends with the memory of the prey’s eagle. The golden eagle exploitation vector $(E\vec{a}_{k} )$ is,

$$ E\vec{a}_{k} = \vec{X}_{m}^{*} - \vec{X}_{k} + l_{d} \left( {W^{\prime } } \right) $$

(13)

where best place visit by golden eagle $(m)$ implies $\vec{X}_{m}^{*}$, eagle present position $(k)$ indicates $\vec{X}_{k}$. The golden eagle populace is directed into best visited locations via exploitation vector, which also refers to GEO exploitation stage.

3.4.5 Step 5: Update exploration behaviour (cruise) of golden eagle to enhance ${\text{Q}}\left( {{\text{r}}^{\prime } {\text{|y}}} \right)$

The exploitation vector is used to determine cruise vector. It is perpendicular to the vector of exploitation and tangent to the circle. For GE that is connected with prey, cruise is translated into linear speed. As a result, the cruise vector on n-dimensions is positioned in the circles tangent hyperplane. This is computed by Eq. (14),

$$ \sum\limits_{k = 1}^{n} {ea_{k} x_{k} } = \sum\limits_{k = 1}^{n} {Ea_{k}^{i} } x_{k}^{*} + {\text{Q}}\left( {{\text{r}}^{\prime } {\text{|y}}} \right) $$

(14)

where vector represents $E\vec{a}_{k} = \left[ {ea_{1} ,.......ea_{n} } \right],\;\vec{X}_{k} = \left[ {x_{1} ,x_{2} ,.........x_{n} } \right]$ represents verdict variables vector, selected prey position represents $X^{*} = \left[ {x_{1}^{ * } ,.........x_{n}^{ * } } \right]$ that is employed for improve the parameters. The fitness function value has been updating depending upon exploration through GEO exploitation manner. This is used to determine the best categorization parameters.

3.4.6 Step 6: Termination

GEO is used to update the manners of the exploit and also to optimize the ASRNN classifier parameters along with exploration. The obtained objective function is utilised to increase accuracy by decreasing calculation time with error. The algorithm repeating step 3 up to the shutdown condition are met $i = i + 1$.

4 Results and discussion

The experimental outcomes of proposed Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method are discussed in this section. The implementation procedure begins with the jssprz/video captioning datasets, which include over 25 video datasets paired with textual descriptions. Video frames are extracted and processed using Tree CNN to derive 2D image, 3D motion, and sentence-context features. These features are encoded into vector representations. The caption generation utilizes ASRNN-bi-directional LSTM architecture. Optimization is performed using GEO to refine ASRNN weight parameters for enhanced performance. The model is trained on a split dataset for training and validation, and evaluated using metrics. The experimental setup for training the proposed method involves configuring a high-performance CPU and NVIDIA GPU with CUDA support for efficient computations. With at least 32 GB of RAM and SSD storage, Python is utilized alongside TensorFlow or PyTorch for deep learning implementation. Training parameters include varying batch sizes, adaptive learning rates, and epochs typically ranging from 50 to 100. Then the acquired outcomes of the proposed method is analysed with the existing CNN-VGG16-LSTM-AVC [13], SHAN-LSTM-AVC [14], CAM-RNN-LSTM-AVC [15] models.

4.1 Dataset description

The jssprz/video captioning dataset is utilized here, it comprising a comprehensive collection of over 25 video datasets paired with textual descriptions for training and examining video captioning models. Each dataset includes temporal-localization information corresponding to each description, ensuring alignment with specific segments of the video content. Notably, the videos in the dataset encompass audio components, adding a multimodal dimension to the captioning task. On average, the textual descriptions consist of more than eleven words, providing sufficient context and detail for accurate caption generation. The dataset as a whole comprises a total of 80,838 captions associated with 1,970 video clips, ensuring a diverse and robust training and evaluation environment for developing automated video captioning systems.

4.2 Performance metrics

The following metrics is used to validate the performance of the proposed system [21].

4.2.1 Mean squared error (MSE)

The difference among original and compressed image data is used to evaluate the image quality.

$$ \overline{MSE} = \frac{1}{{M^{\prime } xN^{\prime } }}\sum\limits_{a = 1}^{M\prime } {\sum\limits_{b = 1}^{N\prime } {\left( {g\left( {a,b} \right) - g^{\prime } \left( {a,b} \right)} \right)} }^{2} $$

(15)

here $g\left( {a,b} \right)$ denotes original input image, $g^{\prime } \left( {a,b} \right)$ signifies compressed image, $M^{\prime } N^{\prime }$ implicates image dimensions.

4.2.2 Peak signal to noise ratio (PSNR)

The ratio of signal intensity to noise in the transmissions is represented by this number. Everything is dependent on the image quality; if the PSNR is higher, and then the image quality is higher. The PSNR is expressed in given Eq. (16),

$$ \overline{PSNR} = 10\log_{10} \left( {\frac{{\max^{2} }}{{\overline{MSE} }}} \right) $$

(16)

where $\max^{2}$ represents maximum pixel intensity in the ideal image.

4.2.3 Structural similarity index (SSIM)

SSIM actually measures the perceptual difference among two identical pictures. SSIM is expressed in given Eq. (17),

$$ \overline{SSIM} \left( {a,b} \right) = \frac{{\left( {2\nu_{a} \nu_{b} + sc_{1} } \right)\left( {2\lambda_{a} \lambda_{b} + sc_{2} } \right)}}{{\left( {\nu_{a}^{2} + \nu_{b}^{2} + sc_{1} } \right)\left( {\lambda_{a}^{2} + \lambda_{b}^{2} + sc_{2} } \right)}} $$

(17)

where $\nu_{a} \;{\text{and}}\;\nu_{b}$ are defined by illuminating each image in the $a$ and $b$ directions, $\lambda_{a} \;{\text{and}}\;\lambda_{b}$ represents the standard deviations and the contrast estimation of the signal, $sc_{1} \;{\text{and}}\;sc_{2}$ represents the very small constants.

4.2.4 Accuracy

It’s a ratio of exact forecast for entire count of actions in the dataset. This calculated in Eq. (18),

$$ {\text{Accuracy}} = \frac{{T_{P}^{\prime } + \hat{T}_{N} }}{{T_{P}^{\prime } + \hat{T}_{N} + F_{P}^{\prime } + \hat{F}_{N} }} $$

(18)

Let $T_{P}^{\prime }$ specifies true positive,$\hat{T}_{N}$ specifies true negative, $F_{P}^{\prime }$ signifies false positive, $\hat{F}_{N}$ signifies false negative.

4.2.5 Computational time

This is to separate the final output image superiority, and is calculated by Eq. (19),

$$ {\text{Computational}}_{Time} = {{J * CPI} \mathord{\left/ {\vphantom {{J * CPI} P}} \right. \kern-0pt} P} $$

(19)

Consider $J$ states image number $CPI$ indicates Cycle as per instruction and $P$ signifies the separation period.

4.3 Simulation results

The simulation result of Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC is illustrated in Figs 3, 4, 5, 6, 7 8 and 9. The mentioned metrics are analysed with existing CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC methods.

Figure 3 implicates accurateness assessment. This is to assess the proposed method’s presentation and calculates the number of instances detected in relation to the overall instances in the dataset. The efficiency of Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method shows 34.89%, 29.06% and 20.78% higher accuracy analyzed to the existing CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC methods respectively.

Figure 4 displays the Mean Squared Error Analysis. MSE is a performance metric used in deep learning and statistics to assess the accuracy of a prediction technique. It computes the average squared difference among predicted as well as actual values. It is expressed in percentage. The performance of proposed Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method shows 23.65%, 22.10% and 29.68% lower Mean Squared Error evaluated to the existing CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC methods respectively.

Figure 5 depicts peak signal to noise ratio analysis. PSNR examine the quality of reconstructed or compressed images and videos by comparing them to the original, uncompressed version. It is expressed in percentage. The performance of the proposed Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method shows 29.63%, 19.68%, and 29.67% greater PSNR compared to the existing CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC methods respectively.

Figure 6 shows the Structural Similarity Index Analysis. SSIM is a widely used metric for assessing the similarity amongst the input images. It is commonly employed in the image processing and computer vision to evaluate the quality of image compression, denoising, and other image enhancement techniques. The SSIM index compares picture structural information such as brightness, contrast, and structure. It is expressed in percentage. The performance of the proposed Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method shows 30.21%, 39.65% and 28.74% higher SSIM estimated to the existing CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC methods respectively.

Figure 7 represents the efficiency Analysis. Efficiency analysis involves assessing the performance of a proposed Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method, to determine how well it utilizes resources and achieves its goals. It is expressed in percentage. The performance of the proposed Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method shows 22.36%, 29.68% and 33.17% higher efficiency analyzed with existing CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC methods.

Figure 8 signifies the error rate Analysis. Error rate is used to measure the accurateness of predictive model. It reveals the ratio of imperfectly predicted occurrences as the total instances in the dataset. The performance of the Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method shows 14.32%, 19.78%, 18.65% lesser error rate assessed with other CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC methods respectively.

Figure 9 shows analysis of computation time. Computational time scales the time necessary to implement a task. Also assess the effectiveness and speediness of the algorithm. The performance of Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method shows 9.63%, 39.67% and 49.67% lower computation time estimated to the existing CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC methods respectively.

5 Conclusion

In this manuscript, Automatic Video Captioning using Tree Hierarchical Deep Convolutional Neural Network and ASRNN- bi-directional LSTM is proposed is effectively applied to classifying the ES, ED, and MSD frames. The simulation procedure is done in Python and the performance is evaluated using the evaluation metrics. Therefore, the Tree-CNN-ASRNN-Bi-LSTM-GEO-AVC method attains 9.63%, 39.67% and 49.67% lower computation time, 30.21%, 39.65% and 28.74% greater SSIM analyzed with existing methods, such as CNN-VGG16-LSTM-AVC, SHAN-LSTM-AVC, and CAM-RNN-LSTM-AVC respectively. The proposed research, while effective in improving video captioning accuracy, still faces challenges such as substantial processing time for large datasets due to the complexity of the Tree CNN and ASRNN-bi-directional LSTM architectures, and a potential for overfitting, particularly with limited or variable training data. The model’s performance is dependent on the quality and diversity of the dataset, and it may struggle with nuanced or complex contexts. Future work should focus on enhancing efficiency through more streamlined architectures or hybrid optimization techniques, implementing advanced regularization methods to further mitigate overfitting, and using larger, more diverse datasets, including multilingual options, to improve generalization. Additionally, integrating multimodal data, exploring transfer learning, enabling real-time captioning, and incorporating user interaction mechanisms for adaptive learning can significantly enhance the model’s robustness and applicability.

Data availability

Data sharing does not apply to this article as no new data has been created or analyzed in this study.

Change history

05 March 2025
The original online version of this article was revised to update the corresponding author’s affiliation. This has been corrected now.
13 March 2025
A Correction to this paper has been published: https://doi.org/10.1007/s00607-025-01446-7

References

Shi X, Cai J, Gu J, Joty S (2020) Video captioning with boundary-aware hierarchical language decoding and joint video prediction. Neurocomputing 417:347–356
Article Google Scholar
Xu N, Zhang H, Liu AA, Nie W, Su Y, Nie J, Zhang Y (2019) Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans Multimedia 22(5):1372–1383
Article MATH Google Scholar
Bahrehdar AR, Adams B, Purves RS (2020) Streets of London: Using Flickr and Open Street Map to build an interactive image of the city. Comput Environ Urban Syst 84:101524
Article Google Scholar
Abdi A, Shamsuddin SM, Hasan S, Piran J (2019) Deep learning-based sentiment classification of evaluative text based on Multi-feature fusion. Inf Process Manage 56(4):1245–1259
Article MATH Google Scholar
Jasper GnanaChandran J, Karthick R, Rajagopal R, Meenalochini P (2023) Dual-channel capsule generative adversarial network optimized with golden eagle optimization for pediatric bone age assessment from hand X-Ray image. Int J Pattern Recognit Artif Intell 37(02):2354001
Article Google Scholar
Karthick S, Gomathi N (2024) IoT-based COVID-19 detection using recalling-enhanced recurrent neural network optimized with golden eagle optimization algorithm. Medical Biol Eng Comput 62(3):925–940
Aafaq N, Akhtar N, Liu W, Mian A (2021) Empirical autopsy of deep video captioning encoder-decoder architecture. Array 9:100052
Article Google Scholar
Roy D, Panda P, Roy K (2020) Tree-CNN: a hierarchical deep convolutional neural network for incremental learning. Neural Netw 121:148–160
Article MATH Google Scholar
Lin JC, Shao Y, Djenouri Y, Yun U (2021) ASRNN: a recurrent neural network with an attention model for sequence labeling. Knowl-Based Syst 212:106548
Article Google Scholar
Mohammadi-Balani A, Nayeri MD, Azar A, Taghizadeh-Yazdi M (2021) Golden eagle optimizer: A nature-inspired metaheuristic algorithm. Comput Ind Eng 152:107050
Article Google Scholar
https://github.com/jssprz/video_captioning_datasets
Om Prakash S, Udhayakumar S, Anjum Khan R, Priyadarshan R (2021) Video captioning for proactive video management using deep machine learning. In: Advances in smart system technologies: Select proceedings of ICFSST 2019, Springer Singapore, pp 801–811
Deng J, Li L, Zhang B, Wang S, Zha Z, Huang Q (2021) Syntax-guided hierarchical attention network for video captioning. IEEE Trans Circuits Syst Video Technol 32(2):880–892
Article MATH Google Scholar
Zhao B, Li X, Lu X (2019) CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans Image Process 28(11):5552–5565
Article MathSciNet MATH Google Scholar
Islam S, Dash A, Seum A, Raj AH, Hossain T, Shah FM (2021) Exploring video captioning techniques: a comprehensive survey on deep learning methods. SN Computer Science 2(2):1–28
Article Google Scholar
Zheng Z, Wang W, Qi S, Zhu SC (2019) Reasoning visual dialogs with structural and partial observations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6669–6678
Zellers R, Bisk Y, Farhadi A, Choi Y (2019) From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2019, pp 6720–6731
Alkalouti HN, Masre MA (2021) Encoder-decoder model for automatic video captioning using yolo algorithm. In: 2021 IEEE International IOT, electronics and mechatronics conference (IEMTRONICS), pp 1–4. IEEE.
Gao L, Wang X, Song J, Liu Y (2020) Fused GRU with semantic-temporal attention for video captioning. Neurocomputing 395:222–228
Article MATH Google Scholar
Zhang B, Zou G, Qin D, Lu Y, Jin Y, Wang H (2021) A novel Encoder-Decoder model based on read-first LSTM for air pollutant prediction. Sci Total Environ 765:144507
Article Google Scholar
Sara U, Akter M, Uddin MS (2019) Image quality assessment through FSIM, SSIM, MSE and PSNR—a comparative study. J Comput Commun 7(3):8–18
Article Google Scholar

Download references

Acknowledgements

Not applicable

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Mepco Schlenk Engineering College (Autonomous), Sivakasi, Tamil Nadu, India
N. Kavitha
Department of Computer Science and Engineering, Mepco Schlenk Engineering College (Autonomous), Sivakasi, Tamil Nadu, India
K. Ruba Soundar
Department of Computer Science and Engineering, KLN College of Engineering, Pottapalayam, Sivaganga, India
R. Karthick
Department of Electrical and Electronics Engineering, P.S.R Engineering College, Sivakasi, Tamil Nadu, 626146, India
J. Kohila

Authors

N. Kavitha
View author publications
Search author on:PubMed Google Scholar
K. Ruba Soundar
View author publications
Search author on:PubMed Google Scholar
R. Karthick
View author publications
Search author on:PubMed Google Scholar
J. Kohila
View author publications
Search author on:PubMed Google Scholar

Contributions

Dr. N. Kavitha-(Corresponding Author)—Conceptualization Methodology, Original draft preparation. Dr. K. Ruba Soundar—Supervision. Dr. R. Karthick—Supervision. Mrs. J. Kohila—Supervision.

Corresponding author

Correspondence to N. Kavitha.

Ethics declarations

Conflict of interests

The authors declare no conflict of interests.

Consent to participate

This article does not contain any studies with human participants performed by any of the authors.

Consent for publication

Not Applicable.

Human and animal ethics

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised to update the corresponding author’s affiliation. This has been corrected now.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kavitha, N., Soundar, K.R., Karthick, R. et al. Automatic video captioning using tree hierarchical deep convolutional neural network and ASRNN-bi-directional LSTM. Computing 106, 3691–3709 (2024). https://doi.org/10.1007/s00607-024-01334-6

Download citation

Received: 20 September 2023
Accepted: 29 July 2024
Published: 13 August 2024
Version of record: 13 August 2024
Issue date: November 2024
DOI: https://doi.org/10.1007/s00607-024-01334-6

Automatic video captioning using tree hierarchical deep convolutional neural network and ASRNN-bi-directional LSTM

Abstract

Similar content being viewed by others