Depth-Aware Unpaired Video Dehazing
Depth-Aware Unpaired Video Dehazing
33, 2024
Abstract— This paper investigates a novel unpaired video is demanded, as the task has rarely been studied in the
dehazing framework, which can be a good candidate in practice literature.
by relieving pressure from collecting paired data. In such a In comparison with single image dehazing, a core concern
paradigm, two key issues including 1) temporal consistency
uninvolved in single image dehazing, and 2) better dehazing for video dehazing is that i) how to guarantee the temporal
ability need to be considered for satisfied performance. To handle consistency across frames under unpaired supervision. Previ-
the mentioned problems, we alternatively resort to introduc- ous attempts tried to first construct future frames then employ
ing depth information to construct additional regularization the recycle loss [11], which encourages the consistency
and supervision. Specifically, we attempt to synthesize realistic between the future frames produced by translation simply on
motions with depth information to improve the effectiveness
and applicability of traditional temporal losses, and thus bet- time/across time and domains, or its variants for achieving the
ter regularizing the spatiotemporal consistency. Moreover, the temporal consistency. Specifically, RecycleGAN [11] directly
depth information is also considered in terms of adversarial estimates the future frame, and MocycleGAN [12] predicts
learning. For haze removal, the depth information guides the the relative motion of the future frame by an additional
local discriminator to focus on regions where haze residuals are network. To avoid the error caused by the motion estimator,
more likely to exist. The dehazing performance is consequently
improved by more pertinent guidance from our depth-aware local URecycGAN [13] randomly synthesizes motion to produce
discriminator. Extensive experiments are conducted to validate future frames. However, these manners can hardly be adapted
our effectiveness and superiority over other competitors. To to the unpaired video dehazing task, the primary reasons are:
the best of our knowledge, this study is the initial foray into the estimated or synthesized motion is inaccurate or even ran-
the task of unpaired video dehazing. Our code is available at dom [13], resulting in warped future frames severely distorted.
https://github.com/YaN9-Y/DUVD.
Consequently, the generator is forced to fit such low-quality
Index Terms— Video dehazing, image dehazing, unpaire frames, which highly likely undermines the performance for
learning.
low-level tasks. Moreover, most of the above-mentioned meth-
ods are built for general video translation tasks. One important
I. I NTRODUCTION characteristic for the dehazing task was not considered in those
frameworks, i.e., it is difficult to synthesize relatively accurate
T HE visibility of images and videos captured under hazy
conditions is typically interfered due to the scattering
effect of aerosol particles [1]. Both human observers and
future hazy frames for constructing temporal constraints, since
the haze distribution should vary accordingly to scene depth
computer vision systems may be affected by such degradation. changes along with motions. But it is beyond the capacity of
With the emergence of deep learning, deep neural network original schemes in modeling such variation.
based dehazing methods have dominated the mainstream [2], In addition, as a common issue for CycleGAN-based
[3], [4], [5]. However, most of these methods heavily depend unpaired dehazing, we also have to face that ii) Original
on (massive) paired hazy-clean data. To mitigate the require- CycleGAN-based architecture shows limited performance on
ment of paired data, the unpaired dehazing paradigm seems dehazing tasks. In other words, most CycleGAN-based meth-
to be an effective option by translating images between hazy ods cannot accurately perform haze removal and usually
and clean domains. For unpaired single image dehazing, there remains observable haze. The global adversarial loss only mea-
already exist a number of approaches like [6], [7], [8], [9], and sures whether the processed image conforms to the distribution
[10]. But, without consideration of the temporal inconsistency of the whole target domain, which is too general to provide
issue - a key factor to the video quality, they usually produce sufficient guidance for faithfully recovering local details. Con-
temporally inconsistent videos with severe flickering artifacts. sequently, how to provide more powerful supervision to meet
Thus, developing effective unpaired video dehazing algorithms the dehazing requirement becomes another thorny problem.
To address the above issues, this work exploits depth
Manuscript received 16 August 2023; revised 30 January 2024; information as a clue to build a novel unpaired video dehazing
accepted 3 March 2024. Date of publication 22 March 2024; date of current
version 28 March 2024. This work was supported by the National Natural framework. Specifically, we propose to utilize depth informa-
Science Foundation of China under Grant 62072327 and Grant 62372251. tion to simulate ego-motion, enabling the synthesis of future
The associate editor coordinating the review of this manuscript and approving frames with greater accuracy than those methods by random
it for publication was Dr. Jun Cheng. (Corresponding author: Xiaojie Guo.)
Yang Yang and Xiaojie Guo are with the College of Intelligence and generation or estimation. This technique significantly enhances
Computing, Tianjin University, Tianjin 300350, China (e-mail: yangyangcic@ the temporal consistency by using synthesized frames as a
tju.edu.cn; [email protected]). basis for building cross-frame unpaired supervision. Moreover,
Chun-Le Guo is with the College of Computer Science, Nankai University,
Tianjin 300350, China (e-mail: [email protected]). we integrate depth information with the Atmospheric Scatter-
Digital Object Identifier 10.1109/TIP.2024.3378472 ing Model (ASM) in a cross-frame manner, allowing model
1941-0042 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2389
refinement of the subtle variations in haze density resulting while later methods [3], [4], [21], [22], [23] turn to directly
from object motion in video sequences. This innovative design estimate the haze-free image. More recent approaches tend
significantly elevates the authenticity and quality of the synthe- to introduce other specific designs to further improve the
sized frames. Consequently, it not only enhances the temporal performance. For instance, AECR-Net [5] proposed the con-
consistency but also remarkably improves dehazing perfor- trastive perceptual loss as regularization and C2 PNet [24]
mance. Finally, our approach adopts depth information to further improves it by finer classifying the negative samples.
strategically direct the local discriminator’s attention towards FFANet [4] proposed the channel and pixel attention mech-
more distant regions, typically characterized by denser and anism to improve the representation ability of the dehazing
more challenging haze. This targeted enhancement in haze network while SHA [25] employed the separable hybrid atten-
removal capability marks a novel development in effectively tion module to encode haze density by capturing features in
addressing areas traditionally difficult to dehaze. the orthogonal directions. Besides, with the great success of
Overall, our contributions can be summarized as follows: vision transformers (ViT) in computer vision tasks [26], [27],
• We propose a novel depth-aware unpaired video dehazing some works attempt to apply the ViTs to assist the dehazing
framework. To the best of our knowledge, this is the task. Specifically, DehazeFormer [28] improves the original
pioneering work in unpaired video dehazing. vision transformer in several aspects of network design for the
• We introduce depth information to accurately simulate the dehazing task. DeHamer [29] exploits the dark channel map
ego-motion and model the haze variation across frames. to build 3D positional embedding in the vision transformer
The obtained synthetic frames are then utilized to attain to provide the relative position and suggest the haze density
the spatiotemporal consistency. of different spatial regions. Although achieved remarkable
• We employ the depth information as guidance for the performance on specific benchmark datasets, these methods
local discriminator to more focus on distant regions where depend heavily on paired data, and easily suffer from the
the haze tends to be denser and hard to remove. The overfitting problem [30].
depth-aware discriminator can provide more powerful 3) Unsupervised Learning-Based Methods: The unsuper-
supervision than the original. vised solutions aim to learn the dehazing ability without
• Extensive experiments together with ablation studies are relying on paired training data. A few works solely adopt
conducted to reveal the effectiveness of our design, and hazy observations for training [31], [32], [33], [34], but
its superiority over other state-of-the-art alternatives. these methods often heavily rely on minimizing hand-crafted-
prior-based objective functions, which also tend to make the
II. R ELATED W ORK models vulnerable to the variation of environment. While
other methods attempt to learn the transformation under
Since the task of unpaired video dehazing has been rarely
unpaired supervision [6], [7], [8], [10], [35], [36], [37],
studied up till now, in what follows, we will briefly introduce
[38], [39]. Most unpaired methods are built upon cyclic
the works that are closely related to ours, which mainly include
consistency [40] with other specific designs. In particular,
single image dehazing, video dehazing and unpaired video
translation. Cycle-Defog2Refog [41] uses a two-stage mapping strategy
in both translation paths to improve the performance of haze
removal. D4 [10] seeks to disentangle the transmission map
A. Single Image Dehazing into density and depth to make the translation process more
Single image dehazing aims to recover the clean image physically reasonable. Inspired by [10], POGAN leverages
from a hazy input. Existing methods can be roughly divided depth-density priors and combines a Physical Restoration
into prior-based, supervised and unsupervised learning-based Network with a Degradation Rendering Network to enhance
approaches. generalization to real-world degraded images in various harsh
1) Prior-Based Methods: The prior-based methods concen- scenarios. CDD-GAN [9] exploits contrastive learning to sup-
trate on discovering priors on haze-free images from statistic press the task-irrelevant factors and enhance the task-relevant
analysis to estimate the transmission and the atmospheric factors. U2 D2 Net further introduced a unsupervised denois-
light [14], [15], [16], [17], [18]. Specifically, DCP [15] ing network based on Noise2Noise [42] to simultaneously
assumes that in haze-free images, the lowest intensity in a local perform dehazing and denoising in unpaired manner. SNSP-
block should be close to zero. CAP [17] builds a linear model GAN [35] exploits an SN-Soft-Patch GAN with a new cyclic
between scene depth and the difference of pixel saturation self-perceptual loss to compute the perceptual similarity and a
and value in HSV space. More recently, SLP [18] employs color loss to brighten the dehazed images as human expects.
the intrinsic relevance between pixels to achieve a reliable USID [39] introduces a compact multi-scale feature attention
saturation line construction for transmission estimation. How- module for better feature representation and a mechanism
ever, as a serious drawback, these algorithms are built upon named OctEncoder to include multi-frequency representations
hand-crafted priors, which are vulnerable to the variation of that can capture more global information. ALGC-GAN [37]
the environment. also introduces an attention mechanism to address the issue of
2) Supervised Learning-Based Methods: The supervised uneven haze distribution in images, a global-local consistent
learning-based strategies adopt powerful deep models to learn discriminator to detect haze that varies across different spatial
the dehazing process from paired training data. Early works regions, thereby enhancing the stability and performance of
focus on estimate transmission or its variants [2], [19], [20], the discriminator and a dynamic feature enhancement mod-
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2390 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024
ule along with an adaptive mix-up module that dynamically Cascading and Deformable (PCD) alignment module, aligning
extracts and organize spatially structured features. frames at the feature level using deformable convolutions in a
Nevertheless, all of the mentioned methods are designed for coarse-to-fine approach. Additionally, a Temporal and Spatial
single image dehazing. Directly applying them to hazy videos Attention (TSA) fusion module applies attention both tempo-
often suffers from temporal inconsistency, which may further rally and spatially to highlight key features for restoration.
affect the applicability on downstream tasks. RVRT [53] processes local neighboring frames in parallel
within a globally recurrent framework, which divides videos
B. Video Dehazing into clips, using the features of one clip to estimate the next.
For aligning clips, it uses guided deformable attention to iden-
Different from single image dehazing, video dehazing can
tify and aggregate key features across clips using an attention
draw support from other (neighboring) frames for a certain
mechanism. Shift-Net [54] uses grouped spatial-temporal shift,
frame to improve the dehazing quality. A majority of early
a lightweight technique that captures inter-frame correspon-
works first utilize single image dehazing algorithms on each
dences for multi-frame aggregation. We can deduce that these
individual frame, then correct the temporal inconsistency in a
video restoration designs are typically versatile, making them
post-processing way [43], [44], [45]. For instance, Borkar and
suitable for a range of application scenarios. However, this
Mukherjee [45] initiate the dehazing sequences by estimat-
versatility also restricts their relevance to specific tasks, as they
ing transmission using DCP [15], then applying augmented
do not account for the unique characteristics inherent to each
MRF to address spatial and temporal inconsistencies, which
individual task.
involve statistical smoothing operations in temporal and spatial
Video restoration models designed for specific tasks often
dimensions. Recent approaches introduce CNN models for
bear considerable resemblance to general video restoration
video dehazing. Specifically, VDHNet [46] incorporates global
models. Differently, they are typically enhanced with features
semantic priors as input to regularize the dehazing result,
unique to the specific task at hand, ensuring a more tailored
EVDD-Net [47] jointly trains the dehazing network with
fit and improved performance for those particular applications.
a video object detection model for more stable detection
Take video draining as example, Liu et al. [60] redefines the
results. Zhang et al. [48] collects a real-world hazy video
issue of rain occlusion by taking into account rain accumula-
dataset named REVIDE and proposes a video dehazing model
tion and introduces a recurrent CNN for video draining which
CG-IDN with deformable alignment modules. Patil et al. [49]
combines rain degradation classification, spatial texture-based
proposes a dual-frame spatio-temporal feature modulation
rain removal, and temporal coherence-based background detail
architecture to handle the degradation. Afterward, PM [50] reconstruction. Yang et al. [55] treats the draining problem
further introduces a meta-learning strategy to deal with data- as an inverse process of rainfall synthesis and solves it by
poor weather-degraded conditions. MAP-Net [51] introduces estimating the parameter of such procedure. As for video
modules to encode the prior-related features into long-range deblurring, Pan et al. [56] proposed a network that simul-
memory, and to capture space-time dependencies in multiple taneously estimates the optical flow and latent frames, and
space-time range for effective temporal information aggrega- exploits a temporal sharpness prior to further assisting video
tion. Though these methods can achieve remarkable dehazing deblurring. STFAN [61] is a network performing alignment
performance thanks to the great representative ability of deep and deblurring in a unified framework, which contains Filter
models, their training highly relies on paired video data that Adaptive Convolutional (FAC) layer to align the deblurred
is expensive to collect. features of the previous frame with the current frame and deals
with blurs in the current frame. Zhu et al. [57] propose that
C. Video Restoration temporal blur degradation can be accessibly decoupled in the
Beyond targeted video dehazing, various techniques are potential frequency domain, and propose to integrate the tem-
developed for additional video restoration tasks, including poral spectrum into the video deblurring process comprising
video denoising, deraining, and deblurring. Certain meth- feature extraction, alignment, aggregation, and optimization.
ods address multiple restoration challenges within a single We can observe that task-specific methods often propose more
model [52], [53], [54], while others focus on distinct, indi- specialized designs tailored to the unique characteristics of
vidual tasks [55], [56], [57], and most of these methods are different tasks. These targeted approaches typically result in
supervised by paired video data. This part of the discussion enhanced performance, achieving superior restoration quality
also includes an overview of notable studies to thoroughly and improved temporal consistency. However, a significant
understand diverse approaches in video restoration. limitation previously mentioned is that all these methods
In the realm of video restoration, the primary focus of necessitate paired video data for supervision, a process that
research has been directed towards developing universal net- demands substantial labor for data collection. Moreover, in the
work architectures based on multi-frame [52], [54], [58] or absence of robust paired supervision, it remains uncertain
recurrent [53], [59] frames input to adapt video processing. whether the prevalent designs can continue to yield significant
Complementary to these designs are features like alignment improvements in temporal consistency.
modules [52] and cross-frame attention [54] mechanisms,
which are instrumental in augmenting temporal consistency D. Video-to-Video Translation
in video streams. Take some representative methods as exam- Unpaired video dehazing can also be regarded as an
ples, EDVR [52] tackles large motions through a Pyramid, unpaired video-to-video translation problem. The key is to
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2391
perform domain translation while maintaining the spatio- frames. Besides, to predict the scattering coefficient from the
temporal consistency. RecycleGAN [11] introduces an addi- hazy images, the network also contains a scattering coefficient
tional network to predict the future frame, and produces an estimation branch, where the multi-scale features from the
effective recycle loss to constrain the temporal consistency. three layers of the encoder are first processed by global
MocycleGAN [12] substitute the future frame predictor with average pooling to extract the global information. Then these
a motion estimator to improve the realness of motion across features are further processed by two serial convolutions with
the transformation, and [62] further proposes the content pre- 1 × 1 kernel, which acts like fully-connected layers, to reach
serving loss for semantic consistency. Afterward, to deal with the final prediction of the scattering coefficient. The detailed
the error caused by the motion estimator, URecycGAN [13] architecture of the dehazing module is shown in Tab. I. In our
directly generates the optical flow in a random way to syn- research, we focus primarily on integrating depth information
thesize pseudo future frames for setting temporal constraints. to enhance training losses for improved dehazing performance
Although these works have validated the effectiveness of recy- and temporal consistency. Consequently, we maintain a net-
cle loss in keeping the temporal consistency, it is challenging work architecture that is both straightforward and concise.
to transplant the recycle loss to unpaired video dehazing task, However, we posit that refining the network architecture,
since the haze variation in future frames is hard to model. In particularly through the incorporation of multi-frame input
this paper, we introduce the depth information and apply it and corresponding multi-frame attention mechanism, like in
to accurately simulate the ego-motion and haze variation, the MAPNet [51] and CG-IDN [48], will positively impact both
temporal consistency can be then constrained by the recycle dehazing performance and temporal consistency. Moreover,
loss. the potential integration of depth information within the
multi-frame attention mechanism presents a feasible prospect.
III. M ETHDOLOGY This integration could be beneficial in reducing the ambiguity
of single-frame depth information, enabling depth information
A. Overall Framework to better contribute to the dehazing process. Furthermore,
Given a clean video C1:T = {Ct }t=1 T with T frames from this approach could enhance the model’s ability to aggregate
the clean domain XC and a hazy video H1:S = {Hs }s=1 S from multi-frame information more efficiently during the inference
the hazy domain X H , our target is to learn a dehazing network stage, offering a promising avenue for further exploration.
G H 2C which maps the hazy frames from X H to XC without The rehazing module aims to transform the given clean image
using paired information. As illustrated by Fig. 1, our whole to its hazy version, which contains a depth estimation network
framework consists of two modules: the dehazing module and G D and a refine network G R . Inspired by [10], we introduce the
the rehazing module. depth information for haze generation and randomly sample
The dehazing module is to generate the clean image from the scattering coefficient to generate hazy images with variant
its hazy observation in a pixel-to-pixel manner. In addition, densities. Having a clean image C and a scattering coefficient
it also estimates the scattering coefficient β̂ of the hazy scene β, the pretrained depth estimation network G D [63] first
(i.e., the density of the haze). For simplicity, by omitting captures its depth map, then synthesizes a coarse hazy image
the β̂, the dehazing procedure can be formulated as Ĉ = according to the ASM [1]. The refine network G R is to refine
G H 2C (H). The dehazing module has a concise U-Net-shaped the coarse hazy image, making it better fit the real hazy
architecture with four residual blocks between the encoder distribution. In the dehazing-rehazing branch (lower branch
and decoder as the main branch to generate the dehazed in Fig. 1), the given scattering coefficient β is the same as
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2392 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024
TABLE I
T HE A RCHITECTURE OF THE D EHAZING N ETWORK AND THE R EHAZING
N ETWORK . T HE R EHAZING N ETWORK O NLY C ONTAINS THE M ODULES
IN THE U PPER PART OF THE TABLE , W HILE THE D EHAZING N ETWORK
C ONTAINS B OTH THE U PPER PART (W ORKS AS THE D EHAZING B RANCH )
AND THE L OWER PART (W ORKS AS THE S CATTERING C OEFFICIENT E STI -
MATION B RANCH .) E ACH L INE R EPRESENTS A S EQUENCE OF L AYERS OR
A W HOLE B LOCK . T HE k, c, s R EPRESENT THE K ERNEL S IZE , O UTPUT
C HANNELS AND S TRIDE OF THE C ONVOLUTION L AYERS , R ESPECTIVELY.
F OR B OTH D EHAZING N ETWORK AND R EHAZING NETWORK , W E S ET
n = 96 FOR THE L ARGE M ODEL (O URS (L)) AND n = 64 FOR THE L IGHT
M ODEL (O URS (S)). T HE L AYERS W ITH S KIP C ONNECTIONS A RE S HOWN
IN THE F IRST C OLUMN . OTHER L AYERS D IRECTLY TAKE THE O UTPUT OF Fig. 2. The definition of our coordinate system.
P REVIOUS L AYERS AS I NPUT. IN, G LOBAL AVGPOOL , LR E LU R EPRESENT
I NSTANCE N ORMALIZATION , G LOBAL AVERAGE P OOLING AND L EAKY
R ELU W ITH SLOPE = 0.2, R ESPECTIVELY. T HE S IZE OF I NPUT I MAGE I S
A SSUMED TO B E 256 × 256 FOR S IMPLICITY. I N FACT O UR
N ETWORK C OULD P ROCESS I MAGES W ITH
A RBITRARY S IZES
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2393
Fig. 4. Visualization of motion and future frame synthesis. Our method can generate more reasonable motion and more realistic distortion. Please zoom in
for better visualization.
synthetic future frame, from which we can infer that, our atmospheric light estimated by the prior. Fig. 5 shows cases
method can generate more realistic motion and better simulate of the frames generated by our depth-aware future frame
the distortion caused by ego-motion (see the green-boxed synthesis method. Specifically, Fig. 5(b) simulates the frame
regions in Fig. 4(a) and (e)), while the motion and future when the camera moves nearer to the objects. With the objects
frames synthesized by [13] are severely distorted (Fig. 4(b) in the image drawing nearer, the haze also becomes lighter,
and (c)). Besides, by randomly giving parameters ξ ∗ , we could which better fits the ASM model. Similarly, in Fig. 5(c),
synthesize future frames with various motions, which could when the camera moves further, the haze correspondingly
act as the augmentation mechanism to further improve the becomes denser. The illustration shows that with the proposed
temporal consistency and performance. depth-aware future frame synthetic method, our framework can
more accurately model the haze variation caused by camera
C. Depth-Aware Future Frame Synthesis motion, which provides better support for constructing further
constraints or regularizations.
Given a clean frame Ct from its source video C1:T , to syn-
thesize its future frame, we first estimate its depth by G D ,
then randomly generate a set of parameters ξ ∗ controlling D. The Temporal Constraint
the ego-motion. Afterward, we synthesize the ego-motion Following [13], as illustrated by Fig. 1, having the
Õ according to ξ ∗ and warp it with the current frame Ct current frames Ct /Hs and their translated frames Ĥt /Ĉs ,
to fetch the synthetic future frame C̃t+1 . The procedure of we could now simulate their synthetic future frames
synthesizing clean future frame can be formulated by d̃ = C̃t+1 /H̃s+1 and W (Ĥt , Õ)/W (Ĉs , Õ). Afterward, we map
G D (Ct ), Õ = F(Ct , d̃, ξ ∗ ), C̃t+1 = W (Ct , Õ), where F(·) W (Ĥt , Õ)/W (Ĉs , Õ) back to their original domain as
and W (·) denote the functions of generating optical flow and G H 2C (W (Ĥt , Õ))/GC2H (W (Ĉs , Õ)). Then we could set the
warping, respectively. constraint between them and the synthetic future frames to
However, when synthesizing future hazy frames, since the guarantee the temporal consistency as follows:
haze varies with depth, simply warping the hazy image with
the synthetic optical flow cannot model such variation. Fortu- L r ecyc = ∥Mt · (C̃t+1 − G H 2C (W (Ĥt , Õ)))∥1
nately, by introducing depth information, the ASM could help + ∥Ms · (H̃s+1 − GC2H (W (Ĉs , Õ)))∥1 , (8)
us achieve the goal. Suppose we have a hazy frame Hs from
H1:S , if we first consider the depth variation caused by camera where Mt , Ms refer to the mask of pixels that do not
motion and corresponding change on haze intensities without exceed the bound of the image during the warping operation
thinking the change of pixel position (warping), Hs and its W (Ĥt , Õ) and W (Ĉs , Õ), respectively. ∥ · ∥1 denotes the
synthetic future frame before warping H̄s+1 can be expressed ℓ1 norm. Moreover, to confirm the warping and domain
by ASM as: translation is commutative to each other, we also constrain
the translated future frame from different routes should be
Hs (x) = J (x)e−βd1 (x) + A(1 − e−βd1 (x) ), (3) consistent, i.e., :
−βd2 (x) −βd2 (x)
H̄s+1 (x) = J (x)e + A(1 − e ). (4) L spa = ∥Mt · (GC2H (C̃t+1 ) − W (Ĥt , Õ))∥1
Dividing Eq. (4) by Eq. (3) yields the following relationship: + ∥Ms · (G H 2C (H̃s+1 ) − W (Ĉs , Õ))∥1 . (9)
−β(d2 (x)−d1 (x))
H̄s+1 (x) = (Hs (x) − A) · e + A, (5)
E. Depth-Aware Local Discriminator
by which, after we synthesize the motion by Ĉs =
G H 2C (Hs ), d˜1 = G D (Ĉ), and Õ = F(Hs (x), d̃1 (x), ξ ∗ ), Besides of the temporal consistency, another crucial issue
we could smoothly synthesize the hazy future frame by first for unpaired video dehazing is to further enhance the supervi-
calculating H̄s+1 , then changing pixel position(warping) as: sion so that the dehazing performance can be well improved.
For most unpaired methods, the dehazing ability is sourced
H̄s+1 (x) = (Hs (x) − A) · e−β(d̄2 (x)−d̃1 (x)) + A, (6) from the adversarial loss. However, it is far beyond the ability
H̃s+1 = W (H̄s+1 (x), Õ), (7) of a single global discriminator to precisely discriminate all
tiny hazy regions from the dehazed image. As a result, we can
where d̄2 denotes the depth of H̄s+1 , which is calculated usually notice obvious haze residuals in the dehazed result.
according to d̃1 and the motion parameters ξ ∗ , A is the To mitigate such problem, a straightforward solution is to
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2394 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024
Fig. 5. The illustration of depth-aware future frame synthesis. In our framework, the haze in synthetic frames faithfully changes with the camera motion.
introduce another discriminator to more focus on the local 2) Adversarial Loss: The adversarial losses measure
content. In fact, previous dehazing methods [8] have already whether the dehazed/rehazed image fits the distribution of the
introduced the local discriminators and achieved promising training sets XC /X H . Taking dehazing module G H 2C and the
gains in performance. Nevertheless, we find that the traditional discriminator Dc as example, the adversarial loss is expressed
local discriminator samples the block from the dehazed images as:
during training in a totally random manner. Yet after a period
L adv (Dc ) = E[(Dc (Ct ) − 1)2 ] + E[Dc (G H 2C (Hs ))2 ],
of training, the models are usually well trained to handle
the light haze region but still have difficulties in thoroughly (11)
processing the dense haze region. According to the ASM [1], L adv (G H 2C ) = E[Dc (G H 2C (Hs ) − 1)2 ], (12)
the dense haze region are more likely to locate in areas distant
from the camera. Based on the observation, we propose the where Hs and Ct are hazy and clean frames sampled from
depth-aware local discriminator to make the local discrimina- X H and XC , respectively. The adversarial loss has the same
tor more focused on the distant regions where the haze is more form for the rehazing module GC2H and discriminator Dh .
likely to be denser. The dehazing module G H 2C is optimized by adversarial loss
To achieve our purpose, we require relatively reasonable from both global and local discriminators, thus we denote it
L ,G
depth map for the hazy frame. However, the dense haze that as L adv in Fig. 1. While the adversarial loss for the rehazing
existed in the hazy frame will affect the depth estimator from module is provided by the global discriminator, and we denote
G .
it as L adv
generating the accurate depth map. To address the issue, in the
dehazing-rehazing branch, we estimate the depth map from the 3) Pseudo Scattering Coefficient Supervision Loss: Since
dehazed image rather than the hazy image. Afterward, given a we employed the self-augmented mechanism in [10], in the
threshold value τ , the local discriminator will only choose the rehazing-dehazing branch (see the upper branch in Fig. 1),
region centered at pixels whose depth is at the top τ of the a constraint between the randomly sampled scattering coeffi-
image when computing the adversarial loss for the generator cient βc from the rehazing module and the estimated scattering
during training. Besides, since at the beginning of training, the coefficient β̂c by the dehazing module is required to enable the
dehazing network is not well trained, the remained haze tends dehazing module properly estimate the scattering coefficient.
to distribute globally rather than locally in distant regions of The training loss is defined as:
the dehazed image. Consequently, we made the threshold τ ′ L sc = ∥βc − β̂c ∥1 . (13)
linearly changed to the set value τ in the first 50 epochs. More
intuitively, our depth-aware local discriminator will gradually Taking all the loss terms into consideration, the overall objec-
shrink its interested region from the global image to the distant tive function to optimize our framework is formulated as:
local region. L = L cyc + λ1 (L r ecyc + L spa ) + λ2 L adv + λ3 L sc , (14)
where λr ecyc , λspa , λadv , and λsc are weights for L r ecyc ,
F. Training Objectives L spa , L adv , and L sc , respectively. In our experiments, setting
Besides of the temporal constraint, we also employ λr ecyc = 0.2, λspa = 0.2, λadv = 0.3 and λsc = 0.5 works
the cycle-consistency loss, the adversarial loss and the well.
pseudo scattering coefficient supervision loss to train the
framework. IV. E XPERIMENTS
1) Cycle-Consistency Loss: The cycle-consistency loss In this section, we will introduce the datasets used for
maintains content consistency by enforcing that a translated training and evaluation, implementation details, metrics and
image could be translated back to its original domain. The baselines. Then extensive quantitative and quantitative com-
cycle-consistency loss is formulated as: parisons with other closely related works are presented to
validate our superiority on both synthetic and real-world
L cyc = ∥Hs − GC2H (G H 2C (Hs ))∥1 datasets. Finally, the ablation studies are held to validate the
+ ∥Ct − G H 2C (GC2H (Ct ))∥1 . (10) effectiveness of each part of our design.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2395
TABLE II
T HE Q UANTITATIVE C OMPARISON R ESULTS ON REVIDE AND NYU-D EPTH . SID, VD AND VT I MPLY M ETHODS A RE F ROM S INGLE I MAGE D EHAZING ,
V IDEO D EHAZING AND V IDEO T RANSLATION , R ESPECTIVELY. T HE B EST R ESULT I S D ENOTED IN BOLD AND I S C OLORED IN B LUE OR R ED
ACCORDING TO W HETHER U SING PAIRED DATA FOR T RAINING . PATIL E T A L . O FFERS N EITHER T RAINING C ODE N OR P RETRAINED
M ODEL ON NYU-D EPTH DATASET
Fig. 6. Visual comparison in haze removal on samples from the REVIDE dataset.
method is in an inferior position in terms of most metrics unsupervised/unpaired approaches, which clearly validates our
compared to modern dehazing methods that employ paired effectiveness.
training data like [4], [50], and [52]. Even when we apply 2) Qualitative Comparison: We also provide qualitative
the pretrained model of Patil et al. [50], originally trained results of several state-of-the-arts unsupervised/unpaired meth-
on the original size data, to the 4x downsampled dataset ods and two supervised methods on REVIDE and NYU-Depth
due to the unavailability of its training code, it still attains dataset for more direct visual comparisons. The results are
the best performance in restoration quality. As exceptional, shown in Fig. 6 and Fig. 7. In the first case of Fig. 6, the model
EDVR [52], a supervised video restoration framework, does developed by Patil et al. produces the best result, excelling in
not perform well on the REVIDE dataset since it suffers from global presentation, local detail, and color accuracy. FFANet
unstable training and fails to estimate large offset when trained generates globally smoother and clear result. However, within
on the REVIDE dataset, such phenomenon has been well the area highlighted by the red box, it appears to struggle
validated by [48]. Apart from some powerful supervised meth- with color shifts and a loss of detail. Our method effectively
ods, our method achieves significantly better results than other removes haze and retains richer details, yet it does introduce
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2397
Fig. 7. Visual comparison in haze removal on samples from the NYU-Depth dataset.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2398 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024
Fig. 8. Visual comparison in dehazed sequential frames from the REVIDE dataset.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2399
TABLE VII
A BLATION S TUDY ON THE REVIDE DATASET. L r ec , L spa , D-ES, D-FS, D-L OCAL D AND V-L OCAL D R EFER TO W HETHER U SE THE U NSUPERVISED
R ECYCLE L OSS , U NSUPERVISED S PATIAL L OSS , D EPTH -AWARE E GO -M OTION S YNTHESIS , D EPTH -AWARE F UTURE F RAME S YNTHESIS , D EPTH -
AWARE L OCAL D ISCRIMINATOR AND VANILLA -L OCAL D ISCRIMINATOR , R ESPECTIVELY
Fig. 10. Visual comparison of different configurations of our model on a challenging sample from REVIDE dataset.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2401
depth-aware local discriminator could gain an extra 0.15dB have validated our superiority in video dehazing performance
on PSNR, which is a 16.7% improvement. Fig. 10(f)-(h) also over other state-of-the-art methods.
shows that our depth-aware local discriminator is able to guide Moving forward, we recognize the potential for further
the network to remove the haze more thoroughly. enhancements in our unpaired video dehazing framework,
Moreover, we have conducted experiments to examine how particularly by improving the network architecture. One
the choice of depth threshold (τ ) and the weight of the promising direction is the incorporation of multi-frame input
adversarial loss (λadv ) influence performance. To assess the and corresponding multi-frame attention mechanisms. This
impact of the depth threshold τ , we evaluated dehazing enhancement can provide a more nuanced understanding of the
performance at thresholds of 0.3, 0.5, and 0.7. The quantitative temporal dynamics in video sequences, crucial for maintaining
results, presented in Table VIII, reveal that setting the depth consistency across frames in unpaired video dehazing. Such
threshold to τ = 0.5 yields the most effective dehazing design has been validated as effective in supervised video
results. Regarding the weight of the adversarial loss λadv , dehazing [50], [51], [52], demonstrating its capability in
increasing it from 0.3 to 0.5 leads to a more pronounced drop enhancing video clarity and detail preservation across multiple
in performance, since over-emphasizing the adversarial loss frames. Moreover, the multi-frame attention mechanism can
usually introduces unnecessary artifacts. Conversely, reducing also incorporate depth information. This integration could
the weight from 0.3 to 0.1 results in inadequate adversarial potentially reduce the ambiguity of single-frame depth infor-
learning supervision. Both adjustments away from the optimal mation, helping depth information better contribute to the
setting are found to compromise the model’s overall perfor- dehazing process. However, such specific network architec-
mance. ture design has been less explored in the realm of unpaired
5) Haze generation in training: Our model also incorpo- video dehazing. Our future work will focus on adapting and
rates certain hyper-parameters specifically related to the haze optimizing this mechanism for unpaired scenarios. We aim
generation process. These include the range of the scattering to bridge this gap by developing innovative techniques that
coefficient β, which determines the minimum and maximum leverage multi-frame attention to achieve superior dehazing
densities for haze generation in the hazing-dehazing branch results, ensuring both high-quality dehazing performance and
during training. And λsc , the weight assigned to the pseudo robust temporal consistency in unpaired video environments.
scattering coefficient loss.
Concerning the range of β, it is evident that over-enlarging R EFERENCES
or shrinking this range can lead to the production of hazy [1] S. G. Narasimhan and S. K. Nayar, “Vision and the atmosphere,” Int.
images that deviate from the desired distribution. Such J. Comput. Vis., vol. 48, no. 3, pp. 233–254, 2002.
deviations can subsequently result in a decline in model per- [2] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “AOD-Net: All-in-one
dehazing network,” in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017,
formance, as shown in Tab. VIII. Regarding λsc , its associated pp. 4770–4778.
loss term is designed to utilize generated hazy images and their [3] X. Liu, Y. Ma, Z. Shi, and J. Chen, “GridDehazeNet: Attention-based
corresponding scattering coefficient values as pseudo training multi-scale network for image dehazing,” in Proc. IEEE/CVF Int. Conf.
Comput. Vis. (ICCV), Oct. 2019, pp. 7314–7323.
data, and trains the scattering coefficient estimation branch of [4] X. Qin, Z. Wang, Y. Bai, X. Xie, and H. Jia, “FFA-Net: Feature fusion
the dehazing network to proficiently estimate the scattering attention network for single image dehazing,” in Proc. AAAI Conf. Artif.
coefficient from hazy inputs. As indicated in Table VIII, Intell., 2020, vol. 34, no. 7, pp. 11908–11915.
[5] H. Wu et al., “Contrastive learning for compact single image dehaz-
variations in λsc ’s weight from 0.1 to 1 appear to have a ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2021,
negligible impact on performance, since this loss only affects pp. 10551–10560.
the scattering coefficient estimation branch thus moderate vari- [6] D. Engin, A. Genc, and H. K. Ekenel, “Cycle-Dehaze: Enhanced Cycle-
ation of its weight will not largely affect its behavior during GAN for single image dehazing,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. Workshops (CVPRW), Jun. 2018, pp. 825–833.
training. However, as highlighted in [10], the presence of λsc [7] S. Zhao, L. Zhang, Y. Shen, and Y. Zhou, “RefineDNet: A weakly
is vital for the overall framework. Its absence would reduce supervised refinement framework for single image dehazing,” IEEE
the model to a vanilla-cycleGAN-liked structure, leading to a Trans. Image Process., vol. 30, pp. 3391–3404, 2021.
[8] X. Yang, Z. Xu, and J. Luo, “Towards perceptual image dehazing by
substantial loss in performance. physics-based disentanglement and adversarial training,” in Proc. AAAI
Conf. Artif. Intell., 2018, vol. 32, no. 1, pp. 7485–7492.
[9] X. Chen et al., “Unpaired deep image dehazing using contrastive
V. C ONCLUSION AND F UTURE W ORK disentanglement learning,” in Proc. Eur. Conf. Comput. Vis., 2022,
pp. 632–648.
This paper has proposed an unpaired video dehazing [10] Y. Yang, C. Wang, R. Liu, L. Zhang, X. Guo, and D. Tao,
framework that fully exploits depth information to improve “Self-augmented unpaired image dehazing via density and depth decom-
position,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
the temporal consistency and dehazing performance. Specifi- (CVPR), Jun. 2022, pp. 2037–2046.
cally, we proposed to simulate ego-motion and future frames [11] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-GAN: Unsu-
faithfully by using depth information. Such simulation is pervised video retargeting,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
2018, pp. 119–135.
introduced to form pseudo supervision to constrain temporal
[12] Y. Chen, Y. Pan, T. Yao, X. Tian, and T. Mei, “Mocycle-GAN: Unpaired
consistency. Besides, we designed a depth-aware local discrim- video-to-video translation,” in Proc. 27th ACM Int. Conf. Multimedia,
inator that concentrates on the regions that are distant from Oct. 2019, pp. 647–655.
the camera and more likely to be covered by denser haze. [13] K. Wang, K. Akash, and T. Misu, “Learning temporally and seman-
tically consistent unpaired video-to-video translation through pseudo-
To the best of our knowledge, this is the first work studying the supervision from synthetic optical flow,” in Proc. AAAI Conf. Artif.
unpaired video dehazing task. Extensive experimental results Intell., 2022, pp. 2477–2486.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2402 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024
[14] R. Fattal, “Single image dehazing,” ACM Trans. Graph., vol. 27, no. 3, [39] J. Li, Y. Li, L. Zhuo, L. Kuang, and T. Yu, “USID-Net: Unsupervised
pp. 1–9, 2008. single image dehazing network via disentangled representations,” IEEE
[15] K. He, J. Sun, and X. Tang, “Single image haze removal using dark Trans. Multimedia, vol. 25, pp. 3587–3601, 2022.
channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, [40] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
pp. 2341–2353, Dec. 2010. translation using cycle-consistent adversarial networks,” in Proc. IEEE
[16] D. Berman, T. Treibitz, and S. Avidan, “Non-local image dehazing,” in Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2223–2232.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, [41] W. Liu, X. Hou, J. Duan, and G. Qiu, “End-to-end single image fog
pp. 1674–1682. removal using enhanced cycle consistent adversarial networks,” IEEE
[17] Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm Trans. Image Process., vol. 29, pp. 7819–7833, 2020.
using color attenuation prior,” IEEE Trans. Image Process., vol. 24, [42] J. Lehtinen et al., “Noise2Noise: Learning image restoration without
no. 11, pp. 3522–3533, Nov. 2015. clean data,” 2018, arXiv:1803.04189.
[18] P. Ling, H. Chen, X. Tan, Y. Jin, and E. Chen, “Single image dehazing [43] J. Zhang, L. Li, Y. Zhang, G. Yang, X. Cao, and J. Sun, “Video dehazing
using saturation line prior,” IEEE Trans. Image Process., vol. 32, with spatial and temporal coherence,” Vis. Comput., vol. 27, nos. 6–8,
pp. 3238–3253, 2023. pp. 749–757, Jun. 2011.
[19] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang, “Sin- [44] Z. Li, P. Tan, R. T. Tan, D. Zou, S. Z. Zhou, and L.-F. Cheong, “Simul-
gle image dehazing via multi-scale convolutional neural networks,” taneous video defogging and stereo reconstruction,” in Proc. IEEE Conf.
in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4988–4997.
pp. 154–169. [45] K. Borkar and S. Mukherjee, “Video dehazing using LMNN with respect
[20] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “DehazeNet: An end-to-end to augmented MRF,” in Proc. 11th Indian Conf. Comput. Vis., Graph.
system for single image haze removal,” IEEE Trans. Image Process., Image Process., Dec. 2018, pp. 1–9.
vol. 25, no. 11, pp. 5187–5198, Nov. 2016. [46] W. Ren et al., “Deep video dehazing with semantic segmentation,” IEEE
[21] Y. Qu, Y. Chen, J. Huang, and Y. Xie, “Enhanced Pix2pix dehazing Trans. Image Process., vol. 28, no. 4, pp. 1895–1908, Apr. 2019.
network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [47] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “End-to-end united
(CVPR), Jun. 2019, pp. 8160–8168. video dehazing and detection,” in Proc. AAAI, 2018, vol. 32, no. 1,
[22] Q. Deng, Z. Huang, C.-C. Tsai, and C.-W. Lin, “HardGAN: A haze- pp. 7016–7023.
aware representation distillation GAN for single image dehazing,” in [48] X. Zhang et al., “Learning to restore hazy video: A new real-world
Proc. ECCV. Cham, Switzerland: Springer, 2020, pp. 722–738. dataset and a new method,” in Proc. IEEE/CVF Conf. Comput. Vis.
[23] H. Dong et al., “Multi-scale boosted dehazing network with dense Pattern Recognit., Jun. 2021, pp. 9239–9248.
feature fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [49] P. W. Patil, S. Gupta, S. Rana, and S. Venkatesh, “Dual-frame spatio-
(CVPR), Jun. 2020, pp. 2157–2167. temporal feature modulation for video enhancement,” Pattern Recognit.,
[24] Y. Zheng, J. Zhan, S. He, J. Dong, and Y. Du, “Curricular con- vol. 130, Oct. 2022, Art. no. 108822.
trastive regularization for physics-aware single image dehazing,” in [50] P. W. Patil, S. Gupta, S. Rana, and S. Venkatesh, “Video restoration
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2023, framework and its meta-adaptations to data-poor conditions,” in Proc.
pp. 5785–5794. ECCV. Cham, Switzerland: Springer, 2022, pp. 143–160.
[25] T. Ye et al., “Perceiving and modeling density for image dehazing,” in [51] J. Xu et al., “Video dehazing via a multi-range temporal alignment
Proc. ECCV, 2022, pp. 130–145. network with physical prior,” in Proc. IEEE/CVF Conf. Comput. Vis.
[26] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using Pattern Recognit. (CVPR), Jun. 2023, pp. 18053–18062.
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), [52] X. Wang, K. C. K. Chan, K. Yu, C. Dong, and C. C. Loy, “EDVR: Video
Oct. 2021, pp. 10012–10022. restoration with enhanced deformable convolutional networks,” in Proc.
[27] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
for image recognition at scale,” 2020, arXiv:2010.11929. Jun. 2019, p. 0.
[28] Y. Song, Z. He, H. Qian, and X. Du, “Vision transformers for single [53] J. Liang et al., “Recurrent video restoration transformer with guided
image dehazing,” IEEE Trans. Image Process., vol. 32, pp. 1927–1941, deformable attention,” in Proc. Adv. Neural Inf. Process. Syst., 2022,
2023. pp. 378–393.
[29] C. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehazing [54] D. Li et al., “A simple baseline for video restoration with grouped
transformer with transmission-aware 3D position embedding,” in Proc. spatial–temporal shift,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, Recognit. (CVPR), Jun. 2023, pp. 9822–9832.
pp. 5812–5820. [55] W. Yang, R. T. Tan, J. Feng, S. Wang, B. Cheng, and J. Liu, “Recurrent
[30] X. Guo, Y. Yang, C. Wang, and J. Ma, “Image dehazing via enhance- multi-frame deraining: Combining physics guidance and adversarial
ment, restoration, and fusion: A survey,” Inf. Fusion, vols. 86–87, learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11,
pp. 146–170, Oct. 2022. pp. 8569–8586, Nov. 2022.
[31] B. Li, Y. Gou, S. Gu, J. Z. Liu, J. T. Zhou, and X. Peng, “You only [56] J. Pan, H. Bai, and J. Tang, “Cascaded deep video deblurring using
look yourself: Unsupervised and untrained single image dehazing neural temporal sharpness prior,” in Proc. IEEE/CVF Conf. Comput. Vis.
network,” Int. J. Comput. Vis., vol. 129, no. 5, pp. 1754–1767, 2021. Pattern Recognit. (CVPR), Jun. 2020, pp. 3043–3051.
[32] A. Golts, D. Freedman, and M. Elad, “Unsupervised single image [57] Q. Zhu, M. Zhou, N. Zheng, C. Li, J. Huang, and F. Zhao, “Explor-
dehazing using dark channel prior loss,” IEEE Trans. Image Process., ing temporal frequency spectrum in deep video deblurring,” in Proc.
vol. 29, pp. 2692–2701, 2019. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 12428–12437.
[33] B. Li, Y. Gou, J. Z. Liu, H. Zhu, and J. T. Zhou, “Zero-shot image [58] L. Xu et al., “Transcoded video restoration by temporal spatial auxiliary
dehazing,” IEEE Trans. Image Process., vol. 29, pp. 8457–8466, 2020. network,” in Proc. AAAI, 2022, vol. 36, no. 3, pp. 2875–2883.
[34] A. Yang et al., “Visual-quality-driven unsupervised image dehazing,” [59] Y. Wang and X. Bai, “Versatile recurrent neural network for wide
Neural Netw., vol. 167, pp. 1–9, Oct. 2023. types of video restoration,” Pattern Recognit., vol. 138, Jun. 2023,
[35] Y. Wang et al., “Cycle-SNSPGAN: Towards real-world image dehazing Art. no. 109360.
via cycle spectral normalized soft likelihood estimation patch GAN,” [60] J. Liu, W. Yang, S. Yang, and Z. Guo, “Erase or fill? Deep joint recurrent
IEEE Trans. Intell. Transp. Syst., vol. 23, no. 11, pp. 20368–20382, rain removal and reconstruction in videos,” in Proc. IEEE/CVF Conf.
Nov. 2022. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3233–3242.
[36] B. Ding et al., “U2 D2 Net: Unsupervised unified image dehazing and [61] S. Zhou, J. Zhang, J. Pan, W. Zuo, H. Xie, and J. Ren, “Spatio-temporal
denoising network for single hazy image enhancement,” IEEE Trans. filter adaptive network for video deblurring,” in Proc. IEEE/CVF Int.
Multimedia, vol. 26, pp. 202–217, 2023. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 2482–2491.
[37] R. S. Jaisurya and S. Mukherjee, “AGLC-GAN: Attention-based global- [62] K. Park, S. Woo, D. Kim, D. Cho, and I. S. Kweon, “Preserving semantic
local cycle-consistent generative adversarial networks for unpaired and temporal consistency for unpaired video-to-video translation,” in
single image dehazing,” Image Vis. Comput., vol. 140, Dec. 2023, Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 1248–1257.
Art. no. 104859. [63] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards
[38] Y. Qiao, M. Shao, L. Wang, and W. Zuo, “Learning depth-density priors robust monocular depth estimation: Mixing datasets for zero-shot cross-
for Fourier-based unpaired image restoration,” IEEE Trans. Circuits Syst. dataset transfer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 3,
Video Technol., early access, 2024. pp. 1623–1637, Mar. 2022.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2403
[64] T. Takahashi and T. Kurita, “Reconstructing optical flow generated by Chun-Le Guo (Member, IEEE) received the Ph.D.
camera rotation via autoassociative learning,” in Proc. IJCNN, vol. 4, degree from Tianjin University, China, under the
Jun. 2000, pp. 279–283. supervision of Prof. Ji-Chang Guo. He was a Vis-
[65] B. Cai, X. Xu, and D. Tao, “Real-time video dehazing based on iting Ph.D. Student with the School of Electronic
spatio-temporal MRF,” in Proc. Pacific Rim Conf. Multimedia. Cham, Engineering and Computer Science, Queen Mary
Switzerland: Springer, 2016, pp. 315–325. University of London (QMUL), U.K. He was a
[66] N. Silberman and R. Fergus, “Indoor scene segmentation using a struc- Research Associate with the Department of Com-
tured light sensor,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, puter Science, City University of Hong Kong (CityU
Nov. 2011, pp. 601–608. of HK). He was a Postdoctoral Researcher with
[67] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image trans- Prof. Ming-Ming Cheng at Nankai University. He is
lation with conditional adversarial networks,” in Proc. CVPR, 2017, currently an Associate Professor with Nankai Uni-
pp. 5967–5976. versity. His research interests include image processing, computer vision, and
[68] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” deep learning.
2014, arXiv:1412.6980.
[69] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell.
Statist., 2010, pp. 249–256.
[70] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: From error visibility to structural similarity,” IEEE
Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[71] G. Sharma, W. Wu, and E. N. Dalal, “The CIEDE2000 color-difference
formula: Implementation notes, supplementary test data, and mathemati-
cal observations,” Color Res. Appl., vol. 30, no. 1, pp. 21–30, Feb. 2005.
[72] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and
M.-H. Yang, “Learning blind video temporal consistency,” in Proc. Eur.
Conf. Comput. Vis. (ECCV), 2018, pp. 170–185.
[73] L. K. Choi, J. You, and A. C. Bovik, “Referenceless prediction of
perceptual fog density and perceptual image defogging,” IEEE Trans.
Image Process., vol. 24, no. 11, pp. 3888–3901, Nov. 2015.
[74] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a ‘completely
blind’ image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3,
pp. 209–212, Apr. 2012.
[75] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality
assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21,
no. 12, pp. 4695–4708, Dec. 2012. Xiaojie Guo (Senior Member, IEEE) is currently a
[76] C. Lei, Y. Xing, and Q. Chen, “Blind video temporal consistency via tenured Associate Professor with Tianjin University,
deep video prior,” in Proc. NeurIPS, vol. 33, 2020, pp. 1083–1093. Tianjin, China. He has published about 100 scien-
tific papers in well-recognized conferences (CVPR,
ICCV, NeurIPS, ECCV, ACM MM, and IJCAI)
and journals (IEEE T RANSACTIONS ON PATTERN
A NALYSIS AND M ACHINE I NTELLIGENCE, Inter-
national Journal of Computer Vision, and IEEE
Yang Yang received the B.S. degree in software T RANSACTIONS ON I MAGE P ROCESSING) in the
engineering from Hebei University of Technology, fields of computer vision, multimedia, and machine
Tianjin, China, in 2019. He is currently pursuing learning. He is a fellow of IET and a Senior Member
the Ph.D. degree in software engineering with the of CCF. He was a recipient of the Wu Wenjun AI Excellent Youth Award from
College of Intelligence and Computing, Tianjin Uni- CAAI in 2020, the Piero Zamperoni Best Student Paper Award in ICPR 2010,
versity, Tianjin. His research interests include image the Best Student Paper Runner-Up in ICME 2018, and the Best Student Paper
dehazing, image restoration, and deep learning. Runner-Up in PRCV 2020. He was a recipient of the Best AE of the Year
2023. He serves/served as the SAC/AC/SPC for CVPR, ACM MM, IJCAI,
WACV, and PRCV. He serves/served as an AE for Information Fusion and
Image Vision Computing.
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.