0% found this document useful (0 votes)
203 views16 pages

Depth-Aware Unpaired Video Dehazing

Uploaded by

Mary Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
203 views16 pages

Depth-Aware Unpaired Video Dehazing

Uploaded by

Mary Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

2388 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

33, 2024

Depth-Aware Unpaired Video Dehazing


Yang Yang , Chun-Le Guo, Member, IEEE, and Xiaojie Guo , Senior Member, IEEE

Abstract— This paper investigates a novel unpaired video is demanded, as the task has rarely been studied in the
dehazing framework, which can be a good candidate in practice literature.
by relieving pressure from collecting paired data. In such a In comparison with single image dehazing, a core concern
paradigm, two key issues including 1) temporal consistency
uninvolved in single image dehazing, and 2) better dehazing for video dehazing is that i) how to guarantee the temporal
ability need to be considered for satisfied performance. To handle consistency across frames under unpaired supervision. Previ-
the mentioned problems, we alternatively resort to introduc- ous attempts tried to first construct future frames then employ
ing depth information to construct additional regularization the recycle loss [11], which encourages the consistency
and supervision. Specifically, we attempt to synthesize realistic between the future frames produced by translation simply on
motions with depth information to improve the effectiveness
and applicability of traditional temporal losses, and thus bet- time/across time and domains, or its variants for achieving the
ter regularizing the spatiotemporal consistency. Moreover, the temporal consistency. Specifically, RecycleGAN [11] directly
depth information is also considered in terms of adversarial estimates the future frame, and MocycleGAN [12] predicts
learning. For haze removal, the depth information guides the the relative motion of the future frame by an additional
local discriminator to focus on regions where haze residuals are network. To avoid the error caused by the motion estimator,
more likely to exist. The dehazing performance is consequently
improved by more pertinent guidance from our depth-aware local URecycGAN [13] randomly synthesizes motion to produce
discriminator. Extensive experiments are conducted to validate future frames. However, these manners can hardly be adapted
our effectiveness and superiority over other competitors. To to the unpaired video dehazing task, the primary reasons are:
the best of our knowledge, this study is the initial foray into the estimated or synthesized motion is inaccurate or even ran-
the task of unpaired video dehazing. Our code is available at dom [13], resulting in warped future frames severely distorted.
https://github.com/YaN9-Y/DUVD.
Consequently, the generator is forced to fit such low-quality
Index Terms— Video dehazing, image dehazing, unpaire frames, which highly likely undermines the performance for
learning.
low-level tasks. Moreover, most of the above-mentioned meth-
ods are built for general video translation tasks. One important
I. I NTRODUCTION characteristic for the dehazing task was not considered in those
frameworks, i.e., it is difficult to synthesize relatively accurate
T HE visibility of images and videos captured under hazy
conditions is typically interfered due to the scattering
effect of aerosol particles [1]. Both human observers and
future hazy frames for constructing temporal constraints, since
the haze distribution should vary accordingly to scene depth
computer vision systems may be affected by such degradation. changes along with motions. But it is beyond the capacity of
With the emergence of deep learning, deep neural network original schemes in modeling such variation.
based dehazing methods have dominated the mainstream [2], In addition, as a common issue for CycleGAN-based
[3], [4], [5]. However, most of these methods heavily depend unpaired dehazing, we also have to face that ii) Original
on (massive) paired hazy-clean data. To mitigate the require- CycleGAN-based architecture shows limited performance on
ment of paired data, the unpaired dehazing paradigm seems dehazing tasks. In other words, most CycleGAN-based meth-
to be an effective option by translating images between hazy ods cannot accurately perform haze removal and usually
and clean domains. For unpaired single image dehazing, there remains observable haze. The global adversarial loss only mea-
already exist a number of approaches like [6], [7], [8], [9], and sures whether the processed image conforms to the distribution
[10]. But, without consideration of the temporal inconsistency of the whole target domain, which is too general to provide
issue - a key factor to the video quality, they usually produce sufficient guidance for faithfully recovering local details. Con-
temporally inconsistent videos with severe flickering artifacts. sequently, how to provide more powerful supervision to meet
Thus, developing effective unpaired video dehazing algorithms the dehazing requirement becomes another thorny problem.
To address the above issues, this work exploits depth
Manuscript received 16 August 2023; revised 30 January 2024; information as a clue to build a novel unpaired video dehazing
accepted 3 March 2024. Date of publication 22 March 2024; date of current
version 28 March 2024. This work was supported by the National Natural framework. Specifically, we propose to utilize depth informa-
Science Foundation of China under Grant 62072327 and Grant 62372251. tion to simulate ego-motion, enabling the synthesis of future
The associate editor coordinating the review of this manuscript and approving frames with greater accuracy than those methods by random
it for publication was Dr. Jun Cheng. (Corresponding author: Xiaojie Guo.)
Yang Yang and Xiaojie Guo are with the College of Intelligence and generation or estimation. This technique significantly enhances
Computing, Tianjin University, Tianjin 300350, China (e-mail: yangyangcic@ the temporal consistency by using synthesized frames as a
tju.edu.cn; [email protected]). basis for building cross-frame unpaired supervision. Moreover,
Chun-Le Guo is with the College of Computer Science, Nankai University,
Tianjin 300350, China (e-mail: [email protected]). we integrate depth information with the Atmospheric Scatter-
Digital Object Identifier 10.1109/TIP.2024.3378472 ing Model (ASM) in a cross-frame manner, allowing model
1941-0042 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2389

refinement of the subtle variations in haze density resulting while later methods [3], [4], [21], [22], [23] turn to directly
from object motion in video sequences. This innovative design estimate the haze-free image. More recent approaches tend
significantly elevates the authenticity and quality of the synthe- to introduce other specific designs to further improve the
sized frames. Consequently, it not only enhances the temporal performance. For instance, AECR-Net [5] proposed the con-
consistency but also remarkably improves dehazing perfor- trastive perceptual loss as regularization and C2 PNet [24]
mance. Finally, our approach adopts depth information to further improves it by finer classifying the negative samples.
strategically direct the local discriminator’s attention towards FFANet [4] proposed the channel and pixel attention mech-
more distant regions, typically characterized by denser and anism to improve the representation ability of the dehazing
more challenging haze. This targeted enhancement in haze network while SHA [25] employed the separable hybrid atten-
removal capability marks a novel development in effectively tion module to encode haze density by capturing features in
addressing areas traditionally difficult to dehaze. the orthogonal directions. Besides, with the great success of
Overall, our contributions can be summarized as follows: vision transformers (ViT) in computer vision tasks [26], [27],
• We propose a novel depth-aware unpaired video dehazing some works attempt to apply the ViTs to assist the dehazing
framework. To the best of our knowledge, this is the task. Specifically, DehazeFormer [28] improves the original
pioneering work in unpaired video dehazing. vision transformer in several aspects of network design for the
• We introduce depth information to accurately simulate the dehazing task. DeHamer [29] exploits the dark channel map
ego-motion and model the haze variation across frames. to build 3D positional embedding in the vision transformer
The obtained synthetic frames are then utilized to attain to provide the relative position and suggest the haze density
the spatiotemporal consistency. of different spatial regions. Although achieved remarkable
• We employ the depth information as guidance for the performance on specific benchmark datasets, these methods
local discriminator to more focus on distant regions where depend heavily on paired data, and easily suffer from the
the haze tends to be denser and hard to remove. The overfitting problem [30].
depth-aware discriminator can provide more powerful 3) Unsupervised Learning-Based Methods: The unsuper-
supervision than the original. vised solutions aim to learn the dehazing ability without
• Extensive experiments together with ablation studies are relying on paired training data. A few works solely adopt
conducted to reveal the effectiveness of our design, and hazy observations for training [31], [32], [33], [34], but
its superiority over other state-of-the-art alternatives. these methods often heavily rely on minimizing hand-crafted-
prior-based objective functions, which also tend to make the
II. R ELATED W ORK models vulnerable to the variation of environment. While
other methods attempt to learn the transformation under
Since the task of unpaired video dehazing has been rarely
unpaired supervision [6], [7], [8], [10], [35], [36], [37],
studied up till now, in what follows, we will briefly introduce
[38], [39]. Most unpaired methods are built upon cyclic
the works that are closely related to ours, which mainly include
consistency [40] with other specific designs. In particular,
single image dehazing, video dehazing and unpaired video
translation. Cycle-Defog2Refog [41] uses a two-stage mapping strategy
in both translation paths to improve the performance of haze
removal. D4 [10] seeks to disentangle the transmission map
A. Single Image Dehazing into density and depth to make the translation process more
Single image dehazing aims to recover the clean image physically reasonable. Inspired by [10], POGAN leverages
from a hazy input. Existing methods can be roughly divided depth-density priors and combines a Physical Restoration
into prior-based, supervised and unsupervised learning-based Network with a Degradation Rendering Network to enhance
approaches. generalization to real-world degraded images in various harsh
1) Prior-Based Methods: The prior-based methods concen- scenarios. CDD-GAN [9] exploits contrastive learning to sup-
trate on discovering priors on haze-free images from statistic press the task-irrelevant factors and enhance the task-relevant
analysis to estimate the transmission and the atmospheric factors. U2 D2 Net further introduced a unsupervised denois-
light [14], [15], [16], [17], [18]. Specifically, DCP [15] ing network based on Noise2Noise [42] to simultaneously
assumes that in haze-free images, the lowest intensity in a local perform dehazing and denoising in unpaired manner. SNSP-
block should be close to zero. CAP [17] builds a linear model GAN [35] exploits an SN-Soft-Patch GAN with a new cyclic
between scene depth and the difference of pixel saturation self-perceptual loss to compute the perceptual similarity and a
and value in HSV space. More recently, SLP [18] employs color loss to brighten the dehazed images as human expects.
the intrinsic relevance between pixels to achieve a reliable USID [39] introduces a compact multi-scale feature attention
saturation line construction for transmission estimation. How- module for better feature representation and a mechanism
ever, as a serious drawback, these algorithms are built upon named OctEncoder to include multi-frequency representations
hand-crafted priors, which are vulnerable to the variation of that can capture more global information. ALGC-GAN [37]
the environment. also introduces an attention mechanism to address the issue of
2) Supervised Learning-Based Methods: The supervised uneven haze distribution in images, a global-local consistent
learning-based strategies adopt powerful deep models to learn discriminator to detect haze that varies across different spatial
the dehazing process from paired training data. Early works regions, thereby enhancing the stability and performance of
focus on estimate transmission or its variants [2], [19], [20], the discriminator and a dynamic feature enhancement mod-

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2390 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024

ule along with an adaptive mix-up module that dynamically Cascading and Deformable (PCD) alignment module, aligning
extracts and organize spatially structured features. frames at the feature level using deformable convolutions in a
Nevertheless, all of the mentioned methods are designed for coarse-to-fine approach. Additionally, a Temporal and Spatial
single image dehazing. Directly applying them to hazy videos Attention (TSA) fusion module applies attention both tempo-
often suffers from temporal inconsistency, which may further rally and spatially to highlight key features for restoration.
affect the applicability on downstream tasks. RVRT [53] processes local neighboring frames in parallel
within a globally recurrent framework, which divides videos
B. Video Dehazing into clips, using the features of one clip to estimate the next.
For aligning clips, it uses guided deformable attention to iden-
Different from single image dehazing, video dehazing can
tify and aggregate key features across clips using an attention
draw support from other (neighboring) frames for a certain
mechanism. Shift-Net [54] uses grouped spatial-temporal shift,
frame to improve the dehazing quality. A majority of early
a lightweight technique that captures inter-frame correspon-
works first utilize single image dehazing algorithms on each
dences for multi-frame aggregation. We can deduce that these
individual frame, then correct the temporal inconsistency in a
video restoration designs are typically versatile, making them
post-processing way [43], [44], [45]. For instance, Borkar and
suitable for a range of application scenarios. However, this
Mukherjee [45] initiate the dehazing sequences by estimat-
versatility also restricts their relevance to specific tasks, as they
ing transmission using DCP [15], then applying augmented
do not account for the unique characteristics inherent to each
MRF to address spatial and temporal inconsistencies, which
individual task.
involve statistical smoothing operations in temporal and spatial
Video restoration models designed for specific tasks often
dimensions. Recent approaches introduce CNN models for
bear considerable resemblance to general video restoration
video dehazing. Specifically, VDHNet [46] incorporates global
models. Differently, they are typically enhanced with features
semantic priors as input to regularize the dehazing result,
unique to the specific task at hand, ensuring a more tailored
EVDD-Net [47] jointly trains the dehazing network with
fit and improved performance for those particular applications.
a video object detection model for more stable detection
Take video draining as example, Liu et al. [60] redefines the
results. Zhang et al. [48] collects a real-world hazy video
issue of rain occlusion by taking into account rain accumula-
dataset named REVIDE and proposes a video dehazing model
tion and introduces a recurrent CNN for video draining which
CG-IDN with deformable alignment modules. Patil et al. [49]
combines rain degradation classification, spatial texture-based
proposes a dual-frame spatio-temporal feature modulation
rain removal, and temporal coherence-based background detail
architecture to handle the degradation. Afterward, PM [50] reconstruction. Yang et al. [55] treats the draining problem
further introduces a meta-learning strategy to deal with data- as an inverse process of rainfall synthesis and solves it by
poor weather-degraded conditions. MAP-Net [51] introduces estimating the parameter of such procedure. As for video
modules to encode the prior-related features into long-range deblurring, Pan et al. [56] proposed a network that simul-
memory, and to capture space-time dependencies in multiple taneously estimates the optical flow and latent frames, and
space-time range for effective temporal information aggrega- exploits a temporal sharpness prior to further assisting video
tion. Though these methods can achieve remarkable dehazing deblurring. STFAN [61] is a network performing alignment
performance thanks to the great representative ability of deep and deblurring in a unified framework, which contains Filter
models, their training highly relies on paired video data that Adaptive Convolutional (FAC) layer to align the deblurred
is expensive to collect. features of the previous frame with the current frame and deals
with blurs in the current frame. Zhu et al. [57] propose that
C. Video Restoration temporal blur degradation can be accessibly decoupled in the
Beyond targeted video dehazing, various techniques are potential frequency domain, and propose to integrate the tem-
developed for additional video restoration tasks, including poral spectrum into the video deblurring process comprising
video denoising, deraining, and deblurring. Certain meth- feature extraction, alignment, aggregation, and optimization.
ods address multiple restoration challenges within a single We can observe that task-specific methods often propose more
model [52], [53], [54], while others focus on distinct, indi- specialized designs tailored to the unique characteristics of
vidual tasks [55], [56], [57], and most of these methods are different tasks. These targeted approaches typically result in
supervised by paired video data. This part of the discussion enhanced performance, achieving superior restoration quality
also includes an overview of notable studies to thoroughly and improved temporal consistency. However, a significant
understand diverse approaches in video restoration. limitation previously mentioned is that all these methods
In the realm of video restoration, the primary focus of necessitate paired video data for supervision, a process that
research has been directed towards developing universal net- demands substantial labor for data collection. Moreover, in the
work architectures based on multi-frame [52], [54], [58] or absence of robust paired supervision, it remains uncertain
recurrent [53], [59] frames input to adapt video processing. whether the prevalent designs can continue to yield significant
Complementary to these designs are features like alignment improvements in temporal consistency.
modules [52] and cross-frame attention [54] mechanisms,
which are instrumental in augmenting temporal consistency D. Video-to-Video Translation
in video streams. Take some representative methods as exam- Unpaired video dehazing can also be regarded as an
ples, EDVR [52] tackles large motions through a Pyramid, unpaired video-to-video translation problem. The key is to

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2391

Fig. 1. The illustration of our whole framework.

perform domain translation while maintaining the spatio- frames. Besides, to predict the scattering coefficient from the
temporal consistency. RecycleGAN [11] introduces an addi- hazy images, the network also contains a scattering coefficient
tional network to predict the future frame, and produces an estimation branch, where the multi-scale features from the
effective recycle loss to constrain the temporal consistency. three layers of the encoder are first processed by global
MocycleGAN [12] substitute the future frame predictor with average pooling to extract the global information. Then these
a motion estimator to improve the realness of motion across features are further processed by two serial convolutions with
the transformation, and [62] further proposes the content pre- 1 × 1 kernel, which acts like fully-connected layers, to reach
serving loss for semantic consistency. Afterward, to deal with the final prediction of the scattering coefficient. The detailed
the error caused by the motion estimator, URecycGAN [13] architecture of the dehazing module is shown in Tab. I. In our
directly generates the optical flow in a random way to syn- research, we focus primarily on integrating depth information
thesize pseudo future frames for setting temporal constraints. to enhance training losses for improved dehazing performance
Although these works have validated the effectiveness of recy- and temporal consistency. Consequently, we maintain a net-
cle loss in keeping the temporal consistency, it is challenging work architecture that is both straightforward and concise.
to transplant the recycle loss to unpaired video dehazing task, However, we posit that refining the network architecture,
since the haze variation in future frames is hard to model. In particularly through the incorporation of multi-frame input
this paper, we introduce the depth information and apply it and corresponding multi-frame attention mechanism, like in
to accurately simulate the ego-motion and haze variation, the MAPNet [51] and CG-IDN [48], will positively impact both
temporal consistency can be then constrained by the recycle dehazing performance and temporal consistency. Moreover,
loss. the potential integration of depth information within the
multi-frame attention mechanism presents a feasible prospect.
III. M ETHDOLOGY This integration could be beneficial in reducing the ambiguity
of single-frame depth information, enabling depth information
A. Overall Framework to better contribute to the dehazing process. Furthermore,
Given a clean video C1:T = {Ct }t=1 T with T frames from this approach could enhance the model’s ability to aggregate
the clean domain XC and a hazy video H1:S = {Hs }s=1 S from multi-frame information more efficiently during the inference
the hazy domain X H , our target is to learn a dehazing network stage, offering a promising avenue for further exploration.
G H 2C which maps the hazy frames from X H to XC without The rehazing module aims to transform the given clean image
using paired information. As illustrated by Fig. 1, our whole to its hazy version, which contains a depth estimation network
framework consists of two modules: the dehazing module and G D and a refine network G R . Inspired by [10], we introduce the
the rehazing module. depth information for haze generation and randomly sample
The dehazing module is to generate the clean image from the scattering coefficient to generate hazy images with variant
its hazy observation in a pixel-to-pixel manner. In addition, densities. Having a clean image C and a scattering coefficient
it also estimates the scattering coefficient β̂ of the hazy scene β, the pretrained depth estimation network G D [63] first
(i.e., the density of the haze). For simplicity, by omitting captures its depth map, then synthesizes a coarse hazy image
the β̂, the dehazing procedure can be formulated as Ĉ = according to the ASM [1]. The refine network G R is to refine
G H 2C (H). The dehazing module has a concise U-Net-shaped the coarse hazy image, making it better fit the real hazy
architecture with four residual blocks between the encoder distribution. In the dehazing-rehazing branch (lower branch
and decoder as the main branch to generate the dehazed in Fig. 1), the given scattering coefficient β is the same as

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2392 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024

TABLE I
T HE A RCHITECTURE OF THE D EHAZING N ETWORK AND THE R EHAZING
N ETWORK . T HE R EHAZING N ETWORK O NLY C ONTAINS THE M ODULES
IN THE U PPER PART OF THE TABLE , W HILE THE D EHAZING N ETWORK
C ONTAINS B OTH THE U PPER PART (W ORKS AS THE D EHAZING B RANCH )
AND THE L OWER PART (W ORKS AS THE S CATTERING C OEFFICIENT E STI -
MATION B RANCH .) E ACH L INE R EPRESENTS A S EQUENCE OF L AYERS OR
A W HOLE B LOCK . T HE k, c, s R EPRESENT THE K ERNEL S IZE , O UTPUT
C HANNELS AND S TRIDE OF THE C ONVOLUTION L AYERS , R ESPECTIVELY.
F OR B OTH D EHAZING N ETWORK AND R EHAZING NETWORK , W E S ET
n = 96 FOR THE L ARGE M ODEL (O URS (L)) AND n = 64 FOR THE L IGHT
M ODEL (O URS (S)). T HE L AYERS W ITH S KIP C ONNECTIONS A RE S HOWN
IN THE F IRST C OLUMN . OTHER L AYERS D IRECTLY TAKE THE O UTPUT OF Fig. 2. The definition of our coordinate system.
P REVIOUS L AYERS AS I NPUT. IN, G LOBAL AVGPOOL , LR E LU R EPRESENT
I NSTANCE N ORMALIZATION , G LOBAL AVERAGE P OOLING AND L EAKY
R ELU W ITH SLOPE = 0.2, R ESPECTIVELY. T HE S IZE OF I NPUT I MAGE I S
A SSUMED TO B E 256 × 256 FOR S IMPLICITY. I N FACT O UR
N ETWORK C OULD P ROCESS I MAGES W ITH
A RBITRARY S IZES

Fig. 3. The effect of different components in ξ ∗ .

Considering the ego-motion includes translation and rota-


tion, we introduce five parameters to describe the camera
motion, i.e., ξ ∗ = {Tx , Ty , Tz , θx , θ y }, where Tx ,Ty and Tz
that estimated by the dehazing network for the sake of cycle denote the translation along the X, Y and Z axis of the
consistency. While in rehazing-dehazing branch (upper branch coordinate system defined in Fig. 2, respectively. And θx , θ y
in Fig. 1), the scattering coefficient β is randomly sampled to refer to the rotation angle around X, Y axis, respectively.
encourage the diversity of generated hazy images. For ease of The components of optical flow Õ at position (x, y) are
exposition, we omit the scattering coefficient β here, the whole synthesized via [64]:
rehazing procedure is expressed as Ĥ = GC2H (C). The refine " #T 
network G R has the same architecture as the main branch of −f
0 d  x
x
the dehazing network G D . Õt (x, y) = d
−f y Ty , (1)
0 d d Tz
" xy 2 #
B. Depth-Aware Ego-Motion Synthesis f + xf θx

Õr (x, y) = − f
, (2)
According to the discussion in Sec. I, to achieve promising − f − yf
2 xy θy
f
temporal consistency on the dehazing task, the framework
should be able to generate realistic motion and accurately where Õ = Õt + Õr , Õt and Õr denote the optical flow caused
synthesize future hazy frames. Yet the previous arts usually by camera translation and rotation, respectively. Besides, f
predict the motion by a pretrained motion estimator [12] or denotes the focal length and d is the depth at (x, y). Fig. 3
direct generating motion in a random manner [13], which will illustrates the effect of each component of ξ ∗ in generating
inevitably generate monotonous or distorted future frames. future frames. We can see that our framework is able to
We believe that simulating ego-motion can generate more accurately simulate the motion and even the caused distortion
accurate and flexible future frames, thus better constraining (see green boxed region in Fig. 3(d) compared to Fig. 3).
the temporal consistency and improving the dehazing per- Further, by altering and combining these parameters, we can
formance. Consequently, we propose the temporal constraint flexibly synthesize future frames with arbitrary motion. Fig. 4
based on depth-aware ego-motion synthesis for video dehaz- (e) and (f) give an example of motion including translation
ing, which is described as follows. towards Z-axis and rotation along X-axis and corresponding

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2393

Fig. 4. Visualization of motion and future frame synthesis. Our method can generate more reasonable motion and more realistic distortion. Please zoom in
for better visualization.

synthetic future frame, from which we can infer that, our atmospheric light estimated by the prior. Fig. 5 shows cases
method can generate more realistic motion and better simulate of the frames generated by our depth-aware future frame
the distortion caused by ego-motion (see the green-boxed synthesis method. Specifically, Fig. 5(b) simulates the frame
regions in Fig. 4(a) and (e)), while the motion and future when the camera moves nearer to the objects. With the objects
frames synthesized by [13] are severely distorted (Fig. 4(b) in the image drawing nearer, the haze also becomes lighter,
and (c)). Besides, by randomly giving parameters ξ ∗ , we could which better fits the ASM model. Similarly, in Fig. 5(c),
synthesize future frames with various motions, which could when the camera moves further, the haze correspondingly
act as the augmentation mechanism to further improve the becomes denser. The illustration shows that with the proposed
temporal consistency and performance. depth-aware future frame synthetic method, our framework can
more accurately model the haze variation caused by camera
C. Depth-Aware Future Frame Synthesis motion, which provides better support for constructing further
constraints or regularizations.
Given a clean frame Ct from its source video C1:T , to syn-
thesize its future frame, we first estimate its depth by G D ,
then randomly generate a set of parameters ξ ∗ controlling D. The Temporal Constraint
the ego-motion. Afterward, we synthesize the ego-motion Following [13], as illustrated by Fig. 1, having the
Õ according to ξ ∗ and warp it with the current frame Ct current frames Ct /Hs and their translated frames Ĥt /Ĉs ,
to fetch the synthetic future frame C̃t+1 . The procedure of we could now simulate their synthetic future frames
synthesizing clean future frame can be formulated by d̃ = C̃t+1 /H̃s+1 and W (Ĥt , Õ)/W (Ĉs , Õ). Afterward, we map
G D (Ct ), Õ = F(Ct , d̃, ξ ∗ ), C̃t+1 = W (Ct , Õ), where F(·) W (Ĥt , Õ)/W (Ĉs , Õ) back to their original domain as
and W (·) denote the functions of generating optical flow and G H 2C (W (Ĥt , Õ))/GC2H (W (Ĉs , Õ)). Then we could set the
warping, respectively. constraint between them and the synthetic future frames to
However, when synthesizing future hazy frames, since the guarantee the temporal consistency as follows:
haze varies with depth, simply warping the hazy image with
the synthetic optical flow cannot model such variation. Fortu- L r ecyc = ∥Mt · (C̃t+1 − G H 2C (W (Ĥt , Õ)))∥1
nately, by introducing depth information, the ASM could help + ∥Ms · (H̃s+1 − GC2H (W (Ĉs , Õ)))∥1 , (8)
us achieve the goal. Suppose we have a hazy frame Hs from
H1:S , if we first consider the depth variation caused by camera where Mt , Ms refer to the mask of pixels that do not
motion and corresponding change on haze intensities without exceed the bound of the image during the warping operation
thinking the change of pixel position (warping), Hs and its W (Ĥt , Õ) and W (Ĉs , Õ), respectively. ∥ · ∥1 denotes the
synthetic future frame before warping H̄s+1 can be expressed ℓ1 norm. Moreover, to confirm the warping and domain
by ASM as: translation is commutative to each other, we also constrain
the translated future frame from different routes should be
Hs (x) = J (x)e−βd1 (x) + A(1 − e−βd1 (x) ), (3) consistent, i.e., :
−βd2 (x) −βd2 (x)
H̄s+1 (x) = J (x)e + A(1 − e ). (4) L spa = ∥Mt · (GC2H (C̃t+1 ) − W (Ĥt , Õ))∥1
Dividing Eq. (4) by Eq. (3) yields the following relationship: + ∥Ms · (G H 2C (H̃s+1 ) − W (Ĉs , Õ))∥1 . (9)
−β(d2 (x)−d1 (x))
H̄s+1 (x) = (Hs (x) − A) · e + A, (5)
E. Depth-Aware Local Discriminator
by which, after we synthesize the motion by Ĉs =
G H 2C (Hs ), d˜1 = G D (Ĉ), and Õ = F(Hs (x), d̃1 (x), ξ ∗ ), Besides of the temporal consistency, another crucial issue
we could smoothly synthesize the hazy future frame by first for unpaired video dehazing is to further enhance the supervi-
calculating H̄s+1 , then changing pixel position(warping) as: sion so that the dehazing performance can be well improved.
For most unpaired methods, the dehazing ability is sourced
H̄s+1 (x) = (Hs (x) − A) · e−β(d̄2 (x)−d̃1 (x)) + A, (6) from the adversarial loss. However, it is far beyond the ability
H̃s+1 = W (H̄s+1 (x), Õ), (7) of a single global discriminator to precisely discriminate all
tiny hazy regions from the dehazed image. As a result, we can
where d̄2 denotes the depth of H̄s+1 , which is calculated usually notice obvious haze residuals in the dehazed result.
according to d̃1 and the motion parameters ξ ∗ , A is the To mitigate such problem, a straightforward solution is to

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2394 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024

Fig. 5. The illustration of depth-aware future frame synthesis. In our framework, the haze in synthetic frames faithfully changes with the camera motion.

introduce another discriminator to more focus on the local 2) Adversarial Loss: The adversarial losses measure
content. In fact, previous dehazing methods [8] have already whether the dehazed/rehazed image fits the distribution of the
introduced the local discriminators and achieved promising training sets XC /X H . Taking dehazing module G H 2C and the
gains in performance. Nevertheless, we find that the traditional discriminator Dc as example, the adversarial loss is expressed
local discriminator samples the block from the dehazed images as:
during training in a totally random manner. Yet after a period
L adv (Dc ) = E[(Dc (Ct ) − 1)2 ] + E[Dc (G H 2C (Hs ))2 ],
of training, the models are usually well trained to handle
the light haze region but still have difficulties in thoroughly (11)
processing the dense haze region. According to the ASM [1], L adv (G H 2C ) = E[Dc (G H 2C (Hs ) − 1)2 ], (12)
the dense haze region are more likely to locate in areas distant
from the camera. Based on the observation, we propose the where Hs and Ct are hazy and clean frames sampled from
depth-aware local discriminator to make the local discrimina- X H and XC , respectively. The adversarial loss has the same
tor more focused on the distant regions where the haze is more form for the rehazing module GC2H and discriminator Dh .
likely to be denser. The dehazing module G H 2C is optimized by adversarial loss
To achieve our purpose, we require relatively reasonable from both global and local discriminators, thus we denote it
L ,G
depth map for the hazy frame. However, the dense haze that as L adv in Fig. 1. While the adversarial loss for the rehazing
existed in the hazy frame will affect the depth estimator from module is provided by the global discriminator, and we denote
G .
it as L adv
generating the accurate depth map. To address the issue, in the
dehazing-rehazing branch, we estimate the depth map from the 3) Pseudo Scattering Coefficient Supervision Loss: Since
dehazed image rather than the hazy image. Afterward, given a we employed the self-augmented mechanism in [10], in the
threshold value τ , the local discriminator will only choose the rehazing-dehazing branch (see the upper branch in Fig. 1),
region centered at pixels whose depth is at the top τ of the a constraint between the randomly sampled scattering coeffi-
image when computing the adversarial loss for the generator cient βc from the rehazing module and the estimated scattering
during training. Besides, since at the beginning of training, the coefficient β̂c by the dehazing module is required to enable the
dehazing network is not well trained, the remained haze tends dehazing module properly estimate the scattering coefficient.
to distribute globally rather than locally in distant regions of The training loss is defined as:
the dehazed image. Consequently, we made the threshold τ ′ L sc = ∥βc − β̂c ∥1 . (13)
linearly changed to the set value τ in the first 50 epochs. More
intuitively, our depth-aware local discriminator will gradually Taking all the loss terms into consideration, the overall objec-
shrink its interested region from the global image to the distant tive function to optimize our framework is formulated as:
local region. L = L cyc + λ1 (L r ecyc + L spa ) + λ2 L adv + λ3 L sc , (14)
where λr ecyc , λspa , λadv , and λsc are weights for L r ecyc ,
F. Training Objectives L spa , L adv , and L sc , respectively. In our experiments, setting
Besides of the temporal constraint, we also employ λr ecyc = 0.2, λspa = 0.2, λadv = 0.3 and λsc = 0.5 works
the cycle-consistency loss, the adversarial loss and the well.
pseudo scattering coefficient supervision loss to train the
framework. IV. E XPERIMENTS
1) Cycle-Consistency Loss: The cycle-consistency loss In this section, we will introduce the datasets used for
maintains content consistency by enforcing that a translated training and evaluation, implementation details, metrics and
image could be translated back to its original domain. The baselines. Then extensive quantitative and quantitative com-
cycle-consistency loss is formulated as: parisons with other closely related works are presented to
validate our superiority on both synthetic and real-world
L cyc = ∥Hs − GC2H (G H 2C (Hs ))∥1 datasets. Finally, the ablation studies are held to validate the
+ ∥Ct − G H 2C (GC2H (Ct ))∥1 . (10) effectiveness of each part of our design.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2395

TABLE II
T HE Q UANTITATIVE C OMPARISON R ESULTS ON REVIDE AND NYU-D EPTH . SID, VD AND VT I MPLY M ETHODS A RE F ROM S INGLE I MAGE D EHAZING ,
V IDEO D EHAZING AND V IDEO T RANSLATION , R ESPECTIVELY. T HE B EST R ESULT I S D ENOTED IN BOLD AND I S C OLORED IN B LUE OR R ED
ACCORDING TO W HETHER U SING PAIRED DATA FOR T RAINING . PATIL E T A L . O FFERS N EITHER T RAINING C ODE N OR P RETRAINED
M ODEL ON NYU-D EPTH DATASET

A. Experimental Configurations 3) Comparing Methods: As mentioned in Sec. II, since


1) Datasets: To validate the effectiveness of our proposed unpaired video dehazing is a rarely studied topic, we sought
framework, we adopt the video dehazing dataset REVIDE [48] a series of methods from relative tasks including prior-based
and NYU-Depth [66] for training and evaluation. The REV- image dehazing DCP [15] and SLP [18], supervised image
IDE dataset contains 48 paired sequences of high-resolution dehazing (AODNet [2], FFANet [4]), unpaired image dehazing
hazy and clean frames. Its training set contains 1747 frame- (DisentGAN [8], USID [39], SNSPGAN [35], RefineD-
pairs from 42 videos while the test set has 284 frame-pairs Net [7], D4 [10]), video dehazing (ST-MRF [65], EDVR [52],
from 6 videos. Due to the limitation of computing resources, Patil et al. [50]) and unpaired video translation (Recy-
we conduct all experiments on its 4× downsampled ver- cleGAN [11], URecycGAN [13]) to make comprehensive
sion. The NYU-Depth dataset includes 64 clean videos with comparisons with our method. All comparing models are
depth. We generated the hazy frames by ASM with A = retrained on the training set, except Patil et al. [50], which
[0.9, 0.9, 0.9], β = [0.4, 0.6, 0.8] and used the official tool to only provides pretrained model on full-size REVIDE but no
split train/test set. The generated dataset includes 2187 frame- training code is available.
pairs for training and 3588 frame-pairs for testing. Moreover, 4) Evaluation Metrics: For evaluation, we select both ref-
we collect a set of 6 real hazy videos with 352 frames erenced and non-referenced metrics for evaluation. For the
to evaluate the generalization ability on real outdoor hazy benchmarks with ground-truth, the commonly used referenced
scenes. When training unpaired methods, the frames from both metrics Peak Signal-to-noise Ratio(PSNR), Structural Simi-
domains are shuffled to avoid introducing paired information. larity(SSIM) [70] and CIEDE2000 [71] are adopted. Among
2) Implementation Details: At the training stage, we adopt them, PSNR measures the pixel-wise error of the given
the discriminator in [67] with the patch size of 30 × 30. The image pair, SSIM better reflects the consistency of structural
Adam optimizer [68] with β1 = 0.9, β2 = 0.999, lr = 0.0001, information while CIEDE2000 more concentrates on color
and a batchsize of 1 is applied to optimize the networks. The information. Besides, since keeping temporal consistency is
network’s parameters are initialized using the Xavier initial- crucial for video dehazing, we employ another metric named
ization method [69]. And the model is trained by 300 epochs Warping Error(WE) [72] to evaluate the temporal stability over
with the learning rate being progressively adjusted using a frames. Moreover, we also conduct non-referenced metrics
cosine annealing strategy. The training samples are randomly FADE [73], NIQE [74], BRISQUE [75] to measure the dehaz-
cropped to 256×256 and half of them are horizontally flipped ing ability on real-world hazy videos without ground-truths.
for data augmentation. As for the value of the local length FADE reflects the haze remaining in the image while NIQE
in the depth-aware optical flow synthesis part, we set f = and BRISQUE correspond to whether the image is visually
0.035 in the experiments. The range of scattering coefficient natural.
in estimation and haze generation is set to [0.2, 1.0]. The
experiments are conducted on an NVIDIA RTX-2080Ti GPU.
In our study, we implement two versions of the model: a lighter B. Performance Evaluation
variant (Ours(S)) with n = 64, and a larger variant (Ours(L)) 1) Quantitative Comparison: We employ the test set of
using n = 96. The aim of the lighter model is to investigate REVIDE and NYU-Depth to perform the quantitative com-
performance levels when the model’s complexity aligns with parisons, and the result is reported in Tab. II, from which we
that of most unpaired models. Conversely, the larger model can infer that our method shows clear advantages in terms
is designed to explore the model’s potential capabilities by of all three metrics over other methods without using paired
moderately increasing its complexity. training data. However, in accordance with expectations, our
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2396 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024

Fig. 6. Visual comparison in haze removal on samples from the REVIDE dataset.

method is in an inferior position in terms of most metrics unsupervised/unpaired approaches, which clearly validates our
compared to modern dehazing methods that employ paired effectiveness.
training data like [4], [50], and [52]. Even when we apply 2) Qualitative Comparison: We also provide qualitative
the pretrained model of Patil et al. [50], originally trained results of several state-of-the-arts unsupervised/unpaired meth-
on the original size data, to the 4x downsampled dataset ods and two supervised methods on REVIDE and NYU-Depth
due to the unavailability of its training code, it still attains dataset for more direct visual comparisons. The results are
the best performance in restoration quality. As exceptional, shown in Fig. 6 and Fig. 7. In the first case of Fig. 6, the model
EDVR [52], a supervised video restoration framework, does developed by Patil et al. produces the best result, excelling in
not perform well on the REVIDE dataset since it suffers from global presentation, local detail, and color accuracy. FFANet
unstable training and fails to estimate large offset when trained generates globally smoother and clear result. However, within
on the REVIDE dataset, such phenomenon has been well the area highlighted by the red box, it appears to struggle
validated by [48]. Apart from some powerful supervised meth- with color shifts and a loss of detail. Our method effectively
ods, our method achieves significantly better results than other removes haze and retains richer details, yet it does introduce

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2397

Fig. 7. Visual comparison in haze removal on samples from the NYU-Depth dataset.

TABLE III TABLE IV


T HE C OMPARISON ON THE C OMPLEXITY OF THE M ODELS T HE WARPING E RROR OF D IFFERENT M ETHODS
ON THE REVIDE DATASET

the number of parameters and the FLOPS (Floating Point


Operations Per Second) involved. These results are presented
in Tab. III. When considering both the complexity and the
dehazing performance (as shown in Table II), our model
some extraneous artifacts. Compared to the larger model, our with fewer channels(denoted as Ours(S)) maintains a similar
lighter model restores fine detail but suffers from a more or even lower count of parameters and FLOPS relative to
obvious global color shift. USID can also remove the haze but most unpaired methods while simultaneously surpassing their
its details are blurred and generate more artifacts. Generally, performance. This balance between performance and compu-
other unpaired methods tend to leave noticeable amounts of tational efficiency highlights the effectiveness and practicality
haze in their results to varying extents. As for the second case, of our approach in the field of unpaired video dehazing.
both FFANet and our method successfully produce globally Furthermore, by increasing the number of channels, our
clearer results. However, there is a notable difference in model with more channels (denoted as Ours(L)) achieves
local color rendition, as observed in the red-boxed region: notable performance improvements with a justifiable increase
FFANet tends to generate a locally darker hue, whereas our in computational cost. While this expanded version possesses
method results in a lighter color. Our larger model better more parameters and FLOPS than some unpaired methods,
removes the haze on the chair in the red-boxed region than it remains markedly more efficient than the majority of
the lighter model. In contrast, other methods either fail to large-scale supervised models. To fully exhibit the potential
completely remove the haze or experience significant color ability of our model, we adopt the larger model for the follow-
shifts. Similarly in Fig. 7, only our method and the supervised ing experiments, including exploring its temporal consistency,
methods FFANet and Patil et al. remove the haze while others generalization ability, and ablation studies.
leave thin haze in the green-boxed region. Compared to our 4) Temporal Consistency Comparison: Besides of restora-
larger model, our lighter model remains a little bit more haze tion fidelity, another essential requirement for video dehazing
in the green-boxed region. In all three cases, the prior-based is to generate temporally consistent frames. To evaluate such
dehazing method DCP and SLP tend to over-dehaze the image, ability, we first compare the quantitative result, i.e., the warp-
causing severe color distortion and low brightness globally. ing error for different methods, which is shown in Tab. IV.
3) Model complexity comparison: To compare the com- From the table we can infer that, the supervised FFANet
plexity of our model with other methods, we assessed both achieves better temporal consistency from paired supervision.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2398 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024

Fig. 8. Visual comparison in dehazed sequential frames from the REVIDE dataset.

Compared to other unpaired dehazing methods, our method TABLE V


yields the best result in terms of warping error, while other N ON -R EFERENCE M ETRICS ON R EAL -W ORLD H AZY V IDEOS
video dehazing approaches considered temporal consistency
show lower but acceptable performance. Yet the unpaired
image dehazing methods like RefineDNet perform poorly
on temporal consistency, even we have tried to apply post-
processing method [76] to improve its temporal stability.
It indicates that without proper constraints applied, it is
relatively difficult to learn a dehazing network under unpaired
supervision, and our depth-aware motion synthetic strategy is
more effective in maintaining temporal consistency. Moreover,
we also provide visual results on consecutive frames for more TABLE VI
intuitive comparisons, which are shown in Fig. 8. Without T HE WARPING E RROR OF D IFFERENT C ONFIGURATIONS OF O UR M ODEL
ON THE REVIDE DATASET
explicit temporal constraint, D4 [10] suffers from the most
severe temporal inconsistency. As shown in Fig. 8, we can
observe obvious inconsistency between frame 1 and frame 3 on
color and brightness. URecGAN performs better but we can
still observe inconsistency on the haze left in the green boxed
regions of frame 1 and frame 4. In comparison, our method
generates the most consistent and clearest dehazed results.
in real-world scenarios where the distribution diverges from
that of the training dataset. In contrast, our model maintains
C. Comparison on Real-World Videos consistent performance under these varied conditions. This
disparity underscores the tendency of supervised models to
To check our generalization ability on real outdoor hazy
overfit, whereas unpaired models often mitigate this issue more
scenes, we compare the dehazing performance of the rep-
effectively.
resentative models on our collected real-world hazy videos.
The quantitative results are reported on Tab. V, which indi-
cates that our model achieves the best results on most of D. Ablation Study
the metrics. In addition, we also provide visual samples in To validate the effectiveness of each part in our design,
Fig. 9. In both samples, our model can generate the clear- we conduct a series of ablation studies.
est and least hazy results. Both quantitative and qualitative 1) Temporal Constraints: To validate the effectiveness of
results validate our superior generalization ability over other the temporal constraints in our loss, i.e. the unsupervised
unpaired/unsupervised methods. recycle loss and spatial loss, we remove them from our overall
Besides, it’s noteworthy that FFANet, despite not being objective function. Since these two terms are employed to
specifically tailored for video dehazing, demonstrates com- constrain temporal consistency, we testify them through met-
mendable performance when the evaluation distribution rics from both temporal consistency and restoration fidelity.
closely aligns with its training distribution, thanks to its From Tab. VI, we can infer that the warping error increases
robust paired supervision. However, its efficacy diminishes significantly without L r ec or L spa , indicating that these two

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2399

Fig. 9. Visual comparison frames from real-world hazy videos.

TABLE VII
A BLATION S TUDY ON THE REVIDE DATASET. L r ec , L spa , D-ES, D-FS, D-L OCAL D AND V-L OCAL D R EFER TO W HETHER U SE THE U NSUPERVISED
R ECYCLE L OSS , U NSUPERVISED S PATIAL L OSS , D EPTH -AWARE E GO -M OTION S YNTHESIS , D EPTH -AWARE F UTURE F RAME S YNTHESIS , D EPTH -
AWARE L OCAL D ISCRIMINATOR AND VANILLA -L OCAL D ISCRIMINATOR , R ESPECTIVELY

terms of temporal constraint are crucial to maintaining tem- TABLE VIII


poral consistency. E XPERIMENTS ON THE I MPACT OF THE H YPER -PARAMETER VARIANCE ON
D EHAZING P ERFORMANCE
Besides, the results on restoration fidelity are shown in
groups (a), (b) of Tab. VII and Fig. 10(b), (c). Quantitatively,
compared to our full model (Tab. VII(g)), removing either
L r ec or L spa will cause degradation to most of the metrics.
From Fig. 10(b) and Fig. 10(c), the ablated model also tends
to leave denser haze on the results. To assess the optimal
weighting of these loss terms, we explored how varying weight
settings affect model performance, as detailed in Table VIII.
Notably, when we quadrupled or reduced the value of λr ec
(from its baseline of 0.2) by a factor of four, the model
continued to benefit from the loss term, albeit with diminished
returns. λspa exhibited a similar trend to λr ec . However,
decreasing its weight led to a more pronounced decrease in
performance. This highlights the significant role λspa plays
in our model’s effectiveness. Consequently, both quantitative
and qualitative results show that these temporal constraints
can also benefit the restoration quality of our dehazing
network.
2) Depth-Aware Ego-Motion Synthesis: Then we explore
the effectiveness of our proposed depth-aware ego-motion
synthesis procedures. Specifically, we replace our ego-motion
synthesis procedure with the random motion generation as aware ego-motion synthesis mechanism can also achieve better
in [13]. From Tab. VII (c) and Fig. 10(d), compared to results in terms of temporal consistency.
randomly generating motion, our proposed depth-aware ego- Additionally, as indicated in Eq. 1 and Eq. 2, the depth-
motion synthesis can significantly improve the dehazing aware ego-motion synthesis process is intricately linked to
performance. Moreover, as shown in Tab. VI, our depth- the focal length value f . Fig. 11 illustrates synthesized future
Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2400 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024

Fig. 10. Visual comparison of different configurations of our model on a challenging sample from REVIDE dataset.

results are reported in Tab. VIII, which reveal that, when


strictly measured, the appropriate local length will lead to
optimal performance. Nevertheless, according to our experi-
ments on real outdoor scenes, our model trained under specific
local length settings generalizes well on frames captured under
various settings, which shows its broad applicability.
3) Depth-Aware Future Frame Synthesis: In our framework,
the depth-aware future frame synthesis is also an important
design that enables the framework to more accurately model
the minor variation caused by the depth change. In Fig. 5 we
have already validated our design by visualizing the synthetic
future frames. Here we will further validate its effectiveness
by comparison on dehazing ability. As shown in Tab. VII(d)
and Fig. 10(e), if we abandon modeling such variation, the
dehazing ability of our model will suffer obvious degradation
on most of the metrics and visual effects.
4) Depth-Aware Local Discriminator: Apart from the
motion-related designs in our framework, the depth-aware
Fig. 11. The effect of different f when rotating with θ y = −5◦ . local discriminator also plays an important part in guiding
the network focus on denser hazy regions thus improving
the dehazing ability. To validate such ability, we conduct
frames when rotating θ y = −5◦ with different f , from which experiments to testify how much our designed depth-aware
we can infer that, compared to f = 0.035, the future frame guidance could boost the original local discriminator. Specifi-
of f = 0.020 suffers from too severe distortion, which is cally, we respectively evaluate the performance of our model
visually unnatural(see the desk part in the image). While without the local discriminator, with the vanilla local dis-
f = 0.050 causes unreal distortion, making the rotation more criminator, and with the depth-aware local discriminator. The
like plain translation. To investigate the impact of varying f quantitative results are shown in Tab. VII. From Tab. VII(e),(f)
on performance, we conducted experiments with focal lengths and (g), applying the vanilla local discriminator could increase
adjusted from 0.035 to 0.020 and 0.050. The quantitative the dehazing performance by 0.90db on PSNR, while using our

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2401

depth-aware local discriminator could gain an extra 0.15dB have validated our superiority in video dehazing performance
on PSNR, which is a 16.7% improvement. Fig. 10(f)-(h) also over other state-of-the-art methods.
shows that our depth-aware local discriminator is able to guide Moving forward, we recognize the potential for further
the network to remove the haze more thoroughly. enhancements in our unpaired video dehazing framework,
Moreover, we have conducted experiments to examine how particularly by improving the network architecture. One
the choice of depth threshold (τ ) and the weight of the promising direction is the incorporation of multi-frame input
adversarial loss (λadv ) influence performance. To assess the and corresponding multi-frame attention mechanisms. This
impact of the depth threshold τ , we evaluated dehazing enhancement can provide a more nuanced understanding of the
performance at thresholds of 0.3, 0.5, and 0.7. The quantitative temporal dynamics in video sequences, crucial for maintaining
results, presented in Table VIII, reveal that setting the depth consistency across frames in unpaired video dehazing. Such
threshold to τ = 0.5 yields the most effective dehazing design has been validated as effective in supervised video
results. Regarding the weight of the adversarial loss λadv , dehazing [50], [51], [52], demonstrating its capability in
increasing it from 0.3 to 0.5 leads to a more pronounced drop enhancing video clarity and detail preservation across multiple
in performance, since over-emphasizing the adversarial loss frames. Moreover, the multi-frame attention mechanism can
usually introduces unnecessary artifacts. Conversely, reducing also incorporate depth information. This integration could
the weight from 0.3 to 0.1 results in inadequate adversarial potentially reduce the ambiguity of single-frame depth infor-
learning supervision. Both adjustments away from the optimal mation, helping depth information better contribute to the
setting are found to compromise the model’s overall perfor- dehazing process. However, such specific network architec-
mance. ture design has been less explored in the realm of unpaired
5) Haze generation in training: Our model also incorpo- video dehazing. Our future work will focus on adapting and
rates certain hyper-parameters specifically related to the haze optimizing this mechanism for unpaired scenarios. We aim
generation process. These include the range of the scattering to bridge this gap by developing innovative techniques that
coefficient β, which determines the minimum and maximum leverage multi-frame attention to achieve superior dehazing
densities for haze generation in the hazing-dehazing branch results, ensuring both high-quality dehazing performance and
during training. And λsc , the weight assigned to the pseudo robust temporal consistency in unpaired video environments.
scattering coefficient loss.
Concerning the range of β, it is evident that over-enlarging R EFERENCES
or shrinking this range can lead to the production of hazy [1] S. G. Narasimhan and S. K. Nayar, “Vision and the atmosphere,” Int.
images that deviate from the desired distribution. Such J. Comput. Vis., vol. 48, no. 3, pp. 233–254, 2002.
deviations can subsequently result in a decline in model per- [2] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “AOD-Net: All-in-one
dehazing network,” in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2017,
formance, as shown in Tab. VIII. Regarding λsc , its associated pp. 4770–4778.
loss term is designed to utilize generated hazy images and their [3] X. Liu, Y. Ma, Z. Shi, and J. Chen, “GridDehazeNet: Attention-based
corresponding scattering coefficient values as pseudo training multi-scale network for image dehazing,” in Proc. IEEE/CVF Int. Conf.
Comput. Vis. (ICCV), Oct. 2019, pp. 7314–7323.
data, and trains the scattering coefficient estimation branch of [4] X. Qin, Z. Wang, Y. Bai, X. Xie, and H. Jia, “FFA-Net: Feature fusion
the dehazing network to proficiently estimate the scattering attention network for single image dehazing,” in Proc. AAAI Conf. Artif.
coefficient from hazy inputs. As indicated in Table VIII, Intell., 2020, vol. 34, no. 7, pp. 11908–11915.
[5] H. Wu et al., “Contrastive learning for compact single image dehaz-
variations in λsc ’s weight from 0.1 to 1 appear to have a ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2021,
negligible impact on performance, since this loss only affects pp. 10551–10560.
the scattering coefficient estimation branch thus moderate vari- [6] D. Engin, A. Genc, and H. K. Ekenel, “Cycle-Dehaze: Enhanced Cycle-
ation of its weight will not largely affect its behavior during GAN for single image dehazing,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. Workshops (CVPRW), Jun. 2018, pp. 825–833.
training. However, as highlighted in [10], the presence of λsc [7] S. Zhao, L. Zhang, Y. Shen, and Y. Zhou, “RefineDNet: A weakly
is vital for the overall framework. Its absence would reduce supervised refinement framework for single image dehazing,” IEEE
the model to a vanilla-cycleGAN-liked structure, leading to a Trans. Image Process., vol. 30, pp. 3391–3404, 2021.
[8] X. Yang, Z. Xu, and J. Luo, “Towards perceptual image dehazing by
substantial loss in performance. physics-based disentanglement and adversarial training,” in Proc. AAAI
Conf. Artif. Intell., 2018, vol. 32, no. 1, pp. 7485–7492.
[9] X. Chen et al., “Unpaired deep image dehazing using contrastive
V. C ONCLUSION AND F UTURE W ORK disentanglement learning,” in Proc. Eur. Conf. Comput. Vis., 2022,
pp. 632–648.
This paper has proposed an unpaired video dehazing [10] Y. Yang, C. Wang, R. Liu, L. Zhang, X. Guo, and D. Tao,
framework that fully exploits depth information to improve “Self-augmented unpaired image dehazing via density and depth decom-
position,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
the temporal consistency and dehazing performance. Specifi- (CVPR), Jun. 2022, pp. 2037–2046.
cally, we proposed to simulate ego-motion and future frames [11] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-GAN: Unsu-
faithfully by using depth information. Such simulation is pervised video retargeting,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
2018, pp. 119–135.
introduced to form pseudo supervision to constrain temporal
[12] Y. Chen, Y. Pan, T. Yao, X. Tian, and T. Mei, “Mocycle-GAN: Unpaired
consistency. Besides, we designed a depth-aware local discrim- video-to-video translation,” in Proc. 27th ACM Int. Conf. Multimedia,
inator that concentrates on the regions that are distant from Oct. 2019, pp. 647–655.
the camera and more likely to be covered by denser haze. [13] K. Wang, K. Akash, and T. Misu, “Learning temporally and seman-
tically consistent unpaired video-to-video translation through pseudo-
To the best of our knowledge, this is the first work studying the supervision from synthetic optical flow,” in Proc. AAAI Conf. Artif.
unpaired video dehazing task. Extensive experimental results Intell., 2022, pp. 2477–2486.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
2402 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024

[14] R. Fattal, “Single image dehazing,” ACM Trans. Graph., vol. 27, no. 3, [39] J. Li, Y. Li, L. Zhuo, L. Kuang, and T. Yu, “USID-Net: Unsupervised
pp. 1–9, 2008. single image dehazing network via disentangled representations,” IEEE
[15] K. He, J. Sun, and X. Tang, “Single image haze removal using dark Trans. Multimedia, vol. 25, pp. 3587–3601, 2022.
channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, [40] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
pp. 2341–2353, Dec. 2010. translation using cycle-consistent adversarial networks,” in Proc. IEEE
[16] D. Berman, T. Treibitz, and S. Avidan, “Non-local image dehazing,” in Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2223–2232.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, [41] W. Liu, X. Hou, J. Duan, and G. Qiu, “End-to-end single image fog
pp. 1674–1682. removal using enhanced cycle consistent adversarial networks,” IEEE
[17] Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm Trans. Image Process., vol. 29, pp. 7819–7833, 2020.
using color attenuation prior,” IEEE Trans. Image Process., vol. 24, [42] J. Lehtinen et al., “Noise2Noise: Learning image restoration without
no. 11, pp. 3522–3533, Nov. 2015. clean data,” 2018, arXiv:1803.04189.
[18] P. Ling, H. Chen, X. Tan, Y. Jin, and E. Chen, “Single image dehazing [43] J. Zhang, L. Li, Y. Zhang, G. Yang, X. Cao, and J. Sun, “Video dehazing
using saturation line prior,” IEEE Trans. Image Process., vol. 32, with spatial and temporal coherence,” Vis. Comput., vol. 27, nos. 6–8,
pp. 3238–3253, 2023. pp. 749–757, Jun. 2011.
[19] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang, “Sin- [44] Z. Li, P. Tan, R. T. Tan, D. Zou, S. Z. Zhou, and L.-F. Cheong, “Simul-
gle image dehazing via multi-scale convolutional neural networks,” taneous video defogging and stereo reconstruction,” in Proc. IEEE Conf.
in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2016, Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4988–4997.
pp. 154–169. [45] K. Borkar and S. Mukherjee, “Video dehazing using LMNN with respect
[20] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “DehazeNet: An end-to-end to augmented MRF,” in Proc. 11th Indian Conf. Comput. Vis., Graph.
system for single image haze removal,” IEEE Trans. Image Process., Image Process., Dec. 2018, pp. 1–9.
vol. 25, no. 11, pp. 5187–5198, Nov. 2016. [46] W. Ren et al., “Deep video dehazing with semantic segmentation,” IEEE
[21] Y. Qu, Y. Chen, J. Huang, and Y. Xie, “Enhanced Pix2pix dehazing Trans. Image Process., vol. 28, no. 4, pp. 1895–1908, Apr. 2019.
network,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [47] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “End-to-end united
(CVPR), Jun. 2019, pp. 8160–8168. video dehazing and detection,” in Proc. AAAI, 2018, vol. 32, no. 1,
[22] Q. Deng, Z. Huang, C.-C. Tsai, and C.-W. Lin, “HardGAN: A haze- pp. 7016–7023.
aware representation distillation GAN for single image dehazing,” in [48] X. Zhang et al., “Learning to restore hazy video: A new real-world
Proc. ECCV. Cham, Switzerland: Springer, 2020, pp. 722–738. dataset and a new method,” in Proc. IEEE/CVF Conf. Comput. Vis.
[23] H. Dong et al., “Multi-scale boosted dehazing network with dense Pattern Recognit., Jun. 2021, pp. 9239–9248.
feature fusion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. [49] P. W. Patil, S. Gupta, S. Rana, and S. Venkatesh, “Dual-frame spatio-
(CVPR), Jun. 2020, pp. 2157–2167. temporal feature modulation for video enhancement,” Pattern Recognit.,
[24] Y. Zheng, J. Zhan, S. He, J. Dong, and Y. Du, “Curricular con- vol. 130, Oct. 2022, Art. no. 108822.
trastive regularization for physics-aware single image dehazing,” in [50] P. W. Patil, S. Gupta, S. Rana, and S. Venkatesh, “Video restoration
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2023, framework and its meta-adaptations to data-poor conditions,” in Proc.
pp. 5785–5794. ECCV. Cham, Switzerland: Springer, 2022, pp. 143–160.
[25] T. Ye et al., “Perceiving and modeling density for image dehazing,” in [51] J. Xu et al., “Video dehazing via a multi-range temporal alignment
Proc. ECCV, 2022, pp. 130–145. network with physical prior,” in Proc. IEEE/CVF Conf. Comput. Vis.
[26] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using Pattern Recognit. (CVPR), Jun. 2023, pp. 18053–18062.
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), [52] X. Wang, K. C. K. Chan, K. Yu, C. Dong, and C. C. Loy, “EDVR: Video
Oct. 2021, pp. 10012–10022. restoration with enhanced deformable convolutional networks,” in Proc.
[27] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
for image recognition at scale,” 2020, arXiv:2010.11929. Jun. 2019, p. 0.
[28] Y. Song, Z. He, H. Qian, and X. Du, “Vision transformers for single [53] J. Liang et al., “Recurrent video restoration transformer with guided
image dehazing,” IEEE Trans. Image Process., vol. 32, pp. 1927–1941, deformable attention,” in Proc. Adv. Neural Inf. Process. Syst., 2022,
2023. pp. 378–393.
[29] C. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehazing [54] D. Li et al., “A simple baseline for video restoration with grouped
transformer with transmission-aware 3D position embedding,” in Proc. spatial–temporal shift,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, Recognit. (CVPR), Jun. 2023, pp. 9822–9832.
pp. 5812–5820. [55] W. Yang, R. T. Tan, J. Feng, S. Wang, B. Cheng, and J. Liu, “Recurrent
[30] X. Guo, Y. Yang, C. Wang, and J. Ma, “Image dehazing via enhance- multi-frame deraining: Combining physics guidance and adversarial
ment, restoration, and fusion: A survey,” Inf. Fusion, vols. 86–87, learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 11,
pp. 146–170, Oct. 2022. pp. 8569–8586, Nov. 2022.
[31] B. Li, Y. Gou, S. Gu, J. Z. Liu, J. T. Zhou, and X. Peng, “You only [56] J. Pan, H. Bai, and J. Tang, “Cascaded deep video deblurring using
look yourself: Unsupervised and untrained single image dehazing neural temporal sharpness prior,” in Proc. IEEE/CVF Conf. Comput. Vis.
network,” Int. J. Comput. Vis., vol. 129, no. 5, pp. 1754–1767, 2021. Pattern Recognit. (CVPR), Jun. 2020, pp. 3043–3051.
[32] A. Golts, D. Freedman, and M. Elad, “Unsupervised single image [57] Q. Zhu, M. Zhou, N. Zheng, C. Li, J. Huang, and F. Zhao, “Explor-
dehazing using dark channel prior loss,” IEEE Trans. Image Process., ing temporal frequency spectrum in deep video deblurring,” in Proc.
vol. 29, pp. 2692–2701, 2019. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 12428–12437.
[33] B. Li, Y. Gou, J. Z. Liu, H. Zhu, and J. T. Zhou, “Zero-shot image [58] L. Xu et al., “Transcoded video restoration by temporal spatial auxiliary
dehazing,” IEEE Trans. Image Process., vol. 29, pp. 8457–8466, 2020. network,” in Proc. AAAI, 2022, vol. 36, no. 3, pp. 2875–2883.
[34] A. Yang et al., “Visual-quality-driven unsupervised image dehazing,” [59] Y. Wang and X. Bai, “Versatile recurrent neural network for wide
Neural Netw., vol. 167, pp. 1–9, Oct. 2023. types of video restoration,” Pattern Recognit., vol. 138, Jun. 2023,
[35] Y. Wang et al., “Cycle-SNSPGAN: Towards real-world image dehazing Art. no. 109360.
via cycle spectral normalized soft likelihood estimation patch GAN,” [60] J. Liu, W. Yang, S. Yang, and Z. Guo, “Erase or fill? Deep joint recurrent
IEEE Trans. Intell. Transp. Syst., vol. 23, no. 11, pp. 20368–20382, rain removal and reconstruction in videos,” in Proc. IEEE/CVF Conf.
Nov. 2022. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3233–3242.
[36] B. Ding et al., “U2 D2 Net: Unsupervised unified image dehazing and [61] S. Zhou, J. Zhang, J. Pan, W. Zuo, H. Xie, and J. Ren, “Spatio-temporal
denoising network for single hazy image enhancement,” IEEE Trans. filter adaptive network for video deblurring,” in Proc. IEEE/CVF Int.
Multimedia, vol. 26, pp. 202–217, 2023. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 2482–2491.
[37] R. S. Jaisurya and S. Mukherjee, “AGLC-GAN: Attention-based global- [62] K. Park, S. Woo, D. Kim, D. Cho, and I. S. Kweon, “Preserving semantic
local cycle-consistent generative adversarial networks for unpaired and temporal consistency for unpaired video-to-video translation,” in
single image dehazing,” Image Vis. Comput., vol. 140, Dec. 2023, Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 1248–1257.
Art. no. 104859. [63] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards
[38] Y. Qiao, M. Shao, L. Wang, and W. Zuo, “Learning depth-density priors robust monocular depth estimation: Mixing datasets for zero-shot cross-
for Fourier-based unpaired image restoration,” IEEE Trans. Circuits Syst. dataset transfer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 3,
Video Technol., early access, 2024. pp. 1623–1637, Mar. 2022.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.
YANG et al.: DEPTH-AWARE UNPAIRED VIDEO DEHAZING 2403

[64] T. Takahashi and T. Kurita, “Reconstructing optical flow generated by Chun-Le Guo (Member, IEEE) received the Ph.D.
camera rotation via autoassociative learning,” in Proc. IJCNN, vol. 4, degree from Tianjin University, China, under the
Jun. 2000, pp. 279–283. supervision of Prof. Ji-Chang Guo. He was a Vis-
[65] B. Cai, X. Xu, and D. Tao, “Real-time video dehazing based on iting Ph.D. Student with the School of Electronic
spatio-temporal MRF,” in Proc. Pacific Rim Conf. Multimedia. Cham, Engineering and Computer Science, Queen Mary
Switzerland: Springer, 2016, pp. 315–325. University of London (QMUL), U.K. He was a
[66] N. Silberman and R. Fergus, “Indoor scene segmentation using a struc- Research Associate with the Department of Com-
tured light sensor,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops, puter Science, City University of Hong Kong (CityU
Nov. 2011, pp. 601–608. of HK). He was a Postdoctoral Researcher with
[67] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image trans- Prof. Ming-Ming Cheng at Nankai University. He is
lation with conditional adversarial networks,” in Proc. CVPR, 2017, currently an Associate Professor with Nankai Uni-
pp. 5967–5976. versity. His research interests include image processing, computer vision, and
[68] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” deep learning.
2014, arXiv:1412.6980.
[69] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell.
Statist., 2010, pp. 249–256.
[70] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
quality assessment: From error visibility to structural similarity,” IEEE
Trans. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[71] G. Sharma, W. Wu, and E. N. Dalal, “The CIEDE2000 color-difference
formula: Implementation notes, supplementary test data, and mathemati-
cal observations,” Color Res. Appl., vol. 30, no. 1, pp. 21–30, Feb. 2005.
[72] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and
M.-H. Yang, “Learning blind video temporal consistency,” in Proc. Eur.
Conf. Comput. Vis. (ECCV), 2018, pp. 170–185.
[73] L. K. Choi, J. You, and A. C. Bovik, “Referenceless prediction of
perceptual fog density and perceptual image defogging,” IEEE Trans.
Image Process., vol. 24, no. 11, pp. 3888–3901, Nov. 2015.
[74] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a ‘completely
blind’ image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3,
pp. 209–212, Apr. 2012.
[75] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality
assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21,
no. 12, pp. 4695–4708, Dec. 2012. Xiaojie Guo (Senior Member, IEEE) is currently a
[76] C. Lei, Y. Xing, and Q. Chen, “Blind video temporal consistency via tenured Associate Professor with Tianjin University,
deep video prior,” in Proc. NeurIPS, vol. 33, 2020, pp. 1083–1093. Tianjin, China. He has published about 100 scien-
tific papers in well-recognized conferences (CVPR,
ICCV, NeurIPS, ECCV, ACM MM, and IJCAI)
and journals (IEEE T RANSACTIONS ON PATTERN
A NALYSIS AND M ACHINE I NTELLIGENCE, Inter-
national Journal of Computer Vision, and IEEE
Yang Yang received the B.S. degree in software T RANSACTIONS ON I MAGE P ROCESSING) in the
engineering from Hebei University of Technology, fields of computer vision, multimedia, and machine
Tianjin, China, in 2019. He is currently pursuing learning. He is a fellow of IET and a Senior Member
the Ph.D. degree in software engineering with the of CCF. He was a recipient of the Wu Wenjun AI Excellent Youth Award from
College of Intelligence and Computing, Tianjin Uni- CAAI in 2020, the Piero Zamperoni Best Student Paper Award in ICPR 2010,
versity, Tianjin. His research interests include image the Best Student Paper Runner-Up in ICME 2018, and the Best Student Paper
dehazing, image restoration, and deep learning. Runner-Up in PRCV 2020. He was a recipient of the Best AE of the Year
2023. He serves/served as the SAC/AC/SPC for CVPR, ACM MM, IJCAI,
WACV, and PRCV. He serves/served as an AE for Information Fusion and
Image Vision Computing.

Authorized licensed use limited to: University Town Library of Shenzhen. Downloaded on May 22,2024 at 08:57:23 UTC from IEEE Xplore. Restrictions apply.

You might also like