On combining denoising with learning-based image decoding
On combining denoising with learning-based image decoding
ABSTRACT
Noise is an intrinsic part of any sensor and is present, in various degrees, in any content that has been captured
in real life environments. In imaging applications, several pre- and post-processing solutions have been proposed
to cope with noise in captured images. More recently, learning-based solutions have shown impressive results
in image enhancement in general, and in image denoising in particular. In this paper, we review multiple novel
solutions for image denoising in the compressed domain, by integrating denoising operations into the decoder
of a learning-based compression method. The paper starts by explaining the advantages of such an approach
from different points of view. We then describe the proposed solutions, including both blind and non-blind
methods, comparing them to state of the art methods. Finally, conclusions are drawn from the obtained results,
summarizing the advantages and drawbacks of each method.
Keywords: Image denoising, learning-based compression, latent space, image processing, deep-learning
1. INTRODUCTION
Capturing images through digital devices such as smartphones, tablets or cameras has recently become a common
practice, leading to a growing demand for storage of trillions of pictures per year ∗ . The vast amount of stored
data motivates the research toward novel and more efficient compression methods, which could allow reducing
the enormous needs for storage space. While a number of conventional standards have been proposed in the past,
recent research efforts are mostly devoted to learning-based compression methods.1 As an example, the JPEG
Committee has recently organized an activity with the goal of standardizing a novel learning-based compression
algorithm, also known as JPEG AI. The group first reported the leading performance of learning-based methods
over conventional image compression during a Call for Evidence,2 and successively compared different emerging
technologies in a Call for Proposals.3
Recent trends reveal that images are nowadays not only intended for human consumption, but also for
computer vision applications. Therefore, compressed contents should not only maximize the perceptual similarity
with its original version, but also guarantee good performance for computer vision and image processing tasks.
In this context, JPEG AI proposed, in the ”Use Cases and Requirements for JPEG AI” document,4 a framework
that allows image processing and computer vision tasks applied directly in the latent space of learning-based
image compression, therefore without the need for standard decoding. In particular, the compressed stream
should not only allow reconstruction with a standard decoder that specifically targets the human vision, but
should also allow computer vision tasks applied to the compressed domain or non-normative decoders for image
processing operations like denoising or super-resolution. This framework has the advantage that it does not
require preliminary information about the target application, and that the non-normative decoders or computer
vision networks can be updated to the most recent technology without the need to transcode or re-capture the
content.
Noise is a common disturbance factor in images, which impacts the visual quality of images and the perfor-
mance of multiple computer vision methods, including face detection and recognition.5 Noise is often caused
by both intrinsic factors, like the camera sensor, and extrinsic factors, like the ambient light, and may be im-
possible to avoid in many situations. This makes image denoising necessary and desirable, and a classical and
∗
https://blog.mylio.com/how-many-photos-will-be-taken-in-2021-stats/
well-studied problem in the state of the art. Generally, the goal of image denoising is to reconstruct an image
x̂ from its noisy observation y = x + n. The noise n is often approximated in the literature as additive white
Gaussian noise (AWGN), which is signal-independent with zero mean and standard deviation σ. Real noise can
be more realistically approximated with the Gaussian-Poissonian model,6 where the noise is approximated by a
Poissonian signal-dependent component ηp and a Gaussian signal-independent component ηg .
In this paper, we propose and assess different non-normative decoders able to jointly reconstruct and denoise
a compressed stream generated by a learned encoder. Notably, different blind and non-blind solutions are
implemented and compared, and the results are assessed using a number of objective quality metrics. Moreover,
the benefit of including extra information, e.g. the standard deviation of the noise σ, is discussed. All the
proposed solutions allow for improved performance when compared to the anchor methods (including compression
and denoising in cascade) in terms of perceptual visual quality and computational complexity.
The remaining of this paper is structured as follows: Section 2 summarizes the state of the art in learning-based
image compression, image denoising, and computer vision and image processing methods applied directly to the
latent space of image compression. Section 3 reviews the different proposed methods for combined compression
and denoising. Results are reported and discussed in Section 4, while conclusions are drawn in Section 5.
2. RELATED WORK
Following the constant growth in the total number of images taken by and stored on digital devices, new and
more efficient solutions to image compression are consistently being researched. Recently, a number of image
compression solutions based on autoencoders have been investigated,7–12 reporting high performance in terms of
compression efficiency and perceived visual quality.13 In particular, Ballè et al. firstly proposed an autoencoder
solution using nonlinear transforms in cascade to linear convolutions,7 which was then extended by introducing
side information in the form of a hyperprior that captures the spatial dependencies in the latent representation,8
and includes an autoregressive model to reduce the amount of side information.9 More recently, generative models
have been proposed, synthesizing details of the image to improve the performance at the lowest bitrates first,11
and successively maximizing perceptual similarity metrics to generate images with improved visual quality.12
In conventional scenarios, image compression is followed by either pre- or post-processing operations, with
the goal of limiting the distortions introduced by capture, compression and other factors. In this context, image
denoising is used as both pre- and post- processing operations. Multiple conventional denoising methods have
been proposed in the state of the art. As an example, Wavelet thresholding14 relies on the wavelet transform
to denoise images. More recently, denoising methods based on neural networks15, 16 were able to achieve better
performance at the cost of an additional computational cost. Notably, Zhang et al. proposed a denoising solution
based on a deep convolutional neural network (CNN), known as DnCNN, trained to estimate the residual noise
from a noisy observation,15 and successively improved the method by integrating a uniform noise level map as
input to the network16 in FFDNet. This additional information enables the network to handle a wide range of
noise levels and to compromise between noise reduction and detail preservation. Recently, Guo et al.17 proposed
a learning-based approach combining a noise level estimation network with a non-blind denoising network into a
unified blind method known as CBDNet, trained on realistic noise and with emphasis on mitigating noise level
under-estimation. Finally, Yue et al.18 proposed an innovative deep-learning-based bayesian framework for blind
image denoising and noise modeling, based on variational inference.
In recent years, due to the large amount of images that are intended for machine consumption, researchers
in image compression try to design compression methods able to encode images that are not only visually
pleasing after the reconstruction with a conventional decoder, but that also optimize computer vision and image
processing tasks.4, 19 A limited number of methods attempted to apply computer vision and image processing
methods directly in the latent space of image compression. Early results have been presented by Torfason et al.,20
which proposed to apply image classification and semantic segmentation in the latent representation of a learning-
based image compression method, showing improvements in run-time, memory usage, robustness and synergy,
and by compromising only the performance at the lowest bitrates. More recently, super-resolution algorithms
have been applied to the latent space of image compression,21 showing promising results in terms of visual
quality. Preliminary work in the compressed domain image denoising field proposed a non-normative decoder
solution able to combine decoding and denoising operations, while reducing the computational complexity of
the pipeline.22 A different approach for latent-space denoising was proposed by Alvar et al.,23 where a joint
compression and denoising network based on a scalable latent space allowed to achieve BD-rate savings and
improve the quality of images simultaneously. A joint compression and denoising method designed for satellite
images was proposed, by training both the encoder and the decoder of a learning-based compression algorithm
with an alternative loss function.24 Finally, Cheng et al25 recently proposed a pipeline for joint compression and
denoising, with the goal of reducing the storage space by minimizing the allocated bits used to store the noise
information. While these last methods have demonstrated improved performance, they are only suitable for a
limited number of applications but not all; for instance, reconstructing the original image without denoising is
desirable for preserving artistic intent. Focusing the denoising operations at the decoder side allows for a more
flexible choice of the desired decoder, without the need of storing multiple versions of the same content.
Regardless, the research on learning-based computer vision and image processing techniques applied to the
latent space of image compression is still at an early stage, and more efforts are needed to design robust coding
methods which are suitable for both machine and human vision. Notably, the impact of different architectures
on the performance of denoising methods applied in the latent space of image compression has not been fully
investigated yet.
L(x, x
bn ) = D(x, x
bn ) (1)
NOISE
GENERATOR
FROZEN
Figure 1: Training pipeline of the proposed blind combined decoding and denoising method. Here, x represents
the original noise-free image, x̃ the noisy input image, ga the encoder, gs the decoder, ỹ the latent presentation,
and x̂ the reconstructed noise-free image.
reshape non-blind B
average
pooling
non-blind S
Figure 2: Noise map computation process. The noise map with 12 channels σb is used in the non-blind B method,
the noise map having 3 channels σs is used in the non-blind S method.
NOISE CONCAT.
GENERATOR
FROZEN
NOISE MAP
COMPUTATION
Figure 3: Training pipeline of the proposed non-blind denoising and decoding methods. Here σ refers to the low
resolution noise map with either 3 channels σs for non-blind S or 12 channels σb for non-blind B.
NOISE
GENERATOR
FROZEN
CONCAT.
Figure 4: Training pipeline of the proposed blind E denoising and decoding method. The architecture is built
upon the non-blind method, with gs referring to the non-blind denoising decoder. ge refers to the noise level
estimation network. The ground truth uniform noise level map σu contains the overall noise level of the noisy
image x̃, i.e., an estimation of the standard deviation of x̃ − x over all pixels
NOISE CONCAT.
GENERATOR
FROZEN
level map in the original RGB space, given by the parameters of the Poisson-Gaussian model and by the original
image. For an original RGB image x, in channel i, at a 2D-position p, the local noise level (standard deviation)
σi (p) is given by:
p
σi (p) = ai xi (p) + bi (2)
Conv. 32 x 3 x 3
Conv. 32 x 3 x 3
Conv. 32 x 3 x 3
Conv. 32 x 3 x 3
ReLU
ReLU
ReLU
ReLU
IGDN
Figure 6: ge architecture used in blind E to estimate the uniform noise level map σu , based on the architecture
of CN NE from CBDNet.17
Conv. 2M x 3 x 3
Conv. 2M x 3 x 3
Conv. 2M x 3 x 3
Conv. 2M x 3 x 3
Conv. M x 3 x 3
ReLU
ReLU
ReLU
ReLU
ReLU
Figure 7: ge architecture used in blind L to estimate the point-wise latent noise level map σ, based on the
architecture of CN NE from CBDNet .17 M is the number of channels of the noisy latent ỹ.
As obtaining the ground truth spatially variant noise level map may not be always feasible in practice, a
relaxation of the non-blind S model is proposed. During inference only, a uniform noise map σu containing the
empirical noise level of the image is used instead of the ground truth spatially variant noise map. We refer to
this pipeline as non-blind U.
3.3 Blind combined decoding and denoising with noise map estimation
An additional blind solution is proposed, denoted as blind E, taking advantage of an additional learned model to
estimate the noise map. Similarly to CBDNet,17 the pipeline is composed of a subnetwork (denoted here as ge ) to
estimate the noise level, and of a non-blind denoising subnetwork. The estimation network is trained separately
from the decoder, using Mean Square Error between a uniform ground truth noise level map and the output
noise map as the objective. The employed denoising network is the non-blind denoising decoder gs presented
in Section 3.2. More specifically, the decoder is chosen depending on the dimensionality of the latent space in
the baseline compression network. The image latent is composed of either 192 or 384 channels, for which the
non-blind S and the non-blind B decoders are used respectively.
The pipeline of the blind E method is represented in Figure 4.
3.4 Blind combined decoding and denoising with noise modeling in the latent space
Instead of estimating the noise level information in the original RGB space, an additional method that aims at
inferring the point-wise latent noise level directly from the latent space, here denoted as blind L, is presented.
Notably, we define the latent noise as the difference between the quantized latent ỹ of the noisy image and the
inversely quantized latent y of the clean image. The relationship between the latent noise and the noise applied
to the original image is unknown, as the latent noise is influenced by the encoding and quantization operations.
We approximate this noise in the latent image as zero-mean, point-wise independent Gaussian noise applied to
the clean latent, as such is typically done in the literature for noise with unknown properties applied to RGB
images.17, 18 The network is trained for inference of the latent noise level map σ and for combined denoising and
decoding of the noisy latent ỹ simultaneously, using the following loss function :
" #
2 2
1X (ỹi − ŷi ) + ε ŷ
L(D; θ) = E(x,ỹ)∼D log (σi2 + ε) + + 2 + λ||x̂ − x||22 (3)
2 i σi2 + ε si + ε
Where θ refers to the learned parameters of gs and ge . D is the set of clean-image and noisy-latent pairs
(x, ỹ) that compose the training dataset. λ > 0, ε > 0 are hyperparameters. The detailed derivation of the loss
function can be found in Appendix A. In our implementation, we choose ε = 1e − 3 and use the same λ value
as in the rate-distortion trade-off from the baseline compression model. Analogously to the non-blind methods
presented in Section 3.2, the number of output channels in the first convolutional layer of the decoder is doubled
and σ is concatenated to ỹ before decoding.
1. Original anchor: the learning-based anchor denoising method, i.e. FFDNet,16 is used to denoise the
images in the JPEG AI noisy test dataset. The denoising is applied before any compression, thus avoiding
any compression artifact.
2. Decoded anchor: the learning-based anchor denoising method, i.e. FFDNet,16 is applied in the pixel
domain after encoding and decoding the noisy test images with the variational autoencoder with a scale
hyperprior model at multiple bitrates.8
The results are reported both in the form of rate-distortion plots and through visual examples. In this paper,
only the results for images ‘00001’ and ‘00016’ of the JPEG AI datasets28 are reported. Notably, the first
image was chosen as it presents a wide smooth area, i.e. a white background; instead, the second image presents
high-frequency patterns, corresponding to the feathers of a bird. Therefore, the performance of the proposed
methods are assessed on a variety of conditions.
Figure 8 and Figure 9 present the objective results for image ‘00001’ and image ‘00016’ respectively. The
results are presented in the form of rate-distortion plots for a number of metrics, namely PSNR Y (i.e. computed
on the luminance component), MS-SSIM Y, VIFp Y, FSIM, and VMAF.28 The objective metrics have been
computed using the objective quality framework provided by JPEG AI † .
Figure 10 and Figure 11, on the other hand, present some visual examples of details from images decoded
and denoised with the proposed methods. Notably, the results for the highest rate (i.e. approximately 1.5bpp)
are presented, as the effects of compression are milder at such rates and therefore the visual difference between
the methods is more prominent.
4.1 Discussion
The objective quality and visual results presented above highlight that all the proposed methods are able to
improve the performance of the decoded anchor, but generally not the performance of the original anchor.
This can be explained by the fact that FFDNet was trained only on noisy uncompressed images, therefore the
performance on the decoded images is expected to be lower and could be improved by including examples of
encoded noisy images during the training of the network. In addition, the following observations can be drawn
from the rate-distortion plots:
• the proposed blind method, being the simplest and least complex method, shows lower performance than
the non-blind methods. This indicates that the decoder benefits from the added information about the
noise, generating better results both in terms of subjective visual quality and according to the objective
quality metrics. Regardless, the blind method has the advantage of not requiring any prior information
about the noise level, or any additional network which estimates information about the noise, making it
suitable for applications with low-latency constraints.
†
https://gitlab.com/wg1/jpeg-ai/jpeg-ai-qaf
0.98 0.52
29.5
0.50
29.0 0.97
28.5 0.48
27.5 0.44
0.95
27.0 0.42
26.5 0.94 0.40
26.0 0.38
0.93
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
BPP BPP BPP
0.980
29
68
original anchor FFDNET
0.975
66
decoded anchor FFDNET
64
blind
0.970
28
62
non-blind U
0.965
60
non-blind S
58
non-blind B
0.960 56
blind E
27 54 blind L
0.955 52
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
BPP BPP
26
(d) FSIM (e) VMAF
Figure 8: Rate-distortion results for image ‘00001’ of the JPEG AI noisy test set. The results regard only the
25
images with the highest noise level.
24
0.5 1.0 1.5 2.0 2.5 3.0
BPP
0.44
0.96
31 0.42
0.95
0.40
30 0.94
0.38
0.93
29 0.36
0.92
0.34
28 0.91
0.90 0.32
27 0.89 0.30
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
BPP BPP BPP
0.97
75
0.96 29 original anchor FFDNET
0.95
70 decoded anchor FFDNET
blind
0.94
28
65
non-blind U
0.93 60 non-blind S
0.92
non-blind B
55 blind E
0.91 27 blind L
50
0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 2.5 3.0
BPP BPP
26
(d) FSIM (e) VMAF
Figure 9: Rate-distortion results for image ‘00016’ of the JPEG AI noisy test set. The results regard only the
25
images with the highest noise level.
24
0.5 1.0 1.5 2.0 2.5 3.0
BPP
(a) original (b) noisy (c) original anchor
5. CONCLUSIONS
In this paper, different methods for integrating denoising operations into the decoder of a learning-based compres-
sion framework are proposed. Notably, both blind and non-blind solutions have been explored. Experimental
results reveal that additional information about the noise distribution benefits the combined methods, which
achieve higher performance both objectively and subjectively when compared to an anchor performing decoding
and denoising in cascade. While in this paper the proposed strategies are only applied to a single framework,
they are flexible enough to be adapted to a wide variety of other learning-based compression methods, e.g. in the
future it can be applied to the upcoming JPEG AI learning-based codec. In this work, only the distortion metric
used in by original compression model (i.e. MSE) is used. As future work, a trade-off between two objective
metrics (e.g. MSE and SSIM) or a metric specific to noise reduction performance assessment might be used to
further improve the perceptual visual quality of the decoded and denoised images. Additionally, more advanced
approaches to estimate properties of latent noise might be explored.
ACKNOWLEDGMENTS
The authors would like to acknowledge support from the Swiss National Scientific Research project enti-
tled ”Advanced Visual Representation and Coding in Augmented and Virtual Reality” under grant number
200021 178854.
REFERENCES
[1] Testolina, M., Upenik, E., and Ebrahimi, T., “Comprehensive assessment of image compression algorithms,”
in [Applications of Digital Image Processing XLIII ], 11510, 469–485, SPIE (2020).
[2] ISO/IEC JTC 1/SC29/WG1 N89022, “Report on the JPEG AI Call for Evidence Results.” 89th JPEG
Meeting, Online, October 2020.
[3] ISO/IEC JTC 1/SC29/WG1 N100250, “Report on the JPEG AI Call for Proposals Results.” 96th JPEG
Meeting, Online, July 2022.
[4] ISO/IEC JTC 1/SC29/WG1 N100094, “Use Cases and Requirements for JPEG AI.” 94th JPEG Meeting,
Online, January 2022.
[5] Lu, Y., Barras, L., and Ebrahimi, T., “A novel framework for assessment of deep face recognition systems
in realistic conditions,” in [10th European Workshop on Visual Information Processing (EUVIP)], IEEE
(2022).
[6] Foi, A., Trimeche, M., Katkovnik, V., and Egiazarian, K., “Practical poissonian-gaussian noise modeling
and fitting for single-image raw-data,” IEEE Transactions on Image Processing 17(10), 1737–1754 (2008).
[7] Ballé, J., Laparra, V., and Simoncelli, E. P., “End-to-end optimized image compression,” in [5th Interna-
tional Conference on Learning Representations, ICLR 2017], (2017).
[8] Ballé, J., Minnen, D., Singh, S., Hwang, S. J., and Johnston, N., “Variational image compression with a
scale hyperprior,” in [International Conference on Learning Representations ], (2018).
[9] Minnen, D., Ballé, J., and Toderici, G. D., “Joint autoregressive and hierarchical priors for learned image
compression,” Advances in neural information processing systems 31 (2018).
[10] Cheng, Z., Sun, H., Takeuchi, M., and Katto, J., “Learned image compression with discretized gaussian
mixture likelihoods and attention modules,” in [Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition ], 7939–7948 (2020).
[11] Agustsson, E., Tschannen, M., Mentzer, F., Timofte, R., and Gool, L. V., “Generative adversarial networks
for extreme learned image compression,” in [Proceedings of the IEEE/CVF International Conference on
Computer Vision], 221–231 (2019).
[12] Mentzer, F., Toderici, G. D., Tschannen, M., and Agustsson, E., “High-fidelity generative image compres-
sion,” Advances in Neural Information Processing Systems 33, 11913–11924 (2020).
[13] Ascenso, J., Akyazi, P., Pereira, F., and Ebrahimi, T., “Learning-based image coding: early solutions
reviewing and subjective quality evaluation,” in [Optics, Photonics and Digital Technologies for Imaging
Applications VI], 11353, 164–176, SPIE (2020).
[14] Chang, S. G., Yu, B., and Vetterli, M., “Adaptive wavelet thresholding for image denoising and compres-
sion,” IEEE transactions on image processing 9(9), 1532–1546 (2000).
[15] Zhang, K., Zuo, W., Chen, Y., Meng, D., and Zhang, L., “Beyond a Gaussian denoiser: Residual learning
of deep CNN for image denoising,” IEEE Transactions on Image Processing 26(7), 3142–3155 (2017).
[16] Zhang, K., Zuo, W., and Zhang, L., “Ffdnet: Toward a fast and flexible solution for cnn-based image
denoising,” IEEE Transactions on Image Processing 27(9), 4608–4622 (2018).
[17] Guo, S., Yan, Z., Zhang, K., Zuo, W., and Zhang, L., “Toward convolutional blind denoising of real
photographs,” in [Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ],
1712–1722 (2019).
[18] Yue, Z., Yong, H., Zhao, Q., Meng, D., and Zhang, L., “Variational denoising network: Toward blind noise
modeling and removal,” Advances in neural information processing systems 32 (2019).
[19] Choi, H. and Bajić, I. V., “Scalable image coding for humans and machines,” IEEE Transactions on Image
Processing 31, 2739–2754 (2022).
[20] Torfason, R., Mentzer, F., Agustsson, E., Tschannen, M., Timofte, R., and Van Gool, L., “Towards im-
age understanding from deep compression without decoding,” in [International Conference on Learning
Representations], (2018).
[21] Upenik, E., Testolina, M., and Ebrahimi, T., “Towards super resolution in the compressed domain of
learning-based image codecs,” in [Applications of Digital Image Processing XLIV ], 11842, 531–541, SPIE
(2021).
[22] Testolina, M., Upenik, E., and Ebrahimi, T., “Towards image denoising in the latent space of learning-based
compression,” in [Applications of Digital Image Processing XLIV ], 11842, 412–422, SPIE (2021).
[23] Alvar, S. R., Ulhaq, M., Choi, H., and Bajić, I. V., “Joint image compression and denoising via latent-space
scalability,” arXiv preprint arXiv:2205.01874 (2022).
[24] de Oliveira, V. A., Chabert, M., Oberlin, T., Poulliat, C., Bruno, M., Latry, C., Carlavan, M., Henrot,
S., Falzon, F., and Camarero, R., “Satellite image compression and denoising with neural networks,” IEEE
Geoscience and Remote Sensing Letters 19, 1–5 (2022).
[25] Cheng, K. L., Xie, Y., and Chen, Q., “Optimizing image compression via joint learning with denoising,”
arXiv preprint arXiv:2207.10869 (2022).
[26] Bégaint, J., Racapé, F., Feltman, S., and Pushparaja, A., “Compressai: a pytorch library and evaluation
platform for end-to-end compression research,” arXiv preprint arXiv:2011.03029 (2020).
[27] Alvar, S. R. and Bajić, I. V., “Practical noise simulation for rgb images,” arXiv preprint arXiv:2201.12773
(2022).
[28] ISO/IEC JTC1/SC29/WG1 N100106, “JPEG AI Common Training and Testing Conditions.” 94th Meeting,
Online, January 2022.
APPENDIX A. LOSS FUNCTION DERIVATION OF THE BLIND L METHOD
Notation :
x : original noise-free image
y : (unquantized) latent representation of the noise-free image, y = ga (x)
s : noise-free latent scale hyperprior s = hs (ha (Q{y}))
ỹ : noisy latent
x̂ : reconstructed image, x̂ = gs (ỹ, σ)
ŷ : reconstructed latent, ŷ = ga (x̂) = ga (gs (ỹ, σ))
σ : noise level map of the noisy latent, σ = ge (ỹ)
z : unobserved noise-free latent
The combined denoising and decoding problem is first posed as the modeling of our data, original noise-free
image/noisy-latent pairs (x, ỹ). The objective is to find the network parameter values that maximize the expected
log-likelihood of the joint distribution p(x, ỹ), over the dataset D of original noise-free images/noisy-latent pairs.
E(x,ỹ)∈D [log p(x, ỹ)] = E(x,ỹ)∈D [log p(ỹ)] + E(x,ỹ)∈D [log p(x|ỹ)] (4)
Under the same assumptions as in the compression framework,8 where λ is the hyper-parameter of the
rate-distortion trade-off :
The above allows for interpretation of the objective in parallel to that of a compression model :
Where the term R corresponds to the rate of latent ỹ and D to the distortion in a classic rate distortion
trade-off, where before quantization, the latent is perturbed by a more complex noise source. Note that unlike
learned compression, our scope here is not to find a latent representation that minimizes rate, but to minimize
the rate given the fixed noisy latent, by infering parameters σ and ŷ on the distribution of the latent.
As the evidence log ỹ is untractable, we thus consider instead its evidence lowerbound, using an approximation
q(z) of the true distribution p(z|ỹ), similarly to what is proposed for VDNet.18 Note that unlike in the framework
presented by Yue et al.,18 only the noise-free latent z is an unobserved variable and not the noise level map σ.
Similarly to the approach taken by VDNet framework18 for denoising in the RGB space but here in the
latent space, a true distribution is imposed on z, where ε is a hyperparameter. The distribution q(z) which
approximates p(z|ỹ) is also defined:
z ∼ N (0, S + εI) (10)
(
s2i if i=j
Sij = (11)
0 otherwise
q
z ∼ N (ŷ, εI) (12)
Finally, based on our point-wise independent gaussian latent noise assumption, the distribution of ỹ|z is
given by :
Which gives the following loss function to minimize as a function of the learned parameter θ of gs and ge :
" #
(ỹi − ŷi )2 + ε ŷ 2
1X 2 2
L(D; θ) = E(x,ỹ)∼D log (σi + ε) + + 2 + λ||x̂ − x||2 (15)
2 i σi2 + ε si + ε