Image Colorization Using Generative Adversarial Networks
Image Colorization Using Generative Adversarial Networks
Abstract. Over the last decade, the process of automatic image col-
orization has been of significant interest for several application areas
including restoration of aged or degraded images. This problem is highly
ill-posed due to the large degrees of freedom during the assignment of
color information. Many of the recent developments in automatic col-
orization involve images that contain a common theme or require highly
processed data such as semantic maps as input. In our approach, we
attempt to fully generalize the colorization procedure using a condi-
tional Deep Convolutional Generative Adversarial Network (DCGAN).
The network is trained over datasets that are publicly available such as
CIFAR-10 and Places365. The results between the generative model and
traditional deep neural networks are compared.
1 Introduction
The automatic colorization of grayscale images has been an active area of re-
search in machine learning for an extensive period of time. This is due to the
large variety of applications such color restoration and image colorization for
animations. In this manuscript, we will explore the method of colorization using
generative adversarial networks (GANs) proposed by Goodfellow et al. [1]. The
network is trained on the datasets CIFAR-10 and Places365 [2] and its results
will be compared with those obtained using existing convolutional neural net-
works (CNN).
Models for the colorization of grayscales began back in the early 2000s. In 2002,
Welsh et al. [3] proposed an algorithm that colorized images through texture
synthesis. Colorization was done by matching luminance and texture informa-
tion between an existing color image and the grayscale image to be colorized.
However, this proposed algorithm was defined as a forward problem, thus all so-
lutions were deterministic. Levin et al. [4] proposed an alternative formulation to
the colorization problem in 2004. This formulation followed an inverse approach,
where the cost function was designed by penalizing the difference between each
pixel and a weighted average of its neighboring pixels. Both of these proposed
methods still required significant user intervention which made the solutions less
than ideal.
In [5], a colorization method was proposed by comparing colorization differences
between those generated by convolutional neural networks and GAN. The mod-
els in the study not only learn the mapping from input to output image, but
also learn a loss function to train this mapping. Their approach was effective in
ill-posed problems such as synthesizing photos from label maps, reconstructing
objects from edge maps, and colorizing images. We aim to extend their approach
by generalizing the colorization procedure to high resolution images and suggest
training strategies that speed up the process and greatly stabilize it.
In 2014, Goodfellow et al. [1] proposed a new type of generative model: genera-
tive adversarial networks (GANs). A GAN is composed of two smaller networks
called the generator and discriminator. As the name suggests, the generator’s
task is to produce results that are indistinguishable from real data. The discrim-
inator’s task is to classify whether a sample came from the generator’s model
distribution or the original data distribution. Both of these subnetworks are
trained simultaneously until the generator is able to consistently produce results
that the discriminator cannot classify.
The architectures of the generator and discriminator both follow a multilayer
perceptron model. Since colorization is a class of image translation problems,
the generator and discriminator are both convolutional neural networks (CNNs).
The generator is represented by the mapping G(z; θG ), where z is a noise vari-
able (uniformly distributed) that acts as the input of the generator. Similarly,
the discriminator is represented by the mapping D(x; θD ) to produce a scalar
between 0 and 1, where x is a color image. The output of the discriminator can
be interpreted as the probability of the input originating from the training data.
These constructions of G and D enable us to determine the optimization prob-
lem for training the generator and discriminator: G is trained to minimize the
probability that the discriminator makes a correct prediction in generated data,
while D is trained to maximize the probability of assigning the correct label.
Mathematically, this can be expressed as
The above two equations provide the cost functions required to train a GAN.
In literature, these two cost functions are often presented as a single minimax
game problem with the value function V (G, D):
In our model, we have decided to use an alternate cost function for the generator.
In equation 1, the cost function is defined by minimizing the probability of the
2
discriminator being correct. However, this approach presents two issues: 1) If
the discriminator performs well during training stages, the generator will have a
near-zero gradient during back-propagation. This will tremendously slow down
convergence rate because the generator will continue to produce similar results
during training. 2) The original cost function is a strictly decreasing function
that is unbounded below. This will cause the cost function to diverge to −∞
during the minimization process.
To address the above issues, we have redefined the generator’s cost function
by maximizing the probability of the discriminator being mistaken, as opposed
to minimizing the probability of the discriminator being correct. The new cost
function was suggested by Goodfellow at NIPS 2016 Tutorial [6] as a heuristic,
non-saturating game, and is presented as:
∗
max J (G) (θD , θG ) = max Ez [log(D(G(z)))] , (4)
θG θG
The comparison between the cost functions in equations 1 and 5 can be visualized
in figure 1 by the blue and red curves respectively. In addition, the cost function
∗
Fig. 1: Comparison of cost functions J (G) (dashed blue) and −J (G) (red).
was further modified by using the `1 -norm in the regularization term [5]. This
produces an effect where the generator is forced to produce results that are
similar to the ground truth images. This will theoretically preserve the structure
of the original images and prevent the generator from assigning arbitrary colors
to pixels just to “fool” the discriminator. The cost function takes the form
∗
min J (G) (θD , θG ) = min −Ez [log(D(G(z)))] + λkG(z) − yk1 (6)
θG θG
3
2.1 Conditional GAN
min J (G) (θD , θG ) = min −Ez [log(D(G(0z |x)))] + λkG(0z |x) − yk1 (7)
θG θG
max J (D) (θD , θG ) = max (Ey [log(D(y|x))] + Ez [log(1 − D(G(0z |x)|x))]) (8)
θD θD
The discriminator gets colored images from both generator and original data
along with the grayscale input as the condition and tries to decide which pair
contains the true colored image.
3 Method
For our baseline model, we follow the “fully convolutional network”[8] model
where the fully connected layers are replaced by convolutional layers which in-
clude upsampling instead of pooling operators. This idea is based on encoder-
decoder networks [9] where input is progressively downsampled using a series of
contractive encoding layers, and then the process is reversed using a series of
expansive decoding layers to reconstruct the input. Using this method we can
4
train the model end-to-end without consuming large amounts of memory. Note
that the subsequent downsampling leads to a much more compact feature learn-
ing in the middle layers. This strategy forms a crucial attribute to the network,
otherwise the resolution would be limited by GPU memory.
Our baseline model needs to find a direct mapping from the grayscale image
space to color image space. However, there is an information bottleneck that
prevents flow of the low level information in the network in the encoder-decoder
architecture. To fix this problem, features from the contracting path are con-
catenated with the upsampled output in the expansive path within the network.
This also makes the input and output share the locations of prominent edges in
grayscale and colored images. This architecture is called U-Net [10], where skip
connections are added between layer i and layer n-i.
The architecture of the model is symmetric, with n encoding units and n decod-
ing units. The contracting path consists of 4 × 4 convolution layers with stride 2
for downsampling, each followed by batch normalization [11] and Leaky-ReLU
[12] activation function with the slope of 0.2. The number of channels are doubled
after each step. Each unit in the expansive path consists of a 4 × 4 transposed
convolutional layer with stride 2 for upsampling, concatenation with the acti-
vation map of the mirroring layer in the contracting path, followed by batch
normalization and ReLU activation function. The last layer of the network is a
1 × 1 convolution which is equivalent to cross-channel parametric pooling layer.
We use tanh function for the last layer as proposed by [5]. The number of chan-
nels in the output layer is 3 with L*a*b* color space. (Fig. 2)
where x is our grayscale input image, y is the corresponding color image, p and
` are indices of pixels and color channels respectively, n is the total number of
pixels, and h is a function mapping from grayscale to color images.
5
generator and discriminator architectures. The architecture was also modified as
a conditional GAN instead of a traditional DCGAN; we also follow guideline in
[5] and provide noise only in the form of dropout [14], applied on several layers
of our generator. The architecture of generator G is the same as the baseline. For
discriminator D, we use similar architecture as the baselines contractive path:
a series of 4 × 4 convolutional layers with stride 2 with the number of channels
being doubled after each downsampling. All convolution layers are followed by
batch normalization, leaky ReLU activation with slope 0.2. After the last layer,
a convolution is applied to map to a 1 dimensional output, followed by a sigmoid
function to return a probability value of the input being real or fake. The input
of the discriminator is a colored image either coming from the generator or true
labels, concatenated with the grayscale image.
For training our network, we used Adam [15] optimization and weight initial-
ization as proposed by [16]. We used initial learning rate of 2 × 10−4 for both
generator and discriminator and manually decayed the learning rate by a factor
of 10 whenever the loss function started to plateau. For the hyper-parameter λ
we followed the protocol from [5] and chose λ = 100, which forces the generator
to produce images similar to ground truth.
GANs have been known to be very difficult to train as it requires finding a Nash
equilibrium of a non-convex game with continuous, high dimensional parameters
[17]. We followed a set of constraints and techniques proposed by [5,13,17,18] to
encourage convergence of our convolutional GAN and make it stable to train.
6
than the most ideal way of fooling the discriminator. Batch normalization
[11] is proven to be essential to train both networks preventing the generator
from collapsing all samples to a single point [13]. Batch-Norm is not applied
on the first layer of generator and discriminator and the last layer of the
generator as suggested by [5].
– All Convolutional Net
Strided convolutions are used instead of spatial pooling functions. This effec-
tively allows the model to learn its own downsampling/upsampling rather
than relying on a fixed downsampling/upsampling method. This idea was
proposed in [20] and has shown to improve training performance as the net-
work learns all necessary invariances just with convolutional layers.
– Reduced Momentum
We use Adam optimizer [15] for training both networks. Recent research has
shown that using a large momentum term β1 (0.9 as suggested), could result
in oscillation and instability in training. We followed the suggestion in [13]
to reduce the momentum term to 0.5.
– LeakyReLU Activation Function
Radford et al. [13] showed that using leaky ReLU [5] activation functions in
the discriminator resulted in better performance over using regular ReLUs.
We also found that using leaky ReLU in the encoder part of the generator
as suggested by [5] works slightly better.
4 Experimental Results
To measure the performance, we have chosen to employ mean absolute error
(MAE) and accuracy. MAE is computed by taking the mean of the absolute
error of the generated and source images on a pixel level for each color channel.
Accuracy is measured by the ratio between the number of pixels that have the
same color information as the source and the total number of pixels. Any two
pixels are considered to have the same color if their underlying color channels
lie within some threshold distance . This is mathematically represented by
n 3
1 XY
acc(x, y) = 1[0,` ] (|h(x)(p,`) − y (p,`) |) (10)
n p=1
`=1
7
Dataset Network Batch Size EPOCHs MAE Accuracy = 2% Accuracy = 5%
CIFAR-10 U-Net 128 200 7.9 13.7 37.2%
CIFAR-10 GAN 128 200 5.1 24.1 65.5%
Places365 GAN 16 20 7.5 18.3 47.3 %
many car images were colored red. This is most likely due to the significantly
larger number of images with red cars than images with cars of another color.
The preliminary results using Places365 (256 × 256) are shown in Appendix
B. We noticed that there were some instances of mis-colorization: regions of
images that have high fluctuations are frequently colored green. This is likely
caused by the large number of grassland images in the training set, thus the
model leans towards green whenever it detects a region with high fluctuations in
pixel intensity values. We also noticed that some colorized images experienced a
“sepia effect” seen with CIFAR-10 under U-Net. This hue is evident especially
with images with clear sky, where the color of the sky includes a strange color
gradient between blue and light yellow. We suspect that this was caused by
insufficient training, and will correct itself over time.
In this study, we were able to automatically colorize grayscale images using GAN,
to an acceptable visual degree. With the CIFAR-10 dataset, the model was able
to consistently produce better looking (qualitatively) images than U-Net. Many
of the images generated by U-Net had a brown-ish hue in the results known as
the “Sepia effect” across L*a*b* color space. This is due to the L2 loss function
that was applied to the baseline CNN, which is known to cause a blurring effect.
We obtained mixed results when colorizing grayscale images using the Places365
dataset. Mis-colorization was a frequent occurrence with images containing high
levels of textured details. This leads us to believe that the model has identified
these regions as grass since many images in the training set contained leaves
or grass in an open field. In addition, this network was not as well-trained as
the CIFAR-10 counterpart due to its significant increase in resolution (256 × 256
versus 32 × 32) and the size of the dataset (1.8 million versus 50, 000). We expect
the results will improve if the network is trained further.
We would also need to seek a better quantitative metric to measure performance.
This is because all evaluations of image quality were qualitative in our tests.
Thus, having a new or existing quantitative metric such as peak signal-to-noise
ratio (PSNR) and root mean square error (RMSE) will enable a much more
robust process of quantifying performance.
Source code is publicly available at:
https://github.com/ImagingLab/Colorizing-with-GANs
8
Acknowledgments This research was supported in part by an NSERC Discovery
Grant for ME. The authors gratefully acknowledge the support of NVIDIA Corporation
for donation of GPUs through its Academic Grant Program.
References
1. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.
In Advances in neural information processing systems, pages 2672–2680, 2014.
2. Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva.
Places: An image database for deep scene understanding. 2016.
3. Tomihisa Welsh, Michael Ashikhmin, and Klaus Mueller. Transferring color to
greyscale images. In ACM TOG, volume 21, pages 277–280, 2002.
4. Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimization. In
ACM transactions on graphics (tog), volume 23, pages 689–694. ACM, 2004.
5. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image
translation with conditional adversarial networks. 2016.
6. Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. 2016.
7. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. 2014.
8. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks
for semantic segmentation. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 3431–3440, 2015.
9. Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of
data with neural networks. science, 313(5786):504–507, 2006.
10. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net-
works for biomedical image segmentation. In International Conference on Medical
Image Computing and Computer-Assisted Intervention. Springer, 2015.
11. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep net-
work training by reducing internal covariate shift. In International Conference on
Machine Learning, 2015.
12. Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities
improve neural network acoustic models. In Proc. ICML, volume 30, 2013.
13. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation
learning with deep convolutional generative adversarial networks. 2015.
14. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
Journal of machine learning research, 15(1):1929–1958, 2014.
15. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into
rectifiers: Surpassing human-level performance on imagenet classification. In Pro-
ceedings of the IEEE international conference on computer vision, 2015.
17. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford,
and Xi Chen. Improved techniques for training gans. In Advances in Neural
Information Processing Systems, pages 2234–2242, 2016.
18. Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sen-
gupta, and Anil A Bharath. Generative adversarial networks: An overview. 2017.
19. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. Rethinking the inception architecture for computer vision. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
20. Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Ried-
miller. Striving for simplicity: The all convolutional net. 2014.
9
A CIFAR-10 Results
Fig. 3: Colorization results with CIFAR10. (a) Grayscale. (b) Original Image. (c) Col-
orized with U-Net. (d) Colorized with GAN.
10
B Places365 Results
Fig. 4: Colorization results with Places365 (a) Grayscale. (b) Original Image. (c) Col-
orized with GAN.
11