0% found this document useful (0 votes)
146 views42 pages

Image Upcaling With UNet

The document describes a project report on image upscaling using U-Net. It was submitted by three students - Viruthiran, Vignesh, and Vignesh Raaja in partial fulfillment of the requirements for a Bachelor of Engineering degree. The project aims to provide more precise and accurate predictions for image upscaling compared to existing methods like RDN, by using the U-Net convolutional neural network architecture which employs a series of residual blocks with skip connections.

Uploaded by

Giya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views42 pages

Image Upcaling With UNet

The document describes a project report on image upscaling using U-Net. It was submitted by three students - Viruthiran, Vignesh, and Vignesh Raaja in partial fulfillment of the requirements for a Bachelor of Engineering degree. The project aims to provide more precise and accurate predictions for image upscaling compared to existing methods like RDN, by using the U-Net convolutional neural network architecture which employs a series of residual blocks with skip connections.

Uploaded by

Giya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

IMAGE UPSCALING USING UNET

A PROJECT REPORT

Submitted by

VIRUTHIRAN S
17CSR231

VIGNESH S
17CSR227

VIGNESH RAAJA M S
17CSL253

in partial fulfillment of the requirements for the


award of the degree of

BACHELOR OF ENGINEERING IN
COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SCHOOL OF COMMUNICATION AND COMPUTER SCIENCES

KONGU ENGINEERING COLLEGE


(Autonomous)
PERUNDURAI ERODE – 638 060
APRIL 2021
i

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


KONGU ENGINEERING COLLEGE
(Autonomous)

PERUNDURAI ERODE – 638 060


APRIL 2021

BONAFIDE CERTIFICATE

This is to certify that the Project report entitled IMAGE UPSCALING USING UNET is the

bonafide record of the project work done by S. VIRUTHIRAN (Register No: 17CSR231), S.

VIGNESH (Register No: 17CSR227), M.S. VIGNESH RAAJA(Register No: 17CSL253) in

the partial fulfillment of the requirements for the award of the Degree of Bachelor of Engineering

in Computer Science and Engineering of Anna University Chennai during the year 2020 – 2021.

SUPERVISOR HEAD OF THE DEPARTMENT


(Signature with seal)

Date:

Submitted for the end semester viva voice examination held on

INTERNAL EXAMINER EXTERNAL EXAMINER

ii
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
KONGU ENGINEERING COLLEGE
(Autonomous)

PERUNDURAI ERODE – 638 060


APRIL 2021

DECLARATION

We affirm that the Project Report titled IMAGE UPSCALING USING UNET being

submitted in partial fulfillment of the requirements for the award of Bachelor of

Engineering is the original work carried out by us. It has not formed the part of any other

project report or dissertation on the basis of which a degree or award was conferred on an

earlier occasion on this or any other candidate.

VIRUTHIRAN S

(Reg.No: 17CSR231)

VIGNESH S

(Reg. No: 17CSR227)

VIGNESH RAAJA M S

Date: (Reg. No:17CSL253)

I certify that the declaration made by the above candidates is true to the best of my
knowledge.

Date: Name and Signature of the Supervisor with seal


iii

ABSTRACT

Image upscaling (increasing the quality of an image) with the help of U-Net, a deep

convolutional neural network with a series of residual blocks with skip connections.

The problem deep machine learning based super resolution is trying to solve is that

traditional algorithm based upscaling methods lack fine detail and cannot remove defects and

compression artifacts. For humans who carry out these tasks manually it is a very slow and

painstaking process.

For image upscaling, machine learning employs simulated neurons or neural networks to

predict a high-resolution image from a given low-resolution image. To predict the upscaled

images with high accuracy, a neural network model must be trained on countless images. The

deployed AI model can then take low-resolution video and produce incredible sharpness and

enhanced details no traditional scaler can recreate. As an advancement to the existing system,

we are trying to give more precise and accurate predictions using U-Net instead of RDN.
iv

ACKNOWLEDGEMENT

Owing deeply to the supreme, we extend our thanks to the Almighty, who has

blessed us to come out successfully with our project. We take pleasure to express out deep

sense of gratitude to our beloved parents.

We express our hearty sincere thanks and gratitude to our beloved Correspondent,

Thiru. P.SACHITHANANDAN for giving the opportunity to pursue this course. We are

extremely thankful with no words of formal nature to the dynamic Principal,

Dr.V.BALUSAMY B.E (Hons), MTech, Ph.D., for providing the necessary facilities to

complete our work.We would like to express our sincere gratitude to our respected Head of

the Department, Dr.N.SHANTHI M.E., Ph.D., for providing necessary facilities.

We express our sincere thanks to our project coordinator, Dr.E.GOTHAI M.E.,

Ph.D., for her constant encouragement for the development of our project.

We take immense pleasure to express our hearty thanks and gratitude to our

supervisor, Dr.N.KRISHNA MOORTHY B.E.,M.E.,Ph.D., for his valuable ideas and

suggestions, which have been very helpful in the project. We are grateful to all the faculty

members of the Computer Science and engineering Department for their support.
v

TABLE OF CONTENTS
CHAPTER No. TITLE PAGE No.

I ABSTRACT iii

II LIST OF FIGURES vi

III LIST OF ABBREVIATIONS vii

IV LIST OF TABLES vii

1 INTRODUCTION 1

1.1 EXISTING SYSTEM 1

1.2 OBJECTIVE AND SCOPE 2

2 GENERAL DESCRIPTION 4

2.1 PROJECT PERSPECTIVE 4

2.2 USER CHARACTERISTICS 5

DESIGN AND IMPLEMENTATION


2.3 CONSTRAINTS 5

3 REQUIREMENTS 7

3.1 HARDWARE REQUIREMENTS 7

3.2 SOFTWARE REQUIREMENTS 7

3.3 SOFTWARE DESCRIPTION 8

3.3.1 COLABORATORY 8

3.3.2 PYTHON 8

4 DETAILED DESIGN 9

4.1 ARCHITECTURES 9

4.1.1 UNET 9

4.1.2 VGG16 12

4.2 PROCEDURE 12
vi

II. LIST OF FIGURES

FIGURE NO. FIGURE NAME PAGE NO.

A COMPARISON BETWEEN THE


1.1 INTERPOLATION METHODS 2

1.2 ORIGINAL IMAGE & UNET 3

2.1 UNET ARCHITECTURE 5

4.1 UNET WORKING FLOW 9

4.2 ResBlock WITH ResNet 11

THE LOSS SURFACE WITH AND WITHOUT


4.3 THE SKIP CONNECTIONS 11

4.4 VGG16 ARCHITECTURE 12

4.5 UNET WITH ENCODER AND DECODER PATH 13

4.6 DATA AUGMENTATION 14

5.1 LR SAMPLE DATASET FROM ImageNet 16

5.2 LEARNING RATE OF THE MODEL 20

5.3 LOSS WITH TRAINING TIME 20

5.4 OUTPUT 21

5.5 PSNR PATCHES 21


vii

III. LIST OF ABBREVIATIONS

ISR Image Super Resolution

RDN Residual Dense Network


TABLE NAME TABLE No PAGE No
PSNR Peak Signal to Noise Ratio
Hardware Requirements 3.1 7 / Mean Squared Error
MAE / MSE Mean Absolute Error

IR Image Resolution
Software Requirements 3.2 7
HQ High Quality

LQ Low Quality

CAR Compression and Reduction

IDN Image Denoising

ICNR Initialized to Convolution NN Resize

VGG Visual Global Groups

CIFAR Canadian Institute For Advanced Research

SSIM Structural Similarity Index


IV. LIST OF TABLES
1

CHAPTER 1

INTRODUCTION

When you expand(zoom in) an image, the computer does bi-linear interpolation to
upscale the image which gets blurred due to over-smoothening (averaging the neighbouring
pixels). Or the low quality of images which are captured in a surveillance camera. They all
cause discomfort to view due to the lack of fine detail.

In order to avoid the loss of fine detail, we are in need of a new technology which
upscales an image without losing the quality. We implemented a deep convolutional neural
network (U-Net) which is widely used in the area of Image segmentation. It helps us to
upscale the image without losing the quality. We used a subset of the ImageNet dataset and
Oxford-IIIT PETS dataset for training and testing the model. The prediction results are far
better than the classic image upscaling methods with remarkably fine details. Thus, we think
our model serves the need of a good alternative to the Classical Image Upscaling method
which is used in the area of Image Processing and Image Restoration.

1.1. EXISTING SYSTEM

The traditional way of doing this Super Resolution(SR) is Bi-linear Interpolation.


Interpolation can be used to determine a height value of a point by using known neighboring
points. It can be used for estimating the values on a continuous grid based model. Linear
interpolation means we estimate the value using linear polynomials. Bi-linear interpolation
means applying a linear interpolation in two directions. Thus, it uses 4 nearest neighbors,
takes their weighted average to produce the output. Basically it is linear interpolation in both
X and Y directions (or rows and columns) (and Z if 3D). The four cell centers from the input
raster are closest to the cell center for the output processing cell will be weighted and based
on distance and then averaged.

In computer vision and image processing, bi-linear interpolation is used to resample


images and textures. An algorithm is used to map a screen pixel location to a corresponding
point on the texture map. A weighted average of the attributes (color, transparency, etc.) of
2

the four surrounding textures is computed and applied to the screen pixel. This process is
repeated for each pixel forming the object being textured. When an image needs to be scaled
up, each pixel of the original image needs to be moved in a certain direction based on the
scale constant. However, when scaling up an image by a non-integral scale factor, there are
pixels (i.e., holes) that are not assigned appropriate pixel values. In this case, those holes
should be assigned appropriate RGB or grayscale values so that the output image does not
have non-valued pixels.

A COMPARISON BETWEEN THE BILINEAR AND BICUBIC INTERPOLATION


(1.1)

1.2. OBJECTIVE AND SCOPE

The classical method for upscaling an Image has been done using Bi-Linear
Interpolation, a method which interpolates a pixel in a 2d image. This bi-linear interpolation
takes 4 nearest neighbors for the current selected pixel and outputs the results based on the
weighted average taken. This traditional bilinear interpolation upscaling method leaves the
scaled image blurry due to over-smoothening of neighbouring pixels i.e. averaging the pixels
3

of the neighbours. In order to avoid the loss of fine detail, we are in need of a new
technology which upscales an image without losing the quality.

Without a good quality image, places that deal with crime, medicine, etc have a
disadvantage of making a rational decision which could cause a bigger problem. The
creation of image upscaling goes way beyond when they need a clear image in situations like
these. Although the traditional way of doing this is moderate and inefficient, modern
problems need modern solutions with updated technological advancements. That’s where
deep convolutional neural networks come in. With this method, the output can be viewed
with more detail.

We implemented a deep convolutional neural network (U-Net) which is widely used


in the area of Image segmentation. It helps us to upscale the image without losing the
quality. We used a subset of the ImageNet dataset and Oxford-IIIT PETS dataset for training
and testing the model. The prediction results are far better than the classic image upscaling
methods with remarkably fine details. Thus, we think our model serves the need of a good
alternative to the Classical Image Upscaling method which is used in the area of Image
Processing.

ORIGINAL IMAGE
And
UNET (1.2)
4

CHAPTER 2

LITERATURE REVIEW

1. Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, Yun Fu (2018):Residual Dense
Network for Image Super-Resolution. Residual learning makes the loss surface smooth than
a deep neural network with disordered loss surface. They split the training model into two
phases namely, global feature fusion and local feature fusion. In local feature fusion, the
model learns the local features of an image which helps later to generate the high resolution
image. And also with global feature fusion, the model learns the features of an image
efficiently and generates more authentic high resolution image.

2. Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue(2019) :This
review represent Image Super-Resolution with deep neural networks, and they are 
categorized into two methods according to their major contributions aspects of image super-
resolution: Experimenting efficient deep neural networks for single image super-resolution,
and the optimization of redundant layers i.e. the layers learn nothing much beneficial to
obtain the target.

2.   Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham,
Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe
ShiIn (2017): In this SRGAN method, a generative adversarial network (GAN) is used for
the image super-resolution (SR).This first neural network architecture able to inferring 4x
upscaling factors in photo-realistic natural images. A perceptual loss with L1 error is used to
train the model to generate authentic images .  The generator generates the authentic images
while the discriminator estimates the similarity between the generated image and the original
image with perceptual loss.

4. Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, Kyoung Mu Lee (2017):This
paper is about enhanced deep super-resolution network (EDSR) with performance that
5

exceeds than the current state-of-the-art super-resolution methods. The performance


significantly improved in this model due to optimizing by removing redundant layers in the
network. Stabilizing the performance of the model while increasing the layers of the
networks with help of this method is efficient. A new multi-scale deep super-resolution
system (MDSR) and training method is used to generate high resolution images while
upscaling them 2x. 

5.Ying Tai, Jian Yang, Xiaoming Liu(2017):This paper is about a deep Convolutional
Neural Network model called as Deep Recursive Residual Network (DRRN) with 52
convolutional layers. Usually, Deep neural networks’ loss diverges than a shallow neural
networks’ loss and it is known as “vanishing gradient” which in theory, having many layers
in a neural network aids to minimize the loss but not in practical. So, they use residual
blocks to avoid the divergence of loss while increasing the layers in the network.
6

CHAPTER 3

REQUIREMENTS

3.1. HARDWARE REQUIREMENTS

Table 3.1 Hardware Requirements

Processor Intel/AMD

Processor Speed 2.25 GHz

Hard Disk 512 GB

RAM 16 GB

3.2. SOFTWARE REQUIREMENTS

Table 3.2 Software Requirements

Programming language Python 3

Software Colaboratory

Operating system(OS) Any OS


7

3.3.SOFTWARE DESCRIPTION

3.3.1. Colaboratory

Colaboratory is a Google research project created to help disseminate machine


learning education and research. It's a Jupyter notebook environment that requires no setup
to use and runs entirely in the cloud. It provides notebooks for several of our models that
allow you to interact with them on a hosted Google Cloud instance for free.

3.3.2. Python

Python is an interpreted, object-oriented, high-level programming language with


dynamic semantics. Its high-level built in data structures, combined with dynamic typing
and dynamic binding, make it very attractive for Rapid Application Development, as well as
for use as a scripting or glue language to connect existing components together. Python's
simple, easy to learn syntax emphasizes readability and therefore reduces the cost of
program maintenance.

Python supports modules and packages, which encourages program modularity and
code reuse. The Python interpreter and the extensive standard library are available in source
or binary form without charge for all major platforms, and can be freely distributed.
8

CHAPTER 4

DETAILED DESIGN

The goal of this project is to upscale and improve the quality of low resolution
images. This project developed with UNET, a model developed for bio-medical
segmentation, used here for Single Image Super-Resolution (ISR) as well as a multi-output
version of the VGG16 network for deep features extraction used in the perceptual loss.

4.1. ARCHITECTURES

4.1.1. UNET
9

UNET WORKING FLOW(4.1)


A U-Net is a convolutional neural network architecture that was developed for
biomedical image segmentation. U-Nets have been found to be very effective for tasks
where the output is of similar size as the input and the output needs that amount of spatial
resolution. This makes them very good for creating segmentation masks and for image
processing/generation such as super resolution.

When convolutional neural nets are commonly used with images for classification,
the image is taken and downsampled into one or more classifications using a series of stride
two convolutions reducing the grid size each time.

To be able to output a generated image of the same size as the input, or larger, there
needs to be an upsampling path to increase the grid size. This makes the network layout
resemble a U shape, a U-Net the downsampling/encoder path forms the left hand side of the
U and the upsampling/decoder path forms the right hand part of the U.
10

For the upsampling/decoder path several transposed convolutions accomplishes this,


each adding pixels between and around the existing pixels. Essentially the reverse of the
downsampling path is carried out. The options for the upsampling algorithms are discussed
further on.

This UNET model contains a workflow (shown in figure 4.1) of the following things,

 A U-Net architecture with cross connections similar to a DenseNet

 A ResNet-34 based encoder and a decoder based on ResNet-34

 Pixel Shuffle upscaling with ICNR initialization

 Transfer learning from pretrained ImageNet models

 A loss function based on activations from a VGG-16 model

 Discriminative learning rates and progressive resizing

ResBlock within ResNet(4.2)

H(x) := F(x) + x
11

Here, the main advantage of adding skip connections is because if any layer hurt the
performance of architecture then it will be skipped by regularization. So, this results in
training very deep neural network without the problems caused by vanishing/exploding
gradient. It was experimented on 100-1000 layers on CIFAR-10 dataset. There are similar
approaches present everywhere however those could not provide accuracy better than UNET
architecture.

The loss surface with and without the skip connections (4.3)

4.1.2. VGG16

One caveat in this generative model is that it lacks to re-produce valuable features
like eyes, hair, grass etc in the image. This information is learned by a pre-trained model
named VGG-16(Visual Geometry Group), which is a 16 layer CNN architecture trained to
classify images based on its features.

VGG16 ARCHITECTURE(4.4)
12

In this model, VGG-16 is implemented to estimate the likeliness between the


generated image and the original image(Ground Truth). VGG-16 with MSE pixel loss aids
the generator to generate more authentic and high quality images in terms of its perceptual
loss.

4.2. PROCEDURE

4.2.1. Data Preprocessing

ImageNet is a vast dataset of annotated images by humans and majorly used for
computer vision research. There are more than 14 million images in the ImageNet dataset
and categorized into sub-categories as “synset” or “synonym set”. There are 100k synset in
the ImageNet dataset each with 1000 images. The dataset can be downloaded via this
link(http://image-net.org/download-imageurls).

4.2.2. Data Preparation

Downscale each image by compression for details analyzing and store it with a copy.
After that, the dataset separated for training and validation

4.2.3. Analysis and Model

UNet is a deep convolutional network with skip connections from each


downsampling layer to its corresponding upsampling layer. It is a modified architecture of
ResNet which has an identity shortcut function between two layers.
13

UNET WITH ENCODER AND DECODER PATH(4.5)

A low-res input image goes through a series of convolutions i.e. the encoder path
which encodes the useful features for the reconstruction of low-resolution image via skip
connections through the corresponding upsampling layer. The flattened encoded image now
goes through a series of deconvolutions i.e. the decoder path which upsamples the image
with the help of features obtained from the corresponding downsampling layer. This series
of convolutions and deconvolutions of the architecture resembles the letter ‘U’, thus it is
named as UNet.

When using a deep neural network, the loss tends to diverge more than a shallow
neural network. This is resolved by using ResBlocks which has an Identity shortcut function
between every two layers to minimize the loss and it’s widely adapted in the machine
learning community to use very deep neural nets without caring about vanishing gradients.
U-net uses this analogy as in skip-connections to aid the loss converge faster.
14

4.2.4. Training and Testing

Training begins with downloading a subset of the ImageNet dataset and compressing
each image to 64 x 64 resolution with 50% quality by bi-linear interpolation. Splitting the
low-resolution images into a folder called “Train” and the original high-resolution images
into another folder called “Valid”.

Data augmentation in training is a very helpful process which allows the model to
learn and improvise in many aspects. This augmentation is done in so many ways including
compressing the quality of the image, flipping horizontally and vertically, adjusting,
lightening and perspective of the image and also adding noise too.

DATA AUGMENTATION(4.6)

Then, the training images are loaded into a data loader with 128 images per batch.
Each input low-resolution image goes through a series of 2-stride convolutions which halves
the image size at each convolution and it gets flattened into one dimensional vector of pixels
at the end of encoder path.
15

Upscaling starts with the flattened 1D vector of pixels and performs de-convolutions
which is padding an extra pixel around every pixel in the flattened vector. This process is
repetitively done till it gets to the size of the ground truth. Also, the features are gradually
improving with the perceptual loss provided from the VGG-16 and MSE pixel loss.

We observe the generative loss converges to a quite reasonable extent after 30 epochs
of training. Now, our model is all set to evaluate its inference. Testing is done with another
subset of ImageNet dataset with images compressed and reduced to low quality.

4.2.5. Perceptual loss function

The previous approach to noise cancelling proved not to be working as we hoped: in


order to minimize the pixel-wise MSE loss, the network indiscriminately smoothes the
output image to avoid sharp artefacts, which would ultimately increase the average error.
This has the drawback of diminishing the sharpness of the textures and patterns that we
would want to instead enhance, or even erase them completely.

What we really want is the network to learn patterns idiosyncratic to the noise and
replace the corresponding portions with a realistic approximation of the original image,
hence preserving the precious details and delivering an image that is both clean of noise and
with a good perceptual quality.

A common effect of using a pixel-wise MSE loss function for image super resolution
is a smooth output: minimizing the MSE for each pixel leads to “average good results”,
which are though not correlated with perceptually pleasing results, those crisp details that we
are looking for. The idea would then be to use a loss function that penalizes more heavily
reconstruction errors committed on perceptually relevant elements of the image such as
edges, lines and shapes, textures and so on.

CHAPTER 5

RESULTS AND DISCUSSION


16

5.1. DATASET DESCRIPTION

The Oxford-IIIT Pet Dataset contains a total 37 category pet set with more or less
200 images for each class. In terms of scale, pose and lighting, those data contains more
number of variations. All the images inside the set have an associated original interpretation
of various kinds like breed, head ROI, and pixel level trimap segmentation.

ImageNet training data contains 12 lacks images in thousand categories for global
simplification in implementing. These data inside are only for the purpose of training and
not for validation of the model. They picked up random subsets with over 50,000 for that.
The overall view of the data here,

1.Filled synsets : 21,841


2.Overall Images : 14,197,122
3.Bounding box annotation images : 1,034,908
4.SIFT features containing synsets : 1000
5.SIFT features containing images : 1.2 million

5.1.1 SAMPLE DATASET


17

LR Sample Dataset from ImageNet(5.1)

5.2. METRICS AND ACCURACY

Precision based evaluation metrics Among the image quality assessment metrics,
MSE (Mean square error) and PSNR (Peak signal to noise ratio) are most widely used
metrics due to their simplicity. MSE is defined as the pixel-wise difference average:
m −1 n −1
1
MSE= ∑ ❑ ∑ ❑¿
mn i=0 j=0

Where I and K are the baseline image and the distorted image respectively. Based on
MSE, PSNR (Peak signal to noise ratio) is defined as

PSNR=20 ⋅lo g10 ( MA X I ) −10 ⋅lo g10 (MSE)

Where MAXI is the maximum signal value in the signal.255 is used in the image
evaluation applications. However, MSE or PSNR does not perform as well when it comes to
detailed texture. As these metrics take the regression-to-mean approach, which means
multiple potential solutions with high texture details are averaged to create a smooth
reconstruction, it can cause the over-smooth problem in the detail texture.

Algorithms based on PSNR or MSE as their sole loss function always face the
problem of poor perceptual quality. While error sensitivity based image quality assessment
metrics can reflect the pixel-wise difference of an image as compared to the non-distorted
image, it is not clear these metrics can reflect the fidelity of an image. The reason is that
natural images are highly structured, there exists strong dependencies within a local region
and among regions, reflecting the structures of an object. Pixel-wise difference is
independent of the inability to capture this type of structural information.

Following the above philosophy in order to evaluate the image degradation according
to the changes in structural information, Structural Similarity Index (SSIM) is designed. It
evaluated three independent components of an image, including luminance, contrast and
structure. It evaluated three independent components of an image, including luminance,
18

contrast and structure. Luminance signal l(x, y) is measured as the mean intensity

Contrast component c(x, y) is estimated as removing the mean intensity (variance)


from the signal and then taking the standard deviation (the square root of variance).

Finally, by taking normalization of its own standard deviation, the structure


component c(x, y) is measured

All three components are combined to yield an overall similarity measure

When take α = β = γ = 1, SSIM index become


19

Refer Appendix 2 for accuracy results.

5.3.RESULT

The trained model along with skip connections and feature extractor VGG16, gave
results far better than the classic models.

LEARNING RATE OF THE MODEL(5.2)

Figure 4.1 shows the learning rate of the model to the loss. And it can be interpret
that the low loss value tends to pull up model’s learning.

LOSS WITH TRAINING TIME(5.3)


20

OUTPUT(5.4)

Figure 4.3 contains three images side by side where the first one(left) is the low-res
input image and the middle one is predicted output image and final one on the right is the
ground truth which is original.

5.4.DISCUSSIONS

To understand where our model generalizes well and where it does not, we extracted
patches that have high PSNR values and patches that have a low value of the metric from the
validation set.
21

PSNR PATCHES (5.5)

Unsurprisingly, the best performing patches are the ones with large flat areas while
more complex patterns are harder to reproduce accurately. We might then want to focus on
these areas for both training and evaluation of the results. The same is also highlighted by this
a heat map representing the error between the original HR image and the SR output of the
network: darker colors correspond to higher pixel-wise mean squared error, while lighter
colors correspond to lower error, or better results.

We can see how areas with more patterns correspond to higher errors but also
intuitively “simpler” transition areas (cloud-sky for instance) are fairly dark.

CHAPTER 6

CONCLUSION AND FUTURE WORK

With the advancement in AI technology, the classic upscaling couldn’t suffice the
problem. Also the results are not reliable enough for people to satisfy with the traditional
algorithms. This made the current situation where upscaling is critical in so many fields.
Thus, to avoid the problems with classic solutions and also to maintain the quality to the
advancement, the deep convolutional neural networks become essential to solve this issue
22

and automates the painstaking manual work by humans to restore an image.

It’s a matter of time for this SR to dive deep in every field. The U-Net based model
which is initially developed for biomedical segmentation can have a clear understanding of
what needs to be done here. Before this, all the other models had their works with only a
lower prediction rate. U-Net became one of the best models for this super resolution purpose
as time went by in the past decade.

Our future work involves in enhancing the model from super resolution to repairing
images such as disoriented, noised, overlapped, damaged, etc. Neural nets are getting so
much powerful that is enough to achieve something that is far greater than the human mind
could think of. These rational synthetic beings can revolt again in history. Our work will
depend on that.

APPENDIX 1

import fastai
from fastai.vision import *
from fastai.callbacks import *
from fastai.utils.mem import *
from torchvision.models import vgg16_bn
 
path = untar_data(URLs.PETS)
path_hr = path/'images'
path_lr = path/'small-96'
path_mr = path/'small-256'
23

il = ImageList.from_folder(path_hr)
 
def resize_one(fn, i, path, size):
    dest = path/fn.relative_to(path_hr)
    dest.parent.mkdir(parents=True, exist_ok=True)
    img = PIL.Image.open(fn)
    targ_sz = resize_to(img, size, use_min=True)
    img = img.resize(targ_sz, resample=PIL.Image.BILINEAR).convert('RGB')
    img.save(dest, quality=60)
 
sets = [(path_lr, 96), (path_mr, 256)]
for p,size in sets:
    if not p.exists(): 
        print(f"resizing to {size} into {p}")
        parallel(partial(resize_one, path=p, size=size), il.items)
 
bs,size=32,128
arch = models.resnet34
src = ImageImageList.from_folder(path_lr).split_by_rand_pct(0.1, seed=42)
 
def get_data(bs,size):
    data = (src.label_from_func(lambda x: path_hr/x.name)
           .transform(get_transforms(max_zoom=2.), size=size, tfm_y=True)
           .databunch(bs=bs).normalize(imagenet_stats, do_y=True))
 
    data.c = 3
    return data
 
data = get_data(bs,size)
data.show_batch(ds_type=DatasetType.Valid, rows=2, figsize=(9,9))
t = data.valid_ds[0][1].data
t = torch.stack([t,t])
 
def gram_matrix(x):
    n,c,h,w = x.size()
    x = x.view(n, c, -1)
    return (x @ x.transpose(1,2))/(c*h*w)
 
24

gram_matrix(t)
base_loss = F.l1_loss
vgg_m = vgg16_bn(True).features.cuda().eval()
requires_grad(vgg_m, False)
blocks = [i-1 for i,o in enumerate(children(vgg_m)) if isinstance(o,nn.MaxPool2d)]
blocks, [vgg_m[i] for i in blocks]
 
class FeatureLoss(nn.Module):
    def __init__(self, m_feat, layer_ids, layer_wgts):
        super().__init__()
        self.m_feat = m_feat
        self.loss_features = [self.m_feat[i] for i in layer_ids]
        self.hooks = hook_outputs(self.loss_features, detach=False)
        self.wgts = layer_wgts
        self.metric_names = ['pixel',] + [f'feat_{i}' for i in range(len(layer_ids))
              ] + [f'gram_{i}' for i in range(len(layer_ids))]
 
    def make_features(self, x, clone=False):
        self.m_feat(x)
        return [(o.clone() if clone else o) for o in self.hooks.stored]
    
    def forward(self, input, target):
        out_feat = self.make_features(target, clone=True)
        in_feat = self.make_features(input)
        self.feat_losses = [base_loss(input,target)]
        self.feat_losses += [base_loss(f_in, f_out)*w
                             for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
        self.feat_losses += [base_loss(gram_matrix(f_in), gram_matrix(f_out))*w**2 * 5e3
                             for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
        self.metrics = dict(zip(self.metric_names, self.feat_losses))
        return sum(self.feat_losses)
    
    def __del__(self): self.hooks.remove()
 
feat_loss = FeatureLoss(vgg_m, blocks[2:5], [5,15,2])
 
 
wd = 1e-3
learn = unet_learner(data, arch, wd=wd, loss_func=feat_loss, callback_fns=LossMetrics,
25

                     blur=True, norm_type=NormType.Weight)
gc.collect();
learn.lr_find()
learn.recorder.plot()
lr = 1e-3
 
def do_fit(save_name, lrs=slice(lr), pct_start=0.9):
    learn.fit_one_cycle(10, lrs, pct_start=pct_start)
    learn.save(save_name)
    learn.show_results(rows=1, imgsize=5)
 
do_fit('1a', slice(lr*10))
learn.unfreeze()
do_fit('1b', slice(1e-5,lr))
data = get_data(12,size*2)
learn.data = data
learn.freeze()
gc.collect()
learn.load('1b');
do_fit('2a')
learn.unfreeze()
do_fit('2b', slice(1e-6,1e-4), pct_start=0.3)
 
 
learn = None
gc.collect();
free = gpu_mem_get_free_no_cache()
# the max size of the test image depends on the available GPU RAM 
if free > 8000: size=(1280, 1600) # >  8GB RAM
else:           size=( 820, 1024) # <= 8GB RAM
print(f"using size={size}, have {free}MB of GPU RAM free")
data_mr = (ImageImageList.from_folder(path_mr).split_by_rand_pct(0.1, seed=42)
          .label_from_func(lambda x: path_hr/x.name)
          .transform(get_transforms(), size=size, tfm_y=True)
          .databunch(bs=1).normalize(imagenet_stats, do_y=True))
data_mr.c = 3
 
learn = unet_learner(data, arch, loss_func=F.l1_loss, blur=True, norm_type=NormType.Weight)
 
26

learn.load('2b');
learn.data = data_mr
fn = data_mr.valid_ds.x.items[0]; fn
img = open_image(fn); img.shape
p,img_hr,b = learn.predict(img)
show_image(img, figsize=(18,15), interpolation='nearest');
Image(img_hr).show(figsize=(18,15))
27

 epo train_lo valid_lo pixel feat_0 feat_1 feat_2 gram_0 gram_1 gram_
ch ss ss 2

1 2.06179 2.07871 0.16757 0.25767 0.28252 0.14720 0.33082 0.53979 0.3531


9 4 8 4 3 8 4 7 09

2 2.06358 2.07750 0.16702 0.25750 0.28227 0.14687 0.33149 0.53956 0.3527


9 7 2 1 5 9 4 0 76

3 2.05719 2.07460 0.16765 0.25704 0.28220 0.14692 0.33011 0.53841 0.3522


1 5 6 1 4 5 7 7 47

4 2.05078 2.07339 0.16661 0.25662 0.28168 0.14658 0.33158 0.53865 0.3516


1 5 0 5 0 5 0 1 65

5 2.05470 2.06874 0.16752 0.25729 0.28161 0.14639 0.32793 0.53681 0.3511


5 7 7 5 2 2 2 4 74

6 2.05274 2.06757 0.16716 0.25674 0.28135 0.14610 0.32851 0.53714 0.3505


5 3 6 1 4 1 0 7 54

7 2.05186 2.06707 0.16722 0.25727 0.28160 0.14618 0.32757 0.53670 0.3505


3 6 2 6 7 8 5 1 06

8 2.04678 2.06432 0.16711 0.25700 0.28131 0.14605 0.32694 0.53576 0.3501


8 6 0 2 3 5 7 0 39

9 2.05446 2.06558 0.16722 0.25707 0.28124 0.14601 0.32758 0.53637 0.3500


0 1 2 7 6 6 6 7 57

10 2.05260 2.06445 0.16687 0.25683 0.28125 0.14613 0.32750 0.53573 0.3501


5 9 9 5 2 5 5 4 18
28

APPENDIX 2

OXFORD-IIIT PETS and ImageNet Dataset

import fastai
from fastai.vision import *
from fastai.callbacks import *
from fastai.utils.mem import *
from torchvision.models import vgg16_bn
 
torch.cuda.set_device(0)
path = Path('data/imagenet')
path_hr = path/'train'
path_lr = path/'small-64/train'
path_mr = path/'small-256/train'
 
path_pets = untar_data(URLs.PETS)
 
il = ImageList.from_folder(path_hr)
def resize_one(fn, i, path, size):
    dest = path/fn.relative_to(path_hr)
    dest.parent.mkdir(parents=True, exist_ok=True)
    img = PIL.Image.open(fn)
    targ_sz = resize_to(img, size, use_min=True)
    img = img.resize(targ_sz, resample=PIL.Image.BILINEAR).convert('RGB')
    img.save(dest, quality=60)
 
assert path.exists(), f"need imagenet dataset @ {path}"
# create smaller image sets the first time this nb is run
sets = [(path_lr, 64), (path_mr, 256)]
29

for p,size in sets:


    if not p.exists(): 
        print(f"resizing to {size} into {p}")
        parallel(partial(resize_one, path=p, size=size), il.items)
 
free = gpu_mem_get_free_no_cache()
# the max size of the test image depends on the available GPU RAM 
if free > 8200: bs,size=16,256  
else:           bs,size=8,256
print(f"using bs={bs}, size={size}, have {free}MB of GPU RAM free")
 
arch = models.resnet34
# sample = 0.1
sample = False
 
tfms = get_transforms()
 
src = ImageImageList.from_folder(path_lr)
if sample: src = src.filter_by_rand(sample, seed=42)
src = src.split_by_rand_pct(0.1, seed=42)
 
def get_data(bs,size):
    data = (src.label_from_func(lambda x: path_hr/x.relative_to(path_lr))
           .transform(get_transforms(max_zoom=2.), size=size, tfm_y=True)
           .databunch(bs=bs).normalize(imagenet_stats, do_y=True))
 
    data.c = 3
    return data
 
data = get_data(bs,size)
 
def gram_matrix(x):
    n,c,h,w = x.size()
    x = x.view(n, c, -1)
    return (x @ x.transpose(1,2))/(c*h*w)
 
vgg_m = vgg16_bn(True).features.cuda().eval()
requires_grad(vgg_m, False)
30

blocks = [i-1 for i,o in enumerate(children(vgg_m)) if isinstance(o,nn.MaxPool2d)]


 
base_loss = F.l1_loss
 
class FeatureLoss(nn.Module):
    def __init__(self, m_feat, layer_ids, layer_wgts):
        super().__init__()
        self.m_feat = m_feat
        self.loss_features = [self.m_feat[i] for i in layer_ids]
        self.hooks = hook_outputs(self.loss_features, detach=False)
        self.wgts = layer_wgts
        self.metric_names = ['pixel',] + [f'feat_{i}' for i in range(len(layer_ids))
              ] + [f'gram_{i}' for i in range(len(layer_ids))]
 
    def make_features(self, x, clone=False):
        self.m_feat(x)
        return [(o.clone() if clone else o) for o in self.hooks.stored]
    
    def forward(self, input, target):
        out_feat = self.make_features(target, clone=True)
        in_feat = self.make_features(input)
        self.feat_losses = [base_loss(input,target)]
        self.feat_losses += [base_loss(f_in, f_out)*w
                             for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
        self.feat_losses += [base_loss(gram_matrix(f_in), gram_matrix(f_out))*w**2 * 5e3
                             for f_in, f_out, w in zip(in_feat, out_feat, self.wgts)]
        self.metrics = dict(zip(self.metric_names, self.feat_losses))
        return sum(self.feat_losses)
    
    def __del__(self): self.hooks.remove()
feat_loss = FeatureLoss(vgg_m, blocks[2:5], [5,15,2])
 
wd = 1e-3
learn = unet_learner(data, arch, wd=wd, loss_func=feat_loss, callback_fns=LossMetrics, blur=True,
norm_type=NormType.Weight)
gc.collect();
learn.unfreeze()
learn.load((path_pets/'small-96'/'models'/'2b').absolute());
learn.fit_one_cycle(1, slice(1e-6,1e-4))
31

learn.save('imagenet')
learn.show_results(rows=3, imgsize=5)
learn.recorder.plot_losses()
 
 
_=learn.load('imagenet')
data_mr = (ImageImageList.from_folder(path_mr).split_by_rand_pct(0.1, seed=42)
          .label_from_func(lambda x: path_hr/x.relative_to(path_mr))
          .transform(get_transforms(), size=(820,1024), tfm_y=True)
          .databunch(bs=2).normalize(imagenet_stats, do_y=True))
 
learn.data = data_mr
fn = path_pets/'other'/'dropout.jpg'
img = open_image(fn); img.shape
_,img_hr,b = learn.predict(img)
show_image(img, figsize=(18,15), interpolation='nearest');
Image(img_hr).show(figsize=(18,15))
32

REFERENCE

[1] Andreas Lugmayr, Martin Danelljan, Radu Timofte. NTIRE 2020 Challenge on Real-
World Image Super-Resolution: Methods and Results. 2020 CVPRW

[2] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, Kyoung Mu Lee. Enhanced Deep
Residual Networks for Single Image Super-Resolution. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition Workshops, 2 - 8

[3] Gwantae Kim, Kanghyu Lee, Junyeop Lee, Jeongki Min,Bokyeung Lee, Jaihyun Park,
David K. Han, and Hanseok Ko. Unsupervised real-world super resolution with cycle
generative adversarial network and domain discriminator. In CVPR Workshops, 2020

[4] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein. Visualizing the Loss
Landscape of Neural Nets. arXiv:1712.09913v3 [cs.LG] 7 Nov 2018

[5] He Zhang, Vishwanath Sindagi, Vishal M. Patel. Image De-raining Using a Conditional
Generative Adversarial Network. In IEEE Transactions on Circuits and Systems for Video
Technology. Volume: 30, Nov. 2020

[6] Justin Johnson, Alexandre Alahi, Li Fei-Fei. Perceptual Losses for Real-Time Style
Transfer and Super-Resolution. In European Conference on Computer Vision 2016.

[7] Matthew D. Zeiler, Rob Fergus. Visualizing and Understanding Convolutional


Networks. In 2013 European Conference on Computer Vision
33

[8] Radu Timofte, Rasmus Rothe, and Luc Van Gool. Seven ways to improve example-
based single image super resolution. In CVPR, pages 1865–1873. IEEE Computer Society,
2016.

[9] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep
learning. arXiv:1603.07285v2 [stat.ML] 11 Jan 2018.

[10] Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, Qingmin
Liao. Deep Learning for Single Image Super-Resolution: A Brief Review. IEEE
Transactions on Multimedia. Volume: 21, Dec. 2019

[11] Xintao Wang, Ke Yu, Chao Dong, Chen Change Loy. Recovering Realistic Texture in
Image Super-resolution by Deep Spatial Feature Transform. arXiv:1804.02815v1 [cs.CV] 9
Apr 2018.

[12] Ying Tai, Jian Yang and Xiaoming Liu. Image Super-Resolution via Deep Recursive
Residual Network. In CVPR 2017

Yiwen Huang and Ming Qin. Densely connected high order residual network for single
frame image super resolution. arXiv preprint arXiv:1804.05902, 2018.

You might also like