IMAGE TO IMAGE TRANSLATION
USING CGAN
Under the guidance of A Theme Based Project Report submitted in partial fulfilment of the academic requirement
for the award of the degree of
BACHELOR OF ENGINEERING In
ELECTRONICS AND COMMUNICATION
ENGINEERING
By
1602-21-735-015 [Link] Jawahar
1602-21-735-063 [Link]
1602-21-735-052 [Link] sai
Under the guidance of
Mr. [Link] Mahesh Babu
Associate Professor, ECE
Department of Electronics and Communication Engineering
Vasavi College of Engineering (Autonomous)
ACCREDITED BY NAAC WITH 'A++' GRADE
IBRAHIMBAGH, HYDERABAD-500031
2021-2025
Department of Electronics and Communication Engineering Vasavi College of Engineering (Autonomous)
ACCREDITED BY NAAC WITH 'A++' GRADE
IBRAHIMBAGH, HYDERABAD-500031
CERTIFICATE
This is to certify that the theme-based project work title:
IMAGE TO IMAGE TRANSLATION USING CGAN
submitted by
1602-21-735-015 Y Hemantha Jawahar
1602-21-735-063 G. Yogendra
1602-21-735-052 [Link] sai
students of the Electronics and Communication Engineering Department, Vasavi College of Engineering in
partial fulfilment of the requirement for the award of the degree of Bachelor of Engineering in Electronics
and Communication Engineering is a record of the bonafide work carried out by them during the academic
year 2024-2025. The result embodied in this theme-based project report has not been submitted to any
other university or institute for the award of any degree.
Internal Guide Head of the Department
Mr. [Link] Mahesh Babu [Link] Rao
Associate Professor Professor & HoD
E.C.E Department E.C.E Department
DECLARATION
This is to state that the work presented in this theme-based project report titled “IMAGE
TO IMAGE TRANSLATION USING CGAN“is a record of work done by us in the Department of
Electronics and Communication Engineering, Vasavi College of Engineering, Hyderabad. No
part of the thesis is copied from books/journals/internet and wherever the portion is taken,
the same has been duly referred to in the text. The report is based on the project work
done entirely by us and not copied from any other source. I hereby declare that the matter
embedded in this thesis has not been submitted by me in full or partial thereof for the
award of any degree/diploma of any other institution or university previously.
Signature of the students
1602-21-735-015 Y Hemantha Jawahar
1602-21-735-063 [Link]
1602-21-735-052 [Link] sai
S.N PAGE
O CONTENTS NO.
1 Aim and objectives 01
2 Introduction and application 02
3 Abstract & Block Diagram 03
4 Methodology 04
5 Specifications 15
6 Results 26
7 Conclusion 28
8 Future scope 29
9 References 30
AIM:
To develop and evaluate a framework for image-to-image translation using
Conditional Generative Adversarial Networks (GANs), focusing on
generating high-quality and realistic target images from source images
under specified conditionscapable of automatically transforming images
from one domain to another with high accuracy and realism.
OBJECTIVE:
To create an efficient conditional GAN (cGAN) model for high-quality image-to-
image translation. This includes transforming 2D images into 3D representations
and converting block-based building facades into realistic, detailed facades.
The project focuses on optimizing the model's design, training methods, and
parameters to produce accurate and visually appealing results while preserving
key features of the original images.
Fig. 1. Architecture of Conditional GANs
INTRODUCTION :
Generative Adversarial Networks (GANs) have emerged as a powerful
framework for both supervised and unsupervised learning, capable of
generating high-quality synthetic data. Generative Adversarial Networks
(GANs) are a class of machine learning frameworks that consist of two
neural networks, the generator and the discriminator, which are trained
simultaneously in a competitive setting. The generator’s goal is to produce
synthetic data samples, such as images, text, or other data types, while the
discriminator's role is to distinguish between real and fake data. The
generator creates new, artificial samples based on patterns learned from
the training data, and the discriminator evaluates these generated
samples by comparing them to real data, aiming to classify them as either
authentic or generated. As training progresses, the generator improves its
ability to produce increasingly realistic data, while the discriminator
refines its capacity to detect subtle differences between real and
generated samples. This adversarial process continues until the generated
data becomes so convincing that even the discriminator struggles to tell
real from fake.
APPLICATION:
The proposed conditional GAN (cGAN)-based framework has a wide range
of practical applications. In architectural design, it can transform block
diagrams or facade blueprints into detailed, realistic visualizations, aiding
in planning and presentations. For 3D modeling and animation, it enables
the conversion of 2D images into 3D representations, useful in gaming,
simulations, and virtual reality. It can also enhance data augmentation by
generating synthetic yet realistic data for training machine learning
models, especially in fields like medical imaging and autonomous driving.
Additionally, the framework supports creative design by allowing artists
and designers to produce lifelike transformations of conceptual sketches.
In urban planning.
ABSTRACT:
Generative Adversarial Networks (GANs) have significantly advanced the
field of generative models, especially in image-to-image translation. This
project focuses on utilizing GANs to perform image-to-image processing,
where one visual domain is transformed into another. GANs have been
effectively applied to tasks such as season-to-season translation, altering
the time of day in images, and synthesizing photorealistic depictions of
objects, scenes, and people that are nearly indistinguishable from genuine
photos. This project seeks to harness the full potential of GANs to generate
high-fidelity, realistic images that closely mimic human visual perception.
The study further investigates the impact of hyperparameter tuning,
including activation functions, optimizers, batch sizes, and stride sizes, on
the performance of the cGAN. Extensive experiments on façade datasets
demonstrate that using combinations like Leaky ReLU and Adam optimizer
significantly enhances the quality of the generated images.
BLOCK DIAGRAM:
Fig. 2. Block Diagram of Conditional GAN
METHODOLOGY:
Conditional GAN Framework
The Conditional GAN framework maps source images to target images
based on specific conditions applied to the input. The model ensures that
domain-independent attributes (such as edges) remain intact, while
domain-specific attributes (such as color or style) are transformed.
Data Collection and Preprocessing Dataset: We collected
paired datasets of input and output images for each use case (e.g., block
facades paired with real facade images, aerial map images paired with
Google Maps images). We sourced publicly available datasets from
repositories like the Cityscapes Dataset for facade translation and
DeepGlobe Land Cover for aerial maps. Preprocessing: Input images were
resized to 512x512 resolution to match the training requirements of
Pix2Pix, with normalization applied to scale pixel values between -1 and 1.
Data augmentation techniques such as random cropping, rotation, and
horizontal flipping were applied to increase dataset diversity and improve
generalization
Model Architecture and Training Setup Generator and
Discriminator: We used a Pix2Pix model, which consists of a U-Net
generator and a multi-scale PatchGAN discriminator to capture high-
resolution details. The U-Net generator uses skip connections to preserve
spatial details, while the multi-scale discriminator enhances fine-grained
realism in high-resolution outputs.
Generator Architecture:
The Generator employs a U-Net-inspired structureconsisting of two
main parts:
Encoder (Contraction Part): Uses convolutional and pooling layers
to extract features from the input image, reducing its resolution
while preserving essential features.
Decoder (Expansion Part): Uses transposed convolutional layers to
upsample the image, reconstructing a high-resolution output that
accurately maps the extracted features to the target image.
This structure allows the model to retain both feature presence and
their spatial locations, improving image quality.
Discriminator Architecture:
The Discriminator uses a PatchGAN approach, which classifies small
patches (N x N) of the image as real or fake instead of the entire image. This
method improves computational efficiency and allows for better local
texture analysis, enhancing the model’s ability to detect fine-grained
details.
Loss Functions:
Distance Loss: Measures the difference between the generated image
and the ground truth image.
Conditional Adversarial Loss: Ensures the Generator produces images
that are indistinguishable from real images by minimizing the
adversarial loss.
Combined Loss Function: The total loss is a combination of the
distance loss and conditional adversarial loss, weighted by a
hyperparameter λ:
Hyperparameters:
Batch Size: Various batch sizes between 1 and 5 were tested.
Stride Size: Stride sizes of 1 and 2 were experimented with in the
convolutional layers.
Activation Functions: Different activation functions, including ReLU,
Leaky ReLU, and ELU, were tested.
Optimizers: Optimizers such as Adam, Stochastic Gradient Descent
(SGD), and RMSprop were explored for their impact on convergence
and loss reduction.
The model was trained with a learning rate of 0.0002, a batch size of 1-5,
and Adam optimizer with beta values of 0.5 and 0.999. We conducted
experiments to find optimal hyperparameters and minimize training time
without compromising output quality.
Fig. 3. Flowchart of Proposed Approach
Training and Evaluation Process Training: The model was
trained for 100 epochs on a single NVIDIA GPU. To monitor progress, we
saved checkpoints every 10 epochs and evaluated the intermediate results
for realism and accuracy.
Evaluation Metrics: To assess the quality of translated images, we
used both quantitative and qualitative metrics: Structural Similarity Index
(SSIM) and Peak Signal-to-Noise Ratio (PSNR) were used to measure
similarity to target images. Perceptual Quality Assessment involved human
evaluations of output images for realism, detail, and fidelity to target
characteristics.
Hyperparameter Analysis:
Hyperparameters play a crucial role in optimizing the performance of the
model, and each parameter affects different aspects of training and image
quality.
Batch Size(1-5): The batch size determines how many samples are
processed simultaneously during training. Smaller batch sizes can
improve generalization but may lead to noisier gradients, while larger
batch sizes can stabilize training at the cost of higher memory
requirements.
Stride Size(1-2): This affects how quickly the convolutional filter moves
across the image. Smaller strides preserve spatial resolution and fine
details but increase computational overhead. Larger strides reduce
computation time but may sacrifice image quality.
Activation Functions:
Leaky ReLU: Commonly used in the discriminator, Leaky ReLU
introduces a small gradient for negative input values, which prevents
the model from becoming stuck during training.
ReLU: Used in the generator, ReLU helps in efficient feature
extraction by introducing non-linearity.
Optimizers: The choice of optimizer impacts the convergence rate and
stability of the model. The Adam optimizer is preferred due to its
adaptive learning rate and momentum, which stabilize the training
process.
Loss Values:
Discriminator loss: stabilizes between 0.4 to 0.6.
Generator loss: ranges between 0.3 to 0.7, but can vary based on the
complexity of the task.
Figure 4: Different losses induce different quality of results. Each column
shows results trained under a different loss
REPORT
Initial Results
Table1:Comparison of LOSS based on Hyper-Parameter Tuning
Experimenting
Figure 5: Adding skip connections to an encoder-decoder to create a “U-Net” results in
much higher quality results.
Table 2: FCN-scores for different losses, evaluated on Cityscapes labels↔photos
Table 3: FCN-scores for different receptive field sizes of the discriminator, evaluated
on Cityscapes labels→photos. Note that input images are 256 × 256 pixels and larger
receptive fields are padded with zeros.
Fig. 6. Different Pixels of Patches
Figure 7: Color distribution matching property of the cGAN, tested on
Cityscapes. Note that the histogram intersection scores are dominated by
differences in the high probability region, which are imperceptible in the
plots, which show log probability and therefore emphasize differences in
the low probability regions
Table 4:Histogram intersection against ground truth
Table 5: AMT “Real” Vs “Fake” test on Maps<->Aerial Photos
Table 5: AMT “real vs fake” test on colorization
Table 6: Performance of photo→labels on cityscapes
Best Combination
The best combination for producing high-quality output, as indicated in
the paper, is using a combination of L1 loss and conditional GAN (cGAN).
This combination balances sharpness and realism while reducing artifacts:
L1 Loss helps minimize differences between generated and ground-
truth images, reducing blurriness.
cGAN ensures the output looks realistic by forcing the network to
distinguish between real and fake images.
The study found that combining these two, with a high weight on L1 (λ =
100), produced sharper and more realistic results compared to using L1
or cGAN alone
FCN-score (Fully Convolutional Network Score)
Purpose: Measures the semantic accuracy of generated images by
evaluating how well an off-the-shelf semantic segmentation model
classifies the generated images.
FCN-Score (Cityscapes labels ↔ photos)
Top Scorer: L1 + cGAN
Values:
Per-pixel accuracy: 0.66
Per-class accuracy: 0.23
Class IoU: 0.17
AMT Perceptual Study (Amazon Mechanical Turk)
Purpose: Measures human perception of realism in generated images
Top Scorer (Map ↔ Aerial Photo): L1 + cGAN
Aerial Photo to Map: 18.9% ± 2.5% Turkers labeled real
Top Scorer (Colorization): Zhang et al. 2016
27.8% ± 2.7% Turkers labeled real
Histogram Intersection in Color Space
Purpose: Evaluates how well the color distribution of the generated
images matches the ground truth in Lab color space.
Top Scorer: cGAN
Values:
L (Lightness): 0.87
a (Green-Red): 0.74
b (Blue-Yellow): 0.84
Conclusion:
L1 + cGAN consistently outperforms individual loss functions in most
tasks, combining realism and structure.
Zhang et al. 2016 performs best in colorization due to task-specific
engineering.
cGAN excels in generating sharp, vivid colors that match real-world
distributions.
SPECIFICATIONS
Discriminator
def Discriminator():
initializer = tf.random_normal_initializer(0., 0.02)
inp = [Link](shape=[256, 256, 3], name='input_image')
tar = [Link](shape=[256, 256, 3], name='target_image')
x = [Link]([inp, tar]) # (batch_size, 256, 256,
channels*2)
down1 = downsample(64, 4, False)(x) # (batch_size, 128, 128, 64)
down2 = downsample(128, 4)(down1) # (batch_size, 64, 64, 128)
down3 = downsample(256, 4)(down2) # (batch_size, 32, 32, 256)
zero_pad1 = [Link].ZeroPadding2D()(down3) # (batch_size, 34, 34,
256)
conv = [Link].Conv2D(512, 4, strides=1,
kernel_initializer=initializer,
use_bias=False)(zero_pad1) # (batch_size, 31, 31, 512)
batchnorm1 = [Link]()(conv)
leaky_relu = [Link]()(batchnorm1)
zero_pad2 = [Link].ZeroPadding2D()(leaky_relu) # (batch_size, 33,
33, 512)
last = [Link].Conv2D(1, 4, strides=1,
kernel_initializer=initializer)(zero_pad2) # (batch_size, 30,
30, 1)
return [Link](inputs=[inp, tar], outputs=last)
Generator
def Generator():
inputs = [Link](shape=[256, 256, 3])
down_stack = [
downsample(64, 4, apply_batchnorm=False), # (b_s, 128, 128, 64)
downsample(128, 4), # (batch_size, 64, 64, 128)
downsample(256, 4), # (batch_size, 32, 32, 256)
downsample(512, 4), # (batch_size, 16, 16, 512)
downsample(512, 4), # (batch_size, 8, 8, 512)
downsample(512, 4), # (batch_size, 4, 4, 512)
downsample(512, 4), # (batch_size, 2, 2, 512)
downsample(512, 4), # (batch_size, 1, 1, 512)
up_stack = [
upsample(512, 4, apply_dropout=True), # (batch_size, 2, 2, 1024)
upsample(512, 4, apply_dropout=True), # (batch_size, 4, 4, 1024)
upsample(512, 4, apply_dropout=True), # (batch_size, 8, 8, 1024)
upsample(512, 4), # (batch_size, 16, 16, 1024)
upsample(256, 4), # (batch_size, 32, 32, 512)
upsample(128, 4), # (batch_size, 64, 64, 256)
upsample(64, 4), # (batch_size, 128, 128, 128)
]
initializer = tf.random_normal_initializer(0., 0.02)
last = [Link].Conv2DTranspose(OUTPUT_CHANNELS, 4,
strides=2,
padding='same',
kernel_initializer=initializer,
activation='tanh') # (batch_size, 256, 256, 3)
x = inputs
skips = []
for down in down_stack:
x = down(x)
[Link](x)
skips = reversed(skips[:-1])
for up, skip in zip(up_stack, skips):
x = up(x)
x = [Link]()([x, skip])
x = last(x)
intermediate_output1 = [Link].Conv2D(OUTPUT_CHANNELS, 1,
padding="same", activation="tanh")(x)
intermediate_output2 = [Link].Conv2D(OUTPUT_CHANNELS, 3,
padding="same", activation="tanh")(x)
return [Link](inputs=inputs, outputs=[x, intermediate_output1,
intermediate_output2])
Generator_loss
def generator_loss(disc_generated_output, gen_output, target,
feature_layers):
gan_loss = loss_object(tf.ones_like(disc_generated_output),
disc_generated_output)
# Mean absolute error for pixel-to-pixel similarity
l1_loss = tf.reduce_mean([Link](target - gen_output))
# Feature matching loss
fm_loss = sum([tf.reduce_mean([Link](target_layer - gen_layer)) for
target_layer, gen_layer in zip(feature_layers, gen_output)])
total_gen_loss = gan_loss + (LAMBDA * l1_loss) + (0.1 * fm_loss) # weight
feature loss as needed
return total_gen_loss, gan_loss, l1_loss
Discriminator_loss
def discriminator_loss(disc_real_output, disc_generated_output):
real_loss = loss_object(tf.ones_like(disc_real_output), disc_real_output)
generated_loss = loss_object(tf.zeros_like(disc_generated_output),
disc_generated_output)
total_disc_loss = real_loss + generated_loss
return total_disc_loss
Optimizer
generator_optimizer = [Link](2e-4, beta_1=0.5)
discriminator_optimizer = [Link](2e-4, beta_1=0.5)
Train_step
@[Link]
def train_step(input_image, target, step):
with [Link]() as gen_tape, [Link]() as disc_tape:
gen_output = generator(input_image, training=True)
disc_real_output = discriminator([input_image, target], training=True)
disc_generated_output = discriminator([input_image, gen_output],
training=True)
gen_total_loss, gen_gan_loss, gen_l1_loss =
generator_loss(disc_generated_output, gen_output, target)
disc_loss = discriminator_loss(disc_real_output, disc_generated_output)
generator_gradients = gen_tape.gradient(gen_total_loss,
generator.trainable_variables)
discriminator_gradients = disc_tape.gradient(disc_loss,
discriminator.trainable_variables)
generator_optimizer.apply_gradients(zip(generator_gradients,
generator.trainable_variables))
discriminator_optimizer.apply_gradients(zip(discriminator_gradients,
discriminator.trainable_variables))
with summary_writer.as_default():
[Link]('gen_total_loss', gen_total_loss, step=step//1000)
[Link]('gen_gan_loss', gen_gan_loss, step=step//1000)
[Link]('gen_l1_loss', gen_l1_loss, step=step//1000)
[Link]('disc_loss', disc_loss, step=step//1000)
SNR,SSIM,MAE
import cv2
import numpy as np
import [Link] as plt
from [Link] import structural_similarity as ssim
from [Link] import peak_signal_noise_ratio as psnr
def calculate_mae(image1, image2):
return [Link]([Link](image1 - image2))
original_image = [Link]("original_image.png", cv2.IMREAD_GRAYSCALE)
generated_image = [Link]("generated_image.png", cv2.IMREAD_GRAYSCALE)
if original_image.shape != generated_image.shape:
generated_image = [Link](generated_image, (original_image.shape[1],
original_image.shape[0]))
mae_value = calculate_mae(original_image, generated_image)
psnr_value = psnr(original_image, generated_image)
ssim_value, _ = ssim(original_image, generated_image, full=True)
metrics = ['MAE', 'PSNR', 'SSIM']
values = [mae_value, psnr_value, ssim_value]
[Link](figsize=(8, 5))
[Link](metrics, values, color=['blue', 'green', 'red'])
[Link]("Image Quality Metrics")
[Link]("Metric Values")
[Link](0, max(values) + 10)
for i, v in enumerate(values):
[Link](i, v + 0.5, f"{v:.2f}", ha='center', fontsize=10)
[Link]()
SPECIFICATIONS
1. Model Architecture
1.1 Generator: U-Net Architecture
Type: Encoder-Decoder with Skip Connections.
Purpose: Preserves low-level features by connecting mirrored layers in the
encoder and decoder.
Input: Conditioned on an input image.
Output: Generates an output image corresponding to the input domain.
Activation: ReLU for intermediate layers, Tanh for the output layer.
Kernel Size: Size of the convolutional filter.
Padding: Number of pixels added to the borders of the image.
Stride: Step size at which the filter is moved over the image.
1.2 Discriminator: PatchGAN
Type: Convolutional network that classifies each image patch as real or
fake.
Receptive Field Size: 70×70.
Purpose: Focuses on high-frequency details to ensure sharpness and local
realism.
Activation: Leaky ReLU.
PatchGAN structure, which classifies N×N patches
2. Loss Function :Combined Loss
2.1 Conditional GAN Loss (cGAN):
x: input image (source domain).
yyy: real target image (target domain).
zzz: condition applied to the input image.
G(x,z)G(x, z)G(x,z): generated image from the generator GGG,
conditioned on xxx and zzz.
D(x,y)D(x, y)D(x,y): probability that the discriminator DDD classifies yyy
as real when conditioned on xxx.
The first term maximizes the probability that the discriminator DDD
correctly classifies the real image yyy.
The second term minimizes the probability that the discriminator
identifies the generated image G(x,z)G(x, z)G(x,z) as fake.
2.2 Adversarial Loss
The first term represents the loss when the discriminator correctly
identifies real images.
The second term penalizes the generator when the discriminator
identifies generated images as fake.
2.3 Distance Loss (L1 Loss)
The distance loss encourages the generator to produce images that are
close to the actual ground truth (target image). It is computed as the L1
norm between the generated image G(x,z)G(x, z)G(x,z) and the real image
yyy
2.4 Combined Loss Function
λ is a hyperparameter that controls the trade-off between adversarial loss
and L1 loss. This helps balance the generation of realistic images
(adversarial) with accurate image translation (distance loss)
Suggested λ (Weight for L1 Loss): 100
2.5 Hyperparameter Tuning
The key hyperparameters based on the paper are the learning rate
(0.0002) and L1 loss weight (λ = 100), with Adam optimizer settings: β1 =
0.5 and β2 = 0.999. Additionally, dropout is applied at various layers in the
generator to introduce noise and prevent overfitting.
3. Training Parameters
Optimizer: Adam
Learning Rate: 0.0002
Momentum Parameters:
β1=0.5
β2=0.999
Batch Size: Between 1 and 10, depending on the dataset.
Dropout: Applied at several layers to introduce noise and prevent
overfitting.
4. Evaluation Metrics
MAE (Mean Absolute Error): Measures pixel-wise differences.
PSNR (Peak Signal-to-Noise Ratio): Evaluates image fidelity.
MAXI is the maximum possible pixel value of the image
SSIM (Structural Similarity Index Measure): Assesses structural
similarity.
FCN-Score: Evaluates semantic accuracy using a pre-trained
segmentation model.
AMT Perceptual Study: Human evaluation for realism.
5. Datasets
Cityscapes: Semantic labels ↔ photos.
CMP Facades: Architectural labels ↔ photos.
Google Maps Data: Map ↔ Aerial photos.
HED Edge Detector: For edge ↔ photo tasks.
6. Inference Time
GPU: Runs efficiently on a Pascal Titan X GPU.
Time per Image: Well under 1 second per image during inference.
IDEAL GRAPH
RESULTS
RESULTS
CONCLUSION:
This project demonstrated the effectiveness of the Pix2Pix GAN model for
image-to-image translation, particularly in transforming architectural label
images into realistic building facades. Through the integration of
conditional GANs, we achieved high-quality outputs that accurately
captured both structural and textural details necessary for photorealistic
representations.
Quantitatively, the model's performance was evaluated using metrics such
as the Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio
(PSNR). The SSIM scores consistently reflected high structural fidelity
between generated and target images, while PSNR values confirmed the
accuracy of pixel-level details. These metrics validate the generator's
ability to produce realistic images closely aligned with ground truth data.
Optimization techniques, such as doubling the filters in the generator and
discriminator, balancing loss functions with increased L1 loss weight, and
implementing a learning rate scheduler, further enhanced the model's
performance. These adjustments proved particularly valuable for
maintaining high-quality outputs, even when constrained by small batch
sizes due to limited computational resources. The use of Leaky ReLU and
Adam optimizer was found to provide the best performance, yielding
improved results compared to other configurations.
FUTURE SCOPE:
1. Temporal Consistency in Video Translation
In video processing, it is crucial to maintain temporal consistency across
frames to avoid flickering and unnatural transitions between consecutive
images. This is particularly important in applications such as video
synthesis, video style transfer, and video super-resolution. Future models
can focus on improving spatial-temporal coherence, ensuring that the
generated frames are not only realistic but also smoothly transition from
one frame to another.
Approach: This can be achieved by modifying the GAN architecture to
incorporate temporal constraints that ensure smooth transitions and
coherent visual effects. The integration of Recurrent Neural Networks
(RNNs) or 3D Convolutional Networks (3D CNNs) into the Generator
could help in learning temporal features alongside spatial features.
2. Unsupervised Video-to-Video Translation
Similar to how unpaired image-to-image translation is handled in cGANs
(like CycleGAN), unsupervised methods for video-to-video translation can
be explored. These approaches do not require paired training datasets,
which can be challenging to obtain for video data. Using techniques like
Cycle-consistency and Dual-GAN, the model can learn to generate video
sequences from one domain to another without paired video frames.
Potential Use Cases: For example, translating day to night video,
seasonal changes, or even converting real-world videos to animated
ones can be tackled without the need for ground truth video pairs.
REFERENCES:
1. "Unpaired Image-to-Image Translation using Adversarial Consistency
Loss" by Zhao, Y. et al. (2020) .
2. "Deep Generative Adversarial Networks for Image-to-Image Translation:
A Review" by Aziz Alotaibi (2020)
3. Image-to-Image Translation Using Generative Adversarial Network -
Kusam Lata, Mayank Dave, Nishanth K N (2019)
4. [4] Gerda bosman, tom kenter, rolf jagerman, and daan gosman. (2017)
5. Generative Adversarial Networks for Image-to-Image Translation on
Street View and MR Images Simon Karlsson & Per Welander (2018)
6. Image To Image Translation Using Generative Adversarial Network by
Boddu Manoj, Boda Bhagya Rishiroop (2020)
7. Image-to-Image Translation using Generative Adversarial Networks
(GANs) by Anant Veer Bagrodia (2023)
8. Npix2Cpix: A GAN-based Image-to-Image Translation Network with
Retrieval-Classification Integration for Watermark Retrieval from
Historical Document Images Utsab Sahaa,b, Sawradip Sahac , Shaikh
Anowarul Fattaha , Mohammad Saquibd (2024)
9. Image-to-Image Translation (2000-2021)
10. generative adversarial network (GAN)
Timeline
WEEK 1-3
Types of GANs: Exploration of various GAN types, such as DCGAN, CycleGAN, Pix2Pix, DualGAN, etc.,
with a focus on their specific use cases and strengths
WEEK 4-6
Aim: Experiment with different GANs for image translation in Google Colab.
Process: Implement GANs like CycleGAN, Pix2Pix, etc., to understand each model's advantages,
disadvantages, and key parameters (learning rate, batch size,Loss functions, generator/discriminator
architectures).
Outcome: Develop a detailed understanding of each GAN's performance and suitability for various
image translation tasks
WEEK 7-9
To enhance the Pix2Pix model for high-definition (HD) image-to-image translation, achieving more
realistic and detailed outputs with minimal input requirements by incorporating modifications based
on the Pix2PixHD architecture.
By analyzing the underlying mechanisms of Pix2PixHD, we aim to modify and optimize the Pix2Pix
code to produce high-resolution images with detailed textures and refined visual fidelity. This
approach is expected to reduce input dependency while maximizing output quality
WEEK 10-12
Key stages include selecting a suitable journal or conference, preparing the manuscript according to
submission standards, and addressing peer reviews to refine the paper. This publication will share our
findings with the broader research community