StructGAN: Image Restoration Maintaining Structural
Consistency Using A Two-Step Generative Adversarial Network

Authors

Nahian Muhtasim Zahin
160041008

Md. Mushfiqur Rahman
160041011

Kazi Raiyan Mahmud
160041058

Supervised by

Md. Hasanul Kabir, Ph.D.
Professor, Department of CSE, IUT

A thesis submitted to the Department of CSE
in partial fulfillment of the requirements for the degree of

B.Sc. Engg. in Computer Science and Engineering

Department of Computer Science and Engineering
Islamic University of Technology (IUT)

Organization of Islamic Cooperation (OIC)
Dhaka, Bangladesh

March 2021


Originality Statement

We hereby declare that this submission is our own work and to the best of our knowledge
it contains no materials previously published or written by another person, or substantial
proportions of material which have been accepted for the award of any other degree or
diploma at IUT or any other educational institution, except where due acknowledgement
is made in the thesis. Any contribution made to the research by others, with whom we
have worked at IUT or elsewhere, is explicitly acknowledged in the thesis. The authors
also declare that the intellectual content of this thesis is the product of their own work,
except to the extent that assistance from others in the project’s design and conception or
in style, presentation and linguistic expression is acknowledged.

Authors:

————————————————
Nahian Muhtasim Zahin

(Student ID: 160041008)

————————————————
Md. Mushfiqur Rahman

(Student ID: 160041011)

————————————————
Kazi Raiyan Mahmud
(Student ID: 160041058)

Supervisor:

————————————————
Md. Hasanul Kabir, Ph.D.

Professor
Department of Computer Science and Engineering (CSE)

Islamic University of Technology (IUT)

04 March 2021


Abstract

Image restoration deals with the removal of noise, blurriness, missing patches, and other
kinds of distortions in broken images. Traditional reconstruction and restoration ap-
proaches suffer from different kinds of limitations. In our work, we have improved upon
those models by introducing novel structure loss that emphasizes the overall image struc-
ture rather than individual pixels. Our proposed model StructGAN can achieve a higher
SSIM (Structural Similarity Index Measure) score while not massively compromising other
noise metrics. Overall, our proposed model uses generative adversarial networks with a
two-step generator network, a dual discriminator network, and coherent semantic atten-
tion (CSA) layer. The two-step generator helps refine the output. The dual discriminator
ensures local and global correctness. The CSA layer ensures semantic consistency. Along
with these, our model incorporates the novel structure loss. The structure loss is based
on the Laplacian filter that calculates the overall structure-map of the image and tries to
replicate the structure-map in the generation step. The results obtained by our model are
qualitatively comparable to the performance of the state-of-the-art models. For certain
metrics, e.g. SSIM, StructGAN quantitatively outperforms other models.

i


Acknowledgement

We are indebted to Professor Dr. Md. Hasanul Kabir for guiding us throughout this
research. His valuable time and input were provided throughout this thesis work, from
the initial phase of topic introduction, subject selection, hypothesis proposition, to the
project implementation and finalization which helped us to do our thesis work correctly.
Without his supervision, this research work would not have been possible.

We would also like to thank Mr. A. B. M. Ashikur Rahman, Mr. Redwan Karim Sony, Mr.
Sabbir Ahmed, and all the other faculties of the Computer Vision Research Lab. Their
time-to-time suggestions and interesting insights were massively helpful for our research.
We are also grateful to Mr. Moshiur Farazi for his valuable opinion and advice regarding
our work. His suggestions helped us improve our final work.

We would also like to thank the former head of department, Professor Dr. Muhammad
Mahbub Alam, and the current head of department, Professor Dr. Abu Raihan Mostofa
Kamal for creating a research-friendly environment at IUT. Such a research-friendly envi-
ronment is vital for proper research.

ii


Contents

Abstract i

Acknowledgement ii

Contents iii

List of Figures vi

List of Tables viii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Super-resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 In-painting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.4 Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

iii


2 Literature Review 7

2.1 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Generative Adversarial Network . . . . . . . . . . . . . . . . . . . . 8

2.1.2 Image-to-image translation . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Restoration Tasks and Methods . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Statistical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Persistent Memory Modeling (PMM) . . . . . . . . . . . . . . . . . . 12

2.2.3 Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.4 Deblurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.5 Super Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Loss Functions and Similarity Metrics . . . . . . . . . . . . . . . . . . . . . 17

2.4 Image in-painting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Image in-painting with GANs . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 With two-step GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Proposed Methodology 21

3.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Adversarial Loss and Refinement Loss . . . . . . . . . . . . . . . . . 23

3.2.2 Consistency Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.3 Structure Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.4 Combined Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Generator Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Rough Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.2 Refinement Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

iv


3.4.1 Patch discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.2 Global discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Adversarial training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Experimental Setup 31

4.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Places 2 (Val) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.2 COCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Result and Discussion 36

5.1 Sample Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Conclusion 43

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

References 46

v


List of Figures

1.1 Example of Image Super-Resolution [1] . . . . . . . . . . . . . . . . . . . . . 2

1.2 Example of Image In-Painting [2] . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Example of Image Denoising [3] . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Example of Image Deblurring [4] . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Simplified architecture of Generative Adversarial Networks (laptrinhx.com [5]) 8

2.2 Simplified version of Pix2pix architecture (Isola et al. [6]) . . . . . . . . . . 10

2.3 Examples of Image-to-image translation works (Isola et al. [6]) . . . . . . . 11

2.4 Free-Form Image Inpainting With Gated Convolution [7] . . . . . . . . . . . 20

2.5 Coherent Semantic Attention for Image Inpainting [8] . . . . . . . . . . . . 20

3.1 Architecture overview of proposed model . . . . . . . . . . . . . . . . . . . . 22

3.2 Example of Laplacian Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Rough Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Consistency loss (Similar to the work of Liu et al. [8]) . . . . . . . . . . . . 28

3.5 Patch discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Performance comparison between Google Colab and Kaggle . . . . . . . . . 32

4.2 Few samples from the Places2 (Val) dataset . . . . . . . . . . . . . . . . . . 34

4.3 Few samples from the COCO dataset . . . . . . . . . . . . . . . . . . . . . . 34

vi


4.4 Example of Center-square Occlusion Mask . . . . . . . . . . . . . . . . . . . 35

5.1 Sample Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Sample Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Sample Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4 SSIM Score of StructGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5 PSNR Score of StructGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.6 MSE Score of StructGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

vii


List of Tables

5.1 Quantitative Analysis of our proposed model (StructGAN). . . . . . . . . . 38

viii


Abbreviations

CNN Convolutional Neural Network

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

GAN Generative Adversarial Network

GPU Graphics Processing Unit

MSE Mean Squared Error

PSNR Peak Signal-to-Noise Ratio

SSIM Structural Similarity Index Measure

StructGAN Structure GAN

ix


Chapter 1

Introduction

Images, both physical and digital, can quite efficiently store visual information for a long

period of time. All kinds of artworks, engravings on stones, photographs, etc., can be

encompassed under the broader definition of the image. For ages, humans have been

using this ingenious tool to pass information into the future. But images, especially

physical images, do not last forever. With the passage of time, these go through different

forms of wear and tear and get distorted. In the case of digitally stored images, wear

and tear, over time, is not so common. But the poor quality of capturing devices or

inefficiency of encryption algorithms can still cause deviation from the originally intended

image. Noise, distortion, corruption, and any other form of deviation from the originally

intended image results in loss of visual information. Generally, this loss is irreversible.

Thus, no direct formula exists to reverse this process with full confidence. However, with

some specific methods, this irreversible process can be reversed to some extent. Such

methods are regarded as image restoration techniques. In plain words, image restoration

is the task of regenerating the original image from the distorted image with some form of

prior knowledge of the context of the image.

1


1.1. MOTIVATION

1.1 Motivation

Image restoration is a cognitive task. With a deep knowledge of the context of the original

image and exceptional skill in generating images of similar form, an expert can quite

successfully recreate an image with high accuracy. With enough practice, an amateur can

improve their dexterity in recreating an almost identical copy of the original image, from

only the distorted image. This paper deals with digital image restoration and considers

it as a cognitive image-to-image translation task. Our primary motivation is to make a

machine learning algorithm capable of restoring and reconstructing images.

Super-resolution, in-painting, denoising, and deblurring are some of the most commonly

ventured aspects in digital image restoration literature. Our goal is to create a deep

learning model that addresses all 4 of these topics simultaneously.

1.1.1 Super-resolution

Figure 1.1: Example of Image Super-Resolution [1]

The process of up-scaling and enhancing the details within a low-resolution image is known

as image super-resolution. In most cases, a low-resolution image is taken as the input of

the system and the image is upscaled to a higher resolution to generate the output. The

details in the high-resolution output are filled in where the details are essentially unknown.

2


1.1. MOTIVATION

Figure 1.2: Example of Image In-Painting [2]

1.1.2 In-painting

Broken images often have large missing patches. Obtaining the original information for

these missing portions is almost impossible. But with advanced algorithms and techniques

and with prior knowledge of the image, it is possible to generate information for the missing

portions that makes it consistent with the remaining image. This task of generating

information for the missing portion of an image to make the overall image consistent is

called image in-painting.

1.1.3 Denoising

Figure 1.3: Example of Image Denoising [3]

The signal processing methods which can reconstruct a 1-D, 2-D, or 3-D signal from a

noisy one is known as denoising. Its primary objective is to remove noise and retain useful

information about the signal. In the case of images, there are different types of noises. A

denoising technique detects and removes those noises. The biggest challenge here is, the

3


1.1. MOTIVATION

information obscured by noise is totally missing and is often irretrievable. So, the task of

denoising involves generating new information coherent to the neighboring regions.

1.1.4 Deblurring

Figure 1.4: Example of Image Deblurring [4]

Blurry images do not have distinct edges. Deblurring algorithms aim at sharpening the

edges and enhancing the structure of the image. Traditional image processing algorithms,

like, morphological sharpening [9], un-sharp masking [10], etc., can successfully fix simple

blurriness. However, these traditional algorithms often fail to deblur images with complex

blurriness where a huge portion of original information goes missing. One such complex

blurriness is the motion blur.

As exact regeneration is not possible, the success of a re-generator is measured by calcu-

lating the similarity of the generated image with the original non-distorted image. The

primary objective of this research is to minimize the difference between X and X
′ and

maximize their similarity.

min(
∣∣∣X −X ′

∣∣∣)
max(similarity(X,X ′))

The similarity is a broad and vague concept. It can be calculated in various ways and

the performance of the re-generator vastly depends on the metric that is used to calculate

the similarity. Traditionally, pixel-by-pixel mapping has been used for this purpose. If an

exact copy of the image was necessary, then such a similarity metric could be of great use.

But since image restoration focuses more on regenerating a semantically sensible version

4


1.2. PROBLEM STATEMENT

of the image and not just a digital replica with inconsistent information, only using the

pixel-by-pixel similarity is counter-intuitive.

1.2 Problem Statement

If an original image, represented by X, has a noise-map N . And the noise in the image

has an intensity α, then the distorted image can be represented by:

f(X) = αN + (1− α)X

In this research, the primary goal is to obtain the original image, X, when the distorted

image, f(X), is given. Since N and α are totally unknown and can be anything, it

ultimately requires an intelligent system to find those missing information and generate

X
′ from some prior knowledge of the image, the noise, and the environment.

X
′ = f(Distorted, prior)

Therefore, image restoration and reconstruction can be defined as the task of creating a

function that uses a distorted image and utilizes the image-prior to generate an output

very similar to the original image.

1.3 Objectives

The primary objective of this research is to find an algorithm that can generate missing

portions of an image while maintaining the overall structural consistency. Regeneration

tasks are quantitatively compared with different scoring metrics. The additional objective

of this research is to have high scores in these metrics.

5


1.4. CONTRIBUTIONS

1.4 Contributions

In this report, we have introduced a novel approach to calculate the structural similarity

of images in in-painting tasks. Our approach focuses more on the overall structure of the

image rather than individual pixels. We have incorporated this mechanism in the form of

a loss function in our two-step generative model. Our model achieves a high SSIM score.

1.5 Organization of the thesis

This report has 6 chapters. These are: Introduction, Literature Review, Proposed Method-

ology, Experimental Setup, Result and Discussion, and Conclusion. The Introduction

chapter introduces the problem statement and gives an overview of the report. The Liter-

ature Review chapter discusses related works in the domain. The Proposed Methodology

chapter describes our proposed model. The Experimental Setup chapter gives a detailed

explanation of our experiments. The Result and Discussion chapter gives a qualitative

and a quantitative analysis of the outputs of our work. The Conclusion chapter concludes

the work with a brief summary of our work. Prior to these 6 chapters, this report has a

list of all the figures, a list of all the tables, and a list of all the abbreviations used in this

report.

6


Chapter 2

Literature Review

Much work has been done on image restoration. But contrary to our approach, most of

these papers only target one form of distortion, whereas, our goal is to create an algorithm

capable of solving multiple types of image distortion.

2.1 Generative Models

The use of deep learning has shown promising results when it came to discovering models

capturing the probability distribution of different types of data like images, audios, natural

languages, etc. Most of these models were discriminative models. As a whole, these are

known as Auto-Encoders. These use the Kullback-Leibler(KL) divergence. Although they

work well for some cases, they massively fall short in the case of a true generation where

samples are very divergent from generated image. This divergence causes the generative

loss to bloat up.

7


2.1. GENERATIVE MODELS

2.1.1 Generative Adversarial Network

Producing good generative models has not been much success in the past due to the diffi-

culties of calculating probabilities and utilizing the benefits of piece-wise linear functions.

Ian Goodfellow et al. [11] proposed a framework that can produce generative models with

very good accuracy. The basis of this architecture is the Jensen-Shannon (JS) divergence

instead of the KL divergence. This new framework has pushed the boundaries of deep

learning and has provided some handy solutions to a lot of problems. In this paper, two

models are trained simultaneously. One model is the generative model (G) which works

with the data distribution and generates new data. The other one is a discriminative

model (D) which calculates the probability that an input sample came from the training

data rather than the other model, G. The discriminative model (D) takes data as an input

and outputs a scalar value, either real or fake. The goal of the generative model (G) is

to fool the other model by producing such data which maximizes the probability of D

classifying a fake input as real. On the other hand, the goal of the discriminative model

(D) is to correctly classify any input data fed into it. This adversarial process makes both

the models better as they basically compete against each other.

A simplified architecture of Generative Adversarial Networks has been shown in Figure

2.1.

Figure 2.1: Simplified architecture of Generative Adversarial Networks (laptrinhx.com [5])

8


2.1. GENERATIVE MODELS

This concept opened new dimensions in the field of deep learning. A lot of other variations

of GAN architecture were later developed which could also produce satisfactory results.

The whole idea of this paper is highly relevant to our research. A major portion of our

implementation incorporated the core scheme of this architecture.

2.1.2 Image-to-image translation

An image can be represented in many ways like RGB representation, edge map, gradient

field, semantic map. When working with digital images, we need to work with a variety of

these representations according to our purpose. Isola et al. [6] mainly focuses on image-

image translation using Conditional Adversarial Networks. Zhu et al. [12] also use

image-to-image translation but for unpaired images.

The goal of this network is firstly to learn a mapping from an input image to an out-

put image. Secondly, the network also learns a loss function which trains this model to

achieve a more general approach to this problem. If the same job was approached more

traditionally with CNNs, the loss function for each type of input would vary and had to

be designed manually which is more challenging and less efficient. As discussed earlier,

GANs are generative models that learn a mapping from random noise vector z to output

image y, G:z → y. In contrast, the cGANs learn to map y from observed image x and

random noise vector z. G:(x, z) → y. For building this network, the “U-net” architecture

was used. In the “U-net” architecture, the input is first downsampled until a point that

is known as the bottleneck. Then the whole process is reversed. To ensure that some

low-level information is not lost during the downsampling process, skip connections are

used between mirrored layers across the whole architecture. These skip connections pass

necessary low-level information between layers.

Figure 2.2 illustrates a simplified version of the training architecture of Pix2pix proposed

by Isola et al. [6]

It was seen that common losses like the L1 and L2 norms, can accurately capture low-

9


2.1. GENERATIVE MODELS

Figure 2.2: Simplified version of Pix2pix architecture (Isola et al. [6])

level frequency information of an image though they failed to do so with the high-level

frequencies. As a result, the images produced were blurry. The authors of this paper

smartly added the L1 loss with the loss function instead of building a whole new framework.

This worked well for both the low and high-level frequency information of any input image

and produced good results.

The method of this paper was tested on numerous tasks and datasets like:

• Semantic segmentation map ←→ realistic photo

• Architectural facade segmentation −→ realistic photo

• Black and white −→ colored photos

• Map ←→ aerial photo

• Edges −→ photo

• Sketch −→ photo

• Day −→ night

• Thermal −→ color

• Photo with missing pixels −→ in-painted photo

10


2.2. RESTORATION TASKS AND METHODS

Figure 2.3: Examples of Image-to-image translation works (Isola et al. [6])

Few examples of image-to-image translation domains have been illustrated in Figure 2.3.

All of the results showed that their loss function was working better than any common

or fixed loss function. This paper was pertinent to our research as they incorporated a

new loss function with GAN architecture and also worked with some tasks (like image

in-painting) which were similar to what we are trying to accomplish.

2.2 Restoration Tasks and Methods

2.2.1 Statistical Modeling

The use of statistical algorithms is also quite common in image restoration tasks. Zhang

et al. [13] propose a method that statistically characterizes local smoothness and nonlocal

self-similarity of natural images to handle image restoration.

Most papers on image restoration assume images to be locally smooth except for the edges.

Regularization techniques based on this assumption (total variation (TV), half quadrature

formulation, and Mumford-Shah (MS) models) can preserve edge smoothness effectively

but smear out image details. The alternative to this assumption is the use of NLM (non-

local means) in creating weighted filters by analyzing surrounding pixels from the image

prior. In recent literature, the use of non-local self-similarity property can be seen both in

11


2.2. RESTORATION TASKS AND METHODS

pixel level (for denoising) and in block/patch level (for super-resolution and deblurring).

The paper proposes a novel model (Joint Statistical Modeling) that combines the local

statistical modeling in space-domain(2D) and non-local self-similarity in transform-domain

(3D). The proposed regularization term is as follows:

ΨJSM (u) = τ.ΨLSM (u) + λ.ΨNLSM (u) (2.1)

Here, Ψ represents regularization.

2.2.2 Persistent Memory Modeling (PMM)

A very common problem in deep CNNs is that the prior states/layers have very little

influence on the subsequent ones. Tai et al. [14] propose a very deep persistent memory

network (MemNet) that introduces a memory block, consisting of a recursive unit and a

gate unit. The representations and the outputs from the previous memory block are sent

to the gate unit which controls how much of the previous states should be reserved and

how much of the current states will be added to the memory. MemNet addressed three

image restoration tasks – image denoising, super-resolution, and JPEG deblocking. They

used the following loss function:

L(Θ) = 1
2N

N∑
i=1

∣∣∣(x̃)(i) −D(x̃)(i)
∣∣∣2 (2.2)

2.2.3 Denoising

Image denoising is one of the most widely studied topics of computer vision. Many great

works are available in this field. Their results are mostly excellent. Scientists have used

a wide variety of algorithms for image denoising. The use of the statistical approaches

[15–18], the non-local approaches [19–22], and the filtering approaches [23,24] have shown

the best results so far.

12


2.2. RESTORATION TASKS AND METHODS

The statistical approaches mostly work with wavelet coefficients. Mihcak et al. [15] in-

troduced a simple spatially adaptive statistical model for wavelet image coefficients and

applied it to image denoising. Their model was inspired by another wavelet image com-

pression algorithm, the Estimation Quantization Coder [25]. They modeled the wavelet

image coefficients as zero-mean Gaussian random variables [26] with high local correlation.

Their model presupposed a marginal prior distribution on wavelet coefficient variances.

This distribution was estimated using the Maximum A Posteriori Probability rule. Then

they applied an approximate Minimum Mean Squared Error estimation procedure to re-

store the noisy wavelet image coefficients. Despite the simplicity of their method, both

in its concept and implementation, their denoising results are among the best reported in

the literature.

Buades, Coll, and Morel [19] proposed a new measure, the method noise, to evaluate

and compare the performance of digital image denoising methods. They firstly computed

and analyzed this method for a wide class of denoising algorithms, namely, the local

smoothing filters. Secondly, they proposed a new algorithm, the non-local means (NL-

means), based on a non-local averaging of all pixels in the image. Finally, they presented

some experiments comparing the NL-means algorithm and the local smoothing filters.

The results obtained by the aforementioned state-of-the-art techniques are already great

so our research will not delve deeper into them. But we have plans to leverage their

techniques and incorporate them into our system.

2.2.4 Deblurring

Prior works on deblurring show promising results. Most deblurring literature deal with

motion blur [27–32]. In our research, we are also primarily targeting motion blur. Shan et

al. [32] presented a novel algorithm for removing motion blur from a single image. Their

method constructs a deblurred image on the basis of a probabilistic model. The probabilis-

tic model that they use, unifies blur kernel estimation with unblurred image restoration.

They presented a thorough analysis of the common reasons for artifacts found in outputs

13


2.2. RESTORATION TASKS AND METHODS

of current deblurring methods. They also introduced several novel terms within their

probabilistic model. These terms include a) a model of the spatial randomness of noise

in the blurred image and b) a new local smoothness prior that reduced the ringing effect

of artifacts by constraining contrast in the unblurred image wherever the blurred image

has low contrast. Finally, they described an efficient optimization scheme that alternates

between blur kernel estimation and unblurred image restoration until convergence. As

a result of these steps, they were able to produce high-quality deblurred results in low

computation time.

2.2.5 Super Resolution

Super-resolution has made huge progress in recent years. Previously, cubic and bicubic

methods were primarily used to zoom-in or expand images. But with modern technologies

and algorithms, currently, deep learning is massively used in the image super-resolution

domain. Sahu [33] quite brilliantly explains the evolution of the use of deep learning in a

single image super-resolution domain.

2.2.5.1 Interpolation

Prior to the wide use of deep learning techniques, interpolation was the go-to method for

researchers working in the super-resolution domain. Common interpolation methods are:

• Nearest-neighborhood interpolation

• Bilinear interpolation

• Bicubic interpolation

14


2.2. RESTORATION TASKS AND METHODS

2.2.5.2 SRCNN

After the success of fully convolutional neural networks (FCNN) [34], its popularity in

various fields of computer vision promulgated rapidly. CNNs has two primary functional

blocks – one extracts features and the other one classifies outputs. The fully-connected

(FC) layers at the end-part of CNNs are the classifier whose task is to map the extracted

features to class probabilities. SRCNN [35] was one of the primary applications of FCNN.

In SRCNN, first, the image is unsampled using any interpolation technique (Dong et

al. [35] recommended bicubic interpolation). The output of the interpolation is fed into

a simple FCNN. The method does not use any pooling operation, so the output has the

same spatial size as that of the unsampled input image. In the last step, SRCNN computes

the MSE loss between the target high-resolution image and the obtained output.

2.2.5.3 SRResNet and Sub-pixel convolution

With the promising progress of SRCNN, the next step in the evolution was achieved

through the use of ResNet (CNN architecture with skip connections) in super-resolution

models. SRResNet [36] replaced simple convolutional blocks of SRCNN with residual

blocks. This gave a huge boost in the accuracy of the algorithm.

Upsampling operations were implemented with stridden convolution gradients which adds

zero values to upscale the image, which has to be later filled in with meaningful values.

2.2.5.4 Perceptual Loss

MSE or MSE-based error mechanisms only measure the pixel-difference between two cor-

responding pixels in the generated image and the ground-truth image. These are generally

too smooth and thus have poor perceptual quality. Therefore, it is advised to not check

PSNR alone while comparing the performance of any two methods in such tasks [33].

Perceptual loss [37] is calculated by finding the changes between two images based on

15


2.2. RESTORATION TASKS AND METHODS

high-level representations from a pre-trained CNN model. The feature is used to compare

high-level variations between pictures, such as content and style differences. In our work,

we have utilized this technique to some extent.

2.2.5.5 SR-GAN

The GAN technique allows reconstructions to move into search space regions with a high

likelihood of containing photo-realistic images, taking them closer to the natural image.

SR-GAN [36] is another GAN-based network. Upsampling is done by the Generator using

ResNet and sub-pixel convolution.

min
θG

max
θD

(
E
[
logDθD

(IHR)
]

+ E
[
log

(
1−DθD

(GθD
ILR

)
)
])

lSRGen =
N∑
n=1
− logDθD

(
GθG

(
ILR

))

lSR = lSRX + 10−3lSRGen

Here, the discriminator tries to maximize the net loss and the generator tries to minimize

it to minimize it. Though SR-GAN has great results, the hallucinated details are often

accompanied by unpleasant artifacts.

2.2.5.6 ESRGAN

The latest addition to the SR algorithms is the Enhanced Super-Resolution Generative

Adversarial Network (ESRGAN) [38]. It is capable of generating realistic textures better

than all the previously mentioned algorithms. ESRGAN improved and enhanced all the

key components of SR-GAN – network architecture, adversarial loss, and perceptual loss.

16


2.3. LOSS FUNCTIONS AND SIMILARITY METRICS

In particular, the paper introduced a novel type of dense block, namely, Residual-in-

Residual Dense Block. The block does not have any batch normalization layer. Their other

contribution was the use of relativistic GAN for discriminators where the discriminator

predicts relative realness instead of absolute realness. Finally, they improved perceptual

loss by using the features before activation, which could provide stronger supervision for

brightness consistency and texture recovery.

2.3 Loss Functions and Similarity Metrics

Zhao et al. [39] evaluates the performance of L2 loss for different image restoration tasks

(image super-resolution, JPEG artifacts removal, and joint denoising plus demosaicking)

and proposes a novel loss function that works better for image restoration. They compared

the L2 loss function with four other image quality error metrics: L1, SSIM (structural

similarity index), MS-SSIM (multi-scale structural similarity index) [22] and mix (a novel

loss the paper proposes). The paper claims, the mixed loss works better because the

human perception of image quality does not resonate with L2 loss but with SSIM and

MS-SSIM. The paper uses the same CNN model and applies different loss functions to

it to obtain comparable outputs. The comparison shows that L1 loss alone works better

than SSIM and MS-SSIM. But their proposed mixed loss performs better than both L1

and SSIM/MS-SSIM losses.

Patch Loss:

Lε(P ) = 1
N

∑
p∈P

ε(p) (2.3)

L1 Loss:

Ll1(P ) = 1
N

∑
p∈P
|x(p)− y(p)| (2.4)

17


2.4. IMAGE IN-PAINTING

SSIM Function:

SSIM(p) = 2µxµy + C1
µ2
x + µ2

y + C2
.

2σxσy + C1
σ2
x + σ2

y + C2
(2.5)

= l(p).cs(p) (2.6)

SSIM Loss:

LSSIM (P ) = 1
N

∑
p∈P

1− SSIM(p) (2.7)

LSSIM (P ) = 1− SSIM(p̃) (2.8)

MS-SSIM function:

MS-SSIM(p) = lαM (p).
M∏
j=1

cs
βj

j (p) (2.9)

MS-SSIM loss:

LMS−SSIM (P ) = 1−MS-SSIM(p̃) (2.10)

Mix Loss:

LMix = αLMS−SSIM + (1− α).GσM
G
.Ll1 (2.11)

2.4 Image in-painting

One of the most popular methods of reconstructing broken or damaged images is image

in-painting. The damaged, deteriorating, or missing parts of images or artworks are filled

in with the help of a neural network. In fact, machine-generated images in paints are

better at filling up the missing parts than a human artist. To learn about in-painting, we

need to discuss context encoders. It is a type of autoencoder that consists of an encoder,

bottleneck, and decoder. Its purpose is to reduce the image size ignoring the noise of that

image. Now the context encoder is a type of convolutional neural network which considers

the surrounding of an image to predict the missing parts of that image. The encoder

18


2.5. IMAGE IN-PAINTING WITH GANS

part’s duty is to try to capture the context of the image in a compact latent feature

representation. The decoder, on the other hand, uses that representation to produce the

missing image content. We feed our model with a huge dataset of images with missing

bits. There are several ways to create the blocked part of an image, it is known as region

mask. We can handle it in 3 ways;

• Central region: The central part of the image is set to zero. This is way too simple

and causes generalization.

• Random block: Instead of putting the block in middle, it is randomized. Several

overlapping squares take up-to one-fourth of the image.

• Random region: This creates sharp boundaries of the mask with arbitrary shapes

around the image.

In painting, the model consists of an encoder, a decoder that works as the generator. This

generates our desired image and tries to get better with the help of a discriminator. The

discriminator finally with the help of the sigmoid function gives us a scalar output to

decide how well the model did.

2.5 Image in-painting with GANs

All the aforementioned methods can fall under the broad domain of image restoration. But

there are papers that target image restoration as a whole instead of the smaller domains.

2.5.1 With two-step GAN

Yu et al. [7] proposed a two-step generative model that roughly in-paints in the first step

and refines the outputs in the next step. The model has that two-step GANs or gated

GANs can give great performance in image restoration domain. Beside using two back-

to-back generators, the also introduce two simultaneous discriminators. In Figure 2.4, the

19


2.5. IMAGE IN-PAINTING WITH GANS

overview of their proposed model has been described. Many other literature has developed

upon this work and has added different kinds of attention mechanisms to it.

Figure 2.4: Free-Form Image Inpainting With Gated Convolution [7]

Liu et al. [8] proposed a special type of attention layer that is semantically coherent. Their

paper uses the attention layer on top of the two-step GAN proposed by Yu et al. [7]. This

improves the model even more. This model can capture semantic information and retain

consistency across different parts of the image. The Figure 2.5 gives an illustration of the

working mechanism of their model.

Figure 2.5: Coherent Semantic Attention for Image Inpainting [8]

The issue with structural consistency still persists in these works.

20


Chapter 3

Proposed Methodology

This paper treats image restoration as an image-to-image translation problem. Isola et

al. proposes the conditional GAN (cGAN) [6] for image-to-image translation problems.

In this paper, we have used a variation of cGAN to restore images.

3.1 Architecture Overview

The core architecture of our work is based on the model proposed by Liu et al. [8]. The

model consists of two separate generators – rough network, and refinement network. These

two generators are designed to perform two distinct tasks. The model also includes two

discriminators. Inspired by Liu et al.’s work, we have also used a Coherent Semantic

Attention (CSA) block in our refinement network. Our contribution to the architecture

is the novel structure extractor sub-model and the structure loss obtained from this sub-

model.

An overview of our proposed model has been given in Figure 3.1. The input to the model

is marked with ’B Real’. After adding some sort of distortion, the image becomes ’A’.

The goal of the model would be to produce an image almost similar to ’B real’ from ’A’.

Passing ’A’ through the first u-net, ’B1 Fake’ is obtained. Conditioning ’B1 Fake’ on ’A’,

21


3.2. LOSS FUNCTIONS

we get the input for the next u-net. In this report, the first u-net is defined as Rough

Network and the second u-net is defined as Refinement Network.

Figure 3.1: Architecture overview of proposed model

3.2 Loss Functions

Let for any example case, X be the distorted image and y be the ground-truth. The goal of

the overall architecture would be to generate a generator function, such that, G(X) ≈ X ′ .

So, if L(X, y) gives the pixel-wise loss of the generated image and the original image. But

for generative models, a pixel-by-pixel loss is not sufficient and a loss calculated directly

does not ensure a good generative mechanism. So, this paper uses a combination of three

other kinds of loss functions to overcome these problems.

22


3.2. LOSS FUNCTIONS

3.2.1 Adversarial Loss and Refinement Loss

Adversarial loss is the min-max approach from game theory. This paper uses a variation

of the adversarial loss similar to that of cGAN [6]. This adversarial loss facilitates the

primary generative mechanism of the model by introducing a discriminator. BesidesX and

y, adversarial loss uses an additional noise vector z. The aim is to generate G : x, z → y.

The discriminator predicts how close the generated output is to x.

LcGAN (G,D) = Ey [logD(y)] + Ex,z [log(1−D(G(x, z)))] (3.1)

LL1(G) = Ex,y,z [y −G(x, z))] (3.2)

The objective of the adversarial loss function is:

Lr = argmin
G

max
D
LcGAN (G,D) + λLL1(G) (3.3)

3.2.2 Consistency Loss

Similar to the CSA inpainting paper [8], our model has consistency loss. This is a re-

designed form of the perceptual loss. The loss is defined as:

Lc =
∑
y∈M
||CSA(Iip)y − Φn(Igt)y||22 + ||CSAd(Iip)y − Φn(Igt)y||22 (3.4)

3.2.3 Structure Loss

The adversarial loss is sufficient in ensuring that the generator learns. But due to the pixel-

wise L1 loss in the objective, the generator learns to get every pixel correct. However,

in reality, the shape and structure of the objects are more important than pixel-wise

correctness. An image with a different brightness level and color contrast is deemed

23


3.2. LOSS FUNCTIONS

accurate if the overall structure matches. To achieve structural similarity, we have patch-

wise compared the edge map of the generated image with that of the original image.

So, besides minimizing the adversarial loss, the model has to minimize the structural

differences.

Any function, S : x → edge(x), can be used for the filter. We have tried experimenting

with first derivative filters, e.g., Sobel Filter, and second derivative filters, e.g, Laplacian

Filter. Though the high noise-sensitivity is a big issue for Laplacian Filters, the early stages

of experimentation show better result for the Laplacian Filter. Therefore our model uses

a Laplacian operator-based edge detector. [40].

∆fL = δ2f

δx2 + δ2f

δy2 + δ2f

δz2 (3.5)

This entire function can be achieved using the filter:


−1 −1 −1

−1 8 −1

−1 −1 −1



Original Image Image after Laplacian Filter

Figure 3.2: Example of Laplacian Filter

The use of a novel structure loss in addition to the adversarial loss is the primary contri-

bution of this paper. The loss function objective is minimizing the difference of input and

output structure:

Ls = 1
n

n∑
i=1

(fL(G(X))− fL(y))2 (3.6)

24


3.3. GENERATOR NETWORKS

3.2.4 Combined Loss Function

L = α(βLs + (1− β)LcGAN (G,D)) + λLL1(G) + (1− α− λ)Lc (3.7)

In Eq. 3.7, LcGAN , Lc, and Ls are refinement loss, consistency loss and structural loss

respectively. α,β, and γ determines their influence in the overall loss.

3.3 Generator Networks

Our work heavily relies on the generator network. Like most GAN architectures, our work

uses u-net in the generator network. Instead of using only one u-net, our model uses two

u-net architectures. In this report, the first u-net is defined as Rough Network and the

second u-net is defined as Refinement Network.

3.3.1 Rough Network

The rough in-painting network takes input image (distorted image) of size 3× 256× 256.

Like any generator network, this network has two portions – an encoder and a decoder.

The key structural features of our rough network are:

• Encoder: The encoder consists of 4× 4 convolution blocks. After each convolution

block, the image size halves.

• Decoder: The decoder consists of subsequent deconvolution blocks. Skip-connection

is added to each layer in the decoder portion from each corresponding encoder layer.

• Loss: L1 reconstruction loss is used to train the rough network

The Figure 3.3 shows the architecture of the rough network. In the figure, the green boxes

represent outputs of convolutional blocks and the red boxes represent the outputs of the

deconvolutional blocks. There are skip connections between the conv and deconv blocks.

25


3.3. GENERATOR NETWORKS

Figure 3.3: Rough Network

The purpose of the rough network is to generate a quick prediction. As our prediction

should be a fully restored image, the rough network generates a partially restored image

with enough patchiness and blurriness.

26


3.3. GENERATOR NETWORKS

3.3.2 Refinement Network

Similar to the rough network, the refinement network also consists of an encoder and a

decoder. The input of this network is the output of the rough network conditioned on the

original input image (distorted image). The key structural features of this layer are:

• Encoder: The encoder has numerous encoder blocks each with one 3×3 convolution

and one 4×4 dilated convolution. The 3×3 convolution keeps the spatial size of the

image unchanged but doubles the number of channels. The 4×4 dilated convolution

reduces the spatial size to half while keeping the number of channels unchanged. So,

in each subsequent encoder block in the refinement network, the image is spatially

halved and the number of channels is doubled. We have also used a CSA block in

the encoder and placed it right after the 3rd encoder block

• Decoder: The decoder of the refinement network is symmetrical to the encoder.

The decoder consists of numerous decoder blocks instead of encoder blocks and in

each block, the 4 × 4 convolution is replaced with a deconvolution operation. Like

rough networks, skip connection from the corresponding encoder layer is also used

here.

• Coherent Semantic Attention: Though image in-painting is a very difficult task,

various deep learning approaches have provided us with excellent results. The prob-

lem with most of the models is that, when there is a discontinuity of local pixels in

the image, the results tend to have blurry textures and distorted structures. Hongyu

Liu Et al. [8] proposed a different approach using a two-step process that involved

consistency loss and feature patch discriminator. Their model addressed the general

in-painting task just like a human being keeping the semantic relevance and feature

continuity in mind. The overall concept of this paper is very close to what we are

trying to achieve and helped us a lot while building our own model which aims to

achieve better results by using different discriminator models.

The Figure 3.4 shows how the consistency loss works. Each pixel in the blank portion

27


3.3. GENERATOR NETWORKS

is dependent on pixels outside the blocked region. During training, the model learns

to assign these attention values.

Figure 3.4: Consistency loss (Similar to the work of Liu et al. [8])

• Loss: The refinement network primarily uses the adversary loss and the structure

mentioned in the section 3.2.

This type of input stacks the information of the known areas to urge the network to capture

the valid features faster, which is critical for rebuilding the content of hole regions.

28


3.4. DISCRIMINATOR

3.4 Discriminator

In this report, along with using two generators, we have used two discriminators as well.

Our intuition is that one of the discriminators analyzes global correctness and the other

one analyzes local correctness. We have named the first one as Global discriminator and

the later one as Patch discriminator.

3.4.1 Patch discriminator

We have developed the patch discriminator in accordance with the patchGAN proposed

by Isola et al. [6]. The patch discriminator is trained using the adversarial loss function

mentioned in section 3.2. The discriminator is built on the first few layers of pre-trained

VGG-16 followed by three 4× 4 convolutional blocks.

The Figure 3.5 shows the architecture of patch discriminator. The discriminator is just

an image-classifier with several convolutional blocks. The output of the classifier ends few

steps prior to the convention of (1,1). In case of the discriminator proposed in this report,

the output is 14× 14. Each pixel in this output represents a portion of the original image.

A partially trained model ensures a faster convergence. Since the discriminator and the

generator learns simultaneously with the min-max process it is necessary to have a learning

balance between them.

3.4.2 Global discriminator

The global discriminator is not too different from the patch discriminator. The only

difference is the output shape of the model. Here, the output is directly 1 × 1. So, the

output gives a global perspective if the overall image is real or not.

29


3.5. ADVERSARIAL TRAINING

Figure 3.5: Patch discriminator

3.5 Adversarial training

Training a generative model is quite difficult. Conventional method of training does not

work. The training objective of such generative models are quite different from the ones

used in ordinary CNNs. The training objective used in this report is based on the condi-

tional Generative Adversarial Network.

30


Chapter 4

Experimental Setup

4.1 Environment

We performed our experiments in Google Colaboratory [41]. Google Colaboratory allows

users to execute Jupyter Notebooks and provides access to free GPUs. For a given session,

they only offer 1 GPU. The Colab provides 1 12GB Tesla K80 GPU. Tesla K80 is not

one of the fastest GPUs in the market, but this is the best we had at our disposal.

There is also a time limit on usage. One session can run for 12 hours. A user needs to

wait another 12 hours before getting the opportunity to use GPUs for training. Google

Colab allocates RAM and CPU configuration according to the session requirements. If

the RAM requirement is high, the session automatically shifts to a higher RAM. For our

experimentations, the RAM requirement was around 4GB.

We used Google Drive for storage. The Google Colab can be directly mounted on a Drive

location. This is the conventional method of running experiments with large datasets

in Google Colab. However, there was an issue with this method. As Google Drive is

in a remote server, the I/O operation between Colab and Drive is incredibly slow. The

workaround for it would be loading the entire dataset to the RAM. But since the size of

the dataset is huge, we were not able to opt for this option.

31


4.2. HYPER-PARAMETERS

Another viable option for our training was using Kaggle. Kaggle provides Tesla P100

GPU. However, Kaggle does not support easy access to Google Drive storage. There is

also a weekly usage limit. Therefore it is more troublesome to train using Kaggle. In

Figure 4.1, a performance comparison between Kaggle and Colab has been shown. The

training time mentioned in the chart is for the FastAI dataset with batch size 16. [42]

0 5 10 15

GPU Memory [GB]

Train Time [min]

Time Limit [per session]

Google Drive Storage

GPU Count

17

19.5

9

0

1

12

11.3

12

1

1 Kaggle
Colab

Figure 4.1: Performance comparison between Google Colab and Kaggle

4.2 Hyper-parameters

Due to limited computational resources, the most difficult part of training the model was

tuning the hyper-parameters. For external models, we did not tune the hyper-parameters

at all. For StructGAN, we tried to find the best options.

Since the generative adversarial network is the most important part of our model, the

decision for choosing the GAN type was also vital. After thorough experimentation, we

opted for the LSGAN [43].

Another important part of our proposal was the Structural Loss. In the calculation of

structural loss, the type of the structural element was vital. We tried experimenting with

Laplacian and Sobel structures. The Laplacian structure gave significantly better results

in comparison to Sobel structure.

32


4.3. DATASET

We used initial learning rate α = 0.002. For learning rate policy, we used Lambda decay.

We used 50 as the decay interval.

The GAN weight and Struct weight were also important determining factor for the per-

formance of the model. We used 20% GAN weight and 20% Struct weight during our

experimentations. These values can be further tuned to increase the performance of the

model.

4.3 Dataset

We used two distinct datasets for our training. Viz – COCO [44] dataset and Places2

(Val) [45] dataset.

4.3.1 Places 2 (Val)

The Places2 Database [45] was primarily developed for Places365-Challenge, an image-

recognition competition. The database consists of 6.2 million extra images in addition to

the 1.8 million images from Paces365-Standard dataset. Therefore, the total number of

train images become 8 million. In our experiments, we only used the validation set of the

database to train our models. The validation set contains 36500 images. The database is

sub-divided into 365 categories and each category contains 100 images.

4.3.2 COCO

The Common Object in Context (COCO) [44] dataset is a widely used dataset for training

deep learning models. The dataset was created by Google to aid future research for object

detection, instance segmentation, image captioning, and person key points localization.

The motivation of the COCO dataset is to check how well our model understands com-

mon objects in context. In the sector of computer vision, the COCO dataset is renowned.

It is one of the most popular datasets with 330K images (200K labeled), 1.5 million object

33


4.3. DATASET

Figure 4.2: Few samples from the Places2 (Val) dataset

instances, 80 object categories, 91 stuff categories, 250,000 people with key points. The

biggest advantage of using this dataset is its sheer size. It is one of the best image datasets

available, so it is widely used in cutting-edge image recognition artificial intelligence re-

search. It is used in open-source projects such as Facebook Research’s Detectron [46],

Matterport’s Mask R-CNN, Endernewton’s Tensorflow Faster RCNN for Object Detec-

tion, and others.

The dataset contains segmentation map, object detection annotation and many other

feature for each image. Figure 4.3 shows few examples from the COCO dataset with their

corresponding segmentation maps.

Figure 4.3: Few samples from the COCO dataset

34


4.4. PREPROCESSING

4.4 Preprocessing

Data preprocessing was essential for our work. We used random transformations to make

the model more robust. The input size of the model is batch_size × 3 × 256 × 256. The

preprocessing step involved converting the data to the appropriate size. For our model, we

used batch_size to be 1. In most generative adversarial networks, this is the convention.

For the normalization layer, we used “instance norm”.

Another important portion of data preprocessing was the addition of the noise. To main-

tain in-painting convention, we have used a center-square occlusion system. With this

system, a fixed-sized square region is cropped from the center of each image and it is

replaced with gray color.

Figure 4.4: Example of Center-square Occlusion Mask

35


Chapter 5

Result and Discussion

5.1 Sample Results

Our model performed fairly well on the Places2 dataset. Figure 5.1, Figure 5.2, and Figure

5.3 are few of the selected samples from our experiments. The outputs shown in Figure 5.1

are mostly natural scenes with less complicated details. Our model performs excellently

on this kind of images.

Sample 1 Sample 2 Sample 3 Sample 4

Figure 5.1: Sample Outputs

36


5.1. SAMPLE RESULTS

Images from Figure 5.2 are also natural scenes. However, there are many complicated

edges in these images. Our model’s performance on such images is also outstanding. Our

model has successfully captured the fine details of these natural scenes.

Sample 5 Sample 6 Sample 7 Sample 8

Sample 9 Sample 10 Sample 11 Sample 12

Figure 5.2: Sample Outputs

For images with too non-natural objects, our model does not perform so well. The samples

shown in Figure 5.3 are examples of such images. For such objects, our model has produced

blurry outputs.

37


5.2. QUANTITATIVE ANALYSIS

Sample 13 Sample 14 Sample 15 Sample 16

Figure 5.3: Sample Outputs

5.2 Quantitative Analysis

We performed two types of quantitative analyses – Ablation Study and Performance Eval-

uation over epochs.

5.2.1 Ablation Study

Table 5.1: Quantitative Analysis of our proposed model (StructGAN).

Gen 1 Gen 2 Disc 1 Disc 2 Struct SSIM PSNR MSE
Pix2pix [6] unet_256 - basic - - 0.831 31.72 0.114
CSA [8] unet_256 u_csa basic feature - 0.984 69.31 0.079

Struct_Pix2pix u_struct - basic - Laplacian 0.852 30.12 0.127
Struct_CSA unet_256 u_struct basic feature Sobel 0.91 41.92 0.131
StructGAN unet_256 u_struct basic feature Laplacian 0.997 64.77 0.081

The Table 5.1 shows a comparative study of StructGAN with other image in-painting

models. The experiments were conducted in our experimental setup. Therefore the scores

for the external models can increase if better hyper-parameters are used during training.

The four other relevant models chosen for the ablation study were: Pix2pix [6], CSA

[8], Struct_Pix2pix and Struct_CSA. The Struct_Pix2pix was formed by just adding a

38


5.2. QUANTITATIVE ANALYSIS

Laplacian structural element with the Pix2pix architecture. Similarly, the Struct_CSA

was formed by adding a Sobel structural element with the CSA architecture. The models

had varying number of generator and discriminator units which was helpful for the ablation

study as the effect of having multiple units of generators and/or discriminators could also

be sensed.

The model with our proposed novelty produced the best SSIM value of 0.997 in comparison

with the other models (0.984, 0.91, 0.852 and 0.831). This indicated that the structural

consistency is being maintained in our produced images as the Structural Similarity Index

Metric (SSIM) is very good at finding the structural similarity between two images. Our

model is slightly lagging behind when it comes to the PSNR and MSE values compared

to the other ones. We hope that these values will get better with more training. However,

getting the highest SSIM value was a huge success for us as this metric independently

focuses on the structural consistency between two images, which was our main motivation

behind this research.

5.2.2 Performance Evaluation

Judging the performance of the model from the produced images can be highly subjective.

We did a bit of background study about the metrics that can be used for this purpose. After

a careful selection, we selected three widely used quantitative metrics for our evaluation.

The metrics were Structural Similarity Index Metric (SSIM), Peak Signal-to-Noise Ratio

(PSNR) and Mean Squared Error (MSE). We wrote independent scripts using Python to

find the values of these metrics between images.

The core metric that we have worked on is the Similarity Index Metric (SSIM). The reason

behind this was, this metric is very good at finding the structural similarities between two

images. The higher the SSIM value, the more the structural consistency between two

images. Our model has produced a value of 0.997 which is better than all the other

39


5.2. QUANTITATIVE ANALYSIS

models.

Figure 5.4 shows a Number of steps vs SSIM value graph. As we can see from the graph,

the SSIM value kept on increasing as the number of steps increased, This shows that our

model got better and better in producing more structurally consistent images over time.

The Peak Signal-to-Noise Ration (PSNR) is a metric which is used to measure the level

Figure 5.4: SSIM Score of StructGAN

of noise removed from the output image when compared to the input. It is an excellent

metric when it comes to Noise Removal tasks. Our model produced a slightly less PSNR

value (64.77) in comparison with the CSA model. We hope that this value will get better

with more time and some fine tuning.

Figure 5.5 shows a Number of steps vs PSNR value graph. As we can see from the graph,

the PSNR value increased over time which shows the model got better in producing images

with less noise compared to the input.

The third and final metric is one of the most common metrics used in the domain of

Computer Vision, Mean Squared Error (MSE). It takes the average of squared differences

among the pixel values of the input and output images. Our model has produced a MSE

value of 0.081 which will decrease as we train it more.

40


5.2. QUANTITATIVE ANALYSIS

Figure 5.5: PSNR Score of StructGAN

Figure 5.6 shows a Number of steps vs MSE value graph. As we can see from the graph,

the MSE value decreased over time which shows the model got better in producing images

with less mean squared error which was expected.

Figure 5.6: MSE Score of StructGAN

41


5.3. QUALITATIVE ANALYSIS

5.3 Qualitative Analysis

Image reconstruction is not a yes-no task. Therefore, off-the-shelf metrics like SSIM,

PSNR, etc. are not able to capture the overall quality of the model. For this reason, it is

essential to perform qualitative analysis as well.

The sample outputs from Figure 5.1, Figure 5.2 are excellent and most of the images

are almost perfect. These are the general conditions of our generated images that include

scenic views. So, we can say, the model generates great results for natural scenes that

have low detail.

The samples from Figure 5.3 are not as good as the previous ones. The original images

of these samples had much more details. Our model failed to generate such details. The

reason for this failure is the lack of information in the surrounding. The texts, or the

human face structure that went missing in these generated outputs are very complex. To

have a model that can generate outputs with such details, we need to train the model on

one particular object. For face generation, we could train the model on faces dataset.

5.4 Discussion

In light of Section 5.2 and Section 5.2, we can say that our proposed model performs fairly

well. Considering the complexity of the task and despite the computational limitation

during the training period, the model has learned to generate images of excellent quality

that look almost real to normal eyes.

Our intuition behind structure loss was that the model will learn to understand image

structure of a region from its surrounding regions. Our model has a very high SSIM score

which indicates to the fact that our model is successfully understanding the structural

consistency. However, our model under-performs in case of PSNR. The reason for it is,

our model is not able to generate sharp edges.

42


Chapter 6

Conclusion

6.1 Summary

Our research aimed at maintaining the structural consistency of an image while restoration

using a novel approach. The existing methods of image in painting created blurry textures

while filling in the missing pixel values. Our model was motivated from the cognitive

behavior of a human being while restoring the image.

For maintaining the structural consistency, we proposed a novel loss function, which uses

a Laplacian structural element along with two other losses (Consistency Loss and Refine-

ment Loss), to achieve its goal. The overall architecture consists of two generator networks

(Rough Network and Refinement Network) and two discriminator networks (Patch Dis-

criminator and Structure Discriminator).

Judging the efficiency of the model from just the qualitative results can be a cumbersome.

The quality of in-painted output images will highly vary from person to person. Finding

a good quantitative metric to perfectly judge the performance of our model was difficult.

Studying the previous researches in this sub-domain, we have selected three metrics to

perform the quantitative analysis; Structural Similarity Index Metric (SSIM), Peak Signal-

to-Noise Ratio (PSNR) and Mean Squared Error (MSE). We have performed an ablation

43


6.2. FUTURE WORKS

study including a total of 5 relevant models. So far, the model with our proposed novelty

and architecture has outperformed the other models. It could produce a SSIM value of

0.997 which is the best SSIM value when compared to the other models. However, the

model could not perform as good as the CSA model in terms of the PSNR and MSE

values. We really hope that the values will keep on getting better as we keep on training

the model.

Given the complexity of our model, finding sufficient computational resources was a con-

stant challenge for us. We have used Google Colaboratory for the whole implementation.

We are still training our model and the results are becoming satisfactory with time. Since

the main goal of our research was to produce synthetic images keeping the structural con-

sistency in focus, we can say that, with the current SSIM value of 0.997, the research goal

has been successfully reached.

6.2 Future Works

We have developed a custom script using Python that can produce realistic noisy images.

The script incorporates various kinds of noise. The script randomly chooses one or more

of the following types of noise to prepare the image.

• Doodle noise: A doodle is a rough line drawn with a brush which generally replaces

the original pixels with a solid color. This noise is important for our research because

in many cases, old photos have similar noises where a certain portion of the image

is destroyed because of folding. Doodle creates a similar effect and helps our model

to learn how to generate the pixels lost due to the doodle noise.

While running, the algorithm mainly selects three things randomly: the brush size,

the doodle length and the doodle color. More attributes (the starting point, the

ending point, whether the doodles will be connected or not etc.) are flexible and can

be changed as needed.

• Salt and pepper noise

44


6.2. FUTURE WORKS

• Gaussian noise

• Poisson noise [47]

The algorithm assigns different priorities to each noise. We have not incorporated this

noise script into our training phase due to the lack of a global benchmark. However,

we have tried to show output of our model for these types of noises. The output was

underwhelming because the model was not trained on such type of noise. In future, we

would like to develop a new benchmark for our doodle noise. We would then train the

model to maximize the accuracy for the benchmark.

We would also like to develop a custom metric that best describes correctness for a realistic

broken image. This new metric can be designed based on coherent correctness of the image

itself. This metric needs a lot of further study.

45


References

[1] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,” in Proceedings

of the IEEE international conference on computer vision, 2017, pp. 5439–5448.

[2] X. Li, G. Hu, J. Zhu, W. Zuo, M. Wang, and L. Zhang, “Learning symmetry consistent

deep cnns for face completion,” IEEE Transactions on Image Processing, vol. 29, pp.

7641–7655, 2020.

[3] “Wikipedia: Total variation denoising,” https://en.wikipedia.org/wiki/Total_

variation_denoising, accessed: 2020-09-28.

[4] “Before and after comparisons of adobe’s amazing im-

age deblurring feature,” https://www.slrlounge.com/

zoom-and-enhance-google-brain-super-resolution-tech-make-tv-trope-a-reality/,

accessed: 2020-09-28.

[5] “Learning generative adversarial networks (gans),” https://laptrinhx.com/

learning-generative-adversarial-networks-gans-2910834212/, accessed: 2021-03-

01.

[6] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with

conditional adversarial networks,” in Proceedings of the IEEE conference on computer

vision and pattern recognition, 2017, pp. 1125–1134.

46

https://en.wikipedia.org/wiki/Total_variation_denoising
https://en.wikipedia.org/wiki/Total_variation_denoising
https://www.slrlounge.com/zoom-and-enhance-google-brain-super-resolution-tech-make-tv-trope-a-reality/
https://www.slrlounge.com/zoom-and-enhance-google-brain-super-resolution-tech-make-tv-trope-a-reality/
https://laptrinhx.com/learning-generative-adversarial-networks-gans-2910834212/
https://laptrinhx.com/learning-generative-adversarial-networks-gans-2910834212/


[7] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting

with contextual attention,” in Proceedings of the IEEE conference on computer vision

and pattern recognition, 2018, pp. 5505–5514.

[8] H. Liu, B. Jiang, Y. Xiao, and C. Yang, “Coherent semantic attention for image in-

painting,” in Proceedings of the IEEE International Conference on Computer Vision,

2019, pp. 4170–4179.

[9] J. G. Schavemaker, M. J. Reinders, J. J. Gerbrands, and E. Backer, “Image sharpening

by morphological filtering,” Pattern Recognition, vol. 33, no. 6, pp. 997–1012, 2000.

[10] A. Polesel, G. Ramponi, and V. J. Mathews, “Image enhancement via adaptive un-

sharp masking,” IEEE transactions on image processing, vol. 9, no. 3, pp. 505–510,

2000.

[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,

A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural

information processing systems, 2014, pp. 2672–2680.

[12] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation

using cycle-consistent adversarial networks,” in Proceedings of the IEEE international

conference on computer vision, 2017, pp. 2223–2232.

[13] J. Zhang, D. Zhao, R. Xiong, S. Ma, and W. Gao, “Image restoration using joint

statistical modeling in a space-transform domain,” IEEE Transactions on Circuits

and Systems for Video Technology, vol. 24, no. 6, pp. 915–928, 2014.

[14] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for

image restoration,” in Proceedings of the IEEE international conference on computer

vision, 2017, pp. 4539–4547.

[15] M. K. Mihcak, I. Kozintsev, K. Ramchandran, and P. Moulin, “Low-complexity image

denoising based on statistical modeling of wavelet coefficients,” IEEE Signal Process-

ing Letters, vol. 6, no. 12, pp. 300–303, 1999.

47


[16] M. K. Mihcak, I. Kozintsev, and K. Ramchandran, “Spatially adaptive statistical

modeling of wavelet image coefficients and its application to denoising,” in 1999 IEEE

International Conference on Acoustics, Speech, and Signal Processing. Proceedings.

ICASSP99 (Cat. No. 99CH36258), vol. 6. IEEE, 1999, pp. 3253–3256.

[17] G. Fan and X.-G. Xia, “Wavelet-based statistical image processing using hidden

markov tree model,” in Proc. 34th Annual Conference on Information Sciences and

Systems, Princeton, NJ, USA, 2000.

[18] A. Pizurica, W. Philips, I. Lemahieu, and M. Acheroy, “A joint inter-and intrascale

statistical model for bayesian wavelet based image denoising,” IEEE Transactions on

Image Processing, vol. 11, no. 5, pp. 545–557, 2002.

[19] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in

2005 IEEE Computer Society Conference on Computer Vision and Pattern Recogni-

tion (CVPR’05), vol. 2. IEEE, 2005, pp. 60–65.

[20] S. Lefkimmiatis, “Non-local color image denoising with convolutional neural net-

works,” in Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, 2017, pp. 3587–3596.

[21] C.-A. Deledalle, J. Salmon, A. S. Dalalyan et al., “Image denoising with patch based

pca: local versus global.” in BMVC, vol. 81, 2011, pp. 425–455.

[22] J. Wang, Y. Guo, Y. Ying, Y. Liu, and Q. Peng, “Fast non-local algorithm for image

denoising,” in 2006 International Conference on Image Processing. IEEE, 2006, pp.

1429–1432.

[23] T. Chen, K.-K. Ma, and L.-H. Chen, “Tri-state median filter for image denoising,”

IEEE Transactions on Image processing, vol. 8, no. 12, pp. 1834–1838, 1999.

[24] M. Zhang and B. K. Gunturk, “Multiresolution bilateral filtering for image denoising,”

IEEE Transactions on image processing, vol. 17, no. 12, pp. 2324–2333, 2008.

48


[25] S. M. LoPresto, K. Ramchandran, and M. T. Orchard, “Image coding based on mix-

ture modeling of wavelet coefficients and a fast estimation-quantization framework,”

in Proceedings DCC’97. Data Compression Conference. IEEE, 1997, pp. 221–230.

[26] N. R. Goodman, “Statistical analysis based on a certain multivariate complex gaussian

distribution (an introduction),” The Annals of mathematical statistics, vol. 34, no. 1,

pp. 152–177, 1963.

[27] S. Cho and S. Lee, “Fast motion deblurring,” in ACM SIGGRAPH Asia 2009 papers,

2009, pp. 1–8.

[28] L. Xu and J. Jia, “Two-phase kernel estimation for robust motion deblurring,” in

European conference on computer vision. Springer, 2010, pp. 157–170.

[29] S. K. Nayar and M. Ben-Ezra, “Motion-based motion deblurring,” IEEE transactions

on pattern analysis and machine intelligence, vol. 26, no. 6, pp. 689–698, 2004.

[30] J. Biemond, R. L. Lagendijk, and R. M. Mersereau, “Iterative methods for image

deblurring,” Proceedings of the IEEE, vol. 78, no. 5, pp. 856–883, 1990.

[31] P. C. Hansen, J. G. Nagy, and D. P. O’leary, Deblurring images: matrices, spectra,

and filtering. SIAM, 2006.

[32] Q. Shan, J. Jia, and A. Agarwala, “High-quality motion deblurring from a single

image,” Acm transactions on graphics (tog), vol. 27, no. 3, pp. 1–10, 2008.

[33] B. Sahu, “Towards data science: An evolution in single image su-

per resolution using deep learning,” https://towardsdatascience.com/

an-evolution-in-single-image-super-resolution-using-deep-learning-66f0adfb2d6b,

2019, accessed: 2020-09-29.

[34] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic

segmentation,” in Proceedings of the IEEE conference on computer vision and pattern

recognition, 2015, pp. 3431–3440.

49

https://towardsdatascience.com/an-evolution-in-single-image-super-resolution-using-deep-learning-66f0adfb2d6b
https://towardsdatascience.com/an-evolution-in-single-image-super-resolution-using-deep-learning-66f0adfb2d6b


[35] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convo-

lutional networks,” IEEE transactions on pattern analysis and machine intelligence,

vol. 38, no. 2, pp. 295–307, 2015.

[36] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken,

A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using

a generative adversarial network,” in Proceedings of the IEEE conference on computer

vision and pattern recognition, 2017, pp. 4681–4690.

[37] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer

and super-resolution,” in European conference on computer vision. Springer, 2016,

pp. 694–711.

[38] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy,

“Esrgan: Enhanced super-resolution generative adversarial networks,” in Proceedings

of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.

[39] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with

neural networks,” IEEE Transactions on computational imaging, vol. 3, no. 1, pp.

47–57, 2016.

[40] X. Wang, “Laplacian operator-based edge detectors,” IEEE transactions on pattern

analysis and machine intelligence, vol. 29, no. 5, pp. 886–890, 2007.

[41] “Google colaboratory,” https://colab.research.google.com/, accessed: 2010-09-30.

[42] “Kaggle vs. colab faceoff — which free gpu provider is tops?” https://

towardsdatascience.com/kaggle-vs-colab-faceoff-which-free-gpu-provider-is-tops-d4f0cd625029,

accessed: 2020-01-02.

[43] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares

generative adversarial networks,” in Proceedings of the IEEE international conference

on computer vision, 2017, pp. 2794–2802.

50

https://colab.research.google.com/
https://towardsdatascience.com/kaggle-vs-colab-faceoff-which-free-gpu-provider-is-tops-d4f0cd625029
https://towardsdatascience.com/kaggle-vs-colab-faceoff-which-free-gpu-provider-is-tops-d4f0cd625029


[44] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and

C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference

on computer vision. Springer, 2014, pp. 740–755.

[45] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places: An image

database for deep scene understanding,” arXiv preprint arXiv:1610.02055, 2016.

[46] A. Joulin and F. Paris, “Facebook ai research,” Learning Visual Features from Large

Weakly Supervised Data, 2015.

[47] “Poisson noise - wikipedia,” https://en.wikipedia.org/wiki/Shot_noise, accessed:

2010-09-30.

51

https://en.wikipedia.org/wiki/Shot_noise

	Abstract
	Acknowledgement
	Contents
	List of Figures
	List of Tables
	Introduction
	Motivation
	Super-resolution
	In-painting
	Denoising
	Deblurring

	Problem Statement
	Objectives
	Contributions
	Organization of the thesis

	Literature Review
	Generative Models
	Generative Adversarial Network
	Image-to-image translation

	Restoration Tasks and Methods
	Statistical Modeling
	Persistent Memory Modeling (PMM)
	Denoising
	Deblurring
	Super Resolution

	Loss Functions and Similarity Metrics
	Image in-painting
	Image in-painting with GANs
	With two-step GAN


	Proposed Methodology
	Architecture Overview
	Loss Functions
	Adversarial Loss and Refinement Loss
	Consistency Loss
	Structure Loss
	Combined Loss Function

	Generator Networks
	Rough Network
	Refinement Network

	Discriminator
	Patch discriminator
	Global discriminator

	Adversarial training

	Experimental Setup
	Environment
	Hyper-parameters
	Dataset
	Places 2 (Val)
	COCO

	Preprocessing

	Result and Discussion
	Sample Results
	Quantitative Analysis
	Ablation Study
	Performance Evaluation

	Qualitative Analysis
	Discussion

	Conclusion
	Summary
	Future Works

	References