0% found this document useful (0 votes)
10 views46 pages

MTech Thesis Piyal Roy

The thesis titled 'Detection of Cluster of Buildings in Aerial Image' by Piyal Roy aims to develop a robust algorithm for detecting clusters of buildings in aerial images, which is crucial for various applications such as facility location and urban planning. The proposed model utilizes Matterport Mask R-CNN and is evaluated using RGB images, achieving a performance accuracy of 81%. The document includes acknowledgments, an abstract, and a structured approach to the research, including methodology and results.

Uploaded by

Soumya Nandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

MTech Thesis Piyal Roy

The thesis titled 'Detection of Cluster of Buildings in Aerial Image' by Piyal Roy aims to develop a robust algorithm for detecting clusters of buildings in aerial images, which is crucial for various applications such as facility location and urban planning. The proposed model utilizes Matterport Mask R-CNN and is evaluated using RGB images, achieving a performance accuracy of 81%. The document includes acknowledgments, an abstract, and a structured approach to the research, including methodology and results.

Uploaded by

Soumya Nandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Detection of Cluster of Buildings

in Aerial Image

Thesis submitted In partial fulfilment of the requirements for the


Degree of Master of Technology
In Computer Science and Engineering
In the Department of Computer Science and Engineering
University of Kalyani

by

Piyal Roy

Under the supervision of


Prof. Priya Ranjan Sinha Mahapatra

Department of Computer Science and Engineering


University of Kalyani
Kalyani - 741235, India
July 2021
CERTIFICATE

This is to certify that the Thesis entitled “Detection of Cluster of Buildings in


Aerial Image”, submitted by Piyal Roy (Roll No. 90/CSE/190005, Registration No.
100140 of 2019-2020) to University of Kalyani, is a record of bonafide project work
under my supervision for partial fulfilment of requirements for the Degree of M.Tech
in Computer Science and Engineering, is worth consideration.

Signature of Supervisor Signature of Head of Department


Prof. Priya Ranjan Sinha Mahapatra Dr. Anirban Mukhopadhyay
Professor Professor and Head
Dept. of Computer Science Engineering Dept. of Computer Science Engineering
University of Kalyani University of Kalyani
Kalyani, Nadia Kalyani, Nadia

Date:
Place: Kalyani

i
ACKNOWLEDGEMENT

I would like to take this opportunity to record my deep sense of gratitude to my


supervisor and guide, Prof. Priya Ranjan Sinha Mahapatra of Department of Com-
puter Science and Engineering, University of Kalyani, for his supervision, advice,
guidance and constant inspiration throughout the period of my work. I am ex-
tremely thankful and humbled to have him as my mentor and guide. I am very
much grateful to him for his role as a continuous mentor in detecting and rectifying
faults in all of my efforts.
I am also very much thankful to Prof. Anirban Mukhopadhyay, Prof. Utpal
Biswas, Prof. Jyotsna Kumar Mandal, Prof. Kalyani Mali, Dr. Debabrata Sarddar
and Mr. Sukanta Majumder. If I have forgotten anyone, I humbly request their
forgiveness.
I would like to show my heartfelt thankfulness to Prof. Sourav Saha and Mr.
Shambo Chatterjee for their help, support, inspiration and words of wisdom.
A special thanks to my family.
Finally, I am grateful to my friends and classmates for inspiring me, guiding
me, helping me whenever they could and for helping me through all the difficulties.

Date: Piyal Roy


Place: Kalyani

ii
ABSTRACT

Automatic cluster of buildings detection in aerial images has a significant conse-


quence in an extensive scope of applications. One of the most important applica-
tions can be facility location work where the cluster of buildings will be fed as the
input location for different facility problems. Furthermore, the cluster of buildings
can also be used for the detection of individual buildings, nature of the building,
shadow information, height and area of the buildings and so on. A model based
on the Matterport Mask R-CNN is proposed and implemented in this thesis. The
proposed technique is applied to 3 different variations of images, namely RGB im-
age, Canny edge, and Local Binary Pattern. From the performance evaluation, it is
observed that the model gives the best result (81%) for the RGB images.

iii
Contents

CERTIFICATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 INTRODUCTION 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background Study & Related Work 4


2.1 Background Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Aerial Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 RGB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Gaussian Blur . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Canny Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.5 Local Binary Pattern . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.6 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Proposed Methodology 16
3.1 Overview of Proposed Methodology . . . . . . . . . . . . . . . . . . . 16
3.2 Detailed Proposed Methodology . . . . . . . . . . . . . . . . . . . . . 17

iv
4 Experimental Result & Analysis 21
4.1 Data Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Processing Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Quantitative Assessment . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Qualitative Assessment . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Conclusion & Future Work 34


5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

References 35

v
List of Figures

2.1 Determining scale of an aerial image. . . . . . . . . . . . . . . . . . . 5


2.2 RGB Cube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 (a) Original Aerial Image. (b) Gaussian Blurred Image. . . . . . . . . 7
2.4 (a) Original Aerial Image. (b) Canny Edge Representation. . . . . . . 8
2.5 An example of LBP computation. . . . . . . . . . . . . . . . . . . . . 8
2.6 (a) Original Aerial Image. (b) LBP Representation. . . . . . . . . . . 9
2.7 Workflow of Mask R-CNN. . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Building Block of Residual Learning. . . . . . . . . . . . . . . . . . . 12
2.9 ResNet 101 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Flowchart of the Proposed Methodology. . . . . . . . . . . . . . . . . 17


3.2 Detailed Flowchart of the Proposed Methodology. . . . . . . . . . . . 18

4.1 Mask R-CNN architecture implementation overview. . . . . . . . . . 25


4.2 Confusion Matrix for Image Segmentation. . . . . . . . . . . . . . . . 26
4.3 (a) Train Loss for RGB images. (b) Validation Loss for RGB images. 31
4.4 (a) Train Loss for Canny images. (b) Validation Loss for Canny images. 32
4.5 (a) Train Loss for LBP images. (b) Validation Loss for LBP images. . 32
4.6 Training Time Per epoch. . . . . . . . . . . . . . . . . . . . . . . . . 33

vi
List of Tables

4.1 Sample Building Dominant Images and Corresponding Ground Truth


Masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Sample Non-Building Images and Corresponding Ground Truth Masks. 23
4.3 Data Size for Processes. . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Confusion Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Performance Evaluation Results. . . . . . . . . . . . . . . . . . . . . . 27
4.6 Sample Result for RGB images‘. . . . . . . . . . . . . . . . . . . . . . 28
4.7 Sample Result for Canny images. . . . . . . . . . . . . . . . . . . . . 29
4.8 Sample Result for LBP images. . . . . . . . . . . . . . . . . . . . . . 30

vii
List of Algorithms

1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

viii
Chapter 1

INTRODUCTION

1.1 Motivation
Object detection in aerial images has been a major research topic in the field of
computer vision for several years. The edge or line extraction problem is one of the
key challenges that drive much of the research in this field. Another important one
is the segmentation problem, which is nothing but finding the desired object and
separating it from the background in the presence of distractions. These intrusions
are caused by different features such as exterior markings, vegetation, shadows, and
highlights, and so on.
Automatic cluster of building detection from aerial images is an important task in
a large scale of applications. One of the most important applications can be facility
location work where the clusters will be fed as the input location for different facility
algorithms. Other important applications of cluster of buildings detection incorpo-
rate generating computerized maps, urban planning, real estate management, survey
land uses, updating GIS databases, and local area route planning [8]. The cluster of
buildings can be further used for individual rooftop detection. Precise identification
and localization of rooftops in urban images are a vital step in regional planning and
city modeling. Likewise, information on the area, profile, and thickness of buildings
can be extremely helpful in assessing the circulation of a city’s populace. Specifi-
cally, building detection can be utilized for different facility location projects such

1
as planting solar panels on the rooftop, placing Covid-19 vaccination center, placing
disaster relief camp, and so on.
Individual building detection from aerial images can be very difficult. One reason
is that the aerial images usually differ in terms of resolution and brightness. An-
other reason is that rooftops may have distinct and intricate shapes and structures
and these can be easily confused with similar objects like cars, roads, and patios.
Therefore, this project aims to detect the cluster of buildings from the aerial images
which can reduce the complexity of individual building detection.

1.2 Problem Formulation


Many researchers have suggested various building or rooftop segmentation models
by utilizing the available technologies. The initial systems were built using conven-
tional techniques such as edge detection filters and other mathematical methods.
Nowadays machine learning approaches have become a dominant technique in this
field of research. Most of the researchers have focused to detect individual rooftops
from the image. But the research work lacks in the detection of clusters of buildings
in aerial images. So this project focuses on the detection of cluster of buildings in
aerial images. And if this segmented result is used to further detect the individ-
ual buildings then it can minimize the complexity of detecting individual buildings
because most of the non-building regions get eliminated in the segmented image.

1.3 Objective
The objective of this thesis is to develop a robust, reliable algorithm, which is capable
of detecting cluster of buildings in aerial image in the presence of distracting features
such as surface markings, vegetation, shadows, and highlights.

1.4 Organization of the Thesis


The organization of the thesis is as follows:
Chapter 2 provides a theoretical background and related work to cluster of build-
ings segmentation from aerial image.

2
Next, the design of the proposed methodology is given in Chapter 3.
The subsequent Chapter 4 represents the necessary steps for data and processing
setup, detailed criteria for performance evaluation and an analysis on the perfor-
mance of the proposed methodology.
Finally in Chapter 5, conclusions and future work are presented.

3
Chapter 2

Background Study & Related


Work

2.1 Background Study

2.1.1 Aerial Image

An aerial image is any image/photograph which is taken from an aircraft or other


flying object1 . Normally, this type of photo is captured obliquely or vertically from
drone, fixed-wing aircraft, helicopters, balloons, blimps and dirigibles, rockets, satel-
lites, kites, parachutes, stand-alone telescoping and vehicle-mounted poles2 . More
than one aerial image of the same area may seem different from each other because
of the type of film, scale, and overlap used. Basic concepts of aerial photography
are given below-

• Film: Mostly black and white film is used; however, color, infrared, and false-
color infrared films are also used for special cases.

• Focal length: It is the distance between the middle of the camera lens and
the focal plane. It should be measured precisely when the camera is cali-
1
https://www.nrcan.gc.ca/maps-tools-publications/satellite-imagery-air-photos/air-
photos/national-air-photo-library/about-aerial-photography/concepts-aerial-photography/9687
2
https://en.wikipedia.org/wiki/Aerial photography

4
brated. The relation between focal length and image distortion is inversely
proportional.

• Scale: It is the ratio of the distance between two points on an image to the
actual distance between the same two points on the ground (i.e., 1 unit on the
photo equals ”a” units on the ground). If a 1 km range of roadway covers 5
cm on an air photo, the scale is calculated as follows:

P HOT O DIST AN CE 5 cm 5 cm 1
SCALE = GROU N D DIST AN CE
= 1 km
= 100000 cm
= 20000

Another method used to determine the scale of an aerial image is to find the
ratio between the focal length of the camera and the altitude of the plane above the
ground being photographed.

Figure 2.1: Determining scale of an aerial image.3

If the focal length of a camera is 156 mm, and the altitude of the plane Above
Ground Level (AGL) is 7800 m, using the same equation as above, the scale is
calculated as follows:

F OCAL LEN GT H 156 mm 156 mm 1


SCALE = ALT IT U DE (AGL)
= 78000 m
= 7800000 mm
= 50000
3
https://www.nrcan.gc.ca/sites/www.nrcan.gc.ca/files/earthsciences/images/photos101/
images/E T1609 image2.jpg

5
2.1.2 RGB

In the RGB color model, each color appears in its primary components (Red-Green-
Blue). This color model is based on a Cartesian coordinate system as shown in
Figure 2.2. Individual values for red, green, and blue channels are stored in the
RGB color model. In any color space which is based on the RGB color model, the
three primary components are added together to generate colors from completely
white to completely black.

Figure 2.2: RGB Cube.4

Different camera produce different color image data for scanning the same image
and different monitors provide different color display results when rendering the
same image. This is because the RGB color space is correlated with the device
operating on it. Many different color spaces are derived from this color model,
standard RGB (sRGB) is one of them.

2.1.3 Gaussian Blur

The Gaussian blur filter is a nonuniform low-pass filter. It is typically achieved by


convolving an image with a Gaussian kernel5 . This Gaussian kernel in 2-D form is
given below -
x2 +y 2
G2D (x, y, σ) = 1
2πσ 2
e− 2σ 2

Here, x, y are the image pixel locations and σ is the standard deviation of the distri-
bution. It determines the variance around a mean value of the Gaussian distribution,
4
https://www.dynamsoft.com/blog/wp-content/uploads/2019/05/Group-3.png
5
https://www.sciencedirect.com/topics/engineering/gaussian-blur

6
which defines the extent of the blurring effect around a pixel.
Gaussian Blur is applied to reduce the amount of noise and remove speckles
within the image. It is important to remove the very high-frequency elements that
surpass those associated with the gradient filter used, unless, these may cause false
edges to be detected.

(a) (b)

Figure 2.3: (a) Original Aerial Image. (b) Gaussian Blurred Image.

2.1.4 Canny Edge

The first step of using the canny operator is to smooth the input image by applying
Gaussian blur. Then a simple 2-D first derivative operator is applied to the smoothed
image to highlight regions of the image with high first spatial derivatives6 . In the
gradient magnitude image, the edges rise to ridges. The next step is to track along
the top of these ridges and set all pixels to zero which are not actually on the ridge.
This process is known as non-maximal suppression and it produces a thin line as the
output. The tracking process shows hysteresis guided by two thresholds: T1 and
T2, (T1 > T2). The tracking starts from a point on a ridge which is greater than
T1 and continues in both directions from that point until the height of the ridge is
less than T2. This hysteresis process ensures that the noisy edges are not broken
up into multiple edge fragments [4].
6
https://homepages.inf.ed.ac.uk/rbf/HIPR2/canny.htm

7
(a) (b)

Figure 2.4: (a) Original Aerial Image. (b) Canny Edge Representation.

2.1.5 Local Binary Pattern

Local Binary Pattern (LBP) is an effective texture descriptor operator. It labels the
pixels of an image by thresholding the neighborhood of each pixel with the center
pixel value and returns the result as a binary number [2]. Figure 2.5 shows the LBP
computation process where the notation (P, R) means P sampling points on a circle
of radius of R.

Figure 2.5: An example of LBP computation.7

7
http://www.scholarpedia.org/w/images/7/77/LBP.jpg

8
(a) (b)

Figure 2.6: (a) Original Aerial Image. (b) LBP Representation.

2.1.6 Mask R-CNN

A transfer learning technique is an approach where an earlier gained knowledge is


used to solve a related problem. In context to deep learning by transfer learning, it
is referred to the learning procedure which uses a pre-trained model to commence
the training with a dataset.

9
Image

ResNet 101

feature matrix

RPN
conv 3×3 , 512 output channels

Proposed region
within anchor
boxes

conv 3×3 , 2×9 output channels conv 3×3 , 4×9 output channels

feature maps of
Region of interest
different size

ROIAlign

feature maps of same size


(7×7×512)

Fully Convolutional Network, RELU

Fully Convolutional Network, RELU

Mask Head Softmax Classifier Regressor

Generates
Classifies which Puts bounding
mask for each
object is there in box around
Region of
the image objects
Interest

Figure 2.7: Workflow of Mask R-CNN.

10
In this referred methodology a pre-trained ResNet 101 model is used as the
pre-trained model. When taken an image, it extracts a feature matrix from the
input image. This feature matrix is further given as input in RPN (Region Proposal
Network). During the training, the sizes of the anchor boxes are set based on the
object size, which is capable of containing an object.
When a feature map is given to the RPN, the 1st convolutional network checks
the feature matrix using the sliding window method. After running through the en-
tire image, it generates proposed regions i.e. regions that are likely to be containing
objects.
Since the implemented method is a selective search method, after the 1st con-
volutional layer gets completed, the other two convolutional layers within the RPN
are passed the proposed regions only. The convolutional network with 2×9 output
channels acts as the classifier i.e. identifies whether the image is foreground or not
and the other network with 4×9 output channels acts as the regressor. It puts a
bounding box around the object(s) through anchor boxes.
Now the different sized feature maps of regions are inputted in ROI aligned layer,
which converts them to the same size.
Now the feature maps are passed through 2 consecutive FCNs which are re-
sponsible for generating binary masks. Hence the classifier classifies the object,
the regressor puts a bounding box around the object and the mask head generates
instance masks around the object.

ResNet 101

ResNet 101 is a deep neural network that uses skip connection. This enables it
to eradicate the Vanishing Gradient problem. Here the 101 implies that total 101
layers are used to build up that network.

11
x

Weight layer

F(x) x
relu
identity

Weight layer

F(x) + x
+ relu

Figure 2.8: Building Block of Residual Learning.

In Figure2.8, x is the input.


F(x) is the calculated loss.
So the prediction can be given as- y=F(x)+x.
The main aim behind ResNet is that the prediction and input should be same.
In case of other deep neural networks, they learn from the prediction output and
adjusts the weights of other layers through back propagation. However ResNet
learns from the difference between input and prediction i.e from the loss value and
aims to make loss i.e F(x) as 0. The basic architecture of a ResNet 101 is illustrated
in Figure 2.9

12
Image
3
avg.pool
1×1,512
3×3,512
1×1,2048
conv 7×7, 64/2

23
3×3 maxpool/2
Fully Connected layer
1×1,256
3×3,256
1×1,1024

3 4
1×1,64 1×1,128
3×3,64 3×3,128
1×1,256 1×1,512

Figure 2.9: ResNet 101 Architecture.

These skip connection can be directly used when the input shape and output
shape is same. In case the input and output shape differs two following options can
be availed.

1. Identity mapping is performed with extra zero entries padded for increasing
dimensions. This option introduces no extra parameter.

2. A linear projection is performed by the shortcut connections to match the


dimensions.

13
2.2 Related Work
Research related to aerial image processing is extensive and quickly growing, which
makes it challenging to provide a comprehensive overview of the area. In this section
some related research will be presented which ranges from semi-automatic to fully-
automatic detection of buildings from aerial images.
A semi-automatic approach was developed by Ruther et al. [12] using digital
surface models (DSM) to generate initially raised structure hypotheses from eleva-
tion blobs and the model was also finetuned via active contours. In another semi-
automatic approach, Besbes et al. [3] presented an adaptive variational segmentation
method for aerial images. They used level set formulations and evaluation of spec-
tral and texture features from each image region to cope with content heterogeneity
of remotely sensed data.
Kabolizade et al. [9] proposed an improved snake-based segmentation model for
automatic building extraction purposes. This model was based on the radiometric
and geometric behaviors of roofs. Another automatic variational level set building
detection model was proposed by Yun and Ying [15] using a neighborhood-based
image analysis framework and a novel energy term related to height and roughness
of non-terrain objects derived from LiDAR data.
Liu et al. [13] developed a semi-automatic rectilinear shape rooftop detection
algorithm using multi-scale object-oriented classification and probabilistic Hough
transform. In another semi-automatic approach, Zhengjun et al. [11] utilized region
growing and localized multi-scale object-oriented segmentation to detect small recti-
linear rooftops. This approach was later refined and applied to more complex cases
using a node graph search technique.
Lafarge et al. [10] proposed an automatic model-based building extraction method
using digital elevation models (DEMs). In this approach, rough approximations of
buildings were first identified by rectangle layouts and finetuned using height dis-
continuities.
Melissa and Parvaneh [6] combined the strength of energy-based approaches with
the distinctiveness of corners which are assessed using multiple colors and color-
invariance spaces. A rooftop outline is generated from selected corner candidates and
further refined to fit the best possible boundaries through level-set curve evolution.

14
Shibiao et al. [14] proposed a novel salient rooftop detector by integrating four
correlative RGB-D priors (depth cue, uniqueness prior, shape prior, and transition
surface prior) for improved rooftop extraction to address the preceding complex
issues mentioned. After that, these correlative cues were computed from image
layers created by the multilevel segmentation and further fused into the state-of-
the-art high-order conditional random field (CRF) framework to locate the rooftop.

15
Chapter 3

Proposed Methodology

3.1 Overview of Proposed Methodology


In this thesis, a model-based methodology is proposed for cluster of building detec-
tion in aerial images. In the proposed approach, preprocessing steps are performed
on the input aerial image to apply the mask r-CNN model to generate the segmen-
tation masks for the cluster of buildings. These generated masks are then fed to
some postprocessing steps to get the final segmentation masks. The final masks
are binary, i.e., white represents the segmented cluster of building areas and black
represents the background areas.
In Figure 3.1, these steps and relation among them is given as a flowchart. In the
figure, the boxes represent the processes applied and ellipses represent the results.

16
INPUT IMAGE INPUT MASK

PREPROCESSING GENERATE
ANNOTATION

PROCESSED
IMAGE

TRAIN MODEL

GENERATE INITIAL
SEGMENTATION MASK

POSTPROCESSING

FINAL MASK

Figure 3.1: Flowchart of the Proposed Methodology.

3.2 Detailed Proposed Methodology


An in detailed flowchart including the intermediate steps is shown in Figure 3.2. In
the figure boxes represent the methods used in the thesis and the ellipses represent
the results. In the preprocessing step, gaussian blur is applied to reduce the noise
from the images. Then the RGB images are also converted to their corresponding
canny edge and LBP representations. The processed image is obtained as the result
of the preprocessing steps.

17
INPUT MASK INPUT IMAGE

GENERATE PREPROCESSING
ANNOTATION
GAUSSIAN BLUR

CANNY EDGE

LBP

PROCESSED
IMAGE

TRAIN MODEL

GENERATE CANDIDATE
MASKS (MULTI-COLOR)

POSTPROCESSING
COMBINE CANDITATE MASKS
INTO SINGLE MASK

CONVERT TO
BINARY MASK

REMOVE VEGETATION
LAND

REMOVE BARREN
LAND

FINAL MASK

Figure 3.2: Detailed Flowchart of the Proposed Methodology.

The annotation is generated from the corresponding ground truth mask. Then
the processed image along with the annotation is fed to train the model.
The trained model is then applied to a test image which also goes through
the preprocessing steps. The model generates multi-colored candidate masks for
multiple detected clusters.
In the postprocessing step, these multiple masks are combined into a single mask

18
by a union operation and converted into a binary mask. Finally, all the small holes
from the mask are removed to get the final segmentation mask.

Algorithm 1 Preprocessing
1: procedure preprocessing(IM AGES, GROU N D T RU T H M ASKS)

2: for each g ∈ GROU N D T RU T H M ASKS do


3: Generate Connected Components of g
4: Generate Annotations in COCO format from Connected Components
5: end for
6: for each i ∈ IM AGES do
7: Apply Gaussian Blur on i
8: Apply Canny Edge Detection on i
9: Apply LBP on i
10: end for
11: end procedure

Algorithm 2 Training
1: procedure training(IM AGES, AN N OT AT ION S) . Initialize the
Parameters of the Model
2: set IM AGES P ER GP U = 1
3: set N U M OF CLASSES = 1 + 1
4: set ST EP S P ER EP OCH = IM AGES P ER GP U ×
COU N T T RAIN IM AGES
5: set V ALIDAT ION ST EP S = IM AGES P ER GP U ×
COU N T V ALIDAT E IM AGES
6: set LEARN IN G RAT E = 0.00001
7: set EP OCHS = 33
8: for each e ∈ EP OCHS do
9: Train the Model
10: end for
11: end procedure

19
Algorithm 3 Postprocessing
1: procedure postprocessing(IM AGE, P REDICT ED M ASK,

Horizontal Grid, V ertical Grid, T HRESH)


2: Virtually break the IMAGE & PREDICTED MASK according to grids.
3: for each g ∈ IM AGE GRIDS do
4: Perform median blur on g
5: Perform histogram equalization on g
6: if HSV (g) < T HRESH then
7: Remove g
8: end if
9: end for
10: end procedure

In the postprocessing step, the original aerial image along with the corresponding
predicted mask is virtually broken into grids. The number of vertical and horizontal
grids is chosen by empirical study. In the next step, the color space of each grid
is converted from RGB to HSV. Next, a median blur and histogram equalization
are consecutively applied on each grid of the image. Then the number of pixels in
each grid is calculated those fall into the non-building region (by observing the H,
S, and V values). If HSV criteria of the region within the grid do not match then
it is considered as a false positive region. If there are two adjacent regions in two
neighbor grids the detected regions are merged.

20
Chapter 4

Experimental Result & Analysis

4.1 Data Setup


In this project, one publicly available dataset is used.

4.1.1 Source

The Aerial Imagery for Roof Segmentation (AIRS) dataset is used in this project.
This dataset aims at benchmarking the algorithms of roof segmentation from high-
resolution aerial imagery. This dataset was published by Chen, Qi and Wang, Lei
and Wu, Yifan and Wu, Guangming and Guo, Zhiling and Waslander and Steven
L in 2019 [5]. This dataset can be acquired from the website: https://www.airs-
dataset.com.
The dataset has the following features:

• Contains orthorectified aerial images covering 457km2 area with over 220,000
buildings

• Spatial resolution of the imagery is very high (0.075m),i.e. each pixel repre-
sents an area of 0.075m x 0.075m of the ground.

• Strictly refined ground truths align with rooftop outlines

The images are down sampled to a resolution of 1000 x 1000 pixel for faster
processing. The original dataset contains masks only for the individual rooftops,

21
therefore the masks for cluster of buildings have been generated manually for this
project.
Some building dominant sample images from the dataset and the corresponding
manually generated ground truth masks are given in Table 4.1.

Image Ground Truth Mask

Table 4.1: Sample Building Dominant Images and Corresponding Ground Truth
Masks.

22
Some non-building sample images from the dataset and the corresponding man-
ually generated ground truth masks are given in Table 4.2.

Image Ground Truth Mask

Table 4.2: Sample Non-Building Images and Corresponding Ground Truth Masks.

23
4.1.2 Setup

The original dataset contains 857, 94, 95 images for training, validation and testing
respectively. The dataset contains a large number of non-building images which
leads to poor accuracy. To overcome this issue, a selective number of images from
the dataset are used for this project. A detailed view is given in Table 4.3.

Process # Building Dominant Images # Non-building Images Total


Training 393 97 490
Validation 50 10 60
Testing 78 20 98

Table 4.3: Data Size for Processes.

Here a total of 490 images are used for training, out of which 393 are building
dominant images and the rest of them are non-building images. For validation, 50
building-dominant and 10 non-building images are used. And for testing, 78 building
dominant and 20 non-building images are used.

4.2 Processing Setup


The proposed model was implemented using Python language and executed on
Google Colab Platform with NVIDIA-SMI GPU and 32 GB of RAM.
The execution time was calculated separately for each sub-function as described
in Subsection 4.4.
Mask R-CNN was designed for high-resolution images in the original implemen-
tation [1]. Since the images used in this project have lower resolution, some of the
parameters are finetuned based on the empirical study.

i. The Google Colab GPU has 15GB of memory, which can fit two images. For
faster performance, 1 image was fitted per GPU.

ii. Number of classes is set to 2(=1+1), 1 for the cluster of buildings, and 1 for
the background.

iii. Total of 33 epochs is used for training.

24
!pip install scikit-image==0.16.2

!pip install opencv-python

!pip install pandas

Training
iv. steps per epoch for training is calculated as a product of no of train images
and images per gpu (i.e., steps per epoch = 490×1 = 490).
In [8]:
train_path = os.path.join(root_dir, "train_single_stage.py")
Similarly,
dataset_path = os.path.join(root_dir,
steps per epoch for validation is calculated as a product of no of validation images
"dataset")

!python3.6
and images"$train_path" steps--dataset=$dataset_path
per gpu (i.e.,train per epoch = 60×1 =--year='2021'
60). Number--model=coco
of images
# !python3.6 '/content/gdrive/My Drive/piyal mtech/singlestage trainnew.py' train --dataset='/co
for different process is discussed in Subsection 4.1.2.
Using TensorFlow backend.
Command: train
v. Model: coco
A low learning rate is good for better learning. So, the learning rate here is
Dataset: /content/gdrive/MyDrive/data/piyal/dataset
Year: 2021
set to /content/gdrive/.shortcut-targets-by-id/1tnGoGL8q1dTZMpEjd6EmKErE8kGneyDs/piyal/logs
Logs: 0.00001.
Auto Download: False

Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.7
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 14
IMAGE_MIN_DIM 800
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 1e-05
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0,
'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME roof
NUM_CLASSES 2
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 490
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 60
WEIGHT_DECAY 0.0001

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_de
Figure 4.1: Mask
3: colocate_with R-CNN architecture implementation
(from tensorflow.python.framework.ops) overview.
is deprecated and will be removed in a f
Instructions for updating:
Colocations handled automatically by placer.
Loading weights /content/gdrive/.shortcut-targets-by-id/1tnGoGL8q1dTZMpEjd6EmKErE8kGneyDs/piyal
o.h5
25
2021-07-21 12:17:30.379483: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU suppor
that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-07-21 12:17:30.597320: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successf
ad from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NU
2021-07-21 12:17:30.598260: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x184f
4.3 Performance Evaluation
The following Confusion Matrix measures proposed in [7] are computed to evaluate
the performance of the proposed method.

• True positive (TP): The number of pixels assigned as cluster of buildings in


both segmentation and ground truth results.

• True negative (TN): The number of pixels assigned as non-buildings in both


segmentation and ground truth results.

• False positive (FP): The number of pixels assigned as cluster of buildings in


segmentation result but not in ground truth

• False negative (FN): The number of pixels assigned as cluster of buildings in


ground truth but not in segmentation result.

These are combined to get the following measures:

TP
P RECISION = (T P +F P )
TP
RECALL = (T P +F N )
(2×P RECISION ×RECALL)
F 1 − SCORE = (P RECISION +RECALL)

Figure 4.2: Confusion Matrix for Image Segmentation.1

1
https://www.omicsonline.org/articles-images/JCSB-07-209-g003.html

26
4.3.1 Quantitative Assessment

Table 4.4 shows the confusion matrix of the proposed method.

RGB CANNY LBP


Buildings Non-Buildings Buildings Non-Buildings Buildings Non-Buildings
Buildings 0.52 0.09 0.57 0.08 0.51 0.08
Non-Buildings 0.02 0.37 0.02 0.33 0.04 0.37

Table 4.4: Confusion Matrix.

In case of RGB, it is observed that out of the correct predictions 52% are building
regions and 37% are non-building regions, 9% have been missed and 2% non-building
regions have been detected falsely.
In case of Canny, it is seen that out of the correct predictions, 57% are building re-
gions and 33% are non-building regions, 8% have been missed and 2% non-building
regions have been detected falsely.
In case of LBP, it is perceived that out of the correct predictions, 51% are building
regions and 37% are non-building regions, 8% have been missed and 4% non-building
regions have been detected falsely.

The evaluation is performed for each type of the input image formats and the
results are given in Table 4.5.

Input Image Format Precision Recall F-Score


RGB Image 0.75 0.86 0.81
Canny Edge 0.72 0.83 0.77
LBP 0.75 0.83 0.79

Table 4.5: Performance Evaluation Results.

The proposed method gives precision, recall and F-score for RGB images as 75%,
86% and 81%, respectively.
These values for Canny edge images are 72%, 83% and 77% respectively.
These values for LBP images are 75%, 83% and 79% respectively.

27
4.3.2 Qualitative Assessment

Sample evaluation results of the proposed model are given below -

Original Image Ground Truth Prediction

Table 4.6: Sample Result for RGB images‘.

28
In ground truth column, the red area represents the ground truth for cluster of
buildings and in prediction column, the red area represents the detected cluster of
buildings by the proposed model.

Original Image Ground Truth Prediction

Table 4.7: Sample Result for Canny images.

29
Original Image Ground Truth Prediction

Table 4.8: Sample Result for LBP images.

30
4.4 Result Analysis
From the evaluation results (Section 4.3.1), it is evident that the proposed method
of detection of cluster of buildings in aerial image gives better and preferred results
in case of the RGB images than the canny and LBP images. This technique per-
forms very well in detecting the cluster of buildings in case of the regions where
the buildings are densely populated. This technique is not performing admirably,
for recognizing consistently separated built-up areas where buildings are further
separated from one another.
Training and validation loss graph are given below –

(a) (b)

Figure 4.3: (a) Train Loss for RGB images. (b) Validation Loss for RGB images.

31
(a) (b)

Figure 4.4: (a) Train Loss for Canny images. (b) Validation Loss for Canny images.

(a) (b)

Figure 4.5: (a) Train Loss for LBP images. (b) Validation Loss for LBP images.

Computation Time Analysis

The execution time for the system depends on the complexity and number of the
images and the number of features in them. The annotation generation took 1653
seconds for the train images and 202 seconds for the validation images.

32
During training, first epoch took 480 seconds and this time was decreased mono-
tonically and from the 25th epoch this time stabilized to 430 seconds (Details given
in Figure 4.6). Thus, the total time for training was 5 hours.

Figure 4.6: Training Time Per epoch.

For testing, the total execution time was 368 seconds.

33
Chapter 5

Conclusion & Future Work

5.1 Summary
In this project, a methodology for the detection of cluster of buildings in aerial
image is proposed and implemented. The proposed methodology is based on the
Matterport Mask R-CNN [1].
Firstly, the ground truth masks are created manually for the cluster of buildings
and then the annotation is generated from them. Next, 3 types of representations
(RGB, Canny & LBP) are created for training the model. Some of the original
parameters are fine-tuned for better training based on empirical studies. Evaluations
are performed after training. The acquired results are satisfactory.
The proposed model faces difficulties when the buildings are not densely popu-
lated and in some cases, the segmentation result contains some vegetation area.
From the overall performance evaluation, it can be observed that the proposed
model gives the best accuracy (81%) for the RGB images and the least accu-
racy(77%) for the Canny-edged images.

34
5.2 Future Work
As a future scope of this research, the cluster of buildings detected by the proposed
method can be used to detect the individual buildings using the same Mask R-CNN
architecture. Next, the nature of the buildings, shadow information, height and area
of the buildings can also be predicted. Finally, all these information can be used for
different facility location works.
The model reported in this thesis can also be further optimized for better per-
formance.

35
References

[1] Waleed Abdulla. Mask r-cnn for object detection and instance segmentation
on keras and tensorflow. Github, 2017.

[2] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with
local binary patterns: Application to face recognition. volume 28, pages 2037–
2041. IEEE, 2006.

[3] Olfa Besbes, Ziad Belhadj, and Nozha Boujemaa. A variational framework for
adaptive satellite images segmentation. In International Conference on Scale
Space and Variational Methods in Computer Vision, pages 675–686. Springer,
2007.

[4] John Canny. A computational approach to edge detection. Number 6, pages


679–698. Ieee, 1986.

[5] Qi Chen, Lei Wang, Yifan Wu, Guangming Wu, Zhiling Guo, and Steven L
Waslander. Temporary removal: Aerial imagery for roof segmentation: A large-
scale dataset towards automatic mapping of buildings. volume 147, pages 42–55,
2019.

[6] Melissa Cote and Parvaneh Saeedi. Automatic rooftop extraction in nadir aerial
imagery of suburban regions using corners and variational level set evolution.
volume 51, pages 313–328. IEEE, 2012.

[7] Tom Fawcett. Introduction to roc analysis. volume 27, pages 861–874, 06 2006.

[8] Hasan Volkan Güdücü. Building detection from satellite images using shadow
and color information. 2008.

36
[9] Mostafa Kabolizade, Hamid Ebadi, and Salman Ahmadi. An improved snake
model for automatic extraction of buildings from urban aerial images and lidar
data. volume 34, pages 435–441. Elsevier, 2010.

[10] Florent Lafarge, Xavier Descombes, Josiane Zerubia, and Marc Pierrot-
Deseilligny. Automatic building extraction from dems using an object approach
and application to the 3d-city modeling. volume 63, pages 365–381. Elsevier,
2008.

[11] Zhengjun Liu, Shiyong Cui, and Qin Yan. Building extraction from high res-
olution satellite imagery based on multi-scale image segmentation and model
matching. In 2008 International workshop on earth observation and remote
sensing applications, pages 1–7. IEEE, 2008.

[12] Heinz Rüther, Hagai M Martine, and EG Mtalo. Application of snakes and dy-
namic programming optimisation technique in modeling of buildings in informal
settlement areas. volume 56, pages 269–282. Elsevier, 2002.

[13] Z Wang and W Liu. Building extraction from high resolution imagery based on
multi-scale object oriented classification and probabilistic hough transform. In
Proceedings of 2005 International Geoscience and Remote Sensing Symposium
(IGARSS’05), Seoul, South Korea, pages 25–29, 2005.

[14] Shibiao Xu, Xingjia Pan, Er Li, Baoyuan Wu, Shuhui Bu, Weiming Dong,
Shiming Xiang, and Xiaopeng Zhang. Automatic building rooftop extraction
from aerial images via hierarchical rgb-d priors. volume 56, pages 7369–7387.
IEEE, 2018.

[15] Yun Yang and Ying Lin. Object-based level set model for building detection
in urban area. In 2009 Joint Urban Remote Sensing Event, pages 1–6. IEEE,
2009.

37

You might also like