MTech Thesis Piyal Roy
MTech Thesis Piyal Roy
in Aerial Image
by
Piyal Roy
Date:
Place: Kalyani
i
ACKNOWLEDGEMENT
ii
ABSTRACT
iii
Contents
CERTIFICATE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1 INTRODUCTION 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Proposed Methodology 16
3.1 Overview of Proposed Methodology . . . . . . . . . . . . . . . . . . . 16
3.2 Detailed Proposed Methodology . . . . . . . . . . . . . . . . . . . . . 17
iv
4 Experimental Result & Analysis 21
4.1 Data Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Processing Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Quantitative Assessment . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Qualitative Assessment . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References 35
v
List of Figures
vi
List of Tables
vii
List of Algorithms
1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
viii
Chapter 1
INTRODUCTION
1.1 Motivation
Object detection in aerial images has been a major research topic in the field of
computer vision for several years. The edge or line extraction problem is one of the
key challenges that drive much of the research in this field. Another important one
is the segmentation problem, which is nothing but finding the desired object and
separating it from the background in the presence of distractions. These intrusions
are caused by different features such as exterior markings, vegetation, shadows, and
highlights, and so on.
Automatic cluster of building detection from aerial images is an important task in
a large scale of applications. One of the most important applications can be facility
location work where the clusters will be fed as the input location for different facility
algorithms. Other important applications of cluster of buildings detection incorpo-
rate generating computerized maps, urban planning, real estate management, survey
land uses, updating GIS databases, and local area route planning [8]. The cluster of
buildings can be further used for individual rooftop detection. Precise identification
and localization of rooftops in urban images are a vital step in regional planning and
city modeling. Likewise, information on the area, profile, and thickness of buildings
can be extremely helpful in assessing the circulation of a city’s populace. Specifi-
cally, building detection can be utilized for different facility location projects such
1
as planting solar panels on the rooftop, placing Covid-19 vaccination center, placing
disaster relief camp, and so on.
Individual building detection from aerial images can be very difficult. One reason
is that the aerial images usually differ in terms of resolution and brightness. An-
other reason is that rooftops may have distinct and intricate shapes and structures
and these can be easily confused with similar objects like cars, roads, and patios.
Therefore, this project aims to detect the cluster of buildings from the aerial images
which can reduce the complexity of individual building detection.
1.3 Objective
The objective of this thesis is to develop a robust, reliable algorithm, which is capable
of detecting cluster of buildings in aerial image in the presence of distracting features
such as surface markings, vegetation, shadows, and highlights.
2
Next, the design of the proposed methodology is given in Chapter 3.
The subsequent Chapter 4 represents the necessary steps for data and processing
setup, detailed criteria for performance evaluation and an analysis on the perfor-
mance of the proposed methodology.
Finally in Chapter 5, conclusions and future work are presented.
3
Chapter 2
• Film: Mostly black and white film is used; however, color, infrared, and false-
color infrared films are also used for special cases.
• Focal length: It is the distance between the middle of the camera lens and
the focal plane. It should be measured precisely when the camera is cali-
1
https://www.nrcan.gc.ca/maps-tools-publications/satellite-imagery-air-photos/air-
photos/national-air-photo-library/about-aerial-photography/concepts-aerial-photography/9687
2
https://en.wikipedia.org/wiki/Aerial photography
4
brated. The relation between focal length and image distortion is inversely
proportional.
• Scale: It is the ratio of the distance between two points on an image to the
actual distance between the same two points on the ground (i.e., 1 unit on the
photo equals ”a” units on the ground). If a 1 km range of roadway covers 5
cm on an air photo, the scale is calculated as follows:
P HOT O DIST AN CE 5 cm 5 cm 1
SCALE = GROU N D DIST AN CE
= 1 km
= 100000 cm
= 20000
Another method used to determine the scale of an aerial image is to find the
ratio between the focal length of the camera and the altitude of the plane above the
ground being photographed.
If the focal length of a camera is 156 mm, and the altitude of the plane Above
Ground Level (AGL) is 7800 m, using the same equation as above, the scale is
calculated as follows:
5
2.1.2 RGB
In the RGB color model, each color appears in its primary components (Red-Green-
Blue). This color model is based on a Cartesian coordinate system as shown in
Figure 2.2. Individual values for red, green, and blue channels are stored in the
RGB color model. In any color space which is based on the RGB color model, the
three primary components are added together to generate colors from completely
white to completely black.
Different camera produce different color image data for scanning the same image
and different monitors provide different color display results when rendering the
same image. This is because the RGB color space is correlated with the device
operating on it. Many different color spaces are derived from this color model,
standard RGB (sRGB) is one of them.
Here, x, y are the image pixel locations and σ is the standard deviation of the distri-
bution. It determines the variance around a mean value of the Gaussian distribution,
4
https://www.dynamsoft.com/blog/wp-content/uploads/2019/05/Group-3.png
5
https://www.sciencedirect.com/topics/engineering/gaussian-blur
6
which defines the extent of the blurring effect around a pixel.
Gaussian Blur is applied to reduce the amount of noise and remove speckles
within the image. It is important to remove the very high-frequency elements that
surpass those associated with the gradient filter used, unless, these may cause false
edges to be detected.
(a) (b)
Figure 2.3: (a) Original Aerial Image. (b) Gaussian Blurred Image.
The first step of using the canny operator is to smooth the input image by applying
Gaussian blur. Then a simple 2-D first derivative operator is applied to the smoothed
image to highlight regions of the image with high first spatial derivatives6 . In the
gradient magnitude image, the edges rise to ridges. The next step is to track along
the top of these ridges and set all pixels to zero which are not actually on the ridge.
This process is known as non-maximal suppression and it produces a thin line as the
output. The tracking process shows hysteresis guided by two thresholds: T1 and
T2, (T1 > T2). The tracking starts from a point on a ridge which is greater than
T1 and continues in both directions from that point until the height of the ridge is
less than T2. This hysteresis process ensures that the noisy edges are not broken
up into multiple edge fragments [4].
6
https://homepages.inf.ed.ac.uk/rbf/HIPR2/canny.htm
7
(a) (b)
Figure 2.4: (a) Original Aerial Image. (b) Canny Edge Representation.
Local Binary Pattern (LBP) is an effective texture descriptor operator. It labels the
pixels of an image by thresholding the neighborhood of each pixel with the center
pixel value and returns the result as a binary number [2]. Figure 2.5 shows the LBP
computation process where the notation (P, R) means P sampling points on a circle
of radius of R.
7
http://www.scholarpedia.org/w/images/7/77/LBP.jpg
8
(a) (b)
9
Image
ResNet 101
feature matrix
RPN
conv 3×3 , 512 output channels
Proposed region
within anchor
boxes
conv 3×3 , 2×9 output channels conv 3×3 , 4×9 output channels
feature maps of
Region of interest
different size
ROIAlign
Generates
Classifies which Puts bounding
mask for each
object is there in box around
Region of
the image objects
Interest
10
In this referred methodology a pre-trained ResNet 101 model is used as the
pre-trained model. When taken an image, it extracts a feature matrix from the
input image. This feature matrix is further given as input in RPN (Region Proposal
Network). During the training, the sizes of the anchor boxes are set based on the
object size, which is capable of containing an object.
When a feature map is given to the RPN, the 1st convolutional network checks
the feature matrix using the sliding window method. After running through the en-
tire image, it generates proposed regions i.e. regions that are likely to be containing
objects.
Since the implemented method is a selective search method, after the 1st con-
volutional layer gets completed, the other two convolutional layers within the RPN
are passed the proposed regions only. The convolutional network with 2×9 output
channels acts as the classifier i.e. identifies whether the image is foreground or not
and the other network with 4×9 output channels acts as the regressor. It puts a
bounding box around the object(s) through anchor boxes.
Now the different sized feature maps of regions are inputted in ROI aligned layer,
which converts them to the same size.
Now the feature maps are passed through 2 consecutive FCNs which are re-
sponsible for generating binary masks. Hence the classifier classifies the object,
the regressor puts a bounding box around the object and the mask head generates
instance masks around the object.
ResNet 101
ResNet 101 is a deep neural network that uses skip connection. This enables it
to eradicate the Vanishing Gradient problem. Here the 101 implies that total 101
layers are used to build up that network.
11
x
Weight layer
F(x) x
relu
identity
Weight layer
F(x) + x
+ relu
12
Image
3
avg.pool
1×1,512
3×3,512
1×1,2048
conv 7×7, 64/2
23
3×3 maxpool/2
Fully Connected layer
1×1,256
3×3,256
1×1,1024
3 4
1×1,64 1×1,128
3×3,64 3×3,128
1×1,256 1×1,512
These skip connection can be directly used when the input shape and output
shape is same. In case the input and output shape differs two following options can
be availed.
1. Identity mapping is performed with extra zero entries padded for increasing
dimensions. This option introduces no extra parameter.
13
2.2 Related Work
Research related to aerial image processing is extensive and quickly growing, which
makes it challenging to provide a comprehensive overview of the area. In this section
some related research will be presented which ranges from semi-automatic to fully-
automatic detection of buildings from aerial images.
A semi-automatic approach was developed by Ruther et al. [12] using digital
surface models (DSM) to generate initially raised structure hypotheses from eleva-
tion blobs and the model was also finetuned via active contours. In another semi-
automatic approach, Besbes et al. [3] presented an adaptive variational segmentation
method for aerial images. They used level set formulations and evaluation of spec-
tral and texture features from each image region to cope with content heterogeneity
of remotely sensed data.
Kabolizade et al. [9] proposed an improved snake-based segmentation model for
automatic building extraction purposes. This model was based on the radiometric
and geometric behaviors of roofs. Another automatic variational level set building
detection model was proposed by Yun and Ying [15] using a neighborhood-based
image analysis framework and a novel energy term related to height and roughness
of non-terrain objects derived from LiDAR data.
Liu et al. [13] developed a semi-automatic rectilinear shape rooftop detection
algorithm using multi-scale object-oriented classification and probabilistic Hough
transform. In another semi-automatic approach, Zhengjun et al. [11] utilized region
growing and localized multi-scale object-oriented segmentation to detect small recti-
linear rooftops. This approach was later refined and applied to more complex cases
using a node graph search technique.
Lafarge et al. [10] proposed an automatic model-based building extraction method
using digital elevation models (DEMs). In this approach, rough approximations of
buildings were first identified by rectangle layouts and finetuned using height dis-
continuities.
Melissa and Parvaneh [6] combined the strength of energy-based approaches with
the distinctiveness of corners which are assessed using multiple colors and color-
invariance spaces. A rooftop outline is generated from selected corner candidates and
further refined to fit the best possible boundaries through level-set curve evolution.
14
Shibiao et al. [14] proposed a novel salient rooftop detector by integrating four
correlative RGB-D priors (depth cue, uniqueness prior, shape prior, and transition
surface prior) for improved rooftop extraction to address the preceding complex
issues mentioned. After that, these correlative cues were computed from image
layers created by the multilevel segmentation and further fused into the state-of-
the-art high-order conditional random field (CRF) framework to locate the rooftop.
15
Chapter 3
Proposed Methodology
16
INPUT IMAGE INPUT MASK
PREPROCESSING GENERATE
ANNOTATION
PROCESSED
IMAGE
TRAIN MODEL
GENERATE INITIAL
SEGMENTATION MASK
POSTPROCESSING
FINAL MASK
17
INPUT MASK INPUT IMAGE
GENERATE PREPROCESSING
ANNOTATION
GAUSSIAN BLUR
CANNY EDGE
LBP
PROCESSED
IMAGE
TRAIN MODEL
GENERATE CANDIDATE
MASKS (MULTI-COLOR)
POSTPROCESSING
COMBINE CANDITATE MASKS
INTO SINGLE MASK
CONVERT TO
BINARY MASK
REMOVE VEGETATION
LAND
REMOVE BARREN
LAND
FINAL MASK
The annotation is generated from the corresponding ground truth mask. Then
the processed image along with the annotation is fed to train the model.
The trained model is then applied to a test image which also goes through
the preprocessing steps. The model generates multi-colored candidate masks for
multiple detected clusters.
In the postprocessing step, these multiple masks are combined into a single mask
18
by a union operation and converted into a binary mask. Finally, all the small holes
from the mask are removed to get the final segmentation mask.
Algorithm 1 Preprocessing
1: procedure preprocessing(IM AGES, GROU N D T RU T H M ASKS)
Algorithm 2 Training
1: procedure training(IM AGES, AN N OT AT ION S) . Initialize the
Parameters of the Model
2: set IM AGES P ER GP U = 1
3: set N U M OF CLASSES = 1 + 1
4: set ST EP S P ER EP OCH = IM AGES P ER GP U ×
COU N T T RAIN IM AGES
5: set V ALIDAT ION ST EP S = IM AGES P ER GP U ×
COU N T V ALIDAT E IM AGES
6: set LEARN IN G RAT E = 0.00001
7: set EP OCHS = 33
8: for each e ∈ EP OCHS do
9: Train the Model
10: end for
11: end procedure
19
Algorithm 3 Postprocessing
1: procedure postprocessing(IM AGE, P REDICT ED M ASK,
In the postprocessing step, the original aerial image along with the corresponding
predicted mask is virtually broken into grids. The number of vertical and horizontal
grids is chosen by empirical study. In the next step, the color space of each grid
is converted from RGB to HSV. Next, a median blur and histogram equalization
are consecutively applied on each grid of the image. Then the number of pixels in
each grid is calculated those fall into the non-building region (by observing the H,
S, and V values). If HSV criteria of the region within the grid do not match then
it is considered as a false positive region. If there are two adjacent regions in two
neighbor grids the detected regions are merged.
20
Chapter 4
4.1.1 Source
The Aerial Imagery for Roof Segmentation (AIRS) dataset is used in this project.
This dataset aims at benchmarking the algorithms of roof segmentation from high-
resolution aerial imagery. This dataset was published by Chen, Qi and Wang, Lei
and Wu, Yifan and Wu, Guangming and Guo, Zhiling and Waslander and Steven
L in 2019 [5]. This dataset can be acquired from the website: https://www.airs-
dataset.com.
The dataset has the following features:
• Contains orthorectified aerial images covering 457km2 area with over 220,000
buildings
• Spatial resolution of the imagery is very high (0.075m),i.e. each pixel repre-
sents an area of 0.075m x 0.075m of the ground.
The images are down sampled to a resolution of 1000 x 1000 pixel for faster
processing. The original dataset contains masks only for the individual rooftops,
21
therefore the masks for cluster of buildings have been generated manually for this
project.
Some building dominant sample images from the dataset and the corresponding
manually generated ground truth masks are given in Table 4.1.
Table 4.1: Sample Building Dominant Images and Corresponding Ground Truth
Masks.
22
Some non-building sample images from the dataset and the corresponding man-
ually generated ground truth masks are given in Table 4.2.
Table 4.2: Sample Non-Building Images and Corresponding Ground Truth Masks.
23
4.1.2 Setup
The original dataset contains 857, 94, 95 images for training, validation and testing
respectively. The dataset contains a large number of non-building images which
leads to poor accuracy. To overcome this issue, a selective number of images from
the dataset are used for this project. A detailed view is given in Table 4.3.
Here a total of 490 images are used for training, out of which 393 are building
dominant images and the rest of them are non-building images. For validation, 50
building-dominant and 10 non-building images are used. And for testing, 78 building
dominant and 20 non-building images are used.
i. The Google Colab GPU has 15GB of memory, which can fit two images. For
faster performance, 1 image was fitted per GPU.
ii. Number of classes is set to 2(=1+1), 1 for the cluster of buildings, and 1 for
the background.
24
!pip install scikit-image==0.16.2
Training
iv. steps per epoch for training is calculated as a product of no of train images
and images per gpu (i.e., steps per epoch = 490×1 = 490).
In [8]:
train_path = os.path.join(root_dir, "train_single_stage.py")
Similarly,
dataset_path = os.path.join(root_dir,
steps per epoch for validation is calculated as a product of no of validation images
"dataset")
!python3.6
and images"$train_path" steps--dataset=$dataset_path
per gpu (i.e.,train per epoch = 60×1 =--year='2021'
60). Number--model=coco
of images
# !python3.6 '/content/gdrive/My Drive/piyal mtech/singlestage trainnew.py' train --dataset='/co
for different process is discussed in Subsection 4.1.2.
Using TensorFlow backend.
Command: train
v. Model: coco
A low learning rate is good for better learning. So, the learning rate here is
Dataset: /content/gdrive/MyDrive/data/piyal/dataset
Year: 2021
set to /content/gdrive/.shortcut-targets-by-id/1tnGoGL8q1dTZMpEjd6EmKErE8kGneyDs/piyal/logs
Logs: 0.00001.
Auto Download: False
Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.7
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 14
IMAGE_MIN_DIM 800
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 1e-05
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0,
'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME roof
NUM_CLASSES 2
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 490
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 60
WEIGHT_DECAY 0.0001
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_de
Figure 4.1: Mask
3: colocate_with R-CNN architecture implementation
(from tensorflow.python.framework.ops) overview.
is deprecated and will be removed in a f
Instructions for updating:
Colocations handled automatically by placer.
Loading weights /content/gdrive/.shortcut-targets-by-id/1tnGoGL8q1dTZMpEjd6EmKErE8kGneyDs/piyal
o.h5
25
2021-07-21 12:17:30.379483: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU suppor
that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-07-21 12:17:30.597320: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successf
ad from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NU
2021-07-21 12:17:30.598260: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x184f
4.3 Performance Evaluation
The following Confusion Matrix measures proposed in [7] are computed to evaluate
the performance of the proposed method.
TP
P RECISION = (T P +F P )
TP
RECALL = (T P +F N )
(2×P RECISION ×RECALL)
F 1 − SCORE = (P RECISION +RECALL)
1
https://www.omicsonline.org/articles-images/JCSB-07-209-g003.html
26
4.3.1 Quantitative Assessment
In case of RGB, it is observed that out of the correct predictions 52% are building
regions and 37% are non-building regions, 9% have been missed and 2% non-building
regions have been detected falsely.
In case of Canny, it is seen that out of the correct predictions, 57% are building re-
gions and 33% are non-building regions, 8% have been missed and 2% non-building
regions have been detected falsely.
In case of LBP, it is perceived that out of the correct predictions, 51% are building
regions and 37% are non-building regions, 8% have been missed and 4% non-building
regions have been detected falsely.
The evaluation is performed for each type of the input image formats and the
results are given in Table 4.5.
The proposed method gives precision, recall and F-score for RGB images as 75%,
86% and 81%, respectively.
These values for Canny edge images are 72%, 83% and 77% respectively.
These values for LBP images are 75%, 83% and 79% respectively.
27
4.3.2 Qualitative Assessment
28
In ground truth column, the red area represents the ground truth for cluster of
buildings and in prediction column, the red area represents the detected cluster of
buildings by the proposed model.
29
Original Image Ground Truth Prediction
30
4.4 Result Analysis
From the evaluation results (Section 4.3.1), it is evident that the proposed method
of detection of cluster of buildings in aerial image gives better and preferred results
in case of the RGB images than the canny and LBP images. This technique per-
forms very well in detecting the cluster of buildings in case of the regions where
the buildings are densely populated. This technique is not performing admirably,
for recognizing consistently separated built-up areas where buildings are further
separated from one another.
Training and validation loss graph are given below –
(a) (b)
Figure 4.3: (a) Train Loss for RGB images. (b) Validation Loss for RGB images.
31
(a) (b)
Figure 4.4: (a) Train Loss for Canny images. (b) Validation Loss for Canny images.
(a) (b)
Figure 4.5: (a) Train Loss for LBP images. (b) Validation Loss for LBP images.
The execution time for the system depends on the complexity and number of the
images and the number of features in them. The annotation generation took 1653
seconds for the train images and 202 seconds for the validation images.
32
During training, first epoch took 480 seconds and this time was decreased mono-
tonically and from the 25th epoch this time stabilized to 430 seconds (Details given
in Figure 4.6). Thus, the total time for training was 5 hours.
33
Chapter 5
5.1 Summary
In this project, a methodology for the detection of cluster of buildings in aerial
image is proposed and implemented. The proposed methodology is based on the
Matterport Mask R-CNN [1].
Firstly, the ground truth masks are created manually for the cluster of buildings
and then the annotation is generated from them. Next, 3 types of representations
(RGB, Canny & LBP) are created for training the model. Some of the original
parameters are fine-tuned for better training based on empirical studies. Evaluations
are performed after training. The acquired results are satisfactory.
The proposed model faces difficulties when the buildings are not densely popu-
lated and in some cases, the segmentation result contains some vegetation area.
From the overall performance evaluation, it can be observed that the proposed
model gives the best accuracy (81%) for the RGB images and the least accu-
racy(77%) for the Canny-edged images.
34
5.2 Future Work
As a future scope of this research, the cluster of buildings detected by the proposed
method can be used to detect the individual buildings using the same Mask R-CNN
architecture. Next, the nature of the buildings, shadow information, height and area
of the buildings can also be predicted. Finally, all these information can be used for
different facility location works.
The model reported in this thesis can also be further optimized for better per-
formance.
35
References
[1] Waleed Abdulla. Mask r-cnn for object detection and instance segmentation
on keras and tensorflow. Github, 2017.
[2] Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. Face description with
local binary patterns: Application to face recognition. volume 28, pages 2037–
2041. IEEE, 2006.
[3] Olfa Besbes, Ziad Belhadj, and Nozha Boujemaa. A variational framework for
adaptive satellite images segmentation. In International Conference on Scale
Space and Variational Methods in Computer Vision, pages 675–686. Springer,
2007.
[5] Qi Chen, Lei Wang, Yifan Wu, Guangming Wu, Zhiling Guo, and Steven L
Waslander. Temporary removal: Aerial imagery for roof segmentation: A large-
scale dataset towards automatic mapping of buildings. volume 147, pages 42–55,
2019.
[6] Melissa Cote and Parvaneh Saeedi. Automatic rooftop extraction in nadir aerial
imagery of suburban regions using corners and variational level set evolution.
volume 51, pages 313–328. IEEE, 2012.
[7] Tom Fawcett. Introduction to roc analysis. volume 27, pages 861–874, 06 2006.
[8] Hasan Volkan Güdücü. Building detection from satellite images using shadow
and color information. 2008.
36
[9] Mostafa Kabolizade, Hamid Ebadi, and Salman Ahmadi. An improved snake
model for automatic extraction of buildings from urban aerial images and lidar
data. volume 34, pages 435–441. Elsevier, 2010.
[10] Florent Lafarge, Xavier Descombes, Josiane Zerubia, and Marc Pierrot-
Deseilligny. Automatic building extraction from dems using an object approach
and application to the 3d-city modeling. volume 63, pages 365–381. Elsevier,
2008.
[11] Zhengjun Liu, Shiyong Cui, and Qin Yan. Building extraction from high res-
olution satellite imagery based on multi-scale image segmentation and model
matching. In 2008 International workshop on earth observation and remote
sensing applications, pages 1–7. IEEE, 2008.
[12] Heinz Rüther, Hagai M Martine, and EG Mtalo. Application of snakes and dy-
namic programming optimisation technique in modeling of buildings in informal
settlement areas. volume 56, pages 269–282. Elsevier, 2002.
[13] Z Wang and W Liu. Building extraction from high resolution imagery based on
multi-scale object oriented classification and probabilistic hough transform. In
Proceedings of 2005 International Geoscience and Remote Sensing Symposium
(IGARSS’05), Seoul, South Korea, pages 25–29, 2005.
[14] Shibiao Xu, Xingjia Pan, Er Li, Baoyuan Wu, Shuhui Bu, Weiming Dong,
Shiming Xiang, and Xiaopeng Zhang. Automatic building rooftop extraction
from aerial images via hierarchical rgb-d priors. volume 56, pages 7369–7387.
IEEE, 2018.
[15] Yun Yang and Ying Lin. Object-based level set model for building detection
in urban area. In 2009 Joint Urban Remote Sensing Event, pages 1–6. IEEE,
2009.
37