Split by PDF Splitter
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Minh Trang
ADVANCED DEEP LEARNING METHODS
AND APPLICATIONS IN
OPEN-DOMAIN QUESTION ANSWERING
MASTER THESIS
Major: Computer Science
HA NOI - 2019
Split by PDF Splitter
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Minh Trang
ADVANCED DEEP LEARNING METHODS
AND APPLICATIONS IN
OPEN-DOMAIN QUESTION ANSWERING
MASTER THESIS
Major: Computer Science
Supervisor: Assoc.Prof. Ha Quang Thuy
Ph.D. Nguyen Ba Dat
HA NOI - 2019
Split by PDF Splitter
Abstract
Ever since the Internet has become ubiquitous, the amount of data accessible by
information retrieval systems has increased exponentially. As for information con-
sumers, being able to obtain a short and accurate answer for any query is one of
the most desirable features. This motivation, along with the rise of deep learning,
has led to a boom in open-domain Question Answering (QA) research. An open-
domain QA system usually consists of two modules: retriever and reader. Each
is developed to solve a particular task. While the problem of document compre-
hension has received multiple success with the help of large training corpora and
the emergence of attention mechanism, the development of document retrieval in
open-domain QA has not gain much progress. In this thesis, we propose a novel
encoding method for learning question-aware self-attentive document represen-
tations. Then, these representations are utilized by applying pair-wise ranking
approach to them. The resulting model is a Document Retriever, called QASA,
which is then integrated with a machine reader to form a complete open-domain
QA system. Our system is thoroughly evaluated using QUASAR-T dataset and
shows surpassing results compared to other state-of-the-art methods.
Keywords: Open-domain Question Answering, Document Retrieval, Learning to
Rank, Self-attention mechanism.
iii
Split by PDF Splitter
Acknowledgements
Foremost, I would like to express my sincere gratitude to my supervisor Assoc.
Prof. Ha Quang Thuy for the continuous support of my Master study and research,
for his patience, motivation, enthusiasm, and immense knowledge. His guidance
helped me in all the time of research and writing of this thesis.
I would also like to thank my co-supervisor Ph.D. Nguyen Ba Dat who has
not only provided me with valuable guidance but also generously funded my re-
search.
My sincere thanks also goes to Assoc. Prof. Chng Eng-Siong and M.Sc. Vu
Thi Ly for offering me the summer internship opportunities in NTU, Singapore
and leading me working on diverse exciting projects.
I thank my fellow labmates in KTLab: M.Sc. Le Hoang Quynh, B.Sc. Can
Duy Cat, B.Sc. Tran Van Lien for the stimulating discussions, and for all the fun
we have had in the last two years.
Last but not the least, I would like to thank my parents for giving birth to me
at the first place and supporting me spiritually throughout my life.
iv
Split by PDF Splitter
Declaration
I declare that the thesis has been composed by myself and that the work has not
be submitted for any other degree or professional qualification. I confirm that the
work submitted is my own, except where work which has formed part of jointly-
authored publications has been included.
My contribution and those of the other authors to this work have been ex-
plicitly indicated below. I confirm that appropriate credit has been given within
this thesis where reference has been made to the work of others. The work pre-
sented in Chapter 3 was previously published in Proceedings of the 3rd ICMLSC
as “QASA: Advanced Document Retriever for Open Domain Question Answering
by Learning to Rank Question-Aware Self-Attentive Document Representations”
by Trang M. Nguyen (myself), Van-Lien Tran, Duy-Cat Can, Quang-Thuy Ha
(my supervisor), Ly T. Vu, Eng-Siong Chng. This study was conceived by all of
the authors. My contributions include: proposing the method, carrying out the
experiments, and writing the paper.
Master student
Nguyen Minh Trang
v
Split by PDF Splitter
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Open-domain Question Answering . . . . . . . . . . . . . . . . . 1
1.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Difficulties and Challenges . . . . . . . . . . . . . . . . . 4
1.2 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Objectives and Thesis Outline . . . . . . . . . . . . . . . . . . . 8
2 Background knowledge and Related work . . . . . . . . . . . . . . . 10
2.1 Deep learning in Natural Language Processing . . . . . . . . . . 10
2.1.1 Distributed Representation . . . . . . . . . . . . . . . . . 10
2.1.2 Long Short-Term Memory network . . . . . . . . . . . . 12
2.1.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . 15
2.2 Employed Deep learning techniques . . . . . . . . . . . . . . . . 17
2.2.1 Rectified Linear Unit activation function . . . . . . . . . 17
2.2.2 Mini-batch gradient descent . . . . . . . . . . . . . . . . 18
2.2.3 Adaptive Moment Estimation optimizer . . . . . . . . . . 19
2.2.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
vi
Split by PDF Splitter
2.2.5 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Pairwise Learning to Rank approach . . . . . . . . . . . . . . . . 22
2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Material and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Document Retriever . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Embedding Layer . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Question Encoding Layer . . . . . . . . . . . . . . . . . 31
3.1.3 Document Encoding Layer . . . . . . . . . . . . . . . . . 32
3.1.4 Scoring Function . . . . . . . . . . . . . . . . . . . . . . 33
3.1.5 Training Process . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Document Reader . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 DrQA Reader . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Training Process and Integrated System . . . . . . . . . . 39
4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Tools and Environment . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 45
4.4.2 Document Retriever . . . . . . . . . . . . . . . . . . . . 45
4.4.3 Overall system . . . . . . . . . . . . . . . . . . . . . . . 48
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
vii
Split by PDF Splitter
Acronyms
Adam Adaptive Moment Estimation
AoA Attention-over-Attention
BiDAF Bi-directional Attention Flow
BiLSTM Bi-directional Long Short-Term Memory
CBOW Continuous Bag-Of-Words
EL Embedding Layer
EM Exact Match
GA Gated-Attention
IR Information Retrieval
LSTM Long Short-Term Memory
NLP Natural Language Processing
QA Question Answering
QASA Question-Aware Self-Attentive
QEL Question Encoding Layer
R3 Reinforced Ranker-Reader
ReLU Rectified Linear Unit
RNN Recurrent Neural Network
viii
Split by PDF Splitter
SGD Stochastic Gradient Descent
TF-IDF Term Frequency – Inverse Document Frequency
TREC Text Retrieval Conference
ix
Split by PDF Splitter
List of Figures
1.1 An overview of Open-domain Question Answering system. . . . . 2
1.2 The pipeline architecture of an Open-domain QA system. . . . . . 3
1.3 The relationship among three related disciplines. . . . . . . . . . 6
1.4 The architecture of a simple feed-forward neural network. . . . . 8
2.1 Embedding look-up mechanism. . . . . . . . . . . . . . . . . . . 11
2.2 Recurrent Neural Network. . . . . . . . . . . . . . . . . . . . . . 13
2.3 Long short-term memory cell. . . . . . . . . . . . . . . . . . . . 14
2.4 Attention mechanism in the encoder-decoder architecture. . . . . 16
2.5 The Rectified Linear Unit function. . . . . . . . . . . . . . . . . . 18
3.1 The architecture of the Document Retriever. . . . . . . . . . . . . 28
3.2 The architecture of the Embedding Layer. . . . . . . . . . . . . . 30
4.1 Example of a question with its corresponding answer and contexts
from QUASAR-T. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Distribution of question genres (left) and answer entity-types (right). 43
4.3 Top-1 accuracy on the validation dataset after each epoch. . . . . 47
4.4 Loss diagram of the training dataset calculated after each epoch. . 48
x
Split by PDF Splitter
List of Tables
1.1 An example of problems encountered by the Document Retriever. 5
4.1 Environment configuration. . . . . . . . . . . . . . . . . . . . . . 41
4.2 QUASAR-T statistics. . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Hyperparameter Settings . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Evaluation of retriever models on the QUASAR-T test set. . . . . 47
4.5 The overall performance of various open-domain QA systems. . . 49
xi
Split by PDF Splitter
Chapter 1
Introduction
1.1 Open-domain Question Answering
We are living in the Information Age where many aspects of our lives are driven
by information and technology. With the boom of the Internet few decades ago,
there is now a colossal amount of data available and this number continues to
grow exponentially. Obtaining all of these data is one thing, how to efficiently use
and extract information from them is one of the most demanding requirements.
Generally, the activity of acquiring useful information from a data collection is
called Information Retrieval (IR). A search engine, such as Google or Bing, is
a type of IR. Search engines are extensively used that it is hard to imagine our
lives today without them. Despite their applicability, current search engines and
similar IR systems can only produce a list of relevant documents with respect to
the user’s query. To find the exact answer needed, users still have to manually
examine these documents. Because of this, although IR systems have been handy,
retrieving desirable information is still a time consuming process.
Question Answering (QA) system is another type of IR that is more sophis-
ticated than search engines in terms of being a natural forms of human computer
interaction [27]. The users can express their information needs in natural language
instead of a series of keywords as in search engines. Furthermore, instead of a list
of documents, QA systems try to return the most concise and coherent answers
possible. With the vast amount of data nowadays, QA systems can reduce count-
less effort in retrieving information. Depending on usage, there are two types of
QA: closed-domain and open-domain. Unlike closed-domain QA, which is re-
1
Split by PDF Splitter
stricted to a certain domain and requires manually constructed knowledge bases,
open-domain QA aims to answer questions about basically anything. Hence, it
mostly relies on world knowledge in the form of large unstructured corpora, e.g.
Wikipedia, but databases are also used if needed. Figure 1.1 shows an overview
of an open-domain QA system.
Figure 1.1: An overview of Open-domain Question Answering system.
The research about QA systems has a long history tracing back to the 1960s
when Green et al. [20] first proposed BASEBALL. About a decade after that,
Woods et al. [48] introduced LUNAR. Both of these systems are closed-domain
and they use manually defined language patterns to transform the questions into
structured database queries. Since then, knowledge bases and closed-domain QA
systems had become dominant [27]. They allow users to ask questions about cer-
tain things but not all. Not until the beginning of this century that open-domain
QA research has become popular with the launch of the annual Text Retrieval
Conference (TREC) [44] started in 1999. Ever since, TREC competitions, espe-
cially the open-domain QA tracks, have progressed in size and complexity of the
dataset provided, and evaluation strategies are improved. [36]. The attention is
now shifting to open-domain QA and in recent years, the number of studies on the
subject has increased exceedingly.
2
Split by PDF Splitter
1.1.1 Problem Statement
In QA systems, the questions are natural language sentences and there are a many
types of them based on their semantic categories such as factoid, list, causal,
confirmation, hypothetical questions, etc. The most common ones that attract
most studies in the literature are factoid questions which usually begin with Wh-
interrogated words, i.e. What, When, Where, Who [27]. With open-domain QA,
the questions are not restricted to any particular domain but the users can ask
whatever they want. Answers to these questions are facts and they can simply be
expressed in text format.
From an overview perspective, as presented in Figure 1.1, the input and out-
put of an open-domain QA system are straightforward. The input is the question,
which is unrestricted, and the output is the answer, both are coherent natural lan-
guage sentences and presented by text sequences. The system can use resources
from the web or available databases. Any system like this can be considered as
an open-domain QA system. However, open-domain QA is usually broken down
into smaller sub-tasks since being able to give concise answers to any questions
is not trivial. Corresponding to each sub-task, there is a component dedicated
to it. Typically, there are two sub-tasks: document retrieval and document com-
prehension (or machine comprehension). Accordingly, open-domain QA systems
customarily comprise of two modules: a Document Retriever and a Document
Reader. Seemingly, the Document Retriever handles the document retrieval task
and the Document Reader deals with the machine comprehension task. The two
modules can be integrated in a pipeline manner, e.g. [7, 46], to form a complete
open-domain QA system. This architecture is depicted in Figure 1.2.
Figure 1.2: The pipeline architecture of an Open-domain QA system.
3
Split by PDF Splitter
HƯỚNG DẪN MUA TÀI LIỆU BẢN ĐẦY ĐỦ:
- Tên tài liệu cần tải:
- Số trang: 69
- Mã tài liệu (ID): 50012
Cách 1: Thanh toán qua tài khoản ngân hàng quét mã QR thanh toán:
THÔNG TIN THANH TOÁN:
0974577291 tại ngân hàng MBBank
Số tài khoản:
Hoặc : 38590217 tại ngân hàng ACB
Chủ tài khoản: NGUYỄN VĂN NGHĨA
Số tiền thanh toán: 50,000 đồng
Nội dung chuyển khoản: TAILIEU 50012
Cách 2: Thanh toán qua app MOMO quét mã thanh toán:
THÔNG TIN THANH TOÁN:
Số tài khoản MOMO: 0974577291
Số tiền thanh toán: 50,000 đồng
Nội dung chuyển khoản: TAILIEU 50012
Split by PDF Splitter
Cách 3: Truy cập vào Website tailieu369 bên dưới để mua tài liệu:
https://tailieu369.info/trangchu và thanh toán.
Cách 4: Thanh toán qua tài khoản paypal :
THÔNG TIN THANH TOÁN:
Lưu ý:
Sau khi hoàn thành chuyển khoản thanh toán qua ngân hàng hoặc momo, vui
lòng inbox zalo/telegram 0974577291, mã qr zalo/tele liên hệ :
hoặc fanpage: https://www.facebook.com/tailieuso369
với nội dụng chuyển khoản bên trên, trong vòng 5-10 phút
chúng tôi sẽ gửi tài liệu qua zalo hoặc facebook hoặc email của quý khách.
Sau khi tải tài liệu, Quý khách có thể chuyển đổi file tài liệu từ PDF sang WORD
miễn phí tại đây
Nếu cần hỗ trợ vui lòng liên hệ số zalo/telegram: 0974577291 hoặc email:
[email protected]
(*) Quý khách lưu ý khi chuyển khoản ghi nội dung chuyển khoản kèm MÃ TÀI LIỆU
theo hướng dẫn