0% found this document useful (0 votes)

22 views16 pages

Luận Văn Named Entity Recognition in Vietnamese Documents Nhận Dạng Thực Thể Có Tên Trong Văn Bản Tiếng Việt

This master's thesis by Le Ngoc Anh focuses on Named Entity Recognition (NER) for Vietnamese documents, addressing the challenges posed by the language's characteristics and the lack of a standardized corpus. The study explores various approaches to NER, including handcrafted rules and machine learning methods, and demonstrates that supervised machine learning techniques, specifically Conditional Random Fields and Support Vector Machines, outperform traditional methods. The thesis emphasizes the importance of feature selection tailored to the Vietnamese language and presents experimental results validating the effectiveness of the proposed methods.

Uploaded by

namphuonghh888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views16 pages

Luận Văn Named Entity Recognition in Vietnamese Documents Nhận Dạng Thực Thể Có Tên Trong Văn Bản Tiếng Việt

Uploaded by

namphuonghh888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Split by PDF Splitter

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

VIETNAM NATIONAL UNIVERSITY, HANOI

LE NGOC ANH

NAMED ENTITY RECOGNITION FOR

VIETNAMESE DOCUMENTS

MASTER’S THESIS OF INFORMATION TECHNOLOGY

Ha noi - 2015
Split by PDF Splitter

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

VIETNAM NATIONAL UNIVERSITY, HANOI

LE NGOC ANH

NAMED ENTITY RECOGNITION FOR

VIETNAMESE DOCUMENTS

Major: Computer Science

Code: 60.48.0101

MASTER’S THESIS OF INFORMATION TECHNOLOGY

Supervisor: Assoc.Prof.Dr Le Anh Cuong

Ha noi - 2015
Split by PDF Splitter

Originality statement
“I hereby declare that the work contained in this thesis is of my own and has not been
previously submitted for a degree or diploma at this or any other higher education
institution. To the best of my knowledge and belief, the thesis contains no materials
previously published or written by another person except where due references or
acknowledgements are made.”

Signature:…………………………………………
Split by PDF Splitter

Supervisor’s approval
“I hereby approve that the thesis in its current form is ready for committee
examination as a requirement for the Master of Computer Science degree at the
University of Engineering and Technology.”

Signature:………………………………………………
Split by PDF Splitter

iii

Abstract
Named Entity Recognition (NER) aims to extract and classify words in documents into
pre-defined entity classes. It is fundamental for many natural language processing
tasks such as machine translation, information extraction and question answering.
NER has extensively studied for other languages such as English, Japanese and
Chinese, etc. However, NER for Vietnamese is still a challenging due to its
characteristics and the lack of Vietnamese corpus.

In this thesis, we study approaches to NER including handcrafted rules, machine

learning and hybrid methods. We present challenges in NER for Vietnamese such as
the lack of standard evaluation corpus and the standard methods for constructing data
set. Specially, we focus on labeling entities Vietnamese since most study has not
presented the detail of handcrafting entities in Vietnamese. We apply supervised
machine learning methods for Vietnamese NER based on Conditional Random Field
and Support Vector Machine with changes in feature selection suitable for
Vietnamese. The evaluation shows that these methods outperform other traditional
methods in NER, such as Hidden Markov Model and rule-based methods.
Split by PDF Splitter

Aknowledgement
First, I would like to thank my supervisor Assoc. Prof. Dr. Le Anh Cuong for his
advice and support. This thesis would not have been possible without him and without
freedom and encouragement he has given me over the last two years I spent at the
Faculty of Technology of University of Engineering and Technology, Vietnam
National University (VNU), Ha Noi.

I have been working with amazing friends in the K19CS class. I dedicate my gratitude
to each of them: Tai Pham Dinh, Tuan Dinh Vu, and Nam Thanh Pham. I would
especially like to thank the teachers in University of Engineering and Technology,
VNU for the collaboration, great ideas and feedbacks during my dissertation.

Finally, I thank my parents and my brother, Hoang Le, for encouragement, advice and
support. I especially thank my wife, Linh Thi Nguyen, and my lovely daughter, Ngoc
Khanh Le, for their endless love and sacrifice in the last two years. They gave me
strength and encouragement to do this thesis.

Ha Noi, September, 2015

Le Ngoc Anh
Split by PDF Splitter

Contents
Supervisor‟s approval ..................................................................................................... ii
Abstract .......................................................................................................................... iii
Aknowledgement ............................................................................................................iv
List of Figures ............................................................................................................... vii
List of Tables ............................................................................................................... viii
List of Abbreviations ......................................................................................................ix
Chapter 1 Introduction .....................................................................................................1
1.1 Information Extraction ..........................................................................................1
1.2 Named entity recognition ......................................................................................3
1.3 Evaluation for NER ...............................................................................................4
1.4 Our work ................................................................................................................4
Chapter 2 Approaches to Named Entity Recognition .....................................................6
2.1 Rules based methods .............................................................................................6
2.2 Machine learning methods ....................................................................................7
2.3 Hybrid methods ...................................................................................................17
Chapter 3 Feature Extraction .........................................................................................18
3.1 Characteristics of Vietnamese language ..............................................................18
3.1.1 Lexical Resource ..........................................................................................18
3.1.2 Word Formation ...........................................................................................18
3.1.3 Spelling Variation .........................................................................................18
3.2 Feature selection for NER ...................................................................................19
3.2.1 Feature selection methods ............................................................................20
3.2.2 Mask methods ...............................................................................................21
3.2.3 Taxonomy of features ...................................................................................21
3.3 Feature selection for Vietnamese NER ...............................................................23
4.1 Data preparation ..................................................................................................26
4.2 Machine learning methods for Vietnamese NER ................................................29
4.2.1 SVM method ................................................................................................ 29
Split by PDF Splitter

4.2.2 CRF method..................................................................................................30

4.3 Experimental results ............................................................................................31
4.4 An example of experimental results and error analysis ......................................32
Chapter 5 Conclusion ....................................................................................................37
References .....................................................................................................................38
Split by PDF Splitter

vii

List of Figures
Figure 1.1: Example of automatically extracted information from a news article on a
terrorist attack. Source [18] .............................................................................................1
Figure 2.1: Directed graph represents HMM ..................................................................7
Figure 2.2: How to compute transition probabilities ....................................................10
Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin
drawn as dashed lines. ...................................................................................................12
Figure 2.4: The mapping of input data from the input space into an infinitely
dimensional Hilbert Space in a non-linear SVM classifier. Source [17]. .....................14
Figure 3.1: A taxonomy of feature selection methods. Source [21]. ............................20
Figure 4.1: Generating training data stages ..................................................................27
Figure 4.2: Vietnamese NER based on SVM ...............................................................30
Figure 4.3: Vietnamese NER based on CRF ................................................................ 31
Split by PDF Splitter

viii

List of Tables
Table 3.1: Word-level features ......................................................................................22
Table 3.2: Gazetteer features .........................................................................................23
Table 3.3: Document and corpus features .....................................................................23
Table 3.4: Orthographic features for Vietnamese .........................................................24
Table 3.5: Lexical and POS features for Vietnamese ...................................................24
Table 4.1: An example of a sentence in training data ...................................................28
Table 4.2: Statistics of training data in entity level .......................................................28
Table 4.3: The number of label types in training data and test data .............................29
Table 4.4: Results on testing data of SVM Learner ......................................................31
Table 4.5: Results on testing data of NER using CRF method .....................................32
Table 4.6: Annotating table ...........................................................................................32
Split by PDF Splitter

List of Abbreviations
Abbreviation Stand for
CRF Conditional Random Field
HMM Hidden Markov Model
IE Information Extraction
MEMM Maximum Entropy Markov Model
NER Named Entity Recognition
SVM Support Vector Machine
ML Machine Learning
MUC Message Understanding Conferences
CoNLL Conferences on Natural Language Learning
MET Multilingual Entity Tasks
SigNLL Special Interest Group on Natural Language Learning
POS Part of Speech
IIS Improved iterative scaling
Split by PDF Splitter

Chapter 1 Introduction

1.1 Information Extraction

Information Extraction (IE) is a research area in Natural Language Processing (NLP. It

focuses on techniques to identify a predefined set of concepts in a specific domain,
where a domain consists of a text corpus together with a well-defined information
need. In other word, IE is about deriving structured information from unstructured
text. For instance, we are interested in extracting information on violent events from
online news, which involes the identification of the main actors of the event, its
location and number of people affected [18]. Figure 1.1 shows an example of a text
snippet from a news article about a terrorist attack and the structured information
derived from that snippet. The process of extracting such structured information
involves the identification of certain small-scale structures such as noun phrases
denoting a person or a group of persons, geographical references and numerical
expressions, as well as finding semantic relations between them. However, in this
scenario some domain specific knowledge is required (e.g., understanding the fact that
terrorist attacks might result in people being killed or injured) in order to correctly
aggregate the partially extracted information into a structured form.

“Three bombs have exploded in north-eastern Nigeria, killing 25 people and

wounding 12 in an attack carriedout by an Islamic sect. Authorities said the bombs
exploded on Sunday afternoon in the city of Maiduguri.”

Figure 1.1: Example of automatically extracted information from a news article on a

terrorist attack. Source [18]
Split by PDF Splitter

Starting from 1987, a series of Message Understanding Conferences (MUC) has been
held which focus on the following domains:

 MUC-1 (1987), MUC-2 (1989): Naval operations messages.

 MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
 MUC-5 (1993): Joint ventures and microelectronics domain.
 MUC-6 (1995): News articles on management changes.
 MUC-7 (1998): Satellite launches reports.

The significance of IE is related to the growing amount of information available in

unstructured form. Tim Berners-Lee, who is the inventor of the World Wide Web
(WWW), refers to the existing Internet as the web of documents and advocates that
more content will be available as a web of data. Until this transpires, the web largely
consists of unstructured documents without semantic metadata. Knowledge contained
in these documents can be more accessible for machine processing by transforming
information into relational form, or by marking-up it with XML tags. For instance, an
intelligent agent monitoring a news data feed requires IE to transform unstructured
data (i.e. text) into something that can be reasoned with. A typical application of IE is
to scan a set of documents written in a natural language and populate a database with
the extracted information.

IE on text aims at creating a structured view i.e., a representation of the information

that is machine understandable. According to [18], the classical IE tasks include:

Named Entity Recognition addresses the problem of the identification (detection) and
classification of predefined types of named entities, such as organizations (e.g., „World
Health Organisation‟), persons (e.g., „Muammar Kaddafi‟), place names (e.g., „the
Baltic Sea‟), temporal expressions (e.g.,„1 September 2011‟), numerical and currency
expressions (e.g., „20 Million Euros‟), etc.

Co-reference Resolution requires the identification of multiple (co-referring)

mentions of the same entity in the text. For example, "International Business
Machines" and "IBM" refer to the same real-world entity. If we take the two sentences
"M. Smith likes fishing. But he doesn't like biking", it would be beneficial to detect
that "he" is referring to the previously detected person "M. Smith".

Relation Extraction focuses on detecting and classifying predefined relationships

between entities identified in text. For example:

 PERSON works for ORGANIZATION (extracted from the sentence "Bill works
for IBM.")
 PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Split by PDF Splitter

Event Extraction refers to the identification of events in free text and deriving
detailed and structured information about them. Ideally, it should identify who did
what to whom, when, where, through what methods (instruments), and why. Normally,
event extraction involves extraction of several entities and relationship between them.
For instance, extraction of information on terrorist attacks from the text fragment
„Masked gunmen armed with assault rifles and grenades attacked a wedding party in
mainly Kurdish southeast Turkey, killing at least 44 people.‟ would reveal the
identification of perpetrators (masked gunmen), victims (people), number of
killed/injured (at least 44), weapons used (rifles and grenades), and location (southeast
Turkey).

1.2 Named entity recognition

In IE, name entity recognition (NER) is slightly more complex. The named entities
(e.g. the location, actors and targets of a terrorist event) need to be recognized as such.
This NER task (also known as „propername classification‟) involves the identifcation
and classification of named entities: people, places, organizations, products,
companies, and even dates, times, or monetary amounts. For example, Figure 1.2
demonstrates an English NER system which identifies and classifies entities in text
documents. It identifies 4 type entities including person, location, organization and
misc.

Figure 1.2: A named entity recognition system. Source 1

1
http://www.iti.illinois.edu/tech-transfer/technologies/natural-language-processing-nlp
Split by PDF Splitter

HƯỚNG DẪN MUA TÀI LIỆU BẢN ĐẦY ĐỦ:

- Tên tài liệu cần tải:

- Số trang: 52

- Mã tài liệu (ID): 50674

Cách 1: Thanh toán qua tài khoản ngân hàng quét mã QR thanh toán:

THÔNG TIN THANH TOÁN:

0974577291 tại ngân hàng MBBank

Số tài khoản:
Hoặc : 38590217 tại ngân hàng ACB

Chủ tài khoản: NGUYỄN VĂN NGHĨA

Số tiền thanh toán: 50,000 đồng

Nội dung chuyển khoản: TAILIEU 50674

Cách 2: Thanh toán qua app MOMO quét mã thanh toán:

THÔNG TIN THANH TOÁN:

Số tài khoản MOMO: 0974577291

Số tiền thanh toán: 50,000 đồng

Nội dung chuyển khoản: TAILIEU 50674

Split by PDF Splitter

Cách 3: Truy cập vào Website tailieu369 bên dưới để mua tài liệu:

https://tailieu369.info/trangchu và thanh toán.

Cách 4: Thanh toán qua tài khoản paypal :

THÔNG TIN THANH TOÁN:

Tài khoản PayPal: [email protected]

Lưu ý:

 Sau khi hoàn thành chuyển khoản thanh toán qua ngân hàng hoặc momo, vui
lòng inbox zalo/telegram 0974577291, mã qr zalo/tele liên hệ :
 hoặc fanpage: https://www.facebook.com/tailieuso369
 với nội dụng chuyển khoản bên trên, trong vòng 5-10 phút
 chúng tôi sẽ gửi tài liệu qua zalo hoặc facebook hoặc email của quý khách.
 Sau khi tải tài liệu, Quý khách có thể chuyển đổi file tài liệu từ PDF sang WORD
miễn phí tại đây
 Nếu cần hỗ trợ vui lòng liên hệ số zalo/telegram: 0974577291 hoặc email:
[email protected]

(*) Quý khách lưu ý khi chuyển khoản ghi nội dung chuyển khoản kèm MÃ TÀI LIỆU
theo hướng dẫn