RLP AppDev Guide
RLP AppDev Guide
Release 6.0.0
Basis Technology, We Put the World in the World Wide Web, Rosette, Nichibei, and Geoscope are registered trademarks of Basis Technology
Corporation. All other brand names may be trademarks of their respective owners.
Lexical data used in this product Copyright © 2006 Basis Technology Corporation. Individual portions of the lexical data used in this product are
Copyright © 2005 Nara Institute of Science and Technology, Copyright © 2005 Appen Pty Ltd., Copyright © 2005 Hangul Research and Engineering
Center, Copyright © 2005 University of Hawaii.
The PCRE regular expression engine is Copyright © 1997-2004 University of Cambridge, All Rights Reserved.
The Expat XML parser is Copyright © 1998-2000 Thai Open Source Software Center Ltd. and Clark Cooper, Copyright © 2001-2003 Expat
Maintainers.
The Tcl Regular Expressions Manual Page is Copyright © 1998 Sun Microsystems, Inc., Copyright © 1999 Scriptics Corporation, Copyright ©
1995-1997 Roger E. Critchlow Jr.
iii
Application Developer's Guide
iv
Application Developer's Guide
v
Application Developer's Guide
vi
Application Developer's Guide
vii
viii
Preface
The Rosette Linguistics Platform (RLP) is designed for document handling systems that need to identify,
classify, analyze, index, and search unstructured text in many different languages. By integrating Rosette,
developers can enable their applications to use any source of raw text data by identifying the language and
encoding of a given document, converting the text to Unicode so that it can be processed, and performing
comprehensive linguistic analysis and entity extraction of text in English and a variety of Asian, European
and Middle Eastern languages.
1. In this Guide
This guide explains how to install, configure, and use RLP to process and analyze text in a variety of
languages. Major topics include:
2. Other Documentation
This developer's guide is intended to be used with
The RLP Release Notes contain up-to-date information about new features and bug fixes in this release.
The API Reference includes HTML documentation generated from source code for the C++ and Java
APIs.
3. What's New
This following features are new in RLP 6.0.0
• Added support for new named entity types: RELIGION, NATIONALITY, GPE ( geo-political entity),
and FACILITY (a man-made structure or architectural entity).
ix
What's New
• A preliminary .NET API [61] that provides limited coverage of RLP functionality.
• Merging of the Korean Hangul and compound noun dictionaries into a single compiled Korean user
dictionary [194] that users can edit and recompile.
• The Rosette Language Identifier (RLI) [164] returns DETECTED_SCRIPT [78] , the ISO15924 code
for the writing script of the text to be processed.
• The Rosette Language Identifier (RLI) [164] is able to detect UTF-16 encoding.
• Moved the routine for scanning the RLP license and generating a list of supported features from the
introductory RLP sample applications to separate C++, Java, and C sample applications.
• Expanded the scope of Tokenizer [173] to tokenize all languages. In a context configuration, Tokenizer
should be placed after processors that provide their own language-specific tokenization (BL1, CLA,
JLA, and KLA ), and (along with Sentence Boundary Detector) before processors that use the
tokenization it provides (ARBL, FABL, and URBL).
Consult the RLP Release Notes (RLP-6.0.0-ReadMe.html) for full details about changes to RLP in this
release.
x
Chapter 1. Introduction to the Rosette
Linguistics Platform
The Rosette Linguistics Platform (RLP) is the backbone of Basis Technology's text and language analysis
technology. RLP provides advanced natural-language processing techniques to help your applications
unlock information in unstructured text. RLP includes modules for language and encoding identification,
converting text to Unicode, identifying basic linguistic features, and locating key entities like the names
of people, places, and objects of interest. RLP supports English and a variety of Asian, European, and
Middle Eastern languages. The detailed linguistic information provided by RLP can be used to increase
the accuracy and depth of information-retrieval, text-mining, entity-extraction, and other text-analysis
applications.
Language support for each of these operations is indicated in the following table:
Table 1.1. RLP Language Support for Base Linguistics (BL) and Named Entity
Extraction (NE)
Language Base Linguistics NE
Tokenization POS SBD BNP Stemming Compounds Readings
Arabic ✓ ✓ ✓ ✓ ✓ n/a ✓
Chinese (Simplified) ✓ ✓ ✓ ✓ n/a n/a ✓ ✓
Chinese (Traditional) ✓ ✓ ✓ ✓ n/a n/a ✓ ✓
Czech ✓ ✓ ✓ ✓ n/a
Dutch ✓ ✓ ✓ ✓ ✓ ✓ ✓
Englisha ✓ ✓ ✓ ✓ ✓ n/a n/a ✓
French ✓ ✓ ✓ ✓ ✓ n/a ✓
German ✓ ✓ ✓ ✓ ✓ ✓ ✓
Greek ✓ ✓ ✓ ✓ n/a
Hungarian ✓ ✓ ✓ ✓ ✓
Italian ✓ ✓ ✓ ✓ ✓ n/a ✓
Japanese ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Korean ✓ ✓ ✓ ✓ ✓ ✓
Persian ✓ ✓ ✓ ✓ n/a ✓
Polish ✓ ✓ ✓ ✓ n/a
Portuguese ✓ ✓ ✓ ✓ ✓ n/a
Russian ✓ ✓ ✓ ✓ n/a
1
Architecture Overview
If you work with multilingual input data, RLP provides tools for locating regions of contiguous text in a
single language, so that you can process each region with the appropriate language processors.
In addition to the languages listed above, the Rosette Language Identifier (RLI) [164] can identify text
in the following languages: Albanian, Transliterated Arabic, Bahasa Indonesia, Bahasa Malay, Bengali,
Bulgarian, Catalan, Croatian, Danish, Estonian, Finnish, Gujarati, Hebrew, Hindi, Icelandic, Kannada,
Kurdish, Latvian, Lithuanian, Malayalam Norwegian, Pashto, Transliterated Pashto, Transliterated
Persian, Romanian, Serbian (Cyrillic and Latin), Slovak, Slovenian, Somali, Swedish, Tagalog, Telugu,
Thai, Turkish, Ukrainian, Transliterated Urdu, Uzbek (Cyrillic and Latin), and Vietnamese.
• C++, C, and Java APIs are available. The APIs do not vary from one human language to another.
Note
RLP's features are enabled by license keys issued by Basis Technology. Please contact us to obtain
the required evaluation or production license file, and refer to Installing RLP [5] for
information about where to put the license file.
All of the language processors available to the system -- as defined by the RLP license -- are kept in an
environment. A context specifies an ordered set of language processors used to perform a particular task.
A single environment can support multiple concurrent contexts. The language processors and their related
data (such as dictionaries or language models) are stored in one location where they may be shared by
different contexts.
Each context provides access to the results generated by the language processors in that context.
2
RLP Language Processors
An application uses an ordered sequence of language processors (a context) to process the input text. In
many cases, one language processor requires the output of another language processor as input for its own
operation.
An RLP context specifies an ordered series of language processors. Created by the environment, a single
context defines a set of operations to be performed on a document.
For example, a context for extracting Japanese named entities might consist of the Unicode Converter (for
converting UTF-8, for example, to the appropriate form of UTF-16 for the platform), the Japanese Language
Analyzer (for tokenization and POS tagging), the Japanese Orthographic Analyzer (to find normalized
forms for word variants), the Sentence Boundary Detector, the Base Noun Phrase detector, the Named
Entity Extractor, the Gazetteer (to find named entities listed in gazetteers), the Regular Expression
processor (to find named entities, such as dates or email addresses, identified with regular expressions),
the Named Entity Redactor (for resolving duplication or overlaps), and the REXML processor (for
generating a report).
3
RLP Configuration
Multiple contexts can be created from a single environment, and contexts can be executed in their own
threads, given that all RLP processors are re-entrant.
An environment can be used in more than one processing thread with no locking. A context can only be
used in one thread at a time unless the application protects it with an appropriate synchronization
mechanism for the platform.
4
Chapter 2. RLP Getting Started
This chapter shows you how to download and install RLP, and how to use the RLP command-line utility
to process sample data and your own text.
The SDK package must be the correct package for your platform. See Supported Platforms [16] .
The documentation package includes this book, release notes, and HTML API references for the C++,
C, and Java APIs.
The license contains a set of keys that define the language operations you are authorized to perform with
RLP.
When you contact Basis Technology to obtain a copy of RLP (for production use or for evaluation), we
send you an email with private download links to the license file (rlp-licenses.xml), the SDK package,
and the documentation package. These links expire after 30 days. If you need an extension, please
contact [email protected] .
The SDK is an executable .msi installer for Windows 32, a .zip file for Windows 64, or
a compressed archive (.tar.gz) for Unix. The file names include version number and
platform designation. See Supported Platforms [16] .
The documentation is a .zip file for Windows or a compressed archive (.tar.gz) for Unix.
See Documentation package file name [18] .
1. Run the SDK executable (for example, rlp-6.0.0-sdk-ia32-w32-msvc71.msi). You may double-click
it in the Explorer window or go to Start → Run and either browse for the file or enter its exact
location. The RLP SDK Installer appears:
5
Windows 64 Installation
2. The Installer guides you through the installation process. Select Next to proceed.
3. Select the destination directory for your installation. The default directory is C:\Program Files\Basis
Technology\Rlp-SDK-6.0.0.
4. Review the installation specifications. If they are incorrect, select the Back button until you reach the
appropriate window and make your corrections. Then select Next until you have returned to the review
window. Select Next to proceed with the installation.
5.
After installing RLP, you must copy the rlp-license.xml license file into the BT_ROOT\rlp\rlp
\licenses directory. RLP does not run without a valid license file. If you wish to upgrade from an
evaluation license or add support for another language or another RLP feature, contact Basis Technology
Corp. at [email protected] .
1. Extract the compressed SDK file (for example, rlp-6.0.0-sdk-amd64-w64-mxvc80.zip) into a directory
on your local volume, such as C:\Program Files\Basis Technology\Rlp-SDK-6.0.0\. For example:
6
Unix Installation
unzip rlp-6.0.0-sdk-amd64-w64-mxvc80.zip
-d "C:\Program Files\Basis Technology\RLP-SDK-6.0.0"
2. Copy the license file provided in the installation package into BT_ROOT\rlp\rlp\licenses. RLP does
not run without a valid license file. If you wish to upgrade from an evaluation license or add support
for another language or another RLP feature, contact Basis Technology Corp. at
[email protected] .
cd /usr/local/BasisTech/BT_RLP
gunzip rlp-6.0.0-sdk-1a32-glibc22-gcc32.tar.gz
tar -xf rlp-6.0.0-sdk-1a32-glibc22-gcc32.tar
2. Copy the license file provided in the installation package into BT_ROOT/rlp/rlp/licenses. RLP does
not run without a valid license file. If you wish to upgrade from an evaluation license or add support
for another language or another RLP feature, contact Basis Technology Corp. at
[email protected] .
3. Add the RLP library directory to the LD_LIBRARY_PATH environment variable (or its equivalent for
your Linux operating system). The RLP library directory is BT_ROOT/rlp/lib/BT_BUILD , where
BT_BUILD is the platform identifier embedded in your SDK package file name (see Supported
Platforms [16] ). For example:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\
/usr/local/BasisTech/BT_RLP/rlp/lib/ia32-glibc22-gcc32/; export LD_LIBRARY_PATH;
7
Running the RLP Command-line Utility
• Release Notes (RLP-6.0.0-ReadMe.html) with up-to-date information about new features and bug fixes
in this release
The RLP command-line utility is an executable: rlp.exe in Windows and rlp in Unix. This utility is in the
binary directory:
BT_ROOT/rlp/bin/BT_BUILD
where BT_BUILD is the platform identifier embedded in your SDK package file name (see Supported
Platforms [16] ).
/usr/local/BasisTech/BT_RLP/rlp/bin/ia32-glibc22-gcc32/rlp
BT_ROOT/rlp/samples/scripts/BT_BUILD
/usr/local/BasisTech/BT_RLP/rlp/samples/scripts/ia32-glibc22-gcc32/go.sh
1. Use the command-line prompt (Windows) or bash shell (Unix) to navigate to the sample scripts
directory.
Warning
To find the required resources and work correctly, the go script must be run from the sample
scripts directory: BT_ROOT/rlp/samples/scripts/BT_BUILD .
8
Using the go Script
2. Run the go script with one argument: the two-letter ISO639 code (more than two letters in special cases,
such as Simplified Chinese) for a language for which you have an RLP license. Use one of the codes
from the following table.
The RLP command-line utility analyzes the input text and generates an XML report on what it finds.
The report includes an xml-stylesheet processing instruction.
Use your browser to open rlp-output.xml. The browser applies the xml-stylesheet to the file and
displays the resulting HTML.
9
What Takes Place When You Run the go Script
The RLP command-line utility begins by using this parameter to set the Basis root directory.
3. Environment
10
Using the RLP Command-Line Utility to Process Your Own Text
The environment defines the scope of operations available to RLP during this session and provides a
pointer to your RLP license, which specifies which operations you are licensed to perform. The
environment is specified with an XML configuration file: BT_ROOT/rlp/etc/rlp-global.xml.
4. Context
The context defines the sequence of processors that the RLP command-line utility applies to the sample
input text. The context is defined with an XML configuration file. The go script specifies BT_ROOT/
rlp/samples/etc/rlp-context.xml.
5. Input text
The input text is a UTF-8 text file in BT_ROOT/rlp/samples/data. The file name takes the form
<ln>-text.txt where <ln> is the language code you enter on the command line. For example, the
sample file for English is en-text.txt.
4. Uses the context object to process the input text. The context object defined with rlp-context.xml
performs the following tasks:
b. Tokenizes the text (each token is a word, multiword expression, possessive affix, or punctuation).
c. Tags the part of speech for each token (not currently supported for Persian and Urdu).
f. Finds named entities of various types (not supported for some languages).
If the XML report you see is missing any of this information, your RLP license does not authorize one or
more processors for the language you selected. To upgrade your license, contact Basis Technology Corp.
at [email protected] . Note: Named Entity extraction, Noun Phrase extraction,
and Part-of-Speech detection are not supported for some languages.
2.4.3. Using the RLP Command-Line Utility to Process Your Own Text
Instead of using the go script, you can run the RLP command-line utility with the necessary arguments
from the command line or with your own batch file or shell script. Call rlp with arguments for language,
RLP root directory, environment configuration file, context configuration file, and input file:
ln is the two-letter ISO639 language code. You must include -l ln , unless your context includes the RLP
Language Identifier (RLI) and you RLP license authorizes the use of RLI.
11
Using the RLP Command-Line Utility to Process Your Own Text
BT_ROOT is the Basis Technology root directory. You must include -root BT_ROOT.
ln is the ISO639 language code. If you do not know the language code, you can omit this parameter:
RLI determines the language.
12
Using the RLP Command-Line Utility to Process Your Own Text
To handle such text you need a context configuration file with RLI and RCLU, both listed before the
processors that tokenize the text, tag parts of speech, and locate noun phrases, named entities, and sentence
boundaries.
The sample context configuration file rlp-context-rclu.xml includes the required processors.
If you know the language, include -l and the two-letter ISO639 language code. If you do not know the
language, do not include this parameter.
contextConfig is the pathname to the context configuration file that starts with RLI and RCLU:
BT_ROOT/rlp/samples/etc/rlp-context-rclu.xml.
You can include command-line parameters to specify encoding and property values to be passed to RLP
processors.
For a complete list of the arguments that the RLP command-line utility accepts, run
rlp[.exe] -h
13
Using the Windows GUI Demo
The following table lists the sample context configuration files that are shipped with RLP. They are in
BT_ROOT /rlp/samples/etc.
To launch the Demo, select All Programs → Basis Technology → Rosette 6.0.0 Demo from the Windows
Desktop start menu.
14
Using the Windows GUI Demo
Alternatively, you can select Edit → Edit/Write (or click the pencil icon) and type or paste text into the
Text Window on the top right. When you are done entering text, select Edit → Edit/Write (or click the
pencil icon) again.
RLP displays the language and writing script, MIME type, encoding, and length of the text.
To analyze the text, choose from the Demo menu. Your choices are
• Base Linguistics
• Named Entity Extraction
• Base Noun Phrase Extraction
• Universal Base Linguistics
• Chinese Script Conversion (Simplified to Traditional or Traditional to Simplified)
The following figure shows the results of selecting Demo → Base Linguistics for the sample Arabic text
file.
Keep in mind that not all operations are available for all languages. For example, Named Entity Extraction
is currently not available for Czech, Greek, Hungarian, Polish, Portuguese, or Russian. Base Noun Phrase
Extraction is currently not available for Korean, Czech, Greek, Hungarian, Persian, Polish, Russian, or
Urdu. Base Linguistics is currently not available for Persian or Urdu.
You can also use the Windows RLP Demo to edit gazetteers and regular expression files. For
comprehensive information about this demo, select Help → Help/Usage. This information also appears
in this manual: Windows GUI Demo [209] .
15
Supported Platforms and BT_BUILD Values
If you are planning to use the C++ (or C) API, use the SDK package that incorporates the compiler you
plan to use.
If you are planning to use the Java API, you can use any Java SDK 1.5 package (or later) built for your OS
version and architecture. Java is supported except where noted otherwise.
If your platform and compiler do not appear in the following list, please contact Basis Technology Corp.
at [email protected] .
16
SDK Package File Name
where <ver> is RLP version (6.0.0 for the 6.0.0 release), BT_BUILD is in the table above, and <ext>
is tar.gz for Unix platforms and .exe for Windows.
For the RLP 6.0.0 release, the package file names are:
• rlp-6.0.0-sdk-ppc-aix52-xlc.tar.gz
• rlp-6.0.0-sdk-ia32-freebsd48-gcc34.tar.gz
• rlp-6.0.0-sdk-amd64-freebsd6-gcc344.tar.gz
• rlp-6.0.0-sdk-ia32-freebsd6-gcc344.tar.gz
• rlp-6.0.0-sdk-ia64-hpux11-aCC541.tar.gz
• rlp-6.0.0-sdk-parisc-hpux11-aCC333-aa.tar.gz
• rlp-6.0.0-sdk-ia32-glibc22-gcc32.tar.gz
• rlp-6.0.0-sdk-amd64-glibc23-gcc34.tar.gz
• rlp-6.0.0-sdk-amd64-glibc23-gcc40.tar.gz
• rlp-6.0.0-sdk-ia32-glibc23-gcc32.tar.gz
• rlp-6.0.0-sdk-ia32-glibc23-gcc34.tar.gz
• rlp-6.0.0-sdk-ia32-glibc23-gcc40.tar.gz
• rlp-6.0.0-sdk-amd64-glibc24-gcc41.tar.gz
• rlp-6.0.0-sdk-ia32-glibc24-gcc41.tar.gz
• rlp-6.0.0-sdk-amd64-glibc25-gcc42.tar.gz
• rlp-6.0.0-sdk-ia32-glibc25-gcc41.tar.gz
17
Documentation Package File Name
• rlp-6.0.0-sdk-ia32-glibc25-gcc42.tar.gz
• rlp-6.0.0-sdk-ia32-darwin891-gcc40.tar.gz
• rlp-6.0.0-sdk-amd64-solaris10-cc58.tar.gz
• rlp-6.0.0-sdk-amd64-solaris10-gcc412.tar.gz
• rlp-6.0.0-sdk-ia32-solaris10-cc58.tar.gz
• rlp-6.0.0-sdk-ia32-solaris10-gcc34.tar.gz
• rlp-6.0.0-sdk-sparc-solaris10-cc58.tar.gz
• rlp-6.0.0-sdk-sparc-solaris10-cc58-64.tar.gz
• rlp-6.0.0-sdk-sparc-solaris10-gcc412-64.tar.gz
• rlp-6.0.0-sdk-sparc-solaris28-cc52.tar.gz
• rlp-6.0.0-sdk-sparc-solaris28-cc52-64.tar.gz
• rlp-6.0.0-sdk-ia32-solaris9-gcc34.tar.gz
• rlp-6.0.0-sdk-sparc-solaris9-cc58.tar.gz
• rlp-6.0.0-sdk-sparc-solaris9-gcc345.tar.gz
• rlp-6.0.0-sdk-sparc-solaris9-cc58-64.tar.gz
• rlp-6.0.0-sdk-ia32-w32-msvc71.msi
• rlp-6.0.0-sdk-ia32-w32-msvc71-static.msi
• rlp-6.0.0-sdk-ia32-w32-msvc80.msi
• rlp-6.0.0-sdk-ia32-w32-msvc80-static.msi
• rlp-6.0.0-sdk-amd64-w64-msvc80.zip
• rlp-6.0.0-sdk-amd64-w64-msvc80-static.zip
For the RLP 6.0.0 release, the English documentation package file names are:
• rlp-6.0.0-doc-unix.tar.gz
• rlp-6.0.0-doc-win.zip
• rlp-6.0.0-doc-ja-unix.tar.gz
• rlp-6.0.0-doc-ja-win.zip
• rlp-6.0.0-japanese-only-doc-ja-unix.tar.gz (documents the handling of Japanese text only)
• rlp-6.0.0-japanese-only-doc-ja-win.zip (documents the handling of Japanese text only)
18
Chapter 3. Creating an RLP Application
This document walks you through the process of creating an RLP application to extract information from
text. Prior to running the application, you may or may not know the language of the text, but the text is
assumed to all be in the same language. For information about how to process text input that may contain
passages in different languages, see Processing Multilingual Text [65] .
3.1. Overview
• Define the objectives [19]
Perhaps you want all the words that appear in the text, in which case you should decide whether you want
the words as they appear in text (tokens) or in their dictionary form (stems). You may want nouns or noun
phrases. If the input is Arabic, you may want vocalized transliterations of the names that appear in the text.
You may want all sentences in which a particular verb or noun, or named entity appears. You may want
RLP to identify the language or to process streams of text in multiple languages. Once you have determined
what kind of data you need, you can define the RLP context that generates the relevant result data.
Based on your objectives and the language and encoding of your input text, you will define an environment
and a context to process the text. Applying the context to the input text generates the result data that your
application can use.
For information about the types of data that RLP can produce, see RLP Result Types [77] .
RLP is distributed with an environment configuration file that you can use as is: BT_ROOT/rlp/etc/rlp-
global.xml, where BT_ROOT/rlp is the Basis root directory.
You may want to edit the preload setting for some processors. If a processor is used frequently, setting
preload to "yes" may improve performance. When preload is set to "no", RLP does not load the
processor until it is actually used. You can also remove any processors that you do not plan to use in the
applications that use this environment configuration.
19
Defining an RLP Context
The RLP context object posts the input text to internal storage. All language processors get their input from
and post results to internal storage. Some language processors process raw data and generate UTF-16 raw
text. Other language processors scan the UTF-16 raw text and post tokens, named entities, etc. A given
processor may depend on other processors, which means it requires as input the output generated by one
or more of the processors that precede it in the context sequence.
The context determines which RLP results are created. In most cases, the body of your application works
with these results.
To perform linguistic analysis, the RLP language processors work with text encoded as UTF-16. The byte
order is big-endian or little-endian, depending on your platform. If your input text is UTF-16 in the correct
byte order for your platform, no conversion of the input text is required. If the input includes a byte order
mark (BOM), see Handling the BOM [72] .
For plain text in any standard encoding, you can use the Unicode Converter, followed by the RLI
and RCLU language processors. If the input is Unicode, Unicode Converter converts it to UTF-16
with the correct byte order for the current platform. If the input is not Unicode, RLI detects the encoding
(RLI also detects the language). If the input is not Unicode, RCLU converts the text to the required form
of UTF-16. See Preparing Text in Any Encoding [71] .
If the input includes markup in addition to text (HTML, XML, PDF, RTF, and Microsoft Office documents),
you can use the iFilter processor (Windows only) or HTML Stripper, to remove the markup before
linguistic analysis takes place. For more information see, Preparing Marked Up Input [73] .
The order in which processors appear in the context configuration determines the order in which they will
run. Order is important. Some processors take their input from the output of a previous processor rather
than directly from the input text. In other words, some processors depend on the inclusion of another
processor. For example, you must run the Tokenizer and SentenceBoundaryDetector (in that
order) before you can run ARBL (Arabic Base Linguistics), FABL (Persian Base Linguistics), or URBL
(Urdu Base Linguistics).
For information about each of the language processors, their settings and dependencies, see Language
Processors [113] .
1In releases prior to RLP 5.2.0, processors were divided into three categories in an RLP context: an input processor, language processors, and an
optional output processors. This distinction no longer exists. All RLP processors are now called language processors.
20
Language Analyzer User Dictionaries
Currently, user dictionaries are supported for the following language analyzers: Base Linguistics Language
Analyzer (European languages), Chinese Language Analyzer, Japanese Language Analyzer, and Korean
Language Analyzer. For more information about creating and using user dictionaries with these language
analyzers, please see their specific documentation in RLP Processors [113] .
Properties can be set via entries in the context configuration XML document. The following entry, for
example, tells REXML to send its output (an XML report) to a file:
Context properties can also be set through the API. See Setting Context Properties [28] .
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>BL1</languageprocessor>
<languageprocessor>JLA</languageprocessor>
<languageprocessor>CLA</languageprocessor>
<languageprocessor>KLA</languageprocessor>
21
Coding the Application
<languageprocessor>Tokenizer</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>ARBL</languageprocessor>
<languageprocessor>FABL</languageprocessor>
<languageprocessor>URBL</languageprocessor>
<languageprocessor>Stopwords</languageprocessor>
<languageprocessor>BaseNounPhrase</languageprocessor>
<languageprocessor>NamedEntityExtractor</languageprocessor>
<languageprocessor>Gazetteer</languageprocessor>
<languageprocessor>RegExpLP</languageprocessor>
<languageprocessor>NERedactLP</languageprocessor>
</languageprocessors>
</contextconfig>
3.4.5.2. Unicode Input and Base Linguistic Analysis for One Language
The following sample is much more specific. It uses the Unicode Converter to convert Japanese text
from UTF-8 (or some other Unicode encoding) to UTF-16. The JLA language processor (Japanese
Language Analyzer) segments Japanese text into separate words and assigns part-of-speech tags to each
word. If com.basistech.jla.deep_compound_decomposition is set to "true" (the default
is "false"), JLA recursively decomposes into smaller components any tokens marked in the dictionary
as decomposable. JON (the Japanese Orthographic Normalizer) uses a normalization dictionary to return
the normalized form of Japanese word variants. This context does not identify named entities.
<?xml version="1.0"?>
<!DOCTYPE contextconfig SYSTEM "contextconfig.dtd">
<contextconfig>
<properties>
<property name="com.basistech.jla.deep_compound_decomposition" value="true"/>
</properties>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>JLA</languageprocessor>
<languageprocessor>JON</languageprocessor>
</languageprocessors>
</contextconfig>
3.4.5.3. Other
For an example that handles HTML, PDF, or Microsoft Office documents, see Using iFilter [73] .
For an example that processes multilingual text, see RLBL Context [65] and Single-Language
Context [66] .
Steps
C++ only: Verify that RLP runtime library is compatible with the library used for compilation.
2See Using the RLP C API [49] .
22
Setting Up the RLP Environment
Standard parameters include path to the Basis root directory, environment configuration XML, context
configuration XML, context and property settings, language and encoding of the input text.
Set the Basis root directory, set up diagnostic logging, and instantiate and configure the environment.
You may also want to collect information about the RLP features your license authorizes you to use.
Use the environment object and an XML context configuration file (or string) to instantiate a context
object. Set context properties as appropriate (see RLP Processors [113] ).
5. Handle the RLP results generated during the previous step. This is the heart of your application. For
detailed information, see Accessing RLP Results [77] .
If you are using C++, use BT_RLP_Environment::SetBTRootDirectory to set the Basis root
directory.
If you are using Java, there are two ways to set the Basis root directory:
• Set the bt.root system property. You can do this from the command line when you launch the Java
virtual machine:
In Java, you must also create an EnvironmentParameters object before you proceed to the next step.
For example:
23
Capturing Log Output
You can set logging level to a single channel (e.g., "warning"), a comma-delimited list of channels (e.g.,
"warning,error"), or "all" (equivalent to "warning,error,info"). If you want to mute all channels, set logging
level to "none".
For information about the codes returned if you include the "error" level, see Error codes [269] .
info_p is the data passed as the first parameter when you use SetLogCallbackFunction to register
the callback: it can be any value you want (or NULL).
channel is the channel number the message is being written to: BT_LOG::WARNING_CHANNEL,
BT_LOG::ERROR_CHANNEL, or BT_LOG::INFO_CHANNEL.
Here is a callback function that writes log messages to the open file that is established when the function
is registered:
The environment class provides static functions for registering the callback function and setting log level.
For example:
24
Initializing the Environment
3.6.2.4. Alternative
If you want RLP to post messages to standard error, you can set the BT_RLP_LOG_LEVEL environment
value to indicate which channels you want sent to standard error (warnings, errors, and/or info). If you set
this environment variable, you do not need to perform any of the steps listed above.
Valid values are (case insensitive): none, all, info, warning, error. Multiple values can be set in a comma
separated list. For example:
export BT_RLP_LOG_LEVEL=all
export BT_RLP_LOG_LEVEL=none
export BT_RLP_LOG_LEVEL=warning,error
Note
Any setting of BT_RLP_LOG_LEVEL is overridden by a call to the C++ SetLogLevel function
or the Java setLogLevel method, unless you pass "" as the parameter, in which case RLP uses
the BT_RLP_LOG_LEVEL setting (if it is set), or "error" (if is not set).
25
Getting License Information
If you attempt to use a processor or process a language for which you do not hold a license, RLP issues a
warning and continues operation.
Warning
Be sure you have purchased the licenses that are required for the processors that you want to use
in your applications. Turn on the warning channel in your logging callback to receive such
notification.
RLP provides an API for gathering information about your license. The environment object provides
methods for determining whether you have a valid license, whether it authorizes base linguistics and named
entity extraction for a given language, and whether it authorizes support for a given named feature. This
release includes sample applications that demonstrate how to use this API: samples/cplusplus/
examine_license.cpp and samples/java/ExamineLicense.java.
To determine whether your license includes base linguistics or named entity extraction for a given language,
use
To determine whether your license includes support for a named feature, use
Language IDs are defined in bt_language_names.h. The arguments you can use for functionality
and feature are defined in bt_rlp_license_types.h.
These methods are used in the examineLicenses method of the Sample C++ Application [30] .
To determine whether you license includes base linguistics or named entity extraction for a given language,
use
To determine whether your license includes support for a named feature, use
26
Setting Up the Context
These methods are used in the examineLicenses method of the Sample Java Application [37] .
The following fragment gets the context configuration from an external file:
The sample C++ application that appears later in this chapter gets the context from a buffer. See item 3. in
Sample C++ Application [30] .
Warning
If you use an XML String rather than an XML file, and the XML declaration contains an
encoding attribute (optional), the encoding MUST be set to UTF-8. For example: <?xml
version='1.0' encoding='utf-8'?>.
Once you have a ContextParameters object with a context, you can use the RLPEnvironment
getContext method to instantiate the context.
The following fragment gets the context from a String within the application:
27
Setting Context Properties
The sample Java application that appears later in this chapter gets the context from a file. See item 3. in
Sample Java Application [36] .
Both methods take two String parameters: property name and property value. For information about context
properties, see RLP Processors [113] .
For detailed information about the C++ API, see ProcessFile, and ProcessBuffer in
BT_RLP_Context [api-reference/cpp-reference/classBT__RLP__Context.html].
For the details of the Java API, see the process methods in RLPContext [api-reference/java-reference/
com/basistech/rlp/RLPContext.html].
In Java, use one of the RLPContext.process methods that takes a String pathname parameter.
For an example, see item 4. in the Sample Java Application [36] .
These methods post raw input data to internal storage. Unless you are using iFilter to process the file,
the context must include RLI/RCLU or Unicode Converter to generate the UTF-16 raw text required
by other language processors.
28
Introduction to Our Sample Applications
In Java, use one of the RLPContext process methods that includes a byte[], ByteBuffer, or
char[] parameter for the input data.
These methods post raw input data in RLP internal storage. The context must include RLI and RCLU, or
Unicode Converter to generate the UTF-16 raw text required by other language processors.
In Java, use one of the RLPContext.process methods that takes a String data parameter.
The methods described in this section generate the UTF-16 raw text required for processing. Your context
does not need RCLU or Unicode Converter; if it includes either of those processors, they are ignored.
The code comments indicate how each application implements the steps listed above. The fundamental
changes you must make are the following:
For runtime flexibility, you will probably use input parameters to define most of the following: Basis
root directory, language 4 , encoding, context configuration, input text.
• Set up the RLP log callback to provide the information you need to understand any problems that may
arise as you are running RLP.
You can embed your context configuration in the application. For flexibility and wider use, you may
want to maintain a separate context configuration file.
3For a sample C application, see Using the RLP C API [49] .
4Use the ISO639 language code [12] to designate the language.
29
Sample C++ Application
This is the heart of your application. The samples illustrate the basic procedures for accessing the
different kinds of result objects that processing text input can generate.
For information about building and running these samples and the other samples included with the RLP
distribution, see Building and Running the Sample C++ Applications [43] and Building and Running
the Sample Java Applications [45] .
// C++ includes
#include <iostream>
#include <fstream>
#include <cstring>
#include <vector>
//prototypes
static void handleResults(BT_RLP_Context* context, ofstream& out);
static void log_callback(void* info_p, int channel, const char* message);
30
Sample C++ Application
// named entities.
static const char* CONTEXT =
"<?xml version='1.0'?>"
"<contextconfig>"
"<languageprocessors>"
"<languageprocessor>Unicode Converter</languageprocessor>"
"<languageprocessor>BL1</languageprocessor>"
"<languageprocessor>JLA</languageprocessor>"
"<languageprocessor>CLA</languageprocessor>"
"<languageprocessor>KLA</languageprocessor>"
"<languageprocessor>Tokenizer</languageprocessor>"
"<languageprocessor>SentenceBoundaryDetector</languageprocessor>"
"<languageprocessor>ARBL</languageprocessor>"
"<languageprocessor>FABL</languageprocessor>"
"<languageprocessor>URBL</languageprocessor>"
"<languageprocessor>Stopwords</languageprocessor>"
"<languageprocessor>BaseNounPhrase</languageprocessor>"
"<languageprocessor>NamedEntityExtractor</languageprocessor>"
"<languageprocessor>Gazetteer</languageprocessor>"
"<languageprocessor>RegExpLP</languageprocessor>"
"<languageprocessor>NERedactLP</languageprocessor>"
"</languageprocessors>"
"</contextconfig>";
/**
* 1. Process input parameters.
* 2. Set up RLP environment.
* 3. Set up RLP context for processing input.
* 4. Process input.
* 5. Work with the results.
* 6. Clean up.
*/
int main(int argc, const char* argv[])
{
if (!BT_RLP_Library::VersionIsCompatible()) {
fprintf(stderr, "RLP library mismatch: have %s expect %s\n",
BT_RLP_Library::VersionString(),
BT_RLP_LIBRARY6.0.0STRING);
return 1;
}
31
Sample C++ Application
//3. Get a context from the environment. In this case the context
// configuration is embedded in the app as a string. It could also
// be read in from a file.
BT_RLP_Context* context;
rc = rlp->GetContextFromBuffer((const unsigned char*) CONTEXT,
strlen(CONTEXT),
&context);
if (rc != BT_OK) {
cerr << "Unable to create the context." << endl;
delete rlp;
32
Sample C++ Application
return 1;
}
//4. Use the context object to processes the input file. Must include
// language id unless using RLI processor to determine language.
rc = context->ProcessFile(inputFile, langID);
if (rc != BT_OK) {
cerr << "Unable to process the input file '"
<< inputFile << "'." << endl;
rlp->DestroyContext(context);
delete rlp;
return 1;
}
//5. Gather results of interest produced by processing the input text.
handleResults(context, out);
fprintf(stdout, "\nSee output file: %s\n\n", outputFile);
33
Sample C++ Application
factory->Destroy();
vector<bt_xwstring> tokens;
while (token_iter->Next()) {
//Get the data you want for each token.
//Get the token (BT_RLP_TOKEN).
const BT_Char16* token = token_iter->GetToken();
tokens.push_back(bt_xwstring(token));
34
Sample C++ Application
//5.3 Use result iterator to get other results, such as base noun phrases,
// and gazetteer names. Note: Can use result iterator to get any/all
// results.
//5.4 Use named entity iterator to get information about named entities.
//Cleanup
context->DestroyResultIterator(result_iter);
token_iter->Destroy();
}
catch (BT_RLP_InvalidResultRequest& e) {
cerr << "Exception: " << e.what() << endl;
}
catch (...) {
cerr << "Unhandled exception." << endl;
}
}
35
Sample Java Application
/**
* The application registers this function to receive diagnostic log entries.
* RLP Environment LogLevel determines which message channels (error, warning.
* info) are posted to the callback.
*/
static void log_callback(void* info_p, int channel, const char* message)
{
static const char* szINFO = "INFO : ";
static const char* szERROR = "ERROR : ";
static const char* szWARN = "WARN : ";
static const char* szUNKNOWN = "UNKWN : ";
const char* szLevel = szUNKNOWN;
switch(channel) {
case 0:
szLevel = szWARN;
break;
case 1:
szLevel = szERROR;
break;
case 2:
szLevel = szINFO;
break;
}
fprintf((FILE*) info_p, "%s%s\n", szLevel, message);
}
/*
Local Variables:
mode: c
tab-width: 2
c-basic-offset: 2
End:
*/
# RLPSample.properties
# input values for RLPSample
env = $BT_ROOT/rlp/etc/rlp-global.xml
context = $BT_ROOT/rlp/samples/etc/rlp-context-no-op.xml
input = $BT_ROOT/rlp/samples/data/de-text.txt
36
Sample Java Application
mime_charset = UTF-8
# output destination
out = RLPSample-out.txt
import com.basistech.rlp.RLPConstants;
import com.basistech.util.Pathnames;
import com.basistech.util.LanguageCode;
import com.basistech.rlp.EnvironmentParameters;
import com.basistech.rlp.RLPEnvironment;
import com.basistech.rlp.ContextParameters;
import com.basistech.rlp.RLPContext;
import com.basistech.rlp.RLPResultAccess;
import com.basistech.rlp.RLPResultRandomAccess;
import com.basistech.rlp.NamedEntityData;
import java.text.MessageFormat;
import java.text.ChoiceFormat;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
37
Sample Java Application
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.List;
import java.util.Iterator;
import java.util.Map;
import java.util.Properties;
import java.util.Enumeration;
/**
* Performs the steps listed above.
* @param btRoot the BT_ROOT directory (the install directory)
* @param rlpProps the path to the properties file
*/
void run(String btRoot, String rlpProps) {
try {
//1. Load input parameters.
// App gets the Basis root directory and a
// properties file from the command line.
// Other parameters are loaded from the
// properties file:
// * env - environment configuration file
// * context - context configuration file
// * input - text input file
// * lang - ISO639 language code
// * mime_charset - charset encoding of input
// * log_level - log level
38
Sample Java Application
//4. Process the input file, including the charset of the file and
// language id.
String input = props.getProperty("input");
String mime_charset = props.getProperty("mime_charset");
rlpContext.process(input, mime_charset, language);
39
Sample Java Application
/**
* Loads properties from RLPSample.properties: environment XML
* file, context XML file, input file, language, input file charset,
* log callback, log file, output file.
*/
Properties loadProperties(String btRoot, String rlpProps)
throws FileNotFoundException, IOException {
/**
* Instructs RLP to send message to designated output (standard out or a
* file). Can log warnings, errors, info messages. RLPEnvironment
* LogLevel determines which messages are posted to the callback.
*/
class LogCallback implements RLPEnvironment.LogCallback {
public void message(int channel, String message) {
MessageFormat form = new MessageFormat("{0} {1}");
ChoiceFormat numFormat =
new ChoiceFormat("0#WARN:|1#ERROR:|2#INFO:");
form.setFormatByArgumentIndex(0, numFormat);
System.err.print(form.format(new Object[]
{new Integer(channel), message}));
}
}
40
Sample Java Application
/**
* Assembles some of the result data and reports to the user:
* detected language
* detected encoding
* tokens
* noun phrases
* named entities
* stem and part of speech for each token
* compounds (if input is German or Dutch)
* sentence boundaries
*
* @param context RLP context responsible for processing the input
* @param outFile report file name
*/
void handleResults(RLPContext rlpContext, String outFile) {
try {
// Target for result data.
final PrintStream out;
// Write file with UTF-8 encoding.
FileOutputStream fos = new FileOutputStream(outFile);
out = new PrintStream(fos, false, "UTF-8");
// Inform the user.
System.out.println("See results in " + outFile);
41
Sample Java Application
42
Building and Running the Applications
• rlp_sample
Illustrates the basic pattern for processing text. See Sample C++ Application [30] .
• rlbl_sample
43
Building and Running the Sample C++ Applications
• examine_license
Illustrates the API for gathering information about the scope of your RLP license.
The source files for these samples are in BT_ROOT/rlp/samples/cplusplus. This directory also contains
a makefile and Microsoft Visual C++ solution and project files.
where target is one or more of the targets described in the following table.
target Description
all (Default) Build all samples. The executables are placed in BT_ROOT/rlp/bin/BT_BUILD .
clean Remove build files.
check Run the executables with sample command-line arguments.
44
Building and Running the Sample Java Applications
2. With the configuration set to Release (not Debug), select Build → Build Solution to build the
applications.
Alternative: You can also use Visual Studio to build the applications from the command line:
or
On Unix
On Linux, you must set LD_LIBRARY_PATH (or its equivalent environment variable for your
Unix operating system) to include the RLP library directory: BT_ROOT/rlp/lib/BT_BUILD .
To run your application, you must place the application, btrlp.jar, and btutil.jar on the classpath. On
Linux, you must set LD_LIBRARY_PATH (or its equivalent environment variable for your Unix operating
system) to include the RLP library directory: BT_ROOT/rlp/lib/BT_BUILD .
java.library.path. This path includes BT_BUILD, so btutil.jar can only be used on the platform on which it was originally installed. If you
want to port the JAR to a different platform, you must replace or override com.basistech.util.build.properties.
45
Building and Running the Sample Java Applications
• RLPSample
Illustrates the basic pattern for processing text. See Sample Java Application [36] .
• MultiLangRLP
• ExamineLicense
Illustrates the API for gathering information about the scope of your RLP license.
The source files for these samples are in BT_ROOT/rlp/samples/java. This directory also includes an Ant
script that you can use to build and to run these samples. The script requires Ant (1.6.5 or later) with the
JAVA_HOME environment variable set to the root of your Java SDK (1.5 or later). For more information,
see Ant. [http://ant.apache.org/]
where target is one of the Ant build targets as described in the following table.
target Description
[NONE] (Default) Build all samples. Samples are compiled and placed in
BT_ROOT/rlp/samples/java/build/BT_BUILD/rlpsamples.jar. For all
builds, the class files are purged after the jar is created.
clean Remove build files.
build.RLPSample Build RLPSample and put the class files and properties file in
rlpsamples.jar.
build.MultiLangRLP Build MultiLangRLP and put the class files and properties file in
rlpsamples.jar.
build.ExamineLicense Build ExamineLicense and put the class file in rlpsamples.jar.
RLPSample Run RLPSample: include BT_ROOT and the path to a properties file with
other arguments (RLPSample.properties) as command-line
arguments.
MultiLangRLP Run MultiLangRLP: include BT_ROOT and the path to a properties file
with other arguments (MultiLangRLP.properties) as command-
line arguments.
ExamineLicense Run ExamineLicense: include BT_ROOT and the path to the RLP
environment configuration file as command-line arguments.
As you create your own applications, you can use this Ant script as the starting point for establishing your
own build procedures.
46
Building and Running the Sample Java Applications
47
48
Chapter 4. Using the RLP C API
4.1. Introduction
In response to customer requests, RLP now includes a C API. For reference documentation, see API
Reference [api-reference/index.html].
To review the basic structure of an RLP application, consult Creating an RLP Application [19] , which
also includes C++ and Java samples. The C samples that follow are designed to help you incorporate RLP
functionality in your C applications.
// C includes
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>
49
Sample C Application
/**
* 1. Process input parameters.
* 2. Set up RLP environment.
* 3. Set up RLP context for processing input.
* 4. Process input.
* 5. Work with the results.
* 6. Clean up.
*/
int main(int argc, char* argv[])
{
BT_Result rc;
BT_RLP_EnvironmentC *envp;
BT_RLP_ContextC *contextp;
const char* btRoot;
BT_LanguageID langID;
const char* envConfig;
const char* inputFile;
const char* outputFile;
FILE* out;
if (!BT_RLP_CLibrary_VersionIsCompatible(BT_RLP_CLIBRARY_INTERFACE_VERSION)){
fprintf(stderr, "RLP library mismatch: have C lib %ld expect %ld\nOr incompatibility "
"between the core (%s) and C binding libraries.\n",
BT_RLP_CLibrary_VersionNumber(),
BT_RLP_CLIBRARY_INTERFACE_VERSION,
BT_RLP_Library_VersionString());
return 1;
}
50
Sample C Application
btRoot = argv[1];
// Get BT language ID (defined in bt_language_names.h)
// from ISO639 code.
langID = BT_LanguageIDFromISO639(argv[2]) ;
if (langID == BT_LANGUAGE_UNKNOWN) {
fprintf(stderr, "Warning: Unknown ISO639 language code: %s\n", argv[2]);
return 1;
}
envConfig = argv[3];
inputFile = argv[4];
outputFile = argv[5];
51
Sample C Application
if (rc != BT_OK){
fprintf(stderr, "Env initialize failed. %d returned.\n", rc);
return 1;
}
//3. Get a context from the environment. In this case the context
// configuration is embedded in the app as a string. It could also
// be read in from a file.
rc = BT_RLP_Environment_GetContextFromBuffer(envp,
(const unsigned char*)CONTEXT,
strlen(CONTEXT),
&contextp);
if (rc != BT_OK){
fprintf(stderr, "GetContextFromBuffer failed. %d returned.\n", rc);
BT_RLP_Environment_Destroy(envp);
return 1;
}
//4. Use the context object to process the input file. Must include
// language id unless using RLI processor to determine language.
rc = BT_RLP_Context_ProcessFile(contextp,
inputFile, langID, "UTF-8", 0);
if (rc != BT_OK){
fprintf(stderr, "Unable to process the input file %s. %d returned.\n", inputFile, rc);
BT_RLP_Environment_DestroyContext(envp, contextp);
BT_RLP_Environment_Destroy(envp);
return 1;
}
52
Sample C Application
rawText =
BT_RLP_Context_GetUTF16StringResult(contextp, BT_RLP_RAW_TEXT, &len);
factoryp = BT_RLP_TokenIteratorFactory_Create();
if (factoryp == 0){
fprintf(stderr, "TokenIteratorFactory_Create failed. ");
exit(1);
}
BT_RLP_TokenIteratorFactory_Destroy(factoryp);
nn = BT_RLP_TokenIterator_Size(tkitp);
tokens = (const BT_Char16**) malloc(sizeof(BT_Char16*) * nn); // Must free()
fprintf(out, "\n");
while(BT_RLP_TokenIterator_Next(tkitp)){
const char *pos;
const BT_Char16* token;
BT_UInt32 ix, start, end;
token = BT_RLP_TokenIterator_GetToken(tkitp);
tokens[tkix++] = token;
pos = BT_RLP_TokenIterator_GetPartOfSpeech(tkitp);
53
Sample C Application
ix = BT_RLP_TokenIterator_GetIndex(tkitp);
start = BT_RLP_TokenIterator_GetStartOffset(tkitp);
end = BT_RLP_TokenIterator_GetEndOffset(tkitp);
fprintf(out, "#%d: s:%d, e:%d, pos:%s, text:", ix, start, end, pos);
putus(token, out);
fprintf(out, "\n");
//5.3 Use result iterator to get other results, such as base noun phrases,
// gazetteer names, and named entities. Note: Can use result iterator
// to get any/all results.
54
Sample C Application
BT_RLP_Context_DestroyResultIterator(contextp, resitp);
//Cleanup
free((void*)tokens);
BT_RLP_Context_DestroyResultIterator(contextp, resitp);
BT_RLP_TokenIterator_Destroy(tkitp);
fprintf(out, "End of sample program.\n");
fclose(out);
}
/**
* The application registers this function to receive diagnostic log entries.
* RLP Environment LogLevel determines which message channels (error, warning.
* info) are posted to the callback.
*/
static void log_callback(void* info_p, int channel, char const* message)
{
static char* szINFO = "INFO : ";
static char* szERROR = "ERROR : ";
static char* szWARN = "WARN : ";
static char* szUNKNOWN = "UNKWN : ";
char* szLevel = szUNKNOWN;
switch(channel) {
case 0:
szLevel = szWARN;
break;
case 1:
szLevel = szERROR;
55
Sample C Application For Handling Arabic Alternative Analyses
break;
case 2:
szLevel = szINFO;
break;
}
fprintf((FILE*) info_p, "%s%s\n", szLevel, message);
}
/*
* Output a UTF-16 string to a file in UTF-8.
*/
static void putus(const BT_Char16* t, FILE* out)
{
#define BYTEBUF_MAX_SIZE 1024
static char bytebuf[BYTEBUF_MAX_SIZE];
bt_xutf16toutf8(bytebuf, t, BYTEBUF_MAX_SIZE);
fputs(bytebuf, out);
}
/*
Local Variables:
mode: c
tab-width: 2
c-basic-offset: 2
End:
*
This application is similar to the sample documented above. The important distinction is that it is designed
to handle Arabic text and to extract alternative analyses (lemmas, roots, stems, normalized tokens, and
parts of speech). The following fragments highlight the distinctive features of this sample.
//Use a token iterator (*tkitp) to get each token and to get alternative analyses.
while(BT_RLP_TokenIterator_Next(tkitp)){
const BT_Char16* token;
BT_UInt32 ix, start, end;
token = BT_RLP_TokenIterator_GetToken(tkitp);
assert(token!=0);
tokens[tkix++] = token;
ix = BT_RLP_TokenIterator_GetIndex(tkitp);
start = BT_RLP_TokenIterator_GetStartOffset(tkitp);
56
Sample C Application For Handling Arabic Alternative Analyses
end = BT_RLP_TokenIterator_GetEndOffset(tkitp);
while(BT_RLP_TokenIterator_NextAnalysis(tkitp)) {
++count;
altNorm = BT_RLP_TokenIterator_GetNormalForm(tkitp);
printf("\tNormal\t%d :", count);
putus(altNorm);
printf("\n");
altPOS = BT_RLP_TokenIterator_GetPartOfSpeech(tkitp);
printf("\tPOS\t%d :%s\n", count, altPOS);
altStem = BT_RLP_TokenIterator_GetStemForm(tkitp);
printf("\tStem\t%d :", count);
putus(altStem);
printf("\n");
altLemma = BT_RLP_TokenIterator_GetLemmaForm(tkitp);
//com.basistech.arbl.lemmas could be "false".
if (altLemma!=0){
printf("\tLemma\t%d :", count);
putus(altLemma);
printf("\n");
}
altRoot = BT_RLP_TokenIterator_GetRootForm(tkitp);
//com.basistech.arbl.roots could be "false".
if (altRoot!=0){
printf("\tRoot\t%d :", count);
putus(altRoot);
printf("\n");
}
printf("\n");
}
57
Sample C Application for Examining the RLP License
printf("\n");
}
}
This application generates the list of features that your RLP license authorizes.
• rlp_sample_c.c
• ar-rlp_sample_alternatives_c.c
• examine_license_c.c
This directory also contains a makefile and a Microsoft Visual C solution file with associated project files.
where target is one or more of the targets described in the following table.
target Description
all (Default) Build the samples. The executables are placed in BT_ROOT/rlp/bin/BT_BUILD .
clean Remove build files.
check Run the executables with sample command-line arguments.
58
Building and Running the Sample C Applications
2. With the configuration set to Release (not Debug), select Build → Build Solution to build the
applications.
Alternative: You can also use Visual Studio to build the applications from the command line:
or
59
60
Chapter 5. Using the .NET API
5.1. Introduction
RLP now includes a .NET wrapper that provides a .NET API for customers on the ia32-w32-
msvc80 platform: Windows 32-bit with the Visual Studio 8.0 compiler. The .NET wrapper uses the .NET
2.0 Framework to provide access to core RLP functionality. For reference documentation, see API
Reference [api-reference/index.html].
.NET Wrapper Source. The source code for the .NET wrapper is in BT_ROOT/rlp/samples/
dotnet_wrapper/managed. This source is used to build BasisTechnology.RLP.dll, which is shipped with
the release in BT_ROOT/rlp/bin/ia32-w32-msvc80. If you want to modify and rebuild the wrapper, you
can use the Visual Studio solution and projects files (managed.sln and managed.vcproj) in the source
code directory to regenerate BasisTechnology.RLP.dll.
To review the basic structure of an RLP application, consult Creating an RLP Application [19] , which
also includes C++ and Java samples. The C# sample that follows is designed to help you incorporate RLP
functionality in your .NET applications.
Input parameters for BT_ROOT (the Basis root directory), language ID, and input file are passed in from
the command line. The sample uses the standard environment configuration file and one of the sample
context files used for processing Unicode input.
using System;
using System.IO;
using System.Text;
using System.Collections;
using BasisTechnology.RLP;
namespace rlp_sample_csharp
{
/// <summary>
/// Basis Technology RLP C# sample program.
/// </summary>
class RLPSample
{
/// <summary>
/// The main entry point for the Basis Technology RLP C# sample program.
/// </summary>
[STAThread]
static void Main(string[] args)
{
Console.WriteLine("");
Console.WriteLine("Basis Technology RLP C# sample program.");
Console.WriteLine("");
if(args.Length != 3)
{
61
Sample C# Application
//
// Extract command line options
//
string rlp_root = args[0];
string language_code = args[1];
string inputFile = args[2];
//
// Define the RLP configuration files we will use.
//
string env_def = rlp_root + "/etc/rlp-global.xml";
string context_def = rlp_root + "/samples/etc/rlp-context-no-op.xml";
//
// Echo our runtime parameters
//
Console.WriteLine(" RLP root dir: " + rlp_root);
Console.WriteLine(" Environment definition file: " + env_def);
Console.WriteLine(" Context definition file: " + context_def);
Console.WriteLine(" Language: " + language_code);
Console.WriteLine(" Input file: " + inputFile);
Console.WriteLine("");
62
Building and Running the Sample C# Application
63
Building the Sample C# Application
Important
The solution file is set up for 32-bit Windows and Visual Studio compiler 8.0.
2. With the configuration set to Release (not Debug), select Build → Build Solution to build the
application.
Alternative: You can also use Visual Studio to build the applications from the command line:
Use the command-line prompt to navigate to the directory that contains the script.
Like the sample go scripts [8] , this script takes one argument : the language ID (ISO639 language code
[12] ). Like the go scripts, this script also uses the language ID to pick a sample input file.
The C# sample displays the parameters it is using, parses the input file, and outputs information about the
input to standard output. The output includes each token. If the input contains compounds (such as German,
Dutch, and Japanese text), the output also identifies the individual components that make up each
compound.
The output is encoded as UTF-8, so you may want to direct it to a file, which you can then read with an
application such as Notepad that can correctly display UTF-8.
For example, to run the sample with the sample German file and view the output:
go-samples-csharp.bat de >de-out.txt
Notepad de-out.txt
64
Chapter 6. Processing Multilingual Text
Text files containing material written in more than one language are not uncommon. This chapter provides
information about using RLP to process multilingual text.
Keep in mind that the procedure documented in Creating an RLP Application [19] is based on the
assumption that you are using RLP to process input text in a single language. If you include the required
language processors (and you have the necessary RLP license), you can use the same context to process
files in a variety of languages, but each file contains text in a single language. Either you identify the
language when you use the context to process the input, or you include the Language Identifier (RLI) in
the context, in which case RLI identifies the language for the language processors that follow.
1. Use an RLP context with the Rosette Language Boundary Locater [65] (RLBL) to identify language
regions within the input text. Each language region contains contiguous text in a single language.
6.2. RLBL
Formerly known as the Multilingual Language Identifier (MLI), the Rosette Language Boundary Locater
(RLBL) is a collection of three language processors that you can use to detect boundaries between the
language regions within multilingual input. You include these processors in an RLP context in the order
in which they appear below. 1
2. Script Boundary Detector [170] detects boundaries between language scripts, such as Latin, Cyrillic,
Arabic, and Hangul (Korean).
3. Language Boundary Detector [139] uses text boundaries, script boundaries, and RLI functionality,
to identify language regions.
A standard RLBL context configuration is as follows. Note the shortened processor names
("Detector" has been removed): Text Boundary, Script Boundary, and Language
Boundary. For information about properties you can set, see Language Boundary Detector Context
Properties [140] .
<?xml version='1.0'?>
<contextconfig>
1For
information about the Unicode standards regarding text boundaries and script names, see Unicode Standard Annex #29 [http://
www.unicode.org/reports/tr29] and Unicode Standard Annex #24 [http://www.unicode.org/reports/tr24].
65
Processing Language Regions
<properties>
<property name='com.basistech.lbd.min_region' value='50'/>
<property name='com.basistech.lbd.profile_depth' value='262422'/>
</properties>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>Text Boundary</languageprocessor>
<languageprocessor>Script Boundary</languageprocessor>
<languageprocessor>Language Boundary</languageprocessor>
</languageprocessors>
</contextconfig>
66
Code examples
1. Instantiate two context objects: an RLBL context for processing the input text and a standard context
for processing each language region.
3. Get the following result data from the RLBL context: raw text and language regions.
For each language region, the result data (LANGUAGE_REGION [79] ) is 6 integers. Of primary
interest are the raw text offsets that delimit each region and the language identifier for the region.
Note
You can also collect result data for script regions and sentence-level text boundaries. If you
want sentence boundaries, you can get more accurate data for each language region during the
next step.
4. Process each region with the second context, and use that context to get results of interest.
For each region, the input is a portion of the UTF-16 raw text with the appropriate language identifier.
Both fragments use two RLP context objects: rlblContext and context. The configuration for these
contexts is along the lines illustrated above. See RLBL Context [65] and Single-Language context
[66] .
For details about setting up the RLP environment, instantiating a context, processing an input file, and
handling the result data for each language region, see Creating an RLP Application [19] .
67
Java Fragment
BT_UInt32 numbers[6];
rlblResult->AsUnsignedIntegerVector(numbers, 6);
BT_UInt32 start, end, level, type, script;
BT_LanguageID language;
start = numbers[0];
end = numbers[1];
level = numbers[2];
type = numbers[3];
script = numbers[4];
language = BT_LanguageID(numbers[5]);
For examples of how to extract result data for each region, see See handleResults() in rlp_sample
[30] .
The complete application from which the preceding fragment has been extracted is in samples/
cplusplus: rlbl_sample.cpp. See Building and Running the Sample C++ Applications [43] .
68
Java Fragment
For examples of how to extract result data for each region, see See handleResults() in RLPSample
[36] .
The complete application from which the preceding fragment has been extracted is in samples/java:
MultiLang.java and MultiLang.properties. See Building and Running the Sample Java Applications
[45] .
69
70
Chapter 7. Preparing Your Data for Processing
Most of the RLP language processors process text in Unicode UTF-16LE or UTF-16BE (little-endian or
big-endian byte order, depending on the platform byte order). For accurate linguistic analysis, the text
should be plain text; it should not contain markup or a binary format, such as is found in HTML, XML,
PDF, or Microsoft Office documents. The MIME type (document type) should be text/plain; if the MIME
type is otherwise (such as text/html or application/pdf), plain text should be extracted from the input.
The text you want to process is likely not to be UTF-16 and it may contain markup that degrades the
accuracy of linguistic analysis.
RLP provides language processors for detecting the encoding and MIME type of your input, for extracting
plain text if the MIME type is not text/plain, and for converting the input to the required UTF-16 encoding.
This chapter explains how to define contexts that perform the necessary conversions before performing
the operations that require UTF-16 and plain text. See Preparing Marked Up Input [73] .
RLP provides procedures for handling two basic categories of input text:
Category Description
Any Encoding [71] Plain text in any standard encoding.
UTF-16LE/BE (platform byte order) [72] Plain Text in Unicode UTF-16LE or UTF-16BE encoding
that conforms to the platform byte order.
In the context configuration, use the Unicode Converter, RLI (Rosette Language Identifier), and RCLU
(Rosette Code Library for Unicode) language processors (in that order) as the first three language
processors. Note: if you are using mime_detector, you should put it first.
If the input is Unicode, Unicode Converter converts it to UTF-16 with the correct byte order for the current
platform.
If the input is not Unicode, RLI [164] detects encoding (it also detects language), and the RCLU
processor [154] converts the text to UTF-16 for use by subsequent language processors.
Here is a general purpose context that begins by identifying the encoding and converting the input to
UTF-16:
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter </languageprocessor>
<languageprocessor>RLI</languageprocessor>
<languageprocessor>RCLU</languageprocessor>
<!-- Other language processors, such as the following -->
<languageprocessor>BL1</languageprocessor>
71
Plain Text in UTF-16LE/BE
<languageprocessor>JLA</languageprocessor>
<languageprocessor>CLA</languageprocessor>
<languageprocessor>KLA</languageprocessor>
<languageprocessor>Tokenizer</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>ARBL</languageprocessor>
<languageprocessor>FABL</languageprocessor>
<languageprocessor>URBL</languageprocessor>
<languageprocessor>Stopwords</languageprocessor>
<languageprocessor>BaseNounPhrase</languageprocessor>
<languageprocessor>NamedEntityExtractor</languageprocessor>
<languageprocessor>Gazetteer</languageprocessor>
<languageprocessor>RegExpLP</languageprocessor>
<languageprocessor>NERedactLP</languageprocessor>
</languageprocessors>
</contextconfig>
Notes
If you know the encoding and the language, you can include these values as parameters when you
process the text, in which case RLI is not required in the context.
If you know the input is Unicode, you can omit RLI and RCLU (keep RLI if you need to identify
the language).
If you know the input is not Unicode, you can omit Unicode Converter.
RLP does not use a BOM in its internal UTF-16 encoding. Byte order for the platform is known,
and the BOM is unnecessary. If, however, your input is UTF-16 with a BOM and you do not
pass it through Unicode Converter or RCLU, the BOM remains. If you have UTF-16 data with
the correct byte order for your platform, but it includes a BOM, you probably want to strip the
BOM (the first character) before you process the data.
2. In the context configuration, do not include a processor for converting the input to UTF-16:
<contextconfig>
<languageprocessors>
<!--Note: If you want to detect language, start with RLI. -->
<languageprocessor>RLI</languageprocessor>
<languageprocessor>BL1</languageprocessor>
<languageprocessor>JLA</languageprocessor>
72
Preparing Marked-Up or Binary Input
<languageprocessor>CLA</languageprocessor>
<languageprocessor>KLA</languageprocessor>
<languageprocessor>Tokenizer</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>ARBL</languageprocessor>
<languageprocessor>FABL</languageprocessor>
<languageprocessor>URBL</languageprocessor>
<languageprocessor>Stopwords</languageprocessor>
<languageprocessor>BaseNounPhrase</languageprocessor>
<languageprocessor>NamedEntityExtractor</languageprocessor>
<languageprocessor>Gazetteer</languageprocessor>
<languageprocessor>RegExpLP</languageprocessor>
<languageprocessor>NERedactLP</languageprocessor>
</languageprocessors>
</contextconfig>
PDF files and Microsoft Office documents use proprietary binary formats. The text content must be
extracted from such a file before RLP can process it.
iFilter extracts plain text from text with markup and proprietary formats. The result is UTF-16 raw text
from which all the extraneous data has been stripped. You can use iFilter to process files of the following
MIME types:
73
HTML Stripper
You must provide iFilter a file pathname and a MIME type (from the table above). You can specify the
MIME type when you call the API method for processing the input. You can also use the mime_detector
[140] language processor to detect MIME type.
The following context is a general-purpose context for handling the MIME types listed above.
<contextconfig>
<languageprocessors>
<languageprocessor>mime_detector</languageprocessor>
<languageprocessor>iFilter</languageprocessor>
<languageprocessor>RLI</languageprocessor>
<languageprocessor>BL1</languageprocessor>
<languageprocessor>JLA</languageprocessor>
<languageprocessor>CLA</languageprocessor>
<languageprocessor>KLA</languageprocessor>
<languageprocessor>Tokenizer</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>ARBL</languageprocessor>
<languageprocessor>FABL</languageprocessor>
<languageprocessor>URBL</languageprocessor>
<languageprocessor>Stopwords</languageprocessor>
<languageprocessor>BaseNounPhrase</languageprocessor>
<languageprocessor>NamedEntityExtractor</languageprocessor>
<languageprocessor>Gazetteer</languageprocessor>
<languageprocessor>RegExpLP</languageprocessor>
<languageprocessor>NERedactLP</languageprocessor>
</languageprocessors>
</contextconfig>
<contextconfig>
<languageprocessors>
<languageprocessor>HTML Stripper</languageprocessor>
<languageprocessor>RLI</languageprocessor>
<languageprocessor>BL1</languageprocessor>
<languageprocessor>JLA</languageprocessor>
<languageprocessor>CLA</languageprocessor>
<languageprocessor>KLA</languageprocessor>
<languageprocessor>Tokenizer</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>ARBL</languageprocessor>
<languageprocessor>FABL</languageprocessor>
<languageprocessor>URBL</languageprocessor>
<languageprocessor>Stopwords</languageprocessor>
<languageprocessor>BaseNounPhrase</languageprocessor>
<languageprocessor>NamedEntityExtractor</languageprocessor>
<languageprocessor>Gazetteer</languageprocessor>
<languageprocessor>RegExpLP</languageprocessor>
74
Handling XML Without iFilter
<languageprocessor>NERedactLP</languageprocessor>
</languageprocessors>
</contextconfig>
The Rosette Core Library for Unicode [154] (RCLU) provides a number of properties that you can set
to normalize text. The Japanese Language Analyzer [134] (JLA) also includes a property that you can
set to normalize Arabic numerals in Japanese text (see com.basistech.jla.normalize.result.token [136] ).
We recommend that you split very large files into smaller files that can be more easily processed.
75
76
Chapter 8. Accessing RLP Result Data
While processing a text stream, the RLP language processors generate a number of types of result objects.
Processors post these results to internal storage making them available for subsequent processors, and for
handling by the RLP application, which is the subject of this chapter.
When you use a context object to process input, RLP begins by posting the raw data, with any parameters
you supply (such as language, encoding, filename, and MIME type) to internal storage. The first job of the
processors is to generate UTF-16 raw text (stripped of extraneous data if necessary, such as HTML tags).
Other language processors scan the raw text and generate other results types: tokens, part-of-speech tags,
sentence boundaries, base noun phrases, named entities, and so on. Each processor in the context has access
to the results that have been posted by earlier processors. The RLP context configuration, including property
settings, and the settings in the processor options files influence the output that the processors generate.
For more information, see RLP Processors [113] and Defining an RLP Context [20] .
The Type column in the following table provides the core type names that you can map to the appropriate
C++ or Java constant. For C++ (BT_RLP_ TYPE), see bt_rlp_result_types.h. For Java
(RLPConstants. TYPE), see com.basistech.rlp.RLPConstants.
The Description column provides the following information about each result type:
Type Description
ALTERNATIVE_LEMMAS Integer and UTF-16 string vector: alternative LEMMA
[79] results. For each token, provides the token index and the
alternative lemmas.
77
Result Types
Type Description
ALTERNATIVE_ROOTS Integer and UTF-16 string vector: alternative ROOTS [80]
results. For each token, provides the token index and the
alternative roots.
78
Result Types
Type Description
LANGUAGE_REGION Six integers (5 currently used) defining a language region:
• text/plain
• text/html
• text/xml
• text/rtf
• application/pdf
• application/msword
• application/vnd.ms-excel
• application/vnd.ms-powerpoint
• application/ms.access
79
Result Types
Type Description
NAMED_ENTITY Integer triple:
80
Result Types
Type Description
SCRIPT_REGION Integer triple:
If you have identified the language, you can use the Sentence
Boundary Detector [169] to generate
SENTENCE_BOUNDARY results (see above).
81
Handling RLP Results in C++
Type Description
TOKEN UTF-16 string: An atomic element from the input text, such as
word, number, multiword expression, possessive affix, or
punctuation.
Provides access to tokens and several related result types. For access to multiple result types returned
by a token iterator, use a token iterator instead of a set of result iterators.
82
Using a Token Iterator
Provides access to all result types, but you need a separate iterator for each type. Use a result iterator to
get a single result type; use multiple result iterators to get multiple result types not available with the
token iterator.
Provides access to named entities. Added in release 5.3.0 to simplify access to named entities.
Provides access to single-valued result types and avoids the overhead of creating a result iterator to
perform a single iteration.
83
Using a Token Iterator
84
Using a Result Iterator
BT_RLP_TokenIteratorFactory* factory =
BT_RLP_TokenIteratorFactory::Create();
2. To decompose compound tokens or access readings (and the language processor supports the
operation), set the factory object accordingly.
3. Use the factory object to create a token iterator for the context which processed the text.
factory->Destroy();
5. Use the token iterator to access each token and obtain the data of interest.
while (iter->Next()) {
//Get the data you want for each token.
iter->Destroy();
85
Using a Result Iterator
use a result iterator. To get the results of more than one type, you must use a separate result iterator for
each type.
As you iterate through the results of a given type, the result iterator returns a pointer to each result object.
The structure of the data associated with the result object is determined by the result type. See Result data
structures [86] for the data structure associated with each result type and the BT_RLP_Result method
for accessing the data.
1. Use the context object to create a result iterator for the desired result type.
2. Use the result iterator to access each result object, and the result object to access the data.
context->DestroyResultIterator(iter);
The following table maps each result type to a result data structure. As indicated below, the
BT_RLP_Result class provides a method for accessing each data structure.
86
Using a Result Iterator
• BT_RLP_TOKEN
• BT_RLP_TOKEN_SOURCE_NAME
• BT_RLP_STEM
• BT_RLP_LEMMA
• BT_RLP_NORMALIZED_TOKEN
• BT_RLP_ROOTS
87
Using a Result Iterator
returns a UTF-16 string that is not null terminated for each of the following result types. The length
parameter returns the length of the string. A single iteration returns all the data. Alternatively, use the
context object [91] to return these single-valued types:
• BT_RLP_RAW_TEXT
• BT_RLP_TRANSCRIBED_TEXT
returns a null-terminated string of 8-bit chars for the following result types:
• BT_RLP_PART_OF_SPEECH
• BT_RLP_DETECTED_ENCODING
• BT_RLP_MIME_TYPE
8.2.2.5. Integer
The BT_RLP_Result method
BTUInt32 AsUnsignedInteger()
• BT_RLP_DETECTED_LANGUAGE
• BT_RLP_DETECTED_SCRIPT
• BT_RLP_STOPWORD
• BT_RLP_SENTENCE_BOUNDARY
• BT_RLP_TEXT_BOUNDARIES
• BT_RLP_MAP_OFFSETS
• BT_RLP_TOKEN_SOURCE_ID
returns the index to the token to which the result applies and a vector of UTF-16 strings for the following
types:
• BT_RLP_COMPOUND
• BT_RLP_READING
• BT_RLP_TOKEN_VARIATIONS
• BT_RLP_ALTERNATIVE_LEMMAS
• BT_RLP_ALTERNATIVE_NORM
• BT_RLP_ALTERNATIVE_ROOTS
• BT_RLP_ALTERNATIVE_STEMS
BT_UInt32 BT_RLP_Result_UTF16StringVector::Size()
88
Using a Result Iterator
returns the index to the token to which the result applies and a vector of ASCII strings for the following
type:
• BT_RLP_ALTERNATIVE_PARTS_OF_SPEECH
Returns a pair of 32-bit unsigned integers for the following result types:
• BT_RLP_TOKEN_OFFSET
• BT_RLP_BASE_NOUN_PHRASE
returns three 32-bit unsigned integers for the following result types:
• BT_RLP_NAMED_ENTITY 1
• BT_RLP_SCRIPT_REGION
89
Using the Named Entity Iterator
returns a vector of unsigned 32-bit integers for the following result type:
• BT_RLP_LANGUAGE_REGION
The vector contains six integers: start, end, level, type, script (not used), and language.
Named Entity Iterator or Result Iterator? As described in the previous section, you can use the result
iterator [85] to iterate through named entities. The named entity iterator simplifies access to named entities
and provides some additional control over the data you can collect. If normalized tokens [80] are available,
the named entity iterator provides direct access to the normalized tokens in the named entities. You can
also instruct the iterator to strip affixes (prefixes and suffixes) from these tokens. These features are useful
in applications designed to generate query strings.
Note: Currently, RLP generates normalized tokens and supports affix stripping for Arabic only. You can
set Arabic Base Linguistics [115] to generate normalized tokens.
2. If you want to strip prefixes and suffixes from the tokens in the named entities, set the factory
StripAffixes flag to true. Currently, the stripping of affixes only applies to Arabic.
3. Use the factory object to create the iterator for the context with which you have processed the text.
factory->Destroy();
5. Use the Named Entity Iterator to iterate over the named entities in the context and get data of interest.
while (ne_iter->Next()) {
//Get the data you want for each named entity.
90
Getting Results from the Context Object
//Get the named entity type (an integer), and its string representation
//("PERSON", "LOCATION", "ORGANIZATION", ...).
BT_UInt32 type = ne_iter->GetType();
const char* type_name = BT_RLP_NET_ID_TO_STRING(type);
//Get the token offsets for the start and end of the named entity.
BT_UInt32 start_token_offset = ne_iter->GetStartOffset();
BT_UInt32 end_token_offset = ne_iter->GetEndOffset();
//and so on ...
}
ne_iter->Destroy();
Single-valued Results
• Use
• Use
• Use
91
Handling RLP Results in Java
Use
• getMapResult() returns a sorted set of Map entries. For each entry, the key is an Integer token index
and the value is an array of Strings.
2. Use the appropriate methods to retrieve the result data you want. For lists and maps, use the standard
java.util API to access individual results.
92
Using RLPResultAccess
//If the language contains (and you are using a processor that supports)
//compound words, you can get a Map of compounds. Each Integer key
//is the index of the corresponding token, and the value is an array of
//the Strings that make up the compound.
Map compoundMap = resultAccess.getMapResult(RLPConstants.COMPOUND);
iter = map.keySet().iterator();
while (iter.hasNext()){
Integer key = (Integer)iter.next();
//Can use the key to get the associated token.
String token = tokenList.get(key.intValue());
String[] value = (String[])map.get(key);
//Handle the compound...
}
String mimeCharset =
resultAccess.getStringResult(RLPConstants.DETECTED_ENCODING);
93
Using RLPResultAccess
94
Using RLPResultAccess
returns a sorted set of map entries where the key for each entry is an Integer index to the associated token
and the value is String[] (compound components, readings, or token alternatives). Use this method to
retrieve results of the following type:
• RLPConstants.COMPOUND
• RLPConstants.READING
• RLPConstants.TOKEN_VARIATIONS
• RLPConstants.ALTERNATIVE_LEMMAS
• RLPConstants.ALTERNATIVE_NORM
• RLPConstants.ALTERNATIVE_PARTS_OF_SPEECH
• RLPConstants.ALTERNATIVE_ROOTS
95
RLPResultRandomAccess
• RLPConstants.ALTERNATIVE_STEMS
• RLPConstants.DETECTED_LANGUAGE
• RLPConstants.DETECTED_SCRIPT
• RLPConstants.DETECTED_ENCODING
• RLPConstants.RAW_TEXT
• RLPConstants.TRANSCRIBED_TEXT
• RLPConstants.MIME_TYPE
8.3.2. RLPResultRandomAccess
The result random access object provides a single call that returns the entire result set for the specified
result type. Depending on type, you must cast the return value accordingly.
For most usage patterns, we recommend you use RLPResultAccess. For more information about this
lower-level API, see the Javadoc.
8.3.2.1. getNamedEntityData
You can use the getNamedEntityData() method to collect data about NAMED_ENTITY [80]
results generated by Named Entity Extractor [141] , the Regular Expression processor [150] , the
Gazetteer [130] processor, and the Named Entity Redactor [145] .
getNamedEntityData() or getListResult(). As described in the previous section, you can use the
getListResult() [94] to obtain the delimiting token indexes and entity type for each named entity.
getNamedEntityData() simplifies access to named entities and provides some additional control
over the data you can collect. If normalized tokens [80] are available, the named entity iterator provides
direct access to the normalized tokens in the named entities. You can also instruct the iterator to strip affixes
(prefixes and suffixes) from these tokens. These features are useful in applications designed to generate
query strings.
Note: Currently, RLP generates normalized tokens and supports affix stripping for Arabic only. You can
set Arabic Base Linguistics [115] to generate normalized tokens.
96
RLPResultRandomAccess
//If the text is Arabic and you have configured the processor to return normalized tokens,
//you can set stripAffixes to true to remove particle prefixes and suffixes from the
//named entity tokens.
boolean stripAffixes = true;
NamedEntityData neData = resultRA.getNamedEntityData(stripAffixes);
//For example, get normalized named entities and the String representation
//of their entity types.
for (int i = 0; i < neData.length; i++){
//For all languages, normalized named entities contain a single space between tokens.
//For Arabic text, the tokens are normalized tokens, if Arabic Base Linguistics is
//configured to return normalized tokens.
//To get the named entitites as they appear in the source text, use getRawNamedEntity().
String normalizedNE = neData[i].getNormalizedNamedEntity();
String typeName = neData[i].toString();
//Handle the named entity.
out.println("Normalized Named entity (" + typeName + "): " + normalizedNE);
}
97
98
Chapter 9. RLP Runtime Configuration
This chapter describes how to configure RLP to run with other applications or to be redistributed with other
applications.
9.1. Redistribution
When delivering your application to your customers, you may wish to include just a subset of RLP, either
to reduce your distribution size or to improve performance. This section explains how to pare off unwanted
processors, move portions of RLP into your own directory structure, and reconfigure your context(s).
1. Environment Configuration.
Many RLP language processors use dictionaries and other resource files. The configuration file for each
processor specifies resource pathnames. Each pathname begins with <env name="root"/>. At
runtime, RLP replaces this element with the pathname to the BT_ROOT/rlp directory. When you distribute
an application, the location of the resources relative to the Basis root directory should not change. If it does
change, you must edit the processor configuration files accordingly.
First, determine which processors you need. Then edit the context configuration XML document , as
described in Creating an RLP Application [19] , to include only those processors. Remember to check
each processor's dependencies, as discussed in RLP Processors [113] , and include all processors that
your chosen processors rely upon to function. For example, if you wish to include the Arabic Base
Linguistics language processor, you must also include the Tokenizer language processor.
For each processor that you remove, you can also remove the processor DLL or shared-object library file
from the binary directory ( BT_ROOT/rlp/bin/BT_BUILD ), where BT_ROOT is the Basis root directory
and BT_BUILD is the platform identifier embedded in your SDK package file name (see Supported
Platforms [16] ). Use rlp-global.xml to identify the processor DLL or shared-object library file (see
below).
99
Individual Processor Configuration
Many of the language processors use an options file to designate the location of dictionaries and data files
that the processor uses. For these processors, rlp-global.xml specifies the pathname of the options file. If
your context does not include a given processor, you do not need any of the resources listed in the options
file for that processor.
For example, rlp-global.xml contains the following entry for Arabic Base Linguistics (ARBL):
The processor DLL or shared-object library is identified by the path element: bt_lp_arb (the filename
of the Windows DLL also contains a version number). The options file is identified by the
optionspath element. <env name="root"/> means the RLP root directory ( BT_ROOT/rlp). So
the ARBL options file is BT_ROOT/rlp/etc/arble-options.xml. This file designates the dictionary and
data files that ARBL uses.
<arblconfig>
<compatpath><env name="root"/>/arla/dicts/compat_table-<env name="endian"/>.bin</compatpath>
<prefixpath><env name="root"/>/arla/dicts/dictPrefixes-<env name="endian"/>.bin</prefixpath>
<rootpath><env name="root"/>/arla/dicts/dictRoots-<env name="endian"/>.bin</rootpath>
<stempath><env name="root"/>/arla/dicts/dictStems-<env name="endian"/>.bin</stempath>
<suffixpath><env name="root"/>/arla/dicts/dictSuffixes-<env name="endian"/>.bin</suffixpath>
<modelpath><env name="root"/>/arla/dicts/ar_pos_model-<env name="endian"/>.bin</modelpath>
<model2path>
<env name="root"/>/arla/dicts/ar_pos_model2-<env name="endian"/>.bin</model2path>
</arblconfig>
Any element that starts with <env name="root"/> designates the pathname of a resource file. If the
pathname includes <env name="endian"/>, substitute LE or BE in the filename, depending on
whether the platform byte order is little-endian or big-endian.
So if your context does not include ARBL (you are not processing Arabic text), you do not need the ARBL
processor or the data files listed above.
For a list of the files associated with each language processor, see Language Processor Resources
[103] .
100
Minimal Configuration
The following sections show configuration files designed to minimize RLP's memory usage.
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>Tokenizer</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>ARBL</languageprocessor>
</languageprocessors>
</contextconfig>
<contextconfig>
<properties>
<!-- To minimize memory usage-->
<property name="com.basistech.cla.pos" value="no"/>
<property name="com.basistech.cla.readings" value="no"/>
</properties>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>CLA</languageprocessor>
</languageprocessors>
</contextconfig>
101
Japanese Minimal Configuration
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>BL1</languageprocessor>
</languageprocessors>
</contextconfig>
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>JLA</languageprocessor>
</languageprocessors>
</contextconfig>
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>KLA</languageprocessor>
</languageprocessors>
</contextconfig>
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>Tokenizer</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>FABL</languageprocessor>
</languageprocessors>
</contextconfig>
102
Urdu Minimal Configuration
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>Tokenizer</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>URBL</languageprocessor>
</languageprocessors>
</contextconfig>
The remainder of this section provides the following information about each language processor:
Name
Case-sensitive name used to designate the processor in a context configuration file.
Options File
The processor configuration file.
Data Files
The dictionaries and other data files used by the processor. These files are specified in the options file
and may include user-defined files, such as user dictionaries or stopword lists.
Many binary files exist in two forms depending on the byte order of your platform. For (LE|BE) in
a file name, substitute LE if the platform byte order is little endian, BE if the byte order is big endian.
For more information about each processor, see RLP Processors [113] .
Options File
BT_ROOT/rlp/etc/arbl-options.xml
103
Language Processor Resources
Data Files
BT_ROOT/rlp/arla/dicts/compat_table-(LE|BE).bin
BT_ROOT/rlp/arla/dicts/dictPrefixes-(LE|BE).bin
BT_ROOT/rlp/arla/dicts/dictRoots-(LE|BE).bin
BT_ROOT/rlp/arla/dicts/dictStems-(LE|BE).bin
BT_ROOT/rlp/arla/dicts/dictSuffixes-(LE|BE).bin
BT_ROOT/rlp/arla/dicts/ar_pos_model-(LE|BE).bin
BT_ROOT/rlp/arla/dicts/ar_pos_model2-(LE|BE).bin
Options File
BT_ROOT/rlp/etc/bnp-config.xml
Data Files
BT_ROOT/rlp/rlp/dicts/de_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/de_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/nl_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/nl_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/en_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/en_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/ja_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/ja_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/es_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/es_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/pt_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/pt_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/fr_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/fr_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/it_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/it_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/zh_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/zh_bnp_data.bin
BT_ROOT/rlp/rlp/dicts/ar_bnp_(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/ar_bnp_data.bin
Options File
BT_ROOT/rlp/etc/bl1-config.xml
104
Language Processor Resources
BT_ROOT/rlp/bl1/dicts/el
BT_ROOT/rlp/bl1/dicts/el/lookup-mor.txt
BT_ROOT/rlp/bl1/dicts/el/tagger.hmm
105
Language Processor Resources
BT_ROOT/rlp/bl1/dicts/hu/lookup-mor.txt
BT_ROOT/rlp/bl1/dicts/hu/tagger.hmm
Name
CLA
Options File
BT_ROOT/rlp/etc/cla-options.xml
Data Files
BT_ROOT/rlp/cma/dicts/zh_lex_(LE|BE).bin
BT_ROOT/rlp/cma/dicts/zh_reading_(LE|BE).bin
BT_ROOT/rlp/cma/dicts/zh_stop.utf8
Name
FABL
Options File
BT_ROOT/rlp/etc/fabl-options.xml
Data Files
BT_ROOT/rlp/fabl/dicts/compat_table-(LE|BE).bin
BT_ROOT/rlp/fabl/dicts/dictPrefixes-(LE|BE).bin
BT_ROOT/rlp/fabl/dicts/dictStems-(LE|BE).bin
BT_ROOT/rlp/fabl/dicts/dictSuffixes-(LE|BE).bin
106
Language Processor Resources
Gazetteer
Name
Gazetteer
Options File
BT_ROOT/rlp/etc/gazetteer-options.xml
Data Files
BT_ROOT/rlp/samples/etc/rlpdemo-gazetteer.txt
HTML Stripper
Name
HTML Stripper
Options File
None
Data Files
None
iFilter
Name
iFilter
Options File
None
Data Files
None
Options File
BT_ROOT/rlp/etc/jla-options.xml
Data Files
BT_ROOT/rlp/jma/dicts/JP_(LE|BE).bin
BT_ROOT/rlp/jma/dicts/JP_AD_(LE|BE).bin
107
Language Processor Resources
BT_ROOT/rlp/jma/dicts/JP_(LE|BE)_Reading.bin
BT_ROOT/rlp/jma/dicts/JP_stop.utf8
Options File
BT_ROOT/rlp/etc/jon-norm-options.xml
Data Files
BT_ROOT/rlp/rlp/dicts/jon_(LE|BE).bin
Options File
BT_ROOT/rlp/etc/kla-options.xml
Data Files
BT_ROOT/rlp/kma/dicts/
BT_ROOT/rlp/utilities/data/
BT_ROOT/rlp/kma/dicts/kr_stop.utf8
Options File
None
Data Files
None
mime_detector
Name
mime_detector
108
Language Processor Resources
Options File
None
Data Files
None
Options File
BT_ROOT/rlp/etc/ne-config.xml
Data Files
BT_ROOT/rlp/rlp/dicts/en-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/en-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/en-gazetteer-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/en_uc-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/en_uc-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/fr-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/fr-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/de-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/de-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/de-gazetteer-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/it-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/it-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/es-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/es-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/nl-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/nl-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/ja-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/ja-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/zh_sc-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/zh-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/zh_sc-gazetteer-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/zh_tc-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/zh_tc-gazetteer-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/ar-memm-ALL-model-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/ar-funcwords-(LE|BE).bin
BT_ROOT/rlp/rlp/dicts/ar-gazetteer-(LE|BE).bin
Options File
None
109
Language Processor Resources
Data Files
None
Options File
None
Data Files
None
REXML
Name
REXML
Options File
None
Data Files
None
Options File
None
Data Files
None
Other
Windows: bteuclid
Unix: BT_ROOT/rlp/lib/BT_BUILD/libbteuclid
110
Language Processor Resources
Script Boundary
Name
Script Boundary
Options File
None
Data Files
None
Options File
BT_ROOT/rlp/etc/sbd-config.xml
Data Files
BT_ROOT/rlp/rlp/dicts/de-dict.dict
BT_ROOT/rlp/rlp/dicts/en-dict.dict
Stopwords
Name
Stopwords
Options File
BT_ROOT/rlp/etc/stop-options.xml
Data Files
BT_ROOT/rlp/etc/en-stopwords.txt
Options File
None
Data Files
None
111
Language Processor Resources
Tokenizer
Name
Tokenizer
Options File
None
Data Files
None
Unicode Converter
Name
Unicode Converter
Options File
None
Data Files
None
Options File
BT_ROOT/rlp/etc/urbl-options.xml
Data Files
BT_ROOT/rlp/urbl/dicts/compat_table-(LE|BE).bin
BT_ROOT/rlp/urbl/dicts/dictPrefixes-(LE|BE).bin
BT_ROOT/rlp/urbl/dicts/dictStems-(LE|BE).bin
BT_ROOT/rlp/urbl/dicts/dictSuffixes-(LE|BE).bin
112
Chapter 10. RLP Processors
10.1. Overview
The RLP language processors are documented in this chapter in the format described below.
Note that licenses are required on a per-language, per-feature basis. Before a language processor runs, it
checks for a license for the given language and feature (e.g., English tokenizing). Licensing failures are
recorded in a log if logging is turned on to report the warning channel. See Logging [23] for more details.
RLP also provides an API for determining the scope of features enabled by your license. See Getting
License Information [26] .
Name
Name used to specify the processor in a context configuration file. Names are case-sensitive.
Dependencies
If the processor relies on the result(s) of another processor, then it is listed here by Name.
Note that the transitive closure of dependencies is not shown here; if processor C depends on the output
of B, and B depends on the output of A, then the Dependencies section for C only includes B, not B
and A.
Use processor dependencies to optimize the context configuration file. First, make sure that all desired
language processors are included in the file. Then make sure that all dependencies are included for
each of the processors. To maximize performance, remove everything else.
For example, to create a context that handles Unicode-encoded English and Japanese down to the
sentence boundary detection level, with REXML formatted output, the context configuration file
should look like the following:
<contextconfig>
<languageprocessors>
<languageprocessor>Unicode Converter</languageprocessor>
<languageprocessor>BL1</languageprocessor>
<languageprocessor>JLA</languageprocessor>
<languageprocessor>SentenceBoundaryDetector</languageprocessor>
<languageprocessor>REXML</languageprocessor>
</languageprocessors>
</contextconfig>
For English, the BL1 processor provides tokenization, POS tagging, and sentence boundaries. The
JLA does nothing, because it produces output only for Japanese. The SentenceBoundaryDetector does
nothing, because BL1 has already produced sentence boundaries.
For Japanese, BL1 does nothing because Japanese is not a supported BL1 language. JLA provides
tokenization and POS tagging (and possibly readings and compounds). SentenceBoundaryDetector
provide sentence boundaries in this case, because there are no existing sentence boundary results when
it is run.
Language Dependent
Indicates whether the processor is language specific and lists the languages it can process. You can
employ either of the following techniques to inform the language-specific processors of the language
of the input text. Processors in the context that do not process the specified language do nothing.
113
Text Being Processed
• Use the Rosette Language Identifier (RLI) [164] to detect the language.
For the C++ API, see ProcessFile, and ProcessBuffer in BT_RLP_Context [api-
reference/cpp-reference/classBT__RLP__Context.html]. For the Java API, see the process
methods in RLPContext [api-reference/java-reference/com/basistech/rlp/RLPContext.html].
XML-Configurable Options
If the processor has an XML options file that the user can configure, this section details the format of
the file and describes each option.
Context Properties
Describes context properties that configure the runtime behavior of the processor. See also Global
Context Properties [115] .
Many processors have options that may be configured in the context configuration file. Properties can
be set via entries in the context configuration XML file, as determined by the contextconfig DTD:
The syntax of the configuration is very important. When a context property is specified in a
configuration file, its name must be prefixed by com.basistech. <processor name>. For example,
The following setting specifies the REXML output file:
<property name="com.basistech.rexml.output_pathname"
value="rlp-output.xml"/>
Boolean Properties
You must use one of the following case-sensitive values to set a boolean property: "yes",
"true", "no", "false". For example,
<property name="com.basistech.cla.break_at_alphanum_intraword_punct"
value="TRUE"/>
does not work. Depending on the processor, using an invalid value is either interpreted as
false or is ignored (the default setting is used).
Context properties can also be specified via the API. See Setting Context Properties [28] .
Description
This section describes in detail the functionality provided by the processor.
One language at a time. Most RLP language processors are designed to process text in a single language.
You either specify the language when you use the context object to process the input text, or you use the
Language Identifier [164] to identify the language. The context may include multiple language
processors, but only the processors applicable to the language of the input text are used. For example, if
114
Global Context Properties
the input text is Japanese and the context includes the Japanese and Chinese language analyzers (JLA and
CLA), JLA processes the text and CLA is inactive.
Input text in more than one language. Three of the processors are designed to process input text that may
contain multiple languages and multiple writing scripts: Text Boundary Detector [172] , Script Boundary
Detector [170] , and Language Boundary Detector [139] . Collectively, these three language processors
make up the Rosette Language Boundary Locator (RLBL), formerly known as the Multilingual Language
Identifier (MLI). You use these processors in the order listed to determine language regions. Then you can
extract each region from the raw text and submit it to another context with the appropriate language
identifier for detailed linguistic processing.
The mechanisms for setting global context properties and processor-specific context properties is the same;
see Setting Context Properties [28] .
Dependencies
Tokenizer, SentenceBoundaryDetector
Language Dependent
Arabic
XML-Configurable Options
None. The paths to the ARBL dictionaries and related resources are defined in BT_ROOT/rlp/etc/arbl-
options.xml.
Context Properties
115
Arabic Base Linguistics
For brevity, the com.basistech.arbl prefix has been removed from the property names in the
first column. Hence the full name for roots is com.basistech.arbl.roots.
When processing a query (a collection of one or more search terms) rather than prose (one or more
sentences), set the com.basistech.bl.query global context property [115] to true.
Description
The ARBL language processor performs morphological analysis and part-of-speech (POS) tagging for
texts written in Modern Standard Arabic.
• STEM [81]
• NORMALIZED_TOKEN [80]
• PART_OF_SPEECH [80]
• TOKEN_VARIATIONS [82] — if com.basistech.arbl.variations is true (default is
false).
• ROOTS [80] — if com.basistech.arbl.roots is true (default is false).
• LEMMA [79] — if com.basistech.arbl.lemmas is true (default is false).
• Alternative Analyses — if com.basistech.arbl.alternatives is true (default is false).
Unless com.basistech.bl1.query is set to true, the first analysis is the disambiguated
analysis, and the others are not ordered. If com.basistech.bl1.query is set to true,
disambiguation does not take place and the analyses are not ordered.
• ALTERNATIVE_STEMS [78]
• ALTERNATIVE_NORM [77]
• ALTERNATIVE_PARTS_OF_SPEECH [77]
• ALTERNATIVE_ROOTS [78] — com.basistech.arbl.roots must be set to true
• ALTERNATIVE_LEMMAS [77] — com.basistech.arbl.lemmas must be set to true
Normalization
For languages written in Arabic script, normalization is performed in two stages: generic Arabic script
normalization [117] and language-specific normalization.
The following language-specific normalizations are performed on the output of the Arabic script
normalization:
116
Arabic Script Normalization
•
Zero-width non joiner (U+200C) and superscript alef ٰ (U+0670) are removed.
•
Fathatan ً (U+064B) is removed.
•
Kaf ک (U+06A9) is converted to ( كU+0643).
•
Heh ہ (U+06C1) or ( ھU+06BE) is converted to ( هU+0647).
Following morphological analysis, the normalizer does the following:
•
Alef wasla ٱ (U+0671) is replaced with plain alef ا (U+0627).
• If a word starts with the incorrect form of an alef, the normalizer retrieves the correct form: plain
alef ا (U+0627), alef with hamza above أ (U+0623), alef with hamza below إ (U+0625), or
alef with madda above آ (U+0622).
Variations
The analyzer can generate a number of variant forms for each Arabic token to account for the
orthographic irregularity seen in contemporary written Arabic. Each variation is added to the output
of the previous variation, starting with the normalized form:
• If a token contains a word-final hamza preceded by yeh or alef maksura, then a variant is created
that replaces these with hamza seated on yeh.
• If a token contains waw followed by hamza on the line, a variant is created that replaces these with
hamza seated on waw.
• Variants are created where word-final heh is replaced by teh marbuta, and word-final alef
maksura is replaced by yeh.
The stem returned in the STEM result is the normalized token with affixes (such as prepositions,
conjunctions, the definite article, proclitic pronouns, and inflectional prefixes) removed.
When the com.basistech.arbl.roots property is true, the consonantal root for the token is
generated, if possible.
117
Arabic Script Normalization
Important
If processing text in Arabic script that includes characters from Arabic Presentation Forms A (U
+FB50 - U+FDFF) and/or Arabic Presentation Forms B (U+FE70 - U+FEFF), use RCLU
[154] with com.basistech.rclu.FormKCNormalization [155] set to true to
normalize these characters to standard Arabic script characters (U+0600 - U+06FF). Otherwise,
these characters are not recognized as Arabic-script characters and the words containing them are
not recognized as Arabic, Persian, or Urdu.
When you examine the results generated by RLP, keep in mind that some fonts do not accurately
display all Arabic ligatures. The Scheherazade Sil International font does an excellent job of
rendering Arabic ligatures.
• The following diacritics are removed: kashida, dammatan, kasratan, fatha, damma, kasra, shadda,
sukun.
• The following characters are removed: left-to-right marker, right-to-left marker, zero-width joiner,
BOM, non-breaking space, soft hyphen, full stop.
• Alef maksura is converted to yeh unless it is at the end of the word or followed by hamza.
• All numbers are are converted to Arabic numbers (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) 1 , thousand separators are
removed, and the decimal separator changed to a period (U+002E). The normalizer handles cases where
ر (reh) is (incorrectly) used as the decimal separator.
•
Alef with hamza above: ( ٵU+0675), ( ٲU+0672), or ( اU+0627) combined with ٔ(U+0654) is converted
أ
to (U+0623).
•
Alef with madda above: ( اU+0627) combined with ٓ(U+0653) is converted to ( آU+0622).
•
Alef with hamza below: ( ٳU+0673) or ( اU+0627) combined with ٕ(U+0655) is converted to ( إU+0625).
•
Misra to Ain: Misra (U+060F) is converted to ( عU+0639).
•
Swash kaf to kaf: ( ڪU+06AA) is converted to ( کU+06A9).
•
Heh: ( ەU+06D5) is converted to ( هU+0647).
•
Teh marbuta: ( ۃU+06C3) is converted to ( ةU+0629).
•
Yeh with hamza above: The following combinations are converted to ( ئU+0626).
( یU+06CC) combined with ٔ(U+0654)
( ىU+0649) combined with ٔ(U+0654)
( يU+064A) combined with ٔ(U+0654)
1As distinguished from the Arabic-Indic numerals often used in Arabic script (٠, ١, ٢, ٣, ٤, ٥, ٦, ٧, ٨, ٩) or the Eastern Arabic-Indic numerals often
118
Base Linguistics Language Analyzer
•
Waw with hamza above: ( وU+0648) combined with ٔ(U+0654), ( ٷU+0677), or ( ٶU+0676) is
converted to ( ؤU+0624).
10.5. Base Linguistics Language Analyzer
Name
BL1
Dependencies
None (see below)
Language Dependent
Yes. This language processor processes many European languages. The supported languages are the
following:
See Appendix B [223] for a listing of the POS tags for these languages.
XML-Configurable Options
BL1 uses a memory limit, language-specific settings, and optional user dictionaries specified in
BT_ROOT/rlp/etc/bl1-config.xml.
For each language, the settings may include a tokenizing FST (finite state transducer), a morphological
lookup script, default part-of speech tags, special tags for internal use only, and an internal lemma
dictionary used during disambiguation.
Memory limit. The bl1config memory-limit attribute defines the limit on the amount of
memory BL1 will load. Each time BL1 is called, it loads resources for the language being processed,
and holds these resources for the remainder of the RLP environment session. If BL1 is called multiple
times with input in different languages, the memory requirements increase. If the defined limit is
reached, BL1 reports a warning, clears memory, re-initializes, and, continues processing the next
language. By default, the limit is 200,000,000 bytes: <bl1config memory-
limit="200000000">. You can modify this limit; if you set it to 0, there is no limit. Keep in mind
that if the limit is too high, BL1 is forced to start paging, and performance deteriorates.
119
Base Linguistics Language Analyzer
Morphological Lookup Script. The only setting you can modify for a language is the
morphological lookup script. Two scripts are provided for each language:
lookup-mor.txt
This script uses full morphological tagging internally. These extra tags are used by
NamedEntityExtractor [141] for more accurate results. This is the default script.
lookup-lem.txt
This script uses only part-of-speech tags. These are shorter than the full morphological tags, so
less memory is used and the lookups are faster (around 10 percent). Do not use this script if
NamedEntityExtractor results are needed.
For example, to change the morphological lookup script for English from full morphological tagging
to part-of-speech tagging, revise the <morpho-script> in the English section:
<bl1-options language="en">
<tokenizer><env name="root"/>/bl1/dicts/en/tokenize.fst</tokenizer>
<morpho-dir><env name="root"/>/bl1/dicts/en</morpho-dir>
<morpho-script>
<env name="root"/>/bl1/dicts/en/lookup-mor.txt
</morpho-script>
<morpho-last-resort-tags>+open +NOUN</morpho-last-resort-tags>
<morpho-replace-tags>^= ^_</morpho-replace-tags>
<pos-disamb><env name="root"/>/bl1/dicts/en/tagger.hmm</pos-disamb>
</bl1-options>
so that it reads
<morpho-script>
<env name="root"/>/bl1/dicts/en/lookup-lem.txt
</morpho-script>
User Dictionaries. You can create one or more user dictionaries for each language that BL1 supports.
To instruct BL1 to employ a user dictionary, you must add a <user-dict> element to the appropriate
language section in bl1-config.xml. The following example adds a German user dictionary.
<bl1-options language="de">
<tokenizer><env name="root"/>/bl1/dicts/de/tokenize.fst</tokenizer>
<morpho-dir><env name="root"/>/bl1/dicts/de</morpho-dir>
<morpho-script><env name="root"/>/bl1/dicts/de/lookup-mor.txt</morpho-script>
<morpho-last-resort-tags>+open +NOUN</morpho-last-resort-tags>
<morpho-compound-tags>^# ^-</morpho-compound-tags>
<morpho-compound-link-tags>^/</morpho-compound-link-tags>
<morpho-special-tags>^} ^{ ^] ^. ^+ ^= [* *]</morpho-special-tags>
<morpho-replace-tags>^_ ^&</morpho-replace-tags>
<pos-disamb><env name="root"/>/bl1/dicts/de/tagger.hmm</pos-disamb>
<lemma-dict>
<env name="root"/>/bl1/dicts/de/lemma-dict-<env name="endian"/>.bin</lemma-dict>
<user-dict>
<env name="root"/>/bl1/dicts/de/userdict-<env name="endian"/>.bin</user-dict>
</bl1-options>
120
Base Noun Phrase Detector
Pathnames. The BL1 configuration file specifies pathnames to various resources, including the
morphological directory where FSTs, data files, and scripts are kept. Each pathname begins with <env
name="root">. At runtime, RLP replaces this element with the pathname to the RLP root directory
( BT_ROOT/rlp). When you distribute an application, the location of the resources relative to RLP
root should not change. See Environment Configuration [99] .
Context Properties
None
Description
The Base Linguistics Language Processor provides tokenization, sentence boundary detection,
stemming, and part-of-speech tagging for the supported languages. By default, it returns the following
results:
• TOKEN [82]
• TOKEN_OFFSET [82]
• SENTENCE_BOUNDARY [81]
• PART_OF_SPEECH [80]
• STEM [81] 2
For languages with compound words (German, Dutch, and Hungarian), the components are separated
and returned in COMPOUND [78] results.
Note
Tokenizer and Sentence Boundary Detector are no longer necessary for BL1, and thus they
do nothing when BL1 precedes them in the context.
User Dictionaries
You may create your own dictionaries for the languages that BL1 supports. See European Language
User Dictionaries [185] .
Dependencies
Tokenized text and part-of-speech tags: BL1, CLA, JLA, KLA; or Tokenizer and ARBL.
Language Dependent
Arabic, Chinese, Dutch, English, French, German, Italian, Japanese, Korean, Portuguese, Spanish.
XML-Configurable Options
The BaseNounPhrase options are defined in BT_ROOT/rlp/etc/bnp-config.xml. For example:
<bnpconfig version="2.0">
<config language="de">
<grammarpath><env name="root"/>/rlp/dicts/de_bnp.bin</grammarpath>
2Depending on the PART_OF_SPEECH BL1 assigns to a TOKEN, the STEM may vary. For example, the English "getting" may return "get" (verb:
121
Base Noun Phrase Detector
<datapath><env name="root"/>/rlp/dicts/de_bnp_data.bin</datapath>
</config>
<config language="en">
<grammarpath><env name="root"/>/rlp/dicts/en_bnp.bin</grammarpath>
<datapath><env name="root"/>/rlp/dicts/en_bnp_data.bin</datapath>
</config>
<config language="ja">
<grammarpath><env name="root"/>/rlp/dicts/ja_bnp.bin</grammarpath>
<datapath><env name="root"/>/rlp/dicts/ja_bnp_data.bin</datapath>
</config>
</bnpconfig>
BaseNounPhrase data is available for ten languages: Arabic, Chinese (Simplified and Traditional),
Dutch, English, French, German, Italian, Japanese, and Spanish.
The names of the grammar and data files are all patterned on ln _bnp.bin and ln _bnp_data.bin,
where ln is replaced by the ISO 639-1 two-letter language code.
Context Properties
None
Description
One of the most important kinds of structure to assign to a document is the identification of noun
phrases (NP). A phrase is a self-contained group of words with a discrete meaning; a noun phrase is
a phrase that functions as a noun in a sentence. Examples of noun phrases include (almost) every title
of a book, movie, play and piece of music.
Noun phrases can also be recursive. That is, a noun phrase may contain other noun phrases as
component parts. For instance, the following are all noun phrases:
it
apples
the apple
the green apple
the round red juicy apple
the green apple on the table
the red apple on the table in the kitchen
the red apple that I ate at lunch yesterday
A base noun phrase is a noun phrase that is not recursive, that is, it does not contain other noun phrases
inside it. So, the first five noun phrases in the list above are base noun phrases, and the remaining ones
122
Chinese Language Analyzer
are complex noun phrases that contain base noun phrases inside them. Below, the list of noun phrases
is repeated with the base noun phrases bracketed:
[it]
[apples]
[the apple]
[the green apple]
[the round red juicy apple]
[the green apple] on [the table]
[the red apple] on [the table] in [the kitchen]
[the red apple] that [I] ate at [lunch] yesterday
Note
A noun phrase that involves an associative relationship between two nouns is treated as a
single base noun phrase. For example, ' king of France' and 'ambitious student of linguistics'
are single, non-recursive noun phrases.
For each supported language, RLP supplies a model of what constitutes a base noun phrase. At each
point in the input, RLP detects the longest possible base noun phrase consistent with the model.
The BASE_NOUN_PHRASE [78] results consist of a pair of integers for each noun-phrase identified:
index of the first token in the phrase and index + 1 of the last token in the phrase.
Dependencies
None
Language Dependent
Chinese (Simplified and Traditional)
XML-Configurable Options
The options for the Chinese Language Processor are described by the BT_ROOT/rlp/etc/cla-
options.xml file. For example:
<claconfig>
<dictionarypath>
<env name="root"/>/cma/dicts/zh_lex_<env name="endian"/>.bin</dictionarypath>
<readingdictionarypath>
<env name="root"/>/cma/dicts/zh_reading_<env name="endian"/>.bin</readingdictionarypath>
<stopwordspath><env name="root"/>/cma/dicts/zh_stop.utf8</stopwordspath>
</claconfig>
123
Chinese Language Analyzer
The dictionarypath specifies the path name to the main dictionary used for segmentation. Users
must use the main dictionary that comes with the analyzer. In addition, users can create and employ
user dictionaries [188] . This option must be specified at least once. Users can specify one main
dictionary and zero or more user dictionaries.
The readingdictionarypath specifies the path to the analyzer's reading dictionary, which is
used to look up readings for segmented tokens.
The stopwordspath specifies the pathname to the stopwords list used by the analyzer. To
customize the stopwords list; see Editing the Stopwords List for Chinese, Korean, or Japanese
[184] .
The lockdictionary value indicates whether or not the pages containing the dictionary are locked
in RAM.
Context Properties
The following table lists the context properties supported by the CLA processor. Note that for brevity
the com.basistech.cla prefix has been removed from the property names in the first column.
Hence the full name for break_at_alphanum_intraword_punct is
com.basistech.cla.break_at_alphanum_intraword_punct.
124
Chinese Language Analyzer
125
Chinese Language Analyzer
126
Chinese Language Analyzer
Description
The Chinese Language Processor segments Chinese text into separate tokens (words and punctuation)
and assigns part-of-speech (POS) tags to each token. For the list of POS tags with examples, see
Chinese POS Tags - Simplified and Traditional [229] . CLA also reports offsets for each token, and
alternative readings, if any, for Hanzi or Hanzi compounds.
127
Chinese Script Converter
• TOKEN [82]
• TOKEN_OFFSET [82]
• PART_OF_SPEECH [80] — if com.basistech.cla.pos is true (the default)
• COMPOUND [78] — if com.basistech.cla.decomposecompound is true ( the default)
• READING [80] (pinyin transcriptions) — if com.basistech.cla.readings is true (the
default)
• STEM [81] — if com.basistech.cla.normalize_result_token is true (the default
is false)
• STOPWORD [81] — if com.basistech.cla.ignore_stopwords is false (the default)
Dependencies
None
Language Dependent
Chinese (Simplified and Traditional)
XML-Configurable Options
The options for the Chinese Script Converter are defined in BT_ROOT/rlp/etc/csc-options.xml.
Modify this file as necessary. A sample configuration follows:
The worddictionarypath specifies the pathname to a dictionary used for segmentation, which is
performed in the orthographic and lexemic modes of conversion as described below.
128
Chinese Script Converter
The mappingdictionarypath specifies the pathname to a dictionary used in orthographic and lexemic
conversion. The dictionary prefixed "SCTTCmpt" converts Simplified Chinese to Traditional Chinese,
while the one prefixed "TCTSCmpt" converts Traditional Chinese to Simplified Chinese. Note that
the conversion dictionary is on a separate directory branch from the word dictionary.
Context Properties
The following table lists the context properties supported by the CSC processor. For brevity, the
com.basistech.csc. prefix has been removed from the property names in the first column. For
example, the full name of conversion_level is
com.basistech.csc.conversion_level.
Description
There are two forms of standard written Chinese. Simplified Chinese and Traditional Chinese.
Simplified Chinese (SC) is used in the People’s Republic of China (PRC). SC normally uses the
GB2312-80 or GBK character set. Traditional Chinese (TC) is used in Taiwan, Hong Kong, and Macau.
TC normally uses the Big Five character set. Conversion from one script to another is a complex matter.
The main problem of SC to TC conversion is that the mapping is one-to-many. For example, the
simplified form 发 maps to either of the traditional forms 發 or 髮. Conversion must also deal with
vocabulary differences and context-dependence.
The Chinese Script Converter converts text in simplified script to text in traditional script, or vice
versa. The conversion can be on any of three levels. The first is codepoint conversion, which uses a
mapping table to convert characters on a codepoint-by-codepoint basis. For example, the simplified
form 头 might be converted to a traditional form by first mapping 头 to 頭, and then 发 to either 髮 or
發. Using this approach, however, there is no recognition of 头发 as a word, the choice could be 發,
in which case the end result 頭發 would be nonsense. On the other hand, the choice of 髮 would lead
to errors for other words. So while conversion mapping is straightforward, it is unreliable.
The second level of conversion is orthographic. This level relies upon identification of the words in a
text. Within each word, orthographic variants of each character may be reflected in the conversion. In
the above example, 头发 would be identified as a word. It would be converted to a traditional variant
of the word, 头髮. There would be no basis for converting it to 頭發, because the conversion considers
the word as a whole rather than the individual characters.
129
Gazetteer
The third level of conversion is lexemic. This level also relies upon identification of words. But rather
than converting a word to an orthographic variant, the aim here is to convert it to an entirely different
word. For example, "computer" is usually 计算机 in SC but 電脳 in TC. Whereas codepoint conversion
is strictly character-by-character and orthographic conversion is character-by-character within a word,
lexemic conversion is word-by-word.
The Chinese Script Converter returns the TOKEN [82] result type. In the case of orthographic and
lexemic conversion, the tokens are the converted words in the destination script. In the case of
codepoint conversion, the tokens are the converted characters in the destination script.
10.9. Gazetteer
Name
Gazetteer
Dependencies
Tokenized text: Tokenizer or language analyzer.
Language Dependent
No
XML-Configurable Options
The file BT_ROOT/rlp/etc/gazetteer-options.xml specifies normalization options and one or more
gazetteers. For example:
130
Gazetteer
</XMLDictionaryPath>
<DictionaryPath>
<env name="root"/>/samples/etc/rlpdemo-gazetteer.txt
</DictionaryPath>
<!-- Insert your dictionaries here
<DictionaryPath>Your Path Here</DictionaryPath>
-->
</DictionaryPaths>
</gazetteerconfig>
Normalization Options. These normalization options are applied to gazetteer entries when the
gazetteers are loaded and to input text when scanned for matches.
NormalizeCase If true, normalizes gazetteer entries and input text to lower case.
NormalizeSpace If true, normalizes whitespace in gazetteer entries to a single space.
NormalizeKana If true, convert Hiragana characters to Katakana in gazetteer entries and
the input text.
NormalizeWidth If true, normalizes half-width and full-width characters in the gazetteer
and in the input text to "generic-width" characters.
NormalizeDiacritics If true, strips diacritics and accents from gazetteer entries and input
text.
Context Properties
The following table lists the context properties supported by the Gazetteer processor. Note that for
brevity the com.basistech.gazetteer prefix has been removed from the property names in
the first column. For example, the full name for report_partial_matches is
com.basistech.gazetteer.report_partial_matches.
Description
The Gazetteer processes text that has already passed through another language processor and isolates
specific terms defined by the user in a Gazetteer Source File (GSF), which has the following properties:
• The first non-comment line is the named entity type, which applies to all entries in the gazetteer,
and will be used as the entity type name for output. The syntax for the name is type:subtype, where
subtype is optional. If the type and subtype appear in BT_ROOT/rlp/etc/ne-types.xml, the Named
131
HTML Stripper
Entity Redactor [145] will use the weighting assigned in that file to resolve duplicates or overlaps.
If the type does not appear in that file, it gets the default weighting of 10.
For example:
# File: en-gazetteer.txt
#
# This is a user-defined gazetteer file.
# Gazetteer file is a UTF-8 text file.
# Comment line starts with # in the beginning of line.
# The first non-comment line is the entity type,
# which will be used to label named entities found.
# All other lines after the named entity type are gazetteer entries.
#
MESSAGE
message in a bottle
ordinary mail
Sample Gazetteer files are available in BT_ROOT/rlp/samples/etc where BT_ROOT is the Basis root
directory (the directory where RLP is installed).
• Generates an internal lookup structure based on the gazetteer files specified in gazetteer-
options.xml.
• Searches the input raw text for matches of entries in the gazetteer and generates
NAMED_ENTITY [80] results. Each NE token consists of three integers:
• Entity type - which maps to the named entity type string that appears at the beginning of the
gazetteer.
For detailed information about creating Gazetteer files, see Customizing Gazetteer [178] .
Dependencies
None
Language Dependent
No
XML-Configurable Options
None
Context Properties
None
132
iFilter
Description
HTML input includes markup tags that degrade the accuracy of linguistic analysis. HTML Stripper
strips HTML tags from the input, detects the encoding, and converts the plain text to the correct UTF-16
for the runtime platform. If the MIME type of the input is not text/html, HTML Stripper does nothing.
If the input is HTML, the HTML Stripper generates the following results: RAW_TEXT [80] and
DETECTED_ENCODING [78] .
10.11. iFilter
Name
iFilter
Dependencies
Windows only. Requires input file pathname and MIME type.
To provide the MIME type, include the mime_detector [140] processor in the context before
iFilter, or include it with the pathname in the API call to process the input.
Language Dependent
No
XML-Configurable Options
None
Context Properties
None
Description
Uses the Microsoft Indexing Service to extract plain text from the input file. The output is UTF-16
RAW_TEXT [80] . RLP recognizes the following MIME types:
• text/plain
• text/html
• text/xml
• text/rtf
• application/pdf
• application/msword
• application/vnd.ms-excel
• application/vnd.ms-powerpoint
• application/ms.access
133
Japanese Language Analyzer
Dependencies
None
Language Dependent
Japanese
XML-Configurable Options
Settings for the Japanese Language Analyzer are specified in BT_ROOT/rlp/etc/jla-options.xml. This
file includes pathnames for the main dictionary used for tokenization and POS tagging, the reading
dictionary (with yomigana pronunciation aids expressed in Hiragana), a stopwords list, and may
include one or more user dictionaries.
The user can edit the stopwords list [182] and create user dictionaries [191] .
<StopwordsPath><env name="root"/>/jma/dicts/JP_stop.utf8</StopwordsPath>
</jlaconfig>
The <env name="endian"/> in the dictionary name is replaced at runtime with either "BE" or "LE" to
match the platform byte order: big-endian or little-endian. For example, Sun's SPARC and Hewlett
Packard's PA-RISC are big-endian, whereas Intel's x86 CPUs are little-endian.
The StopwordsPath specifies the pathname to the stopwords list used by the analyzer. To
customize the stopwords list; see Editing the Stopwords List for Chinese, Korean, or Japanese
[184] .
Context Properties
The following table lists the context properties supported by the JLA processor. Note that for brevity
the com.basistech.jla prefix has been removed from the property names in the first column.
Hence the full name for decomposecompound is
com.basistech.jla.decomposecompound.
134
Japanese Language Analyzer
135
Japanese Language Analyzer
Description
The Japanese Language Analyzer tokenizes Japanese text into separate words and assigns a Part-of-
Speech (POS) tag to each word; see Japanese POS Tags [243] . The Japanese Language Processor
returns the following result types:
• TOKEN [82]
• TOKEN_OFFSET [82]
• PART_OF_SPEECH [80]
• STEM [81] — if com.basistech.jla.normalize_result_token is true (the default
is false)
• COMPOUND [78] — if com.basistech.jla.decomposecompound is true (the default)
136
Japanese Orthographic Analyzer
Dependencies
Requires Japanese tokens: the Japanese Language Analyzer (JLA)
Language Dependent
Japanese
XML-Configurable Options
The Japanese Orthographic Analyzer uses an options file, BT_ROOT/rlp/etc/jon-norm-
options.xml, that specifies the path to a binary normalization dictionary. The user cannot modify this
dictionary.
Context Properties
None
Description
The Japanese Orthographic Analyzer performs a lookup in its normalization dictionary and returns
NORMALIZED_TOKEN [80] , a normalized token for each token. The normalization dictionary
contains normalized tokens and token variants. If the token does not appear in the dictionary, the token
and normalized token are identical.
• Normalization of words written in Katakana. Foreign and borrowed words are expressed
phonetically and thus may vary in their transcription to Japanese Katakana.
Examples:
137
Korean Language Analyzer
Examples:
Dependencies
None
Language Dependent
Korean
XML-Configurable Options
The options for the Korean Language Processor are defined in BT_ROOT/rlp/etc/kla-options.xml.
For example:
138
Language Boundary Detector
Note that for the Korean Language Processor, the dictionarypath points to the directory that
contains the required dictionaries. This is different from the Japanese and Chinese Language Processor
behavior, which requires a path to each dictionary that you are including.
The utilitiesdatapath must specify utilities/data. This contains internal transcription tables.
The stopwordspath specifies the pathname to the stopwords list used by the analyzer. To
customize the stopwords list; see Editing the Stopwords List for Chinese, Korean, or Japanese
[184] .
Context Properties
The following table lists the context property supported by the KLA processor. Note that for brevity
the com.basistech.kla prefix has been removed from the property names in the first column.
Hence the full name for ignore_stopwords is
com.basistech.kla.ignore_stopwords.
Description
The Korean Language Processor segments Korean text into separate words and compounds, reports
the length of each word and the stem, and assigns a Part-of-Speech (POS) tag to each word; see Korean
POS Tags [244] . KLA also returns a list of compound analyses (may be empty).
• TOKEN [82]
• TOKEN_OFFSET [82]
• PART_OF_SPEECH [80]
• COMPOUND [78]
• STEM [81]
• STOPWORD [81] — if com.basistech.kla.ignore_stopwords is false (the default)
Dependencies
Text Boundary Detector, Script Boundary Detector
Language Dependent
No
139
mime_detector
XML-Configurable Options
None
Context Properties
The context properties available to control the runtime behavior of the Language Boundary Detector
are listed below. As with all context properties, they can be specified either in the context or via the
API. For brevity, the com.basistech.lbd prefix has been removed from the property names in
the first column. Hence the full name for max_region is
com.basistech.lbd.max_region.
Description
The Language Boundary Detector determines language regions within input text; that is, all elements
within a language region belong to the same language. Language Boundary Detector returns a result,
LANGUAGE_REGION [79] , consisting of an array of integer sextuplets. Each sextuplet consists
of the beginning character offset of a region, the end offset + 1 of the region, the nesting level of the
region, the type of the region, the script of the region (currently unused), and the language of the region.
The Language Boundary Detector is used in conjunction with the Text Boundary Detector [172]
and Script Boundary Detector [170] to identify language regions within input text that contains
multiple languages. You may want to use the results to submit individual regions (from the raw text)
to an RLP context designed to perform linguistic processing for an individual language and script.
For more information about using the Language Boundary Detector to handle multilingual text, see
Processing Multilingual Text [65] .
10.16. mime_detector
Name
mime_detector
Dependencies
None: pathname of input file is optional.
Language Dependent
No
XML-Configurable Options
None
Context Properties
The context properties available to control the runtime behavior of the mime_detector processor are
listed below. For brevity, the com.basistech.mime_detector prefix has been removed from
the property names in the first column. Hence the full name for ignore_pathname is
com.basistech.mime_detector.ignore_pathname.
140
Named Entity Extractor
Description
Detects the MIME_TYPE [79] of the input file. Often used in conjunction with iFilter [133] or
HTML Stripper [132] to extract plain text from files with markup. The mime_detector processor can
use the file extension or analyze the contents to detect the following MIME types:
• text/plain
• text/html
• text/xml
• text/rtf
• application/pdf
• application/msword
• application/vnd.ms-excel
• application/vnd.ms-powerpoint
• application/ms.access
Dependencies
Depends on the language; see below.
Language Dependent
Arabic, Simplified and Traditional Chinese, Dutch, English, French, German, Italian, Japanese,
Korean, Persian, Spanish, and Urdu.
141
Named Entity Extractor
XML-Configurable Options
The Named Entity Extractor uses language-specific binary data files to locate named entities when it
is processing input. These files are not user configurable. The data file pathnames for each language
are specified in the named entities configuration file: BT_ROOT/rlp/etc/ne-config.xml. Each
pathname begins with <env name="root">. At runtime, RLP replaces this element with the
pathname to the RLP root directory ( BT_ROOT/rlp). When you distribute an application, the location
of the data files relative to RLP root should not change. See Defining an RLP Environment [19] .
to
Context Properties
None
Description
A named entity is a proper noun or adjective, such as the name of a person ("George W. Bush"), an
organization ("Red Cross"), a location ("Mt. Washington"), a geo-political entity ("New York"), a
facility ("Fenway Park"), a nationality ("American"), or a religion ("Christian"). The Named Entity
Extractor has been statistically trained to identify entities of these seven types in some or all of the
languages listed above: PERSION, ORGANIZATION, LOCATION, GPE, FACILITY,
NATIONALITY, and RELIGION. For more information, see Named Entities [221] .
The Named Entity Extractor returns a list of identified entities. Each entity is defined by three integers:
index of the first token in the entity, index of the last token + 1 in the entity, and the entity type. See
NAMED_ENTITY [80] .
3If you use the alternative configuration file, LOCATION is expanded to include GPE. See Named Entity Definitions [222] .
142
Examples of Named Entities in Different Languages
To find other entity types (such dates, email addresses, or weapons), use Gazetteer [130] and the
Regular Expression [150] processors.
Currently, named entity extraction is supported for the following languages: Arabic, Simplified
Chinese, Traditional Chinese, Dutch, English, Upper-Case English 4 , French, German, Italian,
Japanese, Korean, Persian, Spanish, and Urdu.
4Use the en_uc language code when you process upper-case English text.
5For the definition of named entity types, see Named Entity Type Definitions [222] .
143
Examples of Named Entities in Different Languages
144
Named Entity Redactor
Dependencies
Any of the following language processors: Gazetteer, NamedEntityExtractor, RegExpLP
Language Dependent
No
145
Named Entity Redactor
XML-Configurable Options
The Named Entity Redactor uses BT_ROOT/rlp/etc/ne-types.xml to remove duplicates when
named entities are returned by more than one processor or different processors tag the same or an
overlapping set of tokens as more than one named entity. For each named entity type, this file assigns
three integer weight values.
<ne_types>
<ne_type>
<name>PERSON</name>
<weight name="statistical" value="10" />
<weight name="gazetteer" value="10" />
<weight name="regex" value="10" />
</ne_type>
<ne_type>
<name>ORGANIZATION</name>
<subtypes>
<name>GOVERNMENT</name>
<name>COMMERCIAL</name>
<name>EDUCATIONAL</name>
<name>NONPROFIT</name>
</subtypes>
<weight name="statistical" value="10" />
<weight name="gazetteer" value="10" />
<weight name="regex" value="10" />
</ne_type>
<!-- other ne_types -->
...
...
</ne_types>
As the file is shipped, all weights are 10. Adjust these weights to instruct the Named Entity
Redactor which processor it should favor if more than one processor returns the same named entity,
or which entity type it should favor if processors return different types for the same set of tokens in
the input.
If, for example, you want to favor gazetteer entries over regular expressions, and favor both over values
returned by statistical analysis, you could set the weights as follows:
If different processors identify a given string as different types, processor weights determine which
type is returned. If, for example, statistical (Named Entity Extractor) identifies "Foo" as an
146
Persian Base Linguistics
ORGANIZATION and gazetteer (Gazetteer) identifiies it as MY_TYPE, the weights in the preceding
example specify that gazetteer outranks statistical, so the entity is returned as MY_TYPE.
When you define new entity types for gazetteers and regular expressions, you should add those entity
types to ne-types.xml if you want to control how the redactor resolves conflicts. Types that do not
appear in this file receive weights of 10 for all three processors.
Apart from setting weights, it is a good idea to put the entity types you define for gazetteers and regular
expressions in ne-types.xml so that the file also serves as a central repository for the entity types that
you are using.
Context Properties
None
Description
A named entity is a proper noun, such as the name of a person ("Bill Gates") or an organization
("Microsoft") or a location ("New York City"). It can also be a specific date ("July 14, 1789"). Three
RLP processors detect named entities: Gazetteer [130] , Named Entity Extractor [141] , and Regular
Expressions [150] .
A named entity may be detected more than once or may overlap with another named entity (the same
token may appear in more than one named entity). The Named Entity Redactor detects and eliminates
duplication and overlapping. For each duplicate, Named Entity Redactor assigns the named entity to
a single source, using the weights assigned in BT_ROOT/rlp/etc/ne-types.xml. Types and subtypes
that do not appear in that file (types defined in gazetteers or regular expressions configuration file, and
not included in ne-types.xml), are assigned the default weight of 10. If Gazetteer, Regular Expressions,
and Named Entity Extractor have the same weight for a given named entity, the choice is arbitrary.
Dependencies
Tokenizer, SentenceBoundaryDetector
Language Dependent
Persian (Farsi)
XML-Configurable Options
None. The paths to the FABL dictionaries and related resources are defined in BT_ROOT/rlp/etc/
fabl-options.xml.
Context Properties
For brevity, the com.basistech.fabl prefix has been removed from the property name in the
first column. Hence the full name for variations is com.basistech.fabl.variations.
When processing a query (a collection of one or more search terms) rather than prose (one or more
sentences), set the com.basistech.bl.query global context property [115] to true.
147
Persian Base Linguistics
Description
The FABL language processor performs morphological analysis for texts written in Persian.
• STEM [81]
• NORMALIZED_TOKEN [80]
• TOKEN_VARIATIONS [82] (if the com.basistech.fabl.variations property is true)
A Note on Stemming
If com.basistech.bl.query is set to true, FABL treats the input as a set of isolated
tokens. Otherwise (the default), the processor uses contextual information to disambiguate
grammar and produce more accurate stems.
In Persian, as opposed to Arabic, there is less need for context to reduce a word to a useful
stem. Many prefixes and suffixes are unambiguous and require no context.
In removing prefixes and suffixes, FABL takes particular care with regard to the Unicode
zero-width non joiner (U+200C). Persian makes extensive use of compound words. In Persian
orthography, the components of a compound are not separated by whitespace. However, they
are also not joined even when the last letter of the leading component could be joined to the
first letter of the following component. The zero-width non joiner is used to prevent renderers
from joining. When stemming, FABL does not stem compounds joined with the zero-width
non joiner, such as ( روزنامهنگارjournalist); FABL does not remove either part of the
compound.
As a general principle, words containing zero-width non joiner and/or superscript alef will
have the same stem as the same words without zero-width non joiner or the superscript
alef.
148
Persian Base Linguistics
• If the word is not parseable, set the entire word as the value of the stem.
• If the word is not parseable, take the next parseable alternative orthographic variant and
return its stem.
Normalization
Each Persian token is normalized prior to morphological analysis. During morphological analysis,
FABL may choose a token variant and normalize it to produce the value for the
NORMALIZED_TOKEN result.
Normalization is performed in two stages: generic Arabic script normalization [117] and Persian-
specific normalization.
The following Persian-specific normalizations are performed on the output of the Arabic script
normalization:
•
Alef أ إ ٱ ا
(U+0623), (U+0625), or (U+0671) is converted to (U+0627).
•
Kaf ك (U+0643) is converted to ( کU+06A9).
• Heh Heh goal (U+06C1) or heh doachashmee (U+06BE) is converted to heh(U+0647).
•
Heh with hamza ۂ ۀ
(U+06C2) is converted to (U+06C0).
•
Yeh ي (U+064A) or ( ىU+0649) is converted to ( یU+06CC).
Following morphological analysis:
•
Zero-width non joiner (U+200C) and superscript alef ٰ (U+0670) are removed.
Variations
The analyzer can generate a variant form for some tokens to account for the orthographic irregularity
seen in contemporary written Persian. Each variation is generated with the normalized form:
• If a word contains hamza on yeh (U+626), a variant is generated replacing the hamza on yeh with
Farsi yeh (U+06CC).
• If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw
with waw (U+0648).
• If a word contains a zero-width non joiner (U+200C), a variant is generated without the zero-width
non joiner.
149
Regular Expression
• If a word ends in teh marbuta (U+0629), variants are generated replacing the teh marbuta (1) with
teh (U+062A) and (2) with heh (U+0647).
The stem returned in the STEM result is the normalized token with affixes (prefixes and suffixes)
removed.
Dependencies
Any tokenizing processor
Language Dependent
No
XML-Configurable Options
<regexp type="IDENTIFIER:PHONE_NUMBER"><
[\)]([\s]*))|((\d{3})[-]([\s]*)))?(\d{3})[\s]*[-][\s]*(\d{4})(?=\s)]]>
</regexp>
<regexp type="IDENTIFIER:URL"><![CDATA[\b(http://)?(([\w]+\.)+
(com|net|org|edu|gov))\b]]></regexp>
<regexp type="IDENTIFIER:EMAIL"><![CDATA[\b[\w\.]+@([\w]+\.)+
(com|net|org|edu|gov)\b]]></regexp>
150
Regular Expression
Nov|Dec)[\s]+(\d{1,2})(,[\s]+(\d{4}))?\b]]>
</regexp>
<regexp lang="en" type="TEMPORAL:DATE"><![CDATA[\b((Monday|Tuesday|
Wednesday|Thursday|Friday|Saturday|Sunday|Mon|Tue|Wed|Thu|Fri|Sat|Sun),
[\s]+)?(\d{1,2})[\s]+(January|February|March|April|May|June|July|
August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|
Jul|Aug|Sept|Oct|Nov|Dec)(,[\s]+(\d{4}))?\b]]>
</regexp>
<regexp lang="en" type="TEMPORAL:DATE"><![CDATA[\b(January|February|March|
April|May|June|July|August|September|October|November|December|Jan|
Feb|Mar|Apr|May|Jun|Jul|Aug|Sept|Oct|Nov|Dec)[\s]+(\d{4})\b]]>
</regexp>
</regexps>
The type attribute is the named-entity type to which text matched by the regular expression is
assigned.
In the XML file, values take the form "TYPE[:SUBTYPE]" (such as "IDENTIFIER" or
"IDENTIFIER:PHONE_NUMBER"). Regular expressions are particularly suited for identifying
named entities that display a fixed pattern, such as the following:
TYPE:SUBTYPE Description
TEMPORAL:DATE A date
TEMPORAL:TIME A time
IDENTIFIER:EMAIL An email address
IDENTIFIER:URL A URL
IDENTIFIER:DOMAIN_NAME An Internet domain name
IDENTIFIER:IP_ADDRESS An Internet IP address
IDENTIFIER:PHONE_NUMBER A phone number
IDENTIFIER:PERSONAL_ID_NUM A personal ID, such as social security number
The optional lang attribute is the ISO639 language code for the language that the regular expression
can be applied to (see ISO639 Language Codes [12] . If the attribute is left out then the regular
expression is applied to all languages.
Context Properties
None
Description
RegExpLP lets you define regular expressions for named entities. The expressions can be associated
with a single language (in which case they are only applied to documents in that language) or can be
applied to all languages.
151
REXML
The content of each regexp element is a Tcl regular expression. For information about regular
expression syntax, see Creating Regular Expressions [181] .
10.21. REXML
Name
REXML
Dependencies
None
Language Dependent
No
XML-Configurable Options
None
Context Properties
The following context properties control the runtime behavior of the REXML processor. For brevity,
the com.basistech.rexml prefix has been removed from the property names in the first column.
Hence the full name for output_pathname is
com.basistech.rexml.output_pathname.
152
REXML
Description
The REXML processor converts the results of language processing into an XML format, specified
by BT_ROOT/rlp/config/DTDs/rexml.dtd. REXML does not generate any RLP results.
1. The input text is UTF-8 and consists of the single sentence, "The Patriots won."
3. The Language Identifier (RLI) has determined that the language is English.
153
Rosette Core Library for Unicode
<word>The</word>
<position start='0' end='3' />
<pos>DET</pos>
<stem>the</stem>
</token>
<token index='1'>
<word>Patriots</word>
<position start='4' end='12' />
<pos>NOUN</pos>
<stem>patriot</stem>
</token>
<token index='2'>
<word>won</word>
<position start='13' end='16' />
<pos>VPAST</pos>
<stem>win</stem>
</token>
<token index='3'>
<word>.</word>
<position start='16' end='17' />
<pos>SENT</pos>
<stem>.</stem>
</token>
</tokens>
<sentences>
<sentence>
<sentenceStart>0</sentenceStart>
<sentenceEnd>4</sentenceEnd>
</sentence>
</sentences>
</contents>
</rexml:document>
Dependencies
Encoding: supplied by RLI [164] or by user. See RCLU Encodings [157]
Language Dependent
No
XML-Configurable Options
None
Context Properties
The following context properties control the runtime behavior of RCLU.
All of the following context properties activate character transformations when set to the value "yes"
or "true". The default setting for all these properties is "false". For brevity, the
com.basistech.rclu prefix has been removed from the property names in the first column.
Hence the full name for BackSlashToYen is com.basistech.rclu.BackslashToYen.
154
Rosette Core Library for Unicode
155
Rosette Core Library for Unicode
156
Rosette Core Library for Unicode
Description
The RCLU language processor converts the input text to UTF-16 (RAW_TEXT [80] ) as required
by other language processors. RCLU also performs transformations, as determined by the context
properties described above and in the order the properties are listed in the context definition. If RCLU
normalizes the text and com.basistech.rclu.mapoffsets [156] is set to true (the default
is false), RCLU returns MAP_OFFSETS [79] .
If you do not provide an encoding, RLI must precede RCLU to detect the encoding. For more
information, see Language Identifier (RLI) [164] .
157
Rosette Core Library for Unicode
158
Rosette Core Library for Unicode
159
Rosette Core Library for Unicode
160
Rosette Core Library for Unicode
161
Rosette Core Library for Unicode
162
Rosette Core Library for Unicode
163
Rosette Language Identifier
Dependencies
None
Language Dependent
No
XML-Configurable Options
None
Context Properties
The following context properties control the runtime behavior of the Language Identifier. For brevity,
the com.basistech.rli prefix has been removed from the property names in the first column.
Hence the full name for hint_language is com.basistech.rli.hint_language.
164
Rosette Language Identifier
Description
The Language Identifier (RLI) identifies the language, encoding, and writing script of the input.
If the input is (or may be) Unicode, put Unicode Converter [173] in the context in front of RLI.
RLI compares the input document against the statistical profile for every supported language and
encoding (see RLI Languages and Encodings [166] ).
RLI uses an n-gram algorithm for its language and encoding detection. Each built-in profile contains
the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents using
a given language and encoding. The default number of n-grams is 10,000 for double-byte encodings
and 5,000 for single-byte encodings. When input text is submitted for detection, a similar n-gram
profile is built based on that data. The input profile is then compared with all the built-in profiles (a
vector distance measure between the input profile and the built-in profile is calculated). The pre-built
profiles are then returned in ascending order by the (shortest) distance of the input from the pre-built
profiles.
Note: If you are interested in ranking and analyzing multiple language possibilities, rather than simply
obtaining the most likely language, use the Basis Technology Rosette Language Identifier standalone
product, which allows you to set thresholds for tagging possible matches as ambiguous or invalid, to
influence rankings with language-hint settings, and to iterate over all possible language matches with
their respective rankings.
RLI returns three results associated with the profile that ranks highest:
• DETECTED_LANGUAGE [78]
An integer corresponding to one of the language IDs in the table below. The language IDs are defined
in bt_language_names.h (C++) and com.basistech.BTLanguageCodes (Java).
• DETECTED_ENCODING [78]
• DETECTED_SCRIPT [78]
An integer representing the ISO15924 code for the writing script. As noted in the following table,
RLI can detect certain languages (such as Arabic, Kurdish, Pashto, Persian, Serbian, Urdu, and
Uzbek) when the text is transliterated into Latin script, in addition to text in the native script for that
165
Rosette Language Identifier
166
Rosette Language Identifier
167
Rosette Language Identifier
168
Sentence Boundary Detector
Dependencies
Requires tokens. See the following table:
Language Dependency
Persian Tokenizer
Urdu Tokenizer
Arabic, Persian, Urdu Tokenizer
Chinese (Simplified and Traditional) CLA
Japanese JLA
Korean KLA
Englisha Tokenizer
Germana Tokenizer
Frencha Tokenizer
Italiana Tokenizer
Spanisha Tokenizer
aNot recommended. See below.
Language Dependent
Arabic, Simplified and Traditional Chinese, Japanese. For more accurate results with English, German,
French, Italian, Spanish, use BL1 [119] to detect sentence boundaries. SentenceBoundaryDetector
does nothing when it follows BL1. If you do want to use SentenceBoundaryDetector with a European
language, do not include BL1.
169
Script Boundary Detector
XML-Configurable Options
None. If you use SentenceBoundaryDetector to process German or English text, it uses a dictionary
for more accurate boundary detection. The paths to the dictionaries are specified in BT_ROOT/rlp/
etc/sbd-config.xml. These dictionaries are not user configurable.
Context Properties
None
Description
In each language, RLP detects where sentence boundaries are in documents. This is more
straightforward in Japanese than in other languages, as the end-of-sentence marker in Japanese is not
used for other purposes. In German and English, in addition to marking sentence boundaries the period
is also used to mark abbreviations. This creates a great deal of ambiguity. Consider the following
sentences:
Mr. Smith went to N.Y.U. Law School. He then worked for John J. Jones in Mass.
He said, "I always sail on weekends." Finally, "Or on weekdays," he added.
Note that periods need not end sentences when they end an abbreviation; however, periods can end
abbreviations and at the same time end a sentence. Also they interact with other punctuation, especially
quotation marks.
Dependencies
None
Language Dependent
No
XML-Configurable Options
None
Context Properties
None
Description
The Script Boundary Detector determines regions of homogeneous script within input text. That is,
all characters within a script region belong to a common script. The Script Boundary Detector returns
a result, SCRIPT_REGION [81] , consisting of an array of integer triples. Each triple consists of the
beginning character offset of a region, the end offset + 1 of the region, and an ISO15924script code.
RLP provides utilities for mapping ISO15924 integer values to 4-character codes and English names.
For C++, see bt_iso15924.h. For Java, see com.basistech.util.ISO15924.
You may use the Script Boundary Detector in conjunction with the Text Boundary Detector [172]
and Language Boundary Detector [139] to identify individual language and script regions within
input text that contains multiple languages and scripts. You may want to use the results to submit
170
Stopwords
individual regions (from the raw text) to an RLP context designed to perform linguistic processing for
an individual language and script.
For more information about using the Language Boundary Detector to handle multilingual text, see
Processing Multilingual Text [65] .
10.26. Stopwords
Name
Stopwords
Dependencies
Tokenized text: Tokenizer or language analyzer.
Note: You cannot use the Stopwords processor with Chinese, Japanese, or Korean input. The
Chinese [123] , Japanese [134] , and Korean [138] language analyzers have an
ignore_stopwordscontext property that can be set to return (or not return) stopwords. The
Stopwords processor does nothing if run after any of these processors.
Language Dependent
You can create stopwords [182] for Arabic, Czech, Dutch, English, French, German, Greek,
Hungarian, Italian Polish, Portuguese, Russian, and Spanish.
XML-Configurable Options
<stopoptions>
<dictionaries>
<!-- Paths to your stopword dictionary files go here. Each file is a
list of words, one-per-line, in UTF-8. Stopword processing is applied
to the tokens, not the stems.
-->
<dictionarypath language="en"><env name="root"/>/etc/en-stopwords.txt
</dictionarypath>
</dictionaries>
</stopoptions>
Each dictionarypath specifies the pathname of the user-defined stopwords file for the indicated
language. The stopwords file is a list of words, one per line, encoded in UTF-8. Blank lines and lines
starting with "#" are ignored.
Context Properties
Description
The Stopword processor marks tokens considered to be stopwords, based on their presence in the
appropriate language-specific stopword dictionary.
The STOPWORD [81] result type contains the results: the token number of each stopword. For
example, the English input sentence, "We went to the movie." has the following six tokens:
171
Text Boundary Detector
We
went
to
the
movie
.
(Note that token offsets start with 0.) Assuming that "to" and "the" are in the English stopwords file,
the STOPWORD result will contain two entries: 2 and 3, indicating that tokens number 2 and 3 are
stopwords.
In output from the REXML output processor, tokens identified as stopwords have the attribute
stopword="yes" attached to its token element.
Stopword processing applies to the token's surface form, not to the stem. Comparison is case sensitive;
e.g., "The" and "the" are different tokens.
For more detailed information about creating Stopwords files, see User-Defined Data: Customizing
Stopwords [182] .
Dependencies
None
Language Dependent
No
XML-Configurable Options
None
Context Properties
None
Description
For use with the Script Boundary Detector [170] and Language Boundary Detector [139] to identify
individual language and script regions within input text that contains multiple languages and scripts.
You can use the results to submit individual regions (from the raw text) to an RLP context designed
to perform linguistic processing for an individual language and script.
For more information about using the Language Boundary Detector to handle multilingual text, see
Processing Multilingual Text [65] .
The Text Boundary Detector determines sentence unit boundaries as defined by Unicode Standard
Annex #29 [http://www.unicode.org/reports/tr29/]. It does not search for sentence units but rather the
boundaries between them. Though there are language-dependent factors here (see Sentence Boundary
Detector [169] ), the processor uses Unicode requirements to find likely boundaries without knowing
the language of the text. The boundaries are defined in terms of character properties. The Text
Boundary Detector returns a result, TEXT_BOUNDARIES [81] , consisting of an array of character
offsets. Each offset represents the end character + 1 of each sentence unit in the input text. For example,
a value of 12 indicates that character 11 is the last character before the boundary.
172
Tokenizer
10.28. Tokenizer
Name
Tokenizer
Dependencies
None
Language Dependent
No
XML-Configurable Options
None
Context Properties
None.
Description
The Tokenizer provides word tokenization functionality based on the algorithms provided in Chapter
5 of The Unicode Standard 3.0 and UAX #29 Text Boundaries in Unicode 4.0.
If BL1 [119] , CLA [123] , JLA [134] , and/or KLA [138] are in the context, they should precede
the Tokenizer. These language processors do their own tokenization, and the Tokenizer does nothing
if one of them has already run. DO NOT place Tokenizer before BL1, CLA, JLA, or KLA in the
context; if you do so, Tokenizer performs language-neutral tokenization, and the other processors are
unable to generate language-specific tokens, part-of-speech tags, and other critical information.
ARBL [115] , FABL [147] , and URBL [174] depend on the Tokenizer (and Sentence Boundary
Detector [169] ).
Dependencies
Input must be in a UTF-8, UTF-16, or UTF-32 encoding. If the user supplies the encoding in an API
call to a context object, it must be one of the following strings:
• UTF-8
• UTF-8BOM
• CESU-8
• UTF-16BE
• UTF-16LE
• UTF-16
• UTF-32BE
• UTF-32LE
• UTF-32
Language Dependent
No
XML-Configurable Options
None
173
Urdu Base Linguistics
Context Properties
None
Description
The Unicode Converter takes text in any of the
Unicode encoding forms (UTF-8, big- and little-endian UTF-16 and UTF-32) converts it to UTF-16
(RAW_TEXT) [80] . The processor does not perform language detection.
If present, the Unicode byte-order mark (BOM) is removed prior to conversion to UTF-16; subsequent
language processors in the RLP context do not see the BOM.
Sections 2.5 and 2.6 of The Unicode Standard, Version 4.0 discusses the Unicode encoding forms and
encoding schemes at great length.
Dependencies
Tokenizer, SentenceBoundaryDetector
Language Dependent
Urdu
XML-Configurable Options
None. The paths to the URBL dictionaries and related resources are defined in urbl-options.xml.
Context Properties
When processing a query (a collection of one or more search terms) rather than prose (one or more
sentences), set the com.basistech.bl.query global context property [115] to true.
Description
The URBL language processor performs morphological analysis for texts written in Urdu.
• STEM [81]
• NORMALIZED_TOKEN [80]
Normalization
Each Urdu token is normalized prior to morphological analysis. The normalized form is returned in
the NORMALIZED_TOKEN result.
Normalization is performed in two stages: generic Arabic script normalization [117] and Urdu-
specific normalization.
The following language-specific normalizations are performed on the output of the Arabic script
normalization:
• Fathatan (U+064B), zero-width non joiner (U+200C), and jazm (U+06E1) are removed.
•
Alef أ إ ٱ ا
(U+0623), (U+0625), or (U+0671) is converted to (U+0627).
174
Urdu Base Linguistics
•
Kaf ك (U+0643) is converted to ( کU+06A9).
•
Heh with hamza ۀ (U+06C0) is converted to ( ۂU+06C2).
•
Yeh ي (U+064A) or ( ىU+0649) is converted to ( یU+06CC).
•
Small high dotless head of khah ۡ (U+06E1) is removed.
Variations
The analyzer can generate a number of variant forms for each Urdu token to account for the
orthographic irregularity seen in contemporary written Urdu. Each variation is generated over the
output of the previous, starting with the normalized form:
• If a word contains hamza on yeh (U+0626), a variant is generated replacing the hamza on yeh with
Farsi yeh (U+06CC).
• If a word contains hamza on waw (U+0624), a variant is generated replacing the hamza on waw
with waw (U+0648).
• If a word contains a superscript alef (U_0670), a variant is generated without the superscript alef.
• If a word contains heh doachasmee (U+06BE), a variant is generated replacing the heh
doachasmee with heh goal (U+06C1).
• If a word ends with teh marbuta (U+0647), a variant is generated replacing the teh marbuta with
heh goal (U+06C1).
The stem returned in the STEM result is the normalized token with affixes (such as prepositions,
conjunctions, the definite article, proclitic pronouns, and inflectional prefixes) removed.
175
176
Chapter 11. User-Defined Data
Several of the RLP language processors use or even require user-defined data of some form. This chapter
describes how to create and integrate user-defined data with those RLP processors:
The Gazetteer and Regular Expressions processors are customizable in terms of the data (entities) they
return, and the entity types and subtypes (predefined or user-defined) with which they tag individual
entities.
Types and Subtypes. As explained in Result Types: NAMED_ENTITY [80] , an integer triple is
generated for each named entity. The first two integers define the range of tokens that make up the entity.
The third integer identifies the entity type and optional subtype, as well as the origin (Named Entity
Extractor, Gazetteer, or Regular Expression). RLP maps these integers to strings (TYPE[:SUBTYPE]),
such as "PERSON" and "ORGANIZATION:GOVERNMENT". By convention, the names for predefined
types and subtypes are upper case. The integers and strings for predefined types and subtypes are specified
in bt_ne_types.h (C++) and com.basistech.rlp.RLPNENamedConstants (Java), which also
defines the API for getting from integer to string and vice versa.
Defining your own Types and Subtypes. You can define your own types and subtypes. You define the
name; RLP takes care of defining the unique integer to be used to identify this type/subtype and origin.
You use the same API as for predefined types to get from integer to string and vice versa.
For Gazetteer, you define types and lists of the names you want to find for each type. The types may
be predefined or user defined. See Customizing Gazetteer [178] .
For the Regular Expression processor, you define the types and rules for finding named entities.
The types may be predefined or user defined. See Creating Regular Expressions [181] .
You can also provide "weights" to predefined and user-defined types to determine how the Named Entity
Redactor [145] resolves conflicts when more than one processor returns the same or an overlapping set
of tokens.
177
Customizing Gazetteer
Multiple files may be defined for multiple purposes, such as tracking famous personalities in news media,
infectious diseases in journal articles, or specific product names and trademarks in market reports.
Important
The Gazetteer only uses the gazetteers specified in the Gazetteer options file, BT_ROOT/rlp/etc/
gazetteer-options.xml.
Gazetteer processes text and isolates specific terms defined by the user in a Gazetteer Source File (GSF),
which has the following properties:
• The first non-comment line is the TYPE[:SUBTYPE], which applies to the entire GSF, and will be used
as the entity type name for output. Type and subtype may be predefined or user-defined [177] .
For example, let us say that you wish to track common infectious diseases. You might create a GSF that
looks like this:
# File: infectious-diseases-gazetteer.txt
#
DISEASE:INFECTIOUS
tuberculosis
e. coli
malaria
influenza
In many cases, a single GSF may not be enough. You may create as many GSFs as you like. For example,
perhaps you also want to search for the scientific names of the infectious disease in the example above.
You might create a file like this:
# File: latin-infectious-gazetteer.txt
#
DISEASE:INFECTIOUS
Mycobacterium tuberculosis
Escherichia coli
Plasmodium malariae
Orthomyxoviridae
# File: infectious-bacterial-gazetteer.txt
#
178
Customizing Gazetteer
DISEASE:BACTERIAL
Escherichia coli
E. coli
Staphylococcus aureus
Streptococcus pneuminiae
Salmonella
# File: resistant-diseases-gazetteer.txt
#
DISEASE:RESISTANT
Staphylococcus aureus
Streptococcus pneumoniae
Salmonella spp.
Campylobacter jejuni
Escherichia coli
Enterococcus faecium
# File: antimicrobial-drugs-gazetteer.txt
#
DRUG:ANTIMICROBIAL
methicillin
vancomycin
macrolide
fluoroquinolone
Suppose you wish to track infectious diseases, particularly those diseases that have developed resistance
to antimicrobial drugs. Then you might configure gazetteer-options.xml as follows, using example GSFs
from the previous section:
179
Customizing Gazetteer
<gazetteerconfig>
<DictionaryPaths>
<DictionaryPath><env name="root"/>
/rlp/source/samples/infectious-diseases-gazetteer.txt
</DictionaryPath>
<DictionaryPath><env name="root"/>
/rlp/source/samples/resistant-diseases-gazetteer.txt
</DictionaryPath>
<DictionaryPath><env name="root"/>
/rlp/source/samples/antimicrobial-drugs-gazetteer.txt
</DictionaryPath>
</DictionaryPaths>
</gazetteerconfig>
The following example shows the Infectious Diseases Gazetteer defined above in plain text [178] in the
XML format:
180
Creating Regular Expressions
<names>
<name>
<data>e. coli</data>
</name>
</names>
</entity>
<entity>
<names>
<name>
<data>malaria</data>
</name>
</names>
</entity>
<entity>
<names>
<name>
<data>influenza</data>
</name>
</names>
</entity>
</entities>
</gazetteer>
The type attribute for each <regexp> element specifies the type and optional subtype of the entities to
be returned by the regular expression in the <regexp> element . Type and subtype may be predefined or
user-defined [177] . For example, if the type is "compound" and subtype is "organic",
type="compound:organic".
The Regular Expression language processor uses the Tcl regular expression engine with some character
class extensions. It is compatible with the Perl 5 regular expression syntax, with the exceptions noted below.
• Using character properties (\p{} and \P{}) to match characters is not supported.
181
Customizing Stopwords
The Basis Technology implementation has extended character classes to cover a number of character
properties (see below).
Tcl supports the full Unicode locale. Character classes are extended to cover all Unicode characters.
Standard Character Classes. [:alpha:] A letter. [:upper:] An upper-case letter. [:lower:] A lower-case
letter. [:digit:] A decimal digit. [:xdigit:] A hexadecimal digit. [:alnum:] An alphanumeric (letter or digit).
[:print:] An alphanumeric (same as [:alnum:]). [:blank:] A space or tab character. [:space:] A character
producing whitespace in displayed text. [:punct:] A punctuation character. [:graph:] A character with a
visible representation. [:cntrl:] A control character.
Character Classes for Unicode Properties. [:Cc:] Control. [:Cf:] Format. [:Co:] Private Use. [:Cs:]
Surrogate. [:Ll:]Lower-case letter. [:Lm:] Modifier letter. [:Lo:] Other letter. [:Lt:]Title-case letter. [:Lu:]
Upper-case letter. [:Mc:] Spacing mark. [:Me:] Enclosing mark. [:Mn:] Non-spacing mark. [:Nd:] Decimal
Number. [:Nl:] Letter number. [:No:] Other number. [:Pc:] Connector punctuation. [:Pd:] Dash punctuation.
[:Pe:] Close punctuation. [:Pf:] Final punctuation. [:Pi:] Initial punctuation. [:Po:] Other punctuation. [:Ps:]
Open punctuation. [:Sc:] Currency symbol. [:Sk:] Modifier symbol. [:Sm:] Mathematical symbol. [:So:]
Other symbol. [:Zl:] Line separator. [:Zp:] Paragraph separator. [:Zs:] Space separator.
Character Classes for Writing Scripts. [:Arabic:] [:Armenian:] [:Bengali:] [:Bopomofo:] [:Braille:]
[:Buginese:] [:Buhid:] [:Canadian_Aboriginal:] [:Cherokee:] [:Common:] [:Coptic:] [:Cyrillic:]
[:Devanagari:] [:Ethiopic:] [:Georgian:] [:Glagolitic:] [:Greek:] [:Gujarati:] [:Gurmukhi:] [:Han:]
[:Hangul:] [:Hanunoo:] [:Hebrew:] [:Hiragana:] [:Inherited:] [:Kannada:] [:Katakana:] [:Khmer:] [:Lao:]
[:Latin:] [:Limbu:] [:Malayalam:] [:Mongolian:] [:Myanmar:] [:New_Tai_Lue:] [:Ogham:] [:Oriya:]
[:Runic:] [:Sinhala:] [:Syloti_Nagri:] [:Syriac:] [:Tagalog:] [:Tagbanwa:] [:Tai_Le:] [:Tamil:] [:Telugu:]
[:Thaana:] [:Thai:] [:Tibetan:] [:Tifinagh:] [:Yi:]
You can use these character class extensions just like the standard character classes.
For example, [[:Hiragana:]] matches a single Hiragana character, and [[:Zs:]] matches a whitespace
character.
Reference Documentation. For a description of Tcl syntax for regular expressions, see Tcl Regular
Expression Syntax [259] . Unless you specify otherwise (see "Metasyntax"), a regular expression is
understood to be an Advanced Regular Expression (ARE) as described in that documentation.
For supported languages other than Chinese, Korean, and Japanese, the Stopwords language processor uses
user-defined dictionaries very similar to the Gazetteer Source Files to identify and exclude stopwords. A
sample stopword dictionary for English, en-stopwords.txt, is in the rlp/etc directory. Each dictionary file
is a list of words, one word per line, in UTF-8 format. Stopword processing of these files is applied to the
tokens, not the stems.
For Chinese, Korean, and Japanese, the corresponding language processor uses a stopwords list in
UTF-8 encoding. See Editing the Stopwords List for Chinese, Korean, or Japanese [184] .
182
Creating Stopword Dictionaries (Not for Chinese, Korean, or Japanese)
a
adonde
al
como
con
conmigo
contigo
cuando
cuanto
de
del
donde
el
ella
ellas
ellos
eres
es
esta
estamos
estan
estas
la
las
los
mas
mucho
muchos
nosotros
pero
pues
que
quien
son
soy
todo
todos
tu
tus
usted
ustedes
yo
This list is by no means comprehensive. As you use RLP, you may find more words that should be treated
as stopwords.
183
Configuring Stopwords
<stopoptions>
<dictionaries>
<!-- Paths to your stopword dictionary files go here. Each file is a list
of words, one-per-line, in UTF-8. Stopword processing is applied to the
tokens, not the stems.
-->
<dictionarypath language="en"><env name="root"/>/etc/en-stopwords.txt
</dictionarypath>
<dictionarypath language="es"><env name="root"/>/etc/es-stopwords.txt
</dictionarypath>
</dictionaries>
</stopoptions>
The config file will now use both the sample English stopword dictionary and your newly created Spanish
stopword dictionary.
For complete information on using the Stopwords language processor, see Stopwords [171] .
You may want to add stopwords to these files. When you edit one of these files, you must follow these
rules:
184
Creating User Dictionaries
For the languages supported by these analyzers, you may want to create user dictionaries for words specific
to a particular industry or application.
For optimal performance, keep the number of dictionaries you create per language to a minimum.
3. Put the binary file in the BL1 dictionary directory for the language [187] .
4. Edit the BL1 configuration file to include the user dictionary [187] .
The source file for a user dictionary is UTF-8 encoded. The file may begin with a byte order mark (BOM).
Empty lines are ignored.
Each entry is a single line. The Tab character separates word from analysis:
In some cases (as described below) the word or analysis may be empty.
The word is a rlp-results.stem TOKEN [82] . In a few cases (such as "New York"), it may contain one or
more space characters. The three characters '[', ']', and '\' must be escaped by prefixing the character with
'\'. If, for example, the word is 'ABC\XYZ', enter it as 'ABC\\XYZ'.
185
European Language User Dictionaries
Important
BL1 user dictionary lookups occur after tokenization. If your dictionary contains a word like 'hi
there', it will not be found because the tokenizer identifies 'hi' and 'there' as separate tokens.
The analysis is the STEM [81] with 0 or more morphological tags and special tags, and a required POS
[80] tag. Tags are placed in square brackets ([ ]). POS tags and morphological tags begin with "+". The
maximum size of an analysis is 128, where normal characters count as 1 and tags count as 2.
Morphological tags are used by the Named Entity Extractor [141] to help identify named entities (Dutch,
English, French, German, Italian, and Spanish).
Special tags are used to divide compound words into their components (German, Dutch, and Hungarian)
and to define boundaries for multi-word baseforms, contractions and elisions, and words with clitics
(German, Dutch, Hungarian, English, French, Italian, and Portuguese).
Morphological and special tags appear in Appendix C: Morphological and Special Tags [253] .
A POS tag identifying part of speech is required and is the last (right-most) tag in the analysis. Valid POS
tags appear in Appendix B: Part-of-Speech Tags [223] .
English examples:
dog dog[+NOUN]
Peter Peter[+Masc][+PROP]
NEW YORK New[^_]York[+Place][+City][+PROP]
doesn't does[^=]not[+VDPRES]
Variations: You may want to provide more than one analysis for a word or more than one version of a
word for an analysis. To avoid unnecessary repetition, include lines with empty analyses (word + Tab),
and lines with empty words (Tab + analysis). A line with an empty analysis uses the previous non-empty
analysis. A line with an empty word uses the previous non-empty word.
The following example includes two analyses for "telephone" (noun and verb), and two renditions of "dog"
for the same analysis (noun). Note: the dictionary lookup is case sensitive.
telephone telephone[+NOUN]
telephone[+VI]
dog dog[+NOUN]
Dog
Prerequisites
• Unix or Cygwin (for Windows).
186
European Language User Dictionaries
• The BT_ROOT environment variable must be set to BT_ROOT , the Basis root directory. For example,
if RLP SDK is installed in /usr/local/basistech, set the BT_ROOTenvironment variable to /usr/local/
basistech. 2
• The BT_BUILD environment variable must be set to the platform identifier embedded in your SDK
package file name (see Supported Platforms [16] ).2
To compile the dictionary into a binary format that BL1 can use, issue the following command:
lang is the two-letter language code (en_uc for Upper-Case English). See BL1 [119] .
output is the pathname of the binary dictionary file. If you are generating a little-endian and a big-endian
dictionary, use user_dict-LE.bin for the little-endian file and user_dict-BE.bin for the big-endian dictionary.
Note: choose a descriptive name for user_dict.
To organize the placement of user dictionaries, you may want to put the binary dictionary file in a
BT_ROOT/rlp/bl1/dicts language directory where the directory name matches the language code. For
example, put an English user dictionary in BT_ROOT/rlp/bl1/dicts/en.
Example: English user dictionary is available as both a little-endian and a big-endian binary dictionary.
<bl1-options language="en">
...
...
<user-dict><env name="root"/>/bl1/dicts/en/userdict-<env name="endian"/>.bin</user-dict>
</bl1-options>
Example: German user dictionary available in only one form (little-endian or big-endian).
2In place of setting BT_ROOT and BT_BUILD, you can set a single environment variable (BINDIR) to BT_ROOT/rlp/bin/BT_BUILD .
187
Chinese User Dictionaries
<bl1-options language="en">
...
...
<user-dict><env name="root"/>/bl1/dicts/de/userdict.bin</user-dict>
</bl1-options>
Note
At runtime, RLP replaces <env name="root"/> with the path to the RLP root directory.
For more information about the BL1 configuration file, see BL1 [119] .
For efficiency, Chinese user dictionaries are compiled into a binary form with big-endian or little-endian
byte order to match the platform.
4. Edit the CLA configuration file to include the user dictionary [190] .
where word is the noun, POS is one of the user-dictionary part-of-speech tags listed below, and
DecompPattern (optional) is the decomposition pattern: a comma-delimited list of numbers that specify
the number of characters from word to include in each component of the compound (0 for no
decomposition). The individual components that make up the compound are in the COMPOUND [78]
results.
188
Chinese User Dictionaries
• FOREIGN_PERSON
深圳
发展
銀行
The sum of the digits in the pattern must match the number of characters in the entry. For example,
is invalid because the entry has 6 characters while the pattern is for a 13-character string. The correct entry
is:
The POS and decomposition pattern can be in Chinese full-width numerals and Roman letters. For example:
Decomposition can be prevented by specifying a pattern with the special value "0" or by specifying a pattern
consisting of a single digit with the length of the entry.
For example:
北京人 noun 0
or
北京人 noun 3
Tokens matching this entry will not be decomposed. To prevent a word that is also listed in a system
dictionary from being decomposed, set com.basistech.cla.favor_user_dictionary to true.
Prerequisites
• Unix or Cygwin (for Windows).
189
Chinese User Dictionaries
• The BT_ROOT environment variable must be set to BT_ROOT , the Basis root directory. For example,
if RLP SDK is installed in /usr/local/basistech, set the BT_ROOTenvironment variable to /usr/local/
basistech.
• The BT_BUILD environment variable must be set to the platform identifier embedded in your SDK
package file name (see Supported Platforms [16] ).
To compile the dictionary into a binary format, issue the following command:
For example, if you have a user dictionary named user_dict.utf8, build the binary user dictionary
user_dict.bin with the following command:
Note
If you are making the user dictionary available for little-endian and big-endian platforms, you can
compile the dictionary on both platforms, and differentiate the dictionaries by using
user_dict_LE.bin for the little-endian dictionary and user_dict_BE.bin for the big-endian
dictionary.
• Word entries (one word per line with no POS, but may include a Tab and a decomposition pattern). The
decomposition pattern is a series of one or more digits without commas. For example:
深圳发展銀行 24
<claconfig>
...
...
<dictionarypath><env name="root"/>/cma/dicts/user_dict.bin</dictionarypath>
</claconfig>
190
Japanese User Dictionaries
If you are making the user dictionary available for little-endian and big-endian platforms, and you are
differentiating the two files as indicated above ("LE" and "BE"), you can set up the CLA configuration file
to choose the correct binary for the runtime platform:
<claconfig>
...
...
<dictionarypath><env name="root"/>/cma/dicts/user_dict_%.bin</dictionarypath>
</claconfig>
The <env name="endian"/> in the dictionary name is replaced at runtime with "BE" if the platform byte
order is big-endian or "LE" if the platform byte order is little-endian.
Note
At runtime, RLP replaces <env name="root"/> with the path to the RLP root directory.
If, you are not compiling a Chinese user dictionary, you can put a reference to the source file in the CLA
configuration file. For example, suppose the user dictionary is named userdic.utf8 and is in BT_ROOT/
rlp/cma/dicts. Modify BT_ROOT/rlp/etc/cla-options.xml to include the new <dictionarypath>
element.
<claconfig>
...
...
<dictionarypath><env name="root"/>/cma/dicts/user_dict.bin</dictionarypath>
<dictionarypath><env name="root"/>/cma/dicts/userdict.utf8</dictionarypath>
</claconfig>
For efficiency, Japanese user dictionaries are compiled into a binary form with big-endian or little-endian
byte order to match the platform.
4. Edit the JLA configuration file to include the user dictionary [194] .
191
Japanese User Dictionaries
If you want to identify the dictionary (see TOKEN_SOURCE_NAME [82] ) where JLA found each token,
you must assign each user dictionary a name, and you must compile the dictionary [192] . At the top of
the file, enter
where Dictionary Name is the name you want to assign to the dictionary.
where word is the noun, POS is one of the user-dictionary part-of-speech tags listed below, and
DecompPattern (optional) is the decomposition pattern: a comma-delimited list of numbers that specify
the number of characters from word to include in each component of the compound (0 for no
decomposition). The individual components that make up the compound are in the COMPOUND [78]
results.
Examples:
デジタルカメラ NOUN
デジカメ NOUN 0
東京証券取引所 ORGANIZATION 2,2,3
狩野 SURNAME 0
The POS and decomposition pattern can be in Japanese full-width numerals and Roman letters. For
example:
The "2,2,3" decomposition pattern instructs JLA to decompose this compound entry into
東京
証券
取引所
A user dictionary may also contain entries that include Private Use Area (PUA) characters. See Entering
Non-Standard Characters in a User Dictionary [197] .
192
Japanese User Dictionaries
determines the byte order. To use the dictionary on both a little-endian platform (such as an Intel x86 CPU)
and a big-endian platform (such as a Sun Solaris), generate a binary dictionary on each of these platforms.
Prerequisites
• Unix or Cygwin (for Windows).
• The BT_ROOT environment variable must be set to BT_ROOT , the Basis Technology root directory.
For example, if RLP SDK is installed in /usr/local/basistech, set the BT_ROOTenvironment variable
to /usr/local/basistech.
• The BT_BUILD environment variable must be set to the platform identifier embedded in your SDK
package file name (see Supported Platforms [16] ).
To compile the dictionary into a binary format that JLA can use, issue the following command:
For example, if you have a user dictionary named user_dict.utf8, build the binary user dictionary
user_dict.bin with the following command:
Important
The extension for the Japanese dictionary files (system and user) must be .bin.
Note
If you are making the user dictionary available for little-endian and big-endian platforms, you can
compile the dictionary on both platforms, and differentiate the dictionaries by using
user_dict_LE.bin for the little-endian dictionary and user_dict_BE.bin for the big-endian
dictionary.
• Word entries (one word per line with no POS, but may include a Tab and a decomposition pattern). The
decomposition pattern is a series of one or more digits without commas. For example:
東京証券取引所 223
193
Korean User Dictionary
We recommend that you put your Japanese user dictionaries in BT_ROOT/rlp/jma/dicts, where
BT_ROOT is the root directory of the RLP SDK.
To use user_dict.bin with JLA, modify the jla-options.xml file to include it. For example, if you put your
user dictionary in the location we recommend (the directory that contains the system Japanese dictionary).
modify it to read as follows:
<DictionaryPaths>
<DictionaryPath><env name="root"/>/jma/dicts/JP_<env name="endian"/>.bin</DictionaryPath>
<!-- Add a DictionaryPath for each user dictionary -->
<DictionaryPath>
<env name="root"/>/jma/dicts/user_dict.bin
</DictionaryPath>
</DictionaryPaths>
If you are making the user dictionary available for little-endian and big-endian platforms, and you are
differentiating the two files as indicated above ("LE" and "BE"), you can set up the JLA configuration file
to choose the correct binary for the runtime platform:
<DictionaryPaths>
<DictionaryPath><env name="root"/>/jma/dicts/JP_<env name="endian"/>.bin</DictionaryPath>
<!-- Add a DictionaryPath for each user dictionary -->
<DictionaryPath>
<env name="root"/>/jma/dicts/user_dict_<env name="endian"/>.bin
</DictionaryPath>
</DictionaryPaths>
The <env name="endian"/> in the dictionary name is replaced at runtime with "BE" if the platform byte
order is big-endian or "LE" if the platform byte order is little-endian.
Note
At runtime, RLP replaces <env name="root"/> with the path to the RLP root directory.
Note: Prior to Release 6.0, the contents of this dictionary were maintained in two separate dictionaries: a
Hangul dictionary and a compound noun dictionary.
As specified in the KLA options file [138] dictionarypath element, this dictionary in its compiled
form is in BT_ROOT/rlp/kma/dicts. If your platform is little-endian, the compiled dictionary filename is
kla-usr-LE.bin. If your platform is big-endian, the compiled dictionary filename is kla-usr-BE.bin. You
can modify and recompile this dictionary. Do not change its name.
194
Korean User Dictionary
word is the stem form of the word. Verbs and adjectives should not include the "-ta" suffix.
POS is one or more of the user-dictionary part-of-speech tags listed below. An entry can have multiple
parts of speech; simply concatenate the part of speech codes. For example, the POS for a verb that can be
used transitively and intransitively is "IT".
DecompPattern (optional) is the decomposition pattern for a compound noun: a comma-delimited list of
numbers that specify the number of characters from word to include in each component of the compound
(0 for no decomposition). KLA uses a decomposition algorithm to decompose compound nouns that contain
no DecompPattern. The individual components that make up the compound are in the COMPOUND
[78] results.
POS Meaning
N Noun
P Pronoun
U Auxiliary noun
M Numeral
c Compound noun
T Transitive verb
I Intransitive verb
W Auxiliary verb
S Passive verb
C Causative verb
J Adjective
K Auxiliary adjective
B Adverb
D Determiner
L Interjection (exclamation)
Examples:
개배때기 N
그러더니 B
그러던 D
195
Korean User Dictionary
꿰이 TC
개인홈페이지 c
경품대축제 c 2,3
One compound noun (개인홈페이지) contains no decomposition pattern, so KLA uses a decomposition
algorithm to decompose it. For the other compound noun (경품대축제) , the "2,3" decomposition pattern
instructs KLA to decompose it into
경품
대축제
You can add new entries and modify or delete existing entries.
Prerequisites
• Unix or Cygwin (for Windows).
• The BT_ROOT environment variable must be set to BT_ROOT, the Basis Technology root directory. For
example, if JLA is installed in /usr/local/basistech, set BT_ROOT to /usr/local/basistech.
• The BT_BUILD environment variables must be set to the platform identifier embedded in your JLA
package name, such as ia32-glibc22-gcc32. For a list of the BT_BUILD values, see Supported
Platforms and BT_BUILD Values [16] .
To compile the dictionary into a binary format, issue the following command:
where input is the input filename (kla-userdict.u8, unless you have changed the name) and ouput
is kla-usr-LE.bin if your platform is little-endian or kla-usr-BE.bin if your platform is big-
endian.
There can only be one user dictionary, so we recommend you use the default filename. If you want to use
a different filename, you must add a userdictionarypath element to kla-options.xml with the
filename (no path). Suppose, for example, that you have compiled the user dictionary with the name my-
kla-usr-LE.bin and placed that file in the dictionary directory. Edit kla-options.xml so it contains
userdictionarypath as indicated below:
196
Entering Non-Standard Characters in a Japanese User Dictionary
<klaconfig>
<dictionarypath<env name="root"/>/kma/dicts</dictionarypath>
<userdictionarypath>my-kla-usr-LE.bin</userdictionarypath>
..
..
<klaconfig>
PUA Characters. Characters in the hexadecimal range E000 - F8FF. Use \uxxxx where the u is lower-
case and each x is a hexadecimal character.
197
198
Chapter 12. RLP - Lucene/Solr Integration
12.1. Introduction
BT_ROOT/rlp/samples/lucene contains sample integration code to integrate RLP with Lucene and
Solr, for applications that index and search English and Japanese documents. As described below, the code
also provides a starting point for indexing and searching text in any of the other languages that RLP
supports.
For the API documentation that accompanies this code, consult Java Lucene/Solr Sample Integration in
API Reference [api-reference/index.html].
RLPTokenizer is designed to be as flexible as possible, and it can be used for any of the languages that
RLP supports.
RLPEnAnalyzer and RLPJaAnalyzer, on the other hand, are provided purely as samples. As a
developer of search applications, you should make your own Analyzer, perhaps by modifying these
analyzers. See Writing Your Own Analyzer [205] .
A Lucene Filter named POSFilter is provided so that tokens of unwanted parts of speech can be filtered
out. It works with a TokenStream generated by RLPTokenizer, and the set of available POS tags varies
depending on the language. Please refer to Part-of-Speech Tags [223] for the list of the POS tags for
each language.
Demos. The org.apache.lucene.demo package contains a version of the standard Lucene demo
programs, modified to use the RLP-powered Analyzer. See Running the Lucene Command-Line-Interface
Demos [200] .
12.2. Requirements
• RLP version 6.0.0 or later with a license to run the BL1 language processor for the English Analyzer,
the JLA and RCLU 1 processors for the Japanese Analyzer.
1The Japanese Analyzer uses RCLU to normalize character width variations (Unicode Normalization Form KC). If your RLP license does not include
RCLU, remove
"<languageprocessor>RCLU</languageprocessor>" +
199
How To Build
Ant assumes that you have set the JAVA_HOME environment variable to point to the root of the Java
SDK.
• A web application server if your application is a web application. Note: Solr comes with a web
application server to run the demo web application.
BT_BUILD. BT_BUILD is the platform identifier embedded in your RLP SDK package file, such as
ia32-w32-msvc71. For a list and description of the platform identifiers, see Supported Platforms and
BT_BUILD Values [16] .
cd BT_ROOT/rlp/samples/lucene
ant -Dbt.arch=BT_BUILD
Ant 1.6.5 or higher is required to run these demos. The -Dbt.arch=BT_BUILD parameter must be included
in each of the ant commands below.
On Unix
On Linux, you must set LD_LIBRARY_PATH (or its equivalent environment variable for your
Unix operating system) to include the RLP library directory: BT_ROOT/rlp/lib/BT_BUILD
before you use the ant commands.
runs the Lucene indexing demo program that was modified to use the RLPEnAnalyzer default
constructor indirectly via the RLPAnalyzerDispatcher helper class, to index and search the English
plain text files found in the BT_ROOT/rlp/samples/lucene/sampledocs/en subdirectory. The Lucene
index is created in the index directory.
200
How To Use
runs the search demo program. When asked to enter a search string, type "basis", and it should report 5
matches. Try other strings. When done, press the Enter key.
ant cleandemoindex
removes the index file directory. This is necessary before running the indexing program again.
To run the demo for Japanese, replace -Ddemo.lang=en with -Ddemo.lang=ja. The demo will
index and search the plain text files in UTF-8 encoding found in the BT_ROOT/rlp/samples/lucene/
sampledocs/ja subdirectory. For a search string, you can use "新聞" (newspaper). Try other strings. When
done, press the Enter key.
Ultimately, you should write your own implementation of Analyzer for the target language. The
Analyzer should use RLPTokenizer. You can use RLPEnAnalyzer and/or RLPJaAnalyzer as
your starting point.
2. For web applications, you must copy btrlp.jar and btutil.jar from BT_ROOT/rlp/lib/BT_BUILD to
the application server's common library directory (not the web application-specific libary directory). If
you are using Tomcat 5.x, the common library directory is CATALINA_HOME/shared/lib, where
CATALINA_HOME is the Tomcat installation directory.
Also, copy the integration sample code JAR file, btrlplucene.jar, from BT_ROOT/rlp/lib/
BT_BUILD into either the application-specific library directory (WEB-INF/lib) or the server's
common library directory.
If you are using Tomcat 5.x on Windows, you would specify this by right-clicking on the Tomcat icon,
selecting Configure, choosing Java tab, and adding: -Dbt.root=BT_ROOT on each line in the text box
named Java Options.
If you run your application from the java command line, add the command line option -
Dbt.root=BT_ROOT. The classpath must be set to include the JAR files mentioned above.
We assume you have downloaded and installed Solr 1.1 or 1.2. This document refers to the Solr installation
directory as SOLR_HOME. In your command-line shell, we assume that the current working directory is
SOLR_HOME.
cd SOLR_HOME
201
Using with Solr
• You are adding a new field type called text_ja (for Japanese) or text_en (for English).
• You are adding a new field of this type called title_ja (for Japanese) or title_en (for English).
• You are using the RLP-powered Tokenizer for both index and query.
• You are using POSFilter, with the default set of the POS tags, only for indexing, and not for searching,
because the part-of-speech tags for short strings tend to be inaccurate.
On Windows (Solr 1.2. Note:You may have to create the target directory):
mkdir example\lib\ext
copy BT_ROOT\rlp\lib\BT_BUILD\btrlp.jar example\lib\ext
copy BT_ROOT\rlp\lib\BT_BUILD\btutil.jar example\lib\ext
cp BT_ROOT/rlp/lib/BT_BUILD/btrlp.jar example/ext
cp BT_ROOT/rlp/lib/BT_BUILD/btutil.jar example/ext
On Unix: (Solr 1.2. Note:You may have to create the target directory):
mkdir example/lib/ext
cp BT_ROOT/rlp/lib/BT_BUILD/btrlp.jar example/lib/ext
cp BT_ROOT/rlp/lib/BT_BUILD/btutil.jar example/lib/ext
On Windows:
mkdir example\solr\lib
copy BT_ROOT\rlp\lib\BT_BUILD\btrlplucene.jar example\solr\lib
On Unix:
mkdir example/solr/lib
cp BT_ROOT/rlp/lib/BT_BUILDbtrlplucene.jar example/solr/lib
On Windows:
On Unix:
202
Using with Solr
cp BT_ROOT/rlp/samples/lucene/conf/* examples/solr/conf
4. (Unix only). Set LD_LIBRARY_PATH (or the equivalent environment variable for your Unix platform)
to include the RLP shared libraries directory: BT_ROOT/rlp/lib/BT_BUILD . For example:
LD_LIBRARY_PATH=BT_ROOT/rlp/lib/BT_BUILD:$LD_LIBRARY_PATH; export \
$LD_LIBRARY_PATH
For Japanese:
<analyzer type="query">
<tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
rlpContext="SOLR_HOME/example/solr/conf/rlp-context-jla.xml"
lang="ja" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
For English:
<analyzer type="query">
<tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
rlpContext="SOLR_HOME/example/solr/conf/rlp-context-bl1.xml"
lang="en" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
For Japanese:
For English:
203
Configuring the RLP-Solr Integration Code
Note: You can combine the handling of Japanese text and English text in a single copy of
schema.xml.
6. Run start.jar with the bt.root system property set inside the example subdirectory.
On Windows:
cd example
java -Dbt.root="BT_ROOT" -jar start.jar
On Unix:
cd example
java -Dbt.root=BT_ROOT -jar start.jar
b. Enter title_ja (for Japanese) or title_en (for English) in the Field name field.
c. Enter text (Japanese or English, depending on which you have set up) in either or both of the two
Field value fields.
You will see how Tokenizer and each Filter process the input text.
8. Refer to the Solr document and the web site on how to feed the actual documents for indexing.
12.5.3.1. lemmaMode
With the default setting for the com.basistech.rlp.solr.RLPTokenizerFactory, each word
in the source text is converted to its dictionary form ("lemma") before indexing or searching takes place.
For example, "The quicker foxes jumped" is transformed to "the quick fox jump", and indexed with these
lemmas. You can change this behavior by specifying the "lemma" attribute in the "tokenizer" tag, as in:
<tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
rlpContext="SOLR_HOME/example/solr/conf/rlp-context-bl1.xml"
lang="en" lemmaMode="none" />
lemmaMode="none" indicates that words will not be transformed to lemmas; the words are preserved in
the original form. If you use this setting, a search for "quick" will not find a document that contains "quicker"
but does not contain "quick".
Another valid value for the lemmaMode attribute is "synonym" 2 . With this setting, the tokenizer generates
two tokens for each word in the source text, provided the word is not in its dictionary form to begin with.
The first token is the original word; the second token is its lemma.
204
Writing Your Own Analyzer
12.5.3.2. compMode
The com.basistech.rlp.solr.RLPTokenizerFactory tokenizer compMode attribute
changes the way the tokenizer handles the compound noun for Japanese and German text. For example:
<tokenizer class="com.basistech.rlp.solr.RLPTokenizerFactory"
rlpContext="SOLR_HOME/example/solr/conf/rlp-context-bl1.xml"
lang="ja" lemmaMode="none" comMode="decompose"/>
The default setting is to generate a token for each component of a compound words. For example, "国会
議事堂" (the National Assembly conference building front --- a subway station name) is broken up to three
tokens, "国会", "議事堂" and "前".
If you do not want to decompose compound words, use compMode="none". During searches, documents
with the compounds will be found, but not documents containing just the components that are in the
compound.
If you want to include compounds and their decomposed elements, use compMode="synonym" 2.
During searches, documents containing compounds or decomposed elements that make up the compounds
will be found
12.5.3.3. pos
POSFilterFactory can take a file name that lists the part-of-speech tags of the words to be indexed and
searched. If you have followed the step-by-step instructions earlier, pos-en.txt and pos-ja.txt in
SOLR_HOME/example/solr/conf are your part-of-speech tag lists. You can take a look at these files and
comment out the part-of-speech tags that you want Solr to ignore, or remove "#" to turn on part-of-speech
tags currently commented out..
As a developer of the search application, you should write your own analyzer(s) because each search
application has a unique set of requirements. Use the sample code as basis.
Typically, you would need different analyzers for indexing and search. Sec. 4.1.2. of the book Lucene In
Action, by Otis Gospodnetic and Erik Hatcher (Manning Publications, ISBN 1-932394-28-1) reads:
Should you use the same analyzer with QueryParser that you used during indexing? The
short, most accurate, answer is, "it depends."
For example, RLPJaAnalyzer and RLPEnAnalyzer are not well suited for query analysis, because
accurate part-of-speech analysis is not possible with short strings. Some words may be given the wrong
POS (part-of-speech) tags, and useful words may be filtered out. Because you can reasonably expect search
2The "synonym" value refers to the Solr mechanism for handling multiple tokens associated with a single source token. The term does not imply
that the alternative tokens are actually synonyms
205
Modifying RLPTokenizer
users to only type useful words in their dictionary form, you may want to skip the complex analysis using
RLP, and just use the WhitespaceTokenizer and LowercaseFilter for English.
You can easily write Analyzers for languages other than English and Japanese. For Chinese and Korean,
use RLPJaAnalyzer as the basis. For other languages, use RLPEnAnalyzer. For German, however,
you might want to mix these because German frequently uses compound words, and RLP is capable of
handling them. You might want to write a filter to map "ë" to "eu", which are considered equivalent.
Lines of code to use the popular logging facility log4j are in the sample code but they are commented out.
You can uncomment them, and make other changes to suit your situation.
Consult the relevant material in RLP Processors [113] and the RLP Java API HTML documentation
before you modify the source code to use the optional LPs.
com
The integration code, consisting of two packages, com.basistech.rlp.lucene and com.basistech.rlp.solr.
lia
Contains a utility class that accompanies the book Lucene In Action, Otis Gospodnetic & Erik Hatcher,
Manning, ISBN 1-932394-28-1. This is licensed under Apache License, version 2.0. <http://
www.apache.org/licenses/LICENSE-2.0.html>
org
A partial copy of demo program souce code from Lucene 2.0 distribution. These files are slighlty
modified to use RLP-powered Analyze and to read files in UTF-8: org/apache/lucene/demo/
FileDocument.java org/apache/lucene/demo/IndexFiles.java org/apache/lucene/demo/
SearchFiles.java These files are licensed under Apache License, version 2.0. <http://www.apache.org/
licenses/LICENSE-2.0.html>
• lucene-core-2.1.0.jar
• apache-solr-1.1.0-incubating.jar
Along with the JAR files in BT_ROOT/rlp/lib/BT_BUILD (btrlp.jar and btutil.jar), these JARs are
required to build the Lucene sample.
206
Sample Data
207
208
Chapter 13. The Windows GUI Demo
13.1. Launching the GUI Demo
Select All Programs → Basis Technology → Rosette 6.0.0 Demo from the Windows Desktop start menu
to launch the demo.
To see the list of languages these operations support, see RLP Key Features [1] .
If you manually entered text into the input field, select to load the text; RLP Demo will display the
text's language, MIME type, encoding, and length (number of characters).
You can manually select a language from the Demo → Language submenu. You must do this if your RLP
license does not contain RLI. 2 If you select a new language, any text in the Text Window is reprocessed
immediately.
The manually selected language setting remains in place until you choose a different language, or switch
the demo back to Auto-detect Language (Demo → Language → Auto-detect Language).
1RLP accepts file formats including plain text, HTML, XML, PDF, Word, Excel, PowerPoint, and Access.
2If your license does not contain RLI, any plain text files you open for processing must be in in a Unicode encoding (UTF-8, UTF-16, or UTF-32).
209
Apply a Demo Process
Not all functions may be applicable or available for all languages. See the full list of functions available
for each language [1] .
Base Linguistics . Tokenizes the text and tags parts of speech. For applicable languages [1] , it may
also derive stems, analyze compounds, and generate phonetic transcriptions.
Base Noun Phrase Extraction . Identifies base noun phrases in the input text.
Named Entity Extraction . Uses statistical analysis, gazetteers, and regular expressions to identify
named entities
• Built-in entity types: locations, organizations, persons, geo-political entities, facilities, religions,
nationalities, and identifiers (credit card numbers, email addresses, latitudes and longitudes, money,
percentages, personal ID numbers, phone numbers, URLs, and Univeral Transverse Mercator
coordinates).
• Adding entities: Once you have created a gazetteer [215] for a specific entity type, you can select
words in the Text Window and right click to add them to that named entity type. In general, use the
Named Entities Editor [214] to modify named entity definitions, create new named entity types, and
extend the set of languages that are analyzed for named entities.
• RLP Demo automatically reruns named entity extraction after you make an addition while using the
Named Entity Extraction process.
Universal Base Linguistics. Use this process to tokenize text in any language.
Chinese Script Conversion. Converts text in Simplified Chinese script to Traditional Chinese script
and vice versa.
210
View the analysis results
You can also save the input text [213] to a file to process again later.
The display is divided into four panes. You can use the mouse to drag the horizontal and vertical divider
bars between the panes. See Customizing the Display [213] .
Text Window Pane. Displays the input text. Select to switch between Edit mode and Read-Only
mode. In Edit mode (the background is white), you can edit the text. In Read-Only mode (the background
is grey), RLP has loaded the text, converting it to Unicode UTF-16. After you select a Demo process (from
the Demo menu), words or phrases may be highlighted in color. See Using the Text Window [213]
Text Profile Pane. Reports the language, encoding, MIME type, and length (in characters) of the input
text. The Text Profile should display this information before you select a process to run.
211
Using the List View
Legend Pane. Shows the Demo process applied and, for some processes, displays the key to the color-
coded parts of speech or named entity types in the input text. Click a part-of-speech (POS) tag or a named
entity type in the Legend, to highlight all corresponding elements in the Text Window.
List View Pane. Displays the processed results. The first column numbers each token. The other columns
shown depend on the process you have applied and the input language. See Using the List View [212]
for details.
The first column numbers each token. The other columns vary depending upon the process applied and the
input language.
Navigation and Selection. Click on the List View and you can use the arrow keys to navigate up and
down between rows. When you select a row, the Text Window highlights the corresponding tokens in a
contrasting color.
Copying a Token. To copy the token associated with the selected row to the clipboard, select Edit →
Copy (Ctrl-C).
Saving Analysis Results. To save the entire contents of the List View as a UTF-8 encoded XML or
CSV (comma-separated values) file, select File → Save Item List As. You are prompted to name the file
and select a file format from a dropdown menu. Select .csv to view the file in Microsoft EXCEL.
In the XML file, 3 field names are the List View column headers in lower case with spaces removed. In
the CSV file the column names are upper cased. The # column is named <index> or INDEX.
To reorder columns, drag and drop column headers, or bring up the Select Columns dialog box by right
clicking any column header or using View → Select Columns. The dialog also lets you hide any column
except the # column.
To sort rows by a column's values, click the column header. Click again to reverse the sort order.
The column settings are specific to the current combination of input language and Demo process. See
Customizing and Saving Display Settings [213] to save these settings.
3Note that the XML format is not REXML, RLP's native XML format. To generate REXML, you must include the REXML processor in the context
212
Using the Text Window
Token highlighting. Navigating or clicking a highlighted token colors the corresponding item in the
List View [212] . Toggle the highlighting on/off by selecting View → Mark Entities or .
Font. To change the Text Window font, select View → Set View Font → Full Text View. This font
change only applies to text in the language of the current document (shown in Text Profile) for the current
session. See Customizing and Saving Display Settings [213] to save the font selected.
Copying Input Text. Highlight the text to copy using the mouse or Shift and the arrow keys. Then
select Edit → Copy (Ctrl-C) and paste the text into other applications. You cannot drag and drop text
from the Text Window.
Saving Input Text. Save the contents of the Text Window to a file with File → Save Text As (Ctrl-
S) or . You can save the text in any encoding supported by the Rosette Core Library for Unicode (RCLU).
You can also edit .htm files that control some Legend display features [214] .
To make these settings the default, save the file as RLPDemoAppConfig.xml, the default filename, in the
Basis Technology subfolder of your Local Settings/Application Data profile. Otherwise, save the file to
a different name and manually choose the file when you launch the Demo by selecting File → Load
Configuration.
Alternatively, load the settings when you launch the demo with the /config switch from the command
line.
loads the previously saved customization file named my-customization.xml in the Basis Technology
subfolder of your Local Settings/Application Data profile.
Use the /config switch with no argument to load the default configuration file,
RLPDemoAppConfig.xml.
btrlpdemo.exe /config
213
Customizing the Legend
For the Part of Speech (POS) legends, the text is supplied in demo-pos-legend.xml and the coloring is
automatic.
For a more detailed explanation about customizing named entities in RLP, consult Customizing Named
Entities [177] .
If you are viewing the results of the Named Entities Extraction process, RLP Demo automatically reruns
the process after you make edits and close the editor.
Note: To view or edit any XML file with the Named Entities Editor, the corresponding DTD file must be
in the folder with the XML file. For known file locations, RLP Demo copies the required DTD file from
the rlp/config/DTDs. Make sure that the XML files and the folders containing the XML files you want to
edit are not write protected.
214
Adding Named Entities with Gazetteers
In the Editor list display, you can edit the names for user-defined types and subtypes, and the weights
associated with the three sources for each entity type.
Double click [Double Click to Add...] to add an entry to the gazetteer or double click an entry to edit it.
215
Gazetteer Options
In the example above, the user has created a gazetteer for the RANK type and ARMY subtype, and is
adding another entry.
Click Load to open a gazetteer you have already created for this type.
Click New to flush the contents of the list and start a new gazetteer.
216
Adding Named Entities with Regular Expressions
RLP supports Tcl regular expressions. See TclRegular Expression Syntax [259] .
Optionally, enter a note to track the details of a given regular expression. Notes are saved in regex-
config.xml.
To indicate that a regular expression is language-specific, enter the ISO639 language tag:
Language Tag
Arabic ar
Chinese - Simplified zh_sc
Chinese - Traditional zh_tc
Czech cs
Dutch nl
English en
Upper-Case Englisha en_uc
French fr
German de
Greek el
Hungarian hu
217
Adding New Named Entity Types
Language Tag
Italian it
Japanese ja
Korean ko
Persian fa
Polish pl
Portuguese pt
Russian ru
Spanish es
Urdu ur
aFor more accurate processing of English text that is entirely upper case, use the en_uc language tag.
218
Process Context Files
219
220
Appendix A. Named Entities
RLP defines a number of named entity types and includes several language processors that identify and
return named entities [80] when RLP processes text.
As shipped, RLP identifies named entities of various types for a variety of languages, as shown in the table
below. You can extend the scope of the named entities that RLP can identify (see Locating More Named
Entities [222] .
As noted in the table, some of the regular expressions for locating named entities are language-specific;
others are generic.
221
Locating More Named Entities
Definitions
• FACILITY: A man-made structure or architectural entity such as a building, stadium, monument, airport,
bridge, factory, or museum.
• GPE: An geo-political entity comprised of three elements: a population, a geographic location and a
government. A country, state, city, or other location that contains both a population and a centralized
government.
• LOCATION: Name of a geographically defined place such as a continent, body of water, mountain,
park, or full address. It also refers to a region that either spans GPE boundaries (such as Middle East,
Northeast, West Coast) or is contained within a larger GPE (such as Sunni Triangle, Chinatown).
Important: For those languages that do not include GPE, LOCATION is expanded to include GPE.
• Create gazetteers and use the Gazetteer processor to find named entitites of any type for any of the
languages RLP supports.
222
Appendix B. Part-of-Speech Tags
B.1. Arabic POS Tags
Tag Description Example
ABBREV abbreviation
افب
ADJ adjective
ُ أَحْمَر،أَبْجَدِي
ADV adverb
لاَ سِيَّمَا،ْأَمْس
CASE_DEF_ACC definite accusative case suffix
َاَلْكِتَاب
CASE_DEF_GEN definite genitive case suffix
ِاَلْكِتَاب
CASE_DEF_NOM definite nominative case suffix
ُاَلْكِتَاب
CASE_INDEF_ACC indefinite accusative case suffix
ًكِتَابا
CASE_INDEF_GEN indefinite genitive case suffix
ٍكِتَاب
CASE_INDEF_NOM indefinite nominative case suffix
ٌكِتَاب
CONJ conjunction
َو
CVSUFF_DO_3D imperative verb suffix, direct object
3rd person dual
ھُمَا سَاعِدْھُمَا
CVSUFF_DO_3FP imperative verb suffix, direct object
3rd person feminine plural
َّھُنَّ سَاعِدُھُن
CVSUFF_DO_3FS imperative verb suffix, direct object
3rd person feminine singular
ھَا سَاعِدْھَا
CVSUFF_DO_3MP imperative verb suffix, direct object,
3rd person masculine plural
ْھُمْ سَاعِدْھُم
CVSUFF_DO_3MS imperative verb suffix, direct object,
3rd person masculine singular
ُهُ سَاعِدْه
CVSUFF_SUBJ_2FS imperative verb suffix, subject 2nd
person feminine singular
سَاعِدِي
CVSUFF_SUBJ_2MP imperative verb suffix, subject 2nd
person masculine plural
وا إِلْعَبِوا
CVSUFF_SUBJ_2MS imperative verb suffix, subject 2nd
person masculine singular
ْإِلْعَب
CV verb (imperative)
ْ إِسْتَنْتِج،ْإِفْحَص
DEM_PRON demonstrative pronoun
ھَذَا
DET determiner
ال
EMPH_PART emphatic particle
لَ لَحْرِقَنَّكُم
EOS end of sentence .
223
Arabic POS Tags
224
Arabic POS Tags
225
Arabic POS Tags
226
Arabic POS Tags
227
Arabic POS Tags
228
Chinese POS Tags - Simplified and Traditional
229
Czech POS Tags
230
Dutch POS Tags
231
Dutch POS Tags
232
French POS Tags
233
English POS Tags
234
English POS Tags
235
German POS Tags
236
German POS Tags
237
Greek POS Tags
238
Hungarian POS Tags
239
Hungarian POS Tags
240
Italian POS Tags
241
Italian POS Tags
242
Japanese POS Tags
243
Korean POS Tags
244
Polish POS Tags
245
Portuguese POS Tags
246
Portuguese POS Tags
247
Russian POS Tags
248
Spanish POS Tags
249
Spanish POS Tags
250
Spanish POS Tags
251
252
Appendix C. Morphological and Special Tags
When creating a user dictionary for For German, Dutch, Hungarian, English, French, Italian, Portuguese,
and Spanish, you can include morphological tags and/or special tags in individual dictionary entries.
Morphological Tags. The Named Entity Extractor [141] uses the Morphological tags listed in the
following sections to help it identify named entities.
Special Tags. For German, Dutch, and Hungarian, the Base Linguistics Language Analyzer (BL1) uses
special tags to divide compounds into their components.
For German, Dutch, Hungarian, English, French, Italian, and Portuguese, BL1 uses special tags to indicate
the boundaries in multi-word baseforms, contractions and elisions, and words with clitics. BL1 displays
these boundaries as spaces in the STEM [81] results.
253
English Morphological Tags
254
Spanish Special Tags
255
Italian Special Tags
256
Portuguese Special Tags
257
258
Appendix D. Tcl Regular Expression Syntax
The Regular Expression [150] processor uses the Tcl regular expression engine to identify named entities
in input text. To see the named entity types that the Regular Expression processor with the shipped regular
expressions file returns, see Named Entities [221] . For background information about adding your own
regular expressions, see Creating Regular Expressions [181] .
This appendix contains information extracted from the Tcl re_syntax Manual Page [http://www.tcl.tk/man/
tcl/TclCmd/re_syntax.htm]. It also contains the Tcl Software License [268] .
Name [259]
Description [259]
Different Flavors of REs [259]
Regular Expression Syntax [259]
Metasyntax [265]
Matching [266]
Limits and Compatibility [267]
Basic Regular Expressions [267]
D.1. Name
re_syntax - Syntax of Tcl regular expressions.
D.2. Description
A regular expression describes strings of characters. It's a pattern that matches certain strings and doesn't
match others.
This manual page primarily describes AREs. BREs mostly exist for backward compatibility in some old
programs; they will be discussed at the end. POSIX EREs are almost an exact subset of AREs. Features
of AREs that are not present in EREs will be indicated.
An ARE is one or more branches, separated by '|', matching anything that matches any of the branches.
A branch is zero or more constraints or quantified atoms, concatenated. It matches a match for the first,
followed by a match for the second, etc; an empty branch matches the empty string.
A quantified atom is an atom possibly followed by a single quantifier. Without a quantifier, it matches a
match for the atom. The quantifiers, and what a so-quantified atom matches, are:
259
Regular Expression Syntax
*
a sequence of 0 or more matches of the atom
+
a sequence of 1 or more matches of the atom
?
a sequence of 0 or 1 matches of the atom
{m}
a sequence of exactly m matches of the atom
{ m ,}
a sequence of m or more matches of the atom
{m,n}
a sequence of m through n (inclusive) matches of the atom; m may not exceed n
*? +? ?? { m }? { m ,}? { m , n }?
non-greedy quantifiers, which match the same possibilities, but prefer the smallest number rather than
the largest number of matches (see Matching)
The forms using { and } are known as bounds. The numbers m and n are unsigned decimal integers with
permissible values from 0 to 255 inclusive.
( re )
re is any regular expression) matches a match for re, with the match noted for possible reporting
(?: re )
as previous, but does no reporting (a "non-capturing'' set of parentheses)
()
matches an empty string, noted for possible reporting
(?:)
matches an empty string, without reporting
[ chars ]
a bracket expression, matching any one of the chars (see BRACKET EXPRESSIONS for more detail)
.
matches any single character
\k
(where k is a non-alphanumeric character) matches that character taken as an ordinary character, e.g.
\\ matches a backslash character
\c
where c is alphanumeric (possibly followed by other characters), an escape (AREs only), see Escapes
below
{
when followed by a character other than a digit, matches the left-brace character '{'; when followed
by a digit, it is the beginning of a bound (see above)
260
Bracket Expressions
x
where x is a single character with no other significance, matches that character.
A constraint matches an empty string when specific conditions are met. A constraint may not be followed
by a quantifier. The simple constraints are as follows; some more constraints are described later, under
Escapes.
^
matches at the beginning of a line
$
matches at the end of a line
(?= re )
positive lookahead (AREs only), matches at any point where a substring matching re begins
(?! re )
negative lookahead (AREs only), matches at any point where no substring matching re begins
The lookahead constraints may not contain back references (see later), and all parentheses within them are
considered non-capturing.
If two characters in the list are separated by '-', this is shorthand for the full range of characters between
those two (inclusive) in the collating sequence, e.g. [0-9] in ASCII matches any decimal digit. Two ranges
may not share an endpoint, so e.g. a-c-e is illegal. Ranges are very collating-sequence-dependent, and
portable programs should avoid relying on them.
To include a literal] or - in the list, the simplest method is to enclose it in [. and .] to make it a collating
element (see below). Alternatively, make it the first character (following a possible '^'), or (AREs only)
precede it with '\'. Alternatively, for '-', make it the last character, or the second endpoint of a range. To use
a literal - is the first endpoint of a range, make it a collating element or (AREs only) precede it with '\'.
With the exception of these, some combinations using [ and escapes, all other special characters lose their
special significance within a bracket expression.
Within a bracket expression, a collating element (a character, a multi-character sequence that collates as
if it were a single character, or a collating-sequence name for either) enclosed in [. and .] stands for the
sequence of characters of that collating element. The sequence is a single element of the bracket expression's
list. A bracket expression in a locale that has multi-character collating elements can thus match more than
one character. So (insidiously), a bracket expression that starts with ^ can match multi-character collating
elements even if none of them appear in the bracket expression! (Note: Tcl currently has no multi-character
collating elements. This information is only for illustration.)
For example, assume the collating sequence includes a ch multi-character collating element. Then the RE
[[.ch.]]*c (zero or more ch's followed by c) matches the first five characters of 'chchcc'. Also, the RE
[^c]b matches all of 'chb' (because [^c] matches the multi-character ch).
Within a bracket expression, a collating element enclosed in [= and =] is an equivalence class, standing
for the sequences of characters of all collating elements equivalent to that one, including itself. (If there
261
Escapes
are no other equivalent collating elements, the treatment is as if the enclosing delimiters were '[.'and '.]'.)
For example, if o and ô are the members of an equivalence class, then '[[=o=]]', '[[=ô=]]', and '[oô]' are all
synonymous. An equivalence class may not be an endpoint of a range. (Note: Tcl currently implements
only the Unicode locale. It doesn't define any equivalence classes. The examples above are just
illustrations.)
Within a bracket expression, the name of a character class enclosed in [: and :] stands for the list of all
characters (not all collating elements!) belonging to that class. Standard character classes are:
alpha A letter.
upper An upper-case letter.
lower A lower-case letter.
digit A decimal digit.
xdigit A hexadecimal digit.
alnum An alphanumeric (letter or digit).
print An alphanumeric (same as alnum).
blank A space or tab character.
space A character producing white space in displayed text.
punct A punctuation character.
graph A character with a visible representation.
cntrl A control character.
A locale may provide others. See Character Classes [182] (Note that the current Tcl implementation has
only one locale: the Unicode locale.) A character class may not be used as an endpoint of a range.
There are two special cases of bracket expressions: the bracket expressions [[:<:]] and [[:>:]] are
constraints, matching empty strings at the beginning and end of a word respectively. A word is defined as
a sequence of word characters that is neither preceded nor followed by word characters. A word character
is an alnum character or an underscore (_). These special bracket expressions are deprecated; users of AREs
should use constraint escapes instead (see below).
D.6. Escapes
Escapes (AREs only), which begin with a \ followed by an alphanumeric character, come in several
varieties: character entry, class shorthands, constraint escapes, and back references. A \ followed by an
alphanumeric character but not constituting a valid escape is illegal in AREs. In EREs, there are no escapes:
outside a bracket expression, a \ followed by an alphanumeric character merely stands for that character
as an ordinary character, and inside a bracket expression, \ is an ordinary character. The latter is the one
actual incompatibility between EREs and AREs.
Character-entry escapes (AREs only) exist to make it easier to specify non-printing and otherwise
inconvenient characters in REs:
\a
alert (bell) character, as in C
\b
backspace, as in C
\B
synonym for \ to help reduce backslash doubling in some applications where there are multiple levels
of backslash processing
262
Escapes
\c X
(where X is any character) the character whose low-order 5 bits are the same as those of X, and whose
other bits are all zero
\e
the character whose collating-sequence name is 'ESC', or failing that, the character with octal value
033
\f
formfeed, as in C
\n
newline, as in C
\r
carriage return, as in C
\t
horizontal tab, as in C
\u wxyz
(where wxyz is exactly four hexadecimal digits) the Unicode character U+ wxyz in the local byte
ordering
\U stuvwxyz
(where stuvwxyz is exactly eight hexadecimal digits) reserved for a somewhat-hypothetical Unicode
extension to 32 bits
\v
vertical tab, as in C are all available.
\x hhh
(where hhh is any sequence of hexadecimal digits) the character whose hexadecimal value is 0x hhh
(a single character no matter how many hexadecimal digits are used).
\0
the character whose value is 0
\ xy
(where xy is exactly two octal digits, and is not a back reference (see below)) the character whose octal
value is 0 xy
\ xyz
(where xyz is exactly three octal digits, and is not a back reference (see below)) the character whose
octal value is 0 xyz
Hexadecimal digits are '0'-'9', 'a'-'f', and 'A'-'F'. Octal digits are '0'-'7'.
The character-entry escapes are always taken as ordinary characters. For example, \135 is ] in ASCII, but
\135 does not terminate a bracket expression. Beware, however, that some applications (e.g., C compilers)
interpret such sequences themselves before the regular-expression package gets to see them, which may
require doubling (quadrupling, etc.) the '\'.
Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used character classes:
\d
[[:digit:]]
263
Escapes
\s
[[:space:]]
\w
[[:alnum:]_] (note underscore)
\D
[^[:digit:]]
\S
[^[:space:]]
\W
[^[:alnum:]_] (note underscore)
Within bracket expressions, '\d', '\s', and '\w' lose their outer brackets, and '\D', '\S', and '\W' are illegal. (So,
for example, [a-c\d] is equivalent to [a-c[:digit:]]. Also, [a-c\D], which is equivalent to [a-c^[:digit:]], is
illegal.)
A constraint escape (AREs only) is a constraint, matching the empty string if specific conditions are met,
written as an escape:
\A
matches only at the beginning of the string (see Matching, below, for how this differs from '^')
\m
matches only at the beginning of a word
\M
matches only at the end of a word
\y
matches only at the beginning or end of a word
\Y
matches only at a point that is not the beginning or end of a word
\Z
matches only at the end of the string (see Matching, below, for how this differs from '$')
\m
(where m is a nonzero digit) a back reference, see below
\ mnn
(where m is a nonzero digit, and nn is some more digits, and the decimal value mnn is not greater than
the number of closing capturing parentheses seen so far) a back reference, see below
A word is defined as in the specification of [[:<:]] and [[:>:]] above. Constraint escapes are illegal within
bracket expressions.
A back reference (AREs only) matches the same string matched by the parenthesized subexpression
specified by the number, so that (e.g.) ([bc])\1 matches bb or cc but not 'bc'. The subexpression must
entirely precede the back reference in the RE. Subexpressions are numbered in the order of their leading
parentheses. Non-capturing parentheses do not define subexpressions.
There is an inherent historical ambiguity between octal character-entry escapes and back references, which
is resolved by heuristics, as hinted at above. A leading zero always indicates an octal escape. A single non-
264
Metasyntax
zero digit, not followed by another digit, is always taken as a back reference. A multi-digit sequence not
starting with a zero is taken as a back reference if it comes after a suitable subexpression (i.e. the number
is in the legal range for a back reference), and otherwise is taken as octal.
D.7. Metasyntax
In addition to the main syntax described above, there are some special forms and miscellaneous syntactic
facilities available.
Normally the flavor of RE being used is specified by application-dependent means. However, this can be
overridden by a director. If an RE of any flavor begins with '***:', the rest of the RE is an ARE. If an RE
of any flavor begins with '***=', the rest of the RE is taken to be a literal string, with all characters considered
ordinary characters.
An ARE may begin with embedded options: a sequence (? xyz ) (where xyz is one or more alphabetic
characters) specifies options affecting the rest of the RE. These supplement, and can override, any options
specified by the application. The available option letters are:
b
rest of RE is a BRE
c
case-sensitive matching (usual default)
e
rest of RE is an ERE
i
case-insensitive matching (see Matching, below)
m
historical synonym for n
n
newline-sensitive matching (see Matching, below)
p
partial newline-sensitive matching (see Matching, below)
q
rest of RE is a literal ("quoted'') string, all ordinary characters
s
non-newline-sensitive matching (usual default)
t
tight syntax (usual default; see below)
w
inverse partial newline-sensitive ("weird'') matching (see Matching, below)
x
expanded syntax (see below)
Embedded options take effect at the ) terminating the sequence. They are available only at the start of an
ARE, and may not be used later within it.
265
Matching
In addition to the usual (tight) RE syntax, in which all characters are significant, there is an expanded
syntax, available in all flavors of RE with the -expanded switch, or in AREs with the embedded x option.
In the expanded syntax, white-space characters are ignored and all characters between a # and the following
newline (or the end of the RE) are ignored, permitting paragraphing and commenting a complex RE. There
are three exceptions to that basic rule:
Expanded-syntax white-space characters are blank, tab, newline, and any character that belongs to the
space character class.
Finally, in an ARE, outside bracket expressions, the sequence '(?# ttt )' (where ttt is any text not containing
a ')') is a comment, completely ignored. Again, this is not allowed between the characters of multi-character
symbols like '(?:'. Such comments are more a historical artifact than a useful facility, and their use is
deprecated; use the expanded syntax instead.
None of these metasyntax extensions is available if the application (or an initial ***= director) has specified
that the user's input be treated as a literal string rather than as an RE.
D.8. Matching
In the event that an RE could match more than one substring of a given string, the RE matches the one
starting earliest in the string. If the RE could match more than one substring starting at that point, its choice
is determined by its preference: either the longest substring, or the shortest.
Most atoms, and all constraints, have no preference. A parenthesized RE has the same preference (possibly
none) as the RE. A quantified atom with quantifier { m } or { m }? has the same preference (possibly none)
as the atom itself. A quantified atom with other normal quantifiers (including { m , n } with m equal to n)
prefers longest match. A quantified atom with other non-greedy quantifiers (including { m , n }? with m
equal to n) prefers shortest match. A branch has the same preference as the first quantified atom in it which
has a preference. An RE consisting of two or more branches connected by the | operator prefers longest
match.
Subject to the constraints imposed by the rules for matching the whole RE, subexpressions also match the
longest or shortest possible substrings, based on their preferences, with subexpressions starting earlier in
the RE taking priority over ones starting later. Note that outer subexpressions thus take priority over their
component subexpressions.
Note that the quantifiers {1,1} and {1,1}? can be used to force longest and shortest preference, respectively,
on a subexpression or a whole RE.
Match lengths are measured in characters, not collating elements. An empty string is considered longer
than no match at all. For example, bb* matches the three middle characters of 'abbbc', (week|wee)(night|
knights) matches all ten characters of 'weeknights', when (.*).* is matched against abc the parenthesized
subexpression matches all three characters, and when (a*)* is matched against bc both the whole RE and
the parenthesized subexpression match an empty string.
If case-independent matching is specified, the effect is much as if all case distinctions had vanished from
the alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a
bracket expression, it is effectively transformed into a bracket expression containing both cases, so that
x becomes '[xX]'. When it appears inside a bracket expression, all case counterparts of it are added to the
bracket expression, so that [x] becomes [xX] and [^x] becomes '[^xX]'.
266
Limits and Compatibility
If newline-sensitive matching is specified, . and bracket expressions using ^ will never match the newline
character (so that matches will never cross newlines unless the RE explicitly arranges it) and ^ and $ will
match the empty string after and before a newline respectively, in addition to matching at beginning and
end of string respectively. ARE \A and \Z continue to match beginning or end of string only.
If partial newline-sensitive matching is specified, this affects . and bracket expressions as with newline-
sensitive matching, but not ^ and '$'.
If inverse partial newline-sensitive matching is specified, this affects ^ and $ as with newline-sensitive
matching, but not . and bracket expressions. This isn't very useful but is provided for symmetry.
The only feature of AREs that is actually incompatible with POSIX EREs is that \ does not lose its special
significance inside bracket expressions. All other ARE features use syntax which is illegal or has undefined
or unspecified effects in POSIX EREs; the *** syntax of directors likewise is outside the POSIX syntax
for both BREs and EREs.
Many of the ARE extensions are borrowed from Perl, but some have been changed to clean them up, and
a few Perl extensions are not present. Incompatibilities of note include '\b', '\B', the lack of special treatment
for a trailing newline, the addition of complemented bracket expressions to the things affected by newline-
sensitive matching, the restrictions on parentheses and back references in lookahead constraints, and the
longest/shortest-match (rather than first-match) matching semantics.
Henry Spencer's original 1986 regexp package, still in widespread use (e.g., in pre-8.1 releases of Tcl),
implemented an early version of today's EREs. There are four incompatibilities between regexp's near-
EREs ('RREs' for short) and AREs. In roughly increasing order of significance:
In AREs, \ followed by an alphanumeric character is either an escape or an error, while in RREs, it was
just another way of writing the alphanumeric. This should not be a problem because there was no reason
to write such a sequence in RREs.
{ followed by a digit in an ARE is the beginning of a bound, while in RREs, { was always an ordinary
character. Such sequences should be rare, and will often result in an error because following characters
will not look like a valid bound.
In AREs, \ remains a special character within '[ ]', so a literal \ within [ ] must be written '\\'. \\ also gives
a literal \ within [ ] in RREs, but only truly paranoid programmers routinely doubled the backslash.
AREs report the longest/shortest match for the RE, rather than the first found in a specified search order.
This may affect some RREs which were written in the expectation that the first match would be reported.
The careful crafting of RREs to optimize the search order for fast matching is obsolete (AREs examine
all possible matches in parallel, and their performance is largely insensitive to their complexity) but
cases where the search order was exploited to deliberately find a match which was not the longest/shortest
will need rewriting.
267
Tcl License
a parenthesized subexpression (after a possible leading '^'). Finally, single-digit back references are
available, and \< and \> are synonyms for [[:<:]] and [[:>:]] respectively; no other escapes are available.
The authors hereby grant permission to use, copy, modify, distribute, and license this software and its
documentation for any purpose, provided that existing copyright notices are retained in all copies and that
this notice is included verbatim in any distributions. No written agreement, license, or royalty fee is required
for any of the authorized uses. Modifications to this software may be copyrighted by their authors and need
not follow the licensing terms described here, provided that the new terms are clearly indicated on the first
page of each file where they apply.
GOVERNMENT USE: If you are acquiring this software on behalf of the U.S. government, the
Government shall have only "Restricted Rights" in the software and related documentation as defined in
the Federal Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you are acquiring the software
on behalf of the Department of Defense, the software shall be classified as "Commercial Computer
Software" and the Government shall have only "Restricted Rights" as defined in Clause 252.227-7013 (c)
(1) of DFARs. Notwithstanding the foregoing, the authors grant the U.S. Government and others acting in
its behalf permission to use and distribute the software in accordance with the terms specified in this license.
268
Appendix E. Error Codes
RLP APIs and logs may return the error codes described below. Positive error codes indicate success
BT_OK or a non-error condition, either BT_NO_MORE_DATA or BT_WANT_MORE_DATA. Negative error
codes fall into the following categories:
269
Error # Hex # Error Name This error code is returned when...
-51 -33 BT_ERR_INVALID_ARGUMENT An argument passed to the function is
invalid.
-52 -34 BT_ERR_INVALID_FILE_FORMAT A file whose name is passed to the function
has an invalid format.
-100 -64 BT_ERR_SYSTEM_ERROR The function encounters a system-level
error. The caller can check the value of
errno to determine the exact error.
-101 -65 BT_ERR_OUT_OF_MEMORY The function could not allocate new
memory, or if there is insufficient memory
for a required operation.
-102 -66 BT_ERR_FILE_NOT_FOUND A file whose name is passed to the function
could not be found.
-103 -67 BT_ERR_FILE_PERMISSION_DENIED A file whose name is passed to the function
could not be accessed because of a
permissions restriction.
-150 -96 BT_ERR_LICENSE_INVALID At initialization, a license file is present but
the required key is not valid. The key may be
missing a field, or a field value may not be
valid.
-151 -97 BT_ERR_LICENSE_EXPIRED At initialization, a license file is present but
has passed its expiration date.
-152 -98 BT_ERR_LICENSE_WRONG_PLATFORM At initialization, a license file is present but
the platform in the license key is not valid.
-153 -99 BT_ERR_LICENSE_NOT_AVAILABLE A license file is not available. This happens
when attempting to run a language processor
for which there is no license.
-154 -9A BT_ERR_LICENSE_FILE_NOT_FOUND At initialization, a license file could not be
found or could not be processed.
-155 -9B BT_ERR_DATA_NOT_LICENCED An attempt is made to create a dictionary
from a data file that is not licensed.
-156 -9C BT_ERR_DATUM_NOT_FOUND A datum, such as a record or table entry,
which has been requested, cannot be found
in the corresponding resource.
-157 -9D BT_ERR_DATA_GENERATION_INCOMP An attempt is made to initiate use of a
ATIBLE database, table, dictionary, or other data
collection which is not of a compatible
generation, i.e., was not built with, intended
to be used with, or operable with, the
versions of other comparable collections
already in use by the application or
subsystem.
-10000 -2710 BT_RLP_ERR_NO_LANGUAGE_PROCES No language processors have been defined
SORS (or could be loaded) within a context. A
context must have at least one language
processor to be valid.
270
Error # Hex # Error Name This error code is returned when...
-10002 -2712 BT_RLP_ERR_INVALID_PROCESSOR_ A processor's internal API version does not
VERSION match the version required by the core RLP
library. This can occur if the processor is
newer than the version of RLP being used.
-10003 -2713 BT_RLP_ERR_NO_LICENSES_AVAILA There are no license keys defined. This can
BLE occur when the named license file exists but
does not contain valid key values.
-10004 -2714 BT_RLP_ERR_LANGUAGE_NOT_SUPPO A processor does not support the language
RTED requested. This is not necessarily an error; a
language processor can be called for a
language it doesn't support, in which case it
will do nothing, returning this value.
-10005 -2715 BT_RLP_ERR_REQUIRED_DATA_MISS A language processor required data that is
ING not available. For example, if named entity
extraction requires that the input be POS-
tagged and it is not, the language processor
may return this value.
-10006 -2716 BT_RLP_ERR_CHARSET_NOT_SUPPOR The application passes a mime charset to
TED \texttt{ProcessBuffer} or
\texttt{ProcessFile} and either the charset is
not acceptable to the processor or is invalid
or undefined. Note that some processors
ignore the charset or treat it as a hint, not a
firm declaration.
-10007 -2717 BT_RLP_ERR_INVALID_INPUT_DATA A processor detects data that it cannot
process. For example, a processor for a
specific file format returns this error when
the data is not in the specified file format.
-10008 -2718 BT_RLP_ERR_NO_ROOT_DIRECTORY The application has not established a Basis
root directory.
-10010 -271 BT_RLP_ERR_UNINITIALIZED_ENVI The application attempts to create a context
A RONMENT before initializing an environment.
271
272
Glossary
In the glossary below, terms that have a specific meaning for RLP are capitalized, e.g., "Context." Non-capitalized
terms are general linguistic or computing terms.
A
abbreviation An abbreviation is a shortened way to write a long word or phrase. For example,
the word "miscellaneous" is frequently replaced with the abbreviation "misc."
adjective An adjective is a word that modifies a noun to denote quantity, extent, quality, or
to specify the noun as distinct from something else. Consider the sentence, "The
smart woman has a great job." The words "smart" and "great" are adjectives, which
describe "the woman" and the "job," respectively.
adverb An adverb is a word that modifies a verb to describe how the verb was performed.
Consider the sentence, "The racer drove quickly." The word "quickly" is an adverb
that describes the driving of the racer. Adverbs may also modify whole sentences.
Consider the sentence, "Frankly, I don't care." The word "frankly" is an adverb
that describes the rest of the sentence.
affix One or more sounds or characters attached to the beginning, middle, or end of a
word or base to create a derived word or inflectional form.
alternative readings Alternative readings are returned in Japanese when the recognized word has more
than one valid pronunciation.
auxiliary verb A verb that is used in forming certain tenses, aspects and moods of other verbs.
Consider the sentence, "The man is running." The verb "is" is an auxiliary verb to
the main verb "run." The main difference in English between "runs" and "is
running" is aspectual (habitual vs. progressive). Also, the difference between
"saw" and "was seen," with the auxiliary verb "was," is mood (active vs. passive).
B
bopomofo A method for transcribing Chinese text using Chinese character elements for their
reading value. For more information, see Ken Lunde’s CJKV Information
Processing, O’Reilly 1999.
broken plurals Refer to a class of nouns whose plural is formed in an irregular way. Use is specific
to Semitic languages such as Arabic.
bound morpheme A bound morpheme is a morpheme that cannot stand alone and must be combined
with some other morpheme. An affix is an example of a bound morpheme.
See Also morpheme.
273
BT_BUILD BT_BUILD designates the platform on which RLP runs. It is embedded in the
name of the RLP SDK package that you use to install RLP, and it is used to name
platform-specific subdirectories, such as BT_ROOT/rlp/bin and BT_ROOT/rlp/
lib. See Supported Platforms and BT_BUILD Values [16] .
BT_ROOT BT_ROOT designates the Basis root directory, the directory where the RLP
SDK is installed. During initialization, an RLP application must set the path to
BT_ROOT.
C
choseong Leading consonants or syllabic initial jamo in written Korean.
compound analysis Dissects a compound word (a word composed of many words combined) into its
constituent pieces. For example, in German, the word
'"Bibliothekskatalogen" (library catalog) can be decompounded into "Bibliothek"
and "Katalog".
conjunction A conjunction is word that links two phrases, such as the words "and" and "or" in
English.
consonant-stem verb (五段動 A category of Japanese verbs whose stems end in consonants. Some examples of
詞) these verbs are hakob(u) 運ぶ, kak(u) 書く, and shir(u) 知る. An easy way to tell
if a Japanese verb is a consonant-stem is to look at the Romanization of the direct-
style negative form of the verb. If the letter preceding the suffix -anai is a
consonant, it is a consonant-stem verb.
Context RLP context is an XML document that defines a sequence of language processors
to apply to the input text. It may include property settings to customize the
processing.
copula verb A copula is a form of the verb "to be" used for equating two phrases. In Japanese,
copula refers to the word da and its various forms, which is used with some
Japanese nouns and adjectives.
count noun A count noun is a noun that has a plural form. Most nouns in English are count
nouns.
See Also unit noun, mass noun.
D
decomposing Decomposing refers to taking tokens and further breaking them down into smaller
constituent parts where possible.
See Also segmentation.
274
determiner A determiner is a word which specifies a particular noun phrase. For example, in
English, an article such as "the" ("the book").
direct-style Refers to a politeness level in Japanese, which shows intimacy between the speaker
and listener. Direct-style is the opposite of distal style.
See Also distal-style.
distal-style Refers to a politeness level in Japanese, which shows distance (and thus politeness
and deference to the listener) between the speaker and listener. The distal-style is
marked by verb endings containing -mas-, and use of the copula desu instead of
da. Distal-style is the opposite of direct-style.
See Also direct-style.
E
Environment The RLP environment is an XML document that represents the global state of RLP.
The environment is responsible for loading and maintaining the various language
processors authorized by the software license.
eumjeol Eumjeol are a space delimited sequence of eojeol in Korean Hangul writing.
See Also Hangul, eojeol.
F
FACILITY A Named Entity type. See Named Entity Definitions [222] .
fully-productive derivational Fully-productive means that one need not remember which words can be used with
suffix the suffix — anyplace the suffix is used is considered valid. The best example of
a fully-productive derivation suffix in English is -ness, which can be used with any
noun.
furigana In Japanese, furigana refers to the Katakana or Hiragana characters used to show
the pronunciation of a Japanese word containing Kanji. Also known as
yomigana or rubi.
275
See Also Kanji.
G
Gazetteer A list of words or phrases that share a certain property. Gazetteers are used to
extend support for named entities. For example, a gazetteer could contain a list of
companies that provide a particular service. See Gazetteer [130] language
processor and Gazetteer source files [178] .
GPE. Geo-Political Entity. A Named Entity type. See Named Entity Definitions
[222] .
H
half-width Half-width refers to characters which would appear as they do in the ASCII
encoding. The opposite of "half-width" is "full-width."
See Also full-width.
Han characters Han characters refers generally to the Chinese ideographic characters which are
used in Chinese, Japanese, and Korean. 头 is an example of two Han characters.
hanja Hanja is the Korean word referring to Chinese ideographic characters used in
Korean.
hanyu pinyin This system of pinyin uses Roman letters to express the pronunciation of Chinese
words. This system is used widely in Mainland China.
See Also pinyin.
Hanzi Hanzi is the Chinese word referring to Chinese ideographic characters used in
Chinese.
Hiragana The Japanese phonetic alphabet used to write native Japanese words (as opposed
to Katakana, the alphabet used for borrowed foreign words).
honorific prefix A set of characters or sounds added to the beginning of a word to make it into an
honorific word. For example, in Japanese, the prefix o-/go- 头 is added to the
beginning of some nouns to make them honorific.
honorific suffix A set of characters or sounds added to the end of a word make the word honorific.
For example, in Japanese, the suffix -sama 头 is added to the personal name when
addressing a person.
276
I
IDENTIFIER:UTM Univeral Transverse Mercator. A Named Entity type. See Named Entity
Definitions [222] .
irregular verb An irregular verb is one which does not follow the regular conjugation patterns of
the other verbs in the language.
Iterator Used to access the results generated by the processors in the context.
J
jamo Jamo are the basic phonetic building blocks for writing Korean in Hangul. Jamo
consists of 10 basic vowels and 14 consonants. The combination of jamo forms a
syllable called an eojeol.
See Also eojeol.
K
kana Kana refers to the Japanese alphabets Hiragana and Katakana collectively.
See Also Hiragana, Katakana.
Kanji Chinese ideographs used in the Japanese language are called Kanji. An example
of Kanji: 日本語.
Katakana The Japanese phonetic alphabet used to write borrowed foreign words.
See Also Hiragana.
L
Language Processor Processes input text given to RLP and generates analytical results for use by other
language processors and the RLP application.
lemma A lemma is the form of the word a person would look for when looking it up in a
dictionary, i.e., the "dictionary form" of a word. In English, it is the singular of
nouns ("book," not "books") and the "base form" of verbs ("define," not "defines,"
"defining," or "defined"). Other languages have somewhat different notions of
what the "dictionary form" or the "base form" is, but whatever form it takes, it is
the lemma.
277
lexeme A lexeme is an item in the vocabulary of a language, frequently called a word.
lexicalized A word that has entered the language as a single word, where the meaning of the
whole word no longer relates to the meaning of the constituent parts of the word,
is said to be lexicalized.
lexicon A listing of all the words in a language with information about its form, meaning,
and part of speech.
M
Main Dictionary The primary lexicon shipped with a language analyzer.
mass noun A mass noun is a type of noun that does not have a plural form. In English, "water"
is a typical mass noun; instead of making it plural, one must use a measure word
(e.g. "two glasses of water"). In Japanese and Chinese, every noun is a mass noun.
See Also unit noun, count noun.
morpheme A morpheme is the smallest unit for building words (composed of characters or
sounds). That is, morphemes cannot be broken down any further into meaningful
parts. Note: All morphemes are lexemes, but not all lexemes are morphemes. E.g.,
book + s = books. "book" and "s" are morphemes.
See Also lexeme, bound morpheme.
N
named entities The names of persons, places, times or things: "Bill Gates", "New York City",
"11/22/63" and "Harvard University" are all named entities.
named entity extraction Named entity extraction is the act of extracting known types of entities such as
personal names, organization corporate names, and geographical names from a
stream of text.
normalization The act of removing variations from words due to spelling or writing conventions
that are not significant from the point of view of information retrieval.
numeric Numeric describes a token or word containing a number either in Arabic numerals
(0, 1, 2, 3, ...), Arabic-Indic numerals (used in Arabic script), or characters that
represent numbers.
278
O
onomatope A word that expresses a sound. For example, in English "glug, glug" is the
onomatope for the sound of someone drinking.
ordinal numeric An ordinal numeric is a number that designates a place in a sequence, such as
"first," "second," etc.
P
part-of-speech Each language has several dozen parts of speech; each part of speech (POS)
describes how a word can combine with other words in that language. A part of
speech tag (POS tag) can be assigned to each token.
particle A particle is a word that indicates the role of a preceding or following word in a
sentence. For example, in Japanese, the particle wa (头) appears after the subject of
a sentence.
pinyin Pinyin is a system that uses the Roman alphabet used to show the pronunciation
of Chinese words. There are several types of pinyin.
See Also hanyu pinyin.
prefix A bound morpheme that attaches to the beginning of a word or base to create a
derivational word or inflectional form.
See Also bound morpheme, morpheme.
preposition A preposition is a word that appears before a noun phrase and combines to form a
phrase. It expresses the relation that the noun phrase has with the rest of the
sentence. For example, the phrase "under the boardwalk" contains the preposition
"under".
postposition A postposition is the same as a preposition, except that it appears after a noun
phrase instead of before it.
pronoun A pronoun is a word that refers to a noun. Some English pronouns are "he," "she,"
and "it."
proper noun A proper noun is the name of a person, place, or entity. For example, "London",
"Elizabeth", and "Basis Technology" are proper nouns. In English, proper nouns
are always capitalized.
R
regular expression A way of writing a pattern that a string can match or not. Regular expressions are,
most commonly, written in the Perl-compatible regular expression (PCRE) format.
For example, "ab*a" matches a string that starts and ends with an "a", with zero or
more occurrences of "b" in the middle.
279
Romanization Transcription or transliteration of text from another alphabet or writing script to
the Latin alphabet.
See Also transcription, transliteration.
Rosette Language Boundary Formerly known as the Multilingual Language identifier (MLI), RLBL detects
Locator boundaries between areas of different languages in text containing multiple
languages and between areas of different scripts.
S
script A script is a system of writing that is associated with a language. Hiragana,
Katakana, and Kanji are examples of written scripts used in the Japanese language,
but which are not, by themselves, languages.
segmentation The act of dividing a stream of text into segments or tokens. Broadly speaking,
RLP uses a "longest match" principle to segment text.
See Also decomposing, token.
sentence boundary detection Uses the language-based rules of punctuation ito determine the end of sentences.
In many languages (such as Chinese and Japanese), end-of-sentence punctuation
is largely unambiguous, and so sentence boundary detection is relatively easy. In
other languages, such as English and most European languages, sentence-ending
punctuation can also be used for other purposes. For example, periods can be used
in English to end sentence, to mark abbreviations, or in some cases do both at the
same time. A sentence boundary detector for English, then, has to distinguish
among all the uses of sentence-ending punctuation and decide when it is actually
ending a sentence.
stem A stem is the hypothetical form from which all the forms of a word are created.
Frequently, the stem is not an independent word that can stand alone. For example,
the English verb "decode" has the following observed forms: "decode,"
"decoding," "decodes," and "decoded." The stem is "decod"; it clearly is a string
from which the observed words are built, but the stem itself is never seen as a real
word.
stemming The process of removing prefixes and suffixes from a word until only a "stem"
remains. Note that a stem is not necessarily the same as the lemma (dictionary
form) of the word.
See Also lemma, stem.
suffix A bound morpheme that attaches to the beginning of a word or base to create a
derived word or inflectional form.
See Also fully-productive derivational suffix.
suru-verb (サ変動詞) A suru-verb is a Japanese verb that is formed of a one character plus the verb
suru (する) or other related forms such as siru (しる), which means "to do." Usually
the first character does not stand alone as a word, although it may in some cases.
280
For example, ai(suru) (愛する), kan(suru) (関する), and syou(jiru) (生じる) are
suru-verbs.
T
tatweel See kashida.
Tokenizer Language processor that parses a document and generates a sequence of tokens.
transcription The representation of the sound of the words in a language using another alphabet
or set of symbols created for that purpose. Transcription is not concerned with
representing characters; it strives to give a phonetically (or phonologically)
accurate representation of the word. This may differ, depending on the language
or set of symbols into which the word is being transcribed.
transliteration The spelling of the words in one language and writing script with characters from
another writing script. Ideally, transliteration is a character-for-character
replacement so the reverse transliteration into the original script is possible.
Transliteration is not concerned with representing the phonetics of the original; it
only strives to accurately represent the characters.
U
unit noun A unit noun is what is also called sometimes called a "counter," "classifier," or
"measure word." In some languages, objects cannot be counted directly in the way
English never counts bread except by a unit (e.g., a loaf). In English, one would
never say "two breads" but rather "two loaves of bread." In that case, "loaf/loaves"
is a unit noun.
See Also mass noun.
User Dictionary A language dictionary created and maintained by the user for use by a language
analyzer in conjunction with the standard dictionary or dictionaries that RLP
provides for that language. Depending on the language analyzer, the dictionary
may provide segmentation or morphological information. Currently supported for
Chinese [123] , Japanese [134] , Korean [138] , and Base Linguistics
Analyzer [119] (European languages).
V
verbal noun (サ変動詞) A verbal noun is a verb composed of a noun with two or more characters and the
verb suru (する) ("to do"). For example, benkyou suru (勉強する) means "to
study" and is composed of the word for "study" with the verb suru. Because the
noun can stand alone, it may be segmented from the verb suru.
See Also suru-verb (サ変動詞).
vowel-stem verb (一段動詞) This is a category of Japanese nouns whose stems end in a vowel. Some examples
of these are tabe(ru) (头头), i(ru) (头头), and age(ru) (头头). An easy way to tell if a Japanese
281
verb is a vowel-stem, is to look at the romanization of the direct-style negative
form of the verb. If the letter preceding the suffix -nai is a vowel, then it is a vowel-
stem verb.
Y
yomigana See furigana.
Z
zenkaku See half-width.
282
language processor, 119
Index morphological tags, 256
POS tags, 231
A special tags, 256
ALTERNATIVE_LEMMAS, 77 user dictionaries, 185
ALTERNATIVE_NORM, 77
ALTERNATIVE_PARTS_OF_SPEECH, 77 E
ALTERNATIVE_ROOTS, 78 English
ALTERNATIVE_STEMS, 78 language processor, 119
applications morphological tags, 254
building, 43 POS tags, 234
Arabic special tags, 254
language processor, 115 user dictionaries, 185
arbl, 115 environment
configuration, 19
B error codes, 269
BASE_NOUN_PHRASE, 78 European
BaseNounPhrase, 121 language processor, 119
Base Noun Phrase Detector, 122 user dictionaries, 185
BL1, 119
BOM, 72 F
BT_BUILD, 16 fabl, 147
BT_RLP_LOG_LEVEL, 25 French
BT_ROOT, 6 , 7 language processor, 119
morphological tags, 255
C POS tags, 233
Chinese special tags, 255
language processor, 124 user dictionaries, 185
POS tags, 229
user dictionaries, 188 G
CLA, 123 Gazetteer, 131
command-line utility examples, 178
RLP, 8 options file, 132
COMPOUND, 78 source file, 131 , 178
compound noun dictionary XML, 180
Korean, 194 German
context language processor, 119
configuration, 21 POS tags, 236
minimal, 101 user dictionaries, 185
properties, 21 Greek
CSC, 128 language processor, 119
Czech POS tags, 238
language processor, 119 user dictionaries, 185
POS tags, 230
user dictionaries, 185 H
Hangul dictionary, 194
D HTML input, 73
DETECTED_ENCODING, 78 HTML Stripper, 132
DETECTED_LANGUAGE, 78 language processor, 133
DETECTED_SCRIPT, 78 Hungarian
dictionaries language processor, 119
user, 185 POS tags, 239
Dutch special tags, 255
283
user dictionaries, 185 Japanese Orthographic Analyzer, 137
Korean Language Analyzer, 138
I Language Boundary Detector, 140
iFilter, 73 , 133 Language Identification, 164
processor, 133 mime_detector, 141
input file size, 75 Named Entity Extractor, 142
installation Named Entity Redactor, 147
package contents, 5 Persian, 147 , 148
Unix, 7 Regular Expression, 150
Windows 32, 5 REXML, 152
ISO639 Script Boundary Detector, 170
language codes, 12 Stopwords, 171
Italian Tokenizer, 173
language processor, 119 Urdu, 174
morphological tags, 255 LEMMA, 79
POS tags, 241 license file, 5
special tags, 256 installation, 6
user dictionaries, 185 logging, 23
Lucene
Demos, 200
J Integration with RLP, 199
Japanese
language processor, 134 , 137
POS tags, 243 M
user dictionaries, 191 MAP_OFFSETS, 79
JLA, 134 marked-up text, 73
jla-options.xml, 134 Microsoft Office input, 73
JOA, JON, 137 mime_detector, 140
language processor, 141
K MIME_TYPE, 79
morphological tags, 253
KLA, 138
multilingual text
kla-options.xml, 138
processing, 65
Korean
compound noun dictionary, 194
Hangul dictionary, 194 N
language processor, 138 NAMED_ENTITY, 80 , 142
POS tags, 244 named entities
user dictionaries, 194 types, 221
NamedEntityExtractor, 141
L Named Entity Extractor, 142
LANGUAGE_REGION, 79 , 140 Named Entity Redactor, 147
language analyzers, 21 NERedactLP, 145
Language Boundary, 139 NORMALIZED_TOKEN, 80
language codes, 12 normalizing text, 75
language processors, 114
Arabic, 115 P
Base Noun Phrase Detector, 122 PART_OF_SPEECH, 80
Chinese Language Analyzer, 124 PDF input, 73
Core Library for Unicode, 154 Persian
European languages, 119 language processor, 147 , 148
Gazetteer, 131 plain text, 71
HTML Stripper, 133 platforms
iFilter, 133 OS, CPU, compiler, 16
Japanese Language Analyzer, 134 Polish
284
language processor, 119 BASE_NOUN_PHRASE, 78
POS tags, 245 COMPOUND, 78
user dictionaries, 185 DETECTED_ENCODING, 78
Portuguese DETECTED_LANGUAGE, 78
language processor, 119 DETECTED_SCRIPT, 78
POS tags, 246 LANGUAGE_REGION, 79 , 140
special tags, 257 LEMMA, 79
user dictionaries, 185 MAP_OFFSETS, 79
POS Tags MIME_TYPE, 79
Chinese, 229 NAMED_ENTITY, 80
Czech, 230 NORMALIZED_TOKEN, 80
Dutch, 231 PART_OF_SPEECH, 80
English, 234 RAW_TEXT, 80
French, 233 READING, 80
German, 236 ROOTS, 80
Greek, 238 SCRIPT_REGION, 81
Hungarian, 239 SENTENCE_BOUNDARY, 81
Italian, 241 STEM, 81
Japanese, 243 STOPWORD, 81
Korean, 244 TEXT_BOUNDARIES, 81
Polish, 245 TOKEN, 82
Portuguese, 246 TOKEN_OFFSET, 82
Russian, 248 TOKEN_SOURCE_ID, 82
Spanish, 249 TOKEN_SOURCE_NAME, 82
Private Use Area (PUA) characters, 197 TOKEN_VARIATIONS, 82
REXML, 152
Q Rich Text Format input, 73
RLI, 164
query
RLP command-line utility, 8
processing a, 115
ROOTS, 80
Rosette Language Boundary Locater, 65
R runtime configuration, 99
RAW_TEXT, 80 Russian
RCLU, 154 language processor, 119
READING, 80 POS tags, 248
reading dictionary user dictionaries, 185
Chinese, 124
Japanese, 134
redistribution, 99
S
sample applications, 43
RegExpLP, 150
C
regular expression, 150
ar-rlp_sample_alternatives_c, 56
regular expressions
examine_license_c, 58
creating, 181
rlp_sample_c, 49
syntax, 259
C#
result data
RLPSample, 61
accessing in C++, 82
C++
accessing in Java, 92
examine_license, 43
result iterator (C++), 85
rlbl_sample, 43
result type
rlp_sample, 30 , 43
ALTERNATIVE_LEMMAS, 77
Java
ALTERNATIVE_NORM, 77
ExamineLicense, 45
ALTERNATIVE_PARTS_OF_SPEECH, 77
MultiLangRLP, 45
ALTERNATIVE_ROOTS, 78
RLPSample, 36 , 45
ALTERNATIVE_STEMS, 78
285
SCRIPT_REGION, 81 , 170 Polish, 185
Search applications Portuguese, 185
writing an analyzer, 205 Russian, 185
search terms, 115 Spanish, 185
SENTENCE_BOUNDARY, 81 UTF-16
SentenceBoundaryDetector, 169 Unicode Converter, 174
Solr UTF-16LE/BE text, 72
Integration with RLP, 199 UTF-32
Spanish Unicode Converter, 174
language processor, 119 UTF-8
morphological tags, 254 Unicode Converter, 174
POS tags, 249
special tags, 255 X
user dictionaries, 185
XML input, 73 , 75
special tags, 253
STEM, 81
stopconfig.dtd, 171 , 184
STOPWORD, 81
Stopwords, 171
configuration, 184
dictionaries, 182
examples, 182
T
TEXT_BOUNDARIES, 81 , 172
Text Boundary, 172
TOKEN, 82
TOKEN_OFFSET, 82
TOKEN_SOURCE_ID, 82
TOKEN_SOURCE_NAME, 82
TOKEN_VARIATIONS, 82
token iterator (C++), 83
Tokenizer, 173
U
Unicode Converter, 173
urbl, 174
Urdu
language processor, 174
user-defined data, 177
user dictionaries, 21 , 185
Chinese, 188
Czech, 185
Dutch, 185
English, 185
European, 185
French, 185
German, 185
Greek, 185
Hungarian, 185
Italian, 185
Japanese, 191
Korean, 194
286