NLP | Chunk Tree to Text and Chaining Chunk Transformation
Last Updated :
08 Aug, 2022
We can convert a tree or subtree back to a sentence or chunk string. To understand how to do it - the code below uses the first tree of the treebank_chunk corpus.
Code #1: Joining the words in a tree with space.
Python3
# Loading library
from nltk.corpus import treebank_chunk
# tree
tree = treebank_chunk.chunked_sents()[0]
print ("Tree : \n", tree)
print ("\nTree leaves : \n", tree.leaves())
print ("\nSentence from tree : \n", ' '.join(
[w for w, t in tree.leaves()]))
Output :
Tree :
(S
(NP Pierre/NNP Vinken/NNP), /,
(NP 61/CD years/NNS)
old/JJ, /,
will/MD
join/VB
(NP the/DT board/NN)
as/IN
(NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
./.)
Tree leaves :
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ('61', 'CD'),
('years', 'NNS'), ('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'),
('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Sentence from tree :
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29 .
As in the code above, the punctuations are not right because the period and commas are treated as special words. So, they get the surrounding spaces as well. But in the code below we can fix this using regular expression substitution.
Code #2 : chunk_tree_to_sent() function to improve Code 1
Python3
import re
# defining regex expression
punct_re = re.compile(r'\s([, \.;\?])')
def chunk_tree_to_sent(tree, concat =' '):
s = concat.join([w for w, t in tree.leaves()])
return re.sub(punct_re, r'\g<1>', s)
Code #3 : Evaluating chunk_tree_to_sent()
Python3
# Loading library
from nltk.corpus import treebank_chunk
from transforms import chunk_tree_to_sent
# tree
tree = treebank_chunk.chunked_sents()[0]
print ("Tree : \n", tree)
print ("\nTree leaves : \n", tree.leaves())
print ("Tree to sentence : ", chunk_tree_to_sent(tree))
Output :
Tree :
(S
(NP Pierre/NNP Vinken/NNP), /,
(NP 61/CD years/NNS)
old/JJ, /,
will/MD
join/VB
(NP the/DT board/NN)
as/IN
(NP a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD)
./.)
Tree leaves :
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (', ', ', '), ('61', 'CD'),
('years', 'NNS'), ('old', 'JJ'), (', ', ', '), ('will', 'MD'), ('join', 'VB'),
('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),
('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]
Tree to sentence :
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Chaining Chunk Transformation
The transformation functions can be chained together to normalize chunks and the resulting chunks are often shorter and it still holds the same meaning.
In the code below - a single chunk and an optional list of transform functions is passed to the function. This function will call each transform function on the chunk and will return the final chunk.
Code #4 :
Python3
def transform_chunk(
chunk, chain = [filter_insignificant,
swap_verb_phrase, swap_infinitive_phrase,
singularize_plural_noun], trace = 0):
for f in chain:
chunk = f(chunk)
if trace:
print (f.__name__, ':', chunk)
return chunk
Code #5 : Evaluating transform_chunk
Python3
from transforms import transform_chunk
chunk = [('the', 'DT'), ('book', 'NN'), ('of', 'IN'),
('recipes', 'NNS'), ('is', 'VBZ'), ('delicious', 'JJ')]
print ("Chunk : \n", chunk)
print ("\nTransformed Chunk : \n", transform_chunk(chunk))
Output :
Chunk :
[('the', 'DT'), ('book', 'NN'), ('of', 'IN'), ('recipes', 'NNS'),
('is', 'VBZ'), ('delicious', 'JJ')]
Transformed Chunk :
[('delicious', 'JJ'), ('recipe', 'NN'), ('book', 'NN')]
Similar Reads
NLP | Chunking and chinking with RegEx
Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech). Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can't be a part of chuck a
2 min read
NLP | Training Tagger Based Chunker | Set 2
Conll2000 corpus defines the chunks using IOB tags. It specifies where the chunk begins and ends, along with its types.A part-of-speech tagger can be trained on these IOB tags to further power a ChunkerI subclass.First using the chunked_sents() method of corpus, a tree is obtained and is then transf
3 min read
NLP | Training Tagger Based Chunker | Set 1
To train a chunker is an alternative to manually specifying regular expression (regex) chunk patterns. But manually training to specify the expression is a tedious task to do as it follows the hit and trial method to get the exact right patterns. So, existing corpus data can be used to train chunker
2 min read
NLP | Expanding and Removing Chunks with RegEx
RegexpParser or RegexpChunkRule.fromstring() doesn't support all the RegexpChunkRule classes. So, we need to create them manually. This article focusses on 3 of such classes : ExpandRightRule: It adds chink (unchunked) words to the right of a chunk. ExpandLeftRule: It adds chink (unchunked) words to
2 min read
Transform Text Features to Numerical Features with CatBoost
Handling text and category data is essential to machine learning to create correct prediction models. Yandex's gradient boosting library, CatBoost, performs very well. It provides sophisticated methods to convert text characteristics into numerical ones and supports categorical features natively, bo
4 min read
How to Chunk Text Data: A Comparative Analysis
Text chunking is a fundamental process in Natural Language Processing (NLP) that involves breaking down large bodies of text into smaller, more manageable units called "chunks." This technique is crucial for various NLP applications, such as text summarization, sentiment analysis, information extrac
13 min read
Augmented Transition Networks in Natural Language Processing
Augmented Transition Networks (ATNs) are a powerful formalism for parsing natural language, playing a significant role in the early development of natural language processing (NLP). Developed in the late 1960s and early 1970s by William Woods, ATNs extend finite state automata to include additional
8 min read
Rule-Based Tokenization in NLP
Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dicti
4 min read
Analyzing Texts with the text2vec Package in R
Text analysis is a crucial aspect of natural language processing (NLP) that helps extract meaningful information from textual data. The text2vec package in R is a powerful tool designed to facilitate efficient text mining and analysis. This article will explore how to use text2vec for analyzing text
4 min read
Recursive Transition Networks (RTNs) in NLP
Recursive Transition Networks (RTNs) are a type of finite state machine (FSM) used to represent the syntax of languages, particularly those with recursive elements. They extend FSMs by allowing transitions to call other networks, thereby enabling the representation of recursive structures within a l
8 min read