Open In App

NLP | Location Tags Extraction

Last Updated : 26 Feb, 2019
Comments
Improve
Suggest changes
Like Article
Like
Report
Different kind of ChunkParserI subclass can be used to identify the LOCATION chunks. As it uses the gazetteers corpus to identify location words. The gazetteers corpus is a WordListCorpusReader class that contains the following location words:
  • Country names
  • U.S. states and abbreviations
  • Mexican states
  • Major U.S. cities
  • Canadian provinces
LocationChunker class looking for words that are found in the gazetteers corpus by iterating over a tagged sentence. It creates a LOCATION chunk using IOB tags when it finds one or more location words. The IOB LOCATION tags are produced in the iob_locations() and the parse() method converts the IOB tags to Tree. Code #1 : LocationChunker class Python3 1==
from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import gazetteers

class LocationChunker(ChunkParserI):
    def __init__(self):
        self.locations = set(gazetteers.words())
        self.lookahead = 0
        for loc in self.locations:
            nwords = loc.count(' ')
        if nwords > self.lookahead:
            self.lookahead = nwords
  Code #2 : iob_locations() method Python3 1==
def iob_locations(self, tagged_sent):
    
    i = 0
    l = len(tagged_sent)
    inside = False
    
    while i < l:
        word, tag = tagged_sent[i]
        j = i + 1
        k = j + self.lookahead
        nextwords, nexttags = [], []
        loc = False
        
    while j < k:
        if ' '.join([word] + nextwords) in self.locations:
            if inside:
                yield word, tag, 'I-LOCATION'
            else:
                yield word, tag, 'B-LOCATION'
            for nword, ntag in zip(nextwords, nexttags):
                yield nword, ntag, 'I-LOCATION'
                loc, inside = True, True
                i = j
                break
            
        if j < l:
            nextword, nexttag = tagged_sent[j]
            nextwords.append(nextword)
            nexttags.append(nexttag)
            j += 1
        else:
            break
        if not loc:
            inside = False
            i += 1
            yield word, tag, 'O'
            
    def parse(self, tagged_sent):
        iobs = self.iob_locations(tagged_sent)
        return conlltags2tree(iobs)
  Code #3 : use the LocationChunker class to parse the sentence Python3 1==
from nltk.chunk import ChunkParserI
from chunkers import sub_leaves
from chunkers import LocationChunker

t = loc.parse([('San', 'NNP'), ('Francisco', 'NNP'),
               ('CA', 'NNP'), ('is', 'BE'), ('cold', 'JJ'), 
               ('compared', 'VBD'), ('to', 'TO'), ('San', 'NNP'),
               ('Jose', 'NNP'), ('CA', 'NNP')])

print ("Location : \n", sub_leaves(t, 'LOCATION'))
Output :
Location : 
[[('San', 'NNP'), ('Francisco', 'NNP'), ('CA', 'NNP')], 
[('San', 'NNP'), ('Jose', 'NNP'), ('CA', 'NNP')]]

Next Article
Practice Tags :

Similar Reads