Showing posts with label web-mining. Show all posts
Showing posts with label web-mining. Show all posts

Sunday, January 4, 2015

from pattern.web import Google; google.search()

By Vasudev Ram

Spacweb

$ pip install pattern
# test_pattern_google_search.py
from pattern.web import Google, plaintext

google = Google(language='en') 
for result in google.search('"python"', cached=False):
    try:
        print unicode(plaintext(result.text))
    except UnicodeEncodeError:
        print "UnicodeEncodeError, skipping this tweet"
    except Exception:
        print "Exceptions happen"
$ python test_pattern_google_search.py >t
$ less t # more coffee
The official home of the Python Programming Language.
Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows ...
The original implementation of Python, written in C.
Learn to program in Python, a powerful language used by sites like YouTube and Dropbox.
Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented ...
Welcome to the 3rd Edition of Learn Python the Hard Way. You can visit the companion site to the book at http://learnpythonthehardway.org/ where you can ...
Python 3.4.2 documentation. Welcome! This is the documentation for Python 3.4.2, last updated Jan 01, 2015. Parts of the documentation: ...
Try Python in your browser ... Best way to learn Python for Raspberry Pi? ... Are there any python package that can intelligently parse strings containing numbers
...
Python 2.7.9 documentation. Welcome! This is the documentation for Python 2.7.9, last updated Dec 28, 2014. Parts of the documentation: ...
You can get xkcd shirts, prints, and posters in the store! Python ... Image URL (for

- Vasudev Ram - Dancing Bison Enterprises

Signup to hear about new products from me.

Contact Page

Sunday, October 14, 2012

pattern, a Python web mining and NLP tool

By Vasudev Ram


pattern is a web mining and NLP (Natural Language Processing) library for Python.

It is from CLiPS (Computational Linguistics & Psycholinguistics), "a research center associated with the Linguistics department of the faculty of Arts of the University of Antwerp."

From the site:

[ It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, KNN, SVM), and data visualization (graph networks). ]

Example usage and output - from the site:
>>> from pattern.web import Twitter, plaintext
>>> for tweet in Twitter().search('"more important than"', cached=False):
>>>    print plaintext(tweet.description)
 
'HINT: The mobile web is more important than mobile apps.'
'Start slowly, direction is more important than speed.'
'Imagination is more important than knowledge. - Albert Einstein'
...
I installed it (download the zip file, extract it and do "python setup.py install"); then tried it out with the above test program and a few variations on it. It partially works; i.e. it's able to fetch some tweets, but in some cases it gives errors that seem to be related to Unicode.

It also has an NLP module for English and a few other languages, plus some other stuff.

UPDATE:

It is now working. Got it to fetch these recent tweets of mine (from my @vasudevram Twitter profile):
IGNORE THIS (testing a Twitter tool). test===444
IGNORE THIS (testing a Twitter tool). test===333
IGNORE THIS (testing a Twitter tool). test===222
IGNORE THIS (testing a Twitter tool). test===111

- Vasudev Ram - Dancing Bison Enterprises