Retrieving and Visualizing Data: Charles Severance
Retrieving and Visualizing Data: Charles Severance
Charles Severance
Visualize
Clean/Process
• http://spark.apache.org/
• https://aws.amazon.com/redshift/
• http://community.pentaho.com/
• ....
"Personal Data Mining"
Our goal is to make you better programmers – not to make you data
mining experts
GeoData
• Makes a Google Map from user
entered data
Google
geoload.py geodata.sqlite
geodata
where.js
geodump.py
http://www.py4e.com/code3/pagerank.zip
Search Engine Architecture
• Web Crawling
• Index Building
• Searching
http://infolab.stanford.edu/~backrub/google.html
Web Crawler
A Web crawler is a computer program that browses the World
Wide Web in a methodical, automated manner. Web crawlers are
mainly used to create a copy of all the visited pages for later
processing by a search engine that will index the downloaded
pages to provide fast searches.
http://en.wikipedia.org/wiki/Web_crawler
Web Crawler
• Retrieve a page
• Repeat... http://en.wikipedia.org/wiki/Web_crawler
Web Crawling Policy
• a selection policy that states which pages to download,
http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
http://en.wikipedia.org/wiki/Spider_trap
Google Architecture
• Web Crawling
• Index Building
• Searching
http://infolab.stanford.edu/~backrub/google.html
Search Indexing
Search engine indexing collects, parses, and stores data to
facilitate fast and accurate information retrieval. The purpose
of storing an index is to optimize speed and performance in
finding relevant documents for a search query. Without an
index, the search engine would scan every document in the
corpus, which would require considerable time and
computing power.
http://en.wikipedia.org/wiki/Index_(search_engine)
spreset.py sprank.py
force.html
d3.js
spdump.py
force.js
http://www.py4e.com/code3/pagerank.zip
Mailing Lists - Gmane
http://www.py4e.com/code3/gmane.zip
Warning: This Dataset is > 1GB
• Do not just point this application at gmane.org and let it run
• There is no rate limit – these are cool folks
http://mbox.dr-chuck.net/sakai.devel/4/5
gword.htm
mbox.dr-chuck.net gmane.py content.sqlite d3.js
gmodel.py gword.js
mapping.sqlite
gword.py
content.sqlite
gbasic.py
gline.py
How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 5 Email list participants gline.js
[email protected] 2657
[email protected] 1742
[email protected] 1591
[email protected] 1304
[email protected] 1184 gline.htm
...
http://www.py4e.com/code3/gmane.zip d3.js
Acknowledgements / Contributions
These slides are Copyright 2010- Charles R. Severance (
...
www.dr-chuck.com) of the University of Michigan School of
Information and open.umich.edu and made available under a Creative
Commons Attribution 4.0 License. Please maintain this last slide in all
copies of the document to comply with the attribution requirements of
the license. If you make a change, feel free to add your name and
organization to the list of contributors on this page as you republish the
materials.