0% found this document useful (0 votes)

89 views

Retrieving and Visualizing Data: Charles Severance

Uploaded by

bitish commect

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views

Retrieving and Visualizing Data: Charles Severance

Uploaded by

bitish commect

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 19

Retrieving and Visualizing Data

Charles Severance

Python for Everybody

www.py4e.com
Multi-Step Data Analysis
Data
Source Gather

Visualize
Clean/Process

(5, 1.0, 0.985, 3, u'http://www.dr..')

(3, 1.0, 2.135, 4, u'http://www.dr..')
Analyze (1, 1.0, 0.659, 2, u'http://www.dr..')
(1, 1.0, 0.659, 5, u'http://www.dr..')
....
Many Data Mining Technologies
• https://hadoop.apache.org/

• http://spark.apache.org/

• https://aws.amazon.com/redshift/

• http://community.pentaho.com/

• ....
"Personal Data Mining"
Our goal is to make you better programmers – not to make you data
mining experts
GeoData
• Makes a Google Map from user
entered data

• Uses the Google Geodata API

• Caches data in a database to

avoid rate limiting and allow
restarting

• Visualized in a browser using

the Google Maps API
http://www.py4e.com/code3/geodata.zip
where.html
where.data

Google
geoload.py geodata.sqlite
geodata
where.js
geodump.py

Northeastern University, ... Boston, MA 02115, USA 42.3396998 -71.08975

Bradley University, 1501 ... Peoria, IL 61625, USA 40.6963857 -89.6160811
...
Technion, Viazman 87, Kesalsaba, 32000, Israel 32.7775 35.0216667
Monash University Clayton ... VIC 3800, Australia -37.9152113 145.134682
Kokshetau, Kazakhstan 53.2833333 69.3833333
...
12 records written to where.js
Open where.html to view the data in a browser http://www.py4e.com/code3/geodata.zip
Page Rank
• Write a simple web page
crawler

• Compute a simple version of

Google's Page Rank algorithm

• Visualize the resulting network

http://www.py4e.com/code3/pagerank.zip
Search Engine Architecture
• Web Crawling
• Index Building
• Searching

http://infolab.stanford.edu/~backrub/google.html
Web Crawler
A Web crawler is a computer program that browses the World
Wide Web in a methodical, automated manner. Web crawlers are
mainly used to create a copy of all the visited pages for later
processing by a search engine that will index the downloaded
pages to provide fast searches.

http://en.wikipedia.org/wiki/Web_crawler
Web Crawler
• Retrieve a page

• Look through the page for

links

• Add the links to a list of “to

be retrieved” sites

• Repeat... http://en.wikipedia.org/wiki/Web_crawler
Web Crawling Policy
• a selection policy that states which pages to download,

• a re-visit policy that states when to check for changes to the

pages,

• a politeness policy that states how to avoid overloading Web

sites, and

• a parallelization policy that states how to coordinate distributed

Web crawlers
robots.txt
• A way for a web site to communicate
User-agent: *
with web crawlers
Disallow: /cgi-bin/
• An informal and voluntary standard Disallow: /images/
Disallow: /tmp/
• Sometimes folks make a “Spider Disallow: /private/
Trap” to catch “bad” spiders

http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
http://en.wikipedia.org/wiki/Spider_trap
Google Architecture
• Web Crawling

• Index Building

• Searching

http://infolab.stanford.edu/~backrub/google.html
Search Indexing
Search engine indexing collects, parses, and stores data to
facilitate fast and accurate information retrieval. The purpose
of storing an index is to optimize speed and performance in
finding relevant documents for a search query. Without an
index, the search engine would scan every document in the
corpus, which would require considerable time and
computing power.
http://en.wikipedia.org/wiki/Index_(search_engine)
spreset.py sprank.py
force.html
d3.js

The spider.py spider.sqlite

Web
spjson.py

spdump.py
force.js

(5, None, 1.0, 3, u'http://www.dr-chuck.com/csev-blog')

(3, None, 1.0, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm')
(1, None, 1.0, 2, u'http://www.dr-chuck.com/csev-blog/')
(1, None, 1.0, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm')
4 rows.

http://www.py4e.com/code3/pagerank.zip
Mailing Lists - Gmane

• Crawl the archive of a mailing list

• Do some analysis / cleanup

• Visualize the data as word cloud

and lines

http://www.py4e.com/code3/gmane.zip
Warning: This Dataset is > 1GB
• Do not just point this application at gmane.org and let it run
• There is no rate limit – these are cool folks

Use this for your testing:

http://mbox.dr-chuck.net/sakai.devel/4/5
gword.htm
mbox.dr-chuck.net gmane.py content.sqlite d3.js

gmodel.py gword.js
mapping.sqlite
gword.py
content.sqlite
gbasic.py
gline.py
How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584
Top 5 Email list participants gline.js
[email protected] 2657
[email protected] 1742
[email protected] 1591
[email protected] 1304
[email protected] 1184 gline.htm
...
http://www.py4e.com/code3/gmane.zip d3.js
Acknowledgements / Contributions
These slides are Copyright 2010- Charles R. Severance (
...
www.dr-chuck.com) of the University of Michigan School of
Information and open.umich.edu and made available under a Creative
Commons Attribution 4.0 License. Please maintain this last slide in all
copies of the document to comply with the attribution requirements of
the license. If you make a change, feel free to add your name and
organization to the list of contributors on this page as you republish the
materials.

Initial Development: Charles Severance, University of Michigan School

of Information

… Insert new Contributors here

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
94% (68)
Knee Ability Zero Now Complete As A Picture Book 4 PDF Free
49 pages
Read People Like A Book by Patrick King-Edited
62% (71)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How To Buy A House or Car Using A 1099A Form - PDF
81% (32)
How To Buy A House or Car Using A 1099A Form - PDF
12 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
Satanic Calendar
25% (55)
Satanic Calendar
4 pages
Azure AZ-305 Exam Prep Latest
100% (7)
Azure AZ-305 Exam Prep Latest
69 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
70% (70)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Packetstorm Google Dorks List
83% (6)
Packetstorm Google Dorks List
476 pages
Mastering The EOS 5D Mark IV Preview
No ratings yet
Mastering The EOS 5D Mark IV Preview
21 pages
Dokumen - Tips - QXDM Log Analysis PDF Log Analysis PDF App Logmy Academic Training Lends
No ratings yet
Dokumen - Tips - QXDM Log Analysis PDF Log Analysis PDF App Logmy Academic Training Lends
2 pages
Ont Eg8141v5 PDF
No ratings yet
Ont Eg8141v5 PDF
2 pages
List 5
No ratings yet
List 5
15 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Pythonlearn 16 Data Viz
No ratings yet
Pythonlearn 16 Data Viz
19 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Exploring The Internet by Sai Satish
No ratings yet
Exploring The Internet by Sai Satish
18 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
5Legality of Web Scraping
No ratings yet
5Legality of Web Scraping
5 pages
Recon For Web Pen-Testing
No ratings yet
Recon For Web Pen-Testing
17 pages
Section 6_ Gather Competitive Intelligence With Python
No ratings yet
Section 6_ Gather Competitive Intelligence With Python
30 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
No ratings yet
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
29 pages
Web Scrapping: From NP-10
No ratings yet
Web Scrapping: From NP-10
11 pages
SecurityCompass-Search Attacks
100% (1)
SecurityCompass-Search Attacks
41 pages
Code Interview Resources
No ratings yet
Code Interview Resources
7 pages
lucas-hale
No ratings yet
lucas-hale
15 pages
Python Django Presentation
No ratings yet
Python Django Presentation
16 pages
Django Web Framework: Zhaojie Zhang CSCI5828 Class Presenta On 03/20/2012
No ratings yet
Django Web Framework: Zhaojie Zhang CSCI5828 Class Presenta On 03/20/2012
40 pages
b
No ratings yet
b
77 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
26 pages
Implementation of Web Application For Disease Prediction Using AI
No ratings yet
Implementation of Web Application For Disease Prediction Using AI
5 pages
Keamanan Jaringan Komputer Pertemuan 2 - Information Gathering
No ratings yet
Keamanan Jaringan Komputer Pertemuan 2 - Information Gathering
33 pages
Notes Regarding The Use of Beautifulsoup: Python
No ratings yet
Notes Regarding The Use of Beautifulsoup: Python
3 pages
1.Introduction.pdf
No ratings yet
1.Introduction.pdf
11 pages
Erformance Valuation EB Rawler: P E O W C
No ratings yet
Erformance Valuation EB Rawler: P E O W C
34 pages
Learn Data Structures and Algorithms _ DSA tutorials _ CodeChef
No ratings yet
Learn Data Structures and Algorithms _ DSA tutorials _ CodeChef
15 pages
Lab exercise - Session 1 - Python Libraries
No ratings yet
Lab exercise - Session 1 - Python Libraries
2 pages
Session 3 Data Aquisition - Updated
100% (1)
Session 3 Data Aquisition - Updated
40 pages
Resourceslist
100% (1)
Resourceslist
17 pages
Temp 8655876524900479039
No ratings yet
Temp 8655876524900479039
10 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
SQL Guide PDF
No ratings yet
SQL Guide PDF
4 pages
Python Full Stack Development Detailed
No ratings yet
Python Full Stack Development Detailed
26 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Lab 4 Reconnaissance
No ratings yet
Lab 4 Reconnaissance
19 pages
Bugbounty Cheatsheet - Mohammed Adam (Twitter - Com - Iam - Amdadam)
No ratings yet
Bugbounty Cheatsheet - Mohammed Adam (Twitter - Com - Iam - Amdadam)
49 pages
Dap M4
No ratings yet
Dap M4
18 pages
ERESOURCES
No ratings yet
ERESOURCES
24 pages
102167
No ratings yet
102167
69 pages
Python Scrapy
No ratings yet
Python Scrapy
4 pages
Dirty Recon 1 PDF
No ratings yet
Dirty Recon 1 PDF
71 pages
What Is Data Science GDI
0% (1)
What Is Data Science GDI
24 pages
GitHub - peggy1502_Amazing-Resources_ List of references and online resources related to data science, machine learning and deep learning_
No ratings yet
GitHub - peggy1502_Amazing-Resources_ List of references and online resources related to data science, machine learning and deep learning_
41 pages
Web Development - Python
No ratings yet
Web Development - Python
44 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Internet Research 1200691875464541 5
No ratings yet
Internet Research 1200691875464541 5
101 pages
Lab1 Crawling Python
No ratings yet
Lab1 Crawling Python
10 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
How Google Works
No ratings yet
How Google Works
61 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Assignment 5 - Text Web and Social Media Analytics
No ratings yet
Assignment 5 - Text Web and Social Media Analytics
2 pages
Amazon WEB Scrapin G: Using Python
No ratings yet
Amazon WEB Scrapin G: Using Python
9 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Sphinx Search Beginner's Guide
From Everand
Sphinx Search Beginner's Guide
Abbas Ali
4/5 (2)
Mastering the Art of Web Scraping: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Web Scraping: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Powder Metallurgy
No ratings yet
Powder Metallurgy
64 pages
PD Report 2
No ratings yet
PD Report 2
10 pages
PD Report
No ratings yet
PD Report
6 pages
Short Note
No ratings yet
Short Note
5 pages
TPB Fracture Test Manual
No ratings yet
TPB Fracture Test Manual
14 pages
WWW Scribd Com Document 495648518 RDM II Systemes Hyperstatiques
No ratings yet
WWW Scribd Com Document 495648518 RDM II Systemes Hyperstatiques
13 pages
Letter M Worksheets For Preschool PDF
100% (1)
Letter M Worksheets For Preschool PDF
99 pages
System Parameters Identification Scilab-Xcos
No ratings yet
System Parameters Identification Scilab-Xcos
4 pages
PMO - Create A PMO Handbook
No ratings yet
PMO - Create A PMO Handbook
12 pages
Manual Horno Melton
No ratings yet
Manual Horno Melton
105 pages
Unit - 1 Git & Github
No ratings yet
Unit - 1 Git & Github
27 pages
Rough Work
No ratings yet
Rough Work
27 pages
Sub-Function Means Formal Residency Informal Residency Work Surface IT Access Display Flooring Storage Power Air Circulation Services Accessories
No ratings yet
Sub-Function Means Formal Residency Informal Residency Work Surface IT Access Display Flooring Storage Power Air Circulation Services Accessories
1 page
How To Setup Active Directory Domain Service On Windows Server 2019
No ratings yet
How To Setup Active Directory Domain Service On Windows Server 2019
13 pages
RF 018 Merchant Category Code List
No ratings yet
RF 018 Merchant Category Code List
22 pages
Practical System Programming with C 1st Edition Sri Manikanta Palakollu Palakollu Sri Manikanta - Own the ebook now with all fully detailed chapters
100% (2)
Practical System Programming with C 1st Edition Sri Manikanta Palakollu Palakollu Sri Manikanta - Own the ebook now with all fully detailed chapters
56 pages
ATNA40YK15 Specification & Datasheets
No ratings yet
ATNA40YK15 Specification & Datasheets
2 pages
BI Developer
No ratings yet
BI Developer
8 pages
Upgrading From Ubuntu 8.10: Desktop Server
No ratings yet
Upgrading From Ubuntu 8.10: Desktop Server
8 pages
DSTL Unit 3
No ratings yet
DSTL Unit 3
126 pages
Cambridge IGCSE: Information and Communication Technology 0417/03
No ratings yet
Cambridge IGCSE: Information and Communication Technology 0417/03
8 pages
W RTL Inv It LC WK A
No ratings yet
W RTL Inv It LC WK A
89 pages
CANDI 15-16 Jun 2022
No ratings yet
CANDI 15-16 Jun 2022
30 pages
Crash 2021 08 22 - 23.45.02 Server
No ratings yet
Crash 2021 08 22 - 23.45.02 Server
3 pages
Information Sheet 1.1 - Introduction to Computer Peripherals
No ratings yet
Information Sheet 1.1 - Introduction to Computer Peripherals
19 pages