The Record of our
Recent Past
Large-Scale Text Mining in Canadian Web Archives
Ian Milligan
Assistant Professor
Why?
The sheer amount of social,
cultural, and political
information generated every
day presents new
opportunities for historians.
Could one
even study
the 1990s
and
beyond
without
web
archives?
No.
Historians need to do this now, or
we’re going to be left behind.
One Case Study
• Archive-It Research
Services/University of
Toronto Libraries: “Canadian
Political Parties and Political
Interest Groups” collection
• 2005 - 2015
• ARC/WARC/WAT Files
Problem One:
Historians want content, but we
can only locally work with
metadata
WAT files
vs.
ARC/WARC files
Do we want metadata
or content analysis?
Historians NEED content,
but metadata can help us
find and contextualize it
Metadata Extraction
Metadata Extraction
Metadata Extraction
Metadata Extraction
• Results @ http://ianmilligan.ca/2015/02/05/topic-
modeling-web-archive-modularity-classes/
Metadata Extraction
• Conservative themes (2014): economic
development, family, immigration, legislation,
women’s issues, senior issues, Ukrainians,
constituency offices, some prominent (and not-so-
prominent) MPs, and of course, our economic
action plan.
• Liberal themes (2014): Justin Trudeau (the new
leader), cuts to social programs, child poverty,
mental health, municipal issues, labour, workers,
Stop the Cuts, and housing.
Metadata Extraction
• Conservative themes (2006): education, university,
but tons of information on Aboriginal issues;
• Liberal themes (2006): community questions,
electoral topics, universities, human rights, child
care support.
As well as short stories..
Congress text-mining-event
Congress text-mining-event
Congress text-mining-event
Congress text-mining-event
Congress text-mining-event
Congress text-mining-event
Congress text-mining-event
2005 Canadian Federal Election
WATs help us find the files
we need to use - and to
contextualize them
Problem Two:
You can do amazing things with
the content (WARCs), but you
need a cluster.
Congress text-mining-event
WARC Analysis
• 2005-2009: 244 GB of content;
2.9 GB of plain text
• 10,606,822 websites
• On a local powerful node (3
Ghz 8-Core Intel Xeon E5/64
GB RAM, data on SSD), about
three to four hours per query
• On a cluster, about ~10-20
minutes per query, depending
on traffic
Large-Scale Text
Analysis
• With Hadoop about 15-20
minutes to extract all plain-text
from any specified queries:
i.e. all pages belonging to
Green Party, Liberal Party,
Conservative Party, Council of
Canadians, etc.
• Compared to “out of memory”/
go home for an extended
weekend on a local node
Large-Scale Text Analysis
• NER/LDA/Keyword Frequency broken
down by scrape date: i.e. scrape
carried out 2005-10, see change over
time;
• Downside: not everything is optimized
for parallel environment; if not, it crawls
(there goes a day)
• Downside: scrape date != creation
date, requiring temporal analysis
Trial Three:
Culturomics for Web
Archives?
(switch to browser)
Some code/walkthroughs/
sample data available at
https://github.com/
ianmilligan1/WAHR
Thank you!
Ian Milligan
Assistant Professor
https://uwaterloo.ca/web-archive-group/

More Related Content

PPT
Inmagic user group meeting Melbourne june 2011
PDF
Adoption and Integration of Persistent Identifiers in European Research Infor...
PPT
Rs detective afpl
PPT
Rs detective 2nd_fri
PDF
Linked Data
PDF
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
PDF
Wikidata
PPTX
Wikidata & dbpedia
Inmagic user group meeting Melbourne june 2011
Adoption and Integration of Persistent Identifiers in European Research Infor...
Rs detective afpl
Rs detective 2nd_fri
Linked Data
The GND initiative 2017-2021: Developing a Backbone for the Web of Cultural a...
Wikidata
Wikidata & dbpedia

What's hot (20)

PPTX
Multilingual presentation ifla 2013 08-19
PDF
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
PPTX
Expanding Machine-Readable Access Methods for Collections
PDF
Maximising (Re)Usability of Library metadata using Linked Data
PPTX
Incentives for infrastructure modernization
PPTX
Keynote new convergences between natural language processing and knowledge ...
PPTX
Is Linked Open Data the way forward?
PPT
What Was Lost, Now is Found: Using Digital Repositories to Rebuild What Hurri...
PPTX
Deployment of rd_fa_microdata_microformats_on_the_web
PPTX
Introducing linked data
PPTX
Perseverance on Persistence
PDF
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
ODP
DBpedia: A Public Data Infrastructure for the Web of Data
PDF
AAC Linked Data Planning: Perspectives and Considerations
PPTX
MDST 3703 F10 Seminar 11
PDF
Linked open data, its realization
PPTX
Information Extraction from EuroParliament and UK Parliament data
PDF
Methodological Guidelines for Publishing Linked Data
PPTX
Linked Open Data at SAAM: Past, Present, and Future
Multilingual presentation ifla 2013 08-19
COUNTER Standards for Open Access: The Value of Measuring/The Measuring of Va...
Expanding Machine-Readable Access Methods for Collections
Maximising (Re)Usability of Library metadata using Linked Data
Incentives for infrastructure modernization
Keynote new convergences between natural language processing and knowledge ...
Is Linked Open Data the way forward?
What Was Lost, Now is Found: Using Digital Repositories to Rebuild What Hurri...
Deployment of rd_fa_microdata_microformats_on_the_web
Introducing linked data
Perseverance on Persistence
Perseverance on persistence by Herbert Van de Sompel - EuropeanaTech Conferen...
DBpedia: A Public Data Infrastructure for the Web of Data
AAC Linked Data Planning: Perspectives and Considerations
MDST 3703 F10 Seminar 11
Linked open data, its realization
Information Extraction from EuroParliament and UK Parliament data
Methodological Guidelines for Publishing Linked Data
Linked Open Data at SAAM: Past, Present, and Future
Ad

Similar to Congress text-mining-event (20)

PDF
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
PDF
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
PDF
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
PDF
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
PDF
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
PDF
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
PDF
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
PDF
Internet content as research data
PDF
International Internet Preservation Consortium Research Slides from Ian Milligan
PDF
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
PPT
Analytics and Access to the UK web archive
PPTX
Web archiving challenges and opportunities
PDF
Towards Multidimensional Web Archive Access (IIPC 2016)
PDF
Analyzing Web Archives
PPTX
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
PPTX
SAA 2014 session 703
PPTX
Best Practices for Descriptive Metadata
PPT
Cultural Heritage Insitutions and Big Data Collections
PPTX
Improving Collection Understanding in Web Archives
PPTX
"Web Archive services framework for tighter integration between the past and ...
The Challenge of Digital Sources in the Web Age: Common Tensions Across Three...
Making Sense of Abundance: Opportunity and Challenges Across Three Web Archiv...
WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Thr...
Warcbase Building a Scalable Platform on HBase and Hadoop - Part Two: Histori...
Warcbase: Building a Scalable Platform on HBase and Hadoop - Part Two, Histor...
History In The Age Of Abundance How The Web Is Transforming Historical Resear...
Clustering Search to Navigate A Case Study of the Canadian World Wide Web as ...
Internet content as research data
International Internet Preservation Consortium Research Slides from Ian Milligan
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
Analytics and Access to the UK web archive
Web archiving challenges and opportunities
Towards Multidimensional Web Archive Access (IIPC 2016)
Analyzing Web Archives
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAA 2014 session 703
Best Practices for Descriptive Metadata
Cultural Heritage Insitutions and Big Data Collections
Improving Collection Understanding in Web Archives
"Web Archive services framework for tighter integration between the past and ...
Ad

Recently uploaded (20)

PDF
How Technology Shapes Our Information Age
PDF
Testing & QA Checklist for Magento to Shopify Migration Success.pdf
PPTX
curriculumandpedagogyinearlychildhoodcurriculum-171021103104 - Copy.pptx
PPTX
using the citation of Research to create a research
PPSX
AI AppSec Threats and Defenses 20250822.ppsx
PPTX
Networking2-LECTURE2 this is our lessons
PDF
ilide.info-huawei-odn-solution-introduction-pdf-pr_a17152ead66ea2617ffbd01e8c...
PPTX
COPD_Management_Exacerbation_Detailed_Placeholders.pptx
PDF
Top 8 Trusted Sources to Buy Verified Cash App Accounts.pdf
PDF
Public for study about wiring to confirm.
PPTX
ECO SAFE AI - SUSTAINABLE SAFE AND HOME HUB
PPTX
Introduction to networking local area networking
PPTX
IT-Human Computer Interaction Report.pptx
PPTX
Artificial_Intelligence_Basics use in our daily life
DOCX
Powerful Ways AIRCONNECT INFOSYSTEMS Pvt Ltd Enhances IT Infrastructure in In...
PPTX
Basic understanding of cloud computing one need
PPTX
最新版美国埃默里大学毕业证(Emory毕业证书)原版定制文凭学历认证
PDF
The_Decisive_Battle_of_Yarmuk,battle of yarmuk
PPTX
Digital Project Mastery using Autodesk Docs Workshops
PPTX
KSS ON CYBERSECURITY INCIDENT RESPONSE AND PLANNING MANAGEMENT.pptx
How Technology Shapes Our Information Age
Testing & QA Checklist for Magento to Shopify Migration Success.pdf
curriculumandpedagogyinearlychildhoodcurriculum-171021103104 - Copy.pptx
using the citation of Research to create a research
AI AppSec Threats and Defenses 20250822.ppsx
Networking2-LECTURE2 this is our lessons
ilide.info-huawei-odn-solution-introduction-pdf-pr_a17152ead66ea2617ffbd01e8c...
COPD_Management_Exacerbation_Detailed_Placeholders.pptx
Top 8 Trusted Sources to Buy Verified Cash App Accounts.pdf
Public for study about wiring to confirm.
ECO SAFE AI - SUSTAINABLE SAFE AND HOME HUB
Introduction to networking local area networking
IT-Human Computer Interaction Report.pptx
Artificial_Intelligence_Basics use in our daily life
Powerful Ways AIRCONNECT INFOSYSTEMS Pvt Ltd Enhances IT Infrastructure in In...
Basic understanding of cloud computing one need
最新版美国埃默里大学毕业证(Emory毕业证书)原版定制文凭学历认证
The_Decisive_Battle_of_Yarmuk,battle of yarmuk
Digital Project Mastery using Autodesk Docs Workshops
KSS ON CYBERSECURITY INCIDENT RESPONSE AND PLANNING MANAGEMENT.pptx

Congress text-mining-event