0% found this document useful (0 votes)
67 views

Data Mining Tutorial: Session 2: Stack Overflow Data Set

The document describes preprocessing a Stack Overflow dataset for use in an Apriori frequent itemset mining experiment. It involves decompressing the dataset files without extracting to disk, streaming XML processing to select only question posts and their tags, and outputting the tags to a file. This preprocessing reduces the dataset size to around 50 MB uncompressed or 15 MB gzipped, suitable for the Apriori analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Data Mining Tutorial: Session 2: Stack Overflow Data Set

The document describes preprocessing a Stack Overflow dataset for use in an Apriori frequent itemset mining experiment. It involves decompressing the dataset files without extracting to disk, streaming XML processing to select only question posts and their tags, and outputting the tags to a file. This preprocessing reduces the dataset size to around 50 MB uncompressed or 15 MB gzipped, suitable for the Apriori analysis.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Mining

Tutorial
E. Schubert,
E. Ntoutsi

Introduction
Data Mining Tutorial
Downloading

Preprocessing
Session 2: Stack Overflow data set
Apriori FIM

Conclusions
Erich Schubert, Eirini Ntoutsi

Ludwig-Maximilians-Universität München

2012-05-10 — KDD class tutorial


Stack Overflow
Introduction to “SO”

Data Mining
Tutorial Stack overflow is a programming QA website:
E. Schubert,
E. Ntoutsi
http://stackoverflow.com/
Introduction
I Users post programming questions
Downloading I Other users post answers
Preprocessing
I Up- and Downvotes on questions and answers
Apriori FIM

Conclusions
I Awards for good answers, questions and active users
I Tags to organize questions
I Moderation by users with high reputation
I Size: 2.8m questions, 5.8m answers, 11m comments,
22m votes, 30k tags
(Yes, you could post your homework questions there. This is not recommended,
as usually the teachers want you to solve the problems yourself to learn from the
problem, not the solution. Plus, a good question there should already contain
source code)
Stack Overflow
Screenshot of first question

Data Mining
Tutorial First (non-deleted) question on SO:
E. Schubert,
E. Ntoutsi

Introduction

Downloading

Preprocessing

Apriori FIM

Conclusions
Stack Overflow data set
Getting the data

Data Mining
Tutorial StackOverflow publishes data dumps:
E. Schubert,
E. Ntoutsi
http://blog.stackoverflow.com/category/cc-wiki-dump/

Introduction I A torrent download with about 5 GB. 7zip-compressed.


Downloading I About 4 GB for the main stackoverflow site.
Preprocessing
I Main .xml file is 8 GB, post history is 11 GB.
Apriori FIM

Conclusions So this means:


I That is pretty big!
I Maybe not everyone here should download it.
I I will not demo this live, but provide result data for you.
I You can not load the XML in your DOM parser.
I In fact, you might even be unable to decompress it
(due to a 4 GB file size limit on many file systems).
Stack Overflow data set
A first peek inside the 7zip file.

Data Mining
Tutorial > 7z l stackoverflow.com.7z.001
E. Schubert,
E. Ntoutsi Date Time Size Compressed Name
------------------- ------------ ------------ ------------------------
Introduction 2011-09-06 20:05:06 170594039 457479414 092011 Stack Overflow/badges.xml
2011-09-06 20:04:52 1916999879 092011 Stack Overflow/comments.xml
Downloading 2011-09-06 21:10:00 10958639384 1841985260 092011 Stack Overflow/posthistory.xml
2011-09-06 19:57:53 7569879502 1454543990 092011 Stack Overflow/posts.xml
Preprocessing 2011-09-06 20:00:50 193250161 132626278 092011 Stack Overflow/users.xml
2011-09-06 19:59:56 1346527241 092011 Stack Overflow/votes.xml
Apriori FIM 2011-06-13 09:26:12 1786 092011 Stack Overflow/license.txt
Conclusions 2011-09-06 19:41:44 4780 092011 Stack Overflow/readme.txt
2011-09-06 20:05:07 0 0 092011 Stack Overflow
------------------- ------------ ------------ ------------------------
22155896772 3886634942 8 files, 1 folders

We are interested in the posts.xml file for our Apriori


experiment.
Stack Overflow data set
Preprocessing — the plan

Data Mining
Tutorial So we will need to preprocess the data to get it into a
E. Schubert,
E. Ntoutsi
“workable” size. Here is what we have to do:
Introduction I Decompress into a stream, not to the harddisk.
Downloading I Streaming XML processing using XML pull (to avoid
Preprocessing
processing the full XML file at once)
Apriori FIM

Conclusions
I Select interesting data only and dump these parts only
So, let us have a peek into the posts.xml file.
Stack Overflow data set
Preprocessing — posts.xml file

Data Mining
Tutorial Inspect the file with: 7z x -so stackoverflow.com.7z.001
E. Schubert,
E. Ntoutsi
"092011 Stack Overflow/posts.xml" | less
<?xml version="1.0" encoding="utf-8"?>
Introduction <posts>
<row Id="4" PostTypeId="1" AcceptedAnswerId="7"
Downloading CreationDate="2008-07-31T21:42:52.667" Score="83" ViewCount="7351"
Body="&lt;p&gt;I’m new to C#, and I want to use a track-bar to change a form’s
Preprocessing opacity.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;This is my
code:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;decimal trans = trackBar1.Value
Apriori FIM
/ 5000&#xA;this.Opacity =
Conclusions trans&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;When I try to build it, I
get this error:&lt;/p&gt;&#xA;&#xA;&lt;blockquote&gt;&#xA; &lt;p&gt;Cannot
implicitly convert type ’decimal’ to
’double’&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&#xA;&lt;p&gt;I tried making
&lt;code&gt;trans&lt;/code&gt; a double, but then the control doesn’t work.
This code worked fine for me in VB.NET. &lt;/p&gt;&#xA;&#xA;&lt;p&gt;What do I
need to do differently?&lt;/p&gt;&#xA;" OwnerUserId="8"
LastEditorUserId="140328" LastEditorDisplayName="Rich B"
LastEditDate="2011-08-20T23:22:24.213"
LastActivityDate="2011-08-31T19:42:18.077" Title="When setting a form’s opacity
should I use a decimal or double?" Tags="&lt;c#&gt;&lt;winforms&gt;"
AnswerCount="12" CommentCount="19" FavoriteCount="13" />
[...]
</posts>
Stack Overflow data set
Preprocessing — posts.xml processing

Data Mining
Tutorial Looks worse than it is:
E. Schubert,
E. Ntoutsi
The Tags="..." attribute is quite simple, it decodes to:
<c#><winforms>
Introduction

Downloading I Stream via 7z x -so


Preprocessing I Process one <row ../> element at a time
Apriori FIM

Conclusions
I Extract Tags="..." attribute
I Extract tags by <.*> pattern
I Output tags for use in Apriori into an ascii file
I Do not include anything else: no ID, no title — we do
not need them here
I Worst case size estimation: 2.8m questions × up to 5
tags × 20 characters = 280 MB
I Reality: 50 MB uncompressed, 15 MB gzip.
Stack Overflow data set
Preprocessing — posts.xml in python

Data Mining
Tutorial
#!/bin/python
E. Schubert, import subprocess, sys, re
E. Ntoutsi
from xml.dom import pulldom
Introduction
archive = "stackoverflow.com.7z.001"
Downloading xmlfile = "092011 Stack Overflow/posts.xml"
Preprocessing

Apriori FIM
cmd = ["7z", "x", "-so", archive, xmlfile]

Conclusions proc = subprocess.Popen(cmd, stdin=None,


stdout=subprocess.PIPE, shell=False)
events = pulldom.parse(proc.stdout)

for event, node in events:


if event == pulldom.START_ELEMENT:
if node.tagName == "row":
events.expandNode(node)
processRow(node) # NEXT SLIDE
# print node.toxml()

proc.stdout.close()
Stack Overflow data set
Preprocessing — posts.xml in python

Data Mining
Tutorial And the main processing function:
E. Schubert,
E. Ntoutsi tagre = re.compile("<([^>]+)>")

Introduction def processRow(node):


Downloading
# Questions (type 1) only
typ = node.getAttribute("PostTypeId")
Preprocessing
if typ != "1":
Apriori FIM return
Conclusions # Get Tags attribute
tags = node.getAttribute("Tags")
if not tags or len(tags) == 0:
return
# Remove the <> wrappers, separate by space.
print " ".join(tagre.findall(tags))
Stack Overflow data set
Preprocessed data file — all-tags.txt

Data Mining
Tutorial Resulting data set:
E. Schubert,
E. Ntoutsi
c# winforms
Introduction
html css internet-explorer-7
c# conversion j#
Downloading c# datetime
Preprocessing c# .net datetime timespan
Apriori FIM
html browser time timezone
c# math
Conclusions
c# linq web-services .net-3.5
mysql database
performance algorithm language-agnostic unix pi
php
mysql database triggers
c++ c sockets mainframe zos
flex actionscript-3
sql-server datatable
c# .net vb.net timer

Now, let us do some Apriori on this data set!


Planning Apriori

Data Mining
Tutorial Weka unfortunately does not scale up well to this data set
E. Schubert,
E. Ntoutsi
size.
Plus, we first would need to convert it into the .arff file
Introduction
for Weka.
Downloading

Preprocessing
Why not just write Apriori ourselves?
Apriori FIM
Choosing appropriate minsup values might be tricky, too.
Conclusions
So we will just look at the top itemsets in each run.
With just 50 MB of uncompressed data, we should be able
to keep all of them in memory!
Loading the data with python

Data Mining
Tutorial Python is for lazy people. Loading text data is easy:
E. Schubert,
E. Ntoutsi
#!/bin/python
Introduction import gzip
Downloading

Preprocessing db=[]
Apriori FIM
for line in gzip.open("all-tags.txt.gz"):
Conclusions
db.append(line.strip().split(" "))

print "Database size:", len(db)

Output:

Database size: 2012348


Itemset class

Data Mining
Tutorial Class to represent an itemset:
E. Schubert, class itemset():
E. Ntoutsi def __init__(self, tokens, support=0):
self.tokens = list(tokens)
Introduction self.support = support
Downloading

Preprocessing
def tokenstr(self):
return "+".join(self.tokens)
Apriori FIM

Conclusions def __str__(self):


return self.tokenstr()+": "+str(self.support)

def __cmp__(self, other):


return cmp(self.tokenstr(), other.tokenstr())

def __hash__(self):
return hash(self.tokenstr())
Computing the 1-Itemsets

Data Mining
Tutorial
oneitems = dict()
E. Schubert, for rec in db:
E. Ntoutsi
for tag in rec:
item = itemset([tag])
Introduction
item = oneitems.setdefault(item, item)
Downloading item.support += 1
Preprocessing oneitems = list(oneitems.keys())
Apriori FIM
# Inspect:
Conclusions oneitems.sort(lambda a,b: cmp(b.support, a.support))
print len(oneitems), map(str, oneitems[:10])
print str(oneitems[100]), str(oneitems[200])

Output:
29551 [’c#: 211338’, ’java: 153561’, ’php: 142125’,
’javascript: 126296’, ’jquery: 109129’, ’iphone: 96748’,
’android: 93247’, ’.net: 89646’, ’asp.net: 88938’,
’c++: 84777’]
asp.net-mvc-2: 7219 c#-4.0: 4055
Computing the 1-Itemsets

Data Mining
Tutorial So we will try with minsupport= 1000:
E. Schubert,
E. Ntoutsi
minsupport = 1000
Introduction oneitems = filter(
Downloading lambda x: x.support >= minsupport,
Preprocessing oneitems)
Apriori FIM
itemsets = [oneitems]
Conclusions
print len(oneitems)

Output:
777
Apriori-Gen

Data Mining
Tutorial Generating candidates:
E. Schubert,
E. Ntoutsi def apriorigen(curitems):
curitems.sort() # by tags
Introduction for i in range(0, len(curitems) - 1):
Downloading
toka = curitems[i].tokens
for j in range(i + 1, len(curitems)):
Preprocessing
tokb = curitems[j].tokens
Apriori FIM # Prefix test:
Conclusions if not toka[:-1] == tokb[:-1]: break
cand = toka + tokb[-1:] # Extend with last
# Pruning test:
ok = True
for i in range(len(cand) - 2):
t = cand[:i] + cand[i+1:] # without i
if not contains(curitems, itemset(t)):
ok = False
break
if ok: yield itemset(cand) # generate itemset
Apriori FIM

Data Mining
Tutorial Main loop for FIM:
E. Schubert,
E. Ntoutsi while True:
size = len(itemsets) + 1
Introduction cand = dict()
Downloading
for c in apriorigen(itemsets[-1]): cand[c] = c
if len(cand) == 0: break
Preprocessing
for rec in db:
Apriori FIM for subset in itertools.combinations(rec, size):
Conclusions subset = cand.get(itemset(subset))
if subset: subset.support += 1
itemsets.append(filter(
lambda i: i.support >= minsupport, cand))
itemsets[-1].sort(lambda a,b: cmp(b.support, a.support))
print size, map(str, itemsets[-1][:5])

Output:
[’ruby+ruby-on-rails: 12266’, ’c#+winforms: 12236’,
’android+java: 11252’, ’c#+wpf: 11143’, ’ios+iphone: 10431’]
[’c#+wpf+xaml: 1600’, ’mysql+php+sql: 1064’]
More frequent itemsets

Data Mining
Tutorial Setting minsupport = 500 finds more itemsets:
E. Schubert,
E. Ntoutsi

Introduction Output:
[’ruby+ruby-on-rails: 12266’, ’c#+winforms: 12236’, ’android+java: 11252’,
Downloading
’c#+wpf: 11143’, ’ios+iphone: 10431’, ’cocoa-touch+iphone: 10371’,
Preprocessing ’c#+linq: 8184’, ’c+c++: 8131’, ’ajax+javascript: 7951’, ’java+swing: 7296’,
’mysql+sql: 7018’, ’cocoa+objective-c: 6670’, ’cocoa-touch+objective-c: 6598’,
Apriori FIM ’jquery+php: 6564’, ’ipad+iphone: 6311’, ’asp.net+javascript: 5761’,
’sql-server+tsql: 5728’, ’jquery+jquery-ui: 5728’, ’eclipse+java: 5537’,
Conclusions ’hibernate+java: 5336’]

[’c#+wpf+xaml: 1600’, ’mysql+php+sql: 1064’, ’cocoa-touch+ios+iphone: 935’,


’cocoa-touch+iphone+uikit: 879’, ’hibernate+java+orm: 838’,
’activerecord+ruby+ruby-on-rails: 783’, ’c#+databinding+wpf: 754’,
’gui+java+swing: 730’, ’oracle+plsql+sql: 698’,
’cocoa-touch+iphone+uitableview: 693’, ’cocoa+cocoa-touch+iphone: 652’,
’cocoa+cocoa-touch+objective-c: 570’, ’database+mysql+sql: 567’,
’database+database-design+mysql: 532’, ’ios+iphone+uitableview: 514’,
’c#+silverlight+windows-phone-7: 513’, ’cocoa-touch+ipad+iphone: 500’]
Conclusions

Data Mining
Tutorial I The results were okay, but not very surprising (in fact,
E. Schubert,
E. Ntoutsi
most results are very obvious!)
Introduction
I The data contains redundant tags (mysql, sql)
Downloading I 5 tag limit affects output
Preprocessing I Data mining does not guarantee new results,
Apriori FIM
unfortunately
Conclusions

You might also like