Data Mining Tutorial: Session 2: Stack Overflow Data Set
Data Mining Tutorial: Session 2: Stack Overflow Data Set
Tutorial
E. Schubert,
E. Ntoutsi
Introduction
Data Mining Tutorial
Downloading
Preprocessing
Session 2: Stack Overflow data set
Apriori FIM
Conclusions
Erich Schubert, Eirini Ntoutsi
Ludwig-Maximilians-Universität München
Data Mining
Tutorial Stack overflow is a programming QA website:
E. Schubert,
E. Ntoutsi
http://stackoverflow.com/
Introduction
I Users post programming questions
Downloading I Other users post answers
Preprocessing
I Up- and Downvotes on questions and answers
Apriori FIM
Conclusions
I Awards for good answers, questions and active users
I Tags to organize questions
I Moderation by users with high reputation
I Size: 2.8m questions, 5.8m answers, 11m comments,
22m votes, 30k tags
(Yes, you could post your homework questions there. This is not recommended,
as usually the teachers want you to solve the problems yourself to learn from the
problem, not the solution. Plus, a good question there should already contain
source code)
Stack Overflow
Screenshot of first question
Data Mining
Tutorial First (non-deleted) question on SO:
E. Schubert,
E. Ntoutsi
Introduction
Downloading
Preprocessing
Apriori FIM
Conclusions
Stack Overflow data set
Getting the data
Data Mining
Tutorial StackOverflow publishes data dumps:
E. Schubert,
E. Ntoutsi
http://blog.stackoverflow.com/category/cc-wiki-dump/
Data Mining
Tutorial > 7z l stackoverflow.com.7z.001
E. Schubert,
E. Ntoutsi Date Time Size Compressed Name
------------------- ------------ ------------ ------------------------
Introduction 2011-09-06 20:05:06 170594039 457479414 092011 Stack Overflow/badges.xml
2011-09-06 20:04:52 1916999879 092011 Stack Overflow/comments.xml
Downloading 2011-09-06 21:10:00 10958639384 1841985260 092011 Stack Overflow/posthistory.xml
2011-09-06 19:57:53 7569879502 1454543990 092011 Stack Overflow/posts.xml
Preprocessing 2011-09-06 20:00:50 193250161 132626278 092011 Stack Overflow/users.xml
2011-09-06 19:59:56 1346527241 092011 Stack Overflow/votes.xml
Apriori FIM 2011-06-13 09:26:12 1786 092011 Stack Overflow/license.txt
Conclusions 2011-09-06 19:41:44 4780 092011 Stack Overflow/readme.txt
2011-09-06 20:05:07 0 0 092011 Stack Overflow
------------------- ------------ ------------ ------------------------
22155896772 3886634942 8 files, 1 folders
Data Mining
Tutorial So we will need to preprocess the data to get it into a
E. Schubert,
E. Ntoutsi
“workable” size. Here is what we have to do:
Introduction I Decompress into a stream, not to the harddisk.
Downloading I Streaming XML processing using XML pull (to avoid
Preprocessing
processing the full XML file at once)
Apriori FIM
Conclusions
I Select interesting data only and dump these parts only
So, let us have a peek into the posts.xml file.
Stack Overflow data set
Preprocessing — posts.xml file
Data Mining
Tutorial Inspect the file with: 7z x -so stackoverflow.com.7z.001
E. Schubert,
E. Ntoutsi
"092011 Stack Overflow/posts.xml" | less
<?xml version="1.0" encoding="utf-8"?>
Introduction <posts>
<row Id="4" PostTypeId="1" AcceptedAnswerId="7"
Downloading CreationDate="2008-07-31T21:42:52.667" Score="83" ViewCount="7351"
Body="<p>I’m new to C#, and I want to use a track-bar to change a form’s
Preprocessing opacity.</p>

<p>This is my
code:</p>

<pre><code>decimal trans = trackBar1.Value
Apriori FIM
/ 5000
this.Opacity =
Conclusions trans
</code></pre>

<p>When I try to build it, I
get this error:</p>

<blockquote>
 <p>Cannot
implicitly convert type ’decimal’ to
’double’</p>
</blockquote>

<p>I tried making
<code>trans</code> a double, but then the control doesn’t work.
This code worked fine for me in VB.NET. </p>

<p>What do I
need to do differently?</p>
" OwnerUserId="8"
LastEditorUserId="140328" LastEditorDisplayName="Rich B"
LastEditDate="2011-08-20T23:22:24.213"
LastActivityDate="2011-08-31T19:42:18.077" Title="When setting a form’s opacity
should I use a decimal or double?" Tags="<c#><winforms>"
AnswerCount="12" CommentCount="19" FavoriteCount="13" />
[...]
</posts>
Stack Overflow data set
Preprocessing — posts.xml processing
Data Mining
Tutorial Looks worse than it is:
E. Schubert,
E. Ntoutsi
The Tags="..." attribute is quite simple, it decodes to:
<c#><winforms>
Introduction
Conclusions
I Extract Tags="..." attribute
I Extract tags by <.*> pattern
I Output tags for use in Apriori into an ascii file
I Do not include anything else: no ID, no title — we do
not need them here
I Worst case size estimation: 2.8m questions × up to 5
tags × 20 characters = 280 MB
I Reality: 50 MB uncompressed, 15 MB gzip.
Stack Overflow data set
Preprocessing — posts.xml in python
Data Mining
Tutorial
#!/bin/python
E. Schubert, import subprocess, sys, re
E. Ntoutsi
from xml.dom import pulldom
Introduction
archive = "stackoverflow.com.7z.001"
Downloading xmlfile = "092011 Stack Overflow/posts.xml"
Preprocessing
Apriori FIM
cmd = ["7z", "x", "-so", archive, xmlfile]
proc.stdout.close()
Stack Overflow data set
Preprocessing — posts.xml in python
Data Mining
Tutorial And the main processing function:
E. Schubert,
E. Ntoutsi tagre = re.compile("<([^>]+)>")
Data Mining
Tutorial Resulting data set:
E. Schubert,
E. Ntoutsi
c# winforms
Introduction
html css internet-explorer-7
c# conversion j#
Downloading c# datetime
Preprocessing c# .net datetime timespan
Apriori FIM
html browser time timezone
c# math
Conclusions
c# linq web-services .net-3.5
mysql database
performance algorithm language-agnostic unix pi
php
mysql database triggers
c++ c sockets mainframe zos
flex actionscript-3
sql-server datatable
c# .net vb.net timer
Data Mining
Tutorial Weka unfortunately does not scale up well to this data set
E. Schubert,
E. Ntoutsi
size.
Plus, we first would need to convert it into the .arff file
Introduction
for Weka.
Downloading
Preprocessing
Why not just write Apriori ourselves?
Apriori FIM
Choosing appropriate minsup values might be tricky, too.
Conclusions
So we will just look at the top itemsets in each run.
With just 50 MB of uncompressed data, we should be able
to keep all of them in memory!
Loading the data with python
Data Mining
Tutorial Python is for lazy people. Loading text data is easy:
E. Schubert,
E. Ntoutsi
#!/bin/python
Introduction import gzip
Downloading
Preprocessing db=[]
Apriori FIM
for line in gzip.open("all-tags.txt.gz"):
Conclusions
db.append(line.strip().split(" "))
Output:
Data Mining
Tutorial Class to represent an itemset:
E. Schubert, class itemset():
E. Ntoutsi def __init__(self, tokens, support=0):
self.tokens = list(tokens)
Introduction self.support = support
Downloading
Preprocessing
def tokenstr(self):
return "+".join(self.tokens)
Apriori FIM
def __hash__(self):
return hash(self.tokenstr())
Computing the 1-Itemsets
Data Mining
Tutorial
oneitems = dict()
E. Schubert, for rec in db:
E. Ntoutsi
for tag in rec:
item = itemset([tag])
Introduction
item = oneitems.setdefault(item, item)
Downloading item.support += 1
Preprocessing oneitems = list(oneitems.keys())
Apriori FIM
# Inspect:
Conclusions oneitems.sort(lambda a,b: cmp(b.support, a.support))
print len(oneitems), map(str, oneitems[:10])
print str(oneitems[100]), str(oneitems[200])
Output:
29551 [’c#: 211338’, ’java: 153561’, ’php: 142125’,
’javascript: 126296’, ’jquery: 109129’, ’iphone: 96748’,
’android: 93247’, ’.net: 89646’, ’asp.net: 88938’,
’c++: 84777’]
asp.net-mvc-2: 7219 c#-4.0: 4055
Computing the 1-Itemsets
Data Mining
Tutorial So we will try with minsupport= 1000:
E. Schubert,
E. Ntoutsi
minsupport = 1000
Introduction oneitems = filter(
Downloading lambda x: x.support >= minsupport,
Preprocessing oneitems)
Apriori FIM
itemsets = [oneitems]
Conclusions
print len(oneitems)
Output:
777
Apriori-Gen
Data Mining
Tutorial Generating candidates:
E. Schubert,
E. Ntoutsi def apriorigen(curitems):
curitems.sort() # by tags
Introduction for i in range(0, len(curitems) - 1):
Downloading
toka = curitems[i].tokens
for j in range(i + 1, len(curitems)):
Preprocessing
tokb = curitems[j].tokens
Apriori FIM # Prefix test:
Conclusions if not toka[:-1] == tokb[:-1]: break
cand = toka + tokb[-1:] # Extend with last
# Pruning test:
ok = True
for i in range(len(cand) - 2):
t = cand[:i] + cand[i+1:] # without i
if not contains(curitems, itemset(t)):
ok = False
break
if ok: yield itemset(cand) # generate itemset
Apriori FIM
Data Mining
Tutorial Main loop for FIM:
E. Schubert,
E. Ntoutsi while True:
size = len(itemsets) + 1
Introduction cand = dict()
Downloading
for c in apriorigen(itemsets[-1]): cand[c] = c
if len(cand) == 0: break
Preprocessing
for rec in db:
Apriori FIM for subset in itertools.combinations(rec, size):
Conclusions subset = cand.get(itemset(subset))
if subset: subset.support += 1
itemsets.append(filter(
lambda i: i.support >= minsupport, cand))
itemsets[-1].sort(lambda a,b: cmp(b.support, a.support))
print size, map(str, itemsets[-1][:5])
Output:
[’ruby+ruby-on-rails: 12266’, ’c#+winforms: 12236’,
’android+java: 11252’, ’c#+wpf: 11143’, ’ios+iphone: 10431’]
[’c#+wpf+xaml: 1600’, ’mysql+php+sql: 1064’]
More frequent itemsets
Data Mining
Tutorial Setting minsupport = 500 finds more itemsets:
E. Schubert,
E. Ntoutsi
Introduction Output:
[’ruby+ruby-on-rails: 12266’, ’c#+winforms: 12236’, ’android+java: 11252’,
Downloading
’c#+wpf: 11143’, ’ios+iphone: 10431’, ’cocoa-touch+iphone: 10371’,
Preprocessing ’c#+linq: 8184’, ’c+c++: 8131’, ’ajax+javascript: 7951’, ’java+swing: 7296’,
’mysql+sql: 7018’, ’cocoa+objective-c: 6670’, ’cocoa-touch+objective-c: 6598’,
Apriori FIM ’jquery+php: 6564’, ’ipad+iphone: 6311’, ’asp.net+javascript: 5761’,
’sql-server+tsql: 5728’, ’jquery+jquery-ui: 5728’, ’eclipse+java: 5537’,
Conclusions ’hibernate+java: 5336’]
Data Mining
Tutorial I The results were okay, but not very surprising (in fact,
E. Schubert,
E. Ntoutsi
most results are very obvious!)
Introduction
I The data contains redundant tags (mysql, sql)
Downloading I 5 tag limit affects output
Preprocessing I Data mining does not guarantee new results,
Apriori FIM
unfortunately
Conclusions