1
Unsolved problems in machine learning
associated with the Open Mind Initiative
David G. Stork
Ricoh Silicon Valley
stork@[Link]
The Open Mind Initiative: A collaborative 2
framework for collecting and learning netizen
contributions to open source 'intelligent' software
David G. Stork
Ricoh Silicon Valley
stork@[Link]
3
Outline
One-sentence description of Open Mind
Background
Open Mind Initiative
Projects
Unsolved problems in machine learning
Conclusions
4
Open Mind Initiative
A collaborative framework (based on
traditional open source methodology)
for developing “intelligent” software,
where...
» domain experts provide algorithms,
» tool developers provide software
infrastructure and tools, and
» non-expert netizens provide raw data.
Description Background Open Mind Projects Unsolved problems Conclusions
5
E-community & Open Source waves
GNU, SendMail, emacs, Apache
Linux
» 10M lines; 20M seats; dbl. time 6 mo.,
» 105 contributors
Newhoo! [Link]
» Open web directory
» 1.5M sites; 22,229 editors; 218,651 categories
Infomedia
» Open source encyclopedia
Description Background Open Mind Projects Unsolved problems Conclusions
6
Growth of new software methods
1990 105 programmers → 1995 Linux
1995 106 web authors → 1999 Newhoo!
1999 109 netizens → 2003 Open Mind
New communication allows communities and
collaboration, and thus new software methods
Opportunities expand to less-skilled users
Description Background Open Mind Projects Unsolved problems Conclusions
7
Pattern recognition
and intelligent systems
Recognizer = Theory + Model + Data
Theory: excellent
Models: depend on problem
Data: there’s never enough!
» “the group with the most data wins”
» the internet lowers the cost of data
collection
Description Background Open Mind Projects Unsolved problems Conclusions
8
Software tools
Tools for customization/experimentation
» CSLU
» Nuance
» HTK
Non-experts can use many of these!
Description Background Open Mind Projects Unsolved problems Conclusions
9
Open Mind Initiative
Three main functions, provided by
» Domain Experts
– fundamental algorithms, process control,
education/proselytizing, ...
» Tool developers
– software infrastructure, tools, ...
» Netizens
– raw data, low-level bug reports, ...
Description Background Open Mind Projects Unsolved problems Conclusions
10
Domain Experts
Provide algorithms (e.g., grammatical inference,
HMMs, neural nets,...)
Process control, data truthing
» detect outliers for review/rejection
» data “voting”
» catch trials
» signal dection theory (d’)
» bias avoidance
Trend to publish data and algorithms on the web
More university work is being done in open source
Description Background Open Mind Projects Unsolved problems Conclusions
11
Tool/infrastructure developers
Get maximum information for minimum
netizen effort
» learning with queries (e.g., informative
patterns)
Make it easy (fast) for contributors
Web infrastructure
Collaborative software (version control)
Establish communities
Reward contributors
Description Background Open Mind Projects Unsolved problems Conclusions
12
Netizens
Incentives
» benefits in used system
» fun (games: 1013 clicks on Solitaire; Marathon, MUDD, ...)
» recognition (post names by amount of info. accepted,
organized by different criteria)
» general interest (note progress: data and performance)
» altruism/philanthropy (cf. OED, SETI 106 hits/day, ...)
» education (linguistics in schools, ...)
» lottery, gifts, frequent flier miles, ...
» money
1.5M inmates, 1M in nursing homes, ...
Description Background Open Mind Projects Unsolved problems Conclusions
“Generic project” structure 13
Isolated character recognition
Recognizer: “off the shelf”
Isolated characters displayed on netizens’
browsers
Synthetic data (noise, rate, ...)
Learning with queries (present informative
patterns); each pattern more valuable than an
iid sampled one
Improved recognition
Description Background Open Mind Projects Unsolved problems Conclusions
14
Collecting labels of isolated character
Open Mind host
4 9 9 4 ... 4 9
4 9 4 9 4 9 4 9 4 9 4 9
netizens
Description Background Open Mind Projects Unsolved problems Conclusions
15
Relation to traditional open source
Open Source Open Mind
• no netizens • netizens crucial
• expert knowledge (C++filt,gdbm) • informal knowledge (read, hear)
• machine learning irrelevant • machine learning essential
• web infrastructure useful • web infrastructure essential
• most work is directly • most work is on the
on the final software infrastructure
• hacker culture (105) • netizen culture (108)
• software released • software and data released
Description Background Open Mind Projects Unsolved problems Conclusions
16
Relation to data mining
Data Mining Open Mind
• type of data may not be available • data tailored to the project desired
for the project desired (e.g., OCR) (e.g., OCR)
• no interactive queries • interactive queries
slower learning faster learning
ambiguities not resolved ambiguities resolved
• relatively fixed amount of data • new data encouraged
• model data as it exists • collect data for best classifier
• little or no netizen support • netizen support
Description Background Open Mind Projects Unsolved problems Conclusions
Open framework allows
17
cross-project integration
Use Open Mind linguistic constraints for
Open Mind OCR
Use Open Mind speech as front end for
games used in other projects
Use Open Mind common sense for
Open Mind language understanding
Description Background Open Mind Projects Unsolved problems Conclusions
18
Open Mind handwriting
University of Nijmegen (Netherlands)
Segmentation data capture software
Transcription capture software
Seven-classifier system
Large set of unlabelled characters
Porting to the web and bulletproofing
Description Background Open Mind Projects Unsolved problems Conclusions
19
Description Background Open Mind Projects Unsolved problems Conclusions
20
Description Background Open Mind Projects Unsolved problems Conclusions
21
Open Mind speech
U. Sherbrooke (Canada) & Carnegie Mellon
Engine: Sphinx2 system
First target: isolated Linux commands
Database software
Generic tools:
» HMM, FFT, LPC, FIR, IIR
» k-means clustering
» maximum mutual information VQ
Working software:
» speaker identification
» gender identification
Description Background Open Mind Projects Unsolved problems Conclusions
22
Description Background Open Mind Projects Unsolved problems Conclusions
23
MIT Media Lab
Netizens contribute:
» Assertions (“grass is green”)
» Ontology information (“all chairs are
furniture”)
» Inferencing rules (“if all A are B and all B
are C, then all A are C”)
Description Background Open Mind Projects Unsolved problems Conclusions
Possible Project (1) 24
Open Mind text-to-speech
Text-to-speech generator on desktops
Netizens type their favorite sentences,
spoken by 2 models having different
parameters (prosidy, pitch, ....)
Two-alternative forced choice
Parameters set via learning with queries
Preferences may cluster
More natural interfaces
Description Background Open Mind Projects Unsolved problems Conclusions
Possible Project (2) 25
Open Mind spam filter
Netizens forward to Open Mind spam
site spam and non-spam
Classifier learns features
» “!”, “$”, “free”, maximum length of ALL
CAPS, average length of ALL CAPS, ...
Semantics
Better spam filter
Description Background Open Mind Projects Unsolved problems Conclusions
Possible Project (3) 26
Open Mind chatbot
Netizen answers questions online
» complete sentences
» choose most natural paragraph
Game interface (Dungeons & Dragons)
» choose “most natural” paragraph
Better text generation; Loebner Prize
Description Background Open Mind Projects Unsolved problems Conclusions
Possible Project (4) 27
Open Mind object recognition
Netizens submit digital photos and
labels (“cat sitting,” “horse running,” ...)
Semi-automatic segmentation
3D models trained
Better object recognition software; search by
image content
Description Background Open Mind Projects Unsolved problems Conclusions
Possible Project (5) 28
Open Mind GO
Contributors read tutorial and take test
Score board positions taken from database
Netizens play games against each other and
the current system
Scores used to guide massive search
System implemented in parallel on netizens’
computers, networked
» user interest!
Better GO software ($1M prize)
Description Background Open Mind Projects Unsolved problems Conclusions
29
Problems in machine learning
Goal: Optimal classifier/AI system
Differs from traditional learning in noise (but
both usable)
» Interactive learning means we want to reward
good data, punish bad
» on-line (rapid) so as to improve interactive queries
Learn reliability of netizens
Game theory/decision theory/machine
learning/pattern recognition
Description Background Open Mind Projects Unsolved problems Conclusions
30
General issues
Relative value of learning with queries vs. iid samples
Data truthing/outlier detection
Optimal learning strategies given...
» Bayes error
» probability of hostile data
» probability of data error
Somewhat like Educational Testing Service: checks
the predictability of an SAT question by correlating
responses with those on other test questions.
Description Background Open Mind Projects Unsolved problems Conclusions
31
Open Mind Animals (Stork & Lam, 00)
2 legs?
Y N
can fly? can swim?
Y N Y N
feathers?
elephant
dog
Y N human
parrot
bat
mane?
Y N
horse
dog
Description Background Open Mind Projects Unsolved problems Conclusions
Ensuring data quality in
32
Open Mind Animals
Must correct errors early!
Misspelled animal
» check in pre-compiled lexicon, also...
Detection of questionable data (bug report)
» lock node; notice to domain expert who arbitrates
Submission of animal already in tree
» warning and lowest parent node question listed; if player
persists, both nodes locked until arbitrated by 3rd player
Submission collisions
» lock node for 30 seconds
Description Background Open Mind Projects Unsolved problems Conclusions
33
General data “voting”
Present each query to k netizens,
accept iff all k agree
k large small amount of reliable data
k small large amount of unreliable
data
Estimate netizen reliabilities
Find optimal, k*, for different reliabilities,
state of classifier, Bayes error, ...
Description Background Open Mind Projects Unsolved problems Conclusions
34
“Catch” trials
If low submission rate, we do not have k netizens
online simultaneously; must estimate reliability
individually
Make 1 out of every q samples be unambiguous
(given by classifier or precompiled set); if netizen fails
on this “catch” trial he is unreliable, and data
discarded
q small small amount of reliable data
q large large amount of unreliable data
What is optimal, q*?
Description Background Open Mind Projects Unsolved problems Conclusions
35
Exploratory learning (Thrun, 95)
Learning with queries (not iid!)
Decision theory: each action (query) has an
expected cost/payoff. Choose the query
which, when answered, will lead to the
greatest improvement in the classifier.
How does it depend upon the state of the
classifier? Netizen reliabilities?
Classifier sensitivity analysis
Description Background Open Mind Projects Unsolved problems Conclusions
36
Sensitivity analysis
37
Game theory
Seek strategy to reward/teach netizens
to give “good” data
But “good” depends upon the
classifier...which depends upon the
data...
But adversaries learn too!
Description Background Open Mind Projects Unsolved problems Conclusions
38
Take-home messages
Era of large data sets
» needed for further progress
Open software development
» leads to high-quality software
» integrate components
» can be used for many projects in pattern recognition and AI
Netizens
» can contribute large amounts of informal data
Data collection
» opportunities for algorithms and theory
Description Background Open Mind Projects Unsolved problems Conclusions
39
Open Mind is inevitable
• Need is here • Web is here • Theory/Machine learning is here
• (Networked) computer power and memory are here
Intelligent netizens’
Open Mind
systems knowledge
This collaboration is going to happen!
» Less radical than Richard Stallman or Linus Torvald...
Description Background Open Mind Projects Unsolved problems Conclusions
40
Initiative status
[Link], mailing list
Logo (Imageworks, Inc.)
Homepage and project template designed (Diamond
Bullet Designs)
Public relations (Ruder Finn)
Legal counsel
Solicited corporate donations (e.g., books, CDs, ...)
Demonstration project and infrastructure: Animals
Three projects, several infrastructure developers
Description Background Open Mind Projects Unsolved problems Conclusions
41
Summary
Open Mind
» Collaborative framework for developing
“intelligent systems”
» Experts, tool developers, netizens
New areas in machine learning
Vision of the future
Description Background Open Mind Projects Unsolved problems Conclusions
42
Questions/Comments...
“teaching computers the stuff we all know”
[Link]
to subscribe to mailing list, send mail to: majordomo@[Link]
in message body: subscribe openmind-general <your e-mail>