An Introduction To Data Mining IIT Bombay
An Introduction To Data Mining IIT Bombay
Prof. S. Sudarshan
CSE Dept, IIT Bombay
Process of semi-automatically
analyzing large databases to find
patterns that are:
valid: hold on new data with some
certainity
novel: non-obvious to the system
useful: should be possible to act on the
item
understandable: humans should be able
to interpret the pattern
Applications
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications,
financial transactions
from an online stream of event identify fraudulent
events
Manufacturing and production:
automatically adjust knobs when process parameter
changes
Applications (continued)
Predictive:
Regression
Classification
Collaborative Filtering
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
Classification
(Supervised learning)
Classification
Prof=teacher Age<30
c j cj p(d )
Hierarchical clustering
agglomerative Vs divisive
single link Vs complete link
Partitional clustering
distance-based: K-means
model-based: EM
density-based:
Agglomerative
Hierarchical clustering
RangeelaQSQT
RangeelaQSQT 100 daysAnand Sholay Deewar Vertigo
Smita
Vijay
Mohan
Rajesh
Nina
Nitin ? ? ? ? ? ?
Cluster-based approaches
External attributes of people and movies to
cluster
age, gender of people
actors and directors of movies.
[ May not be available]
Cluster people based on movie preferences
misses information about similarity of movies
Repeated clustering:
cluster movies based on people, then people
based on movies, and repeat
ad hoc, might smear out groups
Example of clustering
Rangeela
Anand QSQT Rangeela
QSQT 100 daysAnand Sholay
100 days VertigoDeewar
DeewarVertigo
Sholay
Smita
Vijay
Vijay
Rajesh
Mohan
Mohan
Rajesh
Nina
Nina
Smita
Nitin
Nitin ? ? ? ? ? ? ? ? ?? ??
Model-based approach
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providersValue added data
Utilities Power usage analysis
Why Now?