Data mining is…… Data mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both.  A user-centric, interactive process which leverages analysis technologies and computing power A group of techniques that find relationships that have not previously been discovered Not reliant on an existing database A relatively easy task that requires knowledge of the business problem/subject matter expertise
Data mining is  not “ Blind” application of algorithms Going to find relationships where none exist Presenting data in different ways A database intensive task A difficult to understand technology requiring an advanced degree in computer science
The Evolution of Data Analysis
Results for data mining……. Forecasting what may happen in the future Classifying people or things into groups by recognizing patterns Clustering people or things into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events
Why Should There be a Standard Process? Framework for recording experience Allows projects to be replicated Aid to project planning and management “ Comfort factor” for new adopters Demonstrates maturity of Data Mining Reduces dependency on “stars” The data mining process must be reliable and repeatable by people with little data mining background.
Data mining vs OLAP OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
Data mining vs statistical  analysis Data Mining Originally developed to act as expert systems to solve problems Less interested in the mechanics of the technique If it makes sense then let’s use it Does not require assumptions to be made about data Can find patterns in very large amounts of data Requires understanding of data and business problem Data Analysis Tests for statistical correctness of models Are statistical assumptions of models correct? Eg Is the R-Square good? Hypothesis testing Is the relationship significant? Use a t-test to validate significance Tends to rely on sampling Techniques are not optimised for large amounts of data Requires strong statistical skills
Examples of what people are doing with data mining Fraud/Non-Compliance Anomaly detection Isolate the factors that lead to fraud, waste and abuse Target auditing and investigative efforts more effectively Credit/Risk Scoring Intrusion detection  Parts failure prediction  Recruiting/Attracting customers  Maximizing profitability (cross selling, identifying profitable customers)  Service Delivery and Customer Retention  Build profiles of customers likely to use which services Web Mining
Why Should There be a Standard Process? Framework for recording experience Allows projects to be replicated Aid to project planning and management “ Comfort factor” for new adopters Demonstrates maturity of Data Mining Reduces dependency on “stars” The data mining process must be reliable and repeatable by people with little data mining background.
Data mining is extensively used for knowledge discovery from large databases.  The problem with data mining is that with the availability of non-sensitive information, one is able to infer sensitive information that is not to be disclosed.  Thus privacy is becoming an increasingly important issue in many data mining applications. This has led to the development of privacy preserving data mining. What is Data Mining and Privacy Preservation all about:
Perform Data-mining on union of two private databases Data stays private i.e. no party learns anything but output meeting privacy requirements. Study of achieving some data mining goals without scarifying the privacy of the individuals  providing valid data mining results. Objective
How Do We Do It? There are two approaches to preserve privacy: The first approach protects the privacy of the data by using an extended role based access control approach where sensitive objects identification is used to protect an individual’s privacy.  The second approach uses cryptographic techniques.
Cryptographic techniques for PPDM To run the data mining algorithm on the union of their databases without revealing any unnecessary information. consider separate medical institutions that wish to conduct a joint research while preserving the privacy of their patients. Protection of privileged information, along with its use for research.
How much privacy? if a data mining algorithm is run against the union of the databases, and its output becomes known to one or more of the parties, it reveals something about the contents of the other databases. leak of information is inevitable, however, if the parties need to learn this output.
What is Cryptography? The common definition of privacy in the cryptographic community limits the information that is leaked by the distributed computation to be the information that can be learned from the designated output of the computation.
Specific data mining:Applications
What data mining has done for…..  The US Internal Revenue Service needed to improve customer service and... Scheduled its workforce to provide faster, more accurate answers to questions
What data mining has done for………… The US Drug Enforcement Agency needed to be more effective in their drug “busts” and analyzed suspects’ cell phone usage to focus investigations.
What data mining has done for…….. HSBC need to cross-sell more  effectively by identifying profiles  that would be interested in higher yielding investments and... Reduced direct mail costs by 30% while garnering 95% of the campaign’s revenue.
Final comments Data Mining can be utilized in any organization that needs to find patterns or relationships in their data. By using the CRISP-DM methodology, analysts can have a reasonable level of assurance that their Data Mining efforts will render useful, repeatable, and valid results.
5 dimensions of PPDM (1) the distribution of the basic data,  (2) how basic data are modified (3) which mining method is being used (4) If basic data or rules are to be hidden and  (5) which additional methods for privacy preservation are used
Privacy guidelines 1. Collection limitation principle – too general to be enforced in PPDM 2. Data quality principle – most of today’s PPDM methods or algorithms assume that data are already prepared to an appropriate quality to be mined 3. Purpose specification principle – extremely relevant for PPDM
4. Use limitation principle – fundamental for PPDM 5. Security safeguard principle – unenforceable in the context of PPDM 6. Openness principle – relevant for PPDM
7. Individual participation principle - Oliveira and Zaïane suggest that the implications of this principle for PPDM should be carefully weighed in light of the ownership of the basic data otherwise the application could be too rigid in PPDM applications. 8. Accountability principle – too general for PPDM

Data mining and privacy preserving in data mining

  • 1.
    Data mining is……Data mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. A user-centric, interactive process which leverages analysis technologies and computing power A group of techniques that find relationships that have not previously been discovered Not reliant on an existing database A relatively easy task that requires knowledge of the business problem/subject matter expertise
  • 2.
    Data mining is not “ Blind” application of algorithms Going to find relationships where none exist Presenting data in different ways A database intensive task A difficult to understand technology requiring an advanced degree in computer science
  • 3.
    The Evolution ofData Analysis
  • 4.
    Results for datamining……. Forecasting what may happen in the future Classifying people or things into groups by recognizing patterns Clustering people or things into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events
  • 5.
    Why Should Therebe a Standard Process? Framework for recording experience Allows projects to be replicated Aid to project planning and management “ Comfort factor” for new adopters Demonstrates maturity of Data Mining Reduces dependency on “stars” The data mining process must be reliable and repeatable by people with little data mining background.
  • 6.
    Data mining vsOLAP OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
  • 7.
    Data mining vsstatistical analysis Data Mining Originally developed to act as expert systems to solve problems Less interested in the mechanics of the technique If it makes sense then let’s use it Does not require assumptions to be made about data Can find patterns in very large amounts of data Requires understanding of data and business problem Data Analysis Tests for statistical correctness of models Are statistical assumptions of models correct? Eg Is the R-Square good? Hypothesis testing Is the relationship significant? Use a t-test to validate significance Tends to rely on sampling Techniques are not optimised for large amounts of data Requires strong statistical skills
  • 8.
    Examples of whatpeople are doing with data mining Fraud/Non-Compliance Anomaly detection Isolate the factors that lead to fraud, waste and abuse Target auditing and investigative efforts more effectively Credit/Risk Scoring Intrusion detection Parts failure prediction Recruiting/Attracting customers Maximizing profitability (cross selling, identifying profitable customers) Service Delivery and Customer Retention Build profiles of customers likely to use which services Web Mining
  • 9.
    Why Should Therebe a Standard Process? Framework for recording experience Allows projects to be replicated Aid to project planning and management “ Comfort factor” for new adopters Demonstrates maturity of Data Mining Reduces dependency on “stars” The data mining process must be reliable and repeatable by people with little data mining background.
  • 10.
    Data mining isextensively used for knowledge discovery from large databases. The problem with data mining is that with the availability of non-sensitive information, one is able to infer sensitive information that is not to be disclosed. Thus privacy is becoming an increasingly important issue in many data mining applications. This has led to the development of privacy preserving data mining. What is Data Mining and Privacy Preservation all about:
  • 11.
    Perform Data-mining onunion of two private databases Data stays private i.e. no party learns anything but output meeting privacy requirements. Study of achieving some data mining goals without scarifying the privacy of the individuals providing valid data mining results. Objective
  • 12.
    How Do WeDo It? There are two approaches to preserve privacy: The first approach protects the privacy of the data by using an extended role based access control approach where sensitive objects identification is used to protect an individual’s privacy. The second approach uses cryptographic techniques.
  • 13.
    Cryptographic techniques forPPDM To run the data mining algorithm on the union of their databases without revealing any unnecessary information. consider separate medical institutions that wish to conduct a joint research while preserving the privacy of their patients. Protection of privileged information, along with its use for research.
  • 14.
    How much privacy?if a data mining algorithm is run against the union of the databases, and its output becomes known to one or more of the parties, it reveals something about the contents of the other databases. leak of information is inevitable, however, if the parties need to learn this output.
  • 15.
    What is Cryptography?The common definition of privacy in the cryptographic community limits the information that is leaked by the distributed computation to be the information that can be learned from the designated output of the computation.
  • 16.
  • 17.
    What data mininghas done for….. The US Internal Revenue Service needed to improve customer service and... Scheduled its workforce to provide faster, more accurate answers to questions
  • 18.
    What data mininghas done for………… The US Drug Enforcement Agency needed to be more effective in their drug “busts” and analyzed suspects’ cell phone usage to focus investigations.
  • 19.
    What data mininghas done for…….. HSBC need to cross-sell more effectively by identifying profiles that would be interested in higher yielding investments and... Reduced direct mail costs by 30% while garnering 95% of the campaign’s revenue.
  • 20.
    Final comments DataMining can be utilized in any organization that needs to find patterns or relationships in their data. By using the CRISP-DM methodology, analysts can have a reasonable level of assurance that their Data Mining efforts will render useful, repeatable, and valid results.
  • 21.
    5 dimensions ofPPDM (1) the distribution of the basic data, (2) how basic data are modified (3) which mining method is being used (4) If basic data or rules are to be hidden and (5) which additional methods for privacy preservation are used
  • 22.
    Privacy guidelines 1.Collection limitation principle – too general to be enforced in PPDM 2. Data quality principle – most of today’s PPDM methods or algorithms assume that data are already prepared to an appropriate quality to be mined 3. Purpose specification principle – extremely relevant for PPDM
  • 23.
    4. Use limitationprinciple – fundamental for PPDM 5. Security safeguard principle – unenforceable in the context of PPDM 6. Openness principle – relevant for PPDM
  • 24.
    7. Individual participationprinciple - Oliveira and Zaïane suggest that the implications of this principle for PPDM should be carefully weighed in light of the ownership of the basic data otherwise the application could be too rigid in PPDM applications. 8. Accountability principle – too general for PPDM

Editor's Notes

  • #18 The US Internal Revenue Service is using data mining to improve customer service. [Click] By analyzing incoming requests for help and information, the IRS hopes to schedule its workforce to provide faster, more accurate answers to questions.
  • #19 The US DFAS needs to search through 2.5 million financial transactions that may indicate inaccurate charges. Instead of relying on tips to point out fraud, the DFAS is mining the data to identify suspicious transactions. [Click] Using Clementine, the agency examined credit card transactions and was able to identify purchases that did not match past patterns. Using this information, DFAS could focus investigations, finding fraud more costs effectively.
  • #20 Retail banking is a highly competitive business. In addition to competition from other banks, banks also see intense competition from financial services companies of all kinds, from stockbrokers to mortgage companies. With so many organizations working the same customer base, the value of customer retention is greater than ever before. As a result, HSBC Bank USA looks to enticing existing customers to "roll over" maturing products, or on cross-selling new ones. [Click] Using SPSS products, HSBC found that it could reduce direct mail costs by 30% while still bringing in 95% of the campaign’s revenue. Because HSBC is sending out fewer mail pieces, customers are likely to be more loyal because they don’t receive junk mail from the bank.