Data mining and privacy preserving in data mining

Data mining is…… Data mining is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. A user-centric, interactive process which leverages analysis technologies and computing power A group of techniques that find relationships that have not previously been discovered Not reliant on an existing database A relatively easy task that requires knowledge of the business problem/subject matter expertise

Data mining is not “ Blind” application of algorithms Going to find relationships where none exist Presenting data in different ways A database intensive task A difficult to understand technology requiring an advanced degree in computer science

The Evolution of Data Analysis

Results for data mining……. Forecasting what may happen in the future Classifying people or things into groups by recognizing patterns Clustering people or things into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events

Why Should There be a Standard Process? Framework for recording experience Allows projects to be replicated Aid to project planning and management “ Comfort factor” for new adopters Demonstrates maturity of Data Mining Reduces dependency on “stars” The data mining process must be reliable and repeatable by people with little data mining background.

Data mining vs OLAP OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

Data mining vs statistical analysis Data Mining Originally developed to act as expert systems to solve problems Less interested in the mechanics of the technique If it makes sense then let’s use it Does not require assumptions to be made about data Can find patterns in very large amounts of data Requires understanding of data and business problem Data Analysis Tests for statistical correctness of models Are statistical assumptions of models correct? Eg Is the R-Square good? Hypothesis testing Is the relationship significant? Use a t-test to validate significance Tends to rely on sampling Techniques are not optimised for large amounts of data Requires strong statistical skills

Examples of what people are doing with data mining Fraud/Non-Compliance Anomaly detection Isolate the factors that lead to fraud, waste and abuse Target auditing and investigative efforts more effectively Credit/Risk Scoring Intrusion detection Parts failure prediction Recruiting/Attracting customers Maximizing profitability (cross selling, identifying profitable customers) Service Delivery and Customer Retention Build profiles of customers likely to use which services Web Mining

Data mining is extensively used for knowledge discovery from large databases. The problem with data mining is that with the availability of non-sensitive information, one is able to infer sensitive information that is not to be disclosed. Thus privacy is becoming an increasingly important issue in many data mining applications. This has led to the development of privacy preserving data mining. What is Data Mining and Privacy Preservation all about:

Perform Data-mining on union of two private databases Data stays private i.e. no party learns anything but output meeting privacy requirements. Study of achieving some data mining goals without scarifying the privacy of the individuals providing valid data mining results. Objective

How Do We Do It? There are two approaches to preserve privacy: The first approach protects the privacy of the data by using an extended role based access control approach where sensitive objects identification is used to protect an individual’s privacy. The second approach uses cryptographic techniques.

Cryptographic techniques for PPDM To run the data mining algorithm on the union of their databases without revealing any unnecessary information. consider separate medical institutions that wish to conduct a joint research while preserving the privacy of their patients. Protection of privileged information, along with its use for research.

How much privacy? if a data mining algorithm is run against the union of the databases, and its output becomes known to one or more of the parties, it reveals something about the contents of the other databases. leak of information is inevitable, however, if the parties need to learn this output.

What is Cryptography? The common definition of privacy in the cryptographic community limits the information that is leaked by the distributed computation to be the information that can be learned from the designated output of the computation.

Specific data mining:Applications

What data mining has done for….. The US Internal Revenue Service needed to improve customer service and... Scheduled its workforce to provide faster, more accurate answers to questions

What data mining has done for………… The US Drug Enforcement Agency needed to be more effective in their drug “busts” and analyzed suspects’ cell phone usage to focus investigations.

What data mining has done for…….. HSBC need to cross-sell more effectively by identifying profiles that would be interested in higher yielding investments and... Reduced direct mail costs by 30% while garnering 95% of the campaign’s revenue.

Final comments Data Mining can be utilized in any organization that needs to find patterns or relationships in their data. By using the CRISP-DM methodology, analysts can have a reasonable level of assurance that their Data Mining efforts will render useful, repeatable, and valid results.

5 dimensions of PPDM (1) the distribution of the basic data, (2) how basic data are modified (3) which mining method is being used (4) If basic data or rules are to be hidden and (5) which additional methods for privacy preservation are used

Privacy guidelines 1. Collection limitation principle – too general to be enforced in PPDM 2. Data quality principle – most of today’s PPDM methods or algorithms assume that data are already prepared to an appropriate quality to be mined 3. Purpose specification principle – extremely relevant for PPDM

4. Use limitation principle – fundamental for PPDM 5. Security safeguard principle – unenforceable in the context of PPDM 6. Openness principle – relevant for PPDM

7. Individual participation principle - Oliveira and Zaïane suggest that the implications of this principle for PPDM should be carefully weighed in light of the ownership of the basic data otherwise the application could be too rigid in PPDM applications. 8. Accountability principle – too general for PPDM

Data mining and privacy preserving in data mining

More Related Content

What's hot(20)

Viewers also liked(18)

Similar to Data mining and privacy preserving in data mining(20)

Recently uploaded(20)

Data mining and privacy preserving in data mining

Editor's Notes