0% found this document useful (0 votes)
12 views

solved DM questions

Data mining solved question

Uploaded by

chudarybushra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

solved DM questions

Data mining solved question

Uploaded by

chudarybushra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Question 1: What is data mining?

in your answer, address the following:


(a) is it another hype?
(b) is it a simple transformation of technology developed from databases,
statistics, and machine learning?
(c) explain how the evolution of database technology led to data mining.
(d) describe the steps involved in data mining when viewed as a process of
knowledge discovery

Data mining refers to the process of extracting or mining interesting knowledge or patterns from
large amounts of data.

(a) No, Data mining is not another hype. "We are living in the information age" is a popular
saying; however, we are actually living in the data age. Terabytes or petabytes of data pour
into our computer networks, the World Wide Web (WWW), and various data storage
devices every day from business, society, science and engineering, medicine, and almost
every other aspect of daily life. Powerful and versatile tools are badly needed to
automatically uncover valuable information from the tremendous amounts of data and to
transform such data into organized knowledge. This necessity has led to the birth of data
mining.

(b) No. Data mining is not a simple transformation of technology developed from databases,
statistics, and machine learning. Instead, it involves an integration of data rather than a
simple transformation of techniques from multiple disciplines such as database technology,
statistics, machine learning, high-performance computing, pattern recognition, neural
networks, data visualization, and information retrieval and so on.

(c) the "knowledge discovery in databases" process, or KDD.


Qwcheetah Ambitious
Data mining has got it roots from three family lines:
Classical statistics - The standard stats to analyse a situation or mathematical
data for number predictions.
Artificial intelligence - It has high-end queries to act more humanly.
Machine learning - It is the combination of Classical and AI stats in order to
build itself up.
Over the years this huge amount of data has been collected and stored, but it
has increased the complexity to retrieve that data whenever needed.
Data retrieval has lead to the term of data mining.
The term was first coined in 1990s.

(d) Steps involved in Data mining when viewed as Knowledge Discovery process.

Data Cleaning- a process that removes or transforms noise and inconsistent data.

Data Integration- where data from heterogeneous data sources is combined for mining
purpose.

Data Selection- where data relevant to the analysis task are retrieved from the database.

Data Transformation - where data is transformed or consolidated into forms suitable for
mining.

Data Mining - an essential process where intelligent and efficient methods are applied in order
to extract patterns.

Pattern Evaluation - a process that identifies the truly interesting patterns representing
knowledge based on some interestingness measures.

Knowledge Presentation- where visualization and knowledge representation techniques are


used to present the mined knowledge to the user.

Question 2: How database is different from datawarehouse?


Database System:
Database System is used in traditional way of storing and retrieving data. The major
task of database system is to perform query processing. These systems are
generally referred as online transaction processing system. These systems are used
day to day operations of ans organization.
Data Warehouse:
Data Warehouse is the place where huge amount of data is stored. It is meant for
users or knowledge workers in the role of data analysis and decision making. These
systems are supposed to organize and present data in different format and different
forms in order to serve the need of the specific user for specific purpose. These
systems are referred as online analytical processing.
Difference between Database System and Data Warehouse:
Database System Data Warehouse
1. It supports operational processes. It supports analysis and performance
reporting.

2. Capture and maintain the data. Explore the data.

3. Current data. Multiple years of history.

4. Data is balanced within the scope Data must be integrated and balanced
of this one system. from multiple system.

Data is updated on scheduled


5. Data is updated when transaction
processes.
occurs.

6. Data verification occurs when


entry is done. Data verification occurs after the fact.

7. 100 MB to GB. 100 GB to TB.

8. ER based. Star/Snowflake.

9. Application oriented. Subject oriented.

10. Primitive and highly detailed. Summarized and consolidated.

11. Flat relational. Multidimensional.


Second ANS

What is Database?
A database is a collection of related data which represents some elements of the
real world. It is designed to be built and populated with data for a specific task. It
is also a building block of your data solution.
What is a Data Warehouse?
A data warehouse is an information system which stores historical and
commutative data from single or multiple sources. It is designed to analyze,
report, integrate transaction data from different sources.

Data Warehouse eases the analysis and reporting process of an organization. It is


also a single version of truth for the organization for decision making and
forecasting process.
Parameter Database Data Warehouse
Purpose Is designed to record Is designed to analyze
Processing The database uses the Online Data warehouse uses Online Analytical Processing
Method Transactional Processing (OLTP) (OLAP).
The database helps to perform
Data warehouse allows you to analyze your
Usage fundamental operations for your
business.
business
Tables and Tables and joins of a database are Table and joins are simple in a data warehouse
Joins complex as they are normalized. because they are denormalized.
Is an application-oriented
Orientation collection of It is a subject-oriented collection of data
data
Storage Generally limited to a single
Stores data from any number of applications
limit application
Data is refreshed from source systems as and when
Availability Data is available real-time
needed
ER modeling techniques are used
Usage Data modeling techniques are used for designing.
for designing.
Technique Capture data Analyze data
Data stored in the Database is up Current and Historical Data is stored in Data
Data Type
to date. Warehouse. May not be up to date.
Data Ware House uses dimensional and normalized
Storage of Flat Relational Approach method
approach for the data structure. Example: Star and
data is used for data storage.
snowflake schema.
Query Simple transaction queries are
Complex queries are used for analysis purpose.
Type used.
Data Detailed Data is stored in a
It stores highly summarized data.
Summary database.
.

Question 3: Briefly explain the steps of making decision tree.


1. Calculate entropy for dataset.
2. For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3. Find the feature with maximum information gain.
4. Repeat it until we get the desired tree.
Question 5 :Explain the importance of evaluation criteria for classification
methods.
Performance evaluation of classification model is important for understanding the quality of
the model, to refine the model, and for choosing the adequate model. The performance
evaluation criteria used in classification models are:
• Predictive (Classification ) accuracy: this refers to the ability of the model to correctly predict the
class label of new or previously unseen data:
• accuracy = % of testing set examples correctly classified by the classifier
• Speed: this refers to the computation costs involved in generating and using the model
• Robustness: this is the ability of the model to make correct predictions given noisy data or data
with missing values.
•Scalability: this refers to the ability to construct the model efficiently given large amount of data
• Interpretability: this refers to the level of understanding and insight that is provided by the model
• Simplicity:
• decision tree size
• rule compactness
• Domain-dependent quality indicators
Question no 6: how to improve the accuracy of classification?
1 - Cross Validation : Separate your train dataset in groups, always separe a group for prediction and change the
groups in each execution. Then you will know what data is better to train a more accurate model.

2 - Cross Dataset : The same as cross validation, but using different datasets.

3 - Tuning your model : Its basically change the parameters you're using to train your classification model (IDK
which classification algorithm you're using so its hard to help more).

4 - Improve, or use (if you're not using) the normalization process : Discover which techniques will provide a more
concise data to you to use on the training.

5 - Understand more the problem you're treating... Try to implement other methods to solve the same problem.
Always there's at least more than one way to solve the same problem. You maybe not using the best approach.

6-More Data: More variety and more volume will give better results.
7-Ensemble Methods : (Probably the easiest and most interesting) Combining of multiple weak
models to make a strong model with better prediction by compensating for each other losses.

You might also like