0% found this document useful (0 votes)

80 views2 pages

DWM Paper 1

Uploaded by

harshal.212488105

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views2 pages

DWM Paper 1

Uploaded by

harshal.212488105

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0

70
C1
1A
Paper / Subject Code: 48894 / Data Warehouseing & Mining

3D
29

6
11
1

D5
0

62
BF
1T01875 - T.E. Computer Science & Enginering (Artificial Intelligence & Machine Learning) (Choice

C1
1A

7
3D
29
5

6
Based) (R-2019 'C' Scheme) SEMESTER - V / 48894 - Data Warehouseing & Mining QP CODE:

1
1
A4

D5
0
A1

62
BF

C1
10029875 DATE: 31/05/2023.

3D
29
11
5

1
A4

0
A1
Time: 3 hours Max. Marks: 80

62
BF
52

C1
CC

29
79

11
5

1
A4

0
A1
91

BF
52

C1
N.B. (1) Question one is Compulsory.

6
CC
70

29
79

11
5

1
(2) Attempt any 3 questions out of the remaining.

A4
56

0
A1
91

BF
52

C1
DD

CC
(3) Assume suitable data if required.

11
5

1
A4
56

A1
91

BF
52
23

CC
70

79
96

11
5
A4
56

91
02

BF
52
23

A
DD

CC
70
C1

79
96

11
5
Q. 1 (a) Every data structure in the data warehouse contains the time element. Why? 05

A4
56

91
02

BF
1

52
23
A1

CC
70
C1

79
96

5
A4
11
(b) Calculate Accuracy, Recall and Precision with the help of following data: 05

91
02
1

52
23
A1
BF

CC
True Positive (TP)= 50, True Negative (TN) = 20, False Positive (FP)= 20,

70
C1

79
96
11
45

91
False Negative (FN)= 10

02
1

A
23
A1
BF
CA

CC
70
C1

79
96
11
45

91
02
2C

52
23
(c) What is Market basket analysis? 05

A1
BF
CA

70
C1

79
96
95

11
45

91
02
2C

1
17

23
A1
(d) Draw and explain KDD process. 05
BF
CA

70
C1
09

96
95

11
45

56
67

02
2C

1
17

23
A1
BF
CA

DD
Q. 2 a) Suppose that a data warehouse consists of the four dimensions, date, spectator,
D5

C1
09

96
95

11
45
67

location, and game, and the two measures, count and charge, where charge is

02
2C
3D

1
17

23
A1
BF
CA
D5

C1
09

96
the fare that a spectator pays when watching a game on a given date. Spectators
95
62

11
45
67

02
2C
3D

1
17
29

may be students, adults, or seniors, with each category having its own charge rate.

A1
BF
CA
D5

C1
09
10

95
62

11
45
67

a) Draw a star schema diagram for the data warehouse.

2C
3D

1
17
29

A1
BF
CA
D5

09
10

b) Draw the base cuboid [date, spectator, location] and apply any four OLAP
A1

95
62

11
45
67
1C

2C
3D

17
29

operations. 10
11

BF
CA
D5

09
10
A1

95
62
BF

45
67
1C

2C
3D

17
29
1

b) What is clustering? Explain K-mean clustering algorithm. Suppose that the data
45

CA
F1

09
10
A1

95
62
CA

mining task is to cluster the following items into two clusters.

67
1C

2C
17
29
1

{2, 4, 10, 12, 3, 20, 30, 11, 25, 56, 23}. Apply k-means algorithm. 10
2C

23
F1
A4

09
10
A1

95
DD
96
95

67
CC

17
1

02
17

23
F1
A4

Q.3 a) A database has five transactions. Let min sup = 50% and min conf = 70%.
A1
52

C1
09

96
5B

67
CC

3D
79

02
11
A4

D5
91

62
BF
52

T_id Items
1A
CC

3D
70

29
79

T100 a,b
F1
56

0
91

62
52

C1
1A
5B

T200 a,c,d
DD

29
79

11
F1
A4
56

T300 e,c,a
91
23

1A
95

5B
DD

1C
70

T400 c,d,b
17

F1
A4
56

A1
52
23

T500 a,c,d,b,e
09

5B
DD

CC
79
96

11
67

Find all frequent itemsets and strong association rules using Apriori Algorithm. 10
A4
91
02

BF
52
23

CC
70
C1

79
96

45
3D

b) What is data preprocessing? Explain different data cleaning techniques. 10

91
02
11

CA
62

70
C1

2C
29

91
11

23
10

95
DD

70
1A

96
1C

17
56
02

23
F1

09
A1

DD
C1

96
5B

67
11

02
11

23
A4

D5
BF

C1
1A

3D
45

02
11
F1

62
CA

C1
1A
5B

29
2C

11
F1
A4

10
1A
95

5B
CC

29875 Page 1 of 2
17

F1
A4

A1
52

5B
CC
79

11
A4
91

917952CCA45BF11A11C1029623DD5670
BF
52
0
1

70
C1
1A
Paper / Subject Code: 48894 / Data Warehouseing & Mining

3D
29

6
11
1

D5
0

62
BF

C1
1A

7
3D
29
5

6
1
1
A4

D5
0
A1

62
BF

C1
CC

3D
29
11
5

1
A4

0
A1

62
BF
52

C1
CC

29
79
Q. 4 a) The following table contains a training set D, of class-labeled tuples randomly

11
5

1
A4

0
A1
91

BF
52
selected from the AllElectronics customer database. Let buys_computer be the class label

6
CC
70

29
79

11
5
attribute. Using Naïve Bayesian classification predict the class label of a tuple

1
A4
56

0
A1
91

BF
52

C1
X = (age = youth, income = medium, student = yes, credit rating = fair).

CC
70

11
5

1
A4
56

A1
91

BF
52
23
RID Age income stud credit_rating buys_computer

CC
70

79
96

11
5
ent

A4
56

91
02

BF
52
23

A
1 Youth High No fair No

CC
70
C1

79
96

11
5
A4
56
2 Youth High No excellent No

91
02

BF
1

52
23
A1

CC
70
C1

79
3 middle-aged High No fair Yes

5
A4
11

91
02
1

52
23
4 Senior medium No fair Yes
A1
BF

CC
70
C1

79
96
5 Senior low Yes fair yes
11
45

91
02
1

A
23
A1
6 Senior low Yes excellent no
BF
CA

CC
70
C1

79
96
7 middle-aged low 11 Yes excellent yes
45

91
02
2C

52
23
A1
BF
CA

70
8 Youth medium No fair no

79
96
95

11
45

91
02
2C

1
17

9 Youth low Yes fair yes

23
A1
BF
CA

70
C1
09

96
95

10 Senior medium Yes fair Yes

11
45

56
67

02
2C

1
17

23
11 Youth medium Yes excellent Yes

A1
BF
CA

DD
D5

C1
09

96
95

12 middle-aged medium No excellent Yes

11
45
67

02
2C
3D

1
17

23
A1
BF
CA

13 middle-aged high Yes fair Yes

C1
09

96
95
62

11
45
67

14 Senior medium No excellent No

02
2C
3D

1
17
29

A1
BF
CA
D5

C1
09
10

95
62

11
45
67
1C

2C
3D

1
17
29

A1
b) What is web mining? Explain HITS algorithm. 10

BF
CA
D5

09
10
A1

95
62

11
45
67
1C

2C
3D

17
29
11

BF
CA
D5

09
10
A1

95
62
BF

45
67

Q. 5 a) Explain with example multilevel association mining and multidimensional rule

2C
3D

17
29
1
45

CA
F1

09
10

mining. 10
A1

95
62
CA

DD
5B

67
1C

2C
17
29
1
2C

23
F1
A4

09
10
A1

95
DD

b) Clearly explain the working of DBSCAN algorithm using appropriate diagram. 10

96
95

67
CC

17
1

02
17

23
F1
A4

09
A1
52

C1
09

96
5B

67
CC

3D
79

02
11

Q.6 (a) Explain with example different data sampling techniques. 10

D5
91

62
BF
52

C1
1A
CC

3D
70

29
79

11
F1
56

0
91

62
52

C1
1A

(b) Write short note on any 2 10

5B
DD

29
79

11
F1
A4
56

10
91
23

i. Differentiate between OLTP and OLAP

1A
95

5B
DD

1C
70

F1
A4
56

ii. Web Content mining

A1
52
23

5B
DD

CC
79
96

11
67

iii. Data Loading in ETL

A4
91
02

BF
52
23

CC
70
C1

79
96

45
3D

91
02
11

**********************************
62

70
C1

2C
29

91
11

23
10

95
DD

70
1A

96
1C

17
56
02

23
F1

09
A1

DD
C1

96
5B

67
11

02
11

23
A4

D5
BF

C1
1A

3D
45

02
11
F1

62
CA

C1
1A
5B

29
2C

11
F1
A4

10
1A
95

5B
CC

29875 Page 2 of 2
17

F1
A4

A1
52

5B
CC
79

11
A4
91

917952CCA45BF11A11C1029623DD5670
BF
52

Common questions

Multilevel association mining explores hierarchical levels of data abstraction, such as mining frequent itemsets from categories (e.g., electronics > laptops > gaming laptops). It enables the detection of associations at various abstraction levels, uncovering broad as well as domain-specific patterns. Multidimensional rule mining examines relationships across multiple dimensions, like time, location, and product types, allowing for a comprehensive analysis of interrelated factors influencing sales, enhancing predictive insights and strategic alignment in complex data landscapes .

Naïve Bayesian classification predicts a class label for a tuple by calculating the posterior probabilities for each class, given the tuple's attributes, using Bayes' Theorem. It assumes independence between attributes. For instance, for the tuple X = (age = youth, income = medium, student = yes, credit rating = fair), the classifier multiplies the probabilities of these attributes given the class ‘buys_computer’ and selects the class with the highest product as the predicted class label .

Strong association rules and frequent itemsets can be derived using the Apriori algorithm, which efficiently identifies frequent itemsets by iteratively increasing the size of candidate itemsets from which non-frequent subsets are removed. With a minimum support of 50%, only itemsets that appear in at least half of the transactions are considered frequent. Using a minimum confidence of 70%, only rules with predictive probability above this threshold are deemed strong. This process involves calculating the support, confidence, and generating candidate itemsets in a stepwise manner, pruning non-compliant ones .

Common data cleaning techniques include removing duplicates, correcting data errors, handling missing values, standardizing data formats, and filtering out irrelevant data. These techniques improve data quality by ensuring accuracy, consistency, and completeness, which are crucial for reliable data analysis and decision-making. Clean data helps reduce biases, minimize errors in predictive modeling, and enhances the overall robustness of analytical insights .

K-means clustering algorithm organizes data into k clusters based on feature vectors. The process involves selecting k initial centroids, assigning each data point to the nearest centroid, updating the centroids to be the mean of all points assigned to them, and repeating these steps until convergence. For two clusters, the dataset {2, 4, 10, 12, 3, 20, 30, 11, 25, 56, 23} would be organized by first choosing initial centroids, proceeding with assignment and update steps until cluster assignment stabilizes .

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters based on data point density, detecting noise and outliers inherently. Unlike K-means, which requires pre-defined cluster numbers and is sensitive to cluster shapes, DBSCAN can find clusters of arbitrary shape and requires only two parameters: ε (eps) for neighborhood radius and MinPts for minimum number of points in a cluster. It initiates from a core point having enough neighbors, expanding the cluster until no more points are within ε distance, effectively distinguishing between dense cluster regions and sparser noise .

The time element is crucial in data warehouse data structures because it allows for the historical analysis of data over different time periods. This enables trend analysis, forecasting, and time series analysis, which are essential for strategic decision-making. By maintaining a time dimension, organizations can compare metrics across various temporal snapshots, providing insights into patterns and changes in behavior or performance over time .

Market Basket Analysis uses association rule mining to uncover relationships between items purchased together, enabling the identification of frequent item combinations within retail transactions. By deriving rules such as "if item A is purchased, item B is likely to be purchased," businesses can understand consumer preferences and behavior patterns, optimize product placement, cross-sell, and tailor marketing strategies to increase sales and customer satisfaction .

HITS algorithm functions by computing two scores for each webpage: the authority score and the hub score. It assumes a mutually reinforcing relationship where good hubs point to many good authorities and vice versa. In web mining, HITS identifies authoritative web pages on a given topic by analyzing link structures, enhancing the relevance and accuracy of search results and contributing to efficient information retrieval and ranking .

OLTP (Online Transaction Processing) systems are designed to manage day-to-day transaction data, optimized for fast query processing and maintaining data integrity in multi-access environments, while OLAP (Online Analytical Processing) systems support complex queries to analyze historical data and generate reports, aiding strategic decision-making. OLTP focuses on transactional efficiency, whereas OLAP prioritizes analytical insight, impacting aspects such as system architecture, database design, and end-user interaction .

List of Architects in Kolkata
No ratings yet
List of Architects in Kolkata
9 pages
Grade 3 Vocabulary Lesson Plan
100% (2)
Grade 3 Vocabulary Lesson Plan
6 pages
Oracle EAM Mobile Offline Mode Guide
No ratings yet
Oracle EAM Mobile Offline Mode Guide
27 pages
Current Mirrors & Cascodes Lab Report
No ratings yet
Current Mirrors & Cascodes Lab Report
7 pages
Partnership Deed Template in Hindi
No ratings yet
Partnership Deed Template in Hindi
6 pages
Sandvik Inspection Certificate A/22-181063
No ratings yet
Sandvik Inspection Certificate A/22-181063
3 pages
C++ Basics: Variables and Data Types
No ratings yet
C++ Basics: Variables and Data Types
10 pages
Games Workshop Monthly Update 285
100% (1)
Games Workshop Monthly Update 285
132 pages
Textile Art: Techniques and Traditions
No ratings yet
Textile Art: Techniques and Traditions
5 pages
Machine Breakdown Report Summary
No ratings yet
Machine Breakdown Report Summary
3 pages
Contractor Quality Control Overview
No ratings yet
Contractor Quality Control Overview
9 pages
Year 1 Assessment: Everyday Materials
100% (1)
Year 1 Assessment: Everyday Materials
5 pages
Flaca: Violin I 106
No ratings yet
Flaca: Violin I 106
2 pages
Slope Word Problems Worksheet
100% (1)
Slope Word Problems Worksheet
5 pages
Tamil Nadu Restaurant Invoice Details
No ratings yet
Tamil Nadu Restaurant Invoice Details
1 page
Calgary 2023 Fall Research Insights
No ratings yet
Calgary 2023 Fall Research Insights
15 pages
Kotak Mahindra Loan Details and Requirements
No ratings yet
Kotak Mahindra Loan Details and Requirements
4 pages
Understanding Abhyanga's Therapeutic Action
No ratings yet
Understanding Abhyanga's Therapeutic Action
9 pages
Sentence Transformation Exercises
No ratings yet
Sentence Transformation Exercises
7 pages
Pedagogy and Cinema: A Feminist Critique
No ratings yet
Pedagogy and Cinema: A Feminist Critique
8 pages
Grade 10 ICT 2nd Term Test Paper
No ratings yet
Grade 10 ICT 2nd Term Test Paper
6 pages
Essay Writing Guide: Structure & Tips
No ratings yet
Essay Writing Guide: Structure & Tips
10 pages
Word Document Design Tips in Hindi
No ratings yet
Word Document Design Tips in Hindi
7 pages
iQOO NEO 3: Brand Overview and Specs
No ratings yet
iQOO NEO 3: Brand Overview and Specs
5 pages
Technical Feasibility Study for Ecotourism
No ratings yet
Technical Feasibility Study for Ecotourism
7 pages
PGT Chemistry Teacher Profile: Love Goyal
No ratings yet
PGT Chemistry Teacher Profile: Love Goyal
1 page
Wave Boundary Behavior Worksheet
No ratings yet
Wave Boundary Behavior Worksheet
2 pages
DST3 Sampling Tube Specifications
No ratings yet
DST3 Sampling Tube Specifications
2 pages
Bringing The Outside World Into The Classroom:: Ways of Making Reading Lessons Less of A Tedious Task
No ratings yet
Bringing The Outside World Into The Classroom:: Ways of Making Reading Lessons Less of A Tedious Task
4 pages
Baliwag Water District Overview and Insights
No ratings yet
Baliwag Water District Overview and Insights
3 pages

DWM Paper 1

Uploaded by

DWM Paper 1

Uploaded by

0

a) Draw a star schema diagram for the data warehouse.

mining task is to cluster the following items into two clusters.

b) What is data preprocessing? Explain different data cleaning techniques. 10

9 Youth low Yes fair yes

10 Senior medium Yes fair Yes

12 middle-aged medium No excellent Yes

13 middle-aged high Yes fair Yes

14 Senior medium No excellent No

Q. 5 a) Explain with example multilevel association mining and multidimensional rule

b) Clearly explain the working of DBSCAN algorithm using appropriate diagram. 10

Q.6 (a) Explain with example different data sampling techniques. 10

(b) Write short note on any 2 10

i. Differentiate between OLTP and OLAP

ii. Web Content mining

iii. Data Loading in ETL

Common questions

Explain how multilevel association mining and multidimensional rule mining can be used effectively in a structured data analysis scenario.

How does Naïve Bayesian classification determine the class label for a given tuple, and how would it apply to a specific example tuple from the AllElectronics dataset?

What are strong association rules and frequent itemsets, and how can they be derived using the Apriori algorithm with a minimum support of 50% and minimum confidence of 70%?

What are common data cleaning techniques in data preprocessing, and how do they contribute to data quality and analysis accuracy?

How does the K-means clustering algorithm organize data, and what are the steps involved in applying it to a dataset with two clusters?

What is DBSCAN algorithm, and how does its application differ from K-means in handling dataset spatial distribution?

Why is the time element important in data structures within a data warehouse, and how does it influence data analytics?

In what ways does Market Basket Analysis utilize the concept of association rule mining to understand consumer behavior?

How does the HITS (Hyperlink-Induced Topic Search) algorithm function, and what role does it play in web mining?

What are the key differences between OLTP and OLAP systems, and how do they impact their respective functions in business processes?

You might also like