0% found this document useful (0 votes)
80 views2 pages

DWM Paper 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views2 pages

DWM Paper 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

0

62

70
C1
1A
Paper / Subject Code: 48894 / Data Warehouseing & Mining

3D
29

6
11
1

D5
0

62
BF
1T01875 - T.E. Computer Science & Enginering (Artificial Intelligence & Machine Learning) (Choice

C1
1A

7
3D
29
5

6
Based) (R-2019 'C' Scheme) SEMESTER - V / 48894 - Data Warehouseing & Mining QP CODE:

1
1
A4

D5
0
A1

62
BF

C1
10029875 DATE: 31/05/2023.

CC

3D
29
11
5

1
A4

0
A1
Time: 3 hours Max. Marks: 80

62
BF
52

C1
CC

29
79

11
5

1
A4

0
A1
91

BF
52

C1
N.B. (1) Question one is Compulsory.

6
CC
70

29
79

11
5

1
(2) Attempt any 3 questions out of the remaining.

A4
56

0
A1
91

BF
52

C1
DD

CC
(3) Assume suitable data if required.

70

79

11
5

1
A4
56

A1
91

BF
52
23

DD

CC
70

79
96

11
5
A4
56

91
02

BF
52
23

A
DD

CC
70
C1

79
96

11
5
Q. 1 (a) Every data structure in the data warehouse contains the time element. Why? 05

A4
56

91
02

BF
1

52
23
A1

DD

CC
70
C1

79
96

5
A4
11
(b) Calculate Accuracy, Recall and Precision with the help of following data: 05

56

91
02
1

52
23
A1
BF

DD

CC
True Positive (TP)= 50, True Negative (TN) = 20, False Positive (FP)= 20,

70
C1

79
96
11
45

56

91
False Negative (FN)= 10

02
1

52

A
23
A1
BF
CA

DD

CC
70
C1

79
96
11
45

56

91
02
2C

52
23
(c) What is Market basket analysis? 05

A1
BF
CA

DD

70
C1

79
96
95

11
45

56

91
02
2C

1
17

23
A1
(d) Draw and explain KDD process. 05
BF
CA

DD

70
C1
09

96
95

11
45

56
67

02
2C

1
17

23
A1
BF
CA

DD
Q. 2 a) Suppose that a data warehouse consists of the four dimensions, date, spectator,
D5

C1
09

96
95

11
45
67

location, and game, and the two measures, count and charge, where charge is

02
2C
3D

1
17

23
A1
BF
CA
D5

C1
09

96
the fare that a spectator pays when watching a game on a given date. Spectators
95
62

11
45
67

02
2C
3D

1
17
29

may be students, adults, or seniors, with each category having its own charge rate.

A1
BF
CA
D5

C1
09
10

95
62

11
45
67

a) Draw a star schema diagram for the data warehouse.


1C

2C
3D

1
17
29

A1
BF
CA
D5

09
10

b) Draw the base cuboid [date, spectator, location] and apply any four OLAP
A1

95
62

11
45
67
1C

2C
3D

17
29

operations. 10
11

BF
CA
D5

09
10
A1

95
62
BF

45
67
1C

2C
3D

17
29
1

b) What is clustering? Explain K-mean clustering algorithm. Suppose that the data
45

CA
F1

09
10
A1

95
62
CA

DD

mining task is to cluster the following items into two clusters.


5B

67
1C

2C
17
29
1

{2, 4, 10, 12, 3, 20, 30, 11, 25, 56, 23}. Apply k-means algorithm. 10
2C

23
F1
A4

09
10
A1

95
DD
96
95

5B

67
CC

1C

17
1

02
17

23
F1
A4

D5

09

Q.3 a) A database has five transactions. Let min sup = 50% and min conf = 70%.
A1
52

C1
09

96
5B

67
CC

3D
79

11

02
11
A4

D5
91

62
BF
52

C1

T_id Items
1A
CC

3D
70

29
79

45

11

T100 a,b
F1
56

0
91

62
52

CA

C1
1A
5B

T200 a,c,d
DD

70

29
79

2C

11
F1
A4
56

10

T300 e,c,a
91
23

1A
95

5B
DD

CC

1C
70

T400 c,d,b
17

F1
A4
56

A1
52
23

T500 a,c,d,b,e
09

5B
DD

CC
79
96

11
67

Find all frequent itemsets and strong association rules using Apriori Algorithm. 10
A4
91
02

BF
52
23

D5

CC
70
C1

79
96

45
3D

b) What is data preprocessing? Explain different data cleaning techniques. 10


56

91
02
11

52

CA
62

DD

70
C1

79

2C
29

56

91
11

23
10

95
DD

70
1A

96
1C

17
56
02

23
F1

09
A1

DD
C1

96
5B

67
11

02
11

23
A4

D5
BF

C1
1A

96

3D
45

02
11
F1

62
CA

C1
1A
5B

29
2C

11
F1
A4

10
1A
95

5B
CC

1C

29875 Page 1 of 2
17

F1
A4

A1
52

5B
CC
79

11
A4
91

917952CCA45BF11A11C1029623DD5670
BF
52
0
1

62

70
C1
1A
Paper / Subject Code: 48894 / Data Warehouseing & Mining

3D
29

6
11
1

D5
0

62
BF

C1
1A

7
3D
29
5

6
1
1
A4

D5
0
A1

62
BF

C1
CC

3D
29
11
5

1
A4

0
A1

62
BF
52

C1
CC

29
79
Q. 4 a) The following table contains a training set D, of class-labeled tuples randomly

11
5

1
A4

0
A1
91

BF
52
selected from the AllElectronics customer database. Let buys_computer be the class label

C1

6
CC
70

29
79

11
5
attribute. Using Naïve Bayesian classification predict the class label of a tuple

1
A4
56

0
A1
91

BF
52

C1
X = (age = youth, income = medium, student = yes, credit rating = fair).

DD

CC
70

79

11
5

1
A4
56

A1
91

BF
52
23
RID Age income stud credit_rating buys_computer

DD

CC
70

79
96

11
5
ent

A4
56

91
02

BF
52
23

A
1 Youth High No fair No

DD

CC
70
C1

79
96

11
5
A4
56
2 Youth High No excellent No

91
02

BF
1

52
23
A1

DD

CC
70
C1

79
3 middle-aged High No fair Yes

96

5
A4
11

56

91
02
1

52
23
4 Senior medium No fair Yes
A1
BF

DD

CC
70
C1

79
96
5 Senior low Yes fair yes
11
45

56

91
02
1

52

A
23
A1
6 Senior low Yes excellent no
BF
CA

DD

CC
70
C1

79
96
7 middle-aged low 11 Yes excellent yes
45

56

91
02
2C

52
23
A1
BF
CA

DD

70
8 Youth medium No fair no

C1

79
96
95

11
45

56

91
02
2C

1
17

9 Youth low Yes fair yes

23
A1
BF
CA

DD

70
C1
09

96
95

10 Senior medium Yes fair Yes


11
45

56
67

02
2C

1
17

23
11 Youth medium Yes excellent Yes

A1
BF
CA

DD
D5

C1
09

96
95

12 middle-aged medium No excellent Yes

11
45
67

02
2C
3D

1
17

23
A1
BF
CA

13 middle-aged high Yes fair Yes


D5

C1
09

96
95
62

11
45
67

14 Senior medium No excellent No

02
2C
3D

1
17
29

A1
BF
CA
D5

C1
09
10

95
62

11
45
67
1C

2C
3D

1
17
29

A1
b) What is web mining? Explain HITS algorithm. 10

BF
CA
D5

09
10
A1

95
62

11
45
67
1C

2C
3D

17
29
11

BF
CA
D5

09
10
A1

95
62
BF

45
67

Q. 5 a) Explain with example multilevel association mining and multidimensional rule


1C

2C
3D

17
29
1
45

CA
F1

09
10

mining. 10
A1

95
62
CA

DD
5B

67
1C

2C
17
29
1
2C

23
F1
A4

09
10
A1

95
DD

b) Clearly explain the working of DBSCAN algorithm using appropriate diagram. 10


96
95

5B

67
CC

1C

17
1

02
17

23
F1
A4

D5

09
A1
52

C1
09

96
5B

67
CC

3D
79

11

02
11

Q.6 (a) Explain with example different data sampling techniques. 10


A4

D5
91

62
BF
52

C1
1A
CC

3D
70

29
79

45

11
F1
56

0
91

62
52

CA

C1
1A

(b) Write short note on any 2 10


5B
DD

70

29
79

2C

11
F1
A4
56

10
91
23

i. Differentiate between OLTP and OLAP


1A
95

5B
DD

CC

1C
70

17

F1
A4
56

ii. Web Content mining


A1
52
23

09

5B
DD

CC
79
96

11
67

iii. Data Loading in ETL


A4
91
02

BF
52
23

D5

CC
70
C1

79
96

45
3D

56

91
02
11

52

CA

**********************************
62

DD

70
C1

79

2C
29

56

91
11

23
10

95
DD

70
1A

96
1C

17
56
02

23
F1

09
A1

DD
C1

96
5B

67
11

02
11

23
A4

D5
BF

C1
1A

96

3D
45

02
11
F1

62
CA

C1
1A
5B

29
2C

11
F1
A4

10
1A
95

5B
CC

1C

29875 Page 2 of 2
17

F1
A4

A1
52

5B
CC
79

11
A4
91

917952CCA45BF11A11C1029623DD5670
BF
52

Common questions

Powered by AI

Multilevel association mining explores hierarchical levels of data abstraction, such as mining frequent itemsets from categories (e.g., electronics > laptops > gaming laptops). It enables the detection of associations at various abstraction levels, uncovering broad as well as domain-specific patterns. Multidimensional rule mining examines relationships across multiple dimensions, like time, location, and product types, allowing for a comprehensive analysis of interrelated factors influencing sales, enhancing predictive insights and strategic alignment in complex data landscapes .

Naïve Bayesian classification predicts a class label for a tuple by calculating the posterior probabilities for each class, given the tuple's attributes, using Bayes' Theorem. It assumes independence between attributes. For instance, for the tuple X = (age = youth, income = medium, student = yes, credit rating = fair), the classifier multiplies the probabilities of these attributes given the class ‘buys_computer’ and selects the class with the highest product as the predicted class label .

Strong association rules and frequent itemsets can be derived using the Apriori algorithm, which efficiently identifies frequent itemsets by iteratively increasing the size of candidate itemsets from which non-frequent subsets are removed. With a minimum support of 50%, only itemsets that appear in at least half of the transactions are considered frequent. Using a minimum confidence of 70%, only rules with predictive probability above this threshold are deemed strong. This process involves calculating the support, confidence, and generating candidate itemsets in a stepwise manner, pruning non-compliant ones .

Common data cleaning techniques include removing duplicates, correcting data errors, handling missing values, standardizing data formats, and filtering out irrelevant data. These techniques improve data quality by ensuring accuracy, consistency, and completeness, which are crucial for reliable data analysis and decision-making. Clean data helps reduce biases, minimize errors in predictive modeling, and enhances the overall robustness of analytical insights .

K-means clustering algorithm organizes data into k clusters based on feature vectors. The process involves selecting k initial centroids, assigning each data point to the nearest centroid, updating the centroids to be the mean of all points assigned to them, and repeating these steps until convergence. For two clusters, the dataset {2, 4, 10, 12, 3, 20, 30, 11, 25, 56, 23} would be organized by first choosing initial centroids, proceeding with assignment and update steps until cluster assignment stabilizes .

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters based on data point density, detecting noise and outliers inherently. Unlike K-means, which requires pre-defined cluster numbers and is sensitive to cluster shapes, DBSCAN can find clusters of arbitrary shape and requires only two parameters: ε (eps) for neighborhood radius and MinPts for minimum number of points in a cluster. It initiates from a core point having enough neighbors, expanding the cluster until no more points are within ε distance, effectively distinguishing between dense cluster regions and sparser noise .

The time element is crucial in data warehouse data structures because it allows for the historical analysis of data over different time periods. This enables trend analysis, forecasting, and time series analysis, which are essential for strategic decision-making. By maintaining a time dimension, organizations can compare metrics across various temporal snapshots, providing insights into patterns and changes in behavior or performance over time .

Market Basket Analysis uses association rule mining to uncover relationships between items purchased together, enabling the identification of frequent item combinations within retail transactions. By deriving rules such as "if item A is purchased, item B is likely to be purchased," businesses can understand consumer preferences and behavior patterns, optimize product placement, cross-sell, and tailor marketing strategies to increase sales and customer satisfaction .

HITS algorithm functions by computing two scores for each webpage: the authority score and the hub score. It assumes a mutually reinforcing relationship where good hubs point to many good authorities and vice versa. In web mining, HITS identifies authoritative web pages on a given topic by analyzing link structures, enhancing the relevance and accuracy of search results and contributing to efficient information retrieval and ranking .

OLTP (Online Transaction Processing) systems are designed to manage day-to-day transaction data, optimized for fast query processing and maintaining data integrity in multi-access environments, while OLAP (Online Analytical Processing) systems support complex queries to analyze historical data and generate reports, aiding strategic decision-making. OLTP focuses on transactional efficiency, whereas OLAP prioritizes analytical insight, impacting aspects such as system architecture, database design, and end-user interaction .

You might also like