DWM Paper 1
DWM Paper 1
62
70
C1
1A
Paper / Subject Code: 48894 / Data Warehouseing & Mining
3D
29
6
11
1
D5
0
62
BF
1T01875 - T.E. Computer Science & Enginering (Artificial Intelligence & Machine Learning) (Choice
C1
1A
7
3D
29
5
6
Based) (R-2019 'C' Scheme) SEMESTER - V / 48894 - Data Warehouseing & Mining QP CODE:
1
1
A4
D5
0
A1
62
BF
C1
10029875 DATE: 31/05/2023.
CC
3D
29
11
5
1
A4
0
A1
Time: 3 hours Max. Marks: 80
62
BF
52
C1
CC
29
79
11
5
1
A4
0
A1
91
BF
52
C1
N.B. (1) Question one is Compulsory.
6
CC
70
29
79
11
5
1
(2) Attempt any 3 questions out of the remaining.
A4
56
0
A1
91
BF
52
C1
DD
CC
(3) Assume suitable data if required.
70
79
11
5
1
A4
56
A1
91
BF
52
23
DD
CC
70
79
96
11
5
A4
56
91
02
BF
52
23
A
DD
CC
70
C1
79
96
11
5
Q. 1 (a) Every data structure in the data warehouse contains the time element. Why? 05
A4
56
91
02
BF
1
52
23
A1
DD
CC
70
C1
79
96
5
A4
11
(b) Calculate Accuracy, Recall and Precision with the help of following data: 05
56
91
02
1
52
23
A1
BF
DD
CC
True Positive (TP)= 50, True Negative (TN) = 20, False Positive (FP)= 20,
70
C1
79
96
11
45
56
91
False Negative (FN)= 10
02
1
52
A
23
A1
BF
CA
DD
CC
70
C1
79
96
11
45
56
91
02
2C
52
23
(c) What is Market basket analysis? 05
A1
BF
CA
DD
70
C1
79
96
95
11
45
56
91
02
2C
1
17
23
A1
(d) Draw and explain KDD process. 05
BF
CA
DD
70
C1
09
96
95
11
45
56
67
02
2C
1
17
23
A1
BF
CA
DD
Q. 2 a) Suppose that a data warehouse consists of the four dimensions, date, spectator,
D5
C1
09
96
95
11
45
67
location, and game, and the two measures, count and charge, where charge is
02
2C
3D
1
17
23
A1
BF
CA
D5
C1
09
96
the fare that a spectator pays when watching a game on a given date. Spectators
95
62
11
45
67
02
2C
3D
1
17
29
may be students, adults, or seniors, with each category having its own charge rate.
A1
BF
CA
D5
C1
09
10
95
62
11
45
67
2C
3D
1
17
29
A1
BF
CA
D5
09
10
b) Draw the base cuboid [date, spectator, location] and apply any four OLAP
A1
95
62
11
45
67
1C
2C
3D
17
29
operations. 10
11
BF
CA
D5
09
10
A1
95
62
BF
45
67
1C
2C
3D
17
29
1
b) What is clustering? Explain K-mean clustering algorithm. Suppose that the data
45
CA
F1
09
10
A1
95
62
CA
DD
67
1C
2C
17
29
1
{2, 4, 10, 12, 3, 20, 30, 11, 25, 56, 23}. Apply k-means algorithm. 10
2C
23
F1
A4
09
10
A1
95
DD
96
95
5B
67
CC
1C
17
1
02
17
23
F1
A4
D5
09
Q.3 a) A database has five transactions. Let min sup = 50% and min conf = 70%.
A1
52
C1
09
96
5B
67
CC
3D
79
11
02
11
A4
D5
91
62
BF
52
C1
T_id Items
1A
CC
3D
70
29
79
45
11
T100 a,b
F1
56
0
91
62
52
CA
C1
1A
5B
T200 a,c,d
DD
70
29
79
2C
11
F1
A4
56
10
T300 e,c,a
91
23
1A
95
5B
DD
CC
1C
70
T400 c,d,b
17
F1
A4
56
A1
52
23
T500 a,c,d,b,e
09
5B
DD
CC
79
96
11
67
Find all frequent itemsets and strong association rules using Apriori Algorithm. 10
A4
91
02
BF
52
23
D5
CC
70
C1
79
96
45
3D
91
02
11
52
CA
62
DD
70
C1
79
2C
29
56
91
11
23
10
95
DD
70
1A
96
1C
17
56
02
23
F1
09
A1
DD
C1
96
5B
67
11
02
11
23
A4
D5
BF
C1
1A
96
3D
45
02
11
F1
62
CA
C1
1A
5B
29
2C
11
F1
A4
10
1A
95
5B
CC
1C
29875 Page 1 of 2
17
F1
A4
A1
52
5B
CC
79
11
A4
91
917952CCA45BF11A11C1029623DD5670
BF
52
0
1
62
70
C1
1A
Paper / Subject Code: 48894 / Data Warehouseing & Mining
3D
29
6
11
1
D5
0
62
BF
C1
1A
7
3D
29
5
6
1
1
A4
D5
0
A1
62
BF
C1
CC
3D
29
11
5
1
A4
0
A1
62
BF
52
C1
CC
29
79
Q. 4 a) The following table contains a training set D, of class-labeled tuples randomly
11
5
1
A4
0
A1
91
BF
52
selected from the AllElectronics customer database. Let buys_computer be the class label
C1
6
CC
70
29
79
11
5
attribute. Using Naïve Bayesian classification predict the class label of a tuple
1
A4
56
0
A1
91
BF
52
C1
X = (age = youth, income = medium, student = yes, credit rating = fair).
DD
CC
70
79
11
5
1
A4
56
A1
91
BF
52
23
RID Age income stud credit_rating buys_computer
DD
CC
70
79
96
11
5
ent
A4
56
91
02
BF
52
23
A
1 Youth High No fair No
DD
CC
70
C1
79
96
11
5
A4
56
2 Youth High No excellent No
91
02
BF
1
52
23
A1
DD
CC
70
C1
79
3 middle-aged High No fair Yes
96
5
A4
11
56
91
02
1
52
23
4 Senior medium No fair Yes
A1
BF
DD
CC
70
C1
79
96
5 Senior low Yes fair yes
11
45
56
91
02
1
52
A
23
A1
6 Senior low Yes excellent no
BF
CA
DD
CC
70
C1
79
96
7 middle-aged low 11 Yes excellent yes
45
56
91
02
2C
52
23
A1
BF
CA
DD
70
8 Youth medium No fair no
C1
79
96
95
11
45
56
91
02
2C
1
17
23
A1
BF
CA
DD
70
C1
09
96
95
56
67
02
2C
1
17
23
11 Youth medium Yes excellent Yes
A1
BF
CA
DD
D5
C1
09
96
95
11
45
67
02
2C
3D
1
17
23
A1
BF
CA
C1
09
96
95
62
11
45
67
02
2C
3D
1
17
29
A1
BF
CA
D5
C1
09
10
95
62
11
45
67
1C
2C
3D
1
17
29
A1
b) What is web mining? Explain HITS algorithm. 10
BF
CA
D5
09
10
A1
95
62
11
45
67
1C
2C
3D
17
29
11
BF
CA
D5
09
10
A1
95
62
BF
45
67
2C
3D
17
29
1
45
CA
F1
09
10
mining. 10
A1
95
62
CA
DD
5B
67
1C
2C
17
29
1
2C
23
F1
A4
09
10
A1
95
DD
5B
67
CC
1C
17
1
02
17
23
F1
A4
D5
09
A1
52
C1
09
96
5B
67
CC
3D
79
11
02
11
D5
91
62
BF
52
C1
1A
CC
3D
70
29
79
45
11
F1
56
0
91
62
52
CA
C1
1A
70
29
79
2C
11
F1
A4
56
10
91
23
5B
DD
CC
1C
70
17
F1
A4
56
09
5B
DD
CC
79
96
11
67
BF
52
23
D5
CC
70
C1
79
96
45
3D
56
91
02
11
52
CA
**********************************
62
DD
70
C1
79
2C
29
56
91
11
23
10
95
DD
70
1A
96
1C
17
56
02
23
F1
09
A1
DD
C1
96
5B
67
11
02
11
23
A4
D5
BF
C1
1A
96
3D
45
02
11
F1
62
CA
C1
1A
5B
29
2C
11
F1
A4
10
1A
95
5B
CC
1C
29875 Page 2 of 2
17
F1
A4
A1
52
5B
CC
79
11
A4
91
917952CCA45BF11A11C1029623DD5670
BF
52
Multilevel association mining explores hierarchical levels of data abstraction, such as mining frequent itemsets from categories (e.g., electronics > laptops > gaming laptops). It enables the detection of associations at various abstraction levels, uncovering broad as well as domain-specific patterns. Multidimensional rule mining examines relationships across multiple dimensions, like time, location, and product types, allowing for a comprehensive analysis of interrelated factors influencing sales, enhancing predictive insights and strategic alignment in complex data landscapes .
Naïve Bayesian classification predicts a class label for a tuple by calculating the posterior probabilities for each class, given the tuple's attributes, using Bayes' Theorem. It assumes independence between attributes. For instance, for the tuple X = (age = youth, income = medium, student = yes, credit rating = fair), the classifier multiplies the probabilities of these attributes given the class ‘buys_computer’ and selects the class with the highest product as the predicted class label .
Strong association rules and frequent itemsets can be derived using the Apriori algorithm, which efficiently identifies frequent itemsets by iteratively increasing the size of candidate itemsets from which non-frequent subsets are removed. With a minimum support of 50%, only itemsets that appear in at least half of the transactions are considered frequent. Using a minimum confidence of 70%, only rules with predictive probability above this threshold are deemed strong. This process involves calculating the support, confidence, and generating candidate itemsets in a stepwise manner, pruning non-compliant ones .
Common data cleaning techniques include removing duplicates, correcting data errors, handling missing values, standardizing data formats, and filtering out irrelevant data. These techniques improve data quality by ensuring accuracy, consistency, and completeness, which are crucial for reliable data analysis and decision-making. Clean data helps reduce biases, minimize errors in predictive modeling, and enhances the overall robustness of analytical insights .
K-means clustering algorithm organizes data into k clusters based on feature vectors. The process involves selecting k initial centroids, assigning each data point to the nearest centroid, updating the centroids to be the mean of all points assigned to them, and repeating these steps until convergence. For two clusters, the dataset {2, 4, 10, 12, 3, 20, 30, 11, 25, 56, 23} would be organized by first choosing initial centroids, proceeding with assignment and update steps until cluster assignment stabilizes .
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters based on data point density, detecting noise and outliers inherently. Unlike K-means, which requires pre-defined cluster numbers and is sensitive to cluster shapes, DBSCAN can find clusters of arbitrary shape and requires only two parameters: ε (eps) for neighborhood radius and MinPts for minimum number of points in a cluster. It initiates from a core point having enough neighbors, expanding the cluster until no more points are within ε distance, effectively distinguishing between dense cluster regions and sparser noise .
The time element is crucial in data warehouse data structures because it allows for the historical analysis of data over different time periods. This enables trend analysis, forecasting, and time series analysis, which are essential for strategic decision-making. By maintaining a time dimension, organizations can compare metrics across various temporal snapshots, providing insights into patterns and changes in behavior or performance over time .
Market Basket Analysis uses association rule mining to uncover relationships between items purchased together, enabling the identification of frequent item combinations within retail transactions. By deriving rules such as "if item A is purchased, item B is likely to be purchased," businesses can understand consumer preferences and behavior patterns, optimize product placement, cross-sell, and tailor marketing strategies to increase sales and customer satisfaction .
HITS algorithm functions by computing two scores for each webpage: the authority score and the hub score. It assumes a mutually reinforcing relationship where good hubs point to many good authorities and vice versa. In web mining, HITS identifies authoritative web pages on a given topic by analyzing link structures, enhancing the relevance and accuracy of search results and contributing to efficient information retrieval and ranking .
OLTP (Online Transaction Processing) systems are designed to manage day-to-day transaction data, optimized for fast query processing and maintaining data integrity in multi-access environments, while OLAP (Online Analytical Processing) systems support complex queries to analyze historical data and generate reports, aiding strategic decision-making. OLTP focuses on transactional efficiency, whereas OLAP prioritizes analytical insight, impacting aspects such as system architecture, database design, and end-user interaction .