IPU University 6th Sem Questions
IPU University 6th Sem Questions
DATA-ANALYTICS-Question-Paper-21-22 Download
Ans.
S.
Predictive analytics Prescriptive analytics
No.
It provides insight into what is likely to happen in the future and how
1. It insights on what things to do and how to do them.
things are progressing.
It measures the metric individually and it does not evaluate the It evaluates the whole impact by measuring the metrics while taking into account all
2.
overall impact. inputs, outputs, and processes.
Ans.
2. It transforms data and information into insights. It transforms raw data into information.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 2/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
Ans. Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of
Search...
predictors that minimizes prediction error for a quantitative response variable.
Ans.
S.
Univariate analysis Multivariate analysis
No.
1. It summarizes only one variable at a time. lt summarizes more than two variables at a time.
Basic logic of univariate analysis is by means of contingency tables, distributions, continuous Basic logic of multivariate analysis is by means of
2.
and discrete variables etc. contingency tables only.
Ans.
S.
Stream processing Traditional processing
No.
It involves complex operations on multiple input streams when data is being It involves simple computations on data when data is being
1.
processed. processed.
Ans. The sliding window technique is used to control transmitted data streaming packets. It is utilized when the transmission of data streaming packets must be
dependable and sequential. Tuples are gathered within a window that glides over the data stream at a given interval in a sliding window.
Ans. Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps:
1. Determine the two clusters that are the most closely related.
2. Combine the two groups that are the most similar. This iterative process is repeated until all of the clusters have been blended together.
Ans. Lift is a measure of a targeting model’s (association rules) success at predicting or classifying cases as having an enhanced response as compared to a
random choice targeting model in association rule learning.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 3/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
i. What is the basic description of a box plot in R?
Search...
Ans. Box plots are used to determine how evenly dispersed the data in a data set is. It categorizes the data into three quartiles. This graph depicts the data set’s
minimum, maximum, median, first and third quartiles.
1. Tableau
2. Looker
a. Explain the Process Model and Computation Model of Big Data platform.
1. MapReduce is a distributed computing technique for processing enormous amounts of data and is used throughout the whole Hadoop ecosystem.
2. With this structure, data processing in massive distributed systems is made simpler for developers.
ii. The big file is split into multiple small files with the same size.
iii. These small files are processed in parallel by multiple map processes.
iv. The outputs of the processing are immediately passed on to the reduce process, which will quickly sum up and compute the map results.
1. The technology that aids in data analysis, processing, and management to produce meaningful information is computational modelling.
2. The difficulty facing the modern industry is how to deal with identifying challenges in computational models by incorporating knowledge into Big Data
applications.
3. In order to enable analysts swiftly adapt models to new insights, the methodologies and models are given with instructions.
4. The decision support system is a powerful system that has a big impact on how Big Data is shaped for long-term effectiveness and performance.
5. Computational modelling decision-making is also a potent mechanism for enabling effective tools for Big Data management for influential application.
b. Explain the working of an Artificial Neural Network for image classification task.
Ans.
1. The process of detecting photographs and classifying them into one of several unique, preset categories.
2. Among the tasks in which artificial neural networks (ANNs) excel is image categorization.
3. Computer systems that can recognise patterns are known as neural networks.
4. Its namesake, the human brain, served as the inspiration for their construction.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 4/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
6. A signal is received by the input layer, processed by the hidden layer, and then a judgment or forecast is made regarding the input data by the output layer.
Search...
7. Each network layer is made up of artificial neurons that are connected nodes.
8. A system must first understand the features of a pattern in order to recognise it. To determine if an object is X or Z, it must be trained.
9. Artificial neural networks train on data sets from which they directly learn features.
10. There are numerous examples of each image class in the training data, which is a sizable dataset.
11. Every node layer trains using the output (feature set) generated by the layer before it.
12. As a result, nodes in each subsequent layer are able to distinguish increasingly intricate, specific features visual representations of what the image shows.
Ans.
1. The architectural design pattern known as the Publish/Subscribe pattern, or pub/sub, offers a framework for message exchange between publishers and
subscribers.
2. In this pattern, a message broker that passes messages from the publisher to the subscribers is used by the publisher and the subscriber.
3. The channel’s subscribers can sign up to receive communications (events) that the host (publisher) posts to it.
5. The data model and query language that these systems enable are used to first categorize pub/sub systems.
A. Subject-based:
1. Each communication is given a subject label from a predefined list (such as a stock quote) or hierarchy (such as sports/cricket).
3. In order to narrow down the collection of pertinent messages within a given subject, these queries can also include a filter on the data elements of the
message header.
B. Complex predicate-based:
1. Certain pub/sub systems allow user queries to contain predicates coupled using “and” and “or” operators to provide constraints over the values of the
attributes. These systems model the message content (payload) as a set of attribute-value pairs.
2. For example, a predicate-based query applied to the stock quotes can be “Symbol=’ABC’ and (Change > 1 or Volume > 50000)”.
1. In more recent pub/sub systems, the richness of XML-encoded messages is being utilised.
2. A pre-existing XML query language, such as XQuery, can be used to create user queries.
3. Messages can be further restructured for customized result delivery and perhaps more accurate filtering thanks to the rich XML structure and usage of an
XML query language.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 5/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
Search...
Ans. The PCY Algorithm makes use of the fact that a lot of main memory is often available during the first pass of A-Priori but is not required for the counting of
single items.
Stryker is an award
winning medical
Open
technology company
stryker.com
During the two passes to find L2, the main memory is laid out as in Fig.
Assume that data is stored as a flat file, with records consisting of a basket ID and a list of its items.
1. Pass 1:
b. For each bucket, consisting of items {i1,…..,ik}, hash each pair to a bucket of the hash table, and increment the count of the bucket by 1.
c. At the end of the pass, determine L1 the items with counts at least s.
Stryker is an award
winning medical
Open
technology company
stryker.com
Key point: a pair (i, j) cannot be frequent unless it hashes to a frequent bucket, so pairs that hash to other buckets need not be candidates in C2.
Replace the hash table by a bitmap, with one bit per bucket: 1 if the bucket was frequent, 0 if not.
2. Pass 2:
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 6/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
a. Main memory holds a list of all the frequent items, i.e., L1.
Search...
b. Main memory also holds the bit map summarizing the results of the hashing from pass 1.
Stryker is an award
winning medical
Open
technology company
stryker.com
Key point: The buckets must use 16 or 32 bits for a count, but these are compressed to 1 bit. Thus, even if the hash table occupied almost the entire main
memory on pass 1, its bitmap occupies no more than 1/16 of main memory on pass 2.
c. Finally, main memory also holds a table with all the candidate pairs and their counts. A pair (i, j) can be a candidate in C2 only if all of the following are true:
(i). i is in L. (ii). j is in L1. (iii). (i, j) hashes to a frequent bucket. It is the last, condition that distinguishes FCY from straight a-priori and reduces the requirements
for memory in pass 2.
d. During pass 2, we consider each basket, and each pair of its items, making the test outlined above. If a pair meets all three conditions, add to its count in
memory, or create an entry for it if one does riot yet exist.
Stryker is an award
winning medical
Open
technology company
stryker.com
When does FCY beat a-priori ? When there are too many pairs of items from L1 to fit a table of candidate pairs and their counts in main memory, yet the number
of frequent buckets in the PCY algorithm is sufficiently small that it reduces the size of C2 below what can fit in memory (even with 1/16 of it given over to the
bitmap).
Ans.
1. It is used to handle data coming in high velocity. It is used to handle data coming in low velocity.
2. It gives both read and write scalability. It gives only read scalability.
4. Data arrives from many locations. Data arrives from one or few locations.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 7/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
2. Collection of data:
b. Many tools, including computers, the internet, cameras, environmental sources, and human employees, can be used to accomplish this.
3. Organization of data:
c. A spreadsheet or other piece of software that can handle statistical data may be used for organization.
4. Cleaning of data:
b. This implies it has been cleaned up and examined to make sure there are no errors or duplicates and that it is not missing anything.
c. Before the data is sent to a data analyst to be analyzed, this phase helps to correct any inaccuracies.
S.
Traditional Analytics Modern Analytics
No.
1. Traditional analytics is based on a fixed schema. Modern analytics uses a dynamic schema.
2. It could only work with structured data. It can include structured as well as unstructured data.
Analytics have always been performed after the event or time period
3. In modern analytics, analysis takes place in real-time.
being studied.
4. Traditional analytics is based on a centralized architecture. Modern analytics is based on a distributed architecture.
There is a data explosion in modern analytics as a result of the numerous sources that
5. Traditionally, the sources of data were fairly limited.
record data almost constantly.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 8/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
a. Discuss different types of Time Series Data Analysis along with its major application area.
2. Collection of data:
b. This can be done through a variety of sources such as computers, online sources, cameras, environmental sources, or through personnel.
3. Organization of data:
c. Organization may take place on a spreadsheet or other form of software that can take statistical data.
4. Cleaning of data:
b. This means it is scrubbed and checked to ensure there is no duplication or error, and that it is not incomplete.
c. This step helps correct any errors before it goes on to a data analyst to be analyzed.
1. Retail sales:
a. A clothes retailer wants to predict future monthly sales for several product lines.
b. The seasonal influences on customers’ purchase decisions must be taken into consideration in these forecasts.
c. Demand fluctuations over the course of the year must be taken into account by a suitable time series model.
a. To ensure a sufficient supply of parts to fix consumer products, companies service groups must estimate future spare part requests. The spares
inventory frequently includes thousands of unique part numbers.
b. Complex models for each component number can be created to predict future demand using input variables including anticipated part failure rates,
the effectiveness of service diagnostics, and anticipated new product shipments.
c. Yet, time series analysis can produce precise short-term estimates based just on the past history of spare part demand.
3. Stock trading:
b. In pairs trading, a market opportunity is spotted using a strong positive correlation between the prices of two equities.
d. The variation in these companies’ stock values over time can be analysed using a time series approach.
e. If the price gap is statistically higher than predicted, it may be a smart idea to buy Company A stock and sell Company B stock, or vice versa.
b. Differentiate different types of support vector and kernel methods of data analysis.
1. Data input is transformed into the format needed for processing data using the kernel approach.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 9/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
2. Kernel is utilised because it gives the Support Vector Machine (SVM) a window through which to change the data.
Search...
3. Following are major kernel methods:
i. Gaussian Kernel: It is used to perform transformation when there is no prior knowledge about data.
ii. Gaussian Kernel Radial Basis Function (RBF): It is similar to the Gaussian kernel, but it also includes the radial basis approach to enhance the
transformation.
iii. Sigmoid Kernel: When employed as an activation function for artificial neurons, this function is comparable to a two-layer perceptron model of the
neural network.
iv. Polynomial Kernel: In a feature space over polynomials of the original variables used in the kernel, it depicts the similarity of vectors in the training set
of data.
1. In a supervised machine learning task called a support vector, we look for the optimum hyperplane to divide the two classes.
linearly separable in order to be divided into two groups by a single straight line (if 2D).
ii. Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be divided into two classes by a straight line (in the case of 2D), which
calls for the employment of more sophisticated approaches like kernel tricks. Since linearly separable datapoints are rare in real-world applications, we
apply the kernel method to overcome these problems.
a. Discuss the components of a General Stream Processing Model. List few sources of Streaming Data.
a. A stream processor constantly streams data for consumption by other components after collecting it from its source and converting it to a common
message format.
b. A component that stores streaming data, such as an ETL tool or a data lake or warehouse.
c. Stream processors have a fast throughput, but they don’t perform task scheduling or data transformation.
a. Before data can be evaluated with SQL-based analytics tools, it must first be aggregated, processed, and structured from streams coming from one or
more message brokers.
b. An ETL tool or platform performs this by receiving user queries, retrieving events from message queues, and then applying the query to produce a
result.
c. The outcome could be a new data stream, an API call, an action, a visualization, or an alarm.
d. Apache Storm, Spark Streaming, and WS02 Stream Processor are three examples of open-source ETL solutions for streaming data.
a. After streaming data is ready for the stream processor to consume, it needs to be analyzed to add value.
b. Streaming data analytics can be done in a variety of ways. Some of the most popular tools for streaming data analytics are Amazon Athena, Amazon
Redshift, and Cassandra.
1. Sensor data:
a .Sensor data are the information generated by sensors that are located in various locations.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 10/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
b. Several sensors, including temperature sensors, GPS sensors, and other sensors, are installed at various locations to record the location’s temperature,
Search...
height, and other data.
d. The main memory stores the data or information provided by the sensor. Every tenth of a second, these sensors send a significant amount of data.
2. Image data:
a. Daily streams of many terabytes of photos are frequently sent from satellites to earth.
b. Although surveillance cameras’ image resolution is lower than that of satellites, there can be a lot of them, and each one can create a stream of photos
at intervals as short as one second.
a. An Internet switching node receives streams of IP packets from numerous inputs and routes them to its outputs.
b. The switch’s function is to convey data, not to store it, search for it, or give it greater power.
c. Different streams are received by websites. For instance, Google gets a few hundred million search requests every day. Yahoo’s numerous websites
receive billions of clicks per day.
b. Explain and apply Flajolet-Martin algorithm on the following stream of data to identify unique elements in the stream.
S = 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1 S = 1, 3, 2, 1, 2, 3, 1, 2, 3, 1
1. Create a bit vector (bit array) of sufficient length L, such that 2L > n, the number of elements in the stream. Usually a 64-bit vector is sufficient since 264 is
quite large for most purposes.
2. The hash i-th bit in this vector/array represents whether we have seen a function value whose binary representation ends in 0. So each bit to 0.
3. Generate a good, random hash function that maps input (usually strings) to natural numbers.
4. Read input. For each word, hash it and determine the number of trailing bit vector zeros. If the number of trailing zeros is k, set the k-th bit in the to 1.
5. Once input is exhausted, get the index of the first O in the bit array (call this R). By the way, this is just the number of consecutive ls plus one.
Numerical:
S = 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3
h(1) = (6 x 1 + 1) mod 5 = 2
h(2) = (6 x 2 + 1) mod 5 = 3
h(3) = (6 x 3 + 1) mod 5 = 4
h(1) = (6 x 1 + 1) mod 5 = 2
h(2) = (6 x 2 + 1) mod 5 = 3
h(3) = (6 x 3 + 1) mod 5 = 4
h(4) = (6 x 4 + 1) mod 5 = 0
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 11/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
h(3) = (6 x 3 + 1) mod 5 = 4
Search...
h(1) = (6 x 1 + 1) mod 5 = 2
h(2) = (6 x 2 + 1) mod 5 = 3
h(3) = (6 x 3 + 1) mod 5 = 4
h(1) = 2 = (0010)
h(2) = 3 = (0011)
h(3) = 4 = (0100)
h(4) = 0 = (0000)
Trailing zero’s:
R(max) = h(4) = 4
Ans.
S.
CLIQUE PROCLUS
No.
1. CLIQUE is a density-based and grid-based subspace clustering techniques. PROCLUS is a usual dimension-reduction subspace clustering techniques.
2. CLIQUE allows overlap among clusters in different subspaces. PROCLUS finds non-overlapped partitions of points in the clusters.
The CLIQUE algorithm divides the data space into grids and then identifies The PROCLUS algorithm includes initialization, iteration, and cluster
3.
dense units. refinement
Clusters are then generated from all dense subspaces using the a-priori
4. Clusters are generated does not use the a-priori approach.
approach.
5. CLIQUE proceeds in a bottom-up manner. PROCLUS searches subspaces for clusters in a top-down manner.
High-density clusters must be found in the biggest dimensional subspaces, The found clusters help other subsequence studies and can help us
6.
which CLIQUE inescapably finds. understand high-dimensional data.
7. CLIQUE assigns one object to multiple clusters. PROCLUS assigns one object to only one cluster.
b.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 12/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
Find all the association rule from the above given transaction with
We will remove items coffee, milk because support value of these items is less than 50 %.
For Rules:
(Beer, Diaper)
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 13/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
Search...
Since, all the rules have confidence more than 50 %. So all the rules are good.
Ans.
1. The Hadoop Ecosystem’s central element or skeleton is the Hadoop Distributed File System.
2. HDFS is the one that enables the storage of various kinds of huge data collections (i.e., structured, unstructured and semi structured data).
3. HDFS introduces a degree of resource abstraction that allows us to see the entire HDFS as a single entity.
4. It enables us to maintain a log file about the stored data and store our data across multiple nodes (metadata).
a. Name node:
i. The name node is the master node and does not store the actual data.
ii. It includes metadata, or details about databases. As a result, it requires both high computational and low storage requirements.
b. Data node:
i. Data node stores the actual data in HDFS.
iii. It is responsible for read and write operations as per the request.
c. Block:
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 14/18
3/26/24, 9:35 AM Data Analytics: Solution of Aktu Question Paper with Important Notes - Bachelor Exam
i. Generally the user data is stored in the files of HDFS.
Search...
ii. In a file system, the file will be split into one or more segments and/or kept in separate data nodes. Blocks are the name given to these file chunks.
iii. In other words, the minimum amount of data that HDFS can read or write is called a Block.
1. Mean():
2. Median():
a. It is the middle value of the data set. It splits the data into two halves.
b. If the number of elements in the data set is odd then the center element is median and if it is even then the median would be the average of two central
elements.
3. Mode():
a. It is the value that has the highest frequency in the given data set.
b. The data set may have no mode if the frequency of all data points is the same.
c. Also, we can have more than one mode if we have two or more data points having the same frequency.
4. Range():
a. The range describes the difference between the largest and smallest data point in our data set.
b. The bigger the range, the more is the spread of data and vice versa.
5. Variance():
b. It is computed by calculating the difference between each data point and the average, also referred to as the mean, squaring the difference, adding all the
data points together, and then dividing by the total number of data points in our data set.
https://bachelorexam.com/data-analytics/important-aktu-question-paper-with-notes/#Section-4-Time-Series-Data-Analysis 15/18