1
Recent Trends In IT
a) What is OLAP?
OLAP (for online analytical processing) is software for performing multidimensional analysis
at high speeds on large volumes of data from a data warehouse, data mart, or some other
unified, centralized data store.
b) Define ‘State Space’ in artificial intelligence.
A state space is the set of all configurations that a given problem and its environment could
achieve. Each configuration is called a state, and contains. Static information. This is often
extracted and held separately, e.g., in the knowledge base of the agent.
c) What is Data frame?
A data. frame is a collection of vectors of identical lengths. Each vector represents a
column, and each vector can be of a different data type (e.g., characters, integers, factors).
The str() function is useful to inspect the data types of the columns.
d) What is RDD?
Resilient Distributed Datasets (RDDs). RDDs are the main logical data units in Spark.
They are a distributed collection of objects, which are stored in memory or on disks of
different machines of a cluster. A single RDD can be divided into multiple logical partitions
so that these partitions can be stored and processed on different machines of a cluster.
e) What is Data Mart?
A data mart is a simple form of a data warehouse that is focused on a single subject or line
of business, such as sales, finance, or marketing. Given their focus, data marts draw data
from fewer sources than data warehouses.
f) Define ETL tools.
What is ETL? ETL, which stands for extract, transform and load, is a data integration
process that combines data from multiple data sources into a single, consistent data store
that is loaded into a data warehouse or other target system.
g) What is a Plateau in artificialintelligence?
Hill climbing algorithm is a technique which is used for optimizing the mathematical
problems. Plateau: A plateau is the flat area of the search space in which all the neighbor
states of the current state contains the same value, because of this algorithm does not find
any best direction to move.
2
Define OLTP.
OLTP or Online Transaction Processing is a type of data processing that consists of executing a
number of transactions occurring concurrently—online banking, shopping, order entry, or
sending text messages, for example.
Which language is not supported by Spark?
Pascal language is not supported by spark .
j) Define Ridge.
A ridge is a long, narrow, elevated strip of land or any raised strip or band.
A ridge is a line that rises above what it is attached to.
a) What are components of spark? Explain.
Apache Spark is an open-source, distributed processing system used for big data
workloads.
1. Apache Spark Core – Spark Core is the underlying general execution engine
for the Spark platform that all other functionality is built upon. It provides in-
memory computing and referencing datasets in external storage systems.
2. Spark SQL – Spark SQL is Apache Spark’s module for working with
structured data. The interfaces offered by Spark SQL provides Spark with more
information about the structure of both the data and the computation being
performed.
3. Spark Streaming – This component allows Spark to process real-time
streaming data. Data can be ingested from many sources like Kafka, Flume,
and HDFS (Hadoop Distributed File System). Then the data can be processed
using complex algorithms and pushed out to file systems, databases, and live
dashboards.
4. MLlib (Machine Learning Library) – Apache Spark is equipped with a rich
library known as MLlib. This library contains a wide array of machine learning
algorithms- classification, regression, clustering, and collaborative filtering. It
also includes other tools for constructing, evaluating, and tuning ML Pipelines.
All these functionalities help Spark scale out across a cluster.
5. GraphX – Spark also comes with a library to manipulate graph databases and
perform computations called GraphX. GraphX unifies ETL (Extract,
Transform, and Load) process, exploratory analysis, and iterative graph
computation within a single system.
3
b) Explain Architecture of Data Warehouse
A data-warehouse is a heterogeneous collection of different data sources organised under a unified
schema. There are 2 approaches for constructing data-warehouse: Top-down approach and Bottom-up
approach are explained as below.
Top-down approach:
1. External Sources –
External source is a source from where data is collected irrespective of the type of
data. Data can be structured, semi structured and unstructured as well.
2. Stage Area –
Since the data, extracted from the external sources does not follow a particular format,
so there is a need to validate this data to load into datawarehouse. For this purpose, it
is recommended to use ETL tool.
E(Extracted): Data is extracted from External data source.
T(Transform): Data is transformed into the standard format.
L(Load):Data is loaded into datawarehouse after transforming it into the standard
format.
3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It
actually stores the meta data and the actual data gets stored in the data
marts. Note that datawarehouse stores the data in its purest form in this top-down
approach.
4. Data Marts –
Data mart is also a part of storage component. It stores the information of a particular
function of an organisation which is handled by single authority. There can be as many
number of data marts in an organisation depending upon the functions. We can also
say that data mart contains subset of the data stored in datawarehouse.
5. Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is
used to find the hidden patterns that are present in the database or in datawarehouse
with the help of algorithm of data mining.
4
c) Describe technique of data mining.
Various techniques can be used to mine data for different data science applications. Pattern
recognition is a common data mining use case that's enabled by multiple techniques, as is
anomaly detection, which aims to identify outlier values in data sets. Popular data mining
techniques include the following types:
Association rule mining. In data mining, association rules are if-then statements that identify
relationships between data elements. Support and confidence criteria are used to assess the
relationships -- support measures how frequently the related elements appear in a data set,
while confidence reflects the number of times an if-then statement is accurate.
Classification. This approach assigns the elements in data sets to different categories defined
as part of the data mining process. Decision trees, Naive Bayes classifiers, k-nearest neighbor
and logistic regression are some examples of classification methods.
Clustering. In this case, data elements that share particular characteristics are grouped
together into clusters as part of data mining applications. Examples include k-means clustering,
hierarchical clustering and Gaussian mixture models.
d) Write the advantages of Bidirectional Search.
Advantages
Below are the advantages:
One of the main advantages of bidirectional searches is the speed at which we get the desired
results.
It drastically reduces the time taken by the search by having simultaneous searches.
It also saves resources for users as it requires less memory capacity to store all the searches.
5
c) What is the philosophy of artificial intelligence?
Well, as much boring, logical, and commercial as we think, or hope, AI to be – AI can get a lot of
Meta, a lot abstract, and a lot more complex than what it looks like on the surface.
As to AI, there are two sides to it: Reasoning-based AI or Behaviour-based AI. One dimension is
whether the goal is to match human performance or, instead, ideal rationality. The other
dimension covers the aspect of a purpose – to build systems that reason/think, or rather systems
that act.
That’s where the question of ethics also trickles in. The moral side of AI, the doubts about Robot
Ethics, and the highly consequential impact of decisions made by machines on human life- well,
it is an iceberg waiting to be drilled into.
Where all this takes a unique turn is where the road forks between Strong and Weak AI.
a) What is data cleaning? Describe various method of data cleaning.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate,
or incomplete data within a dataset. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are
unreliable, even though they may look correct.
Step 1: Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including duplicate observations or irrelevant
observations.
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or
incorrect capitalization.
Step 3: Filter unwanted outliers
Often, there will be one-off observations where, at a glance, they do not appear to fit within the data you
are analyzing.
Step 4: Handle missing data
You can’t ignore missing data because many algorithms will not accept missing values. There are a
couple of ways to deal with missing data. Neither is optimal, but both can be considered.
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part of basic
validation:
6
b) Explain any two Types of OLAP Servers.
Relational OLAP (ROLAP):
Relational On-Line Analytical Processing (ROLAP) is primarily used for data stored in a
relational database, where both the base data and dimension tables are stored as
relational tables. ROLAP servers are used to bridge the gap between the relational back-
end server and the client’s front-end tools. ROLAP servers store and manage warehouse
data using RDBMS, and OLAP middleware fills in the gaps.
Hybrid OLAP (HOLAP):
ROLAP and MOLAP are combined in Hybrid On-Line Analytical Processing (HOLAP).
HOLAP offers greater scalability than ROLAP and faster computation than
MOLAP.HOLAP is a hybrid of ROLAP and MOLAP. HOLAP servers are capable of storing
large amounts of detailed data. On the one hand, HOLAP benefits from ROLAP’s greater
scalability. HOLAP, on the other hand, makes use of cube technology for faster
performance and summary-type information. Because detailed data is stored in a
relational database, cubes are smaller than MOLAP.
c) Elaborate the Spark Installation Steps?
1. Step 1: Verifying Java Installation. Java installation is one of the mandatory things in installing
Spark. ...
2. Step 2: Verifying Scala installation. ...
3. Step 3: Downloading Scala. ...
4. Step 4: Installing Scala. ...
5. Step 5: Downloading Apache Spark. ...
6. Step 6: Installing Spark. ...
7. Step 7: Verifying the Spark Installation.
d)Explain Breadth First Search technique of artificial intelligence.
The breadth-first search or BFS algorithm is used to search a tree or graph data structure for a
node that meets a set of criteria. It begins at the root of the tree or graph and investigates all
nodes at the current depth level before moving on to nodes at the next depth level. You can
solve many problems in graph theory via the breadth-first search. For example, finding the
shortest path between two vertices a and b is determined by the number of edges. In a flow
network, the Ford–Fulkerson method is used to calculate the maximum flow and when a binary
tree is serialized/deserialized instead of serialized in sorted order, the tree can be reconstructed
quickly.
Breadth-First Search Algorithm or BFS is the most widely utilized method.
BFS is a graph traversal approach in which you start at a source node and layer by layer through
the graph, analyzing the nodes directly related to the source node. Then, in BFS traversal, you
must move on to the next-level neighbor nodes.
7
According to the BFS, you must traverse the graph in a breadthwise direction:
e) Write any four applications of Data Mining.
Education: For analyzing the education sector, data mining uses Educational
Data Mining (EDM) method. This method generates patterns that can be used both by
learners and educators. By using data mining EDM we can perform some educational
task:
Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated
by data mining are unique to find results. In most of the technical research in data mining,
we create a training model and testing model.
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force
activity and their outcomes to improve the focusing of high-value physicians and figure out
which promoting activities will have the best effect in the following upcoming months,
Whereas the Insurance sector, data mining can help to predict which customers will buy
new policies, identify behavior patterns of risky customers and identify fraudulent behavior
of customers.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
a) Differentiate between MOLAP and HOLAP
ROLAP MOLAP HOLAP
ROLAP stands for Relational MOLAP stands for HOLAP stands for Hybrid
Online Analytical Multidimensional Online Online Analytical Processing.
Processing. Analytical Processing.
The ROLAP storage mode The MOLAP storage mode The HOLAP storage mode
causes the aggregation of principle the aggregations of connects attributes of both
the division to be stored in the division and a copy of its MOLAP and ROLAP. Like
indexed views in the source information to be saved MOLAP, HOLAP causes the
relational database that was in a multidimensional aggregation of the division to
8
specified in the partition's operation in analysis services be stored in a
data source. when the separation is multidimensional operation in
processed. an SQL Server analysis
services instance.
ROLAP does not because a This MOLAP operation is highly HOLAP does not causes a
copy of the source optimize to maximize query copy of the source
information to be stored in performance. The storage area information to be stored. For
the Analysis services data can be on the computer where queries that access the only
folders. Instead, when the the partition is described or on summary record in the
outcome cannot be derived another computer running aggregations of a division,
from the query cache, the Analysis services. Because a HOLAP is the equivalent of
indexed views in the record copy of the source information MOLAP.
source are accessed to resides in the multidimensional
answer queries. operation, queries can be
resolved without accessing the
partition's source record.
Query response is Query response times can be Queries that access source
frequently slower with reduced substantially by using record for example, if we want
ROLAP storage than with aggregations. The record in the to drill down to an atomic
the MOLAP or HOLAP partition's MOLAP operation is cube cell for which there is no
storage mode. Processing only as current as of the most aggregation information must
time is also frequently recent processing of the retrieve data from the
slower with ROLAP. separation. relational database and will
not be as fast as they would
be if the source information
were stored in the MOLAP
architecture.
9
b) What is the Missionaries and Cannibals Problem Statement? Write its
solution.
In the missionaries and cannibals problem, three missionaries and three cannibals must cross a river using a
boat which can carry at most two people, under the constraint that, for both banks, if there are missionaries
present on the bank, they cannot be outnumbered by cannibals (if they were, the cannibals would eat the
missionaries). The boat cannot cross the river by itself with no people on board. And, in some variations, one of
the cannibals has only one arm and cannot row. [1]
A system for solving the Missionaries and Cannibals problem whereby the current state is represented by a
simple vector ⟨m, c, b⟩. The vector's elements represent the number of missionaries, cannibals, and whether
the boat is on the wrong side, respectively. Since the boat and all of the missionaries and cannibals start on the
wrong side, the vector is initialized to ⟨3,3,1⟩. Actions are represented using vector subtraction/addition to
manipulate the state vector. For instance, if a lone cannibal crossed the river, the vector ⟨0,1,1⟩ would be
subtracted from the state to yield ⟨3,2,0⟩. The state would reflect that there are still three missionaries and two
cannibals on the wrong side, and that the boat is now on the opposite bank. To fully solve the problem, a
simple tree is formed with the initial state as the root. The five possible actions (⟨1,0,1⟩, ⟨2,0,1⟩, ⟨0,1,1⟩, ⟨0,2,1⟩,
and ⟨1,1,1⟩) are then subtracted from the initial state, with the result forming children nodes of the root. Any
node that has more cannibals than missionaries on either bank is in an invalid state, and is therefore removed
from further consideration. The valid children nodes generated would be ⟨3,2,0⟩, ⟨3,1,0⟩, and ⟨2,2,0⟩. For each
of these remaining nodes, children nodes are generated by adding each of the possible action vectors. The
algorithm continues alternating subtraction and addition for each level of the tree until a node is generated with
the vector ⟨0,0,0⟩ as its value. This is the goal state, and the path from the root of the tree to this node
represents a sequence of actions that solves the problem.
c) How is Apache Spark different from MapReduce?
10
It is a framework that is open-source
It is an open-source framework
1. which is used for writing data into the
used for faster data processing.
Hadoop Distributed File System.
It is having a very slow speed as It is much faster than
2.
compared to Apache Spark. MapReduce.
It is unable to handle real-time It can deal with real-time
3.
processing. processing.
It is difficult to program as you required
4. It is easy to program.
code for every process.
Its security is not as good as
5. It supports more security projects. MapReduce and continuously
working on its security issues.
d)What is Data warehouse? Describe any two applications in brief.
A data warehouse is a type of data management system used to store vast amounts of data
gathered from many sources within an organization for reporting and analysis. It supports
business managers in creating reports that assist them in dealing with complex queries while
making key business decisions. With a well-functioning data warehouse powered by cutting-
edge technology, it becomes much easier for businesses to obtain all their company data and
secures their growth and success.
1. Banking Industry
Bankers can better manage all of their available resources with the right Data Warehousing
solution. They can better analyze consumer data, government regulations, and market trends to
facilitate better decision-making.
2.Government and Education
The government uses data warehouses to store and analyze tax records, health policy records,
and providers. Their entire criminal law database is also connected to the state's data
warehouse. Illegal activity is predicted from the patterns, trends, and results of analyzing
historical data associated with past criminals.
Universities employ data warehouses to collect information for grant proposals, student
demographic analysis, and human resource management. Most colleges' financial departments,
including the Financial Aid department, rely on data warehouses.
11
3. Healthcare
The healthcare industry is another important application of data warehouses. All clinical,
financial, and personnel data is saved in the warehouse, and analysis is performed to provide
useful insights into allocating resources effectively.
4. Telephone Industry
The telephone sector deals with both offline and online data, resulting in a large amount of
historical data to be consolidated and integrated.
A data warehouse is also required to study fixed assets, monitor customer calling patterns for
salespeople to push advertising campaigns, and track consumer queries.
e) Write in detail the various blind search techniques in artificial intelligence.
Uninformed/Blind Search:
The uninformed search does not contain any domain knowledge such as closeness, the location
of the goal. It operates in a brute-force way as it only includes information about how to
traverse the tree and how to identify leaf and goal nodes. Uninformed search applies a way in
which search tree is searched without any information about the search space like initial state
operators and test for the goal, so it is also called blind search.It examines each node of the tree
until it achieves the goal node.
It can be divided into five main types:
o Breadth-first search
o Uniform cost search
o Depth-first search
o Iterative deepening depth-first search
o Bidirectional Search
1. Breadth-first Search:
o Breadth-first search is the most common search strategy for traversing a tree or graph.
This algorithm searches breadthwise in a tree or graph, so it is called breadth-first search.
2. Depth-first Search
o Depth-first search isa recursive algorithm for traversing a tree or graph data structure.
12
o It is called the depth-first search because it starts from the root node and follows each
path to its greatest depth node before moving to the next path.
3. Depth-Limited Search Algorithm:
A depth-limited search algorithm is similar to depth-first search with a predetermined limit.
Depth-limited search can solve the drawback of the infinite path in the Depth-first search. In this
algorithm, the node at the depth limit will treat as it has no successor nodes further.
4. Uniform-cost Search Algorithm:
Uniform-cost search is a searching algorithm used for traversing a weighted tree or graph. This
algorithm comes into play when a different cost is available for each edge. The primary goal of
the uniform-cost search is to find a path to the goal node which has the lowest cumulative cost.
Uniform-cost search expands nodes according to their path costs form the root node.
5. Iterative deepeningdepth-first Search:
The iterative deepening algorithm is a combination of DFS and BFS algorithms. This search
algorithm finds out the best depth limit and does it by gradually increasing the limit until a goal
is found.
a) ‘Water Jug Problem’ in artificial intelligence with the help of diagrams and propose a
solution to the problem.
Problem: There are two jugs of volume A litre and B litre. Neither has any measuring
mark on it.There is a pump that can be used to fill the jugs with water.How can you get
exactly x litre of water into the A litre jug.Assuming that we have unlimited supply of water.
Note:Let's assume we have A=4 litre and B= 3 litre jugs. And we want exactly 2 Litre water
into jug A (i.e 4 litre jug) how we will do this.
Solution:
The state space for this problem can be described as the set of ordered pairs of integers (x,y)
Where,
x represents the quantity of water in the 4-gallon jug x= 0,1,2,3,4
y represents the quantity of water in 3-gallon jug y=0,1,2,3
Start State: (0,0)
Goal State: (2,0)
Generate production rules for the water jug problem
13
We basically perform three operations to achieve the goal.
1. Fill water jug.
2. Empty water jug
3. and Transfer water jug
Rul
State Process
e
(4,Y)
1 (X,Y | X<4)
{Fill 4-gallon jug}
(X,3)
2 (X,Y |Y<3)
{Fill 3-gallon jug}
(0,Y)
3 (X,Y |X>0)
{Empty 4-gallon jug}
(X,0)
4 (X,Y | Y>0)
{Empty 3-gallon jug}
14
Rul
State Process
e
(4,Y-(4-X))
(X,Y | X+Y>=4 ^
5
Y>0) {Pour water from 3-gallon jug into 4-gallon jug until 4-gallon jug is
full}
(X-(3-Y),3)
(X,Y | X+Y>=3
6
^X>0) {Pour water from 4-gallon jug into 3-gallon jug until 3-gallon jug is
full}
(X+Y,0)
(X,Y | X+Y<=4
7
^Y>0)
{Pour all water from 3-gallon jug into 4-gallon jug}
(0,X+Y)
(X,Y | X+Y <=3^
8
X>0)
{Pour all water from 4-gallon jug into 3-gallon jug}
(2,0)
9 (0,2)
{Pour 2 gallon water from 3 gallon jug into 4 gallon jug}
Initialization:
Start State: (0,0)
Apply Rule 2:
Fill 3-gallon jug
Now the state is (x,3)
Iteration 1:
Current State: (x,3)
Apply Rule 7:
Pour all water from 3-gallon jug into 4-gallon jug
Now the state is (3,0)
15
Iteration 2:
Current State : (3,0)
Apply Rule 2:
Fill 3-gallon jug
Now the state is (3,3)
Iteration 3:
Current State:(3,3)
Apply Rule 5:
Pour water from 3-gallon jug into 4-gallon jug until 4-gallon jug is full
Now the state is (4,2)
Iteration 4:
Current State : (4,2)
Apply Rule 3:
Empty 4-gallon jug
Now state is (0,2)
Iteration 5:
Current State : (0,2)
Apply Rule 9:
Pour 2 gallon water from 3 gallon jug into 4 gallon jug
Now the state is (2,0)-- Goal Achieved.
Water Jug Solution using DFS (Depth First Search)
16
b) Action
Action selection in AI systems is a basic system in which the problem
can be analyzed by the AI machine to understand what it has to do next
to get closer to the solution of the problem. Since AI systems are very
complex, it causes the problem of action selection where more amount
of time and computation is required to achieve a task and the agents
that are responsible for that will have to work through more data to get
the job done. The ASM or the Action Selection System is used to
determine the actions of the agents along paying attention to the
perceptual behavior of the AI system. This causes modifications in the
agent's behavior when learning novel things and helps it adapt to new
things better. AI agents and action selection form to be very important
entities to help devise an intelligent solution to a problem.
17
c) Snowflake Schema
A snowflake schema is a multi-dimensional data model that is an extension of
a star schema, where dimension tables are broken down into subdimensions.
Snowflake schemas are commonly used for business intelligence and reporting in
OLAP data warehouses, data marts, and relational databases.
In a snowflake schema, engineers break down individual dimension tables into
logical subdimensions. This makes the data model more complex, but it can be
easier for analysts to work with, especially for certain data types.
It's called a snowflake schema because its entity-relationship diagram (ERD) looks
like a snowflake, as seen below.