BUSINESS INTELLIGENCE PYQs Answers
BUSINESS INTELLIGENCE PYQs Answers
ENDSEM PYQs
Example:
Suppose we are analyzing sales data in a company.
Dimensions:
o Time: Year → Month → Day
o Product: Category → Sub-category → Item
o Location: Country → State → City
Fact:
o Sales Amount
So, if a business wants to know "What were the total sales
of mobiles in Pune in the month of January 2023?", it can
be answered easily using this model.
Advantages:
Allows quick data retrieval
Supports operations like roll-up, drill-down, slicing,
and dicing
Best suited for decision-making processes
2. Data Sorting:
Definition:
Sorting is the process of arranging data in a particular
order, either ascending (ASC) or descending (DESC), based
on one or more columns. Sorting helps in analyzing data
more easily.
Example (SQL):
SELECT *
FROM SalesData
ORDER BY Sales DESC;
✅ Que Explain Different Types of Reports in Detail
In Business Intelligence (BI), reports help organizations
analyze, interpret, and visualize their data for decision-
making. There are different types of reports based on the
purpose and audience.
1. Operational Reports
Focus on day-to-day business operations.
Show real-time or recent data.
Used by frontline staff or managers to take
immediate action.
Example: Daily sales report, stock level report.
2. Strategic Reports
Used for long-term planning and decision-making.
Often reviewed by top management or executives.
Based on historical and trend data.
Example: Annual performance report, 5-year
financial trends.
3. Analytical Reports
Focused on deep data analysis.
Use charts, graphs, trends, and statistical
techniques.
Helps in understanding why something happened.
Used by data analysts or BI professionals.
Example: Customer churn analysis, sales pattern
analysis.
4. Tactical Reports
Used for short- or medium-term planning.
Often used by middle management.
Helps in optimizing processes and teams.
Example: Monthly sales performance by team,
weekly marketing ROI.
5. Ad-hoc Reports
Created for a specific, one-time question or issue.
Not regularly scheduled.
Example: “How many customers bought product X
in Pune last month?”
Advantages:
Easy to query using SQL
Data integrity through keys and constraints
Supports relationships like one-to-many, many-to-
many, etc.
✅ Que Write a Short Note on Filtering Reports
Definition:
Filtering in reports means displaying only specific or
relevant data based on conditions. It helps in narrowing
down large datasets and focusing only on what’s important.
Purpose:
To hide unnecessary information
To help users focus on meaningful insights
Makes reports clear, concise, and readable
Types of Filtering:
1. Text Filtering: E.g., show only rows where Name =
'Rahul'.
2. Date Filtering: E.g., data between '1 Jan 2023' to '31
Mar 2023'.
3. Numeric Filtering: E.g., sales > ₹10,000.
4. Top N Filtering: E.g., Top 5 performing products.
5.
Example in SQL:
SELECT *
FROM SalesData
WHERE Region = 'West' AND Sales > 50000;
This will show only the records from the 'West' region
where Sales are greater than ₹50,000.
Use in BI Tools:
In tools like Power BI, Tableau, Excel, filters can be applied
through:
Dropdowns
Slicers
Checkbox filters
Custom formulas
Benefits:
Increases report readability
Saves time in data analysis
Helps make better decisions
✅ What are the Best Practices in Dashboard Design?
A dashboard is a visual display of key information and
metrics used to monitor performance, trends, or data
insights in business intelligence (BI) tools like Power BI,
Tableau, etc.
🔷 1. Understand the Purpose
Know who will use the dashboard (executive,
analyst, team leader).
Identify what key questions the dashboard must
answer.
Use KPIs (Key Performance Indicators) relevant to
the user's goals.
🔷 5. Make it Interactive
Allow users to filter data, drill down, or switch
views.
Use slicers, drop-downs, and drill-through features
for interactivity.
🔷 6. Highlight Key Insights
Use color indicators (e.g., red for decline, green for
growth).
Use icons or arrows to show performance changes.
Add summary numbers to highlight totals, averages,
or alerts.
🔷 8. Mobile-Friendly Design
Ensure the dashboard is responsive and fits on
mobile/tablet screens.
Avoid wide layouts or large text blocks.
🔷 3. Filtering Reports
What it is:
Filtering is selecting only the specific subset of data based
on certain conditions.
Use in BI:
Allows users to focus only on relevant data.
Removes unwanted or irrelevant rows from the
report.
Example Use Case:
Show sales only for the 'Electronics' category in January.
SELECT *
FROM SalesData
WHERE Category = 'Electronics' AND Month = 'January';
Why It’s Useful:
Makes dashboards clean and focused
Supports custom views for different users or
departments
Improves performance by reducing data load
✅ Q: What is a File Extension? Explain the Structure of a
CSV File
🔷 What is a File Extension?
Definition:
A file extension is the suffix at the end of a file
name, indicating the file type and associated
program.
It typically consists of three or four characters,
separated from the file name by a dot (.).
Examples:
.docx → Microsoft Word document
.xlsx → Excel spreadsheet
.csv → Comma-Separated Values file
.jpg → Image file
.pdf → Portable Document Format
Why File Extensions are Useful:
Help the operating system identify and open files
with the right application.
Allow users to recognize file types quickly.
Basic Structure:
Name,Age,City
Rahul,22,Pune
Priya,21,Mumbai
Amit,23,Nashik
Header Row: The first line contains column names
(e.g., Name, Age, City).
Data Rows: Each subsequent line is a record.
Comma (,): Used as a delimiter to separate
columns.
🔷 Together, they:
Make reports interactive, dynamic, and insightful.
Turn static reports into intelligent, action-based
documents.
Increase efficiency and data accuracy.
🔷 Example:
Suppose you're exploring a Sales dataset with columns like:
Product_ID, Region, Sales_Amount, Date, Quantity_Sold
Use mean(Sales_Amount) to check average revenue
Create a bar chart of Sales by Region
Use a scatter plot to find correlation between
Quantity_Sold and Sales_Amount
Identify if any products have unusually high returns
🔷 Purpose:
Understand the nature of the data
Decide on data cleaning, transformation, or
modelling methods
Find patterns or problems early
🔷 2. Incompleteness
Definition:
Data is considered incomplete when some required fields
are missing.
Example:
Customer entry without email or phone number
Sales record with missing transaction amount
Effect:
Leads to incorrect insights and biased results
Solution:
Fill using average values or imputation
Remove incomplete rows if not significant
🔷 3. Noise
Definition:
Noise refers to random errors or meaningless data that
doesn't reflect the actual values.
Example:
Outlier in salary: ₹5,000,000 when most values are
around ₹50,000
Misspelled values: "Indai" instead of "India"
Effect:
Skews analysis and misleads models
Solution:
Use smoothing techniques
Remove or correct outliers
🔷 4. Inconsistency
Definition:
Occurs when the same data is represented in different
formats or contains conflicting information.
Example:
Date in DD-MM-YYYY in one record and MM-DD-
YYYY in another
Customer name as “Rahul Sharma” in one record
and “R. Sharma” in another
Effect:
Creates confusion, duplicate records, and invalid
summaries
Solution:
Apply standard formatting rules
Use data cleaning or data integration tools
🔷 Summary:
2. Numerosity Reduction:
o Reduces data volume, not dimensions.
o Methods: Histograms, clustering, sampling
Example:
Instead of storing individual temperature readings for every
minute, store average temperature per hour.
3. Data Compression:
o Stores data in compact format.
o Techniques: Lossless or lossy compression,
encoding formats like .zip, .gz
4. Aggregation:
o Replaces raw data with summarized forms.
Example:
Replace daily sales data with monthly total sales.
🔷 Benefits:
Reduces data storage costs
Increases analysis speed
Removes noise or redundancy
Helps in building faster and simpler models
🔷 1. Univariate Analysis
Focus: One variable
Techniques: Frequency table, histogram, boxplot
Use: Summarize or describe
🔷 2. Bivariate Analysis
Focus: Two variables
Techniques: Correlation, scatter plot, cross-tab
Use: Identify relationship (positive/negative)
🔷 3. Multivariate Analysis
Focus: Multiple variables
Techniques: Multiple regression, clustering, PCA
(principal component analysis)
Use: Prediction, classification, segmentation
🔷 Types of Discretization:
1. Equal-Width Binning:
o Divides range into equal-sized intervals.
o Example: Age 0–10, 11–20, 21–30
2. Equal-Frequency Binning:
o Each bin has the same number of records.
3. Cluster-Based Discretization:
o Groups values based on clustering (e.g., K-
means)
🔷 Example:
Continuous Age Data:
18, 22, 24, 29, 35, 40
Discretized into:
Age Group 1: 18–25 → Young
Age Group 2: 26–35 → Middle-aged
Age Group 3: 36–45 → Older
🔷 Benefits:
Reduces data complexity
Improves algorithm performance
Makes patterns more visible in charts and reports
🔷 Importance:
Prepares data for BI tools and machine learning
Removes inconsistencies
Improves analysis accuracy
🔷 2. Bivariate Analysis
Involves two variables
Purpose: Study relationship or comparison between
two fields
Example: Study hours vs. exam marks
Tools: Scatter plot, correlation, cross-tab
Application: Find patterns (e.g., positive correlation)
🔷 3. Multivariate Analysis
Involves more than two variables
Purpose: Analyze complex interactions between
variables
Common in predictive modeling
Example: Predict sales based on price, season, advertising
Tools: Multiple regression, PCA, clustering
Application: Forecasting, classification, segmentation
🔷 Example:
Let’s say a company wants to analyze customer satisfaction
based on Gender and Feedback.
🔷 Marginal Distribution:
Marginal distribution is the total frequency (or percentage)
of each category in rows or columns of a contingency table.
Marginal totals are usually found at the bottom row and
rightmost column of the table.
🔷 Use in BI:
Used in data analysis, decision-making, and pattern
detection
Helps businesses find links (e.g., which group is
more satisfied)
🔷 2. Incompleteness
Definition: Occurs when some required values are missing
from the dataset.
Example:
Customer record missing phone number
Sales record without date
Impact:
Leads to incorrect analysis and weak models
Fix: Use techniques like data imputation or delete rows with
missing data.
🔷 3. Noise
Definition: Refers to random errors or irrelevant data in the
dataset.
Example:
Outliers like a ₹5,000,000 salary among ₹50,000
range
Spelling mistakes like "Indai" for "India"
Impact:
Misleads charts, averages, and algorithms
Fix: Use smoothing techniques or detect/remove outliers.
🔷 4. Inconsistency
Definition: Occurs when the same data is represented
differently or incorrectly across entries.
Example:
Date as "01/02/2023" in one place and "2023-02-
01" in another
Gender written as "M", "Male", "male"
Impact:
Causes confusion, errors in grouping or analysis
Fix: Use standardization and data cleaning techniques
🔷 2. Feature Selection
Definition:
Choosing the most important input variables (features)
while removing irrelevant ones or redundant ones.
Techniques:
Filter methods (correlation, chi-square test)
Wrapper methods (forward selection)
Embedded methods (Lasso Regression)
Example:
Remove features like “Customer Middle Name”
from sales prediction model.
Benefit:
Improves model performance, reduces complexity,
and avoids overfitting
✅ 2. Normalization / Scaling
Adjusts numerical data to a common scale without
distorting differences.
Example: Salary values scaled from ₹10,000–
₹1,00,000 to 0–1 range
🔷 Types of Binning:
1. Equal-Width Binning
o Divides the range of values into equal-size
intervals.
o Example: Divide ages 0–60 into 3 bins: 0–20,
21–40, 41–60
2. Equal-Frequency Binning
o Each bin contains approximately the same
number of data points.
3. Custom Binning
o Bins defined manually based on domain
knowledge.
o Example: 18–25 = “Young”, 26–40 = “Adult”,
41+ = “Senior”
🔷 Example:
Original Data (Ages):
[18, 22, 25, 28, 31, 34, 45, 48, 52]
Equal-Width Binning (Width = 10):
🔷 Benefits:
Helps in data smoothing
Makes visualizations clearer
Required for categorical analysis
✅ 2. Scatter Plot
A graphical method to visualize the relationship
X-axis: Independent variable
Y-axis: Dependent variable
Example: Age vs. Income
🔷 Summary:
✅ Q1: What is Association Rule Mining? Explain the Terms:
Support, Confidence, Lift
🔷 What is Association Rule Mining?
Association Rule Mining is a data mining technique used to
discover interesting relationships or patterns between
items in large datasets.
It is widely used in:
Market Basket Analysis
Retail sales
Recommendation systems
🔷 Example:
In a supermarket dataset, the rule:
{Bread} ⇒ {Butter}
Means: If a customer buys Bread, they are also likely to buy
Butter.
🔷 Key Terms:
✅ 1. Support
Support shows how frequently an itemset occurs in
the dataset.
Formula:
Example:
o If 100 total transactions and 20 contain
{Bread, Butter}, then:
Support = 20 / 100 = 0.20 (20%)
✅ 2. Confidence
Confidence indicates the likelihood of buying B
given A.
Formula:
Example:
o If 40 transactions have Bread and 20 have
both Bread and Butter:
Confidence = 20 / 40 = 0.50 (50%)
✅ 3. Lift
Lift measures the strength of a rule over random
chance.
It tells whether the presence of A increases the
likelihood of B.
Formula:
Interpretation:
o Lift = 1 → No association
o Lift > 1 → Positive association
o Lift < 1 → Negative association
🔷 Summary Table:
✅ Frequent 2-itemsets:
(11,12), (11,13), (11,15)
(12,13), (12,14), (12,15)
🔷 L3 – Frequent 3-itemsets
Now try combinations that appeared together at least 2
times.
✅ Step 3: Generate Association Rules from Frequent
Itemsets
Use only rules with confidence ≥ 60%
Example Rule 1:
{11} ⇒ {12}
Support(11,12) = 4
Support(11) = 6
Confidence = 4/6 = 66.67% ✅
Rule 2:
{12} ⇒ {11}
Support(11,12) = 4
Support(12) = 7
Confidence = 4/7 = 57.14% ❌
Rule 3:
{12} ⇒ {13}
Support(12,13) = 4
Support(12) = 7
Confidence = 4/7 = 57.14% ❌
Rule 4:
{13} ⇒ {12}
Support(12,13) = 4
Support(13) = 5
Confidence = 4/5 = 80% ✅
Rule 5:
{11,12} ⇒ {13}
Support(11,12,13) = 2
Support(11,12) = 4
Confidence = 2/4 = 50% ❌
Rule 6:
{12,13} ⇒ {11}
Support(11,12,13) = 2
Support(12,13) = 4
Confidence = 2/4 = 50% ❌
Rule 7:
{11,12} ⇒ {15}
Support(11,12,15) = 2
Support(11,12) = 4
Confidence = 2/4 = 50% ❌
Where:
P(A|B) is the probability of event A given that B has
occurred (posterior probability).
P(B|A) is the probability of event B given that A has
occurred (likelihood).
P(A) is the probability of event A occurring (prior
probability).
P(B) is the probability of event B occurring
(evidence).
Example: Imagine you have a disease test. The probability of
having the disease (event A) is 1% (prior probability). If you
test positive (event B), the test has a 90% true positive rate
(likelihood), and there’s a 5% chance of getting a false
positive (probability of event B occurring).
Bayes' Theorem will help you update the probability of
actually having the disease after getting a positive test
result.
b) Difference Between Classification and Clustering
Both classification and clustering are types of machine
learning, but they have distinct differences:
Classification:
o Involves supervised learning.
o The goal is to predict a categorical label
(class) for a given data point based on
labeled training data.
o Example: Email spam detection, where
emails are classified as "spam" or "not
spam."
o Algorithms: Logistic regression, decision
trees, random forests, etc.
Clustering:
o Involves unsupervised learning.
o The goal is to group similar data points
together without predefined labels.
o Example: Customer segmentation, where
customers are grouped based on their
purchasing behavior without predefined
categories.
o Algorithms: K-means, hierarchical clustering,
DBSCAN, etc.
c) Logistic Regression
Logistic Regression is a statistical model used for binary
classification (predicting one of two outcomes).
It predicts the probability that a given input belongs to a
certain class (typically 0 or 1). Unlike linear regression, which
predicts a continuous value, logistic regression uses the
sigmoid function to squash the output between 0 and 1.
The logistic regression model equation is:
Where:
X is the antecedent (the item(s) we know).
Y is the consequent (the item(s) we want to predict).
Support and Confidence are used to evaluate the strength
and usefulness of association rules:
Support: It measures how frequently the itemset
appears in the dataset.
Formula:
c) Clustering with K-Means (K=2)
To perform K-means clustering with K=2, we need to find
two clusters based on the ages of visitors.
Given the data:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45,
61, 62, 66.
1. Step 1: Initialize two random centroids. We
randomly pick two points as the initial centroids, say
16 and 66.
2. Step 2: Assign each point to the nearest centroid.
o Cluster 1: Contains ages closer to 16.
o Cluster 2: Contains ages closer to 66.
After assignment:
o Cluster 1: 16, 16, 17, 20, 20, 21, 21, 22, 23,
29
o Cluster 2: 36, 41, 42, 43, 44, 45, 61, 62, 66
3. Step 3: Recalculate the centroids.
o New Centroid for Cluster 1: Average of [16,
16, 17, 20, 20, 21, 21, 22, 23, 29] = 20.3.
o New Centroid for Cluster 2: Average of [36,
41, 42, 43, 44, 45, 61, 62, 66] = 48.2.
4. Step 4: Reassign points to the nearest centroid.
o After reassigning, the clusters may change
slightly. We iterate this process until the
assignments no longer change.
After a few iterations, we get stable clusters:
o Cluster 1: 16, 16, 17, 20, 20, 21, 21, 22, 23,
29.
o Cluster 2: 36, 41, 42, 43, 44, 45, 61, 62, 66.
These two clusters represent groups of visitors with ages
closer to the centroids 20.3 and 48.2
1. Types of Logistic Regression:
Logistic regression is mainly used for binary classification,
but there are different types based on the number of classes
and the model used. The main types are:
Binary Logistic Regression: This is the simplest form
of logistic regression, where the dependent variable
has two classes (e.g., 0 and 1, Yes and No).
Multinomial Logistic Regression: Used when the
dependent variable has more than two classes. For
example, predicting types of fruits (apple, orange,
banana).
Ordinal Logistic Regression: This is used when the
dependent variable is ordinal (i.e., it has ordered
categories, but the intervals between them are not
necessarily equal). For example, predicting customer
satisfaction levels like "Very dissatisfied",
"Dissatisfied", "Neutral", "Satisfied", "Very satisfied".
5. Definitions:
Frequent Itemset: An itemset (a set of items) that
appears frequently in a transaction dataset, meeting
a minimum support threshold.
Minimum Support Count: The minimum number of
times an itemset should appear in the dataset to be
considered frequent. It is typically specified as a
percentage of the total transactions (support).
Hierarchical Clustering: A type of clustering
algorithm that creates a hierarchy of clusters by
either successively merging small clusters
(agglomerative) or splitting large clusters (divisive).
The result is usually represented as a tree-like
structure called a dendrogram.
Regression: A type of supervised learning algorithm
used for predicting continuous values. The output of
a regression model is a numerical value, such as
predicting house prices based on features like size,
location, etc.
3. BI Applications in Logistics:
Business Intelligence (BI) is increasingly being used in the
logistics industry to optimize operations, reduce costs, and
improve customer satisfaction. BI tools help logistics
companies analyze vast amounts of data from different
sources to make informed decisions and enhance efficiency
across the supply chain. Here are several key applications of
BI in logistics:
1. Supply Chain Optimization: Logistics companies use
BI to analyze data related to inventory levels,
supplier performance, and delivery schedules. By
identifying inefficiencies or bottlenecks in the supply
chain, companies can optimize their processes,
reduce lead times, and lower costs. BI tools can
predict demand more accurately, helping
organizations manage their inventory and
production levels effectively.
2. Route Optimization: BI tools help logistics
companies analyze traffic patterns, weather
conditions, and delivery times to find the most
efficient routes. This not only reduces delivery time
but also saves fuel and ensures timely deliveries. BI
can be used to assess delivery performance and
adjust routes dynamically based on real-time data,
improving operational efficiency.
3. Fleet Management: BI applications in logistics help
track the performance of vehicles, monitor fuel
consumption, and maintain a record of vehicle
maintenance. By using historical data, companies
can optimize their fleet usage, predict when vehicles
need maintenance, and improve the overall fleet
performance. Real-time analytics enable logistics
companies to make proactive decisions regarding
their fleet operations.
4. Predictive Maintenance: With BI, logistics
companies can predict when equipment or vehicles
are likely to fail based on historical data and
performance metrics. By performing predictive
maintenance, companies can reduce downtime,
avoid costly repairs, and extend the life of their
equipment.
5. Inventory and Warehouse Management: BI tools
help logistics companies monitor inventory levels
and track warehouse operations in real time. This
allows them to make decisions based on accurate,
up-to-date data, ensuring that stock levels are
optimized and warehouses are efficiently managed.
BI can also track the movement of goods, helping to
reduce stockouts and overstocking situations.
6. Customer Satisfaction and Service Level
Monitoring: Logistics companies use BI to track and
analyze customer orders, delivery times, and service
levels. This helps identify patterns in customer
satisfaction and address issues such as late
deliveries or damaged goods. By monitoring key
metrics, logistics companies can improve their
service offerings and ensure higher customer
satisfaction.
7. Cost Management and Optimization: BI can help
logistics companies analyze and optimize various
cost factors such as fuel, labor, transportation, and
inventory storage. By tracking costs and comparing
them against performance metrics, businesses can
identify areas where they can cut costs, negotiate
better rates, or streamline operations.
8. Risk Management: Logistics companies face various
risks, such as fluctuating fuel prices, supply chain
disruptions, or regulatory changes. BI helps identify
potential risks by analyzing historical data and
market trends. This enables logistics companies to
take preventative actions, reduce exposure to risks,
and ensure smooth operations.