Data Mining for Business Intelligence – 2 Marks Q&A
1. What is Data Mining?
Data mining is the process of extracting meaningful patterns, trends, and knowledge from large
datasets using statistical, machine learning, and database techniques.
2. Define Business Intelligence (BI).
Business Intelligence refers to technologies, applications, and processes for collecting, integrating,
analyzing, and presenting business information to support decision-making.
3. List two major goals of Data Mining.
Prediction (forecasting unknown values)
Description (finding patterns and relationships)
4. Name any two Data Mining techniques.
Classification
Clustering
5. What is the difference between Data Warehousing and Data Mining?
Data warehousing stores and organizes large amounts of data, while data mining analyzes that data
to discover patterns and insights.
6. What is clustering in Data Mining?
Clustering is the process of grouping similar data objects together without predefined labels.
7. Give two examples of Business Intelligence tools.
Microsoft Power BI
Tableau
8. What is association rule mining?
A technique to find relationships between variables in large datasets, e.g., “Market Basket Analysis.”
9. Mention two challenges in Data Mining.
Handling noisy or incomplete data
Scalability with large datasets
10. What is predictive analytics?
Predictive analytics uses historical data and statistical models to predict future events or trends.
Data Mining for Business Intelligence – 2 Mark Q&A (Extended)
Basics & Concepts
What is Knowledge Discovery in Databases (KDD)?
KDD is the overall process of discovering useful knowledge from data, where data mining is one of
the steps.
List the main steps in KDD process.
Data cleaning
Data integration
Data selection
Data transformation
Data mining
Pattern evaluation
Knowledge presentation
What is OLAP in Business Intelligence?
OLAP (Online Analytical Processing) is a technology for fast, multidimensional analysis of business
data.
Differentiate OLTP and OLAP.
OLTP handles day-to-day transaction processing; OLAP supports analytical queries for decision-
making.
Define data preprocessing.
Data preprocessing is the process of cleaning, integrating, and transforming raw data into a usable
format for analysis.
Techniques
Name two classification algorithms.
Decision Tree (C4.5, ID3)
Naïve Bayes
What is regression in data mining?
A technique that models the relationship between a dependent variable and one or more
independent variables to predict numerical outcomes.
What is the difference between supervised and unsupervised learning?
Supervised learning uses labeled data; unsupervised learning uses unlabeled data.
What is a decision tree?
A classification model that splits data into branches based on attribute values, leading to decision
outcomes.
What is a confusion matrix?
A table used to measure the performance of a classification model by showing actual vs. predicted
results.
Applications
Mention two business applications of data mining.
Customer churn prediction
Fraud detection
What is market basket analysis?
An association rule mining technique to identify items that are frequently bought together.
Name two industries where BI is widely used.
Retail
Banking
What is text mining?
Extracting meaningful information and patterns from unstructured text data.
What is web mining?
Discovering patterns from web data, such as website logs or social media content.
Data Quality & Challenges
List two issues that affect data quality.
Missing values
Duplicate records
What is data cleaning?
The process of detecting and correcting inaccurate, incomplete, or irrelevant parts of data.
Why is scalability important in data mining?
Because large datasets require algorithms that can handle high volumes efficiently.
What is data integration?
Combining data from multiple sources into a coherent data store.
Define dimensionality reduction.
The process of reducing the number of variables under consideration, e.g., using Principal
Component Analysis (PCA).
Evaluation & Visualization
What is lift in association rule mining?
A measure of how much more likely two items are bought together compared to being bought
independently.
What is ROC curve?
A graphical plot showing the performance of a binary classifier, comparing True Positive Rate and
False Positive Rate.
What is support in association rule mining?
The proportion of transactions containing a particular itemset.
What is confidence in association rule mining?
The likelihood that item B is purchased when item A is purchased.
Why is data visualization important in BI?
It helps decision-makers understand complex data patterns quickly and clearly.
Data Mining for Business Intelligence – Additional 2-Mark Questions &
Answers
Core Concepts
What is metadata in BI?
Metadata is “data about data,” describing the structure, origin, and meaning of stored data.
What is ETL in BI?
ETL stands for Extract, Transform, Load — a process to move data from source systems into a data
warehouse.
Differentiate structured and unstructured data.
Structured data is organized in fixed fields (e.g., tables); unstructured data lacks a predefined format
(e.g., emails, videos).
What is a data mart?
A subset of a data warehouse, designed for a specific business function or department.
What is real-time BI?
BI systems that deliver up-to-the-minute data and analytics for immediate decision-making.
Data Mining Methods
What is anomaly detection?
Identifying rare items, events, or patterns that differ from the majority of data.
Name two clustering algorithms.
K-Means
DBSCAN
What is hierarchical clustering?
A clustering method that builds a hierarchy of clusters either via agglomerative or divisive
approaches.
What is time-series analysis?
Analyzing data points collected over time to identify trends, cycles, and seasonal patterns.
What is sequence mining?
Finding patterns in data where the values or events are delivered in a sequence.
Business Applications
What is CRM in the context of BI?
Customer Relationship Management uses BI tools to analyze customer behavior and improve
retention.
Give two examples of predictive analytics in business.
Sales forecasting
Risk assessment in insurance
What is credit scoring?
A predictive model in finance to determine the likelihood of a borrower repaying a loan.
What is sentiment analysis?
Analyzing text data to determine the writer’s or speaker’s emotional tone.
What is demand forecasting?
Predicting future product or service demand using historical sales and market data.
Data Quality & Management
What is data redundancy?
The unnecessary repetition of data, which can cause storage inefficiency and inconsistencies.
What is data transformation?
Converting data into a suitable format or structure for analysis.
What is noise in data?
Random errors or irrelevant information that can distort analysis.
Why is data normalization important?
It removes scale differences among variables and improves model performance.
What is data lineage?
Tracing the origin and movement of data through systems over time.
Evaluation & Measures
What is precision in classification?
The proportion of correctly predicted positive observations to total predicted positives.
What is recall in classification?
The proportion of correctly predicted positives to all actual positives.
What is the F1-score?
The harmonic mean of precision and recall.
What is overfitting?
When a model learns noise along with the patterns, performing well on training data but poorly on
unseen data.
What is underfitting?
When a model is too simple to capture the underlying patterns in the data.
Data Mining for Business Intelligence – More 2-Mark Q&A
Concepts & Architecture
What is data profiling?
The process of examining data to understand its structure, quality, and content.
What is the purpose of a data warehouse?
To store integrated, subject-oriented, time-variant, and non-volatile data for analysis.
What is a star schema?
A data warehouse schema with a central fact table connected to dimension tables.
What is a snowflake schema?
A more normalized version of the star schema where dimension tables are split into sub-dimensions.
What is fact table in BI?
A table containing quantitative business metrics (facts) linked to dimension tables.
Data Mining Tasks
What is descriptive analytics?
Analytics that focuses on summarizing past data to understand what has happened.
What is diagnostic analytics?
Analytics that investigates why a certain event or trend occurred.
Name two distance measures used in clustering.
Euclidean distance
Manhattan distance
What is K-means clustering?
A clustering algorithm that partitions data into k clusters based on similarity.
What is Apriori algorithm used for?
To mine frequent itemsets and generate association rules.
Applications in BI
What is churn analysis?
Predicting which customers are likely to stop using a product or service.
What is clickstream analysis?
Analyzing user navigation paths on websites for behavior patterns.
What is fraud detection in BI?
Identifying suspicious transactions or activities using pattern recognition and anomaly detection.
What is recommendation system?
A system that suggests items to users based on past behavior or similar users.
What is supply chain analytics?
Using BI tools to optimize procurement, inventory, and distribution processes.
Data Issues & Management
What is imbalanced data?
A dataset where the number of instances in different classes is highly unequal.
What is sampling in data mining?
Selecting a representative subset of data for analysis.
What is stratified sampling?
Sampling that ensures proportional representation of different categories.
What is data security in BI?
Protecting sensitive business data from unauthorized access and breaches.
What is data governance?
A set of policies and processes ensuring data quality, security, and compliance.
Model Evaluation & Metrics
What is a training dataset?
A dataset used to build and train a machine learning model.
What is a test dataset?
A dataset used to evaluate the performance of a trained model.
What is cross-validation?
A technique for assessing model performance by splitting data into multiple train/test sets.
What is AUC in classification?
Area Under the ROC Curve — measures classifier performance across thresholds.
What is mean absolute error (MAE)?
The average of absolute differences between predicted and actual values.
Data Mining for Business Intelligence –
Advanced Concepts
What is ensemble learning?
A technique that combines multiple models to improve prediction accuracy.
Name two ensemble methods.
Bagging
Boosting
What is bagging?
Bootstrap Aggregating — training multiple models on different random samples and combining their
results.
What is boosting?
Sequentially training models, giving more weight to previously misclassified instances.
What is random forest?
An ensemble method that uses multiple decision trees to improve accuracy and reduce overfitting.
Data Mining Process & Tools
What is CRISP-DM?
Cross-Industry Standard Process for Data Mining — a methodology for planning and executing data
mining projects.
List the main phases of CRISP-DM.
Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment
What is SAS Enterprise Miner?
A data mining software for building predictive and descriptive models.
What is IBM SPSS Modeler?
A data mining tool for building models without coding, using a visual interface.
What is Weka?
An open-source collection of machine learning algorithms for data mining tasks.
Practical Applications
What is customer segmentation?
Dividing customers into distinct groups based on shared characteristics.
What is inventory optimization?
Using analytics to maintain the right stock levels at minimum cost.
What is price optimization?
Determining the most profitable price point using data analysis.
What is sentiment trend analysis?
Tracking changes in public opinion over time from text data.
What is campaign analysis in BI?
Evaluating the success of marketing campaigns based on performance metrics.
Data Issues & Ethics
What is data anonymization?
Modifying personal data to protect individual identities.
What is data masking?
Hiding sensitive data elements by replacing them with fictional but realistic data.
What is bias in data mining?
Systematic errors that lead to inaccurate or unfair model outcomes.
What is GDPR?
General Data Protection Regulation — a European law governing data privacy.
What is algorithm transparency?
The degree to which a model’s decision-making process can be understood by humans.
Evaluation & Optimization
What is hyperparameter tuning?
The process of selecting the best model parameters before training.
What is grid search?
An exhaustive search over specified hyperparameter values for a model.
What is learning curve in machine learning?
A graph showing model performance over time or as training size increases.
What is regularization?
A technique to reduce overfitting by adding a penalty to large model coefficients.
What is logistic regression used for in data mining?
Predicting binary outcomes using a statistical model.
Core Data Concepts
What is data granularity?
The level of detail or summarization in a dataset.
What is temporal data?
Data that represents time-related information.
What is spatial data mining?
Discovering patterns from spatial or geographical data.
What is heterogeneous data?
Data coming from different formats, sources, or structures.
What is incremental learning?
A method where the model updates itself as new data arrives without retraining from scratch.
Data Mining Algorithms
What is k-nearest neighbors (KNN)?
A classification algorithm that assigns a label based on the majority class of its closest neighbors.
What is naïve Bayes classifier?
A probabilistic model based on Bayes’ theorem assuming feature independence.
What is support vector machine (SVM)?
A supervised learning algorithm that finds the best boundary (hyperplane) to separate classes.
What is principal component analysis (PCA)?
A dimensionality reduction technique that transforms variables into uncorrelated components.
What is deep learning?
A subset of machine learning using multi-layer neural networks for pattern recognition.
Business Applications
What is KPI in BI?
Key Performance Indicator — a measurable value indicating business performance toward
objectives.
What is sales funnel analysis?
Analyzing the stages customers go through before making a purchase.
What is workforce analytics?
Using data mining to improve employee performance and HR decisions.
What is financial risk modeling?
Predicting the likelihood of financial losses using statistical models.
What is healthcare analytics?
Using BI tools to improve patient care and hospital efficiency.
Data Quality & Preprocessing
What is feature selection?
Choosing the most relevant variables for model building.
What is feature engineering?
Creating new features from raw data to improve model performance.
What is outlier detection?
Identifying data points that deviate significantly from the norm.
What is data balancing?
Adjusting the dataset to handle class imbalance.
What is binning in data preprocessing?
Grouping continuous data into intervals for simplification.
Model Evaluation & Deployment
What is model drift?
The decline in a model’s performance due to changes in underlying data patterns.
What is confusion cost?
The business impact of false positives or false negatives in a prediction model.
What is holdout validation?
Splitting data into training and testing sets to assess performance.
What is model deployment in BI?
Integrating a trained model into a live business environment for real-time use.
What is post-deployment monitoring?
Tracking model performance after deployment to detect issues early.
Data Mining for Business Intelligence – Extended Q&A (126–150)
Core BI & Data Concepts
What is drill-down in BI?
Moving from summarized data to more detailed data in analysis.
What is roll-up in BI?
Aggregating detailed data into higher-level summaries.
What is slicing in OLAP?
Selecting a single layer of data from a cube for analysis.
What is dicing in OLAP?
Viewing data from different perspectives by selecting specific rows and columns.
What is a measure in BI?
A numeric value that can be aggregated for analysis, e.g., sales revenue.
Data Mining Tasks & Algorithms
What is classification?
Assigning items to predefined categories based on their features.
What is clustering?
Grouping data into clusters where items in the same group are more similar to each other than to
those in other groups.
What is association analysis?
Finding relationships between variables in large datasets.
What is sequential pattern mining?
Discovering frequently occurring ordered events in data.
What is a neural network?
A computational model inspired by the human brain used for pattern recognition and prediction.
Business Applications
What is inventory forecasting?
Predicting future stock requirements using historical sales data.
What is product affinity analysis?
Identifying products often bought together.
What is profitability analysis?
Assessing the profit generated by different business segments or products.
What is real-time fraud monitoring?
Detecting fraudulent transactions as they occur using live data analysis.
What is predictive maintenance?
Using data mining to predict equipment failures before they happen.
Data Quality & Governance
What is master data management (MDM)?
A process of ensuring consistency, accuracy, and accountability for key business data.
What is data stewardship?
The role responsible for managing data quality and compliance.
What is duplicate detection?
Identifying and removing repeated data entries in a dataset.
What is schema mapping?
Aligning different database schemas to enable integration.
What is referential integrity?
Ensuring relationships between tables remain consistent.
Model Evaluation & Optimization
What is model generalization?
The model’s ability to perform well on unseen data.
What is early stopping?
Halting training before overfitting occurs in machine learning.
What is feature scaling?
Adjusting the range of features to improve model performance.
What is stochastic gradient descent (SGD)?
An optimization method that updates model parameters using one sample at a time.
What is batch processing in BI?
Running data processing tasks on a large set of records all at once, rather than in real-time.