0% found this document useful (0 votes)
78 views20 pages

Data Mining & Business Intelligence Q&A

Data mining for business intelligence two marks

Uploaded by

hariharan21m21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views20 pages

Data Mining & Business Intelligence Q&A

Data mining for business intelligence two marks

Uploaded by

hariharan21m21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining for Business Intelligence – 2 Marks Q&A

1. What is Data Mining?

Data mining is the process of extracting meaningful patterns, trends, and knowledge from large
datasets using statistical, machine learning, and database techniques.

2. Define Business Intelligence (BI).

Business Intelligence refers to technologies, applications, and processes for collecting, integrating,
analyzing, and presenting business information to support decision-making.

3. List two major goals of Data Mining.

Prediction (forecasting unknown values)

Description (finding patterns and relationships)

4. Name any two Data Mining techniques.

Classification

Clustering

5. What is the difference between Data Warehousing and Data Mining?

Data warehousing stores and organizes large amounts of data, while data mining analyzes that data
to discover patterns and insights.

6. What is clustering in Data Mining?

Clustering is the process of grouping similar data objects together without predefined labels.
7. Give two examples of Business Intelligence tools.
Microsoft Power BI
Tableau

8. What is association rule mining?

A technique to find relationships between variables in large datasets, e.g., “Market Basket Analysis.”

9. Mention two challenges in Data Mining.

Handling noisy or incomplete data

Scalability with large datasets

10. What is predictive analytics?

Predictive analytics uses historical data and statistical models to predict future events or trends.

Data Mining for Business Intelligence – 2 Mark Q&A (Extended)


Basics & Concepts

What is Knowledge Discovery in Databases (KDD)?

KDD is the overall process of discovering useful knowledge from data, where data mining is one of
the steps.

List the main steps in KDD process.

Data cleaning

Data integration

Data selection

Data transformation

Data mining

Pattern evaluation

Knowledge presentation
What is OLAP in Business Intelligence?

OLAP (Online Analytical Processing) is a technology for fast, multidimensional analysis of business
data.

Differentiate OLTP and OLAP.

OLTP handles day-to-day transaction processing; OLAP supports analytical queries for decision-
making.

Define data preprocessing.

Data preprocessing is the process of cleaning, integrating, and transforming raw data into a usable
format for analysis.

Techniques

Name two classification algorithms.

Decision Tree (C4.5, ID3)

Naïve Bayes

What is regression in data mining?

A technique that models the relationship between a dependent variable and one or more
independent variables to predict numerical outcomes.

What is the difference between supervised and unsupervised learning?

Supervised learning uses labeled data; unsupervised learning uses unlabeled data.

What is a decision tree?

A classification model that splits data into branches based on attribute values, leading to decision
outcomes.

What is a confusion matrix?

A table used to measure the performance of a classification model by showing actual vs. predicted
results.
Applications

Mention two business applications of data mining.

Customer churn prediction

Fraud detection

What is market basket analysis?

An association rule mining technique to identify items that are frequently bought together.

Name two industries where BI is widely used.

Retail

Banking

What is text mining?

Extracting meaningful information and patterns from unstructured text data.

What is web mining?

Discovering patterns from web data, such as website logs or social media content.

Data Quality & Challenges

List two issues that affect data quality.

Missing values

Duplicate records

What is data cleaning?

The process of detecting and correcting inaccurate, incomplete, or irrelevant parts of data.

Why is scalability important in data mining?

Because large datasets require algorithms that can handle high volumes efficiently.
What is data integration?

Combining data from multiple sources into a coherent data store.

Define dimensionality reduction.

The process of reducing the number of variables under consideration, e.g., using Principal
Component Analysis (PCA).

Evaluation & Visualization

What is lift in association rule mining?

A measure of how much more likely two items are bought together compared to being bought
independently.

What is ROC curve?

A graphical plot showing the performance of a binary classifier, comparing True Positive Rate and
False Positive Rate.

What is support in association rule mining?

The proportion of transactions containing a particular itemset.

What is confidence in association rule mining?

The likelihood that item B is purchased when item A is purchased.

Why is data visualization important in BI?

It helps decision-makers understand complex data patterns quickly and clearly.

Data Mining for Business Intelligence – Additional 2-Mark Questions &


Answers
Core Concepts
What is metadata in BI?

Metadata is “data about data,” describing the structure, origin, and meaning of stored data.

What is ETL in BI?

ETL stands for Extract, Transform, Load — a process to move data from source systems into a data
warehouse.

Differentiate structured and unstructured data.

Structured data is organized in fixed fields (e.g., tables); unstructured data lacks a predefined format
(e.g., emails, videos).

What is a data mart?

A subset of a data warehouse, designed for a specific business function or department.

What is real-time BI?

BI systems that deliver up-to-the-minute data and analytics for immediate decision-making.

Data Mining Methods

What is anomaly detection?

Identifying rare items, events, or patterns that differ from the majority of data.

Name two clustering algorithms.

K-Means

DBSCAN

What is hierarchical clustering?

A clustering method that builds a hierarchy of clusters either via agglomerative or divisive
approaches.
What is time-series analysis?

Analyzing data points collected over time to identify trends, cycles, and seasonal patterns.

What is sequence mining?

Finding patterns in data where the values or events are delivered in a sequence.

Business Applications

What is CRM in the context of BI?

Customer Relationship Management uses BI tools to analyze customer behavior and improve
retention.

Give two examples of predictive analytics in business.

Sales forecasting

Risk assessment in insurance

What is credit scoring?

A predictive model in finance to determine the likelihood of a borrower repaying a loan.

What is sentiment analysis?

Analyzing text data to determine the writer’s or speaker’s emotional tone.

What is demand forecasting?

Predicting future product or service demand using historical sales and market data.

Data Quality & Management

What is data redundancy?

The unnecessary repetition of data, which can cause storage inefficiency and inconsistencies.
What is data transformation?

Converting data into a suitable format or structure for analysis.

What is noise in data?

Random errors or irrelevant information that can distort analysis.

Why is data normalization important?

It removes scale differences among variables and improves model performance.

What is data lineage?

Tracing the origin and movement of data through systems over time.

Evaluation & Measures

What is precision in classification?

The proportion of correctly predicted positive observations to total predicted positives.

What is recall in classification?

The proportion of correctly predicted positives to all actual positives.

What is the F1-score?

The harmonic mean of precision and recall.

What is overfitting?

When a model learns noise along with the patterns, performing well on training data but poorly on
unseen data.

What is underfitting?

When a model is too simple to capture the underlying patterns in the data.
Data Mining for Business Intelligence – More 2-Mark Q&A
Concepts & Architecture

What is data profiling?

The process of examining data to understand its structure, quality, and content.

What is the purpose of a data warehouse?

To store integrated, subject-oriented, time-variant, and non-volatile data for analysis.

What is a star schema?

A data warehouse schema with a central fact table connected to dimension tables.

What is a snowflake schema?

A more normalized version of the star schema where dimension tables are split into sub-dimensions.

What is fact table in BI?

A table containing quantitative business metrics (facts) linked to dimension tables.

Data Mining Tasks

What is descriptive analytics?

Analytics that focuses on summarizing past data to understand what has happened.

What is diagnostic analytics?

Analytics that investigates why a certain event or trend occurred.

Name two distance measures used in clustering.

Euclidean distance

Manhattan distance
What is K-means clustering?

A clustering algorithm that partitions data into k clusters based on similarity.

What is Apriori algorithm used for?

To mine frequent itemsets and generate association rules.

Applications in BI

What is churn analysis?

Predicting which customers are likely to stop using a product or service.

What is clickstream analysis?

Analyzing user navigation paths on websites for behavior patterns.

What is fraud detection in BI?

Identifying suspicious transactions or activities using pattern recognition and anomaly detection.

What is recommendation system?

A system that suggests items to users based on past behavior or similar users.

What is supply chain analytics?

Using BI tools to optimize procurement, inventory, and distribution processes.

Data Issues & Management

What is imbalanced data?

A dataset where the number of instances in different classes is highly unequal.

What is sampling in data mining?

Selecting a representative subset of data for analysis.


What is stratified sampling?

Sampling that ensures proportional representation of different categories.

What is data security in BI?

Protecting sensitive business data from unauthorized access and breaches.

What is data governance?

A set of policies and processes ensuring data quality, security, and compliance.

Model Evaluation & Metrics

What is a training dataset?

A dataset used to build and train a machine learning model.

What is a test dataset?

A dataset used to evaluate the performance of a trained model.

What is cross-validation?

A technique for assessing model performance by splitting data into multiple train/test sets.

What is AUC in classification?

Area Under the ROC Curve — measures classifier performance across thresholds.

What is mean absolute error (MAE)?

The average of absolute differences between predicted and actual values.

Data Mining for Business Intelligence –


Advanced Concepts
What is ensemble learning?

A technique that combines multiple models to improve prediction accuracy.

Name two ensemble methods.

Bagging

Boosting

What is bagging?

Bootstrap Aggregating — training multiple models on different random samples and combining their
results.

What is boosting?

Sequentially training models, giving more weight to previously misclassified instances.

What is random forest?

An ensemble method that uses multiple decision trees to improve accuracy and reduce overfitting.

Data Mining Process & Tools

What is CRISP-DM?
Cross-Industry Standard Process for Data Mining — a methodology for planning and executing data
mining projects.

List the main phases of CRISP-DM.

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment
What is SAS Enterprise Miner?

A data mining software for building predictive and descriptive models.

What is IBM SPSS Modeler?

A data mining tool for building models without coding, using a visual interface.

What is Weka?

An open-source collection of machine learning algorithms for data mining tasks.

Practical Applications

What is customer segmentation?

Dividing customers into distinct groups based on shared characteristics.

What is inventory optimization?

Using analytics to maintain the right stock levels at minimum cost.

What is price optimization?

Determining the most profitable price point using data analysis.

What is sentiment trend analysis?

Tracking changes in public opinion over time from text data.

What is campaign analysis in BI?

Evaluating the success of marketing campaigns based on performance metrics.

Data Issues & Ethics

What is data anonymization?

Modifying personal data to protect individual identities.


What is data masking?

Hiding sensitive data elements by replacing them with fictional but realistic data.

What is bias in data mining?

Systematic errors that lead to inaccurate or unfair model outcomes.

What is GDPR?

General Data Protection Regulation — a European law governing data privacy.

What is algorithm transparency?

The degree to which a model’s decision-making process can be understood by humans.

Evaluation & Optimization

What is hyperparameter tuning?

The process of selecting the best model parameters before training.

What is grid search?

An exhaustive search over specified hyperparameter values for a model.

What is learning curve in machine learning?

A graph showing model performance over time or as training size increases.

What is regularization?

A technique to reduce overfitting by adding a penalty to large model coefficients.

What is logistic regression used for in data mining?

Predicting binary outcomes using a statistical model.


Core Data Concepts

What is data granularity?

The level of detail or summarization in a dataset.

What is temporal data?

Data that represents time-related information.

What is spatial data mining?

Discovering patterns from spatial or geographical data.

What is heterogeneous data?

Data coming from different formats, sources, or structures.

What is incremental learning?

A method where the model updates itself as new data arrives without retraining from scratch.

Data Mining Algorithms

What is k-nearest neighbors (KNN)?

A classification algorithm that assigns a label based on the majority class of its closest neighbors.

What is naïve Bayes classifier?

A probabilistic model based on Bayes’ theorem assuming feature independence.

What is support vector machine (SVM)?

A supervised learning algorithm that finds the best boundary (hyperplane) to separate classes.

What is principal component analysis (PCA)?

A dimensionality reduction technique that transforms variables into uncorrelated components.


What is deep learning?

A subset of machine learning using multi-layer neural networks for pattern recognition.

Business Applications

What is KPI in BI?

Key Performance Indicator — a measurable value indicating business performance toward


objectives.

What is sales funnel analysis?

Analyzing the stages customers go through before making a purchase.

What is workforce analytics?

Using data mining to improve employee performance and HR decisions.

What is financial risk modeling?

Predicting the likelihood of financial losses using statistical models.

What is healthcare analytics?

Using BI tools to improve patient care and hospital efficiency.

Data Quality & Preprocessing

What is feature selection?

Choosing the most relevant variables for model building.

What is feature engineering?

Creating new features from raw data to improve model performance.


What is outlier detection?

Identifying data points that deviate significantly from the norm.

What is data balancing?

Adjusting the dataset to handle class imbalance.

What is binning in data preprocessing?

Grouping continuous data into intervals for simplification.

Model Evaluation & Deployment

What is model drift?

The decline in a model’s performance due to changes in underlying data patterns.

What is confusion cost?

The business impact of false positives or false negatives in a prediction model.

What is holdout validation?

Splitting data into training and testing sets to assess performance.

What is model deployment in BI?

Integrating a trained model into a live business environment for real-time use.

What is post-deployment monitoring?

Tracking model performance after deployment to detect issues early.


Data Mining for Business Intelligence – Extended Q&A (126–150)
Core BI & Data Concepts

What is drill-down in BI?

Moving from summarized data to more detailed data in analysis.

What is roll-up in BI?

Aggregating detailed data into higher-level summaries.

What is slicing in OLAP?

Selecting a single layer of data from a cube for analysis.

What is dicing in OLAP?

Viewing data from different perspectives by selecting specific rows and columns.

What is a measure in BI?

A numeric value that can be aggregated for analysis, e.g., sales revenue.

Data Mining Tasks & Algorithms

What is classification?

Assigning items to predefined categories based on their features.

What is clustering?

Grouping data into clusters where items in the same group are more similar to each other than to
those in other groups.

What is association analysis?

Finding relationships between variables in large datasets.


What is sequential pattern mining?

Discovering frequently occurring ordered events in data.

What is a neural network?

A computational model inspired by the human brain used for pattern recognition and prediction.

Business Applications

What is inventory forecasting?

Predicting future stock requirements using historical sales data.

What is product affinity analysis?

Identifying products often bought together.

What is profitability analysis?

Assessing the profit generated by different business segments or products.

What is real-time fraud monitoring?

Detecting fraudulent transactions as they occur using live data analysis.

What is predictive maintenance?

Using data mining to predict equipment failures before they happen.

Data Quality & Governance

What is master data management (MDM)?

A process of ensuring consistency, accuracy, and accountability for key business data.

What is data stewardship?

The role responsible for managing data quality and compliance.


What is duplicate detection?

Identifying and removing repeated data entries in a dataset.

What is schema mapping?

Aligning different database schemas to enable integration.

What is referential integrity?

Ensuring relationships between tables remain consistent.

Model Evaluation & Optimization

What is model generalization?

The model’s ability to perform well on unseen data.

What is early stopping?

Halting training before overfitting occurs in machine learning.

What is feature scaling?

Adjusting the range of features to improve model performance.

What is stochastic gradient descent (SGD)?

An optimization method that updates model parameters using one sample at a time.

What is batch processing in BI?

Running data processing tasks on a large set of records all at once, rather than in real-time.

Common questions

Powered by AI

Handling noisy or incomplete data in data mining is challenging because it can lead to inaccurate models and insights. Strategies to mitigate these issues include data cleaning processes to remove or correct noise and using algorithms robust to missing or incomplete data. Another approach is employing data imputation techniques to estimate missing values. These strategies maintain data quality, ensuring more reliable and valid insights .

Association rule mining in market basket analysis identifies relationships between items purchased together in transactions. For retail businesses, this adds value by optimizing product placements, improving cross-selling strategies, and enhancing inventory management. It helps in understanding consumer purchasing behavior, thus enabling retailers to tailor marketing and promotional efforts effectively .

Data warehousing and data mining serve distinct yet complementary roles in a business intelligence framework. Data warehousing is concerned with storing and organizing large datasets, ensuring consistent, integrated, and time-variant data storage for analysis. Data mining, on the other hand, analyzes this data to extract meaningful patterns and insights. Together, they provide a comprehensive approach to data management: warehousing ensures data availability and integrity, while mining transforms this data into actionable insights .

Data preprocessing in the KDD process, which includes data cleaning, integration, and transformation, is critical for ensuring high-quality data mining outcomes. It prepares raw data by correcting inaccuracies, integrating data from different sources, and transforming it into a suitable format for analysis. This stage enhances the quality of data inputs and can significantly improve model accuracy, making the results more reliable for decision-making .

Scalability is critical in data mining due to the increasing volume of data businesses need to analyze. Achieving scalability ensures that algorithms can handle large datasets efficiently and within reasonable timeframes. Techniques such as parallel computing, distributed database systems, and advanced algorithmic strategies like MapReduce can enhance scalability. These methods allow for handling big data challenges without compromising performance or speed .

Ensemble learning methods like bagging and boosting improve model performance by combining multiple models to reduce variance (bagging) and bias (boosting). Bagging, or bootstrap aggregating, trains multiple models on various samples, aggregating their outputs to improve stability and accuracy. Boosting sequentially trains models, focusing on difficult cases by giving them more weight. These techniques enhance predictive power and robustness, providing more accurate and generalizable models .

Supervised learning involves using labeled data to train models, allowing for specific outcomes prediction, such as in classification and regression problems. Unsupervised learning uses unlabeled data to identify patterns or groupings, as seen in clustering and association tasks. The implications of these differences lie in their application: supervised methods require historical labeled data and are valuable for predictive tasks, while unsupervised methods are powerful for exploring unknown data structures and identifying intrinsic patterns .

The major goals of data mining are prediction and description. Prediction involves forecasting unknown values, which helps businesses anticipate future trends and make proactive decisions. Description involves finding patterns and relationships among data, which enables deeper insights into business operations and consumer behavior. These goals contribute to business decision-making by providing data-driven insights that enhance strategic planning and operational efficiency .

Ethical considerations in data mining include data privacy, ensuring that personal information is protected, and addressing algorithmic bias, which can lead to unfair or inaccurate model outcomes. To address these concerns, organizations can implement stringent data anonymization and security protocols, enforce compliance with regulations like GDPR, and conduct regular audits to identify and mitigate bias in data and algorithms. Transparency in the decision-making processes of models is also essential to maintain trust and accountability .

Dimensionality reduction techniques like Principal Component Analysis (PCA) are significant in data mining as they reduce the number of variables under consideration, thus simplifying models and enhancing computational efficiency. They address challenges such as multicollinearity, overfitting, and high-dimensional space complexities, improving model interpretability and performance .

You might also like