BD V
BD V
• Answer: Big Data refers to extremely large datasets that cannot be processed using
traditional data-processing techniques. It is important because it allows businesses
and organizations to extract insights from data on a massive scale, driving decision-
making, optimization, and new business models.
• Answer: Challenges include data privacy and security, managing data quality,
scalability of infrastructure, integrating heterogeneous data sources, and ensuring
real-time processing and analysis.
4. What is Big Data Analytics, and how is it different from traditional data analysis?
• Answer: Big Data Analytics involves examining large and varied datasets to uncover
hidden patterns, correlations, and insights, using advanced technologies like Hadoop,
Spark, and machine learning. It differs from traditional data analysis by handling
much larger volumes of data and often requiring distributed computing and real-
time processing.
Introduction to NoSQL
5. What are the key differences between NoSQL and traditional SQL databases?
• Answer: NoSQL databases are schema-less, providing flexibility in data modeling and
are horizontally scalable, making them suitable for Big Data. In contrast, SQL
databases are relational, with structured schemas and typically vertically scalable.
• Answer:
• Key-Value Databases: Store data as pairs (key, value) and are suitable for
simple queries. Example: Redis.
• Graph Databases: Store data in graph structures, ideal for relationships and
interconnected data. Example: Neo4j.
8. What are some popular NoSQL databases used for Big Data solutions?
10. Can you discuss a real-world case study where Big Data solutions provided significant
value?
• Answer: In healthcare, Big Data solutions are used for predictive analytics, such as
predicting disease outbreaks or patient health outcomes. For example, the use of Big
Data at the Mayo Clinic has improved patient care through more accurate diagnosis
and personalized treatment plans.
Introduction to Hadoop
12. How does Hadoop work, and what makes it suitable for processing Big Data?
• Answer: Hadoop works by breaking down large datasets into smaller blocks stored
across a cluster. MapReduce processes these blocks in parallel, making Hadoop
highly scalable and fault-tolerant, which is ideal for Big Data processing.
• Answer: HDFS is a distributed file system used by Hadoop to store large datasets
across multiple machines. Key features include data replication for fault tolerance,
high throughput, and the ability to handle large files.
MapReduce Applications
15. What is MapReduce, and how does it work?
16. What are some use cases of MapReduce in processing large datasets?
• Answer: Use cases include log file analysis, indexing data for search engines,
machine learning model training, and large-scale data aggregation.
17. What are Hadoop Streaming and how are they used in MapReduce workflows?
• Answer: Apache Spark is a fast, in-memory data processing engine that can handle
batch and real-time data. It is used as an alternative or complement to MapReduce
for faster processing.
19. What are the key components of the Hadoop ecosystem (HBase, Sqoop, Flume, PigLatin,
Hive)?
• Answer:
• Flume: A service for collecting, aggregating, and moving large amounts of log
data.
• Hive: A data warehouse system that provides SQL-like querying over large
datasets in Hadoop.
Web Intelligence
• Answer: Web Intelligence refers to the ability to gather, analyze, and interpret vast
amounts of web-based data to gain insights. It benefits businesses by improving
decision-making, understanding customer behavior, and enabling better-targeted
marketing.
22. What is the long tail in social networks, and how does it relate to Web Intelligence?
• Answer: The long tail refers to a large number of niche interests or products that,
individually, may have low demand but collectively represent a significant market.
Web Intelligence helps in identifying and catering to these niche interests effectively.
Information Retrieval
24. What is the Term-Document Matrix, and how is it used in information retrieval?
25. What are stemming and its importance in text processing for information retrieval?
• Answer: Stemming reduces words to their root form (e.g., "running" becomes
"run"). It improves search efficiency by ensuring variations of a word are treated as
the same term.
• Answer: The Boolean Retrieval Model uses Boolean operators (AND, OR, NOT) to
query documents. It retrieves documents that satisfy all conditions (AND), any
condition (OR), or exclude certain conditions (NOT).
27. What is the Vector Space Model, and how does it differ from the Boolean model?
• Answer: The Vector Space Model represents documents and queries as vectors in a
multi-dimensional space, with each term as a dimension. Unlike the Boolean model,
it allows for partial matching and ranking based on term weights.
• Answer: The Probabilistic Model ranks documents based on the probability that a
document is relevant to a query, often using techniques like BM25 to estimate
relevance.
29. How do precision and recall help evaluate the performance of an information retrieval
system?
• Answer: Precision measures the fraction of retrieved documents that are relevant,
while recall measures the fraction of relevant documents that are retrieved. Both
metrics are used to balance the effectiveness of a retrieval system.
30. What is a confusion matrix, and how is it used to evaluate information retrieval models?
31. What are the key components of a web search engine architecture?
• Answer: Key components include the crawler (to gather web data), the indexer (to
organize data), and the ranking algorithm (to rank pages based on relevance).
32. How does link analysis help in the ranking of search results?
• Answer: Link analysis, such as PageRank, assigns a score to web pages based on the
number and quality of links pointing to them. This helps rank pages by their
authority and relevance.
Multimedia Search
33. How are images, audio, and video handled in multimedia search systems?
• Answer: Multimedia search systems use specialized techniques for indexing and
retrieval, such as image recognition, audio transcription, and video metadata
extraction, to enable efficient searching.
Text Pre-processing
• Answer: Text pre-processing involves cleaning and transforming raw text data to
make it suitable for analysis. This includes tasks like tokenization, stemming, and
stop-word removal, which improve the quality of text mining results.
• Answer: Segmentation divides text into smaller meaningful units, such as sentences
or words. It is crucial for processing and analyzing large documents.
Image Pre-processing
Classification Algorithms
37. Explain the linear regression algorithm and its use cases.
• Answer: Linear regression models the relationship between a dependent variable
and one or more independent variables. It is used in prediction tasks, such as
forecasting sales or housing prices.
• Answer: Decision trees split data into subsets based on the most significant
attributes, forming a tree-like structure that makes decisions by following branches
to leaf nodes, which represent outcomes.