0% found this document useful (0 votes)
15 views6 pages

BD V

hadoop via questions

Uploaded by

AAKANSHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

BD V

hadoop via questions

Uploaded by

AAKANSHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Here are the answers to the viva questions based on the topics you provided:

Unit I: Big Data and NoSQL

Understanding Big Data

1. What is Big Data, and why is it important?

• Answer: Big Data refers to extremely large datasets that cannot be processed using
traditional data-processing techniques. It is important because it allows businesses
and organizations to extract insights from data on a massive scale, driving decision-
making, optimization, and new business models.

2. What are some common applications of Big Data?

• Answer: Big Data applications include predictive analytics, fraud detection,


recommendation systems, social media analysis, and real-time analytics in industries
like healthcare, finance, retail, and government.

3. What are the major challenges of working with Big Data?

• Answer: Challenges include data privacy and security, managing data quality,
scalability of infrastructure, integrating heterogeneous data sources, and ensuring
real-time processing and analysis.

4. What is Big Data Analytics, and how is it different from traditional data analysis?

• Answer: Big Data Analytics involves examining large and varied datasets to uncover
hidden patterns, correlations, and insights, using advanced technologies like Hadoop,
Spark, and machine learning. It differs from traditional data analysis by handling
much larger volumes of data and often requiring distributed computing and real-
time processing.

Introduction to NoSQL

5. What are the key differences between NoSQL and traditional SQL databases?

• Answer: NoSQL databases are schema-less, providing flexibility in data modeling and
are horizontally scalable, making them suitable for Big Data. In contrast, SQL
databases are relational, with structured schemas and typically vertically scalable.

6. Explain the different types of NoSQL data models.

• Answer:

• Key-Value Databases: Store data as pairs (key, value) and are suitable for
simple queries. Example: Redis.

• Document Databases: Store semi-structured data in documents (usually


JSON). Example: MongoDB.

• Graph Databases: Store data in graph structures, ideal for relationships and
interconnected data. Example: Neo4j.

7. What is the role of schema-less databases in Big Data solutions?


• Answer: Schema-less databases allow for flexible storage of data without needing to
define a fixed structure, making them ideal for handling large and unstructured
datasets.

8. What are some popular NoSQL databases used for Big Data solutions?

• Answer: Popular NoSQL databases include MongoDB (document-based), Cassandra


(wide-column store), Redis (key-value store), and Neo4j (graph database).

9. How does Big Data benefit from NoSQL solutions?

• Answer: NoSQL databases can handle large volumes of unstructured or semi-


structured data, scale horizontally across distributed systems, and provide high
availability and fault tolerance—critical for Big Data applications.

Case Studies on Big Data

10. Can you discuss a real-world case study where Big Data solutions provided significant
value?

• Answer: In healthcare, Big Data solutions are used for predictive analytics, such as
predicting disease outbreaks or patient health outcomes. For example, the use of Big
Data at the Mayo Clinic has improved patient care through more accurate diagnosis
and personalized treatment plans.

Unit IV: Hadoop and MapReduce

Introduction to Hadoop

11. What is Hadoop, and what is its architecture?

• Answer: Hadoop is an open-source framework for distributed storage and processing


of large datasets across clusters of computers. Its architecture includes HDFS
(Hadoop Distributed File System) for storage and MapReduce for processing.

12. How does Hadoop work, and what makes it suitable for processing Big Data?

• Answer: Hadoop works by breaking down large datasets into smaller blocks stored
across a cluster. MapReduce processes these blocks in parallel, making Hadoop
highly scalable and fault-tolerant, which is ideal for Big Data processing.

13. What are the advantages of using Hadoop?

• Answer: Hadoop offers scalability, fault tolerance, cost-effectiveness (using


commodity hardware), and flexibility in processing unstructured or semi-structured
data.

14. What is HDFS, and what are its key features?

• Answer: HDFS is a distributed file system used by Hadoop to store large datasets
across multiple machines. Key features include data replication for fault tolerance,
high throughput, and the ability to handle large files.

MapReduce Applications
15. What is MapReduce, and how does it work?

• Answer: MapReduce is a programming model used to process large datasets in


parallel. It consists of two phases: the Map phase, where data is processed and
mapped into key-value pairs, and the Reduce phase, where the mapped data is
aggregated and reduced to meaningful output.

16. What are some use cases of MapReduce in processing large datasets?

• Answer: Use cases include log file analysis, indexing data for search engines,
machine learning model training, and large-scale data aggregation.

17. What are Hadoop Streaming and how are they used in MapReduce workflows?

• Answer: Hadoop Streaming allows developers to write MapReduce programs in


languages other than Java (e.g., Python, Ruby). It facilitates using any executable or
script for the map and reduce functions, making it more flexible.

18. What is the role of Apache Spark in the Hadoop ecosystem?

• Answer: Apache Spark is a fast, in-memory data processing engine that can handle
batch and real-time data. It is used as an alternative or complement to MapReduce
for faster processing.

19. What are the key components of the Hadoop ecosystem (HBase, Sqoop, Flume, PigLatin,
Hive)?

• Answer:

• HBase: A NoSQL database for real-time access to large datasets.

• Sqoop: A tool for importing/exporting data between Hadoop and relational


databases.

• Flume: A service for collecting, aggregating, and moving large amounts of log
data.

• PigLatin: A high-level platform for creating MapReduce programs.

• Hive: A data warehouse system that provides SQL-like querying over large
datasets in Hadoop.

Web Intelligence and Information Retrieval

Web Intelligence

20. What is Web Intelligence, and how does it benefit businesses?

• Answer: Web Intelligence refers to the ability to gather, analyze, and interpret vast
amounts of web-based data to gain insights. It benefits businesses by improving
decision-making, understanding customer behavior, and enabling better-targeted
marketing.

21. What are the key ingredients of Web Intelligence?


• Answer: Key ingredients include data mining, web analytics, natural language
processing (NLP), machine learning, and visualization techniques to extract
meaningful insights from web data.

22. What is the long tail in social networks, and how does it relate to Web Intelligence?

• Answer: The long tail refers to a large number of niche interests or products that,
individually, may have low demand but collectively represent a significant market.
Web Intelligence helps in identifying and catering to these niche interests effectively.

Information Retrieval

23. What is document representation in information retrieval?

• Answer: Document representation involves converting text data into a structured


format (such as a vector or matrix) for easier analysis and retrieval. This often
involves techniques like tokenization, stemming, and stop-word removal.

24. What is the Term-Document Matrix, and how is it used in information retrieval?

• Answer: A Term-Document Matrix is a matrix where rows represent terms, columns


represent documents, and the values indicate the frequency or presence of terms in
documents. It is used to find relationships between terms and documents for
efficient retrieval.

25. What are stemming and its importance in text processing for information retrieval?

• Answer: Stemming reduces words to their root form (e.g., "running" becomes
"run"). It improves search efficiency by ensuring variations of a word are treated as
the same term.

26. Describe the Boolean Retrieval Model in Information Retrieval.

• Answer: The Boolean Retrieval Model uses Boolean operators (AND, OR, NOT) to
query documents. It retrieves documents that satisfy all conditions (AND), any
condition (OR), or exclude certain conditions (NOT).

27. What is the Vector Space Model, and how does it differ from the Boolean model?

• Answer: The Vector Space Model represents documents and queries as vectors in a
multi-dimensional space, with each term as a dimension. Unlike the Boolean model,
it allows for partial matching and ranking based on term weights.

28. What is the Probabilistic Information Retrieval Model?

• Answer: The Probabilistic Model ranks documents based on the probability that a
document is relevant to a query, often using techniques like BM25 to estimate
relevance.

29. How do precision and recall help evaluate the performance of an information retrieval
system?

• Answer: Precision measures the fraction of retrieved documents that are relevant,
while recall measures the fraction of relevant documents that are retrieved. Both
metrics are used to balance the effectiveness of a retrieval system.
30. What is a confusion matrix, and how is it used to evaluate information retrieval models?

• Answer: A confusion matrix is a table that summarizes the performance of a


classification algorithm by comparing predicted and actual outcomes. In information
retrieval, it helps evaluate false positives, false negatives, true positives, and true
negatives.

Web Search Engine Architecture

31. What are the key components of a web search engine architecture?

• Answer: Key components include the crawler (to gather web data), the indexer (to
organize data), and the ranking algorithm (to rank pages based on relevance).

32. How does link analysis help in the ranking of search results?

• Answer: Link analysis, such as PageRank, assigns a score to web pages based on the
number and quality of links pointing to them. This helps rank pages by their
authority and relevance.

Multimedia Search and Text/Image Mining

Multimedia Search

33. How are images, audio, and video handled in multimedia search systems?

• Answer: Multimedia search systems use specialized techniques for indexing and
retrieval, such as image recognition, audio transcription, and video metadata
extraction, to enable efficient searching.

Text Pre-processing

34. What is text pre-processing, and why is it important in data mining?

• Answer: Text pre-processing involves cleaning and transforming raw text data to
make it suitable for analysis. This includes tasks like tokenization, stemming, and
stop-word removal, which improve the quality of text mining results.

35. What is segmentation in text mining, and how is it performed?

• Answer: Segmentation divides text into smaller meaningful units, such as sentences
or words. It is crucial for processing and analyzing large documents.

Image Pre-processing

36. What are the steps involved in image pre-processing?

• Answer: Steps include histogram analysis (to adjust brightness/contrast), noise


cleaning (to remove unwanted pixels), and segmentation (to identify and isolate
objects of interest in an image).

Classification Algorithms

37. Explain the linear regression algorithm and its use cases.
• Answer: Linear regression models the relationship between a dependent variable
and one or more independent variables. It is used in prediction tasks, such as
forecasting sales or housing prices.

38. How does the decision tree algorithm work?

• Answer: Decision trees split data into subsets based on the most significant
attributes, forming a tree-like structure that makes decisions by following branches
to leaf nodes, which represent outcomes.

39. What is K-means clustering, and when is it used?

• Answer: K-means is a clustering algorithm that partitions data into K clusters by


minimizing the variance within each cluster. It is used in unsupervised learning to
group similar data points together.

40. What is Naive Bayes classification, and how does it work?

• Answer: Naive Bayes is a probabilistic classifier based on Bayes' Theorem, assuming


feature independence. It is simple and effective for text classification tasks, such as
spam detection.

You might also like