0% found this document useful (0 votes)

15 views6 pages

BD V

hadoop via questions

Uploaded by

AAKANSHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views6 pages

BD V

hadoop via questions

Uploaded by

AAKANSHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Here are the answers to the viva questions based on the topics you provided:

Unit I: Big Data and NoSQL

Understanding Big Data

1. What is Big Data, and why is it important?

• Answer: Big Data refers to extremely large datasets that cannot be processed using
traditional data-processing techniques. It is important because it allows businesses
and organizations to extract insights from data on a massive scale, driving decision-
making, optimization, and new business models.

2. What are some common applications of Big Data?

• Answer: Big Data applications include predictive analytics, fraud detection,

recommendation systems, social media analysis, and real-time analytics in industries
like healthcare, finance, retail, and government.

3. What are the major challenges of working with Big Data?

• Answer: Challenges include data privacy and security, managing data quality,
scalability of infrastructure, integrating heterogeneous data sources, and ensuring
real-time processing and analysis.

4. What is Big Data Analytics, and how is it different from traditional data analysis?

• Answer: Big Data Analytics involves examining large and varied datasets to uncover
hidden patterns, correlations, and insights, using advanced technologies like Hadoop,
Spark, and machine learning. It differs from traditional data analysis by handling
much larger volumes of data and often requiring distributed computing and real-
time processing.

Introduction to NoSQL

5. What are the key differences between NoSQL and traditional SQL databases?

• Answer: NoSQL databases are schema-less, providing flexibility in data modeling and
are horizontally scalable, making them suitable for Big Data. In contrast, SQL
databases are relational, with structured schemas and typically vertically scalable.

6. Explain the different types of NoSQL data models.

• Answer:

• Key-Value Databases: Store data as pairs (key, value) and are suitable for
simple queries. Example: Redis.

• Document Databases: Store semi-structured data in documents (usually

JSON). Example: MongoDB.

• Graph Databases: Store data in graph structures, ideal for relationships and
interconnected data. Example: Neo4j.

7. What is the role of schema-less databases in Big Data solutions?

• Answer: Schema-less databases allow for flexible storage of data without needing to
define a fixed structure, making them ideal for handling large and unstructured
datasets.

8. What are some popular NoSQL databases used for Big Data solutions?

• Answer: Popular NoSQL databases include MongoDB (document-based), Cassandra

(wide-column store), Redis (key-value store), and Neo4j (graph database).

9. How does Big Data benefit from NoSQL solutions?

• Answer: NoSQL databases can handle large volumes of unstructured or semi-

structured data, scale horizontally across distributed systems, and provide high
availability and fault tolerance—critical for Big Data applications.

Case Studies on Big Data

10. Can you discuss a real-world case study where Big Data solutions provided significant
value?

• Answer: In healthcare, Big Data solutions are used for predictive analytics, such as
predicting disease outbreaks or patient health outcomes. For example, the use of Big
Data at the Mayo Clinic has improved patient care through more accurate diagnosis
and personalized treatment plans.

Unit IV: Hadoop and MapReduce

Introduction to Hadoop

11. What is Hadoop, and what is its architecture?

• Answer: Hadoop is an open-source framework for distributed storage and processing

of large datasets across clusters of computers. Its architecture includes HDFS
(Hadoop Distributed File System) for storage and MapReduce for processing.

12. How does Hadoop work, and what makes it suitable for processing Big Data?

• Answer: Hadoop works by breaking down large datasets into smaller blocks stored
across a cluster. MapReduce processes these blocks in parallel, making Hadoop
highly scalable and fault-tolerant, which is ideal for Big Data processing.

13. What are the advantages of using Hadoop?

• Answer: Hadoop offers scalability, fault tolerance, cost-effectiveness (using

commodity hardware), and flexibility in processing unstructured or semi-structured
data.

14. What is HDFS, and what are its key features?

• Answer: HDFS is a distributed file system used by Hadoop to store large datasets
across multiple machines. Key features include data replication for fault tolerance,
high throughput, and the ability to handle large files.

MapReduce Applications
15. What is MapReduce, and how does it work?

• Answer: MapReduce is a programming model used to process large datasets in

parallel. It consists of two phases: the Map phase, where data is processed and
mapped into key-value pairs, and the Reduce phase, where the mapped data is
aggregated and reduced to meaningful output.

16. What are some use cases of MapReduce in processing large datasets?

• Answer: Use cases include log file analysis, indexing data for search engines,
machine learning model training, and large-scale data aggregation.

17. What are Hadoop Streaming and how are they used in MapReduce workflows?

• Answer: Hadoop Streaming allows developers to write MapReduce programs in

languages other than Java (e.g., Python, Ruby). It facilitates using any executable or
script for the map and reduce functions, making it more flexible.

18. What is the role of Apache Spark in the Hadoop ecosystem?

• Answer: Apache Spark is a fast, in-memory data processing engine that can handle
batch and real-time data. It is used as an alternative or complement to MapReduce
for faster processing.

19. What are the key components of the Hadoop ecosystem (HBase, Sqoop, Flume, PigLatin,
Hive)?

• Answer:

• HBase: A NoSQL database for real-time access to large datasets.

• Sqoop: A tool for importing/exporting data between Hadoop and relational

databases.

• Flume: A service for collecting, aggregating, and moving large amounts of log
data.

• PigLatin: A high-level platform for creating MapReduce programs.

• Hive: A data warehouse system that provides SQL-like querying over large
datasets in Hadoop.

Web Intelligence and Information Retrieval

Web Intelligence

20. What is Web Intelligence, and how does it benefit businesses?

• Answer: Web Intelligence refers to the ability to gather, analyze, and interpret vast
amounts of web-based data to gain insights. It benefits businesses by improving
decision-making, understanding customer behavior, and enabling better-targeted
marketing.

21. What are the key ingredients of Web Intelligence?

• Answer: Key ingredients include data mining, web analytics, natural language
processing (NLP), machine learning, and visualization techniques to extract
meaningful insights from web data.

22. What is the long tail in social networks, and how does it relate to Web Intelligence?

• Answer: The long tail refers to a large number of niche interests or products that,
individually, may have low demand but collectively represent a significant market.
Web Intelligence helps in identifying and catering to these niche interests effectively.

Information Retrieval

23. What is document representation in information retrieval?

• Answer: Document representation involves converting text data into a structured

format (such as a vector or matrix) for easier analysis and retrieval. This often
involves techniques like tokenization, stemming, and stop-word removal.

24. What is the Term-Document Matrix, and how is it used in information retrieval?

• Answer: A Term-Document Matrix is a matrix where rows represent terms, columns

represent documents, and the values indicate the frequency or presence of terms in
documents. It is used to find relationships between terms and documents for
efficient retrieval.

25. What are stemming and its importance in text processing for information retrieval?

• Answer: Stemming reduces words to their root form (e.g., "running" becomes
"run"). It improves search efficiency by ensuring variations of a word are treated as
the same term.

26. Describe the Boolean Retrieval Model in Information Retrieval.

• Answer: The Boolean Retrieval Model uses Boolean operators (AND, OR, NOT) to
query documents. It retrieves documents that satisfy all conditions (AND), any
condition (OR), or exclude certain conditions (NOT).

27. What is the Vector Space Model, and how does it differ from the Boolean model?

• Answer: The Vector Space Model represents documents and queries as vectors in a
multi-dimensional space, with each term as a dimension. Unlike the Boolean model,
it allows for partial matching and ranking based on term weights.

28. What is the Probabilistic Information Retrieval Model?

• Answer: The Probabilistic Model ranks documents based on the probability that a
document is relevant to a query, often using techniques like BM25 to estimate
relevance.

29. How do precision and recall help evaluate the performance of an information retrieval
system?

• Answer: Precision measures the fraction of retrieved documents that are relevant,
while recall measures the fraction of relevant documents that are retrieved. Both
metrics are used to balance the effectiveness of a retrieval system.
30. What is a confusion matrix, and how is it used to evaluate information retrieval models?

• Answer: A confusion matrix is a table that summarizes the performance of a

classification algorithm by comparing predicted and actual outcomes. In information
retrieval, it helps evaluate false positives, false negatives, true positives, and true
negatives.

Web Search Engine Architecture

31. What are the key components of a web search engine architecture?

• Answer: Key components include the crawler (to gather web data), the indexer (to
organize data), and the ranking algorithm (to rank pages based on relevance).

32. How does link analysis help in the ranking of search results?

• Answer: Link analysis, such as PageRank, assigns a score to web pages based on the
number and quality of links pointing to them. This helps rank pages by their
authority and relevance.

Multimedia Search and Text/Image Mining

Multimedia Search

33. How are images, audio, and video handled in multimedia search systems?

• Answer: Multimedia search systems use specialized techniques for indexing and
retrieval, such as image recognition, audio transcription, and video metadata
extraction, to enable efficient searching.

Text Pre-processing

34. What is text pre-processing, and why is it important in data mining?

• Answer: Text pre-processing involves cleaning and transforming raw text data to
make it suitable for analysis. This includes tasks like tokenization, stemming, and
stop-word removal, which improve the quality of text mining results.

35. What is segmentation in text mining, and how is it performed?

• Answer: Segmentation divides text into smaller meaningful units, such as sentences
or words. It is crucial for processing and analyzing large documents.

Image Pre-processing

36. What are the steps involved in image pre-processing?

• Answer: Steps include histogram analysis (to adjust brightness/contrast), noise

cleaning (to remove unwanted pixels), and segmentation (to identify and isolate
objects of interest in an image).

Classification Algorithms

37. Explain the linear regression algorithm and its use cases.
• Answer: Linear regression models the relationship between a dependent variable
and one or more independent variables. It is used in prediction tasks, such as
forecasting sales or housing prices.

38. How does the decision tree algorithm work?

• Answer: Decision trees split data into subsets based on the most significant
attributes, forming a tree-like structure that makes decisions by following branches
to leaf nodes, which represent outcomes.

39. What is K-means clustering, and when is it used?

• Answer: K-means is a clustering algorithm that partitions data into K clusters by

minimizing the variance within each cluster. It is used in unsupervised learning to
group similar data points together.

40. What is Naive Bayes classification, and how does it work?

• Answer: Naive Bayes is a probabilistic classifier based on Bayes' Theorem, assuming

feature independence. It is simple and effective for text classification tasks, such as
spam detection.

CCS334-Big-Data-Analytics UNIVERSITY QP
No ratings yet
CCS334-Big-Data-Analytics UNIVERSITY QP
20 pages
KCS061 Solution
No ratings yet
KCS061 Solution
28 pages
BDA viva
No ratings yet
BDA viva
26 pages
Assignment BDHhhh
No ratings yet
Assignment BDHhhh
15 pages
100 questions and answers based on the content from the uploaded documents related to Big Data
No ratings yet
100 questions and answers based on the content from the uploaded documents related to Big Data
7 pages
BDA Question bank with solutions
No ratings yet
BDA Question bank with solutions
88 pages
BDA Model Qp Soln
No ratings yet
BDA Model Qp Soln
55 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
Big data
No ratings yet
Big data
79 pages
2010-2011 Final Year Project Report (1st Term)
No ratings yet
2010-2011 Final Year Project Report (1st Term)
80 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in The Cloud
No ratings yet
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in The Cloud
14 pages
Part A
No ratings yet
Part A
4 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
Hadoop
No ratings yet
Hadoop
4 pages
BDA with answer-1(1)
No ratings yet
BDA with answer-1(1)
18 pages
BDA MQP 1
No ratings yet
BDA MQP 1
29 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
BDA_Question_bank
No ratings yet
BDA_Question_bank
33 pages
07-BigData-DataAnalysis
No ratings yet
07-BigData-DataAnalysis
66 pages
Big Data
No ratings yet
Big Data
29 pages
CDE Sample Interview Questions (1)
No ratings yet
CDE Sample Interview Questions (1)
10 pages
Be Sem 7 Ia 1 Question Bank
No ratings yet
Be Sem 7 Ia 1 Question Bank
4 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
BIG DATA AND ANALYTICS presentation
No ratings yet
BIG DATA AND ANALYTICS presentation
31 pages
Big Data Analysis Unit 1-5 Extended
No ratings yet
Big Data Analysis Unit 1-5 Extended
35 pages
Two marks (1)
No ratings yet
Two marks (1)
39 pages
BDA_2M
No ratings yet
BDA_2M
10 pages
big data
No ratings yet
big data
22 pages
Sample Ques Ns
No ratings yet
Sample Ques Ns
29 pages
BDA Question Bank
No ratings yet
BDA Question Bank
20 pages
DA Lab Manual Final.docx
No ratings yet
DA Lab Manual Final.docx
46 pages
BDA QUESTION BANK
No ratings yet
BDA QUESTION BANK
10 pages
Ism 6404 CH 7
No ratings yet
Ism 6404 CH 7
47 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
BIG DATA & Hadoop Interview Questions With Answers
No ratings yet
BIG DATA & Hadoop Interview Questions With Answers
9 pages
30 SQL and Database Design Questions From Data Science Interviews at Top Tech Companies
No ratings yet
30 SQL and Database Design Questions From Data Science Interviews at Top Tech Companies
27 pages
biggdata
No ratings yet
biggdata
24 pages
Bda Ut1 Que Ans
No ratings yet
Bda Ut1 Que Ans
13 pages
CCS334 Big Data Analytics
No ratings yet
CCS334 Big Data Analytics
20 pages
BD Lab File
No ratings yet
BD Lab File
39 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
6-IR
No ratings yet
6-IR
18 pages
Spark For Python Developers Amit Nandi instant download
100% (1)
Spark For Python Developers Amit Nandi instant download
83 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Cse 17CS82 M2 S1 PPT
No ratings yet
Cse 17CS82 M2 S1 PPT
35 pages
Algorithm Design in MapReduce
No ratings yet
Algorithm Design in MapReduce
62 pages
Introduction To HDFS
No ratings yet
Introduction To HDFS
25 pages
BBBBCCCCDDD
No ratings yet
BBBBCCCCDDD
10 pages
BigQueryTechnicalWP PDF
No ratings yet
BigQueryTechnicalWP PDF
12 pages
BDA question bank
No ratings yet
BDA question bank
8 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
BDA
No ratings yet
BDA
8 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
BIG DATA 2023
No ratings yet
BIG DATA 2023
18 pages
Energy-Efficient Hadoop For Big Data Analytics and Computing - A Systematic Review and Research Insights
No ratings yet
Energy-Efficient Hadoop For Big Data Analytics and Computing - A Systematic Review and Research Insights
17 pages
BDA_2M
No ratings yet
BDA_2M
13 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
BDA AK
No ratings yet
BDA AK
107 pages
Computational Health Informatics in The Big Data Age - A Survey
No ratings yet
Computational Health Informatics in The Big Data Age - A Survey
36 pages
Big Data Visualization
No ratings yet
Big Data Visualization
55 pages
DSBDA Lab Manual 23 - 24
No ratings yet
DSBDA Lab Manual 23 - 24
50 pages
Big Data and Hadoop - Semester Exam - 6th Sem-Set 01
No ratings yet
Big Data and Hadoop - Semester Exam - 6th Sem-Set 01
3 pages
IM Ch14 Big Data Analytics NoSQL Ed12
No ratings yet
IM Ch14 Big Data Analytics NoSQL Ed12
8 pages
C QB
No ratings yet
C QB
12 pages
Ite06 Big Data Analytics-Qbank
No ratings yet
Ite06 Big Data Analytics-Qbank
18 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
DDIA in Concise
100% (1)
DDIA in Concise
106 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
BigData Answers
No ratings yet
BigData Answers
3 pages
NoSQL_Quiz_Answers
No ratings yet
NoSQL_Quiz_Answers
5 pages
Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data For Bioinformatics
No ratings yet
Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data For Bioinformatics
12 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Department of Engineering: Computer Science &
No ratings yet
Department of Engineering: Computer Science &
21 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Worksheet 6th
No ratings yet
Worksheet 6th
6 pages
Data Science Bootcamp Curriculum 2
No ratings yet
Data Science Bootcamp Curriculum 2
7 pages
Big Data Analytics For Logistics and Transportation: Conference Paper
No ratings yet
Big Data Analytics For Logistics and Transportation: Conference Paper
7 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
6 pages
Business Analytics
No ratings yet
Business Analytics
9 pages
cp5092 Syllabus
No ratings yet
cp5092 Syllabus
3 pages
Big Data (KCS-061)
No ratings yet
Big Data (KCS-061)
46 pages
Week 3 CC
No ratings yet
Week 3 CC
7 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Important Questions and Answers of Big Data Course
No ratings yet
Important Questions and Answers of Big Data Course
4 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet

BD V

Uploaded by

BD V

Uploaded by

Here are the answers to the viva questions based on the topics you provided:

Unit I: Big Data and NoSQL

Understanding Big Data

1. What is Big Data, and why is it important?

2. What are some common applications of Big Data?

• Answer: Big Data applications include predictive analytics, fraud detection,

3. What are the major challenges of working with Big Data?

6. Explain the different types of NoSQL data models.

• Document Databases: Store semi-structured data in documents (usually

7. What is the role of schema-less databases in Big Data solutions?

• Answer: Popular NoSQL databases include MongoDB (document-based), Cassandra

9. How does Big Data benefit from NoSQL solutions?

• Answer: NoSQL databases can handle large volumes of unstructured or semi-

Case Studies on Big Data

Unit IV: Hadoop and MapReduce

11. What is Hadoop, and what is its architecture?

• Answer: Hadoop is an open-source framework for distributed storage and processing

13. What are the advantages of using Hadoop?

• Answer: Hadoop offers scalability, fault tolerance, cost-effectiveness (using

14. What is HDFS, and what are its key features?

• Answer: MapReduce is a programming model used to process large datasets in

• Answer: Hadoop Streaming allows developers to write MapReduce programs in

18. What is the role of Apache Spark in the Hadoop ecosystem?

• HBase: A NoSQL database for real-time access to large datasets.

• Sqoop: A tool for importing/exporting data between Hadoop and relational

• PigLatin: A high-level platform for creating MapReduce programs.

Web Intelligence and Information Retrieval

20. What is Web Intelligence, and how does it benefit businesses?

21. What are the key ingredients of Web Intelligence?

23. What is document representation in information retrieval?

• Answer: Document representation involves converting text data into a structured

• Answer: A Term-Document Matrix is a matrix where rows represent terms, columns

26. Describe the Boolean Retrieval Model in Information Retrieval.

28. What is the Probabilistic Information Retrieval Model?

• Answer: A confusion matrix is a table that summarizes the performance of a

Web Search Engine Architecture

Multimedia Search and Text/Image Mining

34. What is text pre-processing, and why is it important in data mining?

35. What is segmentation in text mining, and how is it performed?

36. What are the steps involved in image pre-processing?

• Answer: Steps include histogram analysis (to adjust brightness/contrast), noise

38. How does the decision tree algorithm work?

39. What is K-means clustering, and when is it used?

• Answer: K-means is a clustering algorithm that partitions data into K clusters by

40. What is Naive Bayes classification, and how does it work?

• Answer: Naive Bayes is a probabilistic classifier based on Bayes' Theorem, assuming

You might also like