Big Data Analytics - Exam Answers
**Answers to Big Data Analytics Exam Questions (MSc IT)**
---
**Q.1 (A) (ii): What are the three characteristics of big data? Explain the differences
between BI and Data Science.**
**Three Characteristics of Big Data (3Vs):**
1. **Volume:** Big data involves massive amounts of data generated from various sources
like social media, sensors, digital transactions, etc. Traditional tools cannot handle such
huge volumes.
2. **Velocity:** Refers to the speed at which data is generated and processed. Big data
systems must handle real-time or near real-time data flows.
3. **Variety:** Big data comes in various formats including structured (databases), semi-
structured (XML, JSON), and unstructured (videos, images, text).
**Differences between BI and Data Science:**
| Feature | Business Intelligence (BI) | Data Science
|
|---------------------|----------------------------------------------------------|------------------------------------
--------------------|
| **Focus** | Past and present data analysis for decision-making | Predictive and
prescriptive analytics |
| **Tools** | Excel, SQL, Power BI, Tableau | Python, R, Machine Learning
libraries (e.g., scikit-learn) |
| **Data Handling** | Mostly structured data | Structured, semi-
structured, and unstructured data |
| **Output** | Dashboards, reports, KPIs | Predictive models,
algorithms, actionable insights |
| **Goal** | Business reporting and tracking | Discovering patterns,
automating decision-making |
---
**Q.1 (B) (ii): Write a short note on data science and data science process.**
**Data Science:**
Data Science is an interdisciplinary field that combines techniques from statistics, computer
science, and domain expertise to extract meaningful insights and knowledge from data. It
involves collecting, cleaning, analyzing, and interpreting large volumes of data to support
decision-making and create predictive models.
**Data Science Process:**
1. **Problem Understanding:** Define the objective and understand the business need.
2. **Data Collection:** Gather relevant data from different sources.
3. **Data Cleaning:** Remove inconsistencies, handle missing values, and correct errors.
4. **Exploratory Data Analysis (EDA):** Analyze trends, patterns, and relationships in the
data.
5. **Feature Engineering:** Create new features that can improve model performance.
6. **Model Building:** Apply algorithms such as regression, classification, or clustering.
7. **Evaluation:** Test the model using accuracy, precision, recall, etc.
8. **Deployment:** Integrate the model into the production environment.
9. **Monitoring and Maintenance:** Ensure continued model performance over time.
---
**Q.2 (A) (ii): What is Logistic Regression? Explain in detail. Also explain any two of its
applications.**
**Logistic Regression:**
Logistic regression is a statistical method used for binary classification problems, where the
outcome is categorical (e.g., yes/no, true/false). It uses a logistic function (sigmoid) to
model the probability of a binary response based on one or more independent variables.
**Logistic Function:**
\[ P(Y=1) = rac{1}{1 + e^{-(eta_0 + eta_1X_1 + \dots + eta_nX_n)}} \]
**Working:**
- It calculates the probability of a data point belonging to a certain class.
- Based on a threshold (usually 0.5), it classifies the data point into class 0 or class 1.
**Applications:**
1. **Spam Detection:** Identifying whether an email is spam or not.
2. **Medical Diagnosis:** Predicting whether a patient has a disease based on symptoms
and test results.
---
**Q.3 (A) (i): Write a short note on: The Data Science Pipeline**
**Data Science Pipeline:**
A data science pipeline outlines the sequence of steps followed in a data science project. It
helps streamline the process from data collection to model deployment.
**Phases:**
1. **Data Collection:** Gather data from APIs, databases, or web scraping.
2. **Data Preparation:** Clean, normalize, and transform raw data for analysis.
3. **Exploratory Data Analysis (EDA):** Visualize and explore data to find trends, patterns,
and correlations.
4. **Modeling:** Choose and apply appropriate algorithms to build predictive or
classification models.
5. **Evaluation:** Validate model accuracy using metrics like confusion matrix, ROC curve,
F1-score.
6. **Deployment:** Integrate the model into a business application or web interface.
7. **Monitoring:** Track model performance and update as needed.
---
**Q.3 (B) (ii): Write a short note on Hadoop architecture.**
**Hadoop Architecture:**
Hadoop has two primary components:
1. **HDFS (Hadoop Distributed File System):**
- Manages distributed storage.
- Consists of NameNode (master) and DataNodes (slaves).
- Stores large data by breaking it into blocks and distributing them.
2. **YARN (Yet Another Resource Negotiator):**
- Resource management and job scheduling layer.
- Contains ResourceManager and NodeManagers.
**Features:**
- Supports parallel processing.
- Ensures fault tolerance by replicating data.
- Allows scalability by adding new nodes.
- Data locality optimization ensures that computation occurs near data.
---
**Q.4 (A) (i): What are the Distributed Analysis and Patterns?**
**Distributed Analysis:**
It involves processing large data sets by distributing the data across multiple systems or
nodes. This enables faster computation and the ability to handle big data volumes.
**Common Patterns:**
1. **MapReduce Pattern:** Splits data into small chunks (Map), processes them in parallel,
and then combines the output (Reduce).
2. **Master-Slave Pattern:** One central master node manages multiple slave nodes which
perform the actual processing.
3. **Pipeline Pattern:** Data passes through a series of processing stages, each transforming
the data incrementally.
These patterns help optimize performance, improve fault tolerance, and make systems
scalable.
---
**Q.4 (B) (ii): Explain Spark SQL interface architecture with a neat diagram.**
**Spark SQL Interface Architecture:**
**Components:**
1. **Data Sources:** Includes Hive, JSON, JDBC, Parquet, Avro, etc.
2. **DataFrame API:** Allows developers to perform SQL-like operations on data.
3. **Catalyst Optimizer:** Optimizes queries using rule-based and cost-based strategies.
4. **Tungsten Execution Engine:** Improves execution using whole-stage code generation
and memory management.
5. **Query Execution:** Optimized query plans are executed over the distributed Spark
Core.
**Workflow:**
- User writes SQL or uses DataFrame API.
- Catalyst builds a logical plan and optimizes it.
- Tungsten executes the plan efficiently.