0% found this document useful (0 votes)
25 views8 pages

Untitled Document (1)

The document provides answers and explanations for a series of questions across various topics including Advanced Machine Learning, Complex Statistics & Probability, Big Data & Distributed Systems, and more. Each question is followed by a correct answer and a brief explanation of the concept involved. The content serves as a comprehensive guide for understanding key concepts in data science and related fields.

Uploaded by

justjoyapple123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views8 pages

Untitled Document (1)

The document provides answers and explanations for a series of questions across various topics including Advanced Machine Learning, Complex Statistics & Probability, Big Data & Distributed Systems, and more. Each question is followed by a correct answer and a brief explanation of the concept involved. The content serves as a comprehensive guide for understanding key concepts in data science and related fields.

Uploaded by

justjoyapple123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Here are the answers and detailed explanations for each question:

---

Advanced Machine Learning

1. A) All neurons are active


Explanation: During training with dropout, some neurons are randomly "dropped" to prevent
overfitting. At test time, all neurons are active, and the weights are scaled accordingly.

2. D) Loss function
Explanation: Batch normalization normalizes the inputs of layers (input or hidden) to stabilize
learning. The loss function is not normalized.

3. B) Mode collapse
Explanation: In GANs, mode collapse happens when the generator produces limited varieties of
outputs, failing to capture the diversity of real data.

4. B) Minimum loss reduction for split


Explanation: The gamma parameter in XGBoost controls the minimum loss reduction required to
make a further partition on a leaf node. It helps control model complexity.

5. B) To speed up convergence
Explanation: Teacher forcing uses actual target outputs during training of LSTMs instead of
predicted ones, which helps the model learn faster and more accurately.

---

Complex Statistics & Probability

6. B) P(parameters|data)
Explanation: The posterior distribution is the probability of the parameters given the observed
data, central to Bayesian inference.

7. B) Decreases statistical power


Explanation: Bonferroni correction is conservative. While it reduces Type I errors, it increases
Type II errors, thus lowering statistical power.

8. B) Normality of residuals
Explanation: A QQ plot compares the quantiles of a dataset with a normal distribution,
assessing whether data is normally distributed.

9. A) To generate candidate samples


Explanation: In MCMC, the proposal distribution suggests the next point to sample, which is
then accepted or rejected based on a probability.

10. B) Test if a sample follows a specified distribution


Explanation: The KS test compares the empirical distribution function of the sample with the
expected cumulative distribution function.

---

Big Data & Distributed Systems

11. A) To share large read-only variables across nodes


Explanation: Spark broadcast variables efficiently distribute large data like lookup tables to all
worker nodes.

12. A) In-Sync Replicas


Explanation: ISR in Kafka refers to replicas that are fully caught up with the leader and ready to
take over in case of failure.

13. A) Systems can only guarantee 2 of 3: Consistency, Availability, Partition tolerance


Explanation: The CAP theorem states a distributed system can at most guarantee two of the
three properties.

14. B) Allocates cluster resources


Explanation: Hadoop YARN’s ResourceManager manages resource allocation among
applications in the cluster.
15. A) Better compression and query performance
Explanation: Columnar storage allows efficient queries on selected columns and better
compression compared to row-based formats.

---

Advanced Algorithms & Optimization

16. C) O(n³)
Explanation: The Hungarian algorithm for solving assignment problems has a cubic time
complexity, suitable for small to moderate datasets.

17. B) Model becomes more sensitive to individual points


Explanation: A high gamma makes the RBF kernel more focused on individual data points,
potentially leading to overfitting.

18. B) Discounted future rewards


Explanation: The Bellman equation expresses the value of a policy as the expected sum of
current and future (discounted) rewards.

19. B) Self-attention mechanisms


Explanation: Transformers use self-attention to capture dependencies without recurrence,
making them faster and better at long-range modeling.

20. A) Identifying confounders


Explanation: The backdoor criterion is used in causal inference to find variables that block
spurious paths and help estimate causal effects correctly.

---

Machine Learning

21. B) Random classifier


Explanation: An ROC AUC of 0.5 means the model is no better than random guessing in
distinguishing between classes.

22. A) Select optimal k


Explanation: The elbow method helps choose the number of clusters by finding the "elbow"
point where adding more clusters doesn't improve much.

23. B) F1-score
Explanation: The F1-score balances precision and recall, making it suitable for imbalanced
datasets where accuracy can be misleading.

24. B) Dimensionality reduction for visualization


Explanation: t-SNE is a nonlinear technique for reducing high-dimensional data into 2 or 3
dimensions to visualize patterns.

25. B) Word importance in a corpus


Explanation: TF-IDF assigns importance to words based on their frequency in a document
relative to the corpus.

---

Data Engineering

26. B) Stores quantitative metrics


Explanation: A fact table contains the measurable metrics (facts) and foreign keys to dimension
tables in a star schema.

27. B) Store raw data in native format


Explanation: A data lake allows storing structured, semi-structured, and unstructured data in its
original form.

28. B) HAVING
Explanation: HAVING is used to filter grouped data after aggregation, unlike WHERE, which
filters rows before aggregation.
29. A) Data modeling approaches
Explanation: Schema-on-read means data is interpreted at query time, while schema-on-write
enforces structure when data is ingested.

30. A) Directed Acyclic Graph (workflow)


Explanation: In Airflow, a DAG defines the structure and dependencies of tasks to be executed
in a scheduled workflow.

---

Statistics & Probability

31. C) 95%
Explanation: In a normal distribution, about 95% of the data falls within ±2 standard deviations
from the mean.

32. B) Probability of observed result given null is true


Explanation: A p-value is the likelihood of seeing the observed data (or more extreme)
assuming the null hypothesis is true.

33. A) Sample means converge to population mean as n increases


Explanation: The Law of Large Numbers ensures that as sample size grows, the average of the
sample gets closer to the population mean.

34. B) ANOVA
Explanation: ANOVA (Analysis of Variance) is used to test differences between three or more
group means.

35. B) Constant variance of residuals


Explanation: Heteroscedasticity violates the assumption of homoscedasticity, where residuals
should have constant variance.
---

Programming & Tools

36. B) Creates getter method


Explanation: The @property decorator in Python turns a method into a read-only attribute getter.

37. B) O(log n)
Explanation: A balanced Binary Search Tree allows efficient querying in logarithmic time.

38. B) git add .


Explanation: This command stages all modified and new files for commit in Git.

39. A) COPY is faster but ADD handles URLs


Explanation: ADD supports more features like extracting tar files or downloading from URLs, but
COPY is preferred for simplicity and performance.

40. B) Runs regardless of exceptions


Explanation: The finally block in Python executes no matter what, ensuring cleanup or closing
resources.

---

Basic Concepts

41. B) Understand data patterns


Explanation: Exploratory Data Analysis (EDA) involves summarizing the data's main
characteristics, often visually.

42. C) Correct positive predictions


Explanation: True positives are cases where the model correctly predicts the positive class.

43. A) Labels vs no labels


Explanation: Supervised learning uses labeled data, while unsupervised learning deals with
unlabeled data to find patterns.
44. C) RMSE
Explanation: Root Mean Squared Error measures how well a regression model predicts
continuous values.

45. B) Creating informative input variables


Explanation: Feature engineering transforms raw data into features that better represent the
underlying problem.

---

Tools & Libraries

46. B) Data manipulation


Explanation: Pandas is used for data cleaning, transformation, and analysis in tabular form.

47. A) df.head()
Explanation: This command shows the first 5 rows of a DataFrame.

48. B) Correlation matrix visualization


Explanation: sns.heatmap() displays values in a matrix as color-encoded grid cells, often used
with correlation matrices.

49. B) Figure and axes objects


Explanation: plt.subplots() returns a figure and a set of subplots (axes), useful for custom
layouts.

50. B) TensorFlow
Explanation: TensorFlow is a popular deep learning library for building and training neural
networks.

---
Simple Statistics

51. B) -1 to 1
Explanation: The correlation coefficient ranges from -1 (perfect negative) to 1 (perfect positive),
with 0 indicating no correlation.

52. B) 10
Explanation: Mean = (5 + 10 + 15) / 3 = 10

53. B) 4
Explanation: Sorted list = [1, 3, 5, 7]; Median = (3 + 5)/2 = 4

54. C) Mean
Explanation: The mean is heavily influenced by extreme values, unlike median or mode.

55. B) Data spread


Explanation: Standard deviation quantifies the amount of variation or dispersion in a dataset.

---

Let me know if you'd like a PDF of these explanations or want to focus more deeply on any
topic!

You might also like