Predictive Numericals 20 Questions
Predictive Numericals 20 Questions
(a) Calculate the mean, median, and mode (if it exists) for the following data.
Data: 78, 85, 92, 67, 70, 88, 92, 81, 85, 79
(b) Determine the missing value x for the following data, given that the mean is 27.
(c) Calculate all three measures of central tendency, and identify which measure(s) is/are
affected by the outlier (250).
• Estimate the mean number of books read using the midpoint method.
• Determine the modal class.
• Find the median class.
3. A company conducted a survey of annual incomes (in thousands) of employees in two depart-
ments:
Department A: 35, 40, 38, 50, 100, 120 Department B: 35, 38, 39, 40, 41, 42
Data: 32, 35, 40, 38, 42, 45, 39, 44, 46, 41, 48, 50, 36, 43, 49
• Calculate the first quartile Q1 , the second quartile Q2 (median) and the third quartile Q3 .
• Find the five-number summary and construct a box plot based on it.
Class A: 55, 60, 65, 70, 75, 80, 85, 90 Class B: 45, 50, 55, 60, 75, 80, 85, 95
6. The following are the yearly expenses (in thousands) of a group of individuals:
Data: 150, 160, 165, 170, 175, 180, 185, 200, 250, 500
• Estimate the first quartile Q1 , the median/second quartile Q2 and the third quartile Q3 .
8. Two employees track their monthly sales over the past year:
Employee A: 12, 15, 14, 16, 18, 17, 19, 20, 22, 24, 21, 25
Employee B: 10, 30, 20, 40, 50, 25, 35, 45, 55, 30, 25, 60
9. Given the following dataset of the sizes (in square feet) and corresponding prices (in thousands)
of 8 houses:
10. The following table shows the distribution of marks obtained by students in a test:
Data: 45, 50, 55, 60, 62, 63, 65, 68, 70, 72, 74, 75, 78, 80, 82, 83, 85, 88, 90, 95
• Construct a histogram for the data using class intervals of width 10.
• Describe the shape of the distribution (e.g., symmetric, skewed).
12. Given the following data set of incomes (in thousands):
Data: 22, 25, 28, 30, 35, 40, 42, 45, 48, 50
Data: 150, 152, 154, 156, 158, 160, 162, 164, 166, 168
14. The following table provides data on hours studied and exam scores for 8 students:
15. The following table provides data on the size of houses (in square feet), the number of bedrooms,
and the corresponding house prices (in thousands) for 6 houses:
• Calculate the correlation coefficient between the size of houses and their prices. Interpret
the result. Does it indicate positive, negative, or no correlation?
• Calculate the covariance between the size of houses and their prices.
• Calculate the covariance matrix for the variables: Size, Number of Bedrooms, and Price.
Interpret the signs of the covariances.
• Apply standardization to the house prices.
• Apply normalization (Min-Max scaling) to the house prices.
16. Consider the following two data points representing the ratings of two users on seven different
movies:
User A: (5, 4, 3, 2, 1, 3, 5) User B: (1, 2, 3, 4, 5, 2, 4)
• Compute the Manhattan distance, Euclidean distance and Minkowski distance (with h =
3) between the two users’ ratings across all seven movies.
• Discuss how these distance metrics reflects the similarity or dissimilarity between User A
and User B.
17. Consider the following two data points representing house prices (in thousands), house sizes (in
square feet), number of bedrooms, and number of bathrooms:
• Calculate the Euclidean distance between the two houses without any feature scaling.
• Discuss the importance of feature scaling when calculating distance measures in machine
learning and re-calculate the Euclidean distance after scaling the features using min-max
normalization.
18. You are given the following data points representing two documents in a text classification task,
with each value representing the frequency of a certain term across six terms:
• Calculate the cosine similarity between the two documents across all six terms.
• Compute the Euclidean distance and discuss how it differs from cosine similarity in inter-
preting document similarity.
19. Consider the following binary vectors representing the presence (1) or absence (0) of certain
features for three users in a machine learning dataset:
20. The following table provides data on the observed frequency of customers’ preference for three
types of products (A, B, and C) based on their income levels (Low, Medium, and High):
• Formulate the null hypothesis H0 and alternative hypothesis H1 for testing the indepen-
dence between Income Level and Product Preference.
• Calculate the expected frequencies for each cell under the assumption of independence.
• Perform the Chi-square (χ2 ) -test by calculating the χ2 statistic:
X (O − E)2
χ2 =
E
where O is the observed frequency and E is the expected frequency.
• Given a significance level of α = 0.05 and appropriate degrees of freedom, compare the
calculated χ2 -statistic with the critical value from the χ2 distribution table.
• Interpret the result and conclude whether there is a significant correlation between Income
Level and Product Preference.