0% found this document useful (0 votes)
30 views31 pages

Ds Notes

Uploaded by

kishore.s2206vlr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views31 pages

Ds Notes

Uploaded by

kishore.s2206vlr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Data science grade 10 notes

Chapter 1

(2) Which is a more accurate measure of central tendency


when there are outliers in the data set?
(a) Mean
(b) Median
Ans: (b) Median
(3) Mean absolute deviation is an identifier of the variability
of the data set. Is this a correct statement?

(a) Yes
(b) No
Ans: (a) Yes
(4) The mean absolute deviation is divided by coefficient of
mean absolute deviation to calculate
(a) Variance
(b) Median
(c) Arithmetic Mean
(d) Coefficient of Variation
Ans: (c) Arithmetic Mean
(5) In a manufacturing company, the number of employers in
unit A is 40, the mean is Rs 6400 and the number of
employee in unit B is 30 with the mean of Rs. 5,500 then the
combined arithmetic mean is –
(a) 9500
(b) 8000
(c) 7014.29
(d) 6014.29
Ans: (d) 6014.29
(6) The mean deviation about the mean for the following
data: 5, 6, 7, 8, 9, 13, 12, 15 is
(a) 1.5
(b) 3.2
(c) 2.89
(d) 5
Ans: (c) 2.89
(7) The arithmetic mean of the numerical values of the
deviations of items from some average value is called the
(a) Standard Deviation
(b) Range
(c) Quartile Deviation
(d) Mean Deviation
Ans(d) Mean Deviation
Standard Questions
(1) Explain the different ways of subsetting data.
Ans: There are basically three ways of subsetting data which are:
(a) Row based subsetting: In the Row based subsetting we
consider some rows of the table from top to bottom. Suppose you
have inserted 8 rows and 6 columns in your table, so you can only
take 4 rows that too from the top side of the table.
(b) Column based subsetting: We have always observed that in
original data set there is inclusion of columns in a large number, but
all of these columns are not necessary for the analysis. In that case,
we have to select some columns from the original dataset. Such
method of subsetting is to be termed as column- based subsetting.
(c) Data based subsetting: In this type of subsetting, the data is
subsetted on the basis of the specific data. We can also notice that
the rows which we select will be colored.
(2) When should we use median over mean?
Ans :- As we know that Median is the exact form of the tendency
specially where there are irregular values. Such are to be termed as
outliers.
For ex:- Let us consider the following example:
Rahul’s father gets his blood pressure checked for every week. But
for one week due to the defect in the machine, the blood pressure
was recorded high.

From the above


illustration we can observe that Rahul’s father mean value is
different from regular blood pressure values due to the
problem/defect in the blood pressure machine. Though the median
value still correctly shows the centre point of the data set. Now, in
the data set where there is presence of outliers , as compared to
mean median is the most effective measuring of central tendency.
(3) What is Mean Absolute Deviation?
Ans:- Mean Absolute Deviation (MAD) is the average calculation of
the distance between the values of the data set from the mean.
Let us consider the following data set and solve the following:-
12 16 10 18 11 19
Step 1: Make the calculation of Mean
Mean = (12 + 16 + 10 + 18 + 11 + 19)/6 = 14 (Round figure)
Step 2: In order to find the exact/absolute value, we are supposed
to calculate the distance of each point from the mean. Suppose if
the distance from mean is -2,then we can avoid the negative sign
(-).
Following is the table which is related to the distance which we get
after calculating the each data point from the mean.
Distance form
the mean value
Value (14)
12 2
16 2
10 4
18 4
11 3
19 5
Total 20
Step 3: Now it’s the time for us to calculate the mean of the
distances
Mean of distances = (2+2+4+4+3+5)/6 = 3.33
So 3.33 is our absolute deviation and the mean is 14.
The Mean absolute deviation will give us an idea about the variation
of data set.
(4) What is a two way relative frequency table? How is it
different from two way frequency table?
Ans: The two-way relative frequency table is similar to two way
frequency type of table. We can consider the difference here on the
basis of percentage instead of number. In two-way table frequency
tables shows data points which fits in each category. We can also
take the help of column relative frequencies and row relative
frequencies, which mostly depends on the problem.
Let us take into consider the table of two-way table where the indoor
and outdoor games preference are been recorded :-
Two-way frequency table
Preferences Girls Boys
Indoor games 70 20
Outdoor games 30 80
Total 100 100
We can make the conversion into the relative two-way frequency
table, we will only change the individual cells into the percentages.
Preferences Girls Boys
Indoor games 70% 20%
Outdoor games 30% 80%
Total 100% 100%
Two-way relative frequency table
Two way relative frequency is much useful when there is difference
in the sizes of the sample data set. Preference comparison can be
made by using percentages.
(5) What are the two way frequency table beneficial for?
Ans: Two way relative frequency is much useful when there is
difference in the sizes of the sample data set. Preference
comparison can be made by using percentages.
(6) What is Standard Deviation?
Ans:- Standard Deviation is related to the measuring how the
numbers are been spreaded out. In other terms, it shows how much
data is been spreaded around the mean or an average.
For ex :- We can determine whether all the points are nearer to the
average or whether they are above or below the average.
(7) How to calculate Standard Deviation?
Ans: We can make use of the following steps if we want to find the
final standard deviation:
For ex: Suppose take the values as 1, 2, 3, 5 and 8
(a) You have to calculate the mean by adding up all the
pieces of the data and then make a division by the number
of the pieces of data.
1+2+3+5+8 = 19
19/5 = 3.8 (mean)
(b) You have to subtract the mean from every single values.
1 -3.8 = -2.8
2 -3.8 = -1.8
3 -3.8 = -0.8
4 – 3.8 = 1.2
8 – 3.8 = 4.2
(c) Find out the differences of each squares.
-2.8 *- 2.8 = 7.84
-1.8 * – 1.8 = 3.24
– 0.8 * – 0.8 = 0.64
1.2 * 1.2 = 1.44
4.2 * 4.2 = 17.64
(d) To find difference/variance, we need to find out the
average of the squared numbers which is calculated to point
number 3.
7.84+3.24+0.64+1.44+17.64 = 30.8
30.8/5 = 6.16 (Variance)
(e) Now we can get our standard deviation by finding out the
square root of the variance.
Square root of 6.16 = 2.48
Hence the standard deviation of the values 1,2,3,5 and 8 are 2.48
(8) Name five real-life applications of Standard Deviation
Ans: The five real-life applications of Standard Deviation are :
(a) Grading Tests: If in case the teacher wants to analyse that the
students performance is at the same level or it is a higher standard
deviation.
(b) To calculate the results of any survey: If any of the person
has received any responsibility from the survey and wants to
measure its reliability, then he may make the prediction about how
the bigger group people may answer.
(c) Weather Forecasting: If the person has analysed the low
temperature forecasted for three different cities, then a low
standard deviation will always show the reliable weather forecast.
(d) Marketing: Every marketers they calculate the standard
deviation of the revenues which is been earned after every
advertisement. So they can expect the variation in the revenue how
much they expect from the given advertisement.
(e) Real – Estate :- Every real estate agents makes use of
standard deviation. It is helpful in calculating the prices of houses as
per the square footage in the particular area, so they can inform
their clients about the different in the prices of houses as per their
expectations.
High Order Thinking Skills (HOTS)
(1) Draw a graph to represent Standard Deviation of 4, 6
Ans:

[Link]
t_68c506e563408191bbab2de44726960c

Here is a graph representing a normal distribution with a standard deviation of 4.6. The
shaded area shows the range of ±1 standard deviation from the mean, which covers
approximately 68% of the data in a normal distribution
(2) Calculate the mean of the data set – [56, 89, 76, 58, 58,
65]
Ans: Mean = 56 + 89 + 76 + 58 + 58 + 65 = 402
Mean = 402 / 6
Mean = 67
(3) Calculate the median of this data set – [56, 89, 76, 58,
58, 65]
Ans: 56, 89, 76, 58, 58, 65
Here the no of observations (n) = 6 which is an even.
First we will try to arrange all the observations in an ascending order
56, 58, 58, 65, 76, 89
The two middle scores are 56, 58 so we will add them together
Median = (56 + 58 ) = 114 and then divide this total by 2 Median =
114/2
Median = 57

Chapter 2

(1) If a card is chosen from a standard deck of cards, what is


the probability of getting a five or a seven?
(a) 4/52
(b) 1/26
(c) 8/52
(d) 1/169
Ans: (c) 8/52
(2) Which of the following is the condition for Uniform
Distributions?
(a) Each value in the set of possible values has the exact same
possibility of happening.
(b) Have a constant probability of success
(c) Has only two possible outcomes
(d) Must have at least 3 trials
Ans: (a) Each value in the set of possible values has the exact same
possibility of happening.
(3) The collection of one or more outcomes from an
experiment is called
(a) Probability
(b) Distribution
(c) Event

(d) Random Experiment


Ans: (c) Event
(4) Which of the following are types of distributions?
(a) Continuous
(b) Discrete
(c) Both of them
Ans: (c) Both of them
(5) Which of the following is not an example of discrete
probability distribution?
(a) The sale or purchase price of a house
(b) The number of bedrooms in a house
(c) The number of bathrooms in a house
(d) Whether or not a home has a swimming pool in it

Ans: (a) The sale or purchase price of a house


(6) A discrete probability distributions may be represented
by
(a) A table
(b) A graph
(c) A Mathematical Equation
(d) All of these
Ans: (d) All of these
(7) What is the probability that a ball is drawn at random
from a jar?
(a) 0.1
(b) 1
(c) 0.5
(d) 0
(e) Cannot be determined from given information
Ans: (e) Cannot be determined from given information
(8) Statistical investigative process has which of the
following components:
(a) Formulate /Statistical Investigate Questions
(b) Collect/ Consider the Data
(c) Interpret Data
(d) All of the above
Ans: (d) All of the above
Standard Questions
(1) Explain what distributions in data science with the help
of two examples is?
Ans:

In data science, a distribution shows how values in a dataset are spread out or arranged.
It helps us understand how often each value appears or how data is spread over a range.

Think of it like sorting your exam scores to see how many students scored 90, 80, 70, and so
on.

Example 1: Exam Scores of Students

Imagine a class of 30 students took a math test, and their scores (out of 100) are:

[80, 85, 90, 70, 75, 85, 80, 70, 90, 85, ...]

If we count how many students scored in different ranges:


 70–79 → 5 students
 80–89 → 15 students
 90–100 → 10 students

This summary is called a distribution of scores.


It shows how the data (scores) are distributed across different ranges.

This helps us understand:

 Most students scored between 80 and 89.


 Fewer students scored between 70 and 79.

🎲 Example 2: Rolling a Fair 6-sided Die

When you roll a fair die many times, the results are numbers from 1 to 6.
If you roll the die 60 times, you may get something like:

Die Face Count of Occurrences


1 10
2 12
3 8
4 11
5 9
6 10

This is the distribution of outcomes.


If the die is fair, we expect each number to appear roughly the same number of times.

👉 The distribution helps us check if the die is fair or biased.

✅ Summary:

 Distribution tells us how data values are spread or arranged.


 It helps analyze patterns and trends in data.

These simple examples show how important understanding distribution is in data science!

(2) Explain what is a Statistical Problem- Solving process.


Ans: Statistical Problem is the solving process or the method of
collecting and analysing the data and to answer the investigative
questions which is based on statistics.
This method includes four components which are:
(a) Formulate Statistical Investigative Questions: This method
is involves imagining/predicting the differences before starting with
the actual process. Framing of statistical questions helps us
understanding/identifying the differences which leads to productive
investigations. Below are some examples of the statistical questions
for identifying the changes and nourishing the process of data
collection and analysing of data subsequently.
How fast can my plant grow?
The plants which gets exposed to sunlight more grows
faster?
Does the sunlight affects the growth of plant? How?
Some questions are been asked for collecting data such as How tall
is the plant? Many other such type of data collection questions can
be asked in order to answer the statistical investigative questions.
The plants which gets exposed to sunlight grows faster? Different
heights for different exposures of sunlight are been noticed. Which
means the plants growth due exposure of sunlight may depend upon
the measurement of the plants and may differ. While statistical
investigative questions begin worth while studies, the use of
questioning is prominent throughout all four components of the
statistical problem-solving process. Such pattern of questions can be
explained detailed with help of examples at different levels. There
are some features statistical investigative questions which needs to
be understood before predicting the differences and are much
important. The variables of interest much be transparent, the group
or population that the question is focused on must be clear, is
question requiring for the description of data, is the question
comparing variables across two or more groups is the question of
looking at association of two variables, the question should be about
the whole group and not and not about an individual, the question
should be answered through data collection with the data in hand,
and the question should be purposeful.
(b) Collect/consider the data: This step is recalled as the
acknowledging variability while designing for differences.
Data collection designs must understand the differences in the data.
Statistical Process Control and random sampling are the two
methods which can help in detecting the changes in the data and
reduce them. Designs of Experiments are the method which are
used for testing the induce variabilities. To understand the
difference between the groups according to the subjected
treatment, experimental designs are been chosen. Random
assignment to the groups can also help in reducing the differences
between the groups by the factors which are not manipulated or
controlled by the experiment. In the all the designs, the main
purpose of statistical focus is looking forward for changes and their
explanation. The data which is collected whether as the first hand
(freshly/new data) or the second hand (collected from other sources)
needs interrogation. For ex :- We needs to answer or explain the
certain questions in regards to how the variables are different as per
the type, what are the possible results/outcomes of the variables,
and how the data was collected. Such questions are needed to
explain whether the data is answerable to the statistical
investigation questions. The scope of generalizability and the
possible limitations in analysis and interpretations are been affected
by data collection designs.
(c) Analyse the data: It can be also called as the step of
accounting variability while the distributions. In the case of data
analysing we have to understand its variability. Giving reasons in
regards to the distributions is the key accounting for and describing
variability for all the developing levels. In order to compare,
describe, and explore the distributions variability graphical displays
and numerical summaries are used. For ex :- In the box plots or
comparative dot plots are used for showing the batting averages of
both the teams i.e is Indian Cricket Team and Australia Cricket Team
for specific year. These graphs helps us in differentiating batting
averages team distributions. By separating the distributions of the
two teams or by describing the overlap we can consider the
variability.
(d) Interpret the data: This step is also recalled as the permitting
for the variations while considering the data. You’ll come to know
that mostly statistical interpretations are made in the presence of
variabilities and are often taken into considerations. The two sources
of variability such as randomization to treatment group, and
variability from individual to individual are to be remembered when
interpreting the results of the randomized comparative medical
experiment. When the results are been declared generally and when
look back towards the moment while collecting and studying the
data, we consider such variability sources.
(3) Explain how distributions are broadly categorized,
support your answer with appropriate example for each
category.
Ans:

Distributions in Data Science Are Broadly Categorized Into Two Types:

1. 🎯 Discrete Distribution
2. 📈 Continuous Distribution
3. Discrete Distribution

In a discrete distribution, the data can take only specific values, usually whole numbers.
These values are countable.
Example of Discrete Distribution:

 Rolling a Fair 6-sided Die


Possible outcomes: {1, 2, 3, 4, 5, 6}
Each number appears with equal chance when the die is fair.

Die Face Frequency (from 60 rolls)

1 9

2 12

3 10

4 11

5 8

6 10

👉 This shows a discrete distribution, because only whole numbers (1 to 6) are possible
outcomes.

2️⃣ Continuous Distribution

🔹 In a continuous distribution, the data can take any value within a range, including
decimals.
🔹 These values are uncountable and usually measured.

✅ Example of Continuous Distribution:

 Measuring Heights of Students in a Class


Heights could be: 150.2 cm, 162.5 cm, 158.7 cm, etc.
The values are not just integers but any number in a range.

👉 We can represent the distribution of heights using a graph like a histogram or a smooth
curve (e.g., a bell-shaped curve in the case of normal distribution).

✅ Summary Table:

Type of Distribution Data Type Example

Discrete Countable numbers (integers) Number of students passing in a test

Continuous Any value in a range (decimals) Heights of students in cm


(4) Explain in detail how do we formulate statistical
investigative questions
Ans: This method is involves imagining/predicting the differences
before starting with the actual process. Framing of statistical
questions helps us understanding/identifying the differences which
leads to productive investigations. Below are some examples of the
statistical questions for identifying the changes and nourishing the
process of data collection and analysing of data subsequently.
How fast can my plant grow?
The plants which gets exposed to sunlight more grows
faster?
Does the sunlight affects the growth of plant? How?
Some questions are How tall is the plant? Where the question is
answered with the single height, therefore such question is not a
type of statistical question. Some questions are been asked for
collecting data such as How tall is the plant? Many other such type
of data collection questions can be asked in order to answer the
statistical investigative questions. The plants which gets exposed to
sunlight grows faster?
Different heights for different exposures of sunlight are been
noticed. Which means the plants growth due exposure of sunlight
may depend upon the measurement of the plants and may differ.
While statistical investigative questions begin worth while studies,
the use of questioning is prominent throughout all four components
of the statistical problem-solving process. Such pattern of questions
can be explained detailed with help of examples at different levels.
There are some features statistical investigative questions which
needs to be understood before predicting the differences and are
much important. The variables of interest much be transparent, the
group or population that the question is focused on must be clear, is
question requiring for the description of data, is the question
comparing variables across two or more groups is the question of
looking at association of two variables, the question should be about
the whole group and not and not about an individual, the question
should be answered through data collection with the data in hand,
and the question should be purposeful.
(5) Name five instances where you have observed a uniform
distribution.
Ans:

Here are five real-life examples where uniform distribution is commonly observed:

1. 🎲 Rolling a Fair Die


A standard 6-sided die has outcomes {1, 2, 3, 4, 5, 6}.
Every number has an equal chance of appearing, so the distribution of outcomes is
uniform.
2. 🎰 Drawing a Card from a Well-Shuffled Deck
When drawing a single card from a standard deck of 52 playing cards, each card has
an equal probability of 1/52.
So, the outcome follows a uniform distribution.
3. 🎯 Random Number Generation in Computer Programming
Many programming languages provide functions to generate random numbers in a
range, e.g., randint(1, 100).
If the generator is good, all numbers between 1 and 100 have an equal chance of
being selected — this is a discrete uniform distribution.
4. 🚦 Random Selection of a Lottery Ticket
In a fair lottery where each ticket is equally likely to be drawn, the selection of a
ticket follows a uniform distribution.
5. Waiting Time for a Bus Arriving Randomly
Suppose a bus arrives randomly between 8:00 AM and 9:00 AM, and you arrive at the
stop at a random time.
The arrival time of the bus follows a continuous uniform distribution over the
interval [8:00, 9:00].

These examples reflect either discrete uniform distributions (e.g., die roll, card drawing) or
continuous uniform distributions (e.g., random time interval).

HOTS

cosider that there are 60 students in your class out of which 20 get affected with cold and andflu
every semester . note down five statistical investigative questions for determining students
immunity to a catching cold and flu

Five Statistical Investigative Questions:

1. What is the average number of days students are affected by cold and flu during
a semester?
2. Is there a relationship between the number of colds/flu and students’ age or
gender?
3. 🍎 Does the diet (e.g., frequency of eating fruits and vegetables) affect the
likelihood of getting a cold or flu?
4. Does regular physical exercise help reduce the number of students affected by
cold and flu?
5. 🛌 Do students who sleep more than 7 hours per night get affected by cold and flu
less frequently than those who sleep less?

Why These Questions Matter:

 They help us investigate patterns or causes that affect immunity.


 Help to collect useful data for improving health and immunity.
 These are measurable and can be answered by collecting data and applying statistical
analysis.
Consider you are taking a part in an animal welfare camPAIGN .ONE OF THe most recent concerns
raised by people is dogs not being able to tolerate sudden raise in temperature due to globa
warming. note downfive statistical investigative questions to understand how dogs react to changing
weathe

Five Statistical Investigative Questions:

1. How does the number of dogs showing signs of heat stress (e.g., excessive
panting, lethargy) vary with daily temperature changes?
2. 🐕 Is there a difference in heat tolerance between different dog breeds (e.g., small
vs. large breeds, short-haired vs. long-haired)?
3. 💧 How does access to water or shade affect the frequency of heat-related
symptoms in dogs during hot days?
4. 🏥 What percentage of dogs require medical attention due to heat exhaustion
during high-temperature periods compared to cooler periods?
5. 📊 Is there a correlation between sudden temperature rise (e.g., +5°C in a day)
and the increase in heat-related behavioral problems (e.g., aggression, anxiety) in
dogs?

Why These Questions Are Useful:

 They help us investigate how dogs are impacted by rising temperatures in a


measurable way.
 Help to suggest solutions like improving shade, providing water, or raising public
awareness.
 Allows us to collect data that can support the campaign with solid evidence.

Chapter 3

(1) What is the Data Science term used to describe


partiality, preference, and prejudice?
(a) Bias
(b) Favouritism
(c) Influence

(d) Unfairness
Ans: (a) Bias
(2) Which of the following is NOT a type of bias?
(a) Selection Bias
(b) Linearity Bias
(c) Recall Bias
(d) Trail Bias
Ans: (d) Trail Bias
(3) Which of the following is not a correct statement about a
probability
(a) It must have a value between 0 and 1
(b) It can be reported as a decimal or a fraction
(c) A value near 0 means that the event is not likely to occur/happen

(d) It is the collection of several experiments.


Ans: (d) It is the collection of several experiments.
(4) The central limit theorem states that sampling
distribution of the sample mean is approximately normal if
(a) All possible samples are selected
(b) The sample size are large
(c) The standard error of the sampling distribution is small.
Ans: (b) The sample size are large
(5) The central limit theorem says that the mean of the
sampling distribution of the sample mean is
(a) Equal to the population mean divided by the square root of the
sample size
(b) Close to the population mean if the sample size is large
(c) Exactly equal to the population mean
Ans: (b) Close to the population mean if the sample size is large

(6) Sample of size 25 are selected from a population with


mean 40 and standard deviation 7.5. the mean of the
sampling distribution sample mean is
(a) 7.5
(b) 8
(c) 40
Ans: (c) 40
Standard Questions
(1) Explain what is Bias and why it occurs in data science?
Ans: There is always occurrence of the situation where if some one
is fond of a particular thing, that person slightly tries to become
partial towards it. This action may effect the result of the certain
thing. It is not the exact way of dealing with the data if it is large.
The action which of partiality, preference, prejudices towards a
particular set of data is to be termed as Bias. In the Data Science,
Bias is termed as the change in the data which is different from the
expected outcome. In the other words, you can even define Bias as
data error. Such error is unnoticeable and indistinct. So the question
arises that why the bias takes first place? Sampling and estimation
are the reasons for the occurrence of bias. The occurrence of the
bias would be avoided if we could know the data entities better and
would store the information on the alternative entity. Data science
does not occur in the controlled conditions carefully. It is mostly
done on the searched data which is mostly collected for modelling.
This is the reason why mostly biases occur to happen in this data.
You may have the next question arising in your mind that why the
bias really matters?
The data which is used for only training, such data are often
considered by predictive models. They are much aware that in their
system no other reality other than the data is feeded in their
systems. The data which is feeded in the system and there is
presence of bias, then there will be a compromise of model
accuracy. The models which are biased can also try to discriminate
the group of people. In order to avoid this risks, it is necessary to
avoid such type of bias.
(2) Explain the Selection Bias with the help of an example
Ans: When the model tries to disturbs the data creation that is used
to train it then the selection bias happens. When the sample data
that is been collected but it fails in acting as the representative in
seeing the models exact future or predicting the population of cases,
then selection bias takes place. It even occurs in the systems which
ranks the content like the recommendation systems polls or the
advertisements which are personalised. The user responds and
collects the contents which are been displayed and the response
given to the contents which are not been displayed is unknown.
(3) Explain Recall Bias with the help of an example.
Ans: It is the measuring of the bias or the most common method of
labelling the data of any project. Such type of bias occurs when the
same type of data are been labelled inconsistently, which results in
lower accuracy. For ex :- Imagine we have a team which looking
after the work of the image labelling of the laptops which are
damaged. After labelling the damaged laptops, it easier or helpful in
making a difference between the damage and undamaged laptops.
The data will be in consisted if the team member tries to label the
image as damaged and the similar image as partially damaged.
(4) Explain Linearity Bias with the help of an example
Ans: Linear bias results in the difference in one quality will create
the same amount and proportionate change to the [Link]
compared to selection bias, linearity bias also has its cognitive bias.
This is not created by the process of statistics but how we
mistakenly perceive the world around us.
(5) Explain Confirmation Bias with the help of an example
Ans: Confirmation bias is also termed as observer bias which is the
actual result of the data that really you want to see. This takes place
when the researchers undergo the projects with some subjective
thoughts in regards to their studies that is either conscious or
unconscious. We can also notice when the labellers permits their
subjective thoughts to control their habits of labelling, which leads
to inaccuracy of data.
(6) What is the central limit theorem?
Ans: Central Limit Theorem tells that sample distribution refers to a
normal distribution stating the reason of the larger size of sample in
respective to the population distribution shape. This theorem also
states that large sample size completely differs from the population.
The sample sets mean population will be roughly equal to the mean
of the population. It mostly depends on the source of population
whether it is normal or skewed that is provided that the sample size
is large. Some points will give an idea about the Central Limit
Theorem are :-
(a) The Central Limit Theorem tells that sample distribution means
near to normal distribution as the sample size gets larger.
(b) The sample sizes which are equal or greater than 30 are to be
considered as enough for the Central Limit Theorem.
(c) In the Central Limit Theorem, he average of sample mean and
standard deviation will be equal to the population mean and
standard deviation.
(d) A population can be easily and correctly predicted with the help
of large sample size.
(7) What is the formula for Central Limit Theorem?

Ans: Following is the formula for Central Limit Theorem

(8) What is real life application of Central Limit Theorem?


Ans: The real life application of Central Limit Theorem are:
(a) Voting Polls: In the elections, voting polls always gives an idea
about the counting in the supporters of a candidate. By making use
of the Central Limit Theorem, the news channels comes up the
results with confidence intervals.
(b) Family Income Calculation: In the particular region/area it is
much helpful in calculating the mean family income.
(c) Economics: Many economists make use of Central Limit
Theorem when the sample data been used in order to make
conclusion about the population.
(d) Manufacturing: Central Limit Theorem is mostly used in the
manufacturing plant in determining the defective products which are
produced by plant.
(9) Why central limit theorem is important?
Ans: The Central Limit Theorem tells us that whatever the
distribution of population might be, the shape of the sampling
distribution will always appear normally when the sample size
increases. It is more helpful when any research never knows that in
sampling distribution which mean is much similar to population
mean, by abstracting some random samples from the population,
the sample means will cluster together, which ill allow any
researcher to estimate the population mean.
Hence it is observed that when the size of the sample increases,
there will be decrease in errors.
(10) The coaches of various sports around the world use
probability to better their game and create gaming
strategies. Can you explain how probability is applied in this
case and how does it help players?
Ans: In every game, the very coach makes the probability about
their team in terms of strongerness and where there is much
improvement required to win the match.
Ex:- on the basis of the players previous performance in the match,
the coach makes a deep study about the average results related to
the batting and bowling skills of a particular player and then lines
him up in the team.

Hots

AS per reports in october 2019 researchers found that an algorithm used on more than 200 milion
people in us hospitals to predict which patients who would likely need extra medical care heavily
favored white patients who would likely need extra medical care heavily favored white patients over
black patients. can you reason about what must have caused this bias and categorize it into the type
of bias that you have learnt

The correct answer is: ✅ Selection Bias

Reasoning:

 The algorithm was trained using historical healthcare data of patients.


 If the data mostly included more white patients receiving extra medical care in the
past, and fewer black patients received care—not necessarily because they didn’t
need it, but due to systemic inequality—then the training data was not
representative of the true medical needs of all patients.

👉 This happens because of Selection Bias:


The data used for training was biased due to the way samples (patients) were selected,
leading to an overrepresentation of white patients who received extra care and
underrepresentation of black patients.

2. Recorded percentage of the population who speaks english in India are following a normal
distribution .The mean and the standard deviations are 62 and 5 respectively . If a person is eager to
find the record of 50 people in the popuation , then what would mean and the standard deviation of
the chosen sample?

 Population mean μ=62\mu = 62μ=62

 Population standard deviation σ=5\sigma = 5σ=5

 Sample size n=50n = 50n=50

Mean of the Sample (Sampling Distribution Mean):

The mean of the sampling distribution of the sample mean is the same as the population
mean:

μ sample=μ=62

Standard Deviation of the Sample (Standard Error):

The standard deviation of the sampling distribution of the sample mean (called
Standard Error (SE)) is calculated as:

SE=σnSE = \frac{\sigma}{\sqrt{n}}SE=nσ

Let us compute it step by step:

 σ=5
 n=50
 50≈7.071

So:

SE=5/7.071≈0.707

✅ Final Answer:

 Mean of the sample distribution = 62


 Standard deviation (Standard Error) = 0.707

This means if you take samples of size 50 people repeatedly, the average of their English-
speaking percentage will be around 62%, with a variation of about ±0.707% from sample to
sample.
Ch apter 4

Objective Type Questions


Please choose the correct option in the questions below:
(1) The pth percentile of a distribution is such that:
(a) p percent of the observations fall at it
(b) p percent of the observations fall below it
(c) p percent of the observations fall at or below it
(d) the value is p.
Ans: (c) p percent of the observations fall at or below it
(2) Which of the following function is used for quantiles of
quantitative values?
(a) Quantile
(b) Quantity
(c) Quantiles
(d) All of the mentioned
Ans: (a) Quantile
(3) The distribution of heights of Indian women aged 18 to
24 is approximately normally distributed with a mean of 65.5
inches and standard deviation of 2.5 inches. Calculate the z-
score for a woman six feet tall.
(a) 2.60
(b) 4.11
(c) 1.04
(d) 1.33
Ans: (a) 2.60
(4) What is a z-score?
(a) It is the number of standard deviation a particular score lies
above or below the mean of the set of scores.
(b) It is a standardised measure of the mean of a set of data
(c) It is the average frequency of scores in a sample
(d) It is the measure of central tendency in the data.
Ans: (a) It is the number of standard deviation a particular score lies
above or below the mean of the set of scores.
(5) The median mode, deciles and percentiles are all
considered as measures of
(a) Mathematical averages
(b) Population averages
(c) Sample averages

(d) Averages of position


Ans: (d) Averages of position
(6) According to percentiles, the median to be measured
must lie in
(a) 80th
(b) 40th
(c) 50th
(d) 100th
Ans: (c) 50th
(7) What measures of position divides the distribution into
10 equal parts?
(a) Quartiles
(b) Deciles
(c) Percentiles
(d) Range
Ans: (b) Deciles
(8) What measures of position divides the distribution into 4
equal parts
(a) Quartiles
(b) Deciles
(c) Percentiles
(d) Range
Ans: (a) Quartiles
Standard Questions
Please answer the questions below in no less than 100
words:
(1) What is data merging?
Ans: Data merging is the process or method of combining two or
more data sets in the single data frame. This process is important
when we have several raw data stored in multiple files or data
tables, which we want to analyse all together.
(2) Why is data merging required in data science?
Ans: In the Data Science, Data merging is the process or method of
combining two or more data sets in the single data frame. This
process is important when we have several raw data stored in
multiple files or data tables, which we want to analyse all in one.
When we are merging the data that is taken from different sources
creates a lot of problems, and hence needs to be corrected for
accurate data merging. As compared to main data source different
data sources will have different naming conventions. This may
include the different methods of grouping the data. Sometimes
different data sources also gets created by different peoples for
different use or purposes. It should not be weird to us when listen
that multiple data sources includes much difference.
(3) Name the different ways of merging data sets
Ans: There are three different ways of merging data sets which are
(i) One to One
(ii) One to Many
(iii) Many to Many
(4) Explain one-to-one join with the help of an example
Ans: This method of data merging is one of the simplest technique.
In this technique, every row in one table is been connected with the
single row in the other table with the help of a key column. For ex :-
In the database of accompany, each employee has only one
Employee ID, whereas Each Employee ID is been given to only one
employee. Following is the database of one-to-one relationship:

In the above database table, the “key” field in each table is


“Employee ID”. This field ‘key’ contains unique values. In the
Employee table, Employee ID is the main/primary key in the Contact
Info table, the Employee ID field is a foreign key. Thus the one to
one relationship returns the related records when the value in the
employee ID field in the Contact Info table is the same as the
Employee ID field in the Employees table. In this manner, the one to
one data merging techniques works by making use of the primary
key.
(5) Explain one-to-many join with the help of an example
Ans: In this technique, one record in a table can be connected or
linked to one or many records of the other table. For ex :- in the
school library each student can have multiple books. Following is the
table database of one to many relationships :-
In the above example, you may notice the primary key is available
in the students table, Student ID contains unique values. In the
Library table is Foreign key. Student ID allows multiple instances of
the same value. In the one to many relationships the records value
in the Student ID field in the Library table is similar to the values in
the Student ID field in the Student table.
Hence, one to many joins/relationships acts by merging the
databases using primary key.

(6) Explain many-to-many join with the help of an example


Ans: In this technique when more than one records of one table is
connected or linked with many records of the another table. For ex :-
This relationship generally occurs between students and courses.
Each student can register for many courses whereas a course can
have many students. It is not an easy task to link the tables when it
is many to many relationships. To perform this join, you can break a
many to many relationships into two one to many relationships by
making use of the third table known as join table. In the join table,
every record has a match field which contains the primary key
values of the other two tables which joins. Generally, the matching
fields in the join table are termed as Foreign keys. This foreign keys
are popularized by the data as records in the join table are been
created from the other table which joins.
Following table will show the Student table, in which there is record
of every student. It also has the table as Course table which show
the record for each course. A join table is also called as Enrollments
which creates one two many relationships between each of two
tables.

The Student ID is a unique primary key and identifier for every


student in the Students table. Likewise Course ID is also a unique
primary key and identifier of every course in the Course table. The
Enrollments table carries the foreign keys that is Student ID and
Course ID.
How to set up the join table for many to many relationships:
(1) Using the above example, you can create a table called as
Enrollment which will be working as join table.
(2) In the Enrollments table, make fields as Student ID and Course
ID.
(3) Now make a relationships between two Students ID fields in the
tables. Later, make a relationship between two course ID field in the
tables.
This design is useful if incase the student registers himself for four
courses, we can make sure that the student has only one record in
the Students table and four records in the Enrollments table, one for
each course students is enrolled in.
(7) Think and explain how Z-score can be used to determine
average lifespan of car?
Ans:

A Z-score tells us how far a particular value is from the average (mean), in terms of
standard deviations.

🧮 Formula of Z-score:
Z=(X−μ)σZ = \frac{(X - \mu)}{\sigma}Z=σ(X−μ)

Where:

 X = The value we are interested in (e.g., lifespan of a tire)


 μ = The average (mean) lifespan of tires
 σ= The standard deviation (how much variation there is)

How Can We Use Z-Score for Car Tire Lifespan?

Scenario:

 Suppose we collect data from many car tires and find:


o Average lifespan (mean) μ=50,000\mu = 50,000μ=50,000 km
o Standard deviation σ=5,000\sigma = 5,000σ=5,000 km
 Now, we want to check how good a specific tire is if its lifespan is 60,000 km.

Step 1 – Calculate the Z-Score:

Z=(60,000−50,000)/5,000

=10,000/5,000

=2

Interpretation:

 A Z-score of 2 means this tire’s lifespan is 2 standard deviations above the


average.
 In general, most values fall within ±2 standard deviations in a normal distribution
(about 95% of the data).

👉 So, a Z-score of 2 shows that the tire is better than average, and only a small percentage
of tires last this long.

Example Conclusion:

If a tire has a lifespan of 40,000 km, the Z-score would be:


Z=(40,000−50,000)/5,000

=−10,000/5,000

=−2

This means the tire is below average lifespan.

Hots

suppose heights of 2nd grades follow a normal distribution with a mean of 48 inches and standard
deviation of 2 inches. what is the z score of a 2nd grader who is 40 inches tall

Given Data:

 Mean μ=48 inches


 Standard deviation σ=2 inches
 Height of the student X=40 inches

Z-Score Formula:

Z=X−μ/ σ

Z=40−48/2

Z=−4

The Z-score is -4

This means that a 2nd grader who is 40 inches tall is 4 standard deviations below
the average height for their grade, which is quite rare in a normal distribution.

HOTS

Consider that we know about a population of group of plants having their heights that are commonly
distributed .further to this consider that we know that the mean of the distribution is 13cms and the
standard deviation is 1.3 cms .now cosider the below questions.

what is the z score of 9cms.

what is the z score for 2cms.

how many centimeters corresponds to a z score of 2.25


Given Data:

 Population Mean=13 cm
 Population Standard Deviation σ=1.3 cm

Z-score of 9 cm:

Using the formula:

Z =X−μ
_____
σ

Where X=9 cm:

Z=9−13/1.3
=−4/1.3
≈−3.08

👉 So, the Z-score for 9 cm is approximately -3.08.

2. Z-score of 2 cm:

Using the same formula, whereX=2 cm

Z=2-13/1.3

= -11/1.3

-8.46

Centimeters Corresponding to Z-score of 2.25:

Using the formula rearranged to solve for X

X=μ+Z×σ

X = 13+(2.25*1.3)

=15.925 cm

So, a Z-score of 2.25 corresponds to approximately 15.93 cm.


Chapter 5

(1) Which of the following is not one of the principles in data


governance framework?
(a) Protect your customer
(b) Data should never institutionalize unfair biases
(c) Never collect confidential data from users
Ans: (c) Never collect confidential data from users
(2) The private information that is shared should always be
handled with confidentiality
(a) True
(b) False
Ans: (a) True
(3) If you are done with using the confidential data collected
from users, you should :
(a) Safely store it. we may need it in future for some analysis or
reports
(b) Effectively destroy it in a way that it is unreadable
Ans: (b) Effectively destroy it in a way that it is unreadable
(4) Confidential data can be stored in which of the following
format?
(a) Digital data
(b) Physical copies
(c) Both

Ans: (c) Both


(5) Data should never institutionalize unfair biases
(a) True
(b) False
Ans: (a) True
(6) Digital confidential data should be discarded by
(a) Formatting the drive in which data was stored
(b) Temporarily deleting the data
Ans: (a) Formatting the drive in which data was stored
(7) Which of the following is not the appropriate way of
discarding the confidential data
(a) Shredding the data
(b) Cutting the files which contain confidential data

(c) Burning the confidential data


(d) Crumbling the papers which contain confidential data and
throwing it in the dustbin.
Ans: (d) Crumbling the papers which contain confidential data and
throwing it in the dustbin.
Standard Questions
Please answer the questions in no less than 200 words
(1) Explain the significance of data governance framework.
Ans: Data governance framework results in creating different
methods, setting up responsibilities and processing to standardise,
integrate, protect and store the data. Data analytics creates many
ethical problems, when any person starts making or earning money
from their data externally for different purposes which is different
from the ones for which the data was actually collected.
(2) Explain principles on ethics that one should follow while
performing data analysis.
Ans: The principles related to the ethics that one should follow while
performing the data analysis are:
(a) Protect Your Customer: When we come across the term
Privacy, suddenly we remember that its related to confidentiality of
the personal/private data which needs to be checked whenever it is
needed. The private data which is gained from the person should not
be leaked for the business or individuals purpose.
(b) The private information should always be handled with
confidentiality: We have seen that many of the third party
companies shares the delicated data in the field of finance, medical
etc which they should not do. When passing or forwarding the
information they should mostly try to restrict or reduce in sharing
the information.
(c) Customers should always have a clear view: When the third
party companies shares the information, they should try to inform
the person whose information whenever is been used or traded.
(d) Data should never interfere with a human will: Data
analysis can define or even find out before we make up our mind.
Organisations should be able in predicting about different types of
predictions and conclusions which can be allowed and that ones
cannot.
(e) Data should never institutionalize unfair biases: There
should not be any institutional unfair biases based on sexism,
racism. Analytical systems can absorb unconscious biases in a crowd
and boost them with the help of training samples.
(3) What are various techniques of safety discarding digital
confidential data?
Ans: Following are the methods of safety in order of discarding the
digital data which are :-
(a) When you had finished with your work and does not require it in
the future, then you can go for cleaning the digital data or
information from the memory.
(b) Whenever you store the digital data, you can keep your data
encrypted, so that it can protect the data at the time of data
leakage from the cyber-attacks or hackers.
(c) If the confidential data is been stored in the drive or hard disk of
the client then we can format that or clean the hard disk or hard
drive of client to make a safe discarding.
(d) Many a times it happens on some devices we try to softly delete
the particular file or data, this will delete or clean the file from its
original location or place but it gets stored in the temporary folder
from where anyone can easily restore that data or file again. So in
order to avoid this risk one has to delete or clean the file or data
completely and permanently so that no one could be able to restore
it again.
(4) What are various techniques of safety discarding
physical confidential data?
Ans: Following are the methods for discarding the physical
confidential data which are:
(a) Shredding the documents :- It is the very important and
effective method for discarding the physical confidential data. For
using this method, one has to make use of the shredder and should
make sure that the confidential physical copy of data which is
shredded should not be readable to others. The information of
document or data must be shredded in such a manner no other
person should be able to reconstruct or remake it or read it. As the
document is shredded properly then we get assurance of data or
document is been discarded.
(b) Cutting up the documents: If in case you have the
confidential information on one page or on one file, then it is
necessary you can that document which can be the correct option in
discarding the documents. When you are cutting the documents in
order to discard then make sure that you cut that document into
small-small pieces and no minor information is readable or
remakeable/reconstructable. For the successful discarding of data
on documents, you have to follow such points.
(c) Burning of documents: It is also the very effective method of
discarding the data or documents by burning it completely. No
remains of the documents should be left over where the data or
document can be reconstructed or read.
High Order Thinking Skills (HOTS)
(1) What, according to you, should be the technique used to
discard the confidential data collected from users while
making an online transaction?
Ans: If the data is been collected from the users while making an
online transaction then I will delete the data from the original
location and from the temporary folder, from the history, clear out
its cache and cookies so that no one can retrace or restore of the
data again easily.
(2) How can you make sure that the data that you collected
from users while conducting a poll is stored securely?

Ans: How to Ensure Data Collected from Users During a Poll Is Stored
Securely:

1. 🔒 Use Strong Passwords and Authentication:


Make sure the system where the data is stored is protected by strong passwords and
multi-factor authentication (like a code sent to a phone).
2. Store Data in Encrypted Form:
Encrypt the data so that even if someone tries to access it illegally, they cannot read it
without a special key.
3. 🧱 Use Secure Servers:
Store the poll data on trusted and secure servers (for example, cloud services with
good security practices or dedicated secure local servers).
4. 🚫 Limit Access to Data:
Only authorized people (like poll administrators) should be able to access the data.
Use user roles and permissions to control who sees or edits the data.
5. Regular Backups and Updates:
Regularly back up the data to prevent data loss and always keep the software updated
to protect against new security threats.

Applied Project
Suppose you were working with an NGO to help in vaccinating all the
kids in your area against Polio. Now that vaccination drive is
completed and all kids are vaccinated, you wanted to make sure
that you discard the data about kids that you collected during
vaccination. Explain the technique that you would use to discard this
data and how will you implement this action.
Ans: If I am working with an NGO to help in vaccinating all the kids
in your area against Polio and that happens that vaccination drive is
completed and all kids are vaccinated, you wanted to make sure
that you discard the data about kids that you collected during
vaccination, then I will make use of discarding the data or
information by cleaning the memory or cleaning the hard disk or
hard drive of the client.

You might also like