Ad3301 Dev QB-3,4,5
Ad3301 Dev QB-3,4,5
U
N
I
T
-
I
II
/
PA
R
T-
A
1. What is sampling?
Sampling is a method that allows us to get information about the population based on the
statisticsfrom a subset of the population (sample or case), without having to investigate every
individual.
2. What are the two basic units of data analysis?
Cases and variables are the two organizing concepts that are considered the basis of
dataanalysis.
The cases are the samples about which information is collected.
The information is collected on certain features of all the cases.
These features are the variables that vary across different cases.
Example: In a survey of individuals, their income, sex, and age are some of the variables that
might be recorded.
1
AD8502- Data Exploration and Visualization Department of ADS 2022-
9. What are the four important aspects of any distribution inspected by histograms?
Level: What are typical values in the distribution?
Spread: How widely dispersed are the values? Do they differ very much from one another?
Shape: Is the distribution flat or peaked? Symmetrical or skewed?
Outliers: Are there any particularly unusual values?
2
AD8502- Data Exploration and Visualization Department of ADS 2022-
3
AD8502- Data Exploration and Visualization Department of ADS 2022-
Where Y is conventionally used to refer to an actual variable. The subscript ‘i’ is an index that
indicates which case is being referred to, and N is the number of data points.
24. Write a brief note on the usefulness of the mean and the standard deviation measures.
The mean and the standard deviation are less resistant than other measures.
So, they are often preferable for much descriptive and exploratory work, especially when
there are measurement errors.
The mean and the standard deviation measures are used to make very precise
statementsabout the likely degree of sampling error in any data.
4
AD8502- Data Exploration and Visualization Department of ADS 2022-
29. Calculate the standard deviation of the hours worked by the small sample of men given
below:
54,30,47,39,50,48,45,40,37,48,67,55,55,80,70.
s=
=
= 13.29
6
AD8502- Data Exploration and Visualization Department of ADS 2022-
causal processes. Moving averages are a simple and common type of smoothing used in time series
analysis and time series forecasting.
34. What is the effect on distribution aspects of adding or subtracting a constant from every data
value? Why should we add or subtract a constant?
The change made to the data by adding or subtracting a constant is fairly trivial.
Only the level is affected; spread, shape, and outliers remain unaltered.
The reason for adding or subtracting a constant from every data value is to make a
divisionabove and below a particular point.
This is also done to bring the data within a particular range.
U
N
I
T
-
I
II
/
PA
R
T-
B
1. The dataset below shows the gross earnings in pounds per week of twenty men and twenty women
drawn randomly from the 1979 New Earnings Survey. The respondents are all full-time adult
workers. Men are deemed to be adult when they reach age 21, women when they reach age 18.
Men Women
150 58 90 39
55 122 76 47
82 120 87 80
107 83 58 42
102 115 50 40
78 69 46 99
154 99 63 77
85 94 68 67
123 144 116 49
66 55 60 54
Calculate the median and dQ of both male and female earnings, and compare the two distributions.
2. Describe histograms, bar graphs and pie charts in detail. Draw charts for the below table that
showsa specimen case by variable data matrix. It contains the first few cases in a subset of the
2005 GHS.
7
AD8502- Data Exploration and Visualization Department of ADS 2022-
8
AD8502- Data Exploration and Visualization Department of ADS 2022-
U
N
I
T
-
I
V&V
/
P
A
RT
-
A
1. W
r
i
t
e
br
i
e
fl
y
ab
o
ut
t
he
co
n
ti
n
g
en
c
yt
a
bl
e
.
A
c
on
t
i
ng
e
nc
yt
a
bl
es
h
o
ws
th
ed
i
s
tr
i
b
ut
i
o
no
f
e
ac
hv
a
r
ia
b
le
c
on
d
i
ti
o
n
al
u
po
ne
a
ch
c
a
t
eg
o
r
yo
ft
h
eo
t
he
r
.
T
h
ec
a
t
eg
o
ri
e
so
fo
n
eo
ft
h
ev
ar
i
a
bl
e
sf
o
r
mt
h
er
o
ws
,
an
dt
h
ec
a
t
eg
o
ri
e
so
f
th
e
o
t
h
er
v
a
ri
a
bl
e
fo
r
mt
he
c
ol
u
mn
s
.
E
a
c
hi
n
di
v
i
du
al
c
as
e
is
t
h
en
ta
l
l
ie
d
i
nt
h
ea
pp
r
o
p
ri
a
t
ec
e
l
ld
ep
e
n
di
n
g
on
i
ts
v
al
u
eo
n
bo
t
h
v
a
r
ia
b
l
es
.
T
h
en
u
mb
e
ro
fc
a
se
s
in
e
ac
h
ce
l
li
s
c
al
l
e
dt
h
ec
e
l
l
fr
e
qu
e
n
cy
.
3. W
h
at
a
r
eb
o
u
nd
ed
nu
m
b
er
s
?
P
r
o
po
r
t
i
on
s
a
nd
pe
rc
e
n
t
ag
e
s
ar
e
bo
un
de
d
n
u
mb
e
r
s,
i
nt
ha
tt
h
e
yh
a
v
ea
f
l
o
o
ro
fz
e
r
o
,b
e
l
ow
w
h
ic
h
t
h
e
yc
a
n
no
t
g
o,
a
nda
c
e
i
li
n
g
of
1.
0
an
d1
0
0r
e
s
p
ec
t
iv
el
y
.
9
AD8502- Data Exploration and Visualization Department of ADS 2022-
5. W
r
i
teshort no t
es o n conting enc y table.Dr aw a s c
hem a ti
c fou r-
b y-
f our conting ency tab l
e.
A contingency table shows the distribution of each variable conditional upon each category
of the other.
The categories of one of the variables form the rows, and the categories of the other
variableform the columns.
Each individual case is then tallied in the appropriate pigeonhole depending on its value on
both variables.
The pigeonholes are called as cells, and the number of cases in each cell is called the cell
frequency
Each row and column can have a total presented at the right-hand end and at the bottom
respectively; these are called the marginals.
10
AD8502- Data Exploration and Visualization Department of ADS 2022-
6. What are three different ways of representing contingency table in percentage form?
The three different ways of representing contingency table in percentage form are
The table that is constructed by dividing each cell frequency by the grand total.
Outflow table: The table that is constructed by dividing each cell frequency by its
appropriate row total.
Inflow table: The table that is constructed by dividing each cell frequency by its
appropriatecolumn total.
10. What are the considerations to make a decision about which variable to put in the rows andwhich
in the columns?
Closer figures are easier to compare.
Comparisons are more easily made down a column.
A variable with more than three categories is best put in the rows so that there is plenty
ofroom for category labels.
11
AD8502- Data Exploration and Visualization Department of ADS 2022-
17. What are the elementary details which must always appear in a table?
Labelling
Sources
Sample data
Missing data
Layout
Definitions
Opinion data
Ensuring frequencies can be reconstructed
Showing the way percentages run
Layout
19. What is the limitation of the chi-square statistic for tables with more than one degree of
freedom?
12
AD8502- Data Exploration and Visualization Department of ADS 2022-
13
AD8502- Data Exploration and Visualization Department of ADS 2022-
Chi-square only gives an overall measure of whether the two variables are likely to be
associated, but it does not provide information on the locations of the differences within the
table.
It is therefore necessary to recode the variables or to select specific groups for more detailed
analysis.
24. Provide two reasons why some data points are outliers.
Outliers occur when the whole distribution is skewed.
14
AD8502- Data Exploration and Visualization Department of ADS 2022-
15
AD8502- Data Exploration and Visualization Department of ADS 2022-
The particular data points do not really belong substantively to the same data batch.
27. What are the two types of hypotheses used by statistical tests?
Whenever researchers use a statistical test, two hypotheses are involved:
Null hypothesis
Alternative hypothesis
28. What is GDP?
If one focuses on all the production that takes place within national boundaries, the measure
istermed the Gross Domestic Product (GDP).
16
AD8502- Data Exploration and Visualization Department of ADS 2022-
30 Define cause.
A cause is defined as an object followed by another, and where all the objects, similar to the first,
are followed by objects similar to the second. (i.e.) if the first object had not been, the second
never had existed.
31 Define causality.
. Causality can be defined in terms of constant conjunction or statistical association it is clearly
not sufficient for one event invariably to precede another for us to be convinced that the first
event causes the second.
17
AD8502- Data Exploration and Visualization Department of ADS 2022-
A cohort has been defined as an 'aggregate of individuals who experienced the same event
within the same time interval. The most obvious type of cohort used in longitudinal quantitative
research is the birth cohort (i.e.) a sample of individuals born within a relatively short time
period.
42 W
r
i
tes
h ort notes on cohor ts t
u dies.
Cohort studies allow an explicit focus on the social and cultural context that frames the
experiences, behavior, and decisions of individuals.
For example, in the case of the 1958 British Birth Cohort study, it is important to
understand the cohort's educational experiences in the context of profound changes in the
organization of secondary education during the 1960s and 1970s, and the rapid expansion
of higher education, which was well underway by the time cohort members left school in
the mid-1970s.
18
AD8502- Data Exploration and Visualization Department of ADS 2022-
U
N
I
T
-
I
V/
PA
R
T
-
B
1. Explain the following:
Causality
a. Multiple causality
3. Consider a hypothetical example of the causes of absenteeism from work. Suppose previous
research had shown a positive bivariate relationship between low social status jobs and
absenteeism. Is there something about such jobs that directly causes the people who do them to
go off sick more than others? Discuss assumptions and possible outcomes.
5. Consider as a test factor, a variable that represents the extent to which the respondent suffers
from chronic nervous disorders, such as sleeplessness, anxiety, and so on. Such conditions would
be likely to lead to absence from work. It is also quite conceivable that they could be caused in
part by stressful, low-status jobs. Therefore, assume that nervous disorders act as an intervening
variable. What will happen to the original relationship if we control for this test factor? Show
similar possible outcomes.
6. Explain in detail about longitudinal data collection. Give examples for longitudinal studies.
7. Explain event history modelling. Discuss about the various approaches of event history
modelling.
19
AD8502- Data Exploration and Visualization Department of ADS 2022-
(
i
)R
an
d
o
m
ly
g
e
n
er
a
t
ea
n
o
rm
a
l
i
z
ed
t
im
e
s
e
r
ie
s
d
a
ta
s
e
tu
s
i
n
g
th
e
Nu
m
p
yl
i
b
ra
r
y.
(
i
i
)P
lo
t
t
h
et
i
m
es
e
r
ie
s
d
a
ta
u
s
i
ng
t
h
e
se
a
b
or
n
l
i
br
a
r
y
(
i
i
i
)G
en
e
r
a
t
ea
n
a
rr
a
y
of
t
h
ec
u
mu
l
a
t
iv
e
su
m
o
ft
h
e
da
t
a
(
i
v
)P
lo
t
t
h
ed
a
t
au
s
i
n
g
at
i
me
s
e
r
ie
s
p
l
ot
The Open Power System data consists of 4 columns, such as, ‘Consumption', 'Wind', 'Solar',
'Wind+Solar’. Write the code to visualize the Open Power System dataset.
(
i
)G
en
e
r
a
t
ea
l
i
n
ep
l
ot
o
f
th
e
f
ul
l
t
im
es
e
r
i
e
so
f
Ge
r
m
a
ny
'
sd
ai
l
y
el
e
c
tr
i
c
it
y
c
on
s
u
mp
t
i
o
n
(
i
i
)P
lo
t
t
h
ed
a
t
af
o
r
al
l
t
he
ot
h
e
rc
o
l
u
mn
s
(
i
i
i
)V
is
u
a
l
i
ze
t
h
ee
l
e
c
tr
i
c
it
y
c
on
s
u
mp
t
i
o
nb
e
t
we
e
n
2
01
6
-
12
-
2
3a
n
d2
0
1
6
-
12
-
3
0
11 The data given below relate to the percentage of households that are headed by a lone parent and
contain dependent children, and the percentage of households that have no car or van for each of the
ten Government Office Regions of England and Wales.
Summarize the linear relationship between the two interval-level variables and explain in detail.
Also discuss the rules to draw the line.
12 Describe in detail contingency tables and percentage tables with suitable examples.
13 Explain the guidelines to construct a lucid table of numerical data.
14 Examine the given data. Which group of women is most likely to have high levels of worry about
violent crime? And which group is least likely to have high levels of worry about violent crime?
Using the data in the table create a simpler table with just three age groups; 16-34; 35-54; and 55+.
Decide which should be the reference or base category and use the table to construct a causal path
diagram.
Wo High levels of Not worried Total
men worry
20
AD8502- Data Exploration and Visualization Department of ADS 2022-
21
AD8502- Data Exploration and Visualization Department of ADS 2022-
age P N P N P N
grou
p
16- 0.32 686 0.68 144 1 213
24 7 3
25- 0.27 104 0.73 281 1 385
34 0 5 5
35- 0.25 122 0.75 372 1 494
44 4 5 9
45- 0.24 952 0.76 300 1 396
54 8 0
55- 0.22 943 0.78 343 1 437
64 0 3
65- 0.22 781 0.78 273 1 351
74 5 6
75 0.14 511 0.86 304 1 355
or 4 5
olde
r
Tota 6,13 20,2 26,3
l 7 04 41
15 Consider an imaginary piece of research in which 100 men and 100 women are asked about their
fear of walking alone after dark. Until we conduct the survey, we have no information other than the
number of men and women in our sample (see the below table).
Feeling safe walking alone after dark by gender
Very safe, Very unsafe Total
fairly safe,
ora bit
unsafe
P N P N P N
Mal ? ? ? ? 1 100
e
Fem ? ? ? ? 1 100
ale
Tota ? ? ? ? 1 200
l
Find the following:
a. After the survey, imagine that we find that in total 20 individuals i.e., 0.1 of the sample
state that they feel very unsafe when walking alone after dark. Add this information to the
given table.
b. If, in the population as a whole, the proportion of men who feel very unsafe walking alone
after dark is the same as the proportion of women who feel very unsafe walking alone after
dark, we would expect this to be reflected in our sample survey. Find the expected
proportions and frequencies.
c. Find the observed values after carrying out the survey and the fear of walking alone after
dark by gender were cross-tabulated.
d. Compute the chi-square.
Briefly explain the statistically significant relationship between the variables.
22
AD8502- Data Exploration and Visualization Department of ADS 2022-
c. Degrees of freedom
23
AD8502- Data Exploration and Visualization Department of ADS 2022-
24
AD8502- Data Exploration and Visualization Department of ADS 2022-
25
AD8502- Data Exploration and Visualization Department of ADS 2022-
26
AD8502- Data Exploration and Visualization Department of ADS 2022-
27