100% found this document useful (1 vote)
63 views27 pages

Ad3301 Dev QB-3,4,5

AD3301-Data Exploration and Visualization

Uploaded by

dhananjeyans41
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
63 views27 pages

Ad3301 Dev QB-3,4,5

AD3301-Data Exploration and Visualization

Uploaded by

dhananjeyans41
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

AD8502- Data Exploration and Visualization Department of ADS 2022-

UNIT III INTRODUCTION TO DATA EXPLORATION


Introduction to Single variable: Distributions and Variables - Numerical Summaries of Level and Spread
- Scaling and Standardizing – Inequality - Smoothing Time Series.

U
N
I
T
-
I
II
/
PA
R
T-
A

1. What is sampling?
Sampling is a method that allows us to get information about the population based on the
statisticsfrom a subset of the population (sample or case), without having to investigate every
individual.
2. What are the two basic units of data analysis?
 Cases and variables are the two organizing concepts that are considered the basis of
dataanalysis.
 The cases are the samples about which information is collected.
 The information is collected on certain features of all the cases.
 These features are the variables that vary across different cases.
Example: In a survey of individuals, their income, sex, and age are some of the variables that
might be recorded.

3. What are the measurement scales used by social scientists?


Many of the variables used by social scientists are measured on nominal scales or ordinal scales
(also referred to as categorical variables), rather than interval scales (also referred to as
continuous variables).

4. What are two techniques for reducing the number of digits?


 Rounding and truncating are two methods for reducing the number of digits.
 Rounding method: Values from zero to four are rounded down, and six to ten are rounded up.
The value 5 can be arbitrarily rounded up or down according to a fixed rule, or it could be
rounded up after an odd digit and down after an even digit.
 Truncating method: Values that are not needed are simply ‘cut off’ or ‘truncated’. Thus,
when cutting, all the numbers from 899.0 to 899.9 become 899. This procedure is much
quicker and does not run the extra risk of large mistakes.

5. Write the drawback of the rounding method.


 Th e r o
u ndi n
g of digit fi ve cau s
es a p r
oblem ; i
t ca n be arbitrarily rounded up or dow n
according to a fixed rule, or it could be rounded up after an odd digit and down after an even
digit.
 Th e t r
oubl ewith suc h fuz zy ru l
es is that peoplete nd to m ake m i
st akes,ando ften they are
not trivial.
 Itis a neas y
m istake t
o ro un d 8 99.6 to8 90.

6. What is a bar chart? Give example.


A bar chart is a visual display in which bars are drawn to represent each category of a variable
such that the length of the bar is proportional to the number of cases in the category. For
instance,a bar chart of the drinking classification variable is shown in the figure below.

1
AD8502- Data Exploration and Visualization Department of ADS 2022-

7. When to prefer pie charts?


Pie charts are to be preferred when there are only a few categories and when the sizes of the
categories are very different.

8. What are the two types of distributions in histograms?


The two types of distribution in histograms are unimodal and bimodal, depending on the frequency
of the occurring values.

9. What are the four important aspects of any distribution inspected by histograms?
 Level: What are typical values in the distribution?
 Spread: How widely dispersed are the values? Do they differ very much from one another?
 Shape: Is the distribution flat or peaked? Symmetrical or skewed?
 Outliers: Are there any particularly unusual values?

10. What is SPSS?


 SPSS is an acronym for Statistical Package for the Social Sciences.
 SPSS is a very useful computer package that includes hundreds of different procedures for
displaying and analyzing data.

11. What is unimodal distribution?


 Unimodal is a single-peaked distribution in that one value occurs with the greatest
frequencythan the other values.
 It is a distribution with a single clearly visible peak or a single most frequent value.
 The distribution’s shape in the unimodal distribution has only one main high point.

12. What is bimodal distribution?


 Bimodal distribution is a distribution where two values occur with the greatest
frequencywhich means two frequent values are separated by a gap in between.
 This type of distribution has two fairly equal high points (or the modes).
 The two modes are usually separated by a big gap in between and the distribution
containsmore data than others.

13. What are histograms?


 Histograms are charts that are similar to bar charts that can be used to display interval-
levelvariables grouped into categories.
 They are constructed in exactly the same way as bar charts except that the ordering of
thecategories is fixed.

2
AD8502- Data Exploration and Visualization Department of ADS 2022-

14. What are the three main windows of SPSS?


SPSS has three main windows:
 The Data Editor window
 The Output window
 The Syntax window

15. What are the advantages and disadvantages of a summary?


Advantages:
 Summaries focus the attention of the data analyst on one thing at a time and
preventexploring aimlessly over a display of the data.
 They also help focus the process of comparison from one dataset to another and make it
more rigorous.
Disadvantages:
 Summaries always involve some loss of information.
 They do not contain the richness of information that existed in the original picture.

16. How do we define 'typical' value for summarization?


 The value halfway between the extremes might be chosen
 the single most common number
 a summary of the middle portion of the distribution.
17. What is a residual? Give example.
 A residual can be defined as the difference between a data point and the observed typical, or
average, value.
 For example, if 40 hours a week is chosen as the typical level of men's working hours, using
data from the General Household Survey in 2005, then a man who was recorded in the
survey as working 45 hours a week would have a residual of 5 hours.
 Another way of expressing this is that the residual is the observed data value minus
thepredicted value and in this case, 45-40 = 5.

18. What is data in terms of summarization?


 Any data value (such as a measurement of hours worked or income earned) is composed of
two components: a fitted part and a residual part.
 This can be expressed as an equation:
Data = Fit + Residual

19. What are the measures of central tendency?


Mean, median, and mode are the measures of central tendencies.
 Mean: the sum ofa l
l values di
vided by the to talnum ber o
f
v
a
lu
e
s
.
 Median :the middlen um berin anord ered data set
.
 Mode: the mostfreq uent value.

20. What is arithmetic mean?


 The mean value of a dataset is the average value. (i.e.) a number around which a whole data
is spread out.
 To calculate it, first, all of the values are summed, and then the total is divided by
thenumber of data points.
 In mathematical terms,

3
AD8502- Data Exploration and Visualization Department of ADS 2022-

Where Y is conventionally used to refer to an actual variable. The subscript ‘i’ is an index that
indicates which case is being referred to, and N is the number of data points.

21. What is a median?


 Median is the middle value of the dataset (i.e.) the data is sorted from smallest to biggest
(orbiggest to smallest) and then the value in the middle of the set is taken.
 It is the value of the case that has equal numbers of data points above and below it.
 With N data points, the median M is the value at depth

22. What are quartiles and midspread?


 The points which divide the distribution into quarters are called the quartiles (or hinges
orfourths).
 The lower quartile is usually denoted QL and the upper quartile QU. The middle quartile is
the median.
 The distance between QL and QU is called the midspread (dQ or interquartile range).
23. Why range cannot be recommended as a summary measure of spread?
 Range only uses information from two data points, and these are drawn from the most
unreliable part of the data.
 Therefore, despite its intuitive appeal, it cannot be recommended as a summary measure
ofspread.

24. Write a brief note on the usefulness of the mean and the standard deviation measures.
 The mean and the standard deviation are less resistant than other measures.
 So, they are often preferable for much descriptive and exploratory work, especially when
there are measurement errors.
 The mean and the standard deviation measures are used to make very precise
statementsabout the likely degree of sampling error in any data.

25. State Twyman's law for data analysis.


The more unusual or interesting the data, the more likely they are to have been the result of an
errorof one kind or another.

26. What is the general principle in comparing different measures?


The general principle in comparing different measures is: one measure is more resistant than
another if it tends to be less influenced by a change in any small part of the data.
27. How do we decide between the median and mean to summarize a typical value, or between the
range, the midspread, and the standard deviation to summarize the spread?
 Locational statistics such as the range, median, and midspread generally fare better than
themore abstract means and standard deviations.
 Means and standard deviations are more influenced by unusual data values than medians
and midspreads.
 Means and standard deviations are usually more influenced by a change in any individual
data point than the medians and midspreads.

28. What is smoothing?


The process of smoothing time series also produces such a decomposition of the data.
Message = Signal +Noise
Data= Smooth+ Rough

4
AD8502- Data Exploration and Visualization Department of ADS 2022-

29. Calculate the standard deviation of the hours worked by the small sample of men given
below:
54,30,47,39,50,48,45,40,37,48,67,55,55,80,70.

s=
=
= 13.29

30. Write a brief note on Gaussian distribution.


 Gaussian distributions are bell-shaped and have the convenient property of being
reproducible from their mean and standard deviation.
 Given these two pieces of information, the exact shape of the curve can be reconstructed,
and the proportion of the area under the curve falling between various points can be
calculated.
 Gaussian distribution is the one that, when used to represent a sample, involves the simplest
calculations from sample values.

31. Explain Lorenz curves.


 Lorenz curves have visual appeal because they portray how near total equality or
totalinequality a particular distribution falls.
 The degree of inequality in two distributions can be compared by superimposing their
Lorenz curves.

32. Define the Gini Coefficient.


 A measure that summarizes what is happening across all the distribution is the Gini
coefficient.
 The Gini coefficient expresses the ratio between the area between the Lorenz curve and the
line of total equality and the total area in the triangle formed between the perfect
equalityand perfect inequality lines.
 It therefore varies between 0 and 1 although it is sometimes multiplied by 100 to express
thecoefficient in percentage form.

33. Explain smoothing in time series.


Smoothing is a technique applied to time series to remove the fine-grained variation between
timesteps. The hope of smoothing is to remove noise and better expose the signal of the
5
AD8502- Data Exploration and Visualization Department of ADS 2022-
underlying

6
AD8502- Data Exploration and Visualization Department of ADS 2022-

causal processes. Moving averages are a simple and common type of smoothing used in time series
analysis and time series forecasting.

34. What is the effect on distribution aspects of adding or subtracting a constant from every data
value? Why should we add or subtract a constant?
 The change made to the data by adding or subtracting a constant is fairly trivial.
 Only the level is affected; spread, shape, and outliers remain unaltered.
 The reason for adding or subtracting a constant from every data value is to make a
divisionabove and below a particular point.
 This is also done to bring the data within a particular range.

35. List the different smoothing process in refinement?


 Endpoint Smoothing
 Breaking the smooth

U
N
I
T
-
I
II
/
PA
R
T-
B
1. The dataset below shows the gross earnings in pounds per week of twenty men and twenty women
drawn randomly from the 1979 New Earnings Survey. The respondents are all full-time adult
workers. Men are deemed to be adult when they reach age 21, women when they reach age 18.
Men Women
150 58 90 39
55 122 76 47
82 120 87 80
107 83 58 42
102 115 50 40
78 69 46 99
154 99 63 77
85 94 68 67
123 144 116 49
66 55 60 54
Calculate the median and dQ of both male and female earnings, and compare the two distributions.

2. Describe histograms, bar graphs and pie charts in detail. Draw charts for the below table that
showsa specimen case by variable data matrix. It contains the first few cases in a subset of the
2005 GHS.

7
AD8502- Data Exploration and Visualization Department of ADS 2022-

3. Explain in detail numerical summaries of level and spread.


4. Explain in detail the concepts of Scaling and Standardizing.
5. Write in detail about Inequalities.
6. Write a detailed explanation about time series smoothing.
7. Explain various smoothing Techniques.
8. Pick any three numbers and calculate their mean and median. Calculate the residuals and squared
residuals from each, and sum them. Confirm that the median produces smaller absolute residuals
and the mean produces smaller squared residuals.
9. Explain about variables in a data matrix with an example.
10. Calculate the mean and standard deviation of the male earnings of the data given below. Compare
them with the median and midspread you calculated. Why do they differ?

Men's working hours (ranked)


30
37
39
40
45
47
48
Median value 48
50
54
55
55
67
70
80

8
AD8502- Data Exploration and Visualization Department of ADS 2022-

UNIT IV V INTRODUCING TWO VARIABLE AND THIRD VARIABLE


Relationships between Two Variables - Percentage Tables - Analyzing Contingency Tables - Handling
Several Batches - Scatterplots and Resistant Lines – Transformations.
Unit V
Introducing a Third Variable - Causal Explanations - Three-Variable Contingency Tables and Beyond - Longitudinal
Data – Fundamentals of TSA – Characteristics of time series data – Data Cleaning – Time-based indexing –
Visualizing – Grouping – Resampling.

U
N
I
T
-
I
V&V
/
P
A
RT
-
A
1. W
r
i
t
e
br
i
e
fl
y
ab
o
ut
t
he
co
n
ti
n
g
en
c
yt
a
bl
e
.
A
c
on
t
i
ng
e
nc
yt
a
bl
es
h
o
ws
th
ed
i
s
tr
i
b
ut
i
o
no
f
e
ac
hv
a
r
ia
b
le
c
on
d
i
ti
o
n
al
u
po
ne
a
ch
c
a
t
eg
o
r
yo
ft
h
eo
t
he
r
.
T
h
ec
a
t
eg
o
ri
e
so
fo
n
eo
ft
h
ev
ar
i
a
bl
e
sf
o
r
mt
h
er
o
ws
,
an
dt
h
ec
a
t
eg
o
ri
e
so
f
th
e
o
t
h
er
v
a
ri
a
bl
e
fo
r
mt
he
c
ol
u
mn
s
.
E
a
c
hi
n
di
v
i
du
al
c
as
e
is
t
h
en
ta
l
l
ie
d
i
nt
h
ea
pp
r
o
p
ri
a
t
ec
e
l
ld
ep
e
n
di
n
g
on
i
ts
v
al
u
eo
n
bo
t
h
v
a
r
ia
b
l
es
.
T
h
en
u
mb
e
ro
fc
a
se
s
in
e
ac
h
ce
l
li
s
c
al
l
e
dt
h
ec
e
l
l
fr
e
qu
e
n
cy
.

2. What are the two types of variables?


 The two types of variables are explanatory and response variables.
 The variable that is presumed to be the cause is the explanatory variable (denoted as X).
 The one that is presumed to be the effect is the response variable (denoted as Y).
 They are also termed independent and dependent variables respectively.

3. W
h
at
a
r
eb
o
u
nd
ed
nu
m
b
er
s
?
P
r
o
po
r
t
i
on
s
a
nd
pe
rc
e
n
t
ag
e
s
ar
e
bo
un
de
d
n
u
mb
e
r
s,
i
nt
ha
tt
h
e
yh
a
v
ea
f
l
o
o
ro
fz
e
r
o
,b
e
l
ow
w
h
ic
h
t
h
e
yc
a
n
no
t
g
o,
a
nda
c
e
i
li
n
g
of
1.
0
an
d1
0
0r
e
s
p
ec
t
iv
el
y
.

4. Draw the general diagram of the causal path model.

9
AD8502- Data Exploration and Visualization Department of ADS 2022-

5. W
r
i
teshort no t
es o n conting enc y table.Dr aw a s c
hem a ti
c fou r-
b y-
f our conting ency tab l
e.
 A contingency table shows the distribution of each variable conditional upon each category
of the other.
 The categories of one of the variables form the rows, and the categories of the other
variableform the columns.
 Each individual case is then tallied in the appropriate pigeonhole depending on its value on
both variables.
 The pigeonholes are called as cells, and the number of cases in each cell is called the cell
frequency
 Each row and column can have a total presented at the right-hand end and at the bottom
respectively; these are called the marginals.

10
AD8502- Data Exploration and Visualization Department of ADS 2022-

6. What are three different ways of representing contingency table in percentage form?
The three different ways of representing contingency table in percentage form are
 The table that is constructed by dividing each cell frequency by the grand total.
 Outflow table: The table that is constructed by dividing each cell frequency by its
appropriate row total.
 Inflow table: The table that is constructed by dividing each cell frequency by its
appropriatecolumn total.

7. What are marginals?


 Ea chrow a nd c
o
l
u
mn
i
na
c
on
t
i
ng
e
n
cy
ta
b
l
ec
a
nh
a
ve
at
o
t
a
lp
r
e
se
n
te
da
t
t
he
r
ig
ht
-
h
a
nd
en dand att heb
o
t
t
om
re
s
p
e
ct
i
v
el
y
;
th
e
se
a
r
ec
a
ll
e
dt
h
em
a
r
gi
n
a
ls
.
 T heunivaria te
d
i
s
tr
i
b
ut
i
o
n
sc
a
nb
e
ob
t
ai
n
e
df
r
om
t
he
ma
r
g
i
na
l
di
s
t
ri
b
ut
i
o
n
s.

8. Write a brief note on labeling a table.


 The title of a table should be clear and concise, summarising the contents.
 It should be as short as possible, while at the same time making clear when the data were
collected, the geographical unit covered, and the unit of analysis.
 It helps in numbering figures and can refer to them more succinctly in the text.
 Other parts of a table also need clear, informative labels.
 The variables included in the rows and columns must be clearly identified.

9. What is the importance of using a layout in a table?


 The effective use of space and grid lines can make the difference between a table that is
easyto read and one which is not.
 Grid lines can help indicate how far a heading or subheading extends in a complex table.

10. What are the considerations to make a decision about which variable to put in the rows andwhich
in the columns?
 Closer figures are easier to compare.
 Comparisons are more easily made down a column.
 A variable with more than three categories is best put in the rows so that there is plenty
ofroom for category labels.

11. What is the difference in proportions?


The difference in proportions, d, is used to summarize the effect of being in a category of one
variable upon the chances of being in a category of another.

11
AD8502- Data Exploration and Visualization Department of ADS 2022-

12. Write the properties of difference in proportions?


 Symmetric measures of association have the same value regardless of which way the
causaleffect is assumed to run.
 Asymmetric measures have varying values depending on which variable is presumed to be
the cause of the other.

13. What are inferential statistics?


The analysis of data from samples of individuals to infer information about the population as
awhole is called inferential statistics.

14. Write the equation for the chi-square statistic.


The equation for chi-square is given by,

Where O - Observed frequency


E - Expected frequency
The difference between the observed and expected frequencies for each cell of the table is
calculated. Then, this value is squared before dividing it by the expected frequency for that cell.
Finally, these values are summed over all the cells of the table.

15. What is a null hypothesis?


The null hypothesis is that the two variables under analysis are not associated with the
populationas a whole and the relationship that is observed between variables in the sample is small
enough to have occurred due to random error. (i.e.) the null hypothesis states that, in the population
of interest, changes in the explanatory variable have no impact on the outcome of the response
variable.
16. What are outliers?
Some datasets contain points that are a lot higher or lower than the main body of the data.
Outliersare points that are unusually distant from the rest of the data.

17. What are the elementary details which must always appear in a table?
 Labelling
 Sources
 Sample data
 Missing data
 Layout
 Definitions
 Opinion data
 Ensuring frequencies can be reconstructed
 Showing the way percentages run
 Layout

18. What is a degree of freedom? Give example.


The number of degrees of freedom for a table with r rows and c columns is given by the equation
below:
Degrees of freedom (Df) =
Example: A table with two rows and two columns is said to have one degree of freedom. A table
with two columns and three rows is said to have two degrees of freedom.

19. What is the limitation of the chi-square statistic for tables with more than one degree of
freedom?
12
AD8502- Data Exploration and Visualization Department of ADS 2022-

13
AD8502- Data Exploration and Visualization Department of ADS 2022-

 Chi-square only gives an overall measure of whether the two variables are likely to be
associated, but it does not provide information on the locations of the differences within the
table.
 It is therefore necessary to recode the variables or to select specific groups for more detailed
analysis.

20. What are the limitations of using chi-square statistic?


 The probability associated with a specific value of chi-square can only be calculated reliably
if all the expected frequencies in the table are at least 5. (i.e.) the size of sample required
partly depends on the distribution of the variables of interest.
 It only focuses on the relationship between two categorical variables. It does not examine the
relationship between a number of different categorical variables.

21. How to identify the outliers in a particular dataset?


 To identify the outliers in a particular dataset, a value 1.5 times the dQ, or a step,
iscalculated.
 Fractions other than one-half are ignored.
 Then the points beyond which the outliers fall (the inner fences) and the points beyond which
the far outliers fall (the outer fences) are identified.
 The inner fences lie one step beyond the quartiles and outer fences lie two steps beyond the
quartiles.

22. What are the five principal advantages of transforming data?


1. Data batches can be made more symmetrical.
2. The shape of data batches can be made more Gaussian.
3. Outliers that arise simply from the skewness of the distribution can be removed, and
previouslyhidden outliers may be forced into view.
4. Multiple batches can be made to have more similar spreads.
5. Linear, additive models may be fitted to the data.

23. What is a boxplot?


The boxplot is a device for conveying the information in the five number summaries economically
and effectively.
Example:

24. Provide two reasons why some data points are outliers.
 Outliers occur when the whole distribution is skewed.
14
AD8502- Data Exploration and Visualization Department of ADS 2022-

15
AD8502- Data Exploration and Visualization Department of ADS 2022-

 The particular data points do not really belong substantively to the same data batch.

25. Draw the anatomy of a boxplot.

26. What is GNI?


GNI is the sum of values of both final goods and services and investment goods in a country. If one
focuses on the production that is undertaken by the residents of that country, the income earned by
nationals from abroad has to be added to the gross domestic product, to arrive at the gross national
income.

27. What are the two types of hypotheses used by statistical tests?
Whenever researchers use a statistical test, two hypotheses are involved:
 Null hypothesis
 Alternative hypothesis
28. What is GDP?
If one focuses on all the production that takes place within national boundaries, the measure
istermed the Gross Domestic Product (GDP).

29. Write the formula for a two-sample t-test.


The formula for a two-sample t-test where the samples are independent is,

Where and are the means of the two samples


is the pooled standard deviation which is calculated as

, - standard deviation of the first and second sample

16
AD8502- Data Exploration and Visualization Department of ADS 2022-

, - sample size of the first and second sample

30 Define cause.
A cause is defined as an object followed by another, and where all the objects, similar to the first,
are followed by objects similar to the second. (i.e.) if the first object had not been, the second
never had existed.

31 Define causality.
. Causality can be defined in terms of constant conjunction or statistical association it is clearly
not sufficient for one event invariably to precede another for us to be convinced that the first
event causes the second.

32 What is multiple causality?


Multiple causality is a process where many different component causes can combine to produce a
specific outcome.

33 What are direct and indirect casual effects?


Direct causal effects are effects that go directly from one variable to another. Indirect effects
occur when the relationship between two variables is mediated by one or more variables.

34 List the different casual relationships between variables.


The different casual relationships between variables are prior, intervene and ensue.

35 What is Simpson’s paradox?


Simpson’s paradox is every statistical relationship between two variables may be reversed by
including additional factors in the analysis.

36 What are enhancer and suppressor variables?


 Test factors which are either positively related to both the other variables or
negativelyrelated to both of them are called enhancer variables.
 Test factors which are positively associated with one variable and negatively with
theother are called suppressor variables.

37 Write the equation for logistic regression.

Where X denotes the single explanatory variable.

38 What is a panel study?


The participants in a research study are contacted by researchers and asked to provide
information about themselves and their circumstances on a number of different occasions. This is
referred to as a panel study.

39 What are transition tables?


Transition tables have a longitudinal dimension in that the two variables that are being cross-
tabulated can be understood as a single categorical variable that has been measured at two time
points.

40 Define cohort. Give example.

17
AD8502- Data Exploration and Visualization Department of ADS 2022-

A cohort has been defined as an 'aggregate of individuals who experienced the same event
within the same time interval. The most obvious type of cohort used in longitudinal quantitative
research is the birth cohort (i.e.) a sample of individuals born within a relatively short time
period.

41 What are the characteristics of time series data?


When working with time series data, there are several unique characteristics that can be observed.
 Trend
 Outliers
 Seasonality
 Abrupt changes
 Constant variance

42 W
r
i
tes
h ort notes on cohor ts t
u dies.
 Cohort studies allow an explicit focus on the social and cultural context that frames the
experiences, behavior, and decisions of individuals.
 For example, in the case of the 1958 British Birth Cohort study, it is important to
understand the cohort's educational experiences in the context of profound changes in the
organization of secondary education during the 1960s and 1970s, and the rapid expansion
of higher education, which was well underway by the time cohort members left school in
the mid-1970s.

43 What is cross-sectional survey?


The change over time is determined by conducting two surveys asking the same questions at
different points in historical time. This is known as repeated cross-sectional survey.

44 Write short notes on Time-based indexing.


Time-based indexing is a very powerful method of the pandas library when it comes to time
series data. It allows using a formatted string to select data. Example:
df_power.loc['2015-10-02']
Output:
Consumption 1391.05
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month 10
Weekday Name Friday
Name: 2015-10-02 00:00:00, dtype: object
Here, the pandas dataframe loc accessor is used. The date is used as a string to select a row. All
sorts of techniques can be used to access rows just as we can do with a normal dataframe index.

45 What is the major issue in longitudinal studies?


A major methodological issue in longitudinal studies is the problem of attrition, i.e., the dropout
of participants through successive waves of a prospective study.

46 What is univariate time series? Give example.


 The series that captures a sequence of observations for the same variable over a
particularduration of time is referred to as univariate time series.
 In general, the observations are taken over regular time periods, such as the change in

18
AD8502- Data Exploration and Visualization Department of ADS 2022-

temperature over time throughout a day.

47 What is event history analysis?


Event history analysis focuses on the timing of events or the duration until a particular
eventoccurs, rather than changes in attributes over time.

48 What are the two main approaches to longitudinal data analysis?


 Repeated measures analysis
 Event history analysis or Event history modeling

49 What is repeated measures analysis?


Repeated measures analysis focuses on the changes in an individual attribute over time. For
example, weight, performance score, attitude, voting behavior, reaction time, depression, etc.

50 What are time series?


An ordered sequence of timestamp values at equally spaced intervals is referred to as a
timeseries. It is a collection of observations made sequentially in time.

51 List the applications of time series analysis.


Analysis of a time series is used in many applications such as sales forecasting, utility studies,
budget analysis, economic forecasting, inventory studies, etc.

U
N
I
T
-
I
V/
PA
R
T
-
B
1. Explain the following:

Causality
a. Multiple causality

b. Direct and indirect effects

2. Describe in detail about the assumptions required to infer causes.

3. Consider a hypothetical example of the causes of absenteeism from work. Suppose previous
research had shown a positive bivariate relationship between low social status jobs and
absenteeism. Is there something about such jobs that directly causes the people who do them to
go off sick more than others? Discuss assumptions and possible outcomes.

4. Explain Simpson's paradox with an example.

5. Consider as a test factor, a variable that represents the extent to which the respondent suffers
from chronic nervous disorders, such as sleeplessness, anxiety, and so on. Such conditions would
be likely to lead to absence from work. It is also quite conceivable that they could be caused in
part by stressful, low-status jobs. Therefore, assume that nervous disorders act as an intervening
variable. What will happen to the original relationship if we control for this test factor? Show
similar possible outcomes.

6. Explain in detail about longitudinal data collection. Give examples for longitudinal studies.

7. Explain event history modelling. Discuss about the various approaches of event history
modelling.

19
AD8502- Data Exploration and Visualization Department of ADS 2022-

8. Perform the following and show the outputs:

(
i
)R
an
d
o
m
ly
g
e
n
er
a
t
ea
n
o
rm
a
l
i
z
ed
t
im
e
s
e
r
ie
s
d
a
ta
s
e
tu
s
i
n
g
th
e
Nu
m
p
yl
i
b
ra
r
y.

(
i
i
)P
lo
t
t
h
et
i
m
es
e
r
ie
s
d
a
ta
u
s
i
ng
t
h
e
se
a
b
or
n
l
i
br
a
r
y

(
i
i
i
)G
en
e
r
a
t
ea
n
a
rr
a
y
of
t
h
ec
u
mu
l
a
t
iv
e
su
m
o
ft
h
e
da
t
a

(
i
v
)P
lo
t
t
h
ed
a
t
au
s
i
n
g
at
i
me
s
e
r
ie
s
p
l
ot

9. Perform TSA with Open Power System Data.

10. Describe the characteristics of time series data.

The Open Power System data consists of 4 columns, such as, ‘Consumption', 'Wind', 'Solar',
'Wind+Solar’. Write the code to visualize the Open Power System dataset.

(
i
)G
en
e
r
a
t
ea
l
i
n
ep
l
ot
o
f
th
e
f
ul
l
t
im
es
e
r
i
e
so
f
Ge
r
m
a
ny
'
sd
ai
l
y
el
e
c
tr
i
c
it
y
c
on
s
u
mp
t
i
o
n

(
i
i
)P
lo
t
t
h
ed
a
t
af
o
r
al
l
t
he
ot
h
e
rc
o
l
u
mn
s

(
i
i
i
)V
is
u
a
l
i
ze
t
h
ee
l
e
c
tr
i
c
it
y
c
on
s
u
mp
t
i
o
nb
e
t
we
e
n
2
01
6
-
12
-
2
3a
n
d2
0
1
6
-
12
-
3
0

11 The data given below relate to the percentage of households that are headed by a lone parent and
contain dependent children, and the percentage of households that have no car or van for each of the
ten Government Office Regions of England and Wales.

Summarize the linear relationship between the two interval-level variables and explain in detail.
Also discuss the rules to draw the line.

12 Describe in detail contingency tables and percentage tables with suitable examples.
13 Explain the guidelines to construct a lucid table of numerical data.

14 Examine the given data. Which group of women is most likely to have high levels of worry about
violent crime? And which group is least likely to have high levels of worry about violent crime?
Using the data in the table create a simpler table with just three age groups; 16-34; 35-54; and 55+.
Decide which should be the reference or base category and use the table to construct a causal path
diagram.
Wo High levels of Not worried Total
men worry

20
AD8502- Data Exploration and Visualization Department of ADS 2022-

21
AD8502- Data Exploration and Visualization Department of ADS 2022-

age P N P N P N
grou
p
16- 0.32 686 0.68 144 1 213
24 7 3
25- 0.27 104 0.73 281 1 385
34 0 5 5
35- 0.25 122 0.75 372 1 494
44 4 5 9
45- 0.24 952 0.76 300 1 396
54 8 0
55- 0.22 943 0.78 343 1 437
64 0 3
65- 0.22 781 0.78 273 1 351
74 5 6
75 0.14 511 0.86 304 1 355
or 4 5
olde
r
Tota 6,13 20,2 26,3
l 7 04 41

15 Consider an imaginary piece of research in which 100 men and 100 women are asked about their
fear of walking alone after dark. Until we conduct the survey, we have no information other than the
number of men and women in our sample (see the below table).
Feeling safe walking alone after dark by gender
Very safe, Very unsafe Total
fairly safe,
ora bit
unsafe
P N P N P N
Mal ? ? ? ? 1 100
e
Fem ? ? ? ? 1 100
ale
Tota ? ? ? ? 1 200
l
Find the following:
a. After the survey, imagine that we find that in total 20 individuals i.e., 0.1 of the sample
state that they feel very unsafe when walking alone after dark. Add this information to the
given table.
b. If, in the population as a whole, the proportion of men who feel very unsafe walking alone
after dark is the same as the proportion of women who feel very unsafe walking alone after
dark, we would expect this to be reflected in our sample survey. Find the expected
proportions and frequencies.
c. Find the observed values after carrying out the survey and the fear of walking alone after
dark by gender were cross-tabulated.
d. Compute the chi-square.
Briefly explain the statistically significant relationship between the variables.

16 Explain the following:


a. Null Hypothesis
b. Type 1 and Type 2 errors

22
AD8502- Data Exploration and Visualization Department of ADS 2022-

c. Degrees of freedom

17 Describe the essentials of interpreting contingency tables.


18 Consider the problem of feeling safe walking alone after dark by gender. (sample restricted to thoseof
Black Caribbean ethnic origin)

For the above table, chi-square is calculated using SPSS

Describe the probability of making a Type 1 error.


10. Explain Characteristics of time series data
11.Explain Data Cleaning and Time-based indexing .
19 Explain boxplots in detail with an example.
Explain fitting a resistant line with the data given below.
Characteristics of time series data – Data Cleaning – Time-based indexing – Visualizing – Grouping – Resampling

23
AD8502- Data Exploration and Visualization Department of ADS 2022-

24
AD8502- Data Exploration and Visualization Department of ADS 2022-

25
AD8502- Data Exploration and Visualization Department of ADS 2022-

26
AD8502- Data Exploration and Visualization Department of ADS 2022-

27

You might also like