CONNECT WITH US
WEBSITE: www.eduengineering.in
TELEGRAM: @eduengineering
Best website for Anna University Affiliated College Students
Regular Updates for all Semesters
All Department Notes AVAILABLE
All Lab Manuals AVAILABLE
Handwritten Notes AVAILABLE
Printed Notes AVAILABLE
Past Year Question Papers AVAILABLE
Subject wise Question Banks AVAILABLE
Important Questions for Semesters AVAILABLE
Various Author Books AVAILABLE
UNIVARIATE ANALYSIS Introduction to Single variable: Distributions and Variables -
Numerical Summaries of Level and Spread - Scaling and Standardizing – Inequality - Smoothing
Time Series.
1. What is a case and variable?
A Dataset consists of cases. Cases are nothing but the objects in the collection or are the basic
units of analysis, the things about which information is collected. Each case has one or more
attributes or qualities, called variables which are characteristics of cases.
Example: The following dataset contains 6 cases and 3 variables:
2. What is sample and population?
A population is the entire group that you want to draw conclusions about.
A sample is the specific group that you will collect data from. The size of the sample is always
less than the total size of the population.
Population vs sample
Population Sample
Advertisements for IT jobs in The top 50 search results for advertisements for IT jobs in the
the Netherlands Netherlands on May 1, 2020
Undergraduate students in the 300 undergraduate students from three Dutch universities who
Netherlands volunteer for your psychology research study
TELEGRAM: @eduengineering
3. What is random sampling?
Random sampling is a part of the sampling technique in which each sample has an equal
probability of being chosen. A sample chosen randomly is meant to be an unbiased representation
of the total population.
4. Explain the household Survey in London?
Household survey (General Household Survey GHS) have a long tradition in Britain. The details
were usually collected from employers, rather than from the 'untrustworthy' testimony of the poor
themselves. It is normally conducted by Office for National Statistics (ONS). The information
collected from them includes household, family and individual information. Government
Departments and other organizations use this information for planning, policy and monitoring
purposes, and to present a picture of households, family and people in Great Britain. The simple
data matrix is shown in the following table.
It classifies the drinkers in to the following categories.
1. Hardly drink at all
2. drink a little
3. drink a moderate amount
TELEGRAM: @eduengineering
4. drink quite a lot
5. drink heavily.
Also, National Statistics of Socio-Economic Classification (NS-SEC) or Social Class based on
Occupation (SC) or Social Class Classification or Socio-economic Groups uses five classes.
1. Managerial and Professional Occupations.
2. Intermediate Occupations.
3. Small employers and own account workers.
4. Lower supervisory and technical occupations.
5. Semi-routine occupations.
The numbers -1 and -9 are frequently used to represent a missing value. There are often a number
of different reasons why a case may have missing data on a particular variable. The reason for the
missing data is some householders not ready to express the data.
When numbers are used to represent categories that have no inherent order, this is called nominal
scale. When numbers are used to convey full arithmetic properties, this is called an interval scale.
The variables used by social scientists are measured on nominal scales or ordinal scales
(Categories variables), rather than interval scales (also referred to as continuous variables).
5. How to reduce the number of digits?
There are two different mechanisms are available in statistics. One is the Rounding the variable
values to the nearest number.
Example: 199.99 may be rounded to 200.00
Second method is cutting or truncating the number.
Example: 899.9 number become 899.
6. Draw the bar chart and pie chart for the above survey dataset.
A Bar chart is the display of bars representing each category of variable such that the length of the
bar is propositional to the number of cases in the category. For instances, a bar chart of the drinking
classification variable is shown in the following figure.
A pie chart can also be used to display the same information. In general, pie charts are to be
preferred when there are only a few categories and when the sizes of the categories are very
different. The display of pie chart for the drinking classification is shown in the following charts.
TELEGRAM: @eduengineering
5
TELEGRAM: @eduengineering
7. What is Histogram and explain?
Charts that are somewhat similar to bar charts can be used to display interval level variables
grouped into categories and these are called histograms. They are constructed in exactly the same
way as bar charts except, of course, that the ordering of the categories is fixed, and care has to be
taken to show exactly how the data were grouped. A sample General Household Survey of age is
shown in the following figure.
Histograms allow inspection of four important aspects of any distribution:
Level Spread: What are typical values in the distribution? How widely dispersed are the values?
Do they differ very much from one another? Shape Outliers: Is the distribution flat or peaked?
Symmetrical or skewed? Are there any particularly unusual values?
8. Explain the different typical graph of frequency distribution of polygon or histogram?
A typical share of the frequency distribution of polygon or histogram is shown in the following
figure.
TELEGRAM: @eduengineering
9. What are all the characteristics of normal distributions?
Characteristics of Normal Distribution
Normal Distribution has the following characteristics that distinguish it from the other forms of
probability representations:
Empirical Rule: In a normal distribution, 68% of the observations are confined within -/+
one standard deviation, 95% of the values fall within -/+ two standard deviations, and
almost 99.7% of values are confined to -/+ three standard deviations.
Bell-shaped Curve: Most of the values lie at the center, and fewer values lie at the tail
extremities. This results in a bell-shaped curve.
Mean and Standard Deviation: This data representation is shaped by mean and standard
deviation.
Equal Central Tendencies: The mean, median, and mode of this data are equal.
Symmetric: The normal distribution curve is centrally symmetric. Therefore, half of the
values are to the left of the center, and the remaining values appear on the right.
Skewness and Kurtosis: Skewness is the the symmetry. The skewness for a normal
distribution is zero. Kurtosis studies the tail of the represented data. For a normal
distribution, the kurtosis is 3.
Total Area = 1: The total value of the standard deviation, i.e., the complete area of the
curve under this probability function, is one. Also, the entire mean is zero.
TELEGRAM: @eduengineering
10. Explain the frequency distribution of data sets.
A frequency distribution is a collection of observations produced by sorting observations into
classes and showing their frequency of occurrence in each class.
A frequency distribution helps us to detect any pattern in the data (assuming a pattern exists)
by superimposing some order on the inevitable variability among observations.
• It can be represented as graph or table.
• Example - (Frequency Distributions)
11. Explain the Household level Income data in UK?
In this General Household Survey also collects data at the level of the household. In this survey,
the data includes Weekly Household Income, Having Cars, Persons in Household.
TELEGRAM: @eduengineering
1. North East
2. North West
3. Yorks and Humber
4. East Midlands
5. West Midlands
6. Eastern
7. London
8. South East
9. South West
10. Wales
11. Scotland
The bar chart of household weekly income is shown in the following diagram.
TELEGRAM: @eduengineering
10
TELEGRAM: @eduengineering
11
TELEGRAM: @eduengineering
12. What is SPSS?
SPSS (Statistical Package for the Social Sciences) is a statistical analysis and used to bring out the
good graphical displays. It is also frequently used by researchers in market research companies,
local authorities, health authorities and government departments.
13. How to access the SPSS through Data Editor?
When you first start using the program, don't be overwhelmed by the number of different menus
and options that are available. Rather than trying to discover and understand all the facilities that
SPSS provides, it is better to start by focusing on mastering just a few procedures. The Data Editor
screen shown in the following figure.
SPSS has three main windows:
• The Data Editor window • The Output window • The Syntax window
When you first open SPSS, the Data Editor window will be displayed. This will be empty until
you either open an existing data file or type in your own data - in the same way that you would
enter data into a spreadsheet like Excel. You can open an existing data file using the File menu
and then selecting 'Open'
12
TELEGRAM: @eduengineering
and 'Data'. You are then able to browse the directories on your computer until you find the data
file that you need.
The data in the SPSS Data Editor are displayed in rows and columns. Each row provides the
information about a single case in the dataset. As we have seen in this chapter this could be an
individual person or a household. Each column comprises the information about a specific
variable, and the name of the variable appears at the top of each column. The menus across the top
of the Data Editor allow you to access a range of procedures so that you can analyses your data,
modify your data and produce tables, pie charts, histograms and other graphical displays.
When you use SPSS to produce a graph, a table or some statistical analysis, your results will appear
in an output viewer window. It is straightforward to copy and paste results from the output viewer
into a word processing package so that you can integrate tables and charts into a report.
There are two ways of getting SPSS to perform procedures for you. One is to use the menus and
then the Dialog boxes that SPSS provides to choose exactly the variables that you want to work
with. The second is to type instructions into the SPSS Syntax window. SPSS syntax consists of
keywords and commands that need to be entered very precisely and in the correct order.
In order to produce a pie chart of drinking behavior, similar to that shown in figure 1.3, from the
menus along the top, choose 'Graphs' and then 'Pie'.
An SPSS 'dialogue box' will now appear. By default, this specifies that the data in the chart
represent summaries for groups of cases and this is what you want. Next click on the 'Define'
button.
13
TELEGRAM: @eduengineering
14
TELEGRAM: @eduengineering
Finally, if you click on the button in the top right corner labelled 'OK', SPSS will automatically
open an Output Viewer window for you, which will display your first pie chart. This is shown in
the above figure.
15
TELEGRAM: @eduengineering
Drawing Bar chart
In the next dialogue box, you can specify for which variable you want a bar chart displayed and
also choose whether the bars represent the number of cases (N of cases) or the percentage of cases
(% of cases) in each category of the variable. Select the variable genhlth from the variable list and
move it to the 'Category Axis' window by clicking on the arrow button. Finally click on the OK
button and the bar chart for the variable genhlth will appear in the SPSS output window.
16
TELEGRAM: @eduengineering
The SPSS package itself provides a good introduction to all the main aspects of the program via
the Tutorials. In order to view a tutorial, choose 'Tutorial' from the Help menu (see below).
Defining missing values
Before producing graphical displays of single variables in SPSS you will normally have to tell the
computer which values of the variable correspond to 'missing values'. To do this, in the variable
view of the data editor in SPSS click on the appropriate row 'genhlth' of the column headed
'Missing' and use the dialogue box that appears to specify the three missing values, -9, -8 and -6
and then click the OK button.
17
TELEGRAM: @eduengineering
Next, from the menus along the top, choose 'Graphs' and then 'Bar' From the dialogue box that
appears, choose the first option 'Simple' and specify that data in the chart are summaries for groups
of cases. Next click on the 'Define' button.
18
TELEGRAM: @eduengineering
14. Compare the General Household Survey?
The histograms of the working hours distributions of men and women in the 2005 General
Household Survey are shown in figures
• The male batch is at a higher level than the female batch
• The two distributions are somewhat similarly spread out.
19
TELEGRAM: @eduengineering
• The female batch is bimodal suggesting there are two rather different underlying populations.
• The male batch is unimodal.
15. What is the difference between the sample and population?
A population is the entire group that you want to draw conclusions about. A sample is the specific
group that you will collect data from. The size of the sample is always less than the total size of
the population.
16. What are a Residuals?
A residual can be defined as the difference between a data point and the observed typical, or
average, value. For example, if we had chosen 40 hours a week as the typical level of men's
working hours, using data from the General Household Survey in 2005, then a man who was
recorded in the survey as working 45 hours a week would have a residual of 5 hours. Another way
of expressing this is to say that the residual is the observed data value minus the predicted value
and in this case 45-40 = 5.
17. What is mode? Give example.
The mode reflects the value of the most frequently occurring score or the mode is the value that
appears most often in a set of data. The mode of a discrete probability distribution is the value x at
which its probability mass function takes its maximum value.
20
TELEGRAM: @eduengineering
In this case there exists two modes, 4 and 8. 4 occurs 7 times and 8 occurs 6 times. It is
known as bimodal. Bimodal describes any distribution with two obvious peaks.
18. What is median? Give example.
The median reflects the middle value when observations are ordered from least to most.
19. How to compute or find median?
21
TELEGRAM: @eduengineering
20. How to compute mean value?
The mean is found by adding all scores and then dividing by the number of scores.
22
TELEGRAM: @eduengineering
21. What is the difference between the population mean and sample mean?
Sample Mean ( X ) The balance point for a sample, found by dividing the sum for the values
of all scores in the sample by the number of scores in the sample
Population Mean (μ) : The balance point for a population, found by dividing the sum for all
scores in the population by the number of scores in the population.
Population Size (N) The total number of scores in the population
23
TELEGRAM: @eduengineering
22. What is the purpose and nature of mean?
The mean serves as the balance point for its distribution because of a special property:
The sum of all scores, expressed as positive and negative deviations from the mean, always equals
zero.
The mean reflects the values of all scores, not just those that are middle ranked (as with the
median), or those that occur most frequently.
23. How to Interpret the Differences between Mean and Median?
Ideally, when a distribution is skewed, report both the mean and the median.
Appreciable differences between the values of the mean and median signal the presence of a
skewed distribution.
If the mean exceeds the median, as it does for the infant death rates, the underlying distribution is
positively skewed because of one or more scores with relatively large values, such as the very high
infant death rates for a number of countries, especially Sierra Leone.
On the other hand, if the median exceeds the mean, the underlying distribution is negatively
skewed because of one or more scores with relatively small values.
Special Status of the Mean
24
TELEGRAM: @eduengineering
As has been seen, the mean sometimes fails to describe the typical or middle-ranked value of a
distribution.
Therefore, it should be used in conjunction with another average, such as the median.
In the long run, however, the mean is the single most preferred average for quantitative data.
If Distribution Is Not Skewed
When a distribution of scores is not too skewed, the values of the mode, median, and mean are
similar, and any of them can be used to describe the central tendency of the distribution.
24. What is mid-spread and quartiles?
The range of the middle 50 per cent of the distribution is a commonly used measure of spread
because it concentrates on the middle cases. It is quite stable from sample to sample. The points
which divide the distribution into quarters are called the quartiles (or sometimes 'hinges' or
'fourths'). The lower quartile is usually denoted QL and the upper quartile QU. (The middle quartile
is of course the median.) The distance between QL and Q0 is called the mid-spread (sometimes
the 'interquartile range'), or the dQ for short.
25. What are all the methods for computing variability?
There are three methods available for serve as valid measures of variability of the system, first one
is the Inter Quartile Range (IQR), second one is the variance and the third one is the Standard
Deviation.
Those roles are reserved for the variance and particularly for its square root, the standard deviation,
because these measures serve as key components for other important statistical measures. The
25
TELEGRAM: @eduengineering
variance and standard deviation occupy the same exalted position among measures of variability
as does the mean among measures of central tendency.
26. Explain the method of computation of IQR with example ?
27. What is standard deviation and variance? Explain with an example.
The standard deviation essentially calculates a typical value of these distances from the mean. It
is conventionally denoted s, and defined as:
The deviations from the mean are squared, summed and divided by the sample size (well, N - 1
actually, for technical reasons), and then the square root is taken to return to the original units. The
order in which the calculations are performed is very important. As always, calculations within
26
TELEGRAM: @eduengineering
brackets are performed first, then multiplication and division, then addition (including summation)
and subtraction. Without the square root, the measure is called the variance, s2.
The layout for a worksheet to calculate the standard deviation of the hours worked by this small
sample of men is shown in table.
The original data values are written in the first column, and the sum and mean calculated at the
bottom. The residuals are calculated and displayed in column 2, and their squared values are placed
in column 3. The sum of these squared values is shown at the foot of column 3, and from it the
standard deviation is calculated.
27
TELEGRAM: @eduengineering
28. What are Cumulative Frequency Distributions? Given an example.
A frequency distribution showing the total number of observations in each class and all lower-
ranked classes.
Weight of different persons, and their frequency, and cumulative frequency, cumulative percent is
shown in the following table.
29. What is Percentile Rank of an Observation?
The percentile rank of a score indicates the percentage of scores in the entire distribution with
similar or smaller values than that score.
The percentile rank of a score (PR) is the percentage of scores in its frequency distribution that are
less than that score.[1] Its mathematical formula is
Where CF—the cumulative frequency—is the count of all scores less than or equal to the score of
interest, F is the frequency for the score of interest, and N is the number of scores in the
distribution.
28
TELEGRAM: @eduengineering
30. What are an extreme?
The top and bottom of data points referred as the extremes.
31. What are deciles?
The deciles are the distribution in which divide the distribution into ten.
32. What are the percentiles?
The percentiles which divide the distribution into one hundred.
33. Explain the term quantiles.
The general word given to such dividing points is quantiles. Deciles are the quantiles at depth
N/10, the percentiles are the quantiles at depth N/100, and so on.
34. Explain the Individual and aggregate level data with suitable example.
The 'micro' or 'individual level ‘data’ are the specific entries. The data are micro data it is possible
to extract a small sample and examine the actual working hours of specific individuals within the
General Household Survey dataset.
29
TELEGRAM: @eduengineering
Aggregate means that some analysis has already been carried out, and that the data are summarized
in some way rather than being provided in a raw form. For example, in contrast to the General
Household Survey, data from the Annual Survey of Hours and Earnings in Britain (which replaced
the New Earnings Survey in 2004) are not generally available at the individual level.
35. What are all the duties of Data Analysts.
Data analysts have to learn to be critical of the measures available to them, but in a constructive
manner. As well as asking 'Are there any errors in this measure?' we also have to ask 'Is there
anything better available?' and, if not, 'How can I improve what I've got?'
36. Explain the concept of Adding or subtracting a constant.
One way of focusing attention on a particular feature of a dataset is to add or subtract a constant
from every data value. For example, in a set of data on weekly family incomes it would be possible
to subtract the median from each of the data values, thus drawing attention to which families had
incomes below or above a hypothetical typical family.
37. Explain the concept of Multiplying or dividing by a constant.
We could change each data point by multiplying or dividing it by a constant. A common example
of this is the re-expression of one currency in terms of another. For example, in order to convert
pounds to US dollars, the pounds are multiplied by the current exchange rate. Multiplying or
dividing each of the values has a more powerful effect than adding or subtracting. The result of
multiplying or dividing by a constant is to scale the entire variable by a factor, evenly stretching
or shrinking the axis like a piece of elastic. To illustrate this, let us see what happens if data from
the General Household Survey on the weekly alcohol consumption of men who classify themselves
as moderate or heavy drinkers are divided by seven to give the average daily alcohol consumption.
38. Explain the variables Standardization.
To standardize a variable, a typical value is first subtracted from each data point, and then each
point is divided by a measure of spread. It is not crucial which numerical summaries of level and
spread are picked. The mean and standard deviation could be used, or the median and mid-spread.
A variable which has been standardized in this way is forced to have a mean or median of 0 and a
standard deviation or mid-spread of 1.
This is achieved by standardizing each score. One common way of standardizing is to first subtract
the mean from each data value, and then divide the result by the standard deviation. This process
is summarized by the following formula, where the original variable 'Y' becomes the standardized
variable 'Z':
30
TELEGRAM: @eduengineering
This is shown in the following example.
39. Explain about the Gaussian distribution.
It is also known as normal distribution or uniform distribution. For example, if the distribution
were completely flat (a uniform distribution), this would be possible. We would only need to
specify the value of the extremes and the number of cases for it to be reproduced accurately, and
it would be possible to say exactly what proportion of the cases fell above and below a certain
level.
In order to summarize the shape of a distribution succinctly, it would need to be simple enough to
be able to specify how it should be drawn in a very few statements. For example, if the distribution
31
TELEGRAM: @eduengineering
were completely flat (a uniform distribution), this would be possible. We would only need to
specify the value of the extremes and the number of cases for it to be reproduced accurately, and
it would be possible to say exactly what proportion of the cases fell above and below a certain
level. and which contains fixed proportions of the distribution at different distances from the
center. The two curves in following figure look different - (a) has a smaller spread than (b) - but
in fact they only differ by a scaling factor.
32
TELEGRAM: @eduengineering
Example for standard deviation.
40. Explain the Standardizing distributions with respect to an appropriate base.
In the scaling and standardizing techniques considered up to now, the same numerical adjustment
has been made to each of the values in a batch of data. Sometimes, however, it can be useful to
make the same conceptual adjustment to each data value, which may involve a different number
in each case.
As the figures stand, the most dominant feature of the dataset is a rather uninteresting one: the
change in the value of the pound. While the median and mid-spreads of the money incomes each
year have increased substantially in this period, real incomes and differentials almost certainly
33
TELEGRAM: @eduengineering
have not. How could we present the data in order to focus on the trend in real income differentials
over time?
A dataset which can be viewed from several angles is shown in figure 3.13: the value of the lower
quartile, the median and the upper quartile of male and female earnings in the period between 1990
and 2000. The data are drawn from the New Earnings Survey that collects information about
earnings in a fixed period each year from the employers of a large sample of employees.
34
TELEGRAM: @eduengineering
One approach would be to treat the distribution of incomes for each sex in each year as a separate
distribution, and express each of the quartiles relative to the median. The result of doing this is
given in figure 3.14. The figure of 75 for the QL for men in 1990, for example, was obtained by
dividing £193 by £258 and multiplying the result by 100. All of the results have been rounded to
the nearest pound (£). In this relative to medians income computation, it is the male and female
income almost equal to other one. This is one method for standardization with a base.
35
TELEGRAM: @eduengineering
CONNECT WITH US
WEBSITE: www.eduengineering.in
TELEGRAM: @eduengineering
Best website for Anna University Affiliated College Students
Regular Updates for all Semesters
All Department Notes AVAILABLE
All Lab Manuals AVAILABLE
Handwritten Notes AVAILABLE
Printed Notes AVAILABLE
Past Year Question Papers AVAILABLE
Subject wise Question Banks AVAILABLE
Important Questions for Semesters AVAILABLE
Various Author Books AVAILABLE