1 - Basic Concepts
1 - Basic Concepts
Statistics
Objectives
At the end of this chapter, the students will be able to explain basic
statistical concepts and measures. Specifically,
1. Define statistics;
(https://onlinecourses.science.psu.edu/stat200/node/113)
What is Statistics?
The art and science of designing studies and analyzing the
data that those studies produce. Its ultimate goal is
translating data into knowledge and understanding of the
world around us. In short, statistics is the art and science
of learning from data.
(Agresti and Franklin, 2013)
What is statistics?
Statistics (Singular)
science that deals with techniques for collecting,
presenting, analyzing, and drawing conclusions from data
science of data
Statistics (Plural)
numerical descriptions by which we enhance understanding
of data
summary measures used to describe a sample
Example
1. Statistics are facts or data, either numerical or
nonnumerical.
2. Statistics is the science of organizing and
summarizing numerical or nonnumerical
information.
Variables
When we obtain our sample, we obtain data values on one or
more variables
A variable is a characteristic or attribute which varies from one
entity to another entity (tensile strength, no. of buildings in VSU
campus, income of engineers)
A qualitative variable is one which classifies/identifies/describes
an element of a sample or population (sex of an engineer,
academic rank of faculty )
A quantitative variable is one which quantifies an element of a
sample or population(age, length of service)
Variables
A quantitative variable that can assume only a finite or countably
infinite number of possible values (usually integers) is called a
discrete variable (no. of board passers, enrolment in BSCE per
semester)
A quantitative variable that can theoretically assume any value in a
specified interval (i.e., continuum) is called a continuous variable
(temperature, wind speed, weight of beams)
measuring instruments limit the number of decimal places of
values of continuous variables
Summary
Variables
Qualitative Quantitative
Discrete Continuous
Measurement
the process or rule of assigning labels or values to a variable
Importance:
Identifying the level of measurement of a variable is one of
the factors in choosing a statistical method
Levels or Scales of Measurement
Ratio
Interval
Ordinal
Nominal
Nominal
objects are classified into categories based on some defined
characteristics
categories are mutually exclusive
numbers are sometimes used as category codes but arithmetic
should not be performed on these number codes
frequencies or counts of observations that belong to each
category is usually obtained to summarize the data
Examples:
temperature (in °C), achievement test score , IQ
Ratio
Has all the properties of the interval scale
the zero point reflects an absence of the characteristic (absolute
zero point)
ratios of two values in the scale are meaningful
Examples:
weight, price, distance
Ratio Measurement
Example: Weight in pounds
5. ID number
Population and sample
A population is the entire collection of objects or
outcomes about which data are collected
A sample is a subset of the population containing the
observed objects or the outcomes and the resulting data
Information obtained from a population data are called
parameters
Numbers computed using the data obtained from a
sample are called statistics
Statistics are used to estimate parameters
Parameters & Statistics
Example:
Decide whether the numerical value describes a population
parameter or a sample statistic.
a.) A recent survey of a sample of 450 engineering
students reported that the average weekly
income for students is $325.
Because the average of $325 is based on a sample, this
is a sample statistic.
b.) The average weekly income for all students is $405.
Because the average of $405 is based on a population,
this is a population parameter.
Population
Inference
Sampling
(generalizations)
sample
Sampling
is the process of selecting a small number of elements from a
larger defined target group of elements such that the
information gathered from the small group will allow judgments
to be made about the larger group
is the process of selecting a number of individuals for a study in
such a way that the individuals represent the larger group from
which they were selected
Sampling
A physician would like to know the characteristics of a person’s
blood (blood type, Rh factor, blood sugar, etc).
To be able to do this, the physician extracts a few milliliters of
blood from his arm.
He subjects this to laboratory analysis and concludes that the
characteristic obtained from the blood sample is the characteristic
of the person’s blood.
Reasons for sampling
1. Reduced cost
2. Greater speed or timeliness
3. Greater scope
4. Convenience
5. Physically impossible
Probability sampling
each element in the population has a known chance of being
included in the sample
the likelihood of inclusion is operationalized by the use of a random
mechanism (e.g. a device that is used to generate a random
number) and the assigned probability that the unit is specified
Probability sampling
requires a listing of the population units and assigning a unique
label or identifier (usually counting numbers) to each one
(sampling frame)
generally referred to as random samples
allows drawing of valid generalizations about the population
Non-probability sampling
the manner in which the units are selected from the
population depends on some inclusion rule as specified by the
sampler.
sampling frame is not always required and the operational
cost is relatively cheaper
Non-probability sampling
inability to provide objective measurement of accuracy
inference can only be made by making assumptions
regarding the representativeness of the sample
Sampling Methods
Probability sampling Non-probability sampling
simple random sampling purposive sampling
systematic sampling convenience sampling
stratified sampling quota sampling
cluster sampling
Simple Random Sampling
each element in the population has a known and equal probability of
selection
each possible sample of a given size (n) has a known and equal
probability of being the sample actually selected
may not be practical to implement especially for large populations
due to the absence of good quality sampling frame and the
possibility that the selected units may be extremely scattered thus
making it doubly difficult to implement
maybe done with replacement or without replacement
Simple Random Sampling
Systematic Sampling
the sample is chosen by selecting a random starting point and then
picking every kth element in succession from the sampling frame
k is the sampling interval and is determined by dividing the
population size N by the sample size n and rounded to the nearest
integer
a random number (called the random start) is selected from 1 to k;
the unit assigned this number is then included in the sample and the
kth unit thereafter
Systematic Sampling
Stratified Sampling
the population is divided into mutually exclusive sub-populations
called strata based on a stratification variable that is closely related
to the characteristic of interest
the elements within a stratum should be as homogeneous as
possible, but the elements in different strata should be as
heterogeneous as possible.
independent simple random samples are obtained from each stratum
the overall sample size (n) can be distributed into the strata sizes (nh)
using equal allocation, proportional allocation or optimum allocation
Stratified Sampling
Cluster Sampling
in many applications, units of the population are naturally grouped
(e.g. villages); these groupings are referred to as clusters
a random sample of clusters is selected, based on a probability
sampling technique such as SRS
for each selected cluster, either all the elements are included in the
sample (one-stage) or a sample of elements is drawn
probabilistically (two-stage)
Cluster Sampling
elements within a cluster should be as heterogeneous as possible,
but clusters themselves should be as homogeneous as possible.
Ideally, each cluster should be a small-scale representation of the
population
it is administratively convenient to implement
Cluster Sampling
Descriptive statistics
Use of numerical information to summarize, simplify, and present
data.
Organize and summarize data for clear presentation and easy
interpretation
Computation of measures of location and variation
Construction of tables and graphs
Inferential statistics
techniques that use sample data to make general statements
about a population
making decisions and drawing conclusions about a population
based on data obtained from a sample taken from the population
allows meaningful generalizations only if the subjects in the
sample are representative of the population
Estimation and hypotheses testing
Data
refers to facts or figures from which conclusion can be drawn
refers to the values or labels assigned to variable
it is information collected, organized, analyzed, and interpreted by
statisticians
it is needed whenever we undertake studies or researches which are
designed to answer particular problems, or to provide a base with
which certain decisions may be formulated
Data
Can be quantitative or qualitative
Qualitative data can be transformed into quantitative data
using number codes
Can be primary or secondary
Methods of data collection
Sample survey (personal, phone, online)
Controlled experiments (field or lab)
Observation (psychiatric wards)
Registration method (e. g. as required by law)
Focus group discussion (qualitative data)
Use of existing records (secondary data)
Presenting data
Generally, there are three ways of presenting statistical data:
textual, tabular, and graphical
In a textual presentation, statistics are incorporated in a text or
paragraph.
In a tabular presentation, statistics are organized in rows and
columns with appropriate labels
In a graphical presentation, statistics are shown pictorial form
Textual presentation
Poverty incidence among Filipinos1 in 2015 was estimated at 21.6
percent. During the same period in 2012, poverty incidence among
Filipinos was recorded at 25.2 percent. On the other hand,
subsistence incidence among Filipinos, or the proportion of Filipinos
whose incomes fall below the food threshold, was estimated at 8.1
percent in 2015. In 2012, the subsistence incidence among Filipinos
is at 10.4 percent . Subsistence incidence among Filipinos is often
referred to as the proportion of Filipinos in extreme or subsistence
poverty.
Advantages
1. There is a better comprehension of data than is possible with
textual matter alone.
2. There is a more penetrating analysis of the subject than is
possible in written text.
3. There is a check of accuracy
Some Commonly Used Charts
Line Chart
oldest, simplest, most familiar, and most widely used method
of presenting statistics graphically
the plotted points of the data are connected by a line.
the fluctuations of this line show the variations in the trend.
the distance of the plotting from the base line of the graph
indicates the quantity
1. for emphasizing movement rather than actual amount
2. for depicting time series (data across time)
3. for comparing several series
4. when data cover a long period of time
5. when estimates or forecasts are to be shown
Source: PSA
Some Commonly Used Charts
Column Chart
to depict numerical values of a given item over a period of
time
values are represented by the height of the column
preferable to the line chart when a sharper explanation of
trend is to be shown
Types:
1. Grouped-column chart - used to compare two or sometimes
three independent series over a period of time.
2. Subdivided-column chart - shows the component parts of a total.
These should be few in number and each should carry a
distinctive pattern so that it may be readily identified.
Source: PSA
Some Commonly Used Charts
Horizontal Bar Chart
simplest form of graph comparing different items at a
specified date
especially suited to represent categorical data
bars may be arranged in numerical or alphabetical order,
depending on the purpose of the chart and the given data
Some Commonly Used Charts
Pie Chart
a circular diagram that is divided into sections to show the
composition of a whole
size of each section is indicative of the proportion to the
total of the corresponding component
useful when there are few components to a whole
many components (more than six) would diminish the
visual impact of the chart
Some Commonly Used Charts
Statistical Map
used to present geographical statistics
should be used only when geographic distribution is of
permanent importance and when data can be readily and
correctly interpreted in this form
Types
1. Shaded or cross-hatched map
2. Dot-map chart
Source: PSA
Graphical presentation of
frequency distributions
Frequency histogram
bar graph showing the class boundaries on the x-axis and
the frequencies on the y-axis
the border of each bar is erected at the class boundaries