Unit 1: Statistical data and concepts
Statistical data and concepts involve collecting, organizing, analyzing, interpreting,
and presenting data to draw meaningful conclusions and make informed decisions.
Key concepts include descriptive statistics, inferential statistics, probability, and
various analytical techniques.
Here's a more detailed explanation of statistical data and concepts:
1. What is Statistics?
Statistics is the science of collecting, organizing, analyzing, interpreting, and
presenting data.
It helps us understand data, identify patterns, and make informed decisions based
on evidence.
It is used in various fields, including data science, research, business, and
government.
Types of Statistics:
Descriptive Statistics:
Summarizes and describes the main features of a dataset.
Examples: Mean, median, mode, standard deviation, range.
Helps in understanding the characteristics of a sample or population.
Inferential Statistics:
Draws conclusions or makes inferences about a population based on a sample.
Examples: Hypothesis testing, confidence intervals, regression analysis.
Used to generalize findings from a sample to a larger population.
Key Statistical Concepts:
Data:
Data is the collection of facts, observations, or measurements.
Can be numerical (e.g., counts, measurements) or categorical (e.g., labels,
classifications).
Population:
The entire group of individuals or objects that are of interest in a study.
Sample:
A subset of the population that is selected for analysis.
Variable:
A characteristic or attribute that can vary among individuals or objects in a dataset.
Probability:
The likelihood of an event occurring.
Used to make predictions and assess uncertainty.
Central Tendency:
Measures that describe the "center" of a dataset.
Examples: Mean, median, mode.
Variability:
Measures the spread or dispersion of data points in a dataset.
Examples: Standard deviation, variance.
Hypothesis Testing:
A statistical method used to determine whether there is enough evidence to reject a
null hypothesis.
Regression:
A statistical method used to model the relationship between a dependent variable
and one or more independent variables.
Correlation:
A measure of the relationship between two variables.
Sampling:
The process of selecting a sample from a population.
Data Distributions:
The shape of the distribution of data.
Examples: Normal distribution, skewed distribution.
Importance of Statistical Data and Concepts:
Decision-Making: Statistics helps in making informed decisions based on data
analysis.
Problem Solving: Statistical methods can be used to identify and solve problems.
Research: Statistics is essential for conducting research and drawing valid
conclusions.
Data Science: Statistical concepts are fundamental to data science and machine
learning.
Statistical methods
are used to analyze data and draw conclusions. They include descriptive statistics,
inferential statistics, predictive analysis, and exploratory data analysis.
Descriptive statistics
Summarize data using indexes like the mean, median, and standard deviation
Present data in charts, graphs, and tables
Make complex data easier to read and understand
Inferential statistics
Draw conclusions from data that are subject to random variation
Study the relationship between different variables
Make predictions for the whole population
Use techniques like hypothesis testing and regression analysis
Predictive analysis
Analyze data to derive past trends and predict future events
Use machine learning algorithms, data mining, data modeling, and artificial
intelligence
Exploratory data analysis
Explore the unknown data associations, Analyze the potential relationships within
the data, and Identify patterns.
Statistical methods are used in research, data analysis, and to make informed
decisions. They can help to eliminate bias from data evaluation and improve
research designs.
Misusing statistics can involve cherry-picking data, ignoring outliers,
overgeneralizing, or making faulty causal claims. For example, claiming that a
specific product is "better" based on a small, biased sample is a misuse of statistics.
In this section we provide guidance on the kinds of problems that may be
encountered, and comment on how some of these can be avoided or minimized.
The main categories of misuse can be summarized as:
•inadequate or unrepresentative data
•misleading visualization of results
•inadequate reasoning on the basis of results
•deliberate falsification of data
Sampling involves selecting a subset of a population to study and determine
characteristics of the whole population, while sample size refers to the number of
observations or individuals included in that subset, which directly impacts the
accuracy and reliability of the study's results.
Here's a more detailed explanation:
Sampling:
Definition: Sampling is the process of choosing a representative group (the sample)
from a larger population to gather data and make inferences about the entire
population.
Purpose: Instead of studying every individual in a population (which can be
impractical or impossible), sampling allows researchers to efficiently collect data
and draw conclusions about the larger group.
Example: If you want to know the average height of students in a university, you
might sample 100 students instead of measuring everyone.
Sample Size:
Definition: Sample size is the number of individuals or observations included in
the sample.
Importance: A larger sample size generally leads to more accurate and reliable
results, as it reduces the likelihood of sampling errors and provides a better
representation of the population.
Example: In the student height example, a sample of 1000 students would likely be
more accurate than a sample of 100 students.
Factors affecting sample size: The appropriate sample size depends on factors like
the research question, the population being studied, the desired level of statistical
power and confidence, and the available resources.
Sample size determination: Determining the appropriate sample size involves
careful consideration of these factors and may involve using statistical formulas or
guidelines.
In essence: Sampling is the method of selecting a group, and sample size is the
number of individuals in that group.
Data preparation is the process of cleaning and transforming raw data so it's
ready for analysis. Data cleaning is a key step in data preparation that involves
fixing errors and improving data quality.
Data preparation steps:
Data collection: Gather raw data from various sources, such as databases, web
APIs, or manual entry
Data cleaning: Identify and correct errors, inconsistencies, and anomalies
Data integration: Combine and merge datasets to create a unified dataset
Data transformation: Convert the structure and format of the data
Data normalization: Scale numeric data to a standard range
Data profiling: Identify relationships, connections, and other attributes in data sets
Data cleaning techniques: filling in missing values, filtering out duplicates or
invalid entries, standardizing formats, cross-checking information, and adding
more details.
Benefits of data preparation
Improves data quality
Reduces noise and irrelevant data
Streamlines data for faster and easier processing
Provides clear and well-organized data for better business decisions
You can automate data preparation tasks with tools like KNIME.
Missing data and errors, if not addressed, can significantly impact data analysis
and machine learning model performance, potentially leading to biased or
inaccurate results. Understanding the types of missing data and errors is crucial for
effective data cleaning and handling
Statistical errors occur when the data collected from a study doesn't match the true
value of the population being studied. This can happen due to a number of
reasons, including sampling, measurement, or bias.
Types of statistical errors
Sampling error: The difference between the analysis of a sample and the actual
value of the population.
Measurement error: Occurs when the measuring device used is inaccurate.
Gross error: Statistically unlikely, but can be caused by human error or a
malfunctioning instrument.
Type I error: A false positive result from a hypothesis test.
Non-response error: Occurs when some members of a sample don't respond or
can't provide the required data.
How errors impact results
The greater the error, the less representative the data is of the population, and the
less reliable the study's results.
How to reduce errors
To reduce errors, you can:
   ● Correct gross errors before performing statistical adjustments
   ● Consider the margin of error when interpreting poll results
   ● Be aware of bias in the sampling, measurement, or analysis process
Statistical Modeling:
Statistiсal modeling refers to the рroсess of applying statistiсal analysis techniques
to observe, analyze, interpret, and рrediсt trends and рatterns in data.
In data sсienсe, statistiсal models help data sсientists and analysts unсover valuable
insights from сomрlex datasets. By building mathematiсal reрresentations using
historiсal data, statistiсal models identify important relationships between different
variables in а system. These models are then used to foreсast future outcomes and
events through predictive analytics.
Types of Statistical Models
There are many types of statistiсal models used in data sсienсe, eaсh serving а
different analytiсal рurрose:
   ● Linear Regression Models: Used to model the relationshiр between а
      deрendent variable and one or more indeрendent variables. This is one of the
      most widely used statistiсal techniques.
   ● Logistiс Regression Models: Used when the response or dependent variable
      is сategoriсal, suсh as рass/fail, win/lose, etс. It сalсulates the рrobability of
      an event occurring.
   ● Time Series Models: Caрtures the сhanges in data over time to foreсast
      future values using historiсal trends and seasonality. Extremely useful for
      analysis of financial, weather, traffiс, and сensus data.
   ● Survival Models: Estimates the exрeсted duration of time until one or more
      events happen, such as meсhaniсal failure or death. Useful in mediсal
      research and industrial engineering.
   ● Deсision Tree Models: Uses а decision tree like model with branсhes to
      illustrate every possible outcome of а deсision given сertain сonstraints.
      Assists in сalсulating the сonsequenсes of сhoiсes.
   ● Neural Network Models: Insрired by biologiсal neural networks, these AI
      models deteсt сomрlex nonlinear relationshiрs between indeрendent and
      deрendent variables for рattern сlassifiсation and reсognition рroblems.
   ● Ensemble Models: Imрroves overall model рerformanсe by strategiсally
      сombining multiрle statistiсal models together. For example, randomly
      seleсting sub-samрles of data multiple times.
   ● Multivariate Analysis: Examines relationships among multiple variables at
      the same time. Useful for gaining deeper insights from intriсate datasets.
What are Statistical Modeling Techniques?
Statistiсal modeling techniques refer to the methods applied to build and train
statistiсal models using historiсal data. The whole рroсess сomрrises of -
   ● Data Exрloration: This first step focuses on сleaning the raw data and
      getting familiar with attributes and variables. Summary statistics are
      generated in this рhase. Graрhiсal analysis also assists in this initial visual
      insрeсtion.
   ● Feature Engineering: Aррroрriate рrediсtive variables or features are
      seleсted to be used as inрuts for modeling. Irrelevant or redundant attributes
      are disсarded. Feature sсaling transforms attributes to сomрarable sсales.
   ● Model Develoрment: Based on the рroblem statement, analytiсal goals, and
      data рroрerties, an aррroрriate statistiсal model tyрe and algorithm is
      selected.
      Hyрerрarameters are tuned to improve model рerformanсe.
   ● Model Evaluation: The model is tested on an unseen holdout dataset to
      determine its real-world effectiveness. Evaluation metriсs like R-squared,
      сonfusion matrix, etc., quantify the model рerformanсe.
   ● Interрretation: Finally, the model outрuts and results are visually insрeсted
      to derive meaningful, data-driven insights and findings. These become the
      basis for business decisions.
Computational Statistics:
Computational statistics, also known as statistical computing, is a field that
combines statistics and computer science, focusing on using computational
methods to solve statistical problems, especially those involving large or complex
datasets.
In the context of research and data analysis, inference bias and confounding refer
to errors that distort the relationship between an exposure and an outcome, leading
to potentially misleading conclusions. Understanding and addressing these issues
is crucial for drawing valid causal inferences.
   Bias:
    Bias refers to systematic errors that distort the true relationship between an
    exposure and an outcome.
       ● Examples: Selection bias (participants are selected in a way that skews
           the results), information bias (errors in collecting data), or measurement
           bias (using tools that are not accurate).
   Confounding:
    Confounding occurs when a third variable, a "confounder," influences both the
    exposure and the outcome, creating a spurious association.
       ● Example: If a study shows an association between coffee drinking and
           heart disease, but coffee drinkers also tend to smoke more, smoking
           could be a confounder, obscuring the true relationship between coffee
           and heart disease.
   Importance of Distinguishing Between Bias and Confounding:
       ● Valid Causality: Understanding the difference is critical to accurately
           interpret research findings and establish causal relationships.
       ● Policy and Practice: Misinterpretations due to bias or confounding can
           lead to ineffective or harmful policies and practices.
       ● Causal inference: In causal inference, understanding and adjusting for
           bias and confounding is essential to accurately estimate the causal effect
           of an intervention or exposure
Hypothesis testing is a statistical method that uses sample data to determine if a
theory applies to a larger population. It involves making assumptions, collecting
data, and analyzing the results to either support or reject the theory.
Steps in hypothesis testing:
State the null hypothesis, which assumes no difference between groups or
conditions
   ● State the alternative hypothesis, which predicts a relationship between
      variables
   ● Collect data
   ● Calculate a test statistic
   ● Determine acceptance and rejection regions
   ● Draw a conclusion based on the test statistic and acceptance and rejection
      regions
Key terms
   ● Null hypothesis: The default assumption that there is no effect or
      difference between groups or conditions
   ● Alternative hypothesis: The theory that there is a relationship between
      variables
   ● P-value: A statistical measurement that indicates how likely it is to get the
      observed results if the null hypothesis is true
   ● Significance level: The probability of rejecting the null hypothesis when it
      is true
Example
You might use hypothesis testing to determine if the average weight of a dumbbell
in a gym is higher than 90 lbs
What are the 5 steps of a hypothesis test?
      Step 1: State your null and alternate hypothesis. ...
      Step 2: Collect data. ...
      Step 3: Perform a statistical test. ...
      Step 4: Decide whether to reject or fail to reject your null hypothesis. ...
      Step 5: Present your findings.
Statistical significance refers to the claim that a result from data generated by
testing or experimentation is likely to be attributable to a specific cause. A high
degree of statistical significance indicates that an observed relationship is unlikely
to be due to chance.
Statistical significance is most practically used in hypothesis testing. For example,
you want to know whether changing the color of a button on your website from red
to green will result in more people clicking on it.
Confidence intervals and statistical significance are related tools in statistics: a
confidence interval shows a range where the true population value likely lies,
while significance is determined by whether the interval excludes the "no effect"
value (often zero).
Confidence Interval:
A confidence interval is a range of values within which a population parameter
(like the mean) is likely to fall, given a certain level of confidence (e.g., 95%).
         ● It's a way to quantify the uncertainty or precision of an estimate.
         ● A narrower interval indicates a more precise estimate, while a wider
            interval suggests greater uncertainty.
         ● A 95% confidence interval, for example, means that if you were to
            repeat the study many times, 95% of the intervals would contain the
            true population value.
         ● You can calculate confidence intervals for many kinds of statistical
            estimates, including means, differences between means, and odds
            ratios.
Power and robustness:In the context of systems or models, "power" refers to the
ability to detect true effects or differences when they exist, while "robustness"
refers to the ability to maintain performance and stability under various conditions
and variations in data or the environment.
Power: Definition-
The power of a test, model, or system is its capacity to identify a true effect or
signal when it is present. For statistical tests, it's the probability of rejecting a false
null hypothesis, or in other words, the likelihood of a test finding a difference that
truly exists.
   Importance:
    A powerful system or test is better at uncovering important insights or detecting
    true anomalies.
   Example:
    In the context of statistical analysis, a test is considered powerful if it is good at
    correctly identifying a statistically significant difference between two groups
    when a real difference exists.
Robustness:
       Definition:
        Robustness means a system or model remains reliable and performs well
        despite changes, uncertainties, or variations in its inputs or operating
        environment.
       Importance:
        Robustness ensures a system's reliability and stability even when faced with
        unexpected inputs, changes in conditions, or errors.
       Examples:
           ● A model is considered robust if it continues to make accurate
              predictions even with slight variations in the data it's trained on.
           ● A system is robust if it can maintain its functionality even when
              facing network issues or partial component failures.
           ● In AI, a robust model can perform well on new, unseen data, even if it
              contains noise or variations not present in the training data.
           ● In statistics, a robust statistical test is one that is not highly sensitive to
              deviations from assumptions (e.g., normality) or outliers in the data
Degrees of freedom refer to the maximum number of logically independent
values, which may vary in a data sample. Degrees of freedom are calculated by
subtracting one from the number of items within the data sample.
The degrees of freedom (DF) in statistics indicate the number of independent
values that can vary in an analysis without breaking any constraints. It is an
essential idea that appears in many contexts throughout statistics including
hypothesis tests, probability distributions, and linear regression.
Nonparametric analysis is a set of statistical methods that use ranks or signs
instead of numerical values to analyze data. It's used when the assumptions of
parametric tests aren't met or when the data is inherently in categories.
When to use nonparametric analysis
   ● When the data distribution doesn't meet the assumptions of parametric tests
   ● When analyzing data that's in categories or ranks, like gender, race, or
      employment status
   ● When dealing with unexpected observations that might be problematic with
      parametric methods
Examples of nonparametric tests
   ● Mann-Whitney U Test: A nonparametric version of the independent
      samples t-test
   ● Wilcoxon Signed Rank Test: A nonparametric counterpart of the paired
      samples t-test
   ● Kruskal-Wallis Test: A nonparametric alternative to the one-way ANOVA
   ● Sign Test: A non-parametric test that's similar to the Wilcoxon sign test
Limitations of nonparametric tests
      Nonparametric tests usually have slightly less power than parametric tests
      Nonparametric tests may not be available when more complex analyses are
       required