Comprehensive Statistics and Data Analytics Notes
Statistics: Introduction & Descriptive Statistics
Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions.
Descriptive statistics summarize data using numerical measures and graphical tools.
- Mean: The arithmetic average of a data set.
- Median: The middle value separating the higher half from the lower half.
- Mode: The most frequently occurring value.
- Variance: The average of the squared differences from the mean.
- Standard Deviation: The square root of the variance, indicating data spread.
Data Visualization
Data visualization involves presenting data in graphical format to identify patterns, trends, and outliers.
Common tools include:
- Histograms
- Boxplots
- Scatter plots
- Bar charts
- Line graphs
Introduction to Probability Distributions
Probability distributions describe how probabilities are distributed over values of a random variable.
Comprehensive Statistics and Data Analytics Notes
- Discrete distributions: Binomial, Poisson
- Continuous distributions: Normal, Exponential
Each has properties like mean, variance, skewness, and kurtosis.
Hypothesis Testing
A statistical method used to make decisions based on data.
- Null Hypothesis (H0): Assumes no effect or difference.
- Alternative Hypothesis (H1): Assumes an effect or difference exists.
- Test statistics: z-test, t-test, chi-square test, etc.
- P-value: Probability of observing data assuming H0 is true.
- Significance level (): Commonly set at 0.05
Linear Algebra and Population Statistics
Linear algebra involves vectors, matrices, and linear transformations used in population modeling and
statistical analysis.
- Population statistics: Mean, variance, and correlation structures modeled using matrices.
- Matrix operations support regression and multivariate analyses.
Mathematical Methods and Probability Theory
Probability theory underpins statistical inference.
- Set theory, combinatorics
Comprehensive Statistics and Data Analytics Notes
- Conditional probability, Bayes theorem
- Random variables and expectation
- Law of large numbers, Central Limit Theorem
Sampling Distributions and Statistical Inference
- Sampling distributions describe the distribution of sample statistics.
- Central Limit Theorem enables inference from samples to populations.
- Confidence intervals and p-values form the basis of inference.
Quantitative Analysis
Involves the use of mathematical and statistical modeling, measurement, and research to understand
behavior.
- Descriptive and inferential statistics
- Optimization
- Time series analysis
Unit 2: Statistical Modelling
- Linear models & Regression: Predictive models using linear relationships
- ANOVA: Analyzes variance across groups
- Gauss-Markov Theorem: OLS estimators are BLUE (Best Linear Unbiased Estimators)
- Least Squares Geometry: Minimizing sum of squared residuals
Comprehensive Statistics and Data Analytics Notes
- Model diagnostics: Residual analysis, influence, multicollinearity
- Transformations: e.g., Box-Cox
- Logistic & Poisson Regression for binary/count data
Unit 3: Data Analytics
- Open & Closed Sets: Defined by inclusion of boundary points
- Compactness: Every open cover has a finite subcover
- Metric Space: Defines distance (e.g., Euclidean in R^n)
- Cauchy Sequences: Sequences where terms get arbitrarily close
- Completeness: All Cauchy sequences converge
- Connectedness: Space can't be divided into two disjoint open sets
Unit 4: Advanced Data Analytics
- Vector Space: Collection of vectors closed under vector addition and scalar multiplication
- Subspaces: Subsets that are also vector spaces
- Independence, Basis & Dimension: Basis is a minimal set of independent vectors
- Eigenvalues & Eigenvectors: Solve Av = v, important in PCA and systems analysis