Data Analytics - Unit 4 Notes
1. Supervised vs Unsupervised Learning (Tabular Format)
| Feature | Supervised Learning | Unsupervised Learning |
|-------------------------------|----------------------------------------------------------|----------------------------------------------------------|
| Definition | Learning with labeled data | Learning with unlabeled data |
| Input Data | Input has output labels | Input has no output labels |
| Goal | Predict output | Discover hidden patterns |
| Output Type | Predictive (classification/regression) | Descriptive (clusters/associations)
|
| Examples of Tasks | Classification, Regression | Clustering, Association
|
| Evaluation | Accuracy, RMSE, etc. | Silhouette score, manual interpretation
|
| Algorithms | Decision Trees, SVM, Linear Regression | K-Means, DBSCAN, PCA
|
| Use Cases | Email spam detection, loan approval | Customer segmentation, anomaly
detection |
2. Segmentation
Segmentation divides a large dataset into smaller, meaningful subgroups based on similar behavior or attributes.
Types: Demographic, Geographic, Behavioral, Psychographic
Techniques: K-Means, Hierarchical, DBSCAN
Applications: Marketing, Healthcare, Finance, E-commerce
Purpose: Discover patterns, target specific user groups, improve model performance.
3. Decision Trees
A tree-like structure used for classification or regression.
Types:
- Classification Tree: Categorical output
- Regression Tree: Numerical output
Process:
1. Select splitting attribute (e.g., Gini, Entropy)
2. Split data into subsets
3. Recur until leaf nodes are pure
Data Analytics - Unit 4 Notes
Overfitting: Deep trees that memorize training data
Pruning: Reduces tree size to prevent overfitting
Applications: Loan approval, diagnosis, HR attrition
4. Overfitting and Pruning
Overfitting: When model fits training data too well, including noise
Symptoms: High training accuracy, poor test accuracy
Pruning Types:
- Pre-pruning: Stop early (e.g., max depth, min samples)
- Post-pruning: Build full tree, then cut weak branches
Goal: Improve generalization, reduce complexity
5. Measures of Forecast Accuracy
Used to evaluate time series model performance:
- MAE = Mean Absolute Error
- MSE = Mean Squared Error
- RMSE = Root Mean Squared Error
- MAPE = Mean Absolute Percentage Error
- sMAPE = Symmetric MAPE
Applications: Retail demand, finance, weather forecasting
Lower values = Better accuracy
6. STL Decomposition
STL = Seasonal and Trend decomposition using Loess
Components:
- Trend: Long-term movement
- Seasonality: Repeated cycles
- Residual: Noise
Uses Loess smoothing for flexible decomposition
Data Analytics - Unit 4 Notes
Applications: Sales trends, stock prices, weather patterns
Helps clean and analyze time series data before forecasting.