In data mining, measures are quantitative techniques used to summarize, describe, and analyze large datasets. They help transform raw data into meaningful statistics by capturing properties such as central tendency, dispersion, and distribution. These measures are essential for data-driven decision-making, pattern discovery, and predictive analysis.
Data mining measures are classified based on how their aggregate functions behave when data is partitioned.
1. Holistic Measures
A measure is holistic if it cannot be computed from fixed-size summaries of partitions and requires access to the full dataset.
Key properties:
- Cannot be derived by combining sub-aggregates.
- Require complete data for exact computation.
Examples: median(), mode(), rank().
2. Distributive
A measure is distributive if it can be computed on data partitions and the partial results can be combined to obtain the final result.
Key properties:
- Works by dividing data into subsets.
- Final result equals the aggregate computed over the full dataset.
Examples: sum(), count(), min(), max().
3. Algebraic Measures
A measure is algebraic if it can be computed using a fixed number of distributive measures.
Key properties:
- Combines results of distributive functions.
- Requires a fixed-size summary.
Examples: avg() (uses sum() and count()), MinN(), MaxN(), centerOfMass().
Computation of Measures
Computing measures involves applying mathematical operations and aggregation logic to structured data. Steps in Measure Computation are discussed below:
1. Data Collection and Preprocessing
- Remove duplicates and handle missing values.
- Convert data to consistent formats.
2. Measure Selection
- Choose appropriate measures based on analysis goals
(e.g., sum() for totals, avg() for trends, variance() for dispersion).
3. Formula Application
- Apply statistical formulas:
- Mean = total values รท number of values
- Standard deviation = measure of spread from the mean
4. Aggregation
- Combine results from partitions or subsets for large datasets.
5. Interpretation and Reporting
- Analyze computed values against benchmarks or historical data.
- Use results to identify trends, patterns, and anomalies.