A Guide to Data Quality Testing for AI Applications based on Standards
A Guide to Data Quality Testing for AI Applications based on Standards
CITCOM.AI
This report provides a detailed exploration of the necessary prerequisites for effective
data quality testing, including the identification of key data attributes and the
establishment of specific quality benchmarks. It discusses various data quality
characteristics and metrics for assessing and improving the quality of data used in AI
systems. In particular, the report discusses the relevant standards and guidelines that
govern data quality testing, offering a structured framework for organizations to adhere
to these practices.
ISO/IEC dec1 outlines that machine learning (ML) involves refining model parameters
through computational methods so that the model accurately represents the data or
experiences it is exposed to. Further expounded by ISO/IEC 230532, machine learning
is identified as a subset of artificial intelligence that utilizes computational methods to
allow systems to derive insights from data or experiences. ML is applicable to an array of
tasks reliant on data and ML algorithms. Data within ML is differentiated into several
types: training data, validation data, testing data, and production data. In the case of
supervised ML, a model is developed through the training of an algorithm using training
data. Validation and testing data are subsequently employed to confirm the model's
operation within acceptable bounds. Following this, the model applies what it has
learned to make predictions or decisions based on new, unseen production data. The
efficacy of a trained ML model is tied to the data quality across all these categories.
ISO/IEC 23053 outlines a variety of general ML algorithms, noting that each may be
differently affected by the quality attributes of the data they process.
Example 1:
Representativeness is a crucial data quality attribute for machine learning. If the training data
fails to adequately mirror the population seen in the production data, there's a heightened risk
that the trained ML model will draw incorrect conclusions from that production data. This issue
becomes particularly significant when the model's decisions impact people, potentially leading to
biased outcomes against underrepresented groups.
Example 2:
It is worth noting that the same dataset can serve multiple analytics or machine learning
purposes. For instance, a data holder might distribute data to various users, both within
and outside their organization. Similarly, a data user may be permitted to employ the
data for several different tasks.
Fig. 1 Data quality elements in data life cycle for analytics and ML. ISO 5259-23.
Objective: Ensure that the data meet the requirements identified in the previous stage
and support the objectives of the analytics and ML projects. This includes designing data
architecture, estimating efforts for data acquisition and preparation, and planning for
data quality management.
Objective: Collect data (both live and historic) identified in the planning stage. This
involves:
3 ISO/IEC 5259-2 Artificial intelligence — Data quality for analytics and machine learning (ML).
Available at https://www.iso.org/standard/81860.html
Objective: Process the collected data into a form suitable for input into analytics and ML
models, ensuring data quality through:
- Enriching data by linking diverse sources and annotating data for supervised learning
tasks.
Objective: Apply prepared data to analytics and ML projects and assess if they meet the
performance requirements. If not, analyze potential data or algorithmic issues,
communicate these issues for upstream quality improvement, and possibly repeat earlier
stages to enhance data quality.
These stages illustrate a comprehensive framework for handling data from its initial
requirement gathering through to decommissioning, emphasizing continuous
improvement in data quality to meet the specific needs of analytics and ML applications.
Data will be utilized to train a deep neural network ML model that forecasts product
sales, leveraging the attributes of a marketing strategy. This model will undergo training
and deployment through cloud-based services. In this context, a "data quality subject"
refers to any entity impacted by data quality. According to ISO/IEC 5259-14, a data
quality characteristic encompasses a group of data quality attributes related to the overall
data quality, such as accuracy, completeness, and precision. Data quality requirements
detail the necessary properties or attributes of data, complete with acceptance criteria
that are tailored to how the data will be used. These criteria might be quantitative,
qualitative, or described in other terms.
4 ISO/IEC 5259-1:2024 Artificial intelligence — Data quality for analytics and machine learning
1. Assigning a singular readiness level (DRL 1-9) to large and complex datasets that
often contain imperfections such as missing values, inaccuracies, or incomplete
readings.
2. The context-sensitive nature of data readiness which may imply different things
to different users.
3. Variability in the methods available to users for handling data imperfections,
which can affect the accuracy of the assigned DRL.
4. The continuous production of sensor data from sources like real cars, which are
essential yet challenging to quality control.
The DRL system used in ROADVIEW project assigns a scale from 1 to 9 to describe the
effort, time, and cost required to address or rectify data issues:
Lawrence’s framework initial concept divides data readiness into three bands:
These bands are detailed in ROADVIEW project in a structure analogous to the 9-level
scale used in TRLs, providing a comprehensive measure of data readiness and
highlighting the importance of understanding and preparing data thoroughly to ensure
successful project outcomes.
content/uploads/sites/59/2024/05/ROADVIEW_Deliverable-4.5_v04.pdf
Lastly, rigorous data governance practices are mandated for all high-risk AI systems
to ensure compliance with established data quality requirements, regardless of
whether they involve training models or other methodologies.
Various analytics and machine learning tasks may have distinct data quality needs. These
differing requirements can influence the selection of a data quality model, along with the
corresponding data quality measures and evaluation criteria.
The AI Act proposal outlines the requirements of data quality in Article 10 on data
governance regulation with stringent guidelines for the development of high-risk AI
systems, emphasizing the necessity of utilizing high-quality training, validation, and
testing datasets. These datasets must adhere to rigorous data governance and
management practices as discussed in section 1.6. These practices are essential to ensure
the reliability and fairness of high-risk AI systems. The primary requirements and
• Syntactic accuracy: This involves the extent to which data values conform to a set
of syntactically correct values within a specific domain.
• Semantic accuracy: This pertains to how closely data values align with a set of
semantically correct values within a relevant domain.
A data item is considered syntactically correct when its data value matches its explicit
data type, and semantically correct when the data value aligns with expected values
useful for the machine learning (ML) task at hand. Given that ML models are based on
mathematical frameworks, low syntactic or semantic accuracy in the training, validation,
testing, or production datasets can lead to inaccuracies in the model or the conclusions
it draws.
• Correctness of label names: Ensuring that labels are correctly named according
to what they signify.
• Correctness of labeled tags: Verifying that tags attached to labels are accurate.
• Correctness of label sequence contents: Ensuring the sequence of labels is
correctly ordered and appropriate for the dataset.
Example 1:
If the phrase “lazy dog” is entered as “lzy dg” an ML-based natural language understanding
system can fail to correctly interpret the phrase.
Example 2:
If the number 100 is entered as 1000 in training data, a regression model can fail to correctly
calculate the weight of the related feature and if the entry was made in the production data,
inferences can be incorrect.
Syntactic data accuracy: Ratio of closeness of the data values to a set of values
defined in a domain:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑤ℎ𝑖𝑐ℎ ℎ𝑎𝑣𝑒 𝑟𝑒𝑙𝑎𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑠𝑦𝑛𝑡𝑎𝑐𝑡𝑖𝑐𝑎𝑙𝑙𝑦 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑠𝑦𝑛𝑡𝑎𝑐𝑡𝑖𝑐 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑖𝑠 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑
Concerns all Data Life-Cycle except data design, data file, data item, and data value
Symantic data accuracy: Ratio of how accurate the data values in terms of semantics
in a specific context are:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑎𝑙𝑙𝑦 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑠𝑦𝑚𝑎𝑛𝑡𝑖𝑐 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑖𝑠 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑
Concerns all Data Life-Cycle except data design, data file, and data value
Concerns all Data Life-Cycle except data design, data file, and data item
Concerns all Data Life-Cycle except data design, data file, and data value
Data accuracy range: Are data values included in the required interval?:
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 ℎ𝑎𝑣𝑖𝑛𝑔 𝑎 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛𝑐𝑙𝑢𝑑𝑒𝑑 𝑖𝑛 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
(𝑟𝑎𝑛𝑔𝑒 𝑓𝑟𝑜𝑚 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑡𝑜 𝑚𝑎𝑥𝑖𝑚𝑢𝑚)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑐𝑎𝑛 𝑏𝑒 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑎 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠
Concerns all Data Life-Cycle except data design, data file, data item, and data value
2.1.2 Completeness
ISO/IEC 25012 defines data completeness as having values for all required attributes and
entity instances. ML algorithms might experience failure when they come across any
empty data entries in training, validation, or testing datasets. Similarly, trained ML
models might also malfunction when they encounter null values in production data.
Completeness measures are critical for ML practitioners to ensure their data meets
necessary standards, and they provide guidance on whether to implement additional
data imputation methods as outlined in ISO/IEC 5259-48. The concept of data
completeness varies across different scenarios and must be evaluated within the context
of its specific application. Criteria for assessing data completeness might include: For
ML-based image classification, it is important to check for unlabeled samples that are
unsuitable for use in supervised learning. For ML-based object detection, one must
evaluate any incompleteness in the labeling of bounding boxes around objects. In
practice, it is common to encounter samples containing multiple objects across various
categories, making it challenging to obtain images with a single isolated object
dominating the frame. Thus, when evaluating the completeness of a dataset for ML-
based image recognition, considerations should include: The presence of any intended
object within a sample, the categorization of all intended objects, and the labeling of all
detected objects with bounding boxes or other identification methods.
Example 1:
A completeness assessment shows that over half of the data values for the zip code
feature are missing. Considering that the zip code is not essential for their classification
task, the data scientist opts to exclude this feature from the training, validation, testing,
and production datasets.
Example 2:
8 ISO/IEC 5259-4:2024 Artificial intelligence — Data quality for analytics and machine learning
Example 3:
In a dataset used for an ML clustering task, a completeness check finds a few records
with empty data items. The data scientist chooses to remove these incomplete records
from the training dataset.
Example 4:
For an ML classification task assessing plant images across the United States, a
completeness measure is used to evaluate the proportion of missing data relative to the
expected number of data items for proper dataset fidelity. For example, if the dataset is
missing ten plant types from the northeastern U.S., this would be noted in the
completeness evaluation.
Feature completeness: Ratio of data items of no presence of null data values for a
given feature in a dataset.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑔𝑖𝑣𝑒𝑛 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑤𝑖𝑡ℎ 𝑎𝑛 𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑒𝑑 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒 𝑛𝑜𝑡 𝑛𝑢𝑙𝑙
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑎𝑠𝑠𝑜𝑐𝑖𝑎𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑡ℎ𝑒 𝑔𝑖𝑣𝑒𝑛 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
2.1.3 Consistency:
ISO/IEC 25012 defines consistency in terms of data being in agreement with other data
and lacking contradictions. Consistency is crucial for machine learning because the
features utilized in the training data need to collectively support a model that can
Distribution of data values: Statistical distribution of data values for a given feature
in the dataset.
Data format consistency: Consistency of data format of the same data item
(according to ISO 25024).
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑤ℎ𝑒𝑟𝑒 𝑡ℎ𝑒 𝑓𝑜𝑟𝑚𝑎𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑖𝑒𝑠 𝑖𝑠 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡 𝑖𝑛 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡 𝑑𝑎𝑡𝑎 𝑓𝑖𝑙𝑒𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑓𝑜𝑟𝑚𝑎𝑡 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 𝑐𝑎𝑛 𝑏𝑒 𝑑𝑒𝑓𝑖𝑛𝑒𝑑
2.1.4 Credibility:
ISO/IEC 25012 defines credibility as the extent to which data attributes are considered
believable by users within a specific usage context. This applies to individual data items,
related items within a data record, and entire datasets. The context in which the data is
used can affect its perceived accuracy and trustworthiness. Data may be altered during
processes such as transit, storage, or computation, either by authorized or unauthorized
parties. A particular concern in machine learning is the risk of unauthorized parties
tampering with training, validation, testing, and production data, potentially rendering
trained models ineffective or influencing the outcomes they produce.
A dataset intended for training, validating, and testing an ML model does not perform as
expected on production data. A security audit reveals that unauthorized changes were
made to the data in the training set by an intruder.
Example 2:
A training dataset features numerical data with significantly different ranges. To achieve
uniformity, a data scientist decides to normalize these data values. While normalization
alters the data values, their credibility remains intact within the machine learning
context as their underlying meaning is preserved.
Value credibility: Degree to which information items are regarded as true, real and
credible (according to ISO/IEC 250249).
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑖𝑡𝑒𝑚𝑠 𝑤ℎ𝑒𝑟𝑒 𝑣𝑎𝑙𝑢𝑒𝑠 𝑎𝑟𝑒 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑/𝑐𝑒𝑟𝑡𝑖𝑓𝑖𝑒𝑑 𝑏𝑦 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑝𝑟𝑜𝑐𝑒𝑠𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑖𝑡𝑒𝑚𝑠 𝑡𝑜 𝑏𝑒 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑/𝑐𝑒𝑟𝑡𝑖𝑓𝑖𝑒𝑑
Data model credibility: Degree to which data model provides credible information
(according to ISO/IEC 25024).
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓 𝑎 𝑑𝑎𝑡𝑎 𝑚𝑜𝑑𝑒𝑙 𝑤𝑖𝑡ℎ 𝑎𝑝𝑝𝑟𝑜𝑝𝑟𝑖𝑎𝑡𝑒 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛 𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑒𝑑/𝑐𝑒𝑟𝑡𝑖𝑓𝑖𝑒𝑑 𝑏𝑦 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑝𝑟𝑜𝑐𝑒𝑠𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑜𝑓 𝑎 𝑑𝑎𝑡𝑎 𝑚𝑜𝑑𝑒𝑙
2.1.5 Currentness
Data currentness is the time difference (ΔT) between the time a data sample is recorded
and the time it is used. It ensures that the data is of the correct age relative to its intended
usage. For machine learning, currentness may relate to an appropriate age range for the
ML task. For instance, data concerning demographic groups may be outdated due to
changes in regulations and societal norms. Similarly, economic data spanning several
decades may lead to inaccurate ML models if not adjusted for inflation, exchange rates,
and other time-sensitive factors. These variances, often referred to as data-drift, impact
the data used in production compared to that used in training and testing phases. This
can be mitigated by maintaining data currentness. The concept of dataset currentness
9 ISO/IEC 25024:2015 Systems and software engineering — Systems and software Quality
Feature currentness: Ratio of data items for a feature in the dataset that fall within the
required age range.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑎 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑡ℎ𝑎𝑡 𝑓𝑎𝑙𝑙 𝑤𝑖𝑡ℎ𝑖𝑛 𝑡ℎ𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑎𝑔𝑒 𝑟𝑎𝑛𝑔𝑒
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒
Record currentness: Ratio of data records in the dataset where all data items in the
record fall within the required age range.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑡ℎ𝑎𝑡 𝑓𝑎𝑙𝑙 𝑤𝑖𝑡ℎ𝑖𝑛 𝑡ℎ𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑎𝑔𝑒 𝑟𝑎𝑛𝑔𝑒
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
2.1.6 Accessibility
Accessibility involves the extent to which data is reachable in a given usage context,
especially for individuals requiring assistive technologies or special setups due to
disabilities. Furthermore, it is essential that datasets are readily accessible and
seamlessly deployable through suitable tools for analytics and machine learning
applications.
Data format accessibility: Degree to which data or information are not accessible by the
intended users due to a specific format
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑛𝑜𝑡 𝑎𝑐𝑐𝑒𝑠𝑠𝑖𝑏𝑙𝑒 𝑑𝑢𝑒 𝑡𝑜 𝑖𝑡𝑠 𝑓𝑜𝑟𝑚𝑎𝑡
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑓𝑜𝑟𝑚𝑎𝑡 𝑎𝑐𝑐𝑒𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑦 𝑐𝑎𝑛 𝑏𝑒 𝑑𝑒𝑓𝑖𝑛𝑒𝑑
Data item compliance: Degree to which data items meet compliance requirements
2.1.8 Confidentiality
Confidentiality is the degree to which data has attributes that ensure that access and
interpretation are restricted to authorized users within a specific usage context.
Confidentiality can be evaluated from both inherent and system dependant perspective.
Encryption usage: Degree to which data values are fulfilling the requirement of
encryption
Non vulnerability: Degree to which data item defined as confidential can be accessed
by authorized users only
2.1.9 Efficiency
Efficiency is the degree to which data has attributes that can be processed and provide
the expected levels of performance by using the appropriate amounts and types of
resources in a specific context of use.
Data format efficiency: Unnecessary space occupied rate due to data format
definition.
𝑠𝑖𝑧𝑒 𝑖𝑛 𝑏𝑦𝑡𝑒𝑠 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑 𝑖𝑛 𝑎 𝑑𝑎𝑡𝑎 𝑓𝑖𝑙𝑒 𝑢𝑛𝑛𝑒𝑐𝑒𝑠𝑠𝑎𝑟𝑖𝑙𝑦 𝑜𝑐𝑐𝑢𝑝𝑖𝑒𝑑 𝑑𝑢𝑒 𝑡𝑜 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑓𝑜𝑟𝑚𝑎𝑡 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛
1−
𝑠𝑖𝑧𝑒 𝑖𝑛 𝑏𝑦𝑡𝑒𝑠 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑 𝑖𝑛 𝑎 𝑑𝑎𝑡𝑎 𝑓𝑖𝑙𝑒 𝑑𝑢𝑒 𝑡𝑜 𝑑𝑎𝑡𝑎 𝑓𝑜𝑟𝑚𝑎𝑡 𝑑𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛
Risk of wasted space: Wasted space in comparison with benchmarked average space.
𝑆𝑢𝑚 (𝑠𝑖𝑧𝑒 𝑖𝑛 𝑏𝑦𝑡𝑒𝑠 𝑢𝑠𝑒𝑑 𝑓𝑜𝑟 𝑑𝑎𝑡𝑎 𝑖𝑛 𝑎𝑛𝑦 𝑝ℎ𝑦𝑠𝑖𝑐𝑎𝑙 𝑑𝑎𝑡𝑎 𝑓𝑖𝑙𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒)
− 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑏𝑦𝑡𝑒𝑠 𝑎𝑠𝑠𝑢𝑚𝑒𝑑 𝑎𝑠 𝑡𝑎𝑟𝑔𝑒𝑡 (𝑖. 𝑒. , 𝑓𝑟𝑜𝑚 𝑎 𝑏𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘) 𝑓𝑜𝑟 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑑𝑎𝑡𝑎 𝑠𝑡𝑜𝑟𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒
2.1.10 Precision
ISO/IEC 25012 defines precision as the exactness or ability of data to discriminate.
ISO/IEC 25024 illustrates this through examples such as the number of decimal places
in real numbers. In machine learning contexts, the precision of data—such as the decimal
places in data values—can influence the significance of a feature in a trained ML model.
For instance, a feature with multiple data items at 99.4 may carry more weight than a
feature with values rounded to 99. Conversely, features with values rounded up may
weigh more than those with finer precision. Data users need to consider how precision
impacts the performance of the ML model when setting data precision requirements.
2.1.11 Traceability
Traceability measures indicate the extent to which data possesses attributes that enable
an audit trail, documenting access and any modifications to the data within a particular
usage context.
Traceability of data values: Degree to which the information of user access to the
data value was traced.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑟𝑎𝑐𝑒𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑣𝑙𝑎𝑢𝑒𝑠 𝑒𝑥𝑖𝑠𝑡
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑟𝑎𝑐𝑒𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑠 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
User Access traceability: Possibility to keep information about users access to data
using system capabilities, for investigating who read/wrote data.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑢𝑠𝑒𝑟 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑟𝑎𝑐𝑒𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑠 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑎𝑛𝑑 𝑟𝑒𝑎𝑙𝑖𝑧𝑒𝑑
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑢𝑠𝑒𝑟 𝑎𝑐𝑐𝑒𝑠𝑠 𝑡𝑟𝑎𝑐𝑒𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑠 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
Data values traceability: Possibility to trace the history of a data item value using
system capabilities.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑣𝑎𝑙𝑢𝑒𝑠 𝑎𝑟𝑒 𝑡𝑟𝑎𝑐𝑒𝑎𝑏𝑙𝑒 𝑢𝑠𝑖𝑛𝑔 𝑠𝑦𝑠𝑡𝑒𝑚 𝑐𝑎𝑝𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑓𝑜𝑟 𝑤ℎ𝑖𝑐ℎ 𝑣𝑎𝑙𝑢𝑒𝑠 𝑎𝑟𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑡𝑟𝑎𝑐𝑒𝑎𝑏𝑙𝑒 𝑢𝑠𝑖𝑛𝑔 𝑠𝑦𝑠𝑡𝑒𝑚 𝑐𝑎𝑝𝑎𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠
2.1.13 Availability
Availability measures assess the extent to which data attributes allow it to be accessed by
authorized users and applications within a defined usage context.
Data availability ratio: Ratio of data items available when required (e.g., during
backup/restore procedures).
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒 𝑖𝑛 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑝𝑒𝑟𝑖𝑜𝑑 𝑜𝑓 𝑡𝑖𝑚𝑒
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑝𝑒𝑟𝑖𝑜𝑑 𝑜𝑓 𝑡𝑖𝑚𝑒
2.1.14 Portability
ISO/IEC 25012 defines the data quality characteristic of portability as the ability to
transfer data from one system to another within a specific context, while maintaining its
quality. In the realm of analytics and machine learning, data might be processed across
various systems—for instance, data might be collected on one system, undergo quality
processing on a second system, and then transferred to a third system for training an ML
model. If the data does not retain its quality during these transfers, the effectiveness of
the trained ML model could be compromised. It's crucial that data portability
requirements are clearly established to ensure data maintains its integrity throughout its
movement across systems.
Data portability ratio: Data quality does not decrease after porting (or migration).
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑡ℎ𝑎𝑡 𝑝𝑟𝑒𝑠𝑒𝑟𝑣𝑒 𝑒𝑥𝑖𝑠𝑡𝑖𝑛𝑔 𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑎𝑓𝑡𝑒𝑟 𝑝𝑜𝑟𝑡𝑖𝑛𝑔
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑝𝑜𝑟𝑡𝑒𝑑
2.1.15 Recoverability
Recoverability measures assess the extent to which data attributes support the
maintenance and preservation of a specified level of operations and quality, even during
failures, within a particular usage context.
Data recoverability: Degree to which data stored in a device are successfully and
correctly recovered
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙𝑙𝑦 𝑎𝑛𝑑 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑟𝑒𝑐𝑜𝑣𝑒𝑟𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑟𝑒𝑐𝑜𝑣𝑒𝑟𝑒𝑑
Audited records: Ratio of the records in the dataset that have been audited
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑡ℎ𝑎𝑡 ℎ𝑎𝑣𝑒 𝑏𝑒𝑒𝑛 𝑎𝑢𝑑𝑖𝑡𝑒𝑑
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Auditable records: Ratio of the records in the dataset that are available for audit.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒 𝑓𝑜𝑟 𝑎𝑢𝑑𝑖𝑡
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
2.2.2 Identifiability
ISO/IEC 29100 defines identifiability as the ability to recognize an individual either
directly or indirectly through specific personally identifiable information (PII) within a
dataset. It is crucial to determine if any PII within a dataset can identify an individual, as
legal constraints in various regions may regulate or prohibit such activities. To mitigate
risks of identifiability, de-identification processes can be implemented across training,
validation, testing, and production data sets.
For instance, an ML model designed for targeted advertising might use data from search
engine queries, including users' IP addresses, which are recognized as PII under certain
legal frameworks. To ensure compliance and enhance privacy, anonymization
techniques are employed to remove the IP addresses before dividing the dataset into
training, validation, and testing sets.
Identifiability ratio: Ratio of data records in the dataset that can be used for
identifiability.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑡ℎ𝑎𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠 𝑡ℎ𝑎𝑡 𝑐𝑎𝑛 𝑏𝑒 𝑢𝑠𝑒𝑑 𝑓𝑜𝑟 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦,
𝑒𝑖𝑡ℎ𝑒𝑟 𝑜𝑛 𝑡ℎ𝑒𝑖𝑟 𝑜𝑤𝑛 𝑜𝑟 𝑖𝑛 𝑐𝑜𝑛𝑗𝑢𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑜𝑡ℎ𝑒𝑟 𝑑𝑎𝑡𝑎 𝑖𝑡𝑒𝑚𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
2.2.4 Balance
Balance refers to the equitable distribution of samples across all aspects relevant to the
dataset. For instance, in a dataset with multiple categories, balance would mean having
a roughly equal number of samples in each category. In image datasets, important factors
might include label relevance to business logic, image resolution, brightness, and
attributes like the width-to-height ratio and size of labeled bounding boxes. These factors
are critical because they can significantly impact the performance of a machine learning
(ML) model.
Example 1:
Example 2:
Example 3:
Bounding box height to width ratio balance: Reciprocal of the maximal ratio of
the bounding box height to width ratio difference over the averaged bounding box height
to width ratio of the samples in a dataset.
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑏𝑜𝑥 𝑤𝑖𝑡ℎ ℎ𝑒𝑖𝑔ℎ𝑡 𝑡𝑜 𝑤𝑖𝑑𝑡ℎ 𝑟𝑎𝑡𝑖𝑜 𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑏𝑜𝑥 𝑤𝑖𝑡ℎ ℎ𝑒𝑖𝑔ℎ𝑡 𝑡𝑜 𝑤𝑖𝑑𝑡ℎ 𝑟𝑎𝑡𝑖𝑜 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑠𝑎𝑚𝑝𝑙𝑒
𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑎𝑛𝑑 𝐴
Category of bounding box area balance: Reciprocal of the maximal ratio of the
averaged bounding box area of a category over the averaged bounding box area of all the
samples in a dataset.
Samples bounding box area balance: Reciprocal of the maximal ratio of the
bounding box area of a sample over the averaged bounding box area of all the samples in
a dataset.
𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑏𝑜𝑥 𝑎𝑟𝑒𝑎 𝑜𝑣𝑒𝑟 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑠 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑏𝑜𝑥 𝑎𝑟𝑒𝑎 𝑜𝑓 𝑒 𝑐ℎ 𝑠𝑎𝑚𝑝𝑙𝑒
𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑎𝑛𝑑 𝐴
2.2.5 Diversity
Diversity in a dataset signifies the variety among samples regarding the target data. For
machine learning models, it is crucial that samples differ sufficiently. Homogeneous
datasets can lead to overfitting, reducing the model’s ability to generalize. Diversity in a
dataset indicates a range of different value domains, labels, clusters, and distributions
among the data entries. Using generative ML models to enhance data diversity can be
beneficial, but these methods may not be effective if the original dataset lacks sufficient
diversity. Diversity, closely linked with representativeness and balance, is a data quality
attribute that helps assess the fidelity of a dataset. The measurement of diversity should
be tailored to the specific requirements of the ML task concerning the target data.
Relative label abundance: Portion of the number of individual data (i.e., item,
record, frame) having the same label in a dataset.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑑𝑎𝑡𝑎 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑡𝑎𝑟𝑔𝑒𝑡 𝑙𝑎𝑏𝑒𝑙𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑑𝑎𝑡𝑎 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
2.2.6 Relevancy
Relevance is defined as how suitable a dataset is for a specific context, assuming it meets
other quality criteria like accuracy, completeness, consistency, and currentness. For
machine learning, relevance means that the features selected in the training data, and
their values, effectively predict the target variable.
Feature relevance: Ratio of features in the dataset that are relevant to the given
context.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑑𝑒𝑒𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑢𝑠𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Record relevance: Ratio of records in the dataset that are relevant to the given
context.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 𝑑𝑒𝑒𝑚𝑒𝑑 𝑡𝑜 𝑏𝑒 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑢𝑠𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
2.2.7 Representativeness
ISO 20252 defines representativeness as the extent to which a sample accurately mirrors
the target population under investigation. In supervised machine learning, the training
dataset serves as the sample, while the production data represents the broader
population. If the training data inadequately reflects the production data, the resulting
ML model may not perform as intended. Representativeness is closely tied to the
relevance data quality characteristic, as a dataset that does not faithfully represent the
population being studied is unlikely to yield reliable predictions for the target variable.
For instance, a facial recognition system trained solely on images of individuals with light
skin tones may struggle to accurately identify individuals with darker skin tones.
Sample similarity: Ratio of similar samples in a dataset; the lower, the better.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
1−
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠, 𝑟𝑒𝑠𝑢𝑙𝑡𝑖𝑛𝑔 𝑓𝑟𝑜𝑚 𝑎 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚, 𝑜𝑛 𝑎𝑙𝑙 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡
Where G is a matrix with M rows and M columns and is equal to 𝛷𝑛𝑜𝑟𝑚 T 𝛷𝑛𝑜𝑟𝑚
NOTE 1 𝛷𝑛𝑜𝑟𝑚 is the normalized dataset, calculated from 𝛷𝑁𝑋𝑀 (NOTE 2) after subtracting from each column
its mean, and normalization to 1. Visually, normalized data rely over the surface of a hypersphere of
radius=1 and centered in the origin (𝑀 ≤ 𝑁) .
NOTE 2 𝛷𝑁𝑋𝑀 is an N-by-M matrix, with N data records (vectors) and M features (dimensions).
NOTE 3 The number of principal components 𝐾 ≤ 𝑀 is the smallest number of eigenvalues of 𝐶𝑀𝑋𝑀
(NOTE4), starting from the biggest, chosen in order to represent 95% of their sum.
NOTE 4 𝐶𝑀𝑋𝑀 is an M-by-M matrix, with M rows and M columns and is equal to 𝛷𝑚𝑒𝑎𝑛 T 𝛷𝑚𝑒𝑎𝑛 (NOTE 5).
NOTE 5 𝛷𝑚𝑒𝑎𝑛 is calculated from 𝛷𝑁𝑋𝑀 after subtracting from each column of its mean. Visually, normalized
data 𝛷𝑚𝑒𝑎𝑛 fit an (hyper)ellipsoid with eigenvectors as axis and centered in the origin.
NOTE 6 Principal components can be selected with criteria or percentage different; see Annex A in
ISO/IEC DIS 5259-2 for measure modification.
NOTE 7 A measurement of zero means the least similarity. The similarity measure will yield zero when
the number of samples is equal to the number of clusters indicating that no sample is similar to another.
2.2.9 Timeliness
Timeliness, also called the latency, is the ΔT between the time when a phenomenon
occurs and the time when the recorded data for that phenomenon are available for use,
which makes it different from currentness (i.e., ΔT between the time a data sample is
recorded and the time it is used.
Tabular data is well-suited for statistical analysis and traditional machine learning
algorithms such as decision trees, ensemble methods, and linear regression. Some
relevant standards:
Available at https://www.iso.org/standard/81684.html
Image data requires specific preprocessing steps to ensure that the data is in a suitable
form for analysis by machine learning models. Common preprocessing tasks include
resizing images to a uniform dimension, normalizing pixel values (scaling pixel values to
a range, typically 0 to 1), and augmenting the dataset through techniques such as
rotation, scaling, and cropping to improve the robustness of the model. Advanced
computer vision applications often utilize convolutional neural networks (CNNs), which
are specifically designed to process pixel data and recognize spatial hierarchies in images,
making them effective for tasks such as image recognition, object detection, and
semantic segmentation. Some relevant image data specific standards:
Available at https://www.iso.org/standard/50867.html