Workstream 1 RFC: Signing ML Artifacts: Building towards tamper-proof ML metadata records #4

EWickens · 2024-11-15T17:46:56Z

Signing ML Artifacts: Building towards tamper-proof ML metadata records

Authors:

Mihai Maruseac
Daniel Major
Eoin Wickens

Summary

Cryptographic signing is widely used throughout our digital ecosystem, providing a reliable mechanism to ensure integrity and verify the provenance of network communications, executable binaries, and more. However, for machine learning models and associated artifacts, no standard method currently exists for cryptographically verifying model origins. This gap leaves model artifacts without a means to prove their source or guarantee they are tamper-proof.

Building on the work of the OpenSSF Model Signing Project, we propose adopting a PKI-agnostic method for creating claims on bundles of ML artifacts and soliciting feedback for the design of the library and signing specification, most notably with an eye towards model-specific information that can be embedded within the claim.

Leveraging model signing and attestation, we can begin to build more robust supply chain trust in ML development, including chaining claims, such as those involved with hardware attestation, to build fully signed ML development ecosystems. Additionally, we propose leveraging signatures to embed machine-readable model card information that can be built into the claim. This foundational approach will be a critical step toward achieving provable, tamper-proof ML metadata records and pave the way toward verifiable ML model provenance.

From a very high level, ML is developed as in the following diagram:

We see supply-chain risks in every component of the diagram, and we can protect against them by adding cryptographic signatures to both protect the integrity of the models and datasets but also to record ML metadata and provenance information in a tamper-proof way. Model signing, in this approach, enables us to efficiently sign large numbers of arbitrary files, leveraging a single manifest and associated signature. A similar approach can also be taken for datasets. This approach is PKI agnostic, with the signer deciding if they wish to use the 4 types of PKIs supported (public/private sigstore, bare key, self-signed cert, byo PKI). Our examples leverage Sigstore to sign models, as shown in the following diagram:

Once we have the trust layer established, we can add supply-chain metadata to the model signature. For example, we can add SLSA predicates to the model signature and, in this way, record information about both the inputs and the outputs of every training process. This enables answering questions such as “What datasets has this model been trained on?” or “What hardware was used to train this model?”

By coupling this information with GUAC, we can even analyze the entire supply chain from a supply chain security perspective or leverage this data for incident response (e.g., when a model is discovered to have been improperly trained, we can identify all models that have been fine-tuned from). For example, if a dataset is discovered to have been poisoned, we can create policies that would signal its inclusion in a training process and create automation that will raise an alert before the training even begins.

Priority

P0: This is critical to include in the next release from this workstream.

Level of Effort

Medium: This will take a week or two to document.

Drawbacks

Adoption of model signing will require a cross-industry effort. However, CoSAI is well-positioned to help drive its adoption and ensure that it is designed to suit as many use cases as possible. This will enable the community to build towards provable provenance of ML assets and chain claims together from various sources.

Alternatives

The two alternatives are:

Signing all files as a singular data blob. This approach has issues with very large models and does not scale to datasets at all. It also does not account for a changing subset of component files, especially if someone only needs one subset/file type of files in the directory.
Signing each model binary separately, where each file gets its own hash; however, this does not factor in the context of multiple model artifacts that are required for at least ‘one inference pass’ of the model.

Reference Material & Prior Art

Model Signing Code repository: https://github.com/sigstore/model-transparency/

Talk at SOSS Fusion: https://www.youtube.com/watch?v=DqJz4qvYrTg

Google whitepaper on securing the AI software supply chain: https://research.google/pubs/securing-the-ai-software-supply-chain/

Unresolved questions

What information does your organization require to be present in a provenance claim?

Does this specification adhere to your needs?

Will your organization help support the development and adoption of this as our basis for signing model artefacts?

mihaimaruseac · 2024-11-20T18:15:49Z

We also have v0.1 of the library released to quickly iterate on it while we get users: https://pypi.org/project/model-signing/

mihaimaruseac · 2024-11-21T14:24:30Z

As another resource, here's a talk showing how GUAC could be used to ingest complete SLSA provenances and then get a view over the entire ML supply chain.

https://www.youtube.com/watch?v=uqU3fnmK0BA

All of this vision builds on top of model signatures, so this is just the first step.

ollol88 · 2024-12-02T20:12:05Z

We also have v0.1 of the library released to quickly iterate on it while we get users: https://pypi.org/project/model-signing/

Hello there! We are looking into couple of internal use cases were we'd try to test this. We'll keep you up2date // reach out once we have the scope well defined

mihaimaruseac · 2024-12-03T09:55:14Z

That sounds awesome! Our plan is to keep releasing patches of the library with every bug fix to help fix any remaining issues before a v1 release

Akilsrin · 2025-01-02T01:47:15Z

This RFC shows strong potential as a cornerstone whitepaper for the first milestone for WS1, particularly when combined with the threat modeling work. It aligns well with the goals for software supply chain security in AI systems, focusing on composition, provenance, and practical applications. milestone link

Regarding signing training data, i feel the recommended path would need some refinement. SLSA principles, while excellent for software artifacts, don't map effectively to training data's unique challenges of distribution shifts, quality variations, privacy considerations, and bias implications. The dynamic nature of ML datasets, combined with complex preprocessing pipelines and streaming data sources, makes SLSA-style attestation harder to implement. I would recommend pursuing this after the other low hanging fruit are attested.

Instead, I believe we should start with model signing on three critical points:

when a foundational/base model is completed and ready for distribution
after significant fine-tuning operations that materially change the model
pre-Deployment Stage of the model (before the model moves to production and is used by customers)

andrewelizondo assigned andrewelizondo and unassigned andrewelizondo Dec 13, 2024

Akilsrin mentioned this issue Jan 2, 2025

Workstream 1 RFC: Establishing Trust Chain levels for complete AI supply chain #5

Open

mihaimaruseac mentioned this issue Jan 8, 2025

Workstream 1 RFC: Machine Readable Model Cards #7

Open

andrewelizondo added the accepted label Jan 29, 2025

sevansdell mentioned this issue Feb 6, 2025

What do we want to do in 2025? ossf/ai-ml-security#26

Open

This was referenced Mar 27, 2025

Remove artifact signing, keep only DDSE/in-toto. sigstore/model-transparency#377

Merged

Change to new format for the signing payload sigstore/model-transparency#407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workstream 1 RFC: Signing ML Artifacts: Building towards tamper-proof ML metadata records #4

Workstream 1 RFC: Signing ML Artifacts: Building towards tamper-proof ML metadata records #4

EWickens commented Nov 15, 2024

mihaimaruseac commented Nov 20, 2024

mihaimaruseac commented Nov 21, 2024

ollol88 commented Dec 2, 2024

mihaimaruseac commented Dec 3, 2024

Akilsrin commented Jan 2, 2025 •

edited

Loading

Workstream 1 RFC: Signing ML Artifacts: Building towards tamper-proof ML metadata records #4

Workstream 1 RFC: Signing ML Artifacts: Building towards tamper-proof ML metadata records #4

Comments

EWickens commented Nov 15, 2024

Signing ML Artifacts: Building towards tamper-proof ML metadata records

Summary

Priority

Level of Effort

Drawbacks

Alternatives

Reference Material & Prior Art

Unresolved questions

mihaimaruseac commented Nov 20, 2024

mihaimaruseac commented Nov 21, 2024

ollol88 commented Dec 2, 2024

mihaimaruseac commented Dec 3, 2024

Akilsrin commented Jan 2, 2025 • edited Loading

Akilsrin commented Jan 2, 2025 •

edited

Loading