Skip to content

Workstream 1 RFC: Signing ML Artifacts: Building towards tamper-proof ML metadata records #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
EWickens opened this issue Nov 15, 2024 · 5 comments
Labels

Comments

@EWickens
Copy link

Signing ML Artifacts: Building towards tamper-proof ML metadata records

Authors:

  • Mihai Maruseac
  • Daniel Major
  • Eoin Wickens

Summary

Cryptographic signing is widely used throughout our digital ecosystem, providing a reliable mechanism to ensure integrity and verify the provenance of network communications, executable binaries, and more. However, for machine learning models and associated artifacts, no standard method currently exists for cryptographically verifying model origins. This gap leaves model artifacts without a means to prove their source or guarantee they are tamper-proof.

Building on the work of the OpenSSF Model Signing Project, we propose adopting a PKI-agnostic method for creating claims on bundles of ML artifacts and soliciting feedback for the design of the library and signing specification, most notably with an eye towards model-specific information that can be embedded within the claim.

Leveraging model signing and attestation, we can begin to build more robust supply chain trust in ML development, including chaining claims, such as those involved with hardware attestation, to build fully signed ML development ecosystems. Additionally, we propose leveraging signatures to embed machine-readable model card information that can be built into the claim. This foundational approach will be a critical step toward achieving provable, tamper-proof ML metadata records and pave the way toward verifiable ML model provenance.

From a very high level, ML is developed as in the following diagram:

image 1

We see supply-chain risks in every component of the diagram, and we can protect against them by adding cryptographic signatures to both protect the integrity of the models and datasets but also to record ML metadata and provenance information in a tamper-proof way. Model signing, in this approach, enables us to efficiently sign large numbers of arbitrary files, leveraging a single manifest and associated signature. A similar approach can also be taken for datasets. This approach is PKI agnostic, with the signer deciding if they wish to use the 4 types of PKIs supported (public/private sigstore, bare key, self-signed cert, byo PKI). Our examples leverage Sigstore to sign models, as shown in the following diagram:

image 2

Once we have the trust layer established, we can add supply-chain metadata to the model signature. For example, we can add SLSA predicates to the model signature and, in this way, record information about both the inputs and the outputs of every training process. This enables answering questions such as “What datasets has this model been trained on?” or “What hardware was used to train this model?”

By coupling this information with GUAC, we can even analyze the entire supply chain from a supply chain security perspective or leverage this data for incident response (e.g., when a model is discovered to have been improperly trained, we can identify all models that have been fine-tuned from). For example, if a dataset is discovered to have been poisoned, we can create policies that would signal its inclusion in a training process and create automation that will raise an alert before the training even begins.

Priority

  • P0: This is critical to include in the next release from this workstream.

Level of Effort

  • Medium: This will take a week or two to document.

Drawbacks

Adoption of model signing will require a cross-industry effort. However, CoSAI is well-positioned to help drive its adoption and ensure that it is designed to suit as many use cases as possible. This will enable the community to build towards provable provenance of ML assets and chain claims together from various sources.

Alternatives

The two alternatives are:

  • Signing all files as a singular data blob. This approach has issues with very large models and does not scale to datasets at all. It also does not account for a changing subset of component files, especially if someone only needs one subset/file type of files in the directory.

  • Signing each model binary separately, where each file gets its own hash; however, this does not factor in the context of multiple model artifacts that are required for at least ‘one inference pass’ of the model.

Reference Material & Prior Art

Model Signing Code repository: https://github.com/sigstore/model-transparency/

Talk at SOSS Fusion: https://www.youtube.com/watch?v=DqJz4qvYrTg

Google whitepaper on securing the AI software supply chain: https://research.google/pubs/securing-the-ai-software-supply-chain/

Unresolved questions

What information does your organization require to be present in a provenance claim?

Does this specification adhere to your needs?

Will your organization help support the development and adoption of this as our basis for signing model artefacts?

@mihaimaruseac
Copy link

We also have v0.1 of the library released to quickly iterate on it while we get users: https://pypi.org/project/model-signing/

@mihaimaruseac
Copy link

As another resource, here's a talk showing how GUAC could be used to ingest complete SLSA provenances and then get a view over the entire ML supply chain.

https://www.youtube.com/watch?v=uqU3fnmK0BA

All of this vision builds on top of model signatures, so this is just the first step.

@ollol88
Copy link

ollol88 commented Dec 2, 2024

We also have v0.1 of the library released to quickly iterate on it while we get users: https://pypi.org/project/model-signing/

Hello there! We are looking into couple of internal use cases were we'd try to test this. We'll keep you up2date // reach out once we have the scope well defined

@mihaimaruseac
Copy link

That sounds awesome! Our plan is to keep releasing patches of the library with every bug fix to help fix any remaining issues before a v1 release

@Akilsrin
Copy link

Akilsrin commented Jan 2, 2025

This RFC shows strong potential as a cornerstone whitepaper for the first milestone for WS1, particularly when combined with the threat modeling work. It aligns well with the goals for software supply chain security in AI systems, focusing on composition, provenance, and practical applications. milestone link

Regarding signing training data, i feel the recommended path would need some refinement. SLSA principles, while excellent for software artifacts, don't map effectively to training data's unique challenges of distribution shifts, quality variations, privacy considerations, and bias implications. The dynamic nature of ML datasets, combined with complex preprocessing pipelines and streaming data sources, makes SLSA-style attestation harder to implement. I would recommend pursuing this after the other low hanging fruit are attested.

Instead, I believe we should start with model signing on three critical points:

  • when a foundational/base model is completed and ready for distribution
  • after significant fine-tuning operations that materially change the model
  • pre-Deployment Stage of the model (before the model moves to production and is used by customers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants