-
Notifications
You must be signed in to change notification settings - Fork 3
Workstream 1 RFC: Signing ML Artifacts: Building towards tamper-proof ML metadata records #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We also have v0.1 of the library released to quickly iterate on it while we get users: https://pypi.org/project/model-signing/ |
As another resource, here's a talk showing how GUAC could be used to ingest complete SLSA provenances and then get a view over the entire ML supply chain. https://www.youtube.com/watch?v=uqU3fnmK0BA All of this vision builds on top of model signatures, so this is just the first step. |
Hello there! We are looking into couple of internal use cases were we'd try to test this. We'll keep you up2date // reach out once we have the scope well defined |
That sounds awesome! Our plan is to keep releasing patches of the library with every bug fix to help fix any remaining issues before a v1 release |
This RFC shows strong potential as a cornerstone whitepaper for the first milestone for WS1, particularly when combined with the threat modeling work. It aligns well with the goals for software supply chain security in AI systems, focusing on composition, provenance, and practical applications. milestone link Regarding signing training data, i feel the recommended path would need some refinement. SLSA principles, while excellent for software artifacts, don't map effectively to training data's unique challenges of distribution shifts, quality variations, privacy considerations, and bias implications. The dynamic nature of ML datasets, combined with complex preprocessing pipelines and streaming data sources, makes SLSA-style attestation harder to implement. I would recommend pursuing this after the other low hanging fruit are attested. Instead, I believe we should start with model signing on three critical points:
|
Signing ML Artifacts: Building towards tamper-proof ML metadata records
Authors:
Summary
Cryptographic signing is widely used throughout our digital ecosystem, providing a reliable mechanism to ensure integrity and verify the provenance of network communications, executable binaries, and more. However, for machine learning models and associated artifacts, no standard method currently exists for cryptographically verifying model origins. This gap leaves model artifacts without a means to prove their source or guarantee they are tamper-proof.
Building on the work of the OpenSSF Model Signing Project, we propose adopting a PKI-agnostic method for creating claims on bundles of ML artifacts and soliciting feedback for the design of the library and signing specification, most notably with an eye towards model-specific information that can be embedded within the claim.
Leveraging model signing and attestation, we can begin to build more robust supply chain trust in ML development, including chaining claims, such as those involved with hardware attestation, to build fully signed ML development ecosystems. Additionally, we propose leveraging signatures to embed machine-readable model card information that can be built into the claim. This foundational approach will be a critical step toward achieving provable, tamper-proof ML metadata records and pave the way toward verifiable ML model provenance.
From a very high level, ML is developed as in the following diagram:
We see supply-chain risks in every component of the diagram, and we can protect against them by adding cryptographic signatures to both protect the integrity of the models and datasets but also to record ML metadata and provenance information in a tamper-proof way. Model signing, in this approach, enables us to efficiently sign large numbers of arbitrary files, leveraging a single manifest and associated signature. A similar approach can also be taken for datasets. This approach is PKI agnostic, with the signer deciding if they wish to use the 4 types of PKIs supported (public/private sigstore, bare key, self-signed cert, byo PKI). Our examples leverage Sigstore to sign models, as shown in the following diagram:
Once we have the trust layer established, we can add supply-chain metadata to the model signature. For example, we can add SLSA predicates to the model signature and, in this way, record information about both the inputs and the outputs of every training process. This enables answering questions such as “What datasets has this model been trained on?” or “What hardware was used to train this model?”
By coupling this information with GUAC, we can even analyze the entire supply chain from a supply chain security perspective or leverage this data for incident response (e.g., when a model is discovered to have been improperly trained, we can identify all models that have been fine-tuned from). For example, if a dataset is discovered to have been poisoned, we can create policies that would signal its inclusion in a training process and create automation that will raise an alert before the training even begins.
Priority
Level of Effort
Drawbacks
Adoption of model signing will require a cross-industry effort. However, CoSAI is well-positioned to help drive its adoption and ensure that it is designed to suit as many use cases as possible. This will enable the community to build towards provable provenance of ML assets and chain claims together from various sources.
Alternatives
The two alternatives are:
Signing all files as a singular data blob. This approach has issues with very large models and does not scale to datasets at all. It also does not account for a changing subset of component files, especially if someone only needs one subset/file type of files in the directory.
Signing each model binary separately, where each file gets its own hash; however, this does not factor in the context of multiple model artifacts that are required for at least ‘one inference pass’ of the model.
Reference Material & Prior Art
Model Signing Code repository: https://github.com/sigstore/model-transparency/
Talk at SOSS Fusion: https://www.youtube.com/watch?v=DqJz4qvYrTg
Google whitepaper on securing the AI software supply chain: https://research.google/pubs/securing-the-ai-software-supply-chain/
Unresolved questions
What information does your organization require to be present in a provenance claim?
Does this specification adhere to your needs?
Will your organization help support the development and adoption of this as our basis for signing model artefacts?
The text was updated successfully, but these errors were encountered: