NOTE: The following explanations are based on the latest version of the OmniBOR specification. If any conflicts arise between the explanations given here and the OmniBOR specification, the specification supersedes these explanations.
Artifact IDs are how OmniBOR solves the challenge of reproducibly identifying software artifacts. What we mean by "reproducible" is that anyone with access to the artifact can produce an identifier for it, and that identifier will always be the same.
OmniBOR Artifact IDs achieve this by choosing Git Object Identifiers (GitOIDs) as their identification scheme. GitOIDs are how Git identifies objects it tracks, and they're formed using a choice of a hash function, and then producing hashes using what we'll call the "GitOID construction".
Today, Git supports three hash functions:
SHA-1 is the hash function most people are familiar with in Git, and for a long time it was the only hash function Git supported. However, since the discovery of the SHAttered attack against SHA-1, and Git project did two things:
SHA-1CD is almost equivalent to SHA-1. In essence, SHA-1CD attempts to detect attempts to engineer collisions (hence the "CD," for "collision detection") and modifies the output of the hash in those cases to break the collision. For Git, this kind of modification is fine, and so Git in recent versions usually uses SHA-1CD under the hood by default, though Git documentation still just calls in SHA-1. In the context of a single Git repository, the distinction doesn't generally matter.
However, for the purposes of achieving a universally reproducible identifier, we do need to care about the distinction between SHA-1 and SHA-1CD, which is why we list them separately here.
The SHA-256 transition in Git has been moving slowly, with successive versions periodically adding more support to smooth the transition. Nonetheless, progress is generally recognized as slow.
OmniBOR Artifact IDs only support SHA-256 today.
This is important. While SHA-1 (really, SHA-1CD) is in wide use in existing Git data today, we expect that in the long run, it will be phased out. The Git project itself continues its slow movement along the SHA-256 transition, and (perhaps more importantly) we anticipate there will likely be government standards in the future which mandate a move away from SHA-1, in a similar fashion to prior widespread mandates to move away from MD5.
While we could in theory support multiple identifiers at the same time, even just supporting two would double the complexity of OmniBOR operationally for producers and consumers. Worse, given the SHA-1 / SHA-1CD split, we'd likely need to support all three if we're going to support SHA-1.
The specification does reserve the right to add support for alternative hashes in the future if, for example, SHA-256 is later found to be broken in a manner similar to how SHA-1 can be broken today.
GitOIDs are constructed not just by hashing the data of the object itself. Instead, a small "prefix string" is hashed in first, with the following structure:
<git object type>⎵<length of the object in bytes>\0
Here ⎵
refers to the ASCII space character (0x20
), replaced with a visual
character for clarity.
This prefix string has two purposes in Git. First, the object type (which can
be blob
, tree
, commit
, or tag
) indicates the type of the data being
stored based on Git's object model. This helps differentiate hashes for the
different types of objects. Second, the length being hashed in helps provide
additional protection against collisions. With this length included, an attacker
trying to engineer a collision in Git's object storage would need to account
for how the length of the colliding data impacts the hash as well. The
SHAttered attack specifically relies on extensions of the original data in
highly flexible formats like PDFs, which this is an effective protection
against.
For Artifact IDs, the object type is always blob
, so for our purposes
the prefix string is
blob⎵<length of the object in bytes>\0
Thus, the "GitOID construction" for Artifact IDs is to:
The textual representation of an Artifact ID looks like this:
gitoid:blob:sha256:9f64df92367881be21e23567a31a8ce01994d98b69d28917b5c132ce32a8e6c8
This is a URI using the gitoid
scheme, registered with IANA. In this
scheme, the representation includes four parts, each separate by a colon (:
).
The first is the string gitoid
, indicating the URI scheme. The second is
the object type, which for Artifact IDs is always blob
. Then it's the hash
algorithm, which for Artifact IDs is sha256
. Finally, it's a lowercase
hexadecimal representation of the SHA-256 hash of the object made with the
GitOID construction.
Artifact IDs are used for uniquely and reproducibly identifying software artifacts. Because the construction of an Artifact ID relies only on the contents of an artifact itself, anyone who has access to the artifact can derive its Artifact ID, and the Artifact ID they derive will be exactly equal to one derived by anyone else with access to the same artifact.
This means that, as an identifier scheme, Artifact IDs can scale without limits! Other identification systems, like Common Platform Enumerations (CPE) or Package URLs (pURLs), rely on some form of centralization. CPEs are identifiers which rely on a centralized dictionary, maintained by the United States' National Institute of Standards and Technology (NIST). Package URLs rely on a central list of known package hosts. In either case, while these identifier schemes are very useful (and we view OmniBOR's Artifact IDs as complementary to these other identifier schemes), they lack the property of independent reproducibility that makes Artifact IDs so powerful!
Of course, Artifact IDs by themselves are only one part of the equation. To understand more, learn about Input Manifests next!