CODEC-335: Add DigestUtils.gitBlob and DigestUtils.gitTree methods#427
CODEC-335: Add DigestUtils.gitBlob and DigestUtils.gitTree methods#427
DigestUtils.gitBlob and DigestUtils.gitTree methods#427Conversation
This change adds two methods to `DigestUtils` that compute generalized Git object identifiers using an arbitrary `MessageDigest`, rather than being restricted to SHA-1: - `gitBlob(digest, input)`: computes a generalized [Git blob object identifier](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a given file or byte content. - `gitTree(digest, file)`: computes a generalized [Git tree object identifier](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a given directory. ### Motivation The standard Git object identifiers use SHA-1, which is [in the process of being replaced by SHA-256](https://git-scm.com/docs/hash-function-transition) in Git itself. These methods generalize the identifier computation to support any `MessageDigest`, enabling both forward compatibility and use with external standards. In particular, the `swh:1:cnt:` (content) and `swh:1:dir:` (directory) identifier types defined by [SWHID (ISO/IEC 18670)](https://www.swhid.org/specification/v1.2/5.Core_identifiers/) are currently compatible with Git blob and tree identifiers respectively (using SHA-1), and can be used to generate canonical, persistent identifiers for unpacked source and binary distributions.
There was a problem hiding this comment.
Hi @ppkarwasz
Should all this git related code be in a new GitDigest class instead?
Curious: isn't all this in jgit?
You'll need to run 'mvn' by itself and fix build issues before you push.
JGit does provide the building blocks via
For reference, here is the equivalent JGit code for a two-file tree: final byte[] aBytes = ...; // a.txt
final byte[] bBytes = ...; // nested/b.txt
try (ObjectInserter inserter = new ObjectInserter.Formatter()) {
final ObjectId aBlob = inserter.idFor(OBJ_BLOB, aBytes);
final ObjectId bBlob = inserter.idFor(OBJ_BLOB, bBytes);
final TreeFormatter nestedTreeFormatter = new TreeFormatter();
nestedTreeFormatter.append("b.txt", FileMode.REGULAR_FILE, bBlob);
final ObjectId nestedTree = inserter.idFor(nestedTreeFormatter);
final TreeFormatter rootTreeFormatter = new TreeFormatter();
rootTreeFormatter.append("a.txt", FileMode.REGULAR_FILE, aBlob);
rootTreeFormatter.append("nested", FileMode.TREE, nestedTree);
return inserter.idFor(rootTreeFormatter).name();
} |
|
What would you say about an API like the one below? It would have the advantage of being reusable in other contexts. For example Commons Compress could use it to compute a SWHID without extracting an archive. public final class GitId {
public enum FileMode {
/** Regular, non-executable file ({@code 100644}). */
REGULAR_FILE("100644"),
/** Executable file ({@code 100755}). */
EXECUTABLE_FILE("100755"),
/** Symbolic link ({@code 120000}). */
SYMBOLIC_LINK("120000"),
/** Directory / subtree ({@code 40000}). */
DIRECTORY("40000");
}
public static byte[] blobId(MessageDigest digest, byte[] content);
public static byte[] blobId(MessageDigest digest, InputStream input) throws IOException;
public static byte[] blobId(MessageDigest digest, Path path) throws IOException;
public static TreeBuilder treeBuilder(MessageDigest digest);
public static final class TreeBuilder {
public TreeBuilder addFile(String name, FileMode mode, byte[] content);
public TreeBuilder addFile(String name, FileMode mode, InputStream input) throws IOException;
public TreeBuilder addFile(String name, FileMode mode, Path path) throws IOException;
public TreeBuilder addDirectory(String name, TreeBuilder subtree);
public byte[] build();
}
} |
Hi @ppkarwasz I'm not sure what Commons component the above should belong. I think you mean it to belong in Codec but I can't tell what's supposed to be an interface vs. implementation. Would this PR be reimplemented in terms of the above? Or would this PR provide the implementation for the above? The name TreeBuilder is confusing to me without Javadoc. It's not building a tree, it's building a byte array. Do you mean it processes a directory tree? I can't tell. In the PR description, you write:
Since Git has been migrating to SHA-256, does this still matter? You only mention SHA-1 in the above. From API design, the API inflation is already present with byte[], InputStream, Path, and hints that File, Channel, Buffer, and URI should also be available, which is the problem Commons IOs builder package attempts to solve. Aside from that, the current PR seems focused on narrow functionality without introducing framework code, so it fits in nicely. Let me review it again in the morning. |
This change adds two methods to
DigestUtilsthat compute generalized Git object identifiers using an arbitraryMessageDigest, rather than being restricted to SHA-1:gitBlob(digest, input): computes a generalized Git blob object identifier for a given file or byte content.gitTree(digest, file): computes a generalized Git tree object identifier for a given directory.Motivation
The standard Git object identifiers use SHA-1, which is in the process of being replaced by SHA-256 in Git itself. These methods generalize the identifier computation to support any
MessageDigest, enabling both forward compatibility and use with external standards.In particular, the
swh:1:cnt:(content) andswh:1:dir:(directory) identifier types defined by SWHID (ISO/IEC 18670) are currently compatible with Git blob and tree identifiers respectively (using SHA-1), and can be used to generate canonical, persistent identifiers for unpacked source and binary distributions.Before you push a pull request, review this list:
mvn; that'smvnon the command line by itself.