Skip to content

CODEC-335: Add DigestUtils.gitBlob and DigestUtils.gitTree methods#427

Open
ppkarwasz wants to merge 2 commits intomasterfrom
feat/git-blob
Open

CODEC-335: Add DigestUtils.gitBlob and DigestUtils.gitTree methods#427
ppkarwasz wants to merge 2 commits intomasterfrom
feat/git-blob

Conversation

@ppkarwasz
Copy link

This change adds two methods to DigestUtils that compute generalized Git object identifiers using an arbitrary MessageDigest, rather than being restricted to SHA-1:

Motivation

The standard Git object identifiers use SHA-1, which is in the process of being replaced by SHA-256 in Git itself. These methods generalize the identifier computation to support any MessageDigest, enabling both forward compatibility and use with external standards.

In particular, the swh:1:cnt: (content) and swh:1:dir: (directory) identifier types defined by SWHID (ISO/IEC 18670) are currently compatible with Git blob and tree identifiers respectively (using SHA-1), and can be used to generate canonical, persistent identifiers for unpacked source and binary distributions.

Before you push a pull request, review this list:

  • Read the contribution guidelines for this project.
  • Read the ASF Generative Tooling Guidance if you use Artificial Intelligence (AI).
  • I used AI to create any part of, or all of, this pull request. Which AI tool was used to create this pull request, and to what extent did it contribute? Claude Code was used for tests and to review the main code
  • Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
  • Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best practice.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Each commit in the pull request should have a meaningful subject line and body. Note that a maintainer may squash commits during the merge process.

This change adds two methods to `DigestUtils` that compute generalized Git object
identifiers using an arbitrary `MessageDigest`, rather than being restricted to SHA-1:

- `gitBlob(digest, input)`: computes a generalized
  [Git blob object identifier](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a given file or byte content.
- `gitTree(digest, file)`: computes a generalized
  [Git tree object identifier](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a given directory.

### Motivation

The standard Git object identifiers use SHA-1, which is
[in the process of being replaced by SHA-256](https://git-scm.com/docs/hash-function-transition) in Git itself.
These methods generalize the identifier computation to support any `MessageDigest`,
enabling both forward compatibility and use with external standards.

In particular, the `swh:1:cnt:` (content) and `swh:1:dir:` (directory) identifier
types defined by [SWHID (ISO/IEC 18670)](https://www.swhid.org/specification/v1.2/5.Core_identifiers/) are currently compatible with
Git blob and tree identifiers respectively (using SHA-1), and can be used to generate
canonical, persistent identifiers for unpacked source and binary distributions.
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppkarwasz

Should all this git related code be in a new GitDigest class instead?

Curious: isn't all this in jgit?

You'll need to run 'mvn' by itself and fix build issues before you push.

@ppkarwasz
Copy link
Author

Should all this git related code be in a new GitDigest class instead?

gitBlob fits naturally in DigestUtils: it follows the same pattern as the existing digest methods (wrap content with a header, hash it).

gitTree is more specialised: it only accepts a directory and recursively computes hashes for the entire tree, which is arguably outside DigestUtils's scope. I'm open to moving both to a new GitDigest class.

Curious: isn't all this in jgit?

JGit does provide the building blocks via ObjectInserter, but it has two significant limitations that make it unsuitable here:

  1. ObjectInserter is hardcoded to SHA-1.
  2. The caller is responsible for walking the directory, sorting entries, and building each sub-tree formatter correctly. The proposed gitTree method handles all of that automatically.

For reference, here is the equivalent JGit code for a two-file tree:

final byte[] aBytes = ...; // a.txt
final byte[] bBytes = ...; // nested/b.txt
try (ObjectInserter inserter = new ObjectInserter.Formatter()) {
    final ObjectId aBlob = inserter.idFor(OBJ_BLOB, aBytes);
    final ObjectId bBlob = inserter.idFor(OBJ_BLOB, bBytes);
    final TreeFormatter nestedTreeFormatter = new TreeFormatter();
    nestedTreeFormatter.append("b.txt", FileMode.REGULAR_FILE, bBlob);
    final ObjectId nestedTree = inserter.idFor(nestedTreeFormatter);
    final TreeFormatter rootTreeFormatter = new TreeFormatter();
    rootTreeFormatter.append("a.txt", FileMode.REGULAR_FILE, aBlob);
    rootTreeFormatter.append("nested", FileMode.TREE, nestedTree);
    return inserter.idFor(rootTreeFormatter).name();
}

@ppkarwasz
Copy link
Author

@garydgregory,

What would you say about an API like the one below? It would have the advantage of being reusable in other contexts. For example Commons Compress could use it to compute a SWHID without extracting an archive.

public final class GitId {

    public enum FileMode {

        /** Regular, non-executable file ({@code 100644}). */
        REGULAR_FILE("100644"),

        /** Executable file ({@code 100755}). */
        EXECUTABLE_FILE("100755"),

        /** Symbolic link ({@code 120000}). */
        SYMBOLIC_LINK("120000"),

        /** Directory / subtree ({@code 40000}). */
        DIRECTORY("40000");

    }

    public static byte[] blobId(MessageDigest digest, byte[] content);

    public static byte[] blobId(MessageDigest digest, InputStream input) throws IOException;

    public static byte[] blobId(MessageDigest digest, Path path) throws IOException;

    public static TreeBuilder treeBuilder(MessageDigest digest);


    public static final class TreeBuilder {

        public TreeBuilder addFile(String name, FileMode mode, byte[] content);

        public TreeBuilder addFile(String name, FileMode mode, InputStream input) throws IOException;

        public TreeBuilder addFile(String name, FileMode mode, Path path) throws IOException;

        public TreeBuilder addDirectory(String name, TreeBuilder subtree);

        public byte[] build();
    }
}

@garydgregory
Copy link
Member

@garydgregory,

What would you say about an API like the one below? It would have the advantage of being reusable in other contexts. For example Commons Compress could use it to compute a SWHID without extracting an archive.

public final class GitId {

    public enum FileMode {

        /** Regular, non-executable file ({@code 100644}). */
        REGULAR_FILE("100644"),

        /** Executable file ({@code 100755}). */
        EXECUTABLE_FILE("100755"),

        /** Symbolic link ({@code 120000}). */
        SYMBOLIC_LINK("120000"),

        /** Directory / subtree ({@code 40000}). */
        DIRECTORY("40000");

    }

    public static byte[] blobId(MessageDigest digest, byte[] content);

    public static byte[] blobId(MessageDigest digest, InputStream input) throws IOException;

    public static byte[] blobId(MessageDigest digest, Path path) throws IOException;

    public static TreeBuilder treeBuilder(MessageDigest digest);


    public static final class TreeBuilder {

        public TreeBuilder addFile(String name, FileMode mode, byte[] content);

        public TreeBuilder addFile(String name, FileMode mode, InputStream input) throws IOException;

        public TreeBuilder addFile(String name, FileMode mode, Path path) throws IOException;

        public TreeBuilder addDirectory(String name, TreeBuilder subtree);

        public byte[] build();
    }
}

Hi @ppkarwasz

I'm not sure what Commons component the above should belong. I think you mean it to belong in Codec but I can't tell what's supposed to be an interface vs. implementation. Would this PR be reimplemented in terms of the above? Or would this PR provide the implementation for the above?

The name TreeBuilder is confusing to me without Javadoc. It's not building a tree, it's building a byte array. Do you mean it processes a directory tree? I can't tell.

In the PR description, you write:

In particular, the swh:1:cnt: (content) and swh:1:dir: (directory) identifier types defined by SWHID (ISO/IEC 18670) are currently compatible with Git blob and tree identifiers respectively (using SHA-1), and can be used to generate canonical, persistent identifiers for unpacked source and binary distributions.

Since Git has been migrating to SHA-256, does this still matter? You only mention SHA-1 in the above.

From API design, the API inflation is already present with byte[], InputStream, Path, and hints that File, Channel, Buffer, and URI should also be available, which is the problem Commons IOs builder package attempts to solve.

Aside from that, the current PR seems focused on narrow functionality without introducing framework code, so it fits in nicely. Let me review it again in the morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants