Bazel hashing of external directory output

Background

We allow repository rules to follow freely floating targets, provided they return a modified set of arguments to provide a reproducible version of themselves (design, blog post). As writing rule that correctly provide those arguments is a hard and potentially error-prone task, bazel will support this by computing and supporting verification of hashes of the directory generated by an external repository. This is similar to the way as sandboxing supports writing rules by detecting missing dependencies in actions.

Data included in the repository hash

It is important to understand that this hash is not a security feature. Certain data of the directory structure, in particular the owner of the files, are deliberately not included. And while a reasonable build will not depend on the ownership of its source file, it is technically possible to have a build with the behavior of the resulting binary depending on the owner of the source file, even in a malicious way.

Owners and timestamps ignored

We expect users to build as ordinary users (i.e., not as the privileged user). Therefore, files can only be created owned by the current user, and hence file ownership of files in external repositories differs from developer to developer, but this is fine, as long as the external repository is built in a reasonable way.

For the same reason, we do not include the time stamp in the hash of the directory. Files generated by ctx.file, as well as files checked out by git, will have the current time as time stamp, which is not reproducible, but a sensible build process will not depend on the time stamps.

Executable bit stored

We do include the information whether the file is executable for the owner. Wrong permissions can be the cause of annoying build failures, and most ways of providing an external repository, either track this information explicitly or, at least, set if to a reproducible value (not executable).

Symlinks partially expanded

Symlinks provide two kind of information. Primarily they are just a string that can be read via readlink(2); also if accessed as a file or directory, e.g., via fopen(3), the string is interpreted as a filename and the respective file or directory, if existing, is opened instead.

Examples of symlinks in external repositories

http_archive and other bazel-provided repository rules symlink the BUILD file into the generated repository. The symlink is absolute, so in particular depends on the location of the workspace on the local machine. Hence we cannot just readlink(2) all links and expect a reproducible hash.
Some external projects come with cyclic symlinks. E.g., the alsa library (alsa-lib-1.1.2.tar.bz2 with sha256 d38dacd9892b06b8bff04923c380b38fb2e379ee5538935ff37e45b395d861d6) has, in the include subdirectory a symlink alsa pointing to .. So, replacing all symlinks by (the hash of) what they point to, does not work as symlink cycles exist in the real world.

Proposal on the hash

For absolute paths pointing to files, the file being pointed to is hashed, including the information whether it is executable.
For all other symlinks (relative paths, absolute links to directories, dangling symlinks), the link itself is hashed.

unrelated files

For external repository rules like git_repository, additional directories are created besides the actual source code, e.g., the .git subdirectory. And while the actual source is determined by the specified commit id, the contents of those subdirectories are not. The knowledge which additional such files and directories are created is specific to the individual rule.

Proposal: rules clean up themselves

We propose that the rules are in charge of removing all unrelated files and directories; at the very least they must remove all parts that are not byte-for-byte reproducible.

Alternative considered: rules tell bazel to ignore certain parts

An alternative considered was be that the rules would declare which parts of the created directory are not part of the code and should be ignored by bazel. This would, however imply an even more complicated interface, as rules will have to then return two kind of information: the actual resolved information (i.e., the new dict of keyword arguments), as well as the set of objects to ignore.

This seems quite some extension of the interface for unclear benefits; as the directory of an external repository is completely removed before another call to the rule, we cannot save bandwidth by keeping the .git repository around.

Proposed rollout

hash included in the `resolved` value

The resolved value in the --experimental_repository_resolved_file will contain an additional key output_tree_hash for every entry in the repsoitories field of every entry indicating the call to a Skylark repository rule. As only a new file is added (and the value is experimental anyway) this change does not break any legitimate use cases: users are free to ignore the additional value.

Hash verification

In the long run: all source-like repositories will be taken from the WORKSPACE.resolved file, where hashes are provided separately, and can be checked. This check will only be done for source-like rules, and we will add an option to ignore it for individual repositories.

To allow for a clean transition period, the output-directory verification will be opt-in initially, even before we fully switch to the WORKSAPCE versus WORKSPACE.resolved distinction, and also before the source-like/configure-like distinction is done. We add a new option. This option will specify a Skylark file, where the value resolved is taken from and expected to have the same structure as a resolved value in a file written by the --experimental_repository_resolved_file option. Repositories are associated via the name attribute. As rules will only start to become reproducible one by one, in the transition period an option will be used to specify the repository rules for which verification should happen (defaulting to the empty list, so this feature is opt-in as well).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2018-07-13-repository-hashing.md

2018-07-13-repository-hashing.md

Bazel hashing of external directory output

Background

Data included in the repository hash

Owners and timestamps ignored

Executable bit stored

Symlinks partially expanded

Examples of symlinks in external repositories

Proposal on the hash

unrelated files

Proposal: rules clean up themselves

Alternative considered: rules tell bazel to ignore certain parts

Proposed rollout

hash included in the `resolved` value

Hash verification

Files

2018-07-13-repository-hashing.md

Latest commit

History

2018-07-13-repository-hashing.md

File metadata and controls

Bazel hashing of external directory output

Background

Data included in the repository hash

Owners and timestamps ignored

Executable bit stored

Symlinks partially expanded

Examples of symlinks in external repositories

Proposal on the hash

unrelated files

Proposal: rules clean up themselves

Alternative considered: rules tell bazel to ignore certain parts

Proposed rollout

hash included in the resolved value

Hash verification

hash included in the `resolved` value