created | last updated | status | reviewers | title | authors | |||
---|---|---|---|---|---|---|---|---|
2018-07-13 |
2018-07-13 |
implemented |
|
Bazel hashing of external directory output |
|
We allow repository rules to follow freely floating targets, provided they return a modified set of arguments to provide a reproducible version of themselves (design, blog post). As writing rule that correctly provide those arguments is a hard and potentially error-prone task, bazel will support this by computing and supporting verification of hashes of the directory generated by an external repository. This is similar to the way as sandboxing supports writing rules by detecting missing dependencies in actions.
It is important to understand that this hash is not a security feature. Certain data of the directory structure, in particular the owner of the files, are deliberately not included. And while a reasonable build will not depend on the ownership of its source file, it is technically possible to have a build with the behavior of the resulting binary depending on the owner of the source file, even in a malicious way.
We expect users to build as ordinary users (i.e., not as the privileged user). Therefore, files can only be created owned by the current user, and hence file ownership of files in external repositories differs from developer to developer, but this is fine, as long as the external repository is built in a reasonable way.
For the same reason, we do not include the time stamp in the hash of the
directory. Files generated by ctx.file
, as well as files checked out by
git
, will have the current time as time stamp, which is not reproducible,
but a sensible build process will not depend on the time stamps.
We do include the information whether the file is executable for the owner. Wrong permissions can be the cause of annoying build failures, and most ways of providing an external repository, either track this information explicitly or, at least, set if to a reproducible value (not executable).
Symlinks provide two kind of information. Primarily they are just a
string that can be read via readlink(2)
; also if accessed as a
file or directory, e.g., via fopen(3)
, the string is interpreted as
a filename and the respective file or directory, if existing, is
opened instead.
-
http_archive
and other bazel-provided repository rules symlink theBUILD
file into the generated repository. The symlink is absolute, so in particular depends on the location of the workspace on the local machine. Hence we cannot justreadlink(2)
all links and expect a reproducible hash. -
Some external projects come with cyclic symlinks. E.g., the alsa library (alsa-lib-1.1.2.tar.bz2 with sha256 d38dacd9892b06b8bff04923c380b38fb2e379ee5538935ff37e45b395d861d6) has, in the
include
subdirectory a symlinkalsa
pointing to.
. So, replacing all symlinks by (the hash of) what they point to, does not work as symlink cycles exist in the real world.
-
For absolute paths pointing to files, the file being pointed to is hashed, including the information whether it is executable.
-
For all other symlinks (relative paths, absolute links to directories, dangling symlinks), the link itself is hashed.
For external repository rules like git_repository
, additional directories
are created besides the actual source code, e.g., the .git
subdirectory.
And while the actual source is determined by the specified commit id,
the contents of those subdirectories are not. The knowledge which additional
such files and directories are created is specific to the individual rule.
We propose that the rules are in charge of removing all unrelated files and directories; at the very least they must remove all parts that are not byte-for-byte reproducible.
An alternative considered was be that the rules would declare which parts of the created directory are not part of the code and should be ignored by bazel. This would, however imply an even more complicated interface, as rules will have to then return two kind of information: the actual resolved information (i.e., the new dict of keyword arguments), as well as the set of objects to ignore.
This seems quite some extension of the interface for unclear benefits;
as the directory of an external repository is completely removed
before another call to the rule, we cannot save bandwidth by keeping
the .git
repository around.
The resolved
value in the --experimental_repository_resolved_file
will
contain an additional key output_tree_hash
for every entry in the
repsoitories
field of every entry indicating the call to a Skylark
repository rule. As only a new file is added (and the value is experimental
anyway) this change does not break any legitimate use cases: users are free
to ignore the additional value.
In the long run: all source-like repositories will be taken from the
WORKSPACE.resolved
file, where hashes are provided separately, and can be
checked. This check will only be done for source-like rules, and we will add
an option to ignore it for individual repositories.
To allow for a clean transition period, the output-directory verification will
be opt-in initially, even before we fully switch to the WORKSAPCE
versus
WORKSPACE.resolved
distinction, and also before the source-like/configure-like
distinction is done. We add a new option. This option will
specify a Skylark file, where the value resolved
is taken from and expected
to have the same structure as a resolved value in a file written by the
--experimental_repository_resolved_file
option. Repositories are associated
via the name
attribute. As rules will only start to become reproducible
one by one, in the transition period an option will be used to specify the
repository rules for which verification should happen (defaulting to the
empty list, so this feature is opt-in as well).