Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a new Docker version (27) to run an old-ish docker image saved in the dataset causes DataLad error (image ID mismatch) #269

Open
mslw opened this issue Feb 4, 2025 · 0 comments

Comments

@mslw
Copy link

mslw commented Feb 4, 2025

With Docker 27, trying to run a docker container which was saved using an older version of Docker results with an error:

>python -m datalad_container.adapters.docker run container/image sh -c "echo 123"
(...)
RuntimeError: docker image sha256:f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3 was not successfully loaded

Docker loads an image, but its ID does not match what DataLad expects based on the image that was stored:

>docker image ls
REPOSITORY   TAG       IMAGE ID       CREATED         SIZE
remodnav     latest    81aaa31870f5   16 months ago   3.8GB

This was observed when trying to reproduce paper-remodnav (versioned link), and snippets in this issue are based on that dataset.

Which software versions are affected?

Unclear. The problem was observed and later confirmed on Windows with Docker version 27.5.1. For me, the problem does not replicate on Debian 12 (bookworm) with Docker version 20.10.4 (docker.io package). @mih reports that it still works on his laptop, with v26.1.5.

As far as saving the image goes, I don't know which Docker version was used; however, I suppose < 25 for reasons explained below.

Where in the code does the problem happen?

The error message comes from the datalad_container.adapters.docker function:

def load(path, repo_tag, config):
"""Load the Docker image from `path`.
Parameters
----------
path : str
A directory with an extracted tar archive.
repo_tag : str or None
`image:tag` of image to load
config : str or None
"Config" value or prefix of image to load
Returns
-------
The image ID (str)
"""
# FIXME: If we load a dataset, it may overwrite the current tag. Say that
# (1) a dataset has a saved neurodebian:latest from a month ago, (2) a
# newer neurodebian:latest has been pulled, and (3) the old image have been
# deleted (e.g., with 'docker image prune --all'). Given all three of these
# things, loading the image from the dataset will tag the old neurodebian
# image as the latest.
image_id = "sha256:" + get_image(path, repo_tag, config)
if image_id not in _list_images():
lgr.debug("Loading %s", image_id)
cmd = ["docker", "load"]
p = sp.Popen(cmd, stdin=sp.PIPE, stdout=sp.PIPE, stderr=sp.PIPE)
with tarfile.open(fileobj=p.stdin, mode="w|", dereference=True) as tar:
tar.add(path, arcname="")
out, err = p.communicate()
return_code = p.poll()
if return_code:
lgr.warning("Running %r failed: %s", cmd, err.decode())
raise sp.CalledProcessError(return_code, cmd, output=out)
else:
lgr.debug("Image %s is already present", image_id)
if image_id not in _list_images():
raise RuntimeError(
"docker image {} was not successfully loaded".format(image_id))
return image_id

The function performs a relatively simple operation: it creates a tar file object from the contents of the requested directory, and pipes it directly into docker load (all done with streams, without saving intermediate files). It then compares the image ID reported by docker to the one inferred from the image stored in the dataset - this is where the error is raised.

The expected ID is returned by get_image:

def get_image(path, repo_tag=None, config=None):
"""Return the image ID of the image extracted at `path`.
"""
manifest_path = op.join(path, "manifest.json")
with open(manifest_path) as fp:
manifest = json.load(fp)
if repo_tag is not None:
manifest = [img for img in manifest if repo_tag in (img.get("RepoTags") or [])]
if config is not None:
manifest = [img for img in manifest if img["Config"].startswith(config)]
if len(manifest) == 0:
raise ValueError(f"No matching images found in {manifest_path}")
elif len(manifest) > 1:
raise ValueError(
f"Multiple images found in {manifest_path}; disambiguate with"
" --repo-tag or --config"
)
with open(op.join(path, manifest[0]["Config"]), "rb") as stream:
return hashlib.sha256(stream.read()).hexdigest()

Again, the operation is relatively simple. The function opens the image manifest stored in the dataset, opens the config file it points to, and hashes its content.

Investigating the docker save layout and speculation about IDs

With that dataset, I am able to mimic DataLad's approach in creating the tar file, and save it to a file for further inspection and for loading with docker load -i:

>>> with tarfile.open("img.tar", mode="w|", dereference=True) as tar:
...     tar.add("container\\image", arcname="")

Note: I tried writing the tar file on both GNU/Linux and Windows. The files had different checksums (new line characters? tar header?) but both produced the same image ID when loaded on Windows.

With that, I also tried a docker load - docker save round-trip. Docker 27 has no problem loading an image generated from the dataset content in the manner above. When saving, it produces a different layout - one that is OCI compatible in fact. See OCI image format specification and, in particular, the part about Image layout.

The change in save layout was most likely introduced in Docker 25 - the release notes for Docker Engine 25.0.0 include "The docker image save tarball output is now OCI compliant".

This is the layout of a tar file created from the dataset:

img_dataset
├── 360338cd2a802f4812f06fbc50237a42bc0303390efa7fa321c381e6ec36d1ae
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── 705094a41713537ec5205e79423114633a7225bae388e7ba823d92126c6b36c0
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3.json
├── manifest.json
└── repositories

And this is the one created after running docker load and docker save:

img_load_save
├── blobs
│   └── sha256
│       ├── 81aaa31870f52a6265bef39d0be0df7f82bab3839344ec8da54cc6c18e3fd7a0
│       ├── d310e774110ab038b30c6a5f7b7f7dd527dbe527854496bd30194b9ee6ea496e
│       ├── e2728fc6d2c404f7b41e0fa4f889117090f4476eefab2bca48d7164dcbf7a0cb
│       └── f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3
├── index.json
├── manifest.json
└── oci-layout

Note that the blobs include both 81aaa (which matches the image ID reported by Docker 27) and f881b (which matches the ID that DataLad expected to see, and more than likely also the ID that Docker 20 would report).

Let's explore the new layout then (note: all JSON contents below are presented with jq for readability). First, there is manifest.json:

[
  {
    "Config": "blobs/sha256/f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3",
    "RepoTags": [
      "remodnav:latest"
    ],
    "Layers": [
      "blobs/sha256/d310e774110ab038b30c6a5f7b7f7dd527dbe527854496bd30194b9ee6ea496e",
      "blobs/sha256/e2728fc6d2c404f7b41e0fa4f889117090f4476eefab2bca48d7164dcbf7a0cb"
    ]
  }
]

The manifest references the config with f881b checksum - this is the "old" config, and the one DataLad would look at when determining the expected image ID! However, according to the OCI Image Layout Specification, this manifest is a "file associated with a backwards compatible docker save format", and is not part of the spec.

The mandatory file, acording to the OCI spec, is index.json, and here are its contents:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "manifests": [
    {
      "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
      "digest": "sha256:81aaa31870f52a6265bef39d0be0df7f82bab3839344ec8da54cc6c18e3fd7a0",
      "size": 586,
      "annotations": {
        "io.containerd.image.name": "docker.io/library/remodnav:latest",
        "org.opencontainers.image.ref.name": "latest"
      }
    }
  ]
}

This index file points to a manifest, with a digest (81aaa) matching the ID of the dataset created by Docker 27.

Here is the content of that manifest, ie. blobs/sha256/81aaa...:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "digest": "sha256:f881bd4db45ac9775f5a5377485a7c939fea4685d7482eed4809cb83fc3b51a3",
    "size": 3157
  },
  "layers": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar",
      "digest": "sha256:d310e774110ab038b30c6a5f7b7f7dd527dbe527854496bd30194b9ee6ea496e",
      "size": 77814784
    },
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar",
      "digest": "sha256:e2728fc6d2c404f7b41e0fa4f889117090f4476eefab2bca48d7164dcbf7a0cb",
      "size": 1750877184
    }
  ]
}

This manifest points to a config file with f881b digest, ie. exactly the one from the dataset!

It would seem that it is this manifest, rather than the config file, that docker uses as the basis for the dataset ID. However, given that it is checksums (of the config and the layers) all the way down, this seems to be equivalent (with Docker now hashing a "higher-level" metadata file). However, I wasn't able to find an indication of the ID change in Docker's release notes or documentation, so this is a speculation based on comparing the save layouts and reading the OSI spec.

How can we fix this?

This is unclear at the moment.

If I am right about Docker 27's ID being based on a metadata representation which is equivalent but different to the file saved in the dataset, this means that with the old layout we can't know the ID upfront (unless we try to create the manifest ourselves, which seems doable but finicky).

One possible workaround would be to simply drop the ID check which produced an error. We would still rely on an exit code from docker load giving us some assurance that loading succeeded, so it does not sound entirely wrong.

However, the expected ID is being checked (against a list of Docker images being present) twice. The first time, it is done to decide whether the image needs to be loaded in the first place. So not changing that part would mean loading the image every time the function is called, which sounds bad.

mslw added a commit to mslw/datalad-container that referenced this issue Feb 5, 2025
It appears that while Docker 27 has no problem loading images saved with
older versions, it generates the ID based on the "new style"
(OCI-compliant) manifest that it would save starting with v25, and not
the config file stored in the dataset. This causes DataLad to error out
due to ID mismatch, although the ID is most likely equivalent; see datalad#269

This commit is the first attempt to solve this issue. Since the manifest
is a structured file, an attempt is made to generate a "new" style
manifest based on the contents of the saved image, and derive the ID
from that.

The manifest needs file types, sizes, and checksums. While we could copy
checksums from the previous manifest / config, we do not seem to have
the sizes. To solve that problem, we get both through ls_file_collection
from datalad-next. This is convenient and quick, but introduces a new
dependency.

The generated structure and content are a guesswork based on reading the
OCI spec and seeing docker save output from a single container - it sure
works from that container and tries to be applicable more broadly, but
most likely won't cover more complicated cases, or those where I'm not
even sure what behavior to expect (e.g. multi-arch manifest?). Layers
are assumed to always be rootfs_diff (I currently don't know if there
are other types possible).

This commit focuses on reading older images with new Docker, and does
not address reading new images (reading images saved with Docker 26
would still fail, because it already uses the new save format which our
adapter does not expect). So the combinatorics around that will need to
be addressed later.

The new code would only trigger for Docker 27. It introduces one small
regression, where get_image_id raises a NotImplementedError for two
arguments which can be given to the old get_image.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant