Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs_snapshot_tar exited with code: 1 - Stale NFS file handle #134

Closed
teutat3s opened this issue May 21, 2018 · 8 comments
Closed

zfs_snapshot_tar exited with code: 1 - Stale NFS file handle #134

teutat3s opened this issue May 21, 2018 · 8 comments

Comments

@teutat3s
Copy link
Member

teutat3s commented May 21, 2018

Creating a new issue because a slightly similar error to issue #100 appears in our TritonDataCenter Private-Cloud.
We did a fresh basic setup with latest 20180510 usb image. Post-setup docker, cns and ha's done.
1 headnode, 4 compute nodes. sdc-healthcheck, sdcadm check-health show no errors.

I'm stuck at debugging the following error when doing a docker-compose up -d:

ERROR: Service 'nginx' failed to build: Build failed: zfs_snapshot_tar exited with code: 1 (zfs_snapshot_tar: zfs exited non-zero (1) with: Unable to determine path or stats for object 6 in zones/e0fc9ca0-8fcb-e7fe-fc1f-912f453fe004@buildlayer1: Stale NFS file handle

In the adminui I can see a stopped "build_XXXXXXX" docker container. It happens with different docker-compose.yml - the last ones I tried are from mattermost and autopilotpattern hello-world example. In both cases an apline-linux image is pulled and then modified. After the apk package download / install the build process crashes. How should I proceed debugging?

Full history

triton-compose up -d

Building nginx
Step 1/15 : FROM alpine:3.4
---> c7fc7faf8c28
Step 2/15 : RUN apk update && apk add nginx curl unzip && rm -rf /var/cache/apk/*
---> Running in e0fc9ca08fcb
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
v3.4.6-299-ge10ec9b [http://dl-cdn.alpinelinux.org/alpine/v3.4/main]
v3.4.6-160-g14ad2a3 [http://dl-cdn.alpinelinux.org/alpine/v3.4/community]
OK: 5973 distinct packages available
(1/8) Installing ca-certificates (20161130-r0)
(2/8) Installing libssh2 (1.7.0-r0)
(3/8) Installing libcurl (7.59.0-r0)
(4/8) Installing curl (7.59.0-r0)
(5/8) Installing nginx-common (1.10.3-r0)
Executing nginx-common-1.10.3-r0.pre-install
(6/8) Installing pcre (8.38-r1)
(7/8) Installing nginx (1.10.3-r0)
(8/8) Installing unzip (6.0-r1)
Executing busybox-1.24.2-r14.trigger
Executing ca-certificates-20161130-r0.trigger
OK: 8 MiB in 19 packages
ERROR: Service 'nginx' failed to build: Build failed: zfs_snapshot_tar exited with code: 1 (zfs_snapshot_tar: zfs exited non-zero (1) with: Unable to determine path or stats for object 6 in zones/e0fc9ca0-8fcb-e7fe-fc1f-912f453fe004@buildlayer1: Stale NFS file handle

) (req_id: 06b0002a-4c88-453e-9026-a96486c6ed6d)

docker-compose.yml

Dockerfile

EDIT: I also tried what @twhiteman suggested to check on the snapshots - but a zfs diff doesn't work in our case as we get stuck after the first snapshot - so no 2nd snapshot to diff to ;(

@twhiteman
Copy link
Contributor

Docker build may have issues on platform-20180503 and newer.

See https://smartos.org/bugview/TRITON-372 for the issue details.

Note, for the diff, if you only have one build snapshot, you can perform a zfs diff to the base image snapshot instead (which is what docker build would do in this case). E.g.

IMAGE=vmadm get e0fc9ca0-8fcb-e7fe-fc1f-912f453fe004 | json image_uuid
zfs diff zones/$IMAGE@final zones/e0fc9ca0-8fcb-e7fe-fc1f-912f453fe004@buildlayer1

@teutat3s
Copy link
Member Author

@twhiteman thanks for the heads up - I didn't find that bug.

Do you think a platform rollback to 20180426 would work as a temporary workaround? I still got that image ready on the usbkey. All agents were updated to latest release version, too - dunno if that would cause a problem with the older platform?

@twhiteman
Copy link
Contributor

Yes, an older platform should work correctly.

Alternatively you could also run your builds on a local dev machine and then push them to a docker registry - and pull them down using docker pull.

@jclulow
Copy link
Contributor

jclulow commented May 23, 2018

We've backed out the upstream illumos change that was causing these issues. The release today will not include the regression which causes this issue.

@teutat3s
Copy link
Member Author

I can confirm building works again with the newest Triton version - thanks!

@Adel-Magebinary
Copy link

Hi guys,

I'm still having this issue on the latest build?

OK: 389 MiB in 110 packages
Build failed: zfs_snapshot_tar exited with code: 1 (zfs_snapshot_tar: readlinkat: No such file or directory
) (req_id: 051ab5c2-d4fb-493d-bd5c-2015b1a8146a)

Current20180913T001833Z

@jclulow
Copy link
Contributor

jclulow commented Sep 26, 2018

@Adel-Magebinary, I believe that is a different issue. Note in particular that the message is zfs_snapshot_tar: readlinkat: No such file or directory rather than Stale NFS file handle. #98 looks a bit closer to what you're seeing.

@Adel-Magebinary
Copy link

Adel-Magebinary commented Sep 26, 2018

Hello @jclulow ,

I can successfully build the image when I remove below lines from Dockerfile.
RUN apk del --no-network .ruby-builddeps
&& cd /
&& rm -r /usr/src/ruby
&& rm -r /root/.gem/`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants