-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
race btw blkid and destroy_vbd_frontend can cause hang #41
Comments
djs55 encountered the same issue, as he describes in this now non-existent ticket: Here's a copy in case google's cache entry expires: djs55 opened this issue 4 months ago It looks like qemu isn't resilient to the guest writing error nodes in xenstore:
|
Well, this time around it occurred when a VM whose install failed before the disk was partitioned rebooted. Bootloader was set to 'pygrub' from 'eliloader', and pygrub failed, unable to find the partition. The difference this time is there was not hung 'blkid' process, or anything holding the device open identifiable by 'lsof'. |
My preferred long-term fix for this is to avoid attaching the device to dom0, and use a userspace app to read it, possibly via the NBD protocol talking to tapdisk or qemu. In the short term a 'udevadm settle' like you suggest sounds good. I think it should live in xenopsd just before we attempt to unplug the device, probably here: https://github.com/xapi-project/xenopsd/blob/master/xc/device.ml#L147 and https://github.com/xapi-project/xenopsd/blob/master/xl/xenops_server_xenlight.ml#L859 What do you think, @robhoes? |
@djs55 Yes, that sounds good to me. |
On EL6:
When building a PV VM with pygrub, create_vbd_frontend attaches the VM's boot block device to dom0 for pygrub to operate on. This triggers udev to start blkid.
If blkid does not finish before pygrub, destroy_vbd_frontend will fail to close the device, since blkid is holding it open.
After this, bad stuff. The task will hang, the vdi will remain attached to the dom0, the blkid process can't be killed, and a reboot is required, but the reboot process hangs when stopping the 'blk-availability' service, so the host must be power cycled.
The following links suggest running something like 'udevadm settle', which will wait for the udev event queue to empty, and then exit:
https://www.redhat.com/archives/libguestfs/2012-February/msg00023.html
https://rwmj.wordpress.com/2012/01/19/udev-unexpectedness/#content
For a cheap hack, I added this to the end of the pygrub script, and the problem seems to have disappeared. Of course pygrub isn't the right place for this, but I'm not sure what is. The above links suggest it's possible to run 'udevadm settle' too early before the event is placed in the udev queue, so perhaps it should be in destroy_vbd_frontend.
The text was updated successfully, but these errors were encountered: