race btw blkid and destroy_vbd_frontend can cause hang #41

zultron · 2013-10-29T16:03:12Z

On EL6:

When building a PV VM with pygrub, create_vbd_frontend attaches the VM's boot block device to dom0 for pygrub to operate on. This triggers udev to start blkid.

If blkid does not finish before pygrub, destroy_vbd_frontend will fail to close the device, since blkid is holding it open.

After this, bad stuff. The task will hang, the vdi will remain attached to the dom0, the blkid process can't be killed, and a reboot is required, but the reboot process hangs when stopping the 'blk-availability' service, so the host must be power cycled.

The following links suggest running something like 'udevadm settle', which will wait for the udev event queue to empty, and then exit:

https://www.redhat.com/archives/libguestfs/2012-February/msg00023.html

https://rwmj.wordpress.com/2012/01/19/udev-unexpectedness/#content

For a cheap hack, I added this to the end of the pygrub script, and the problem seems to have disappeared. Of course pygrub isn't the right place for this, but I'm not sure what is. The above links suggest it's possible to run 'udevadm settle' too early before the event is placed in the udev queue, so perhaps it should be in destroy_vbd_frontend.

zultron · 2013-10-29T16:06:00Z

djs55 encountered the same issue, as he describes in this now non-existent ticket:

http://webcache.googleusercontent.com/search?q=cache:rSRkkPeQ0FoJ:https://github.com/djs55/xenopsd/issues/30+

Here's a copy in case google's cache entry expires:

djs55 opened this issue 4 months ago
qdisk: "Device in use; refusing to close" triggers segfault
No milestone
No one is assigned

It looks like qemu isn't resilient to the guest writing error nodes in xenstore:

Jul  1 13:48:10 st30 xenopsd-xenlight: [xenops] xenstore-write /local/domain/0/backend/qdisk/0/51792/online = 0
Jul  1 13:48:10 st30 xenopsd-xenlight: [xenops] Device.del_device setting backend to Closing
Jul  1 13:48:10 st30 xenopsd-xenlight: [xenops] Device.Generic.clean_shutdown_wait frontend (domid=0 | kind=vbd | devid=51792); backend (domid=0 | kind=qdisk | devid=51792)
Jul  1 13:48:10 st30 kernel: vbd vbd-51792: 16 Device in use; refusing to close
Jul  1 13:48:10 st30 kernel: qemu-system-i38[1563]: segfault at 878 ip 00007fd007514edf sp 00007fff97753850 error 6 in qemu-system-i386[7fd00749c000+309000]

zultron · 2013-10-30T05:03:38Z

Well, this time around it occurred when a VM whose install failed before the disk was partitioned rebooted. Bootloader was set to 'pygrub' from 'eliloader', and pygrub failed, unable to find the partition. The difference this time is there was not hung 'blkid' process, or anything holding the device open identifiable by 'lsof'.

djs55 · 2013-10-30T10:38:45Z

My preferred long-term fix for this is to avoid attaching the device to dom0, and use a userspace app to read it, possibly via the NBD protocol talking to tapdisk or qemu.

In the short term a 'udevadm settle' like you suggest sounds good. I think it should live in xenopsd just before we attempt to unplug the device, probably here:

https://github.com/xapi-project/xenopsd/blob/master/xc/device.ml#L147

and

https://github.com/xapi-project/xenopsd/blob/master/xl/xenops_server_xenlight.ml#L859

What do you think, @robhoes?

robhoes · 2013-11-04T11:56:22Z

@djs55 Yes, that sounds good to me.

CA-173605

psafont pushed a commit to psafont/xenopsd that referenced this issue Jul 5, 2021

Merge pull request xapi-project#41 from zli/CA-173605

df7630f

CA-173605

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race btw blkid and destroy_vbd_frontend can cause hang #41

race btw blkid and destroy_vbd_frontend can cause hang #41

zultron commented Oct 29, 2013

zultron commented Oct 29, 2013

zultron commented Oct 30, 2013

djs55 commented Oct 30, 2013

robhoes commented Nov 4, 2013

race btw blkid and destroy_vbd_frontend can cause hang #41

race btw blkid and destroy_vbd_frontend can cause hang #41

Comments

zultron commented Oct 29, 2013

zultron commented Oct 29, 2013

zultron commented Oct 30, 2013

djs55 commented Oct 30, 2013

robhoes commented Nov 4, 2013