Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libzutil: optimize zpool_read_label with AIO #11467

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion lib/libzutil/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ libzutil_la_LIBADD = \

if BUILD_LINUX
libzutil_la_LIBADD += \
$(abs_top_builddir)/lib/libefi/libefi.la
$(abs_top_builddir)/lib/libefi/libefi.la \
-lrt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought. But rather than pulling in a new dependency on librt which could cause us some new portability problems. Should we instead parallelize this using the thread pool library, libtpool, which has already been added above? Or would the setup/teardown costs offset any real gains. In particular, I'm thinking about macos. @lundman am I needlessly concerned here?

On a related note, librt already exists in my default Ubuntu initrd so I don't see any issue there. We shouldn't be making it any bigger for most Linux distributions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On FreeBSD and I think OSX too, lio_listio will be more efficient than using a thread pool. On Linux it's probably about the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other possible concern with using AIO here is that there may be some niche filesystem, or maybe even block device drivers, on Linux which don't fully support it. There are so many it's impossible to keep track. It's possible this could break some existing setups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I know of one. On FreeBSD, using file-backed vdevs with the files stored on NFS will fail. There's a sysctl that you can change to allow it, though. But nobody has complained about it since that change originally landed in FreeBSD in 2017.

I could easily add a serialized fallback path for the case of ENOSYS or EAGAIN. But my concern is that using the fallback path for all error types would double the amount of disk activity even for non-recoverable errors like EACCES. On Linux, I don't think there should be trouble with any particular file system, because POSIX AIO is implemented by glibc using threads. So as long as librt is available, it should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Then I'm happy with this as along as we confirm we're not causing a real problem for macos.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just shoot me a message if you need something tested on macOS or Windows ports. The librt is a bit of a pain, I've yet to solve it in a way I like :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would definitely be helpful if you could test this PR on OSX. If you don't want to run the full test suite, then just do "zpool import" with both file-backed and disk-backed pools. That should be sufficient. And there's a Windows port too now? Sheesh; I had no idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered no issues:

# ./scripts/cmd-macos.sh zpool import

   pool: BOOM
     id: 5628038252777279005
  state: ONLINE

	BOOM                                          ONLINE
	  media-3628A8FE-7BD0-8B42-89B9-6D46D57607EF  ONLINE

# ./scripts/cmd-macos.sh zpool import -d ~/
   pool: test
     id: 7967859940384804015
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

	test                       ONLINE
	  /Users/lundman/disk.img  ONLINE

It is interesting to note that userland still uses module/os/linux/zfs/vdev_file.c on all platforms, not sure how I feel about that yet. But that isn't connected to the changes you made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still uses module/os/linux/zfs/vdev_file.c on all platforms

Yeah, that's odd. We should probably either make it clear it's common by moving it to module/zfs/vdev_file.c, or decide it really not and have per-platform versions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across it when working on Windows, where vdev_file.c (was) quite different (now in 2.0 port uses zfs_file_* API). I believe one other os/linux file is also taken from Linux, was it abd_os.c or arc_os.c hmm... At the time I couldn't get my own vdev_file.c to work in userland (didn't want to spend more than 5 mins shaving yaks at that time). It is clearly working at the moment, but could be some Linux specific changes in future could surprise someone. So a userland only vdev_file.c could work, or, I'll try again to switch to os/macos/vdev_file (and windows+Freebsd).

endif

libzutil_la_LIBADD += -lm $(LIBBLKID_LIBS) $(LIBUDEV_LIBS)
52 changes: 43 additions & 9 deletions lib/libzutil/zutil_import.c
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
* using our derived config, and record the results.
*/

#include <aio.h>
#include <ctype.h>
#include <dirent.h>
#include <errno.h>
Expand Down Expand Up @@ -887,31 +888,64 @@ int
zpool_read_label(int fd, nvlist_t **config, int *num_labels)
{
struct stat64 statbuf;
int l, count = 0;
vdev_label_t *label;
struct aiocb aiocbs[VDEV_LABELS];
struct aiocb *aiocbps[VDEV_LABELS];
vdev_phys_t *labels;
nvlist_t *expected_config = NULL;
uint64_t expected_guid = 0, size;
int error;
int error, l, count = 0;

*config = NULL;

if (fstat64_blk(fd, &statbuf) == -1)
return (0);
size = P2ALIGN_TYPED(statbuf.st_size, sizeof (vdev_label_t), uint64_t);

error = posix_memalign((void **)&label, PAGESIZE, sizeof (*label));
error = posix_memalign((void **)&labels, PAGESIZE,
VDEV_LABELS * sizeof (*labels));
if (error)
return (-1);

memset(aiocbs, 0, sizeof (aiocbs));
for (l = 0; l < VDEV_LABELS; l++) {
off_t offset = label_offset(size, l) + VDEV_SKIP_SIZE;

aiocbs[l].aio_fildes = fd;
aiocbs[l].aio_offset = offset;
aiocbs[l].aio_buf = &labels[l];
aiocbs[l].aio_nbytes = sizeof (vdev_phys_t);
aiocbs[l].aio_lio_opcode = LIO_READ;
aiocbps[l] = &aiocbs[l];
}

if (lio_listio(LIO_WAIT, aiocbps, VDEV_LABELS, NULL) != 0) {
int saved_errno = errno;

if (errno == EAGAIN || errno == EINTR || errno == EIO) {
/*
* A portion of the requests may have been submitted.
* Clean them up.
*/
for (l = 0; l < VDEV_LABELS; l++) {
errno = 0;
int r = aio_error(&aiocbs[l]);
if (r != EINVAL)
(void) aio_return(&aiocbs[l]);
}
}
free(labels);
errno = saved_errno;
return (-1);
}

for (l = 0; l < VDEV_LABELS; l++) {
uint64_t state, guid, txg;

if (pread64(fd, label, sizeof (vdev_label_t),
label_offset(size, l)) != sizeof (vdev_label_t))
if (aio_return(&aiocbs[l]) != sizeof (vdev_phys_t))
continue;

if (nvlist_unpack(label->vl_vdev_phys.vp_nvlist,
sizeof (label->vl_vdev_phys.vp_nvlist), config, 0) != 0)
if (nvlist_unpack(labels[l].vp_nvlist,
sizeof (labels[l].vp_nvlist), config, 0) != 0)
continue;

if (nvlist_lookup_uint64(*config, ZPOOL_CONFIG_GUID,
Expand Down Expand Up @@ -948,7 +982,7 @@ zpool_read_label(int fd, nvlist_t **config, int *num_labels)
if (num_labels != NULL)
*num_labels = count;

free(label);
free(labels);
*config = expected_config;

return (0);
Expand Down