Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writes sometimes out of order, not in sequence #1359

Open
rsms opened this issue Mar 3, 2025 · 5 comments
Open

Writes sometimes out of order, not in sequence #1359

rsms opened this issue Mar 3, 2025 · 5 comments

Comments

@rsms
Copy link

rsms commented Mar 3, 2025

After a nearly month-long bug hunt in our codebase we've found that sometimes writes via io_uring complete out of order. This is observed both at the reader end and in io_uring completions on the writer end. We've read the documentation carefully and written many test programs to try to disprove our theory of there being a bug in io_uring.

Attached are two test programs which reproduce the bug:

  • writeorder.c takes as as an optional argument a file path (defaults to /dev/null) and then proceeds open(path, O_WRONLY) and write chunks of data in an infinite loop. It verifies that completions are sequenced; that each write() is reported as completing in order relative other writes.

  • writeorder-pipe.c works like writeorder.c but uses a pipe() instead of writing to a file. Additionally it verifies the sequence of written data on the reader end as well.

Expected behavior: each write to complete in the order it was submitted.

Actual behavior: sometimes a later write completes before a prior write.
I.e. submissions: A B C D, completions A C B D. Additionally, the writeorder-pipe.c program will report that read data is out of sequence as well (so it's not just completions that are reported out of order.)

This out-of-order event sometimes happens after 100s of millions of writes and sometimes after just a few thousand writes.

Reproduction: cc -g -O2 -luring writeorder-pipe.c -o writeorder-pipe && ./writeorder-pipe

Tested with:

  • liburing 2.9
  • Linux 6.8.9 aarch64 (unpatched, built from source from kernel.org)
  • Linux 6.8.0 x86_64 (ubuntu)
  • Linux 6.13.5 x86_64 (arch linux)
@DylanZA
Copy link
Contributor

DylanZA commented Mar 4, 2025

Are you chaining writes across submission boundaries? It seems like you are in the example but this is not supported:

https://github.com/axboe/liburing/blob/master/man/io_uring_enter.2#L1615

@rsms
Copy link
Author

rsms commented Mar 4, 2025

It says right there that “Even if the last SQE in a submission has this flag set, it will still terminate the current chain.”

I.e. there’s no semantic difference if we take extra care to not set the link flag for the last SQE in each batch or not. We of course tested that too just to be sure and the bug is reproduced in that case as well.

@DylanZA
Copy link
Contributor

DylanZA commented Mar 4, 2025

ok sure. and to be clear I cant test this today so just going from reading the code:
you build with -O2 so the assertion on assert(submitted == ncqe); won't fire, but you submit 10 SQE at a time, and only wait for 1.
So presumably you can have two chains in flight at one time, and those chains have no ordering guarantees between them? That would cause this behaviour I think with no bug in io_uring.

Given you write a small amount at a time, its very unlikely for the pipe to be full, so you almost never actually queue those SQEs onto a background kernel thread (which is what would cause this).

I guess you can get this to trigger earlier with bigger writes? or maybe a slower consumer?

@rsms
Copy link
Author

rsms commented Mar 4, 2025

Assertions are disabled with the CPP macro NDEBUG, not optimization flags, so assert is enabled in these programs.

We can't wait for all submissions; that would defy the reason to use io_uring as it would make the program process I/O synchronously. (I.e. any other work going on, like reading some other file descriptor, would be blocked en masse.) ncqe is used to make sure that we don't submit more work than has completed. In other words: the test programs ensure that we don't grow internal write/work queues in the kernel beyond BATCH_SIZE.

Re small amount of data: You can try it yourself by changing WRITE_NBYTE to larger amounts. The issue seems to be just as likely to happen with large writes as with small ones. As I mentioned, we have spent many days (weeks of calendar time) testing this in all ways we can imagine.

I guess you can get this to trigger earlier with bigger writes? or maybe a slower consumer?

Not in our tests so far. The variance of failure point is high: Sometimes we get an out-of-order write after just a few thousand writes and sometimes it takes many 100s of millions of writes, independent of the "write target." For example, we ran qemu with a serial port setup to /dev/null (-chardev file,id=pipe6,path=/dev/null -device virtserialport,chardev=pipe6,nr=6) and then ran writeorder /dev/vport1p6, which is much slower than just writing to local /dev/null: we saw no clear difference in failure timing.

@DylanZA
Copy link
Contributor

DylanZA commented Mar 5, 2025

so I've run it locally. there are at least two bugs in your example.
(1) is what I pointed out before. you have multiple submission chains running concurrently. Chain (a) might only be partially ompleted when you submit chain (b). Theres no reason to suspect (b) will only start after (a).
You can see this if you put this printf in after the submission loop:
printf("submitted chain of %lu -> %lu\n", batch_submission_head, batch_submission_tail);

I see this output:

submitted chain of 1396 -> 1403
submitted chain of 1403 -> 1406
submitted chain of 1406 -> 1409
Expected 1402 but got 1403 (middle of batch)

so the chain 1396->1403 was not finished when the 1403->1406 was submitted, and the latter ran first

Bug (2) is probably confusing things more. You submit writes on the data array, but you dont wait for it to be completed before clobering the data. I have the following diff which fixes it:

4a5
> #include <stdlib.h>
27c28,29
<     uint64_t data[BATCH_SIZE];
---
>     uint64_t datasize = BATCH_SIZE * 1000000;
>     uint64_t* data = (uint64_t*)malloc(datasize * 8);
45,46c47,49
<             data[i] = submission_seq;
<             io_uring_prep_write(sqe, fd, &data[i], sizeof(*data), 0);
---
>             uint64_t at = submission_seq % datasize;
>             data[at] = submission_seq;
>             io_uring_prep_write(sqe, fd, &data[at], sizeof(*data), 0);
48a52
>         printf("submitted chain of %lu -> %lu\n", batch_submission_head, batch_submission_tail);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants