Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests which are expected to fail / known broken #224

Open
eli-schwartz opened this issue Dec 10, 2024 · 5 comments
Open

Tests which are expected to fail / known broken #224

eli-schwartz opened this issue Dec 10, 2024 · 5 comments

Comments

@eli-schwartz
Copy link

Cross-post from pytest-dev/pytest#13045

It appears that there isn't currently a provision for this. Various testing frameworks support this status (off the top of my head this includes at least GNU Automake, the Test Anything Protocol, Pytest, and the Meson Build System).

I'll quote GNU Automake's manual page on "Generalities about testing" because IMHO the GNU project is, as usual, very expressive about the reasoning behind concepts. Paragraph 2 is key here:

https://www.gnu.org/software/automake/manual/html_node/Generalities-about-Testing.html

Sometimes, tests can rely on non-portable tools or prerequisites, or simply make no sense on a given system (for example, a test checking a Windows-specific feature makes no sense on a GNU/Linux system). In this case, accordingly to the definition above, the tests can neither be considered passed nor failed; instead, they are skipped, that is, they are not run, or their result is in any case ignored for what concerns the count of failures and successes. Skips are usually explicitly reported though, so that the user will be aware that not all of the testsuite has been run.

It’s not uncommon, especially during early development stages, that some tests fail for known reasons, and that the developer doesn’t want to tackle these failures immediately (this is especially true when the failing tests deal with corner cases). In this situation, the better policy is to declare that each of those failures is an expected failure (or xfail). In case a test that is expected to fail ends up passing instead, many testing environments will flag the result as a special kind of failure called unexpected pass (or xpass).

Many testing environments and frameworks distinguish between test failures and hard errors. As we’ve seen, a test failure happens when some invariant or expected behavior of the software under test is not met. A hard error happens when e.g., the set-up of a test case scenario fails, or when some other unexpected or highly undesirable condition is encountered (for example, the program under test experiences a segmentation fault).

They are usually called XFAIL and XPASS (though meson calls them EXPECTEDFAIL and EXPECTEDPASS).

Both should be reported as distinct from regular passes/failures.

@marcphilipp
Copy link
Member

Spock is a Java ecosystem testing framework maintained by @leonard84. I has a @PendingFeature annotation that you can put on tests ("features" in their lingo):

The use case is to annotate tests that can not yet run but should already be committed. The main difference to @Ignore is that the test are executed, but test failures are ignored. If the test passes without an error, then it will be reported as failure since the @PendingFeature annotation should be removed. This way the tests will become part of the normal tests instead of being ignored forever.

Instead of using special xfail and xpass statuses it reports tests as aborted and failed, respectively.

@leonard84 What's your opinion about using special status for these cases?

@eli-schwartz
Copy link
Author

What does it mean for a test to be aborted? There are two mesonbuild statuses which might be possible to interpret as aborted:

  • what GNU calls a "hard error". Could be caused by e.g. broken CFLAGS, a bug in the runner itself, a pathological bug in your library... much more dangerous to ignore than a failed test
  • a timeout, which indicates that either the test needs to carefully mark how long it runs for, or else that it is hanging forever (meson by default doesn't allow tests to run for longer than 30 seconds unless annotated, to prevent tests from hanging forever). As a rule, it's not worth looking at the output logs for a timeout, you need a very different investigative approach. (And it might not be a problem at all -- it could be that you're just running on an OS/CPU combo that is really slow. Maybe you're building MIPS under qemu emulation. On the other hand, maybe that new feature is deadlocking.) Meson quite literally aborted that test from the outside -- it didn't have the opportunity to run to completion.

Is "aborted" intended as a kind of catchall?

If Spock is treating expected failures that are ignored, by mapping them to "aborted", then that indicates that aborted tests are not considered errors at all, which means my assumption about what to use "aborted" for was completely off base!

@marcphilipp
Copy link
Member

What does it mean for a test to be aborted?

We should definitely define these statuses clearly and explicitly!

In JUnit, "aborted" is similar to "skipped". While "skipped" means there was a condition that prevented the test from even being started (could be as simple as explicitly ignoring a test, checking for an env var, etc.), "aborted" means the test was started but during execution the test code decided that it couldn't complete, e.g. because of a more complicated condition that could only be evaluated midway.

@eli-schwartz
Copy link
Author

That's an interesting definition...

Test frameworks such as python unittest/pytest allow marking a test as SKIP either by decorating the test definition with some condition evaluated during collection, such as an environment variable, or by raising a skipTest() halfway through the test. No distinction is drawn between the two -- the logic is probably that "we figured there's no semantic difference regarding which stage it detected the need to skip".

Test frameworks such as GNU automake or Meson don't operate by testing functions at all -- instead they collect and track test programs, and expect a test program to signal whether it gets skipped. I guess those always count as "ABORTED to use your terminology, since there's no actual way to prevent a test from even starting, and the test protocol doesn't have a way to signal whether the test program decided to signal a SKIP based on checking an env var in main() or based on more complicated conditions.

...

With regard specifically to mapping xfail -> aborted. Reporters which use the term "xfail" traditionally consider this important to draw attention to because xfail tests indicate missing functionality which isn't a regression from the previous state of the code. When running the tests to see what shape the project is in, you want to be able to see on the dashboard that they do need fixing and maybe you should go ahead and fix them if you can / have time, you just don't want to e.g. gate CI on those failures. I wouldn't want to have them be intermingled together with tests that signaled halfway through their execution that the feature being tested isn't available on the current system and therefore cannot be unit-tested.

So I would definitely recommend adding new special statuses for this.

@webknjaz
Copy link

FTR, here's how many pythonistas see the use of xfail and xfail_strict: https://ganssle.io/talks/#xfail-lightning / https://youtu.be/YmAXhlcNbys?t=512 / https://pganssle-talks.github.io/xfail-lightning / https://github.com/pganssle-talks/xfail-lightning / https://blog.ganssle.io/articles/2021/11/pytest-xfail.html.

It's something that can be used as an indicator/alert when a test starts passing suddenly. It's something that can be used in TDD. It can be used for accepting PRs with bug reproduces, integrating them as regression and acceptance tests for when somebody gets to fixing those bugs intentionally or by accident.

Like pytest.skip() there's a pytest.xfail() callable in the middle of test functions. Unlike the decorator-based markers (@pytest.mark.skip, @pytest.mark.skipif, @pytest.mark.xfail) that are computed and evaluated at the test collection time, these helper functions interrupt the test at runtime, while assigning equivalent semantic meaning in the report.

Note that there's @pytest.mark.xfail(..., run=False) that makes the test get skipped at runtime while still marking it as an expected failure (this is sometimes useful when the test can SEGFAULT your entire test process). When it runs, though, that creates more code coverage and various logs output / internal state changes. It can also produce a traceback that can be recorded in the test run entry. And while it may be treated as a success on the test report level, it might also have that traceback / stdout / stderr that would be useful to include. The old JUnit has that <failure> node for such tracebacks.

So it sounds to me that an aborted status has a widely different semantic meaning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants