Add the possibility to compute statistics #369
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds the possibility to compute statistics (count all runs of a property, along with the number of times it fails and the number of times it triggers an exception) for all tests. The main goal is to provide a sound argument whether a given change has a (positive or negative) impact on the ability of a test to detect issues. It is an incomplete draft, but advanced enough that we can decide whether it should be completed.
As we have no statistics to back the assumption that this PR has no impact, it tries to be as non-intrusive as possible. Here is how it proposes to do this:
statistics are enabled by setting the environment variable
$QCHECK_STATISTICS_FILE
to the path of a file in which the stats will be recorded (I think that adding a command-line option would require duplicatingQCheck_base_runner.Raw.parse_cli
, which I would rather avoid)a couple of QCheck functions are “routed” through
Util
so that their behaviours can be adapted when computing stats:fail_reportf
is reexported so that it returnsfalse
rather than raising an exception (so that failures and exceptions can be distinguished),make_neg_test
wrapsTest.make_neg
so that it usesTest.make
when statistics are enabled (because tests will always report success, and count failures separately);make_test
wrapsTest.make
for uniformityrun_tests_main
is duplicated in order to run each test separately when statistics are enabled so that we can record the stats for each test separately and to enable the long version of tests (so$QCHECK_LONG_FACTOR
can be used to multiply the number of tests to run)Note that this routing could be useful also to add optional hooks before and after each test (what Weak test revisions #365 and Ephemeron test revision #367 have done by hand for weak and ephemeron tests).
most of the counting logic is done in
Util.repeat
: when stats are enabled, it uses a different loop that will count iterations and negative outcomes rather than stop at the first failure; this means that even the tests not using repetitions are now wrapped into arepeat 1
to get statistics workingUtil.Stats.enabled
is exposed so that tests can choose a different repetition count depending on this.A possible alternate design that would not rely quite so much on global state in
Util
(and would thus allow to userun_tests
without splitting test lists) would be to useUtil.make_test
andUtil.make_neg_test
to expect aStats.t -> 'a -> bool
function (that would be the result ofrepeat
, with the added benefit that the change of type detects places that are not updated...) and generate fresh stats for each test. It is not yet in that version of the PR because I’m not sure which one is the best choice.To do before merging, if this proposal is good enough:
Util
to avoid useless history noise,repeat
andmake_test
signatures),Util
or the changes of behaviours,To do in a later PR: