-
Notifications
You must be signed in to change notification settings - Fork 43
Intermittent test failures and flaky tests #427
Comments
It seems to me that the frame navigations cause most of these issues. What about making a frame-specific halt-the-world event that stops processing the events until preconditions are met (similar to the older Go garbage collector's stop-the-world mechanism—but for a specific frame). For example, imagine a frame navigates, in that case, we're going to stop receiving events both from the browser and our internals until we set up the relevant event listeners and execution contexts in place. Then we'll unlock the world and let the system components of that frame start processing the events. Hopefully, maybe we can do it using #428. Like Puppeteer and Playwright, we can put the relevant actions into the same promise. That would hopefully remove the need for a frame-specific barrier. |
I've been exploring two potential solutions to this issue. Here's a summary that we can discuss in a meeting I'll schedule for Monday:
Internal event emitters (producers) and consumers
NOTES: The general approach of tackling the major refactor would be:
Even though it would be a much more laborious process than any focused fixes, and it might take us a few weeks to finish, possibly the entire cycle for one person, I think we'd end up with a simpler and overall easier to understand system. We don't need to go to extreme lengths of implementing something unique like an FSM, and considering this is not the approach taken by either Puppeteer nor Playwright, I think it would serve us well to not deviate too far from their work. |
@inancgumus I considered your suggestion, but I don't think we want to stop receiving events. You're describing the functionality of a queue, i.e. don't process new events until a previous one completes, which is what channels essentially give us. So, yes, we ultimately want to synchronize in that way, but we can do that by not blocking and using a That said, I don't think we should do any focused fixes like this, but instead take the time to work on a major refactor that will address this and other race conditions, while ending up with a more maintainable and understandable code base. |
I was suggesting something else, nevertheless, yes, I was also on the side of using Promises to do some part of it 👍
And, as I explained in #142, +1 for major refactoring. Anyway, let's discuss the rest of the details in the architectural refactoring and async API meeting 🤝 |
As you can see, the test has been refactored to work with promises. Click and WaitForNavigation race with each other until the event loop continues. They race because each runs in a goroutine as a promise. After several trials, I couldn't make the test PASS. There is always a timeout due to EventFrameNavigation flakiness bug (Issue #427). Since the JS examples has been updated (examples/ folder), we can see whether there is a problem in CI E2E tests. All are updated for the async click and waitForNavigation.
The main change here to fix the `Page.goto()` race condition (see the example in #427) is removing the `waitForEvent` call after calling the CDP `Page.navigate`. We already waited for `NavigationEvent` to determine if it's a navigation within the same document (`chSameDoc`), but that's not necessary if we get an empty `newDocumentID` back (see the `loaderId` description in [1]), so we can return early in that case. It still doesn't eliminate race conditions entirely, as there's a minor chance we could miss the lifecycle event after we get the `NavigationEvent` and timeout waiting on `lifecycleEvtCh`, but I haven't run into that case during my stress testing. I chose not to call `frame.hasSubtreeLifecycleEventFired()` as that's racy/buggy itself and results in returning early due to not clearing previous lifecycle events on time (e.g. returning because 'load' fired for 'about:blank', but not for the current navigation). We should remove this and `Frame.recalculateLifecycle()` altogether... [1]: https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-navigate
@grafana/k6browserdev What should we do with this issue? The main race conditions mentioned in the description have now been fixed, and the major refactoring approach discussed above is not a high priority to address this, and has been downscoped from our initial plans. We do still have intermittent test failures and flaky tests, but it's not clear what the "definition of Done" is for this particular issue. So I vote to close this, and open new issues for specific flaky behavior. Or we could reuse this issue as a general one for all such behavior, but there's a lot of noise in the discussion above, and we'd need to change the description and add a TODO list, so we might as well open a new issue even in that case. |
I also prefer to have issues per flaky test (or behaviour) as that way it could be discussed per case what hte solution can be |
Agreed with both of these 👍 |
IMO, we can close this one and give a reference to this from the new issues when needed. |
OK, closing this, and let's open one issue per flaky behavior we find. Though maybe not immediately if it happens once in CI, but if it really becomes a problem. |
For a few months now we've noticed an increase in intermittent test failures (e.g. grafana/k6#4442), as well very frequent flaky tests and CI failures (#272, #254). This has become a major issue for both users and internal devs working on the extension.
From initial work explored in #281 that aimed to fix #272, it became apparent that these issues stem from the racy nature of how we process CDP events, which can trigger one or more internal events. This is explained in detail in this discussion, along with a proposed architecture to address these issues.
An ancillary problem with the current architecture is that these event flows and race conditions are difficult to trace and understand, and since they're intermittent and difficult to test we're never sure if we're introducing more of them as we make changes.
Example
To illustrate an example of a race condition that should be addressed, consider the following scenario from
TestPageGoto
(failure example in CI build grafana/xk6-browser#1123):xk6-browser sends the
Page.navigate
CDP command:Done here:
xk6-browser/common/frame_manager.go
Line 617 in 6baeca0
The browser responds with the
load
Page.lifecycleEvent
:... which we process before we've had the chance to setup the waiter for it here:
xk6-browser/common/helpers.go
Lines 156 to 169 in 6baeca0
This leads to us missing the event, and waiting for a second
load
event which never arrives, ultimately timing out at the default 30s.This is a very simple scenario, and other more complex ones are common as well. For example, the race condition between calling
ElementHandle.click()
andPage.waitForNavigation()
, or callingPage.waitForSelector()
while a hard navigation is in progress, which resets theExecutionContext
, so we end up evaluating theinjected.waitForSelector()
JS call with a staleExecutionContext
, causing the crypticCannot find context with specified id (-32000)
error, etc.Proposed Solution
In the #142 discussion we mentioned removing the internal event-based system in favor of an event-driven finite-state machine. This might be overkill and would introduce additional complexity, so we might want to look into simpler solutions by e.g. synchronizing CDP and internal event processing in such a way that waiters are setup before the CDP event is sent, ensuring we never have a chance to miss an incoming event. This solution was explored in #281, but we had other test issues with that approach and never merged it.
OTOH, targetted solutions like this would have to be done on a case-by-case basis, whereas a general removal of the internal event-based system and introduction of an FSM should work in all cases, so that might be preferable. Some investigation is needed to settle on the better approach.
Another solution (#428) that would improve things is exposing more async APIs. E.g. in the
click()
/waitForNavigation()
example above, if both of these methods returned aPromise
, we could document that the user must usePromise.all()
to synchronize both calls, in order to setupwaitForNavigation()
beforeclick()
is called. You can see that Playwright mentions this approach in their documentation.The text was updated successfully, but these errors were encountered: