-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] Windows build of plugins don't start on ci.jenkins.io
#4490
Comments
Workin on it: requires to block all builds on ci.jenkins.io (put in the queue) |
|
Since this incident has been ongoing unoticed by Microsoft since the past 2 days, we most porbably don't see any full recovery until a few days. Let's wait and see as the team is not available to perform a full region change for the time being and it's not critically blocked. |
Also did an email to the developers mailing list for awareness: https://groups.google.com/g/jenkinsci-dev/c/VQcRiUYu92o |
Workaround for stuck builds: set |
Unfortunately that banner is not visible on the actual build pages, the user needs to be on the main Jenkins site to see it... And it seems it's a good feature request candidate :-) EDIT: feature request created - JENKINS-75122 |
Good point, unfortunately we don't have anything allowing to display a top banner on every ci.jenkins.io pages AFAIK. |
(FTR, noticed earlier in jenkinsci/jenkins-test-harness#893) |
Or just a zonal change? According to the announcement image, other zones in the same region are unaffected. |
Requires temporarily editing |
Implemented by @MarkEWaite in jenkins-infra/pipeline-library#898 jenkins-infra/pipeline-library#899 opened as
|
jenkins-infra/helpdesk#4490 notes that Windows builds for plugins that use container agents are not starting. https://status.jenkins.io/issues/2025-01-08-ci.jenkins.io-azure-outage/ reports the outage is due to an Azure Container Instance (ACI) outage in the region where we host the ci.jenkins.io agents.
ci.jenkins.io
Thanks @MarkEWaite for jenkins-infra/pipeline-library#898 to switch all windows builds to VM by default. Alas, it uncovered 2 issues:
|
Update: The Incident is over since the 10 January around 20:30 UTC on Azure side as per https://azure.status.microsoft/en-us/status/history/ (check tracking ID Click to read details about the Azure outageMitigated - Networking issues impacting Azure Services in East US 2What happened?Between 22:00 UTC on 08 Jan 2025 and 04:30 UTC on 11 Jan 2025, a networking issue in a single zone in East US 2 resulted in impact to multiple Azure Services in the region. This may have resulted in intermittent Virtual Machine connectivity issues, failures in allocating resources or communicating with resources in the hosted region. The services impacted include but were not limited to, Azure Databricks, Azure Container Apps, Azure Function Apps, Azure App Service, Azure Logic Apps, SQL Managed Instances, Azure Databricks, Azure Synapse, Azure Data Factory, Azure Container Instances, API Management, Azure NetApp Files, DevOps, Azure Stream Analytics, PowerBI, VMSS, PostgreSQL flexible servers, and Azure RedHat Openshift. Customers using resources with Private Endpoint Network Security Groups communicating with other services may have also been impacted. The impact was limited to a single zone in East US 2 region. No other regions were impacted by this issue. What went wrong and why?We determined that a configuration change in our regional networking service resulted in an inconsistent service state with three of the partitions turning unhealthy, which caused requests from multiple services to fail. How did we respond?Service monitoring alerted us to this networking issue at 22:00 UTC on 08 Jan 2025, with all the impacted services raising additional alerts as well based on their failure rates. As part of the investigation, it was identified that a network configuration issue in one of the zones resulted in three of the partitions becoming unhealthy causing widespread impact. As an immediate remediation measure, traffic was re-routed away from the impacted zone, which brought some relief to the non-zonal services, and helped with newer allocations. However, services that sent zonal requests to the impacted zone continued to be unhealthy. Some of the impacted services initiated their own Disaster Recovery options to help mitigate some of their impact. For customers impacted due to Private Link, a patch was applied, and we confirmed dependent services were available. Additional workstreams to rehydrate the impacted zone by bringing the impacted partitions back to a healthy state were completed. To avoid further impact, we validated this fix on one partition, and while we encountered some challenges which required more time than expected to step through the validation process of this fix, causing a delay, the mitigation workstream progressed successfully. We brought two partitions back online, and continued to monitor the health of these partitions, in parallel bringing the final partition back to a healthy state. Once all of the partitions were brought back to a healthy state, we completed an end-to-end validation to ensure that the resources were responding as expected. By 11:18 UTC on 10 Jan 2025, all three impacted partitions were fully recovered. Following this, we worked with the impacted services to validate mitigation on all of their resources. By 00:44 UTC on 11 Jan 2025, all services confirmed mitigation. At 00:30 UTC, we initiated a phased approach to rebalance traffic across all of the zones to ensure that the networking traffic is flowing as expected in the region. After monitoring service health, we determined that this incident is fully mitigated at 04:30 UTC on 11 Jan 2025. The impact for Azure Services varied between the following timelines (the below list doesn't cover all impacted services, but the timelines may be similar):
What happens next?Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings. We saw successful Windows builds in the past 2 days. As such:
|
Thanks for the suggestion. As explained by Hervé, it is not possible. Do not hesitate to open an issue on Jenkins for such an feature request! |
Thanks @jglick! The commit jenkinsci/jenkins-test-harness@87bc0ce in jenkinsci/jenkins-test-harness#893 is inside the network incident windows reported by Azure: 22:00 UTC on 08 Jan 2025 - 20:30 UTC on 10 Jan 2025 - Azure Container Instances.
Absolutely, but the concept of "Zones" in Azure does not exist for Azure Container Instances: it's only for Azure Container "Groups" (as per https://learn.microsoft.com/en-us/azure/reliability/reliability-containers#availability-zone-support). As such, the Azure Container Agents does not have the concept of availability zones.
Absolutely. @MarkEWaite did continue this mitigation effort (as described in #4490 (comment)) as I was missing time. The team was not really available alas at this moment: this is a future source of improvement. |
are things healthy? https://ci.jenkins.io/job/Plugins/job/kubernetes-credentials-provider-plugin/view/change-requests/job/PR-101/4/execution/node/14/log/ has been unable to obtain a windows agent and yet there does not appear to be any load on ci.jenkins.io |
We don't, that's a good suggestion. |
We do have this (in Infra folder, the job acceptance tests). It did not caught the failure this time as the failure is random and always ends up in success. It could be improved to use allocation time as a metric and alert when there is a change in average time |
Nope, i don’t see how datadog could catch this |
I thought that the timeout failures reported in https://ci.jenkins.io/job/Infra/job/acceptance-tests/job/check-agent-availability/ would be enough to tell us that there is an issue. It checks every 4 hours and fails the job if each agent type cannot be allocated within 30 minutes. Unfortunately, that job relies on a human being to check its output, and I rarely check its output. I've added it to my plugin jobs status page in hopes that I will have one more location to see it, in addition to the RSS feed reader where I view those checks. |
jenkinsci/workflow-support-plugin#295 is failing to build on Windows. https://ci.jenkins.io/label/maven-17-windows/ shows several failed nodes. |
Thanks! We are still in outage :'(
But not everytime 🤦 => we most probably will need to start moving these workloads to Windows Kubernetes workload ASAP. |
Ping @mawinter69 @rantoniuk for information, we're working on #4494 following your comments! |
Update: the outage is still present (see above messages such as Jesse's). As such, we are:
|
I guess there is a fallback to VM-based agents for now? I have been seeing a lot of Windows test flakes in various plugins, like jenkinsci/workflow-job-plugin#499 and jenkinsci/support-core-plugin#612, I suppose due to changes in timing. |
Yes, that's correct. We're using VM agents even with |
That is interesting, i would not have expected tests to become flaky has VMs have more power than container instances (better disks, more memory and more powerful cpus). Could this be related to certain type of parallelisation? |
Some could simply be related to changing timing. jenkinsci/workflow-job-plugin#502 seems to have something to do with the filesystem though I do not understand it. Not a concern for the infra team, just noting it. |
Not sure if this is directly relevant, but the Windows test suite for |
If @basil , I confirm it is unrelated. I'm commenting in the PR with the result of my first (superficial) check of the build errors. |
Update: given how close we are to migrate ci.jenkins.io to AWS, I'm prioritizing work on #4318 to get rid of ACI containers (in favor of Kubernetes Windows containers). It means we'll keep using Windows VM agents on Azure until the migration is performed on AWS. |
Well, only somewhat unrelated—the test suite was relying on being executed in a container on Windows so that the Docker-based tests would be skipped on Windows and CI builds would pass. That assumption stopped holding now that all Windows-based CI jobs are running as VMs, without any workaround from the infrastructure team—until a recent PR from a user improved the test suite to properly detect this configuration and skip the relevant tests when executed on a Windows VM, restoring the status quo of a green CI build. |
Service(s)
ci.jenkins.io
Summary
Following my PR on the antexec plugin, https://ci.jenkins.io/job/Plugins/job/antexec-plugin/view/change-requests/job/PR-110/1/ is taking forever to get a Windows VM.
@timja has detected the same kind of issues.
Reproduction steps
No response
The text was updated successfully, but these errors were encountered: