Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] HeapAttackIT testAggTooManyMvLongs failing #120433

Open
elasticsearchmachine opened this issue Jan 18, 2025 · 6 comments
Open

[CI] HeapAttackIT testAggTooManyMvLongs failing #120433

elasticsearchmachine opened this issue Jan 18, 2025 · 6 comments
Assignees
Labels
:Analytics/ES|QL AKA ESQL medium-risk An open issue or test failure that is a medium risk to future releases Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI

Comments

@elasticsearchmachine
Copy link
Collaborator

elasticsearchmachine commented Jan 18, 2025

Build Scans:

Reproduction Line:

./gradlew ":test:external-modules:test-esql-heap-attack:javaRestTest" --tests "org.elasticsearch.xpack.esql.heap_attack.HeapAttackIT.testAggTooManyMvLongs" -Dtests.seed=864E912505774F34 -Dtests.configure_test_clusters_with_one_processor=true -Dtests.locale=kw-GB -Dtests.timezone=Asia/Barnaul -Druntime.java=23

Applicable branches:
8.x

Reproduces locally?:
N/A

Failure History:
See dashboard

Failure Message:

java.lang.Exception: Test abandoned because suite timeout was reached.

Issue Reasons:

  • [8.x] 4 failures in test testAggTooManyMvLongs (1.4% fail rate in 294 executions)
  • [8.x] 3 failures in pipeline elasticsearch-periodic-platform-support (60.0% fail rate in 5 executions)

Note:
This issue was created using new test triage automation. Please report issues or feedback to es-delivery.

@elasticsearchmachine elasticsearchmachine added :Analytics/ES|QL AKA ESQL >test-failure Triaged test failures from CI Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Jan 18, 2025
@elasticsearchmachine
Copy link
Collaborator Author

Pinging @elastic/es-analytical-engine (Team:Analytics)

@nik9000 nik9000 self-assigned this Jan 22, 2025
@nik9000
Copy link
Member

nik9000 commented Jan 22, 2025

[2025-01-16T09:24:49,792][ERROR][o.e.t.e.h.RestTriggerOutOfMemoryAction] [test-cluster-1] triggering out of memory

@nik9000
Copy link
Member

nik9000 commented Jan 22, 2025

That gets triggered after a query runs for five minutes. Here's a better log:

[2025-01-16T09:19:45,855][INFO ][o.e.x.e.h.HeapAttackIT   ] [testHugeManyConcatFromRow] before test
[2025-01-16T09:19:46,005][INFO ][o.e.x.e.h.HeapAttackIT   ] [testHugeManyConcatFromRow] --> test testHugeManyConcatFromRow started querying
[2025-01-16T09:24:46,041][INFO ][o.e.x.e.h.HeapAttackIT   ] [testHugeManyConcatFromRow] --> test testHugeManyConcatFromRow triggering OOM after 5m
[2025-01-16T09:24:46,055][ERROR][o.e.t.e.h.RestTriggerOutOfMemoryAction] [test-cluster-0] triggering out of memory
[2025-01-16T09:24:46,752][WARN ][o.e.m.j.JvmGcMonitorService] [test-cluster-0] [gc][348] overhead, spent [602ms] collecting in the last [1.1s]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /opt/local-ssd/buildkite/builds/bk-agent-prod-gcp-1737034493961442245/elastic/elasticsearch-periodic-platform-support/test/external-modules/esql-heap-attack/build/testrun/javaRestTest/temp/test-cluster1517469627125710648/test-cluster-0/logs/java_pid310470.hprof ...
Heap dump file created [371775153 bytes in 2.161 secs]
Terminating due to java.lang.OutOfMemoryError: Java heap space
[2025-01-16T09:24:49,783][INFO ][o.e.t.ClusterConnectionManager] [test-cluster-1] transport connection to [{test-cluster-0}{oopqlc7iR22Rup8m56YP0w}{G3KG0deFTBmvgSVPe1j5xA}{test-cluster-0}{127.0.0.1}{127.0.0.1:37765}{cdfhilmrstw}{8.18.0}{7000099-8525000}] closed by remote; if unexpected, see [https://www.elastic.co/guide/en/elasticsearch/reference/master/troubleshooting-unstable-cluster.html#troubleshooting-unstable-cluster-network] for troubleshooting guidance

@nik9000
Copy link
Member

nik9000 commented Jan 22, 2025

It looks like the test genuinely took five minutes and we aborted it. Checking how long that takes locally. And, I guess, I can look at the times from our build stats cluster. And, because we caused heap dump, I can look at the threads and heap of the test cluster.

@nik9000
Copy link
Member

nik9000 commented Jan 22, 2025

Locally testHugeManyConcatFromRow takes 47.029s seconds. A long time for sure. Not five minutes.

From the build system:

FROM gradle-tests*
| WHERE className.keyword == "org.elasticsearch.xpack.esql.heap_attack.HeapAttackIT"
    AND name.keyword != "org.elasticsearch.xpack.esql.heap_attack.HeapAttackIT"
| STATS MAX(duration) BY name.keyword

From 2025-01-08 onwards gets me:

Image

with big green on the left being testHugeManyConcatFromRow. That area is five minutes.

The average duration is not too dissimilar:

Image

I believe our options are to bump the timeout or make it faster. Making it faster is obviously good.

@nik9000
Copy link
Member

nik9000 commented Jan 22, 2025

This is important but I'm going to call it a "for later" thing.

@not-napoleon not-napoleon added medium-risk An open issue or test failure that is a medium risk to future releases and removed needs:risk Requires assignment of a risk label (low, medium, blocker) labels Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL medium-risk An open issue or test failure that is a medium risk to future releases Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants