Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to capture cluster diagnostic #54

Open
milanage opened this issue Oct 14, 2021 · 3 comments
Open

failed to capture cluster diagnostic #54

milanage opened this issue Oct 14, 2021 · 3 comments

Comments

@milanage
Copy link

milanage commented Oct 14, 2021

Tried to capture an ECK diags - the command succeeded and we got the tar ball but it seems cluster diagnostic failed (the ECK dump part was correctly captured).

In eck-diagnostic-errors.txt

Delete "https://xxxxxxx.xxx.us-west-2.eks.amazonaws.com/api/v1/namespaces/abc-namespace/pods/xxx-elasticsearch-elasticsearch-diag": net/http: TLS handshake timeout

in eck-diagnostics.log

2021/10/13 19:03:28 ECK diagnostics with parameters: {DiagnosticImage:docker.elastic.co/eck-dev/support-diagnostics:8.1.4 ECKVersion: Kubeconfig: OperatorNamespaces:[elastic-system] ResourcesNamespaces:[abc-namespace] OutputDir: RunStackDiagnostics:true Verbose:false}
2021/10/13 19:03:54 Extracting Kubernetes diagnostics from elastic-system
2021/10/13 19:04:25 ECK version is 1.6.0
2021/10/13 19:04:25 Extracting Kubernetes diagnostics from abc-namespace
2021/10/13 19:58:46 Kibana diagnostics extracted for abc-namespace/xxx-kibana-external

in kibana diagnostics.log

23:57:41.774 [main] INFO  com.elastic.support.BaseService - Diagnostic logger reconfigured for inclusion into archive
23:57:41.776 [main] INFO  com.elastic.support.diagnostics.commands.CheckKibanaVersion - Getting Kibana Version.
23:58:41.875 [main] ERROR com.elastic.support.rest.RestClient - Unexpected Execution Error
org.apache.http.conn.ConnectTimeoutException: Connect to xxx-kibana-external-kb-http:5601 [xxx-kibana-external-kb-http/172.20.50.239] failed: connect timed out
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) ~[httpclient-4.5.10.jar:4.5.10]
	at com.elastic.support.rest.RestClient.execRequest(RestClient.java:73) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.rest.RestClient.execGet(RestClient.java:68) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.rest.RestClient.execQuery(RestClient.java:58) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.commands.CheckKibanaVersion.getKibanaVersion(CheckKibanaVersion.java:95) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.commands.CheckKibanaVersion.execute(CheckKibanaVersion.java:64) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.chain.DiagnosticChainExec.runDiagnostic(DiagnosticChainExec.java:111) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.DiagnosticService.exec(DiagnosticService.java:68) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.DiagnosticApp.main(DiagnosticApp.java:42) [support-diagnostics-8.1.4.jar:8.1.4]
Caused by: java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[?:?]
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[?:?]
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[?:?]
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403) ~[?:?]
	at java.net.Socket.connect(Socket.java:591) ~[?:?]
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368) ~[httpclient-4.5.10.jar:4.5.10]
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[httpclient-4.5.10.jar:4.5.10]
	... 16 more
23:58:41.882 [main] ERROR com.elastic.support.diagnostics.commands.CheckKibanaVersion - Unanticipated error:
java.lang.RuntimeException: Connect to xxx-kibana-external-kb-http:5601 [xxx-kibana-external-kb-http/xxx.xx.xx.xxx] failed: connect timed out
	at com.elastic.support.rest.RestClient.execRequest(RestClient.java:79) ~[support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.rest.RestClient.execGet(RestClient.java:68) ~[support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.rest.RestClient.execQuery(RestClient.java:58) ~[support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.commands.CheckKibanaVersion.getKibanaVersion(CheckKibanaVersion.java:95) ~[support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.commands.CheckKibanaVersion.execute(CheckKibanaVersion.java:64) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.chain.DiagnosticChainExec.runDiagnostic(DiagnosticChainExec.java:111) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.DiagnosticService.exec(DiagnosticService.java:68) [support-diagnostics-8.1.4.jar:8.1.4]
	at com.elastic.support.diagnostics.DiagnosticApp.main(DiagnosticApp.java:42) [support-diagnostics-8.1.4.jar:8.1.4]
23:58:41.882 [main] ERROR com.elastic.support.diagnostics.DiagnosticService - Could't retrieve Kibana version due to a system or network error. Connect to xxx-kibana-external-kb-http:5601 [xxx-kibana-external-kb-http/xxx.xx.xx.xxx] failed: connect timed out
Check diagnostics.log in the archive file for more detail.
23:58:41.883 [main] INFO  com.elastic.support.BaseService - Closing loggers.
23:58:41.883 [main] INFO  com.elastic.support.BaseService - Archiving diagnostic results.

Is there any other flag that we need to specify apart from -o -r?

@kunisen
Copy link

kunisen commented Oct 14, 2021

It seems the timeout is 1 minute when grabbing Kibana diag.

23:57:41.776 [main] INFO  com.elastic.support.diagnostics.commands.CheckKibanaVersion - Getting Kibana Version.
23:58:41.875 [main] ERROR com.elastic.support.rest.RestClient - Unexpected Execution Error

Not very sure if it's a pure timeout issue yet, but given it's hard to tweak timeout value as of now, due to it's not exposed as parameter.
Could we please first expose this option to external and see if by simply tweaking timeout value can solve the issue?

Or alternatively maybe we can use 5 minutes by default, but make it tuneable + default a bit longer may be better, based on the situation.

@pebrc
Copy link
Collaborator

pebrc commented Oct 14, 2021

These are timeouts that are defaulted in the stack diagnostics tool not in eck-diagnostics https://github.com/elastic/support-diagnostics/blob/bad8fe76f2d2be716c14ffc5455f8fb51d78d280/src/main/resources/diags.yml#L24-L30

which are read from the class path so I think we would have to either rebuild the support-diagnostics tool with different settings or inject a different configuration file into the JVM class path.

The other question is maybe: do we have hope that if we would wait longer the Kibana Diagnostics extraction would have been successful?

@milanage
Copy link
Author

I'm not sure about the Kibana diagnostics part but we attempted an ES diagnostics (same API mode) and it was successful.
The uncompressed size of the ES diagnostics is quite large (~660MB, with a 108MB cluster_state.json). I guess the failure could be related to the large size? But on the other hand, if the standalone diag-tool and the one in eck-diagnostics do the exact same thing, why was the different outcome?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants