-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark rewrite_data_files failing with java.lang.IllegalStateException: Connection pool shut down #12046
Comments
I tried to trace where the connection pool is being closed. Aside from a calls stemming from finalizers on Thread shutdown (which seem perfectly legitimate), I see:
Where I would pick out the relevant line:
Line 69 in 7781360
My suspicion is that that this IO object (created/obtained e.g. here, I believe:
Since we are using the Glue catalog, I believe this IO object will likely come all the way from GlueTableOperations I am not completely familiar with the internals of Spark here, but it looks to me like this is basically trying to free up memory because it is possibly running up against some limits. As such, I could imagine this would really only happen in very particular cases. For us, this could also explain why we saw this sometimes with Glue 4.0, and now more often with Glue 5.0, because the behavior wrt memory could've changed between versions. |
Ok, I can confirm that commenting out the code: Line 69 in 7781360
allows the job to run to completion. |
Just for documentation, something similar seems to have been discussed here when SerializableTableWithSize was made closeable.: |
This comment has been minimized.
This comment has been minimized.
Just to note, I have backed out the "workaround" (commenting out the closure of the S3FileIO), and have started seeing fewer errors on Glue. I'm not sure if something happened here, but since this seems to be dependent on memory, perhaps AWS tweaked some settings that lead to the broadcast table not being deleted before the task is in fact done. I will continue running this and have steadily added additional logging to help me trace where there is coming from, but it looks like:
|
Ok, I finally have a full explanation. The issue is that Spark is cleaning up memory, moving broadcast variables to disk and this results in the closure of the I/O even if it's currently being used. This is the relevant Spark code: This is what I see in the logs:
where the last line is logging I have added to track how this is being called. I was also tracking calls to e.g. getInputFile and can see this being called after close has been called.
by adding:
I would summarize to say that, unless it's possible to guarantee the serialization table is not removed from memory and persisted to disk, then it's not possible to close the IO. |
This is to fix: apache#12046 To summarize, the issue is that Spark can remove broadcast variables from memory and persist them to disk in case that memory needs to be freed. In the case that this happens, the IO object would be closed even if it was still being used by tasks. This fixes the issue by removing the closure of the IO object when the serializable table is closed. The IO objects should be closed on thread finalizers.
This is to fix: apache#12046 To summarize, the issue is that Spark can remove broadcast variables from memory and persist them to disk in case that memory needs to be freed. In the case that this happens, the IO object would be closed even if it was still being used by tasks. This fixes the issue by removing the closure of the IO object when the serializable table is closed. The IO objects should be closed on thread finalizers.
This is to fix: apache#12046 To summarize, the issue is that Spark can remove broadcast variables from memory and persist them to disk in case that memory needs to be freed. In the case that this happens, the IO object would be closed even if it was still being used by tasks. This fixes the issue by removing the closure of the IO object when the serializable table is closed. The IO objects should be closed on thread finalizers.
This is to fix: apache#12046 To summarize, the issue is that Spark can remove broadcast variables from memory and persist them to disk in case that memory needs to be freed. In the case that this happens, the IO object would be closed even if it was still being used by tasks. This fixes the issue by removing the closure of the IO object when the serializable table is closed. The IO objects should be closed on thread finalizers.
This is to fix: apache#12046 To summarize, the issue is that Spark can remove broadcast variables from memory and persist them to disk in case that memory needs to be freed. In the case that this happens, the IO object would be closed even if it was still being used by tasks. This fixes the issue by removing the closure of the IO object when the serializable table is closed. The IO objects should be closed on thread finalizers.
Apache Iceberg version
1.7.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
We are running a maintenance job to rewrite data files (in parallel) on AWS Glue, calling the
rewrite_data_files
procedure like the following:We are getting errors like the following:
A few points:
1.7.x
branch with the recent changes, but the error still remained.suggesting to me the lifecycle of this pool connection is simply not working correctly.
I am happy to try and provide some additional information here and help for a fix, but I'd need some guidance how to do this.
Willingness to contribute
The text was updated successfully, but these errors were encountered: