-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix jl_active_task_stack to return correct values #56477
Conversation
This would be really nice to have :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me to the extent that I can evaluate it. I really hope people like @JeffBezanson @d-netto @gbaraldi @udesou might find time to have a look so it can be merged
typedef struct _jl_excstack_t jl_excstack_t; | ||
typedef struct _jl_handler_t jl_handler_t; | ||
|
||
typedef struct _jl_task_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A major chunk of this PR seems to be about moving _jl_task_t
from julia.h
to here. Is there a functional reason why this is necessary for this PR? Perhaps something got added to _jl_task_t
-- but it's difficult to say because everything is in a single commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made PR #57117 to take care of this
@@ -3335,6 +3335,7 @@ JL_DLLEXPORT void jl_gc_collect(jl_gc_collection_t collection) | |||
} | |||
jl_gc_debug_print(); | |||
|
|||
ct->ctx.activefp = (char*)jl_get_frame_addr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that one seems simple enough :-)
@@ -1720,18 +1720,6 @@ JL_DLLEXPORT unsigned jl_special_vector_alignment(size_t nfields, jl_value_t *fi | |||
void register_eh_frames(uint8_t *Addr, size_t Size) JL_NOTSAFEPOINT; | |||
void deregister_eh_frames(uint8_t *Addr, size_t Size) JL_NOTSAFEPOINT; | |||
|
|||
STATIC_INLINE void *jl_get_frame_addr(void) JL_NOTSAFEPOINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to julia_threads.h
so it can be called there, OK
@@ -270,6 +345,9 @@ STATIC_INLINE int8_t jl_gc_state_set(jl_ptls_t ptls, int8_t state, | |||
{ | |||
assert(old_state != JL_GC_PARALLEL_COLLECTOR_THREAD); | |||
assert(old_state != JL_GC_CONCURRENT_COLLECTOR_THREAD); | |||
if (__builtin_constant_p(state) ? state : // required if !old_state && state, otherwise optional | |||
__builtin_constant_p(old_state) ? !old_state : 1) | |||
jl_atomic_load_relaxed(&ptls->current_task)->ctx.activefp = (char*)jl_get_frame_addr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what the semantics for jl_gc_state_set
are, when it is supposed to be called, etc., so I can't judge whether this is sensible or effective.
This seems to store the current frame pointer into the task. Presumably that means this function is being called from the task as it is about to switch out? At least that would make sense.
My personal dream is that the wonderful people adding MMTk and doing other GC restructuring work, like @gbaraldi @d-netto @udesou), add some more comments describing such things as they figure this stuff out for themselves ;-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should be updating the activefp
in jl_gc_state_set
?
Makes more sense to me to update this in the segv_handler
or jl_set_gc_and_wait
.
Also, I believe we should fetch the frame pointer from the context the thread was running with before it got trapped in the signal handler -- note that we run the signal handlers with a different stack.
This is more or less what I started doing in #55003.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should be updating the activefp in jl_gc_state_set? ... Also, I believe we should fetch the frame pointer from the context the thread was running with before it got trapped in the signal handler -- note that we run the signal handlers with a different stack.
It ends up being mandatory, since we don't stop all threads when reading their pgcstack, although that does mean it may be often unreliable / unsafe to conservatively scan the stack of the other threads. The signal handler issue is a good point. We would need to pass through the ucontext_t to jl_set_gc_and_wait
, like we do on Apple, to get a stack pointer from there (plus redzone).
Even with those changed though, we know now that in v1.11+ the llvm optimizations have reached a point where we know many cases that will necessarily cause the GC to segfault when doing conservative stack scanning, so there isn't any safe way to do this. I am going to go ahead and close out this PR, though someone could revive it under a build flag if they really care about it, and don't care if using conservative GC makes their code segfault sometimes or are willing to run with -O0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we already do use conservative GC in Julia 1.11 and master... though we do it by catching the "segfaults", though really the only ones we had so far were due to incorrect values for the stack start being reported causing us to read into guard pages.
Though perhaps we'd be crashing more if we used tasks extensively? I don't know enough about what the problem is right now / how "wrong" the currently reported values are? I thought they are just "too big" as in the stack is overestimated? But maybe that's not true anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The observation now is that values won't be allocated correctly once they are provably garbage (during optimization), which may lead to the GC trying to access memory that is not valid to access. So you need to either turn off optimizations (-O0) or use a precise stack scan, or just hope to be lucky that it doesn't crash often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't quite understand what that means. Anyway based on the bits I do understand I hope we won't be affected because we don't scan for general Julia objects, just for those of a specific "foreign" type. Since these are allocated via the C ABI from C code I don't think the LLVM optimizer has any chance of optimizing them away in any way...
Value *currentfp = builder.CreateConstInBoundsGEP1_32(T_int8, task, offset, "gc_state"); | ||
Value *fp = builder.CreateIntrinsic(Intrinsic::frameaddress, {builder.getPtrTy()}, {builder.getInt32(0)}); | ||
builder.CreateAlignedStore(fp, currentfp, Align(sizeof(void*))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea how this code is supposed to work, so I can't review it other than on a superficial level: My guess would be that this function is a code-emitting counterpart to jl_gc_state_set
, and so a similar modification as there is applied here, to update activefp
to contain the current framepointer. The inserted code looks plausibly like it might be doing that
*size = task->ctx.bufsz - off; | ||
return (void *)((char *)task->ctx.stkbuf + off); | ||
} | ||
|
||
JL_DLLEXPORT void jl_active_task_stack(jl_task_t *task, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THe changes in jl_active_task_stack
look plausible to me
557fcb7
to
f1b0405
Compare
Seems potentially slightly costly to need to generate these extra writes for every safepoint transition to keep track of the current stack location, but that is potentially an unavoidable cost to doing conservative stack scanning for gc.
f1b0405
to
df68965
Compare
This was removed previously in PR #54527 but had to be reverted in PR #54559 as one usage remained (more by accident then by design). This has since then been resolved. This is also part of PR #56477 but that seems stalled right now, and in fact merging parts of it now may make the review of it easier later on...
This is a subset of PR #56477 . I hope that it will be easier to get this merged first, as it just moves things around, afterwards we can update #56477 and it will be considerably smaller, and will thus hopefully break less often (and in less bad ways) over time. (I can perform the required update of that PR, too). See also PR #57116 which contains another independent part of that PR.
This was removed previously in PR JuliaLang#54527 but had to be reverted in PR JuliaLang#54559 as one usage remained (more by accident then by design). This has since then been resolved. This is also part of PR JuliaLang#56477 but that seems stalled right now, and in fact merging parts of it now may make the review of it easier later on...
This is a subset of PR JuliaLang#56477 . I hope that it will be easier to get this merged first, as it just moves things around, afterwards we can update JuliaLang#56477 and it will be considerably smaller, and will thus hopefully break less often (and in less bad ways) over time. (I can perform the required update of that PR, too). See also PR JuliaLang#57116 which contains another independent part of that PR.
Store the activefp at every point where it might be observed to have changed. Seems potentially slightly costly to need to generate these extra writes for every safepoint transition to keep track of the current stack location, but that is potentially an unavoidable cost to doing conservative stack scanning for gc.
Replace #54923