Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix jl_active_task_stack to return correct values #56477

Closed
wants to merge 3 commits into from

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Nov 6, 2024

Store the activefp at every point where it might be observed to have changed. Seems potentially slightly costly to need to generate these extra writes for every safepoint transition to keep track of the current stack location, but that is potentially an unavoidable cost to doing conservative stack scanning for gc.

Replace #54923

@vtjnash vtjnash added the GC Garbage collector label Nov 6, 2024
@vtjnash vtjnash requested a review from JeffBezanson November 6, 2024 18:39
@fingolfin
Copy link
Member

This would be really nice to have :-)

Copy link
Member

@fingolfin fingolfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me to the extent that I can evaluate it. I really hope people like @JeffBezanson @d-netto @gbaraldi @udesou might find time to have a look so it can be merged

typedef struct _jl_excstack_t jl_excstack_t;
typedef struct _jl_handler_t jl_handler_t;

typedef struct _jl_task_t {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A major chunk of this PR seems to be about moving _jl_task_t from julia.h to here. Is there a functional reason why this is necessary for this PR? Perhaps something got added to _jl_task_t -- but it's difficult to say because everything is in a single commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made PR #57117 to take care of this

@@ -3335,6 +3335,7 @@ JL_DLLEXPORT void jl_gc_collect(jl_gc_collection_t collection)
}
jl_gc_debug_print();

ct->ctx.activefp = (char*)jl_get_frame_addr();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that one seems simple enough :-)

@@ -1720,18 +1720,6 @@ JL_DLLEXPORT unsigned jl_special_vector_alignment(size_t nfields, jl_value_t *fi
void register_eh_frames(uint8_t *Addr, size_t Size) JL_NOTSAFEPOINT;
void deregister_eh_frames(uint8_t *Addr, size_t Size) JL_NOTSAFEPOINT;

STATIC_INLINE void *jl_get_frame_addr(void) JL_NOTSAFEPOINT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to julia_threads.h so it can be called there, OK

@@ -270,6 +345,9 @@ STATIC_INLINE int8_t jl_gc_state_set(jl_ptls_t ptls, int8_t state,
{
assert(old_state != JL_GC_PARALLEL_COLLECTOR_THREAD);
assert(old_state != JL_GC_CONCURRENT_COLLECTOR_THREAD);
if (__builtin_constant_p(state) ? state : // required if !old_state && state, otherwise optional
__builtin_constant_p(old_state) ? !old_state : 1)
jl_atomic_load_relaxed(&ptls->current_task)->ctx.activefp = (char*)jl_get_frame_addr();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what the semantics for jl_gc_state_set are, when it is supposed to be called, etc., so I can't judge whether this is sensible or effective.

This seems to store the current frame pointer into the task. Presumably that means this function is being called from the task as it is about to switch out? At least that would make sense.

My personal dream is that the wonderful people adding MMTk and doing other GC restructuring work, like @gbaraldi @d-netto @udesou), add some more comments describing such things as they figure this stuff out for themselves ;-).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should be updating the activefp in jl_gc_state_set?

Makes more sense to me to update this in the segv_handler or jl_set_gc_and_wait.

Also, I believe we should fetch the frame pointer from the context the thread was running with before it got trapped in the signal handler -- note that we run the signal handlers with a different stack.

This is more or less what I started doing in #55003.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should be updating the activefp in jl_gc_state_set? ... Also, I believe we should fetch the frame pointer from the context the thread was running with before it got trapped in the signal handler -- note that we run the signal handlers with a different stack.

It ends up being mandatory, since we don't stop all threads when reading their pgcstack, although that does mean it may be often unreliable / unsafe to conservatively scan the stack of the other threads. The signal handler issue is a good point. We would need to pass through the ucontext_t to jl_set_gc_and_wait, like we do on Apple, to get a stack pointer from there (plus redzone).

Even with those changed though, we know now that in v1.11+ the llvm optimizations have reached a point where we know many cases that will necessarily cause the GC to segfault when doing conservative stack scanning, so there isn't any safe way to do this. I am going to go ahead and close out this PR, though someone could revive it under a build flag if they really care about it, and don't care if using conservative GC makes their code segfault sometimes or are willing to run with -O0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we already do use conservative GC in Julia 1.11 and master... though we do it by catching the "segfaults", though really the only ones we had so far were due to incorrect values for the stack start being reported causing us to read into guard pages.

Though perhaps we'd be crashing more if we used tasks extensively? I don't know enough about what the problem is right now / how "wrong" the currently reported values are? I thought they are just "too big" as in the stack is overestimated? But maybe that's not true anymore?

Copy link
Member Author

@vtjnash vtjnash Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The observation now is that values won't be allocated correctly once they are provably garbage (during optimization), which may lead to the GC trying to access memory that is not valid to access. So you need to either turn off optimizations (-O0) or use a precise stack scan, or just hope to be lucky that it doesn't crash often.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't quite understand what that means. Anyway based on the bits I do understand I hope we won't be affected because we don't scan for general Julia objects, just for those of a specific "foreign" type. Since these are allocated via the C ABI from C code I don't think the LLVM optimizer has any chance of optimizing them away in any way...

Value *currentfp = builder.CreateConstInBoundsGEP1_32(T_int8, task, offset, "gc_state");
Value *fp = builder.CreateIntrinsic(Intrinsic::frameaddress, {builder.getPtrTy()}, {builder.getInt32(0)});
builder.CreateAlignedStore(fp, currentfp, Align(sizeof(void*)));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how this code is supposed to work, so I can't review it other than on a superficial level: My guess would be that this function is a code-emitting counterpart to jl_gc_state_set, and so a similar modification as there is applied here, to update activefp to contain the current framepointer. The inserted code looks plausibly like it might be doing that

*size = task->ctx.bufsz - off;
return (void *)((char *)task->ctx.stkbuf + off);
}

JL_DLLEXPORT void jl_active_task_stack(jl_task_t *task,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THe changes in jl_active_task_stack look plausible to me

This was removed previously in PR #54527 but had to be reverted in
PR #54559 as one usage remained (more by accident then by design).
This has since then been resolved.
fingolfin and others added 2 commits January 21, 2025 15:34
Seems potentially slightly costly to need to generate these extra
writes for every safepoint transition to keep track of the current
stack location, but that is potentially an unavoidable cost to doing
conservative stack scanning for gc.
@fingolfin fingolfin force-pushed the jn/jl_active_task_stack branch from f1b0405 to df68965 Compare January 21, 2025 14:37
d-netto pushed a commit that referenced this pull request Jan 21, 2025
This was removed previously in PR #54527 but had to be reverted in PR
#54559 as one usage remained (more by accident then by design). This has
since then been resolved.

This is also part of PR #56477 but that seems stalled right now, and in
fact merging parts of it now may make the review of it easier later
on...
@vtjnash vtjnash closed this Jan 22, 2025
vtjnash pushed a commit that referenced this pull request Jan 22, 2025
This is a subset of PR #56477 . I hope that it will be easier to get
this merged first, as it just moves things around, afterwards we can
update #56477 and it will be considerably smaller, and will thus
hopefully break less often (and in less bad ways) over time. (I can
perform the required update of that PR, too).

See also PR #57116 which contains another independent part of that PR.
xal-0 pushed a commit to xal-0/julia that referenced this pull request Jan 24, 2025
This was removed previously in PR JuliaLang#54527 but had to be reverted in PR
JuliaLang#54559 as one usage remained (more by accident then by design). This has
since then been resolved.

This is also part of PR JuliaLang#56477 but that seems stalled right now, and in
fact merging parts of it now may make the review of it easier later
on...
xal-0 pushed a commit to xal-0/julia that referenced this pull request Jan 24, 2025
This is a subset of PR JuliaLang#56477 . I hope that it will be easier to get
this merged first, as it just moves things around, afterwards we can
update JuliaLang#56477 and it will be considerably smaller, and will thus
hopefully break less often (and in less bad ways) over time. (I can
perform the required update of that PR, too).

See also PR JuliaLang#57116 which contains another independent part of that PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants