Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known Issue: Reusing the result of aot.compile leads to segfault upon Python exiting #42

Open
slyubomirsky opened this issue Jul 23, 2020 · 3 comments

Comments

@slyubomirsky
Copy link
Contributor

slyubomirsky commented Jul 23, 2020

Here is a test case from the unit tests, which passes:

def test_compose():
    mod = Module()
    p = Prelude(mod)
    add_nat_definitions(p)
    x = relay.Var('x')
    inc = GlobalVar('inc')
    mod[inc] = Function([x], p.s(x))
    x = relay.Var('x')
    func = GlobalVar('func')
    f = Function([x], relay.Call(p.compose(inc, p.double), [x]))
    mod[func] = f
    cfunc = compile(func, mod)
    assert nat_to_int(cfunc(p.s(p.s(p.z())))) == 5

However, this case results in a segfault when the Python interpreter exits (all tests pass):

def test_compose():
    mod = Module()
    p = Prelude(mod)
    add_nat_definitions(p)
    x = relay.Var('x')
    inc = GlobalVar('inc')
    mod[inc] = Function([x], p.s(x))
    x = relay.Var('x')
    func = GlobalVar('func')
    f = Function([x], relay.Call(p.compose(inc, p.double), [x]))
    mod[func] = f
    cfunc = compile(func, mod)
    assert nat_to_int(cfunc(p.s(p.s(p.z())))) == 5
    assert nat_to_int(cfunc(p.s(p.s(p.z())))) == 5
    assert nat_to_int(cfunc(p.s(p.s(p.z())))) == 5

The GDB backtrace reveals the following:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
malloc_consolidate (av=av@entry=0x7ffff7dcfc40 <main_arena>) at malloc.c:4439
4439    malloc.c: No such file or directory.
(gdb) bt
#0  malloc_consolidate (av=av@entry=0x7ffff7dcfc40 <main_arena>) at malloc.c:4439
#1  0x00007ffff7a7c0ab in _int_free (have_lock=0, p=<optimized out>, av=0x7ffff7dcfc40 <main_arena>) at malloc.c:4362
#2  __GI___libc_free (mem=0x1b72750) at malloc.c:3124
#3  0x00007fff89044e69 in dmlc::parameter::FieldEntry<int>::~FieldEntry() ()
   from /home/sslyu/.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so
#4  0x00007fff89044557 in dmlc::parameter::ParamManager::~ParamManager() ()
   from /home/sslyu/.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so
#5  0x00007ffff7a270f1 in __run_exit_handlers (status=0, listp=0x7ffff7dcf718 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#6  0x00007ffff7a271ea in __GI_exit (status=<optimized out>) at exit.c:139
#7  0x00007ffff7a05b9e in __libc_start_main (main=0x4b0c20 <main>, argc=2, argv=0x7fffffffdcb8,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdca8)
    at ../csu/libc-start.c:344
#8  0x00000000005b250a in _start ()

There seems to be some kind of nasty interaction happening somewhere inside TVM's memory (I have also had this happen upon exiting Python). This was done on TVM commit 5046ff25116d66032f5d1b69d240f0a655a1ed92; I do not know exactly which TVM commit this bug begins with.

Note also that the bug can be inconsistent: Sometimes duplicating one call to a compiled function will succeed; other times, I will have to duplicate a different compiled function to get a segfault.

@slyubomirsky
Copy link
Contributor Author

Thanks to the magic of git bisect, I believe this error began with TVM commit 4a262eca5570fab5ef23530c10671d7d503a5152. (PR #5517)

Repro steps:

  1. I changed test_nat_add() in test/test_aot.py to have this line at the end, repeated twice: assert nat_to_int(cfunc()) == 7
  2. I used git bisect starting from TVM commit 22db299b33f05570db2a5a406bdb37b57198a822 (from the date of the last update to the AoT compiler), using test/test_aot.py to determine whether a commit was good or bad. Took about 8 or 9 iterations.

I worry that this bug may be probabilistic so maybe git bisect is not a foolproof method for it.

@slyubomirsky
Copy link
Contributor Author

slyubomirsky commented Jul 24, 2020

Infuriating plot twist: Compiling TVM with GDB flags makes this error go away. It appears to be necessary to repeat calls a specific number of times to trigger the segfaults. It does not seem to be completely deterministic.

@slyubomirsky
Copy link
Contributor Author

Managed to trigger a segfault (in the same manner, by repeating calls to a given compiled function in the tests) with GDB flags on, got this backtrace that is identical to the last one:

#0  malloc_consolidate (av=av@entry=0x7ffff7dcfc40 <main_arena>) at malloc.c:4439
#1  0x00007ffff7a7c0ab in _int_free (have_lock=0, p=<optimized out>,
    av=0x7ffff7dcfc40 <main_arena>) at malloc.c:4362
#2  __GI___libc_free (mem=0x1f46480) at malloc.c:3124
#3  0x00007fff89044e69 in dmlc::parameter::FieldEntry<int>::~FieldEntry() ()
   from /home/sslyu/.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so
#4  0x00007fff89263d47 in dmlc::parameter::ParamManagerSingleton<xgboost::tree::GPUHistMakerTrainParam>::~ParamManagerSingleton() ()
   from /home/sslyu/.local/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so
#5  0x00007ffff7a270f1 in __run_exit_handlers (status=0,
    listp=0x7ffff7dcf718 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:108
#6  0x00007ffff7a271ea in __GI_exit (status=<optimized out>) at exit.c:139
#7  0x00007ffff7a05b9e in __libc_start_main (main=0x4b0ce0 <main>, argc=2,
    argv=0x7fffffffe268, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffe258) at ../csu/libc-start.c:344
#8  0x00000000005b26fa in _start ()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant