Interpret? #19

shashi · 2020-02-06T04:12:54Z

JuliaInterpreter could speed up this package in theory. We usually need to run it once, and the compilation is garbage anyway.

A simple problem such as

julia> @interpret sparsity!((y,x)->y[1:2] .= x[2:end], [1,2,3], [1,2,3])
Explored path: SparsityDetection.Path(Bool[], 1)
3×3 SparseArrays.SparseMatrixCSC{Bool,Int64} with 2 stored entries:
  [1, 2]  =  1
  [2, 3]  =  1

Takes 7 seconds. Which is 2x faster than compiled first run.

I did a track-allocation run and below are the worst offending lines. 3rd column is the byte count. (1st is line number). If anyone in JuliaInterpreter is interested in looking into this together or would suggest something, that would be awesome! cc @timholy

    33  commands.jl.35864.mem: 51807015 finish_and_return!(frame::Frame, istoplevel::Bool=false) = finish_and_return!(finish_and_return!, frame, istoplevel)
   229  interpret.jl.35864.mem: 34610232                 return Base.invokelatest(f, fargs...)
    35  utils.jl.35864.mem: 27500608     m = ccall(:jl_gf_invoke_lookup, Any, (Any, UInt), tt, typemax(UInt))
   217  builtins.jl.35864.mem: 25270915         return Some{Any}(ntuple(i->@lookup(frame, args[i+1]), length(args)-1))
    97  JuliaInterpreter.jl.35864.mem: 24505096     set_compiled_methods()
   157  construct.jl.35864.mem: 23054080     (ti, lenv::SimpleVector) = ccall(:jl_type_intersection_with_env, Any, (Any, Any),
    13  commands.jl.35864.mem: 20916484     while true
   371  interpret.jl.35864.mem: 20031477         args = [@lookup(mod, frame, arg) for arg in node.args]
   244  optimize.jl.35864.mem: 17112723     TVal = evalmod == Core.Compiler ? Core.Compiler.Val : Val
   328  construct.jl.35864.mem: 12875520     FrameData(locals, ssavalues, sparams, exception_frames, last_exception, caller_will_catch_err, last_reference, callargs)
    68  optimize.jl.35864.mem: 12173696         ssalookup[start:stop] .-= cnt
    62  optimize.jl.35864.mem: 11672946     stmts = unique!(sort!(stmts))
    93  JuliaInterpreter.jl.35864.mem:  8056663     push!(compiled_modules, Base.Threads)
    52  JuliaInterpreter.jl.35864.mem:  7514017     push!(compiled_methods, which(vcat, (Vector,)))
   311  construct.jl.35864.mem:  7305904                 locals[i], last_reference[i] = Some{Any}(argvals[i]), 1
    34  JuliaInterpreter.jl.35864.mem:  6263616     append_any(@nospecialize x...) = append!([], Core.svec((x...)...))
   303  optimize.jl.35864.mem:  5917102     f = get(compiled_calls, cc_key, nothing)
   170  construct.jl.35864.mem:  5669093             code = Core.Compiler.get_staged(specialize_method(method, argtypes, lenv))
   248  construct.jl.35864.mem:  5337429     if isa(f, Core.Builtin) || isa(f, Core.IntrinsicFunction)
   368  interpret.jl.35864.mem:  5051472     head = node.head
    50  localmethtable.jl.35864.mem:  4533248                         return fi.framecode, fi.sparam_vals
   152  construct.jl.35864.mem:  4309471     sig = method.sig
   102  utils.jl.35864.mem:  4285585         predicate(a) && return true
   630  construct.jl.35864.mem:  3704435             return Expr(:tuple,
   227  optimize.jl.35864.mem:  3696460     t isa Core.TypeofBottom && return t
   341  optimize.jl.35864.mem:  2821314                     return $fcall($cfunc, $RetType, $ArgType, $(argnames...))
   191  construct.jl.35864.mem:  2678120             genframedict[(method, argtypes)] = framecode
    44  optimize.jl.35864.mem:  2629809     for (i, stmt) in enumerate(stmts)
   234  optimize.jl.35864.mem:  2601102             isa(p, TypeVar) ? p.name :
    88  types.jl.35864.mem:  2494384         src, methodtables = optimize!(copy_codeinfo(src), scope)
   161  construct.jl.35864.mem:  2481174         framecode = get(genframedict, (method, argtypes), nothing)
   345  optimize.jl.35864.mem:  2423183         compiled_calls[cc_key] = f
   252  optimize.jl.35864.mem:  2354354         for (atype, arg) in zip(ArgType, stmt.args[6:6+nargs-1])
   253  construct.jl.35864.mem:  2249776     argtypesv = Any[_Typeof(a) for a in allargs]
   556  construct.jl.35864.mem:  1651801         return prepare_frame(r[1:end-1]...)
   262  optimize.jl.35864.mem:  1595416                 push!(args, cconvert_expr.args[3])
   254  construct.jl.35864.mem:  1557728     argtypes = Tuple{argtypesv...}
    52  construct.jl.35864.mem:  1548477     empty!(framedict)
   186  optimize.jl.35864.mem:  1460475         deleteat!(code.codelocs, delete_idxs)
    53  construct.jl.35864.mem:  1455236     empty!(genframedict)
   185  optimize.jl.35864.mem:  1431680         foreigncalls_idx = map(x -> ssalookup[x], foreigncalls_idx)
   178  construct.jl.35864.mem:  1259371                 code = get_source(method)
   201  optimize.jl.35864.mem:  1214704             inner = extract_inner_call!(stmt, length(new_code)+1)
    95  types.jl.35864.mem:  1177445         if pc_expr == BREAKPOINT_EXPR
   272  construct.jl.35864.mem:   925600     return framecode, allargs, lenv, argtypes
   179  interpret.jl.35864.mem:   923559                 return Some{Any}(Base.invokelatest(f, fargs[2:end]...))
   296  construct.jl.35864.mem:   760434         locals = Vector{Union{Nothing,Some{Any}}}(nothing, ns)
    37  optimize.jl.35864.mem:   731216             replace_ssa!(a, ssalookup)
   344  interpret.jl.35864.mem:   634208         data.locals[lhs.id] = Some{Any}(rhs)
    58  types.jl.35864.mem:   622224     next::Union{Nothing,DispatchableMethod}  # linked-list representation
   196  construct.jl.35864.mem:   616960     return framecode, lenv
    95  localmethtable.jl.35864.mem:   616928         return framecode, env
    72  localmethtable.jl.35864.mem:   616928         fi = FrameInstance(framecode, env, is_generated(scopeof(framecode)) && enter_generated)
   166  builtins.jl.35864.mem:   608464             return Some{Any}(getfield(@lookup(frame, args[2]), @lookup(frame, args[3])))
    87  utils.jl.35864.mem:   607600     iter = Core.Compiler.userefs(stmt)
   100  types.jl.35864.mem:   516192     slotnamelists = Dict{Symbol,Vector{Int}}()
   244  builtins.jl.35864.mem:   427184         return Some{Any}(ccall(:jl_f_intrinsic_call, Any, (Any, Ptr{Any}, UInt32), f, cargs, length(cargs)))
   103  builtins.jl.35864.mem:   387856         return Some{Any}(Core.apply_type(getargs(args, frame)...))
    59  builtins.jl.35864.mem:   374656         new_expr = Expr(:call, argswrapped[1])
   373  interpret.jl.35864.mem:   365990         rhs = ccall(:jl_new_structv, Any, (Any, Ptr{Any}, UInt32), T, args, length(args))
   372  interpret.jl.35864.mem:   297717         T = popfirst!(args)
   102  types.jl.35864.mem:   288560         list = get(slotnamelists, sym, Int[])
   177  interpret.jl.35864.mem:   238312             fmod = parentmodule(f)
   374  optimize.jl.35864.mem:   224872             list[i] = rev ? Core.SlotNumber(stmt.id) : SlotNumber(stmt.id)
   317  construct.jl.35864.mem:   216168             locals[meth_nargs] =  (let i=meth_nargs; Some{Any}(ntuple(k->argvals[i+k-1], nargs-i+1)); end)
    63  builtins.jl.35864.mem:   214864             push!(new_expr.args, (isa(x, Symbol) || isa(x, Expr) || isa(x, QuoteNode)) ? QuoteNode(x) : x)
   344  optimize.jl.35864.mem:   207360         f = Core.eval(evalmod, def)
   307  optimize.jl.35864.mem:   187175             ArgType = Expr(:tuple, [parametric_type_to_expr(t) for t in ArgType]...)
   103  types.jl.35864.mem:   184368         slotnamelists[sym] = push!(list, i)
   209  optimize.jl.35864.mem:   176832         push!(new_code, stmt)
   166  optimize.jl.35864.mem:   158192             if (arg1 == :llvmcall || lookup_stmt(code.code, arg1) == Base.llvmcall) && isempty(sparams) && scope isa Method
    69  utils.jl.35864.mem:   135840     used = BitSet()
    93  types.jl.35864.mem:   133408     breakpoints = Vector{BreakpointState}(undef, length(src.code))
   218  optimize.jl.35864.mem:   133408     methodtables = Vector{Union{Compiled,DispatchableMethod}}(undef, length(code.code))
   213  optimize.jl.35864.mem:   133312     ssalookup = cumsum(ssainc)
   197  optimize.jl.35864.mem:   133312     ssainc = fill(1, length(old_code))
    50  builtins.jl.35864.mem:   130160             return Some{Any}(===(@lookup(frame, args[2]), @lookup(frame, args[3])))
   106  utils.jl.35864.mem:   124569             predicate(a.value) && return true
   158  builtins.jl.35864.mem:   111408             return Some{Any}(fieldtype(@lookup(frame, args[2]), @lookup(frame, args[3])))
   372  optimize.jl.35864.mem:   110264             list[i] = rev ? Core.SSAValue(stmt.id) : SSAValue(stmt.id)
   111  optimize.jl.35864.mem:   101936             ex.args[i] = lookup_global_ref(a)
   210  optimize.jl.35864.mem:   101056         push!(new_codelocs, loc)
   134  optimize.jl.35864.mem:    94976     sparams = scope isa Method ? Symbol[sparam_syms(scope)...] : Symbol[]
   193  construct.jl.35864.mem:    93920             framedict[method] = framecode
   146  optimize.jl.35864.mem:    85904             if stmt.head == :call && stmt.args[1] == :cglobal  # cglobal requires literals
    33  optimize.jl.35864.mem:    77232             stmt.args[i] = SSAValue(ssalookup[a.id])
     6  optimize.jl.35864.mem:    76099     isa(stmt, Expr) || return nothing
   186  builtins.jl.35864.mem:    74864             return Some{Any}(isa(@lookup(frame, args[2]), @lookup(frame, args[3])))
   200  builtins.jl.35864.mem:    71232             return Some{Any}(nfields(@lookup(frame, args[2])))
    44  utils.jl.35864.mem:    67920     s = Symbol[]
   196  optimize.jl.35864.mem:    67920     code.codelocs = new_codelocs = Int32[]
   195  optimize.jl.35864.mem:    67920     code.code = new_code = eltype(old_code)[]
   157  optimize.jl.35864.mem:    67920     delete_idxs = Int[]
   156  optimize.jl.35864.mem:    67920     foreigncalls_idx = Int[]
   168  builtins.jl.35864.mem:    63280             return Some{Any}(getfield(@lookup(frame, args[2]), @lookup(frame, args[3]), @lookup(frame, args[4])))
   220  builtins.jl.35864.mem:    61632             return Some{Any}(typeassert(@lookup(frame, args[2]), @lookup(frame, args[3])))
   134  construct.jl.35864.mem:    59392     if !isempty(kwargs)
   106  types.jl.35864.mem:    54336     framecode = FrameCode(scope, src, methodtables, breakpoints, slotnamelists, used, generator)
   226  builtins.jl.35864.mem:    48528             return Some{Any}(typeof(@lookup(frame, args[2])))
   618  construct.jl.35864.mem:    47728     if isa(ex0, Expr)

ChrisRackauckas · 2020-02-06T04:15:39Z

I'd say we might as well make it the default for this package, i.e. have the higher level function call @interpret _sparsity(...) which does the real work. There will never be a case where we're actually using the compiled code here since it's always function-specific and used once, and someone could bypass it by calling _sparsity if they really wanted to.

timholy · 2020-02-06T09:07:44Z

Cool idea @shashi! Doing this would put extra pressure on the interpreter (correctness-wise & performance-wise), which is probably a good thing. CCing @KristofferC.

Memory consumption is going to be high with the interpreter, if for no reason other than nothing is inferrable so everything get boxed. There are a couple of lines I'm a bit surprised about, but most of that looks utterly unavoidable with the current design.

Speaking purely of performance, unless there have been regressions we appear to be pretty near a local optimum now, but a more radical change could conceivable improve matters quite a bit. My favorite is JuliaDebug/JuliaInterpreter.jl#309 but that will require someone (me?) carving out a month or so.

timholy · 2020-02-06T09:09:14Z

Also worth pointing out that your memory consumption analysis appears to include initialization of the module. (set_compiled_methods() only runs from the __init__.)

KristofferC · 2020-02-06T09:11:48Z

Is the memory consumption here including the compilation of the first call to @interpret? I realize we still want that to be in the measurement for the total time but it might also be a lot of allocations that we cannot do something about.

Also, why are we profiling memory instead of time? :)

shashi · 2020-02-06T21:42:33Z

@timholy Thanks for looking at this! Good to know it's a local minima already :) I don't understand what's happening your serialization PR yet, but it sounds good!

@KristofferC Haha I was profiling with @profile but it shows the obvious interpreter hotspots, so I chose to benchmark memory.

julia> @time @interpret sparsity!((y,x)->y[1:2] .= x[2:end], [1,2,3], [1,2,3])
Explored path: SparsityDetection.Path(Bool[], 1)
 12.001681 seconds (25.35 M allocations: 1.281 GiB, 3.33% gc time)
3×3 SparseArrays.SparseMatrixCSC{Bool,Int64} with 2 stored entries:
  [1, 2]  =  1
  [2, 3]  =  1

1.28G (second run) is a lot.

Also worth pointing out that your memory consumption analysis appears to include initialization of the module. (set_compiled_methods() only runs from the init.)

Is the memory consumption here including the compilation of the first call to @InterPret?

Yes, I did it again for the second run only:

    35  utils.jl.2235.mem: 27500608     m = ccall(:jl_gf_invoke_lookup, Any, (Any, UInt), tt, typemax(UInt))
   157  construct.jl.2235.mem: 23054080     (ti, lenv::SimpleVector) = ccall(:jl_type_intersection_with_env, Any, (Any, Any),
   328  construct.jl.2235.mem: 12875520     FrameData(locals, ssavalues, sparams, exception_frames, last_exception, caller_will_catch_err, last_reference, callargs)
   311  construct.jl.2235.mem:  7305904                 locals[i], last_reference[i] = Some{Any}(argvals[i]), 1
   371  interpret.jl.2235.mem:  5079080         args = [@lookup(mod, frame, arg) for arg in node.args]
   368  interpret.jl.2235.mem:  5051472     head = node.head
    50  localmethtable.jl.2235.mem:  4533248                         return fi.framecode, fi.sparam_vals
    88  types.jl.2235.mem:  2463376         src, methodtables = optimize!(copy_codeinfo(src), scope)
   253  construct.jl.2235.mem:  2219936     argtypesv = Any[_Typeof(a) for a in allargs]
   254  construct.jl.2235.mem:  1480552     argtypes = Tuple{argtypesv...}
   178  construct.jl.2235.mem:  1094144                 code = get_source(method)
   272  construct.jl.2235.mem:   925440     return framecode, allargs, lenv, argtypes
   217  builtins.jl.2235.mem:   891120         return Some{Any}(ntuple(i->@lookup(frame, args[i+1]), length(args)-1))
   170  construct.jl.2235.mem:   723552             code = Core.Compiler.get_staged(specialize_method(method, argtypes, lenv))
    34  JuliaInterpreter.jl.2235.mem:   699744     append_any(@nospecialize x...) = append!([], Core.svec((x...)...))
   344  interpret.jl.2235.mem:   634208         data.locals[lhs.id] = Some{Any}(rhs)
    58  types.jl.2235.mem:   617024     next::Union{Nothing,DispatchableMethod}  # linked-list representation
   196  construct.jl.2235.mem:   616960     return framecode, lenv
    95  localmethtable.jl.2235.mem:   616928         return framecode, env
    72  localmethtable.jl.2235.mem:   616928         fi = FrameInstance(framecode, env, is_generated(scopeof(framecode)) && enter_generated)
   166  builtins.jl.2235.mem:   608464             return Some{Any}(getfield(@lookup(frame, args[2]), @lookup(frame, args[3])))
    87  utils.jl.2235.mem:   607600     iter = Core.Compiler.userefs(stmt)
   100  types.jl.2235.mem:   516192     slotnamelists = Dict{Symbol,Vector{Int}}()
   244  builtins.jl.2235.mem:   427184         return Some{Any}(ccall(:jl_f_intrinsic_call, Any, (Any, Ptr{Any}, UInt32), f, cargs, length(cargs)))
    59  builtins.jl.2235.mem:   374656         new_expr = Expr(:call, argswrapped[1])
   103  builtins.jl.2235.mem:   364168         return Some{Any}(Core.apply_type(getargs(args, frame)...))
   152  construct.jl.2235.mem:   308528     sig = method.sig
   179  interpret.jl.2235.mem:   304768                 return Some{Any}(Base.invokelatest(f, fargs[2:end]...))
   102  types.jl.2235.mem:   288560         list = get(slotnamelists, sym, Int[])
   373  interpret.jl.2235.mem:   224864         rhs = ccall(:jl_new_structv, Any, (Any, Ptr{Any}, UInt32), T, args, length(args))

timholy · 2020-02-06T23:15:09Z

1.28G (second run) is a lot.

That's how it is. To avoid 265-like behavior, the interpreter clears all its old FrameCodes after each exit. So all those lines in construct.jl is the interpreted version of compilation (which is much cheaper than real compilation, but still takes memory). The jl_gf_invoke_lookup is runtime dispatch, which you're going to do a lot of when running the interpreter. That's already the lion's share of your memory allocation right there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpret? #19

Interpret? #19

shashi commented Feb 6, 2020

ChrisRackauckas commented Feb 6, 2020

timholy commented Feb 6, 2020

timholy commented Feb 6, 2020

KristofferC commented Feb 6, 2020 •

edited

Loading

shashi commented Feb 6, 2020

timholy commented Feb 6, 2020

Interpret? #19

Interpret? #19

Comments

shashi commented Feb 6, 2020

ChrisRackauckas commented Feb 6, 2020

timholy commented Feb 6, 2020

timholy commented Feb 6, 2020

KristofferC commented Feb 6, 2020 • edited Loading

shashi commented Feb 6, 2020

timholy commented Feb 6, 2020

KristofferC commented Feb 6, 2020 •

edited

Loading