v0.17~preview.129.11+135

janestreet · Mar 7, 2024 · a852eed · a852eed
1 parent 3aa49ef
commit a852eed
Show file tree

Hide file tree

Showing 29 changed files with 44,302 additions and 53,322 deletions.
diff --git a/README.md b/README.md
@@ -134,6 +134,11 @@ Running the pre-trained models on sample images can the easily be done via:
 dune exec examples/pretrained/predict.exe path/to/resnet18.ot images/tiger.jpg
 ```
 
+## Internals
+
+ocaml-torch uses extensive code generation to produce bindings to thousands of torch C++ functions.
+Read [internals.md](./internals.md) for details.
+
 ## Acknowledgements
 
 Many thanks to [@LaurentMazare](https://github.com/LaurentMazare) for the [original

diff --git a/images/codegen_graph.dot b/images/codegen_graph.dot
@@ -0,0 +1,59 @@
+digraph {
+  subgraph cluster_binding_gen {
+    style=filled
+    color=lightgrey
+    label="binding generation"
+    declarations [label="declarations (yaml)"]
+    bindinggen [label="binding gen exe"];
+  }
+  subgraph cluster_bindings {
+    style=filled
+    color=lightgrey
+    label="stub generation and bindings"
+    bindings [label="bindings (manual)"]
+    bindingsg [label="bindings (generated)"]
+    stubgen [label="ctypes stub gen exe"];
+  }
+  subgraph cluster_wrapper {
+    style=filled
+    color=lightgrey
+    label="wrapper"
+    {rank=same;
+    stubsml [label="OCaml stubs (manual)", group=g1];
+    stubsmlg [label="OCaml stubs (generated)", group=g2];
+    }
+    {rank=same;
+    stubsc [label="C stubs (manual)", group=g1];
+    stubscg [label="C stubs (generated)", group=g2];
+    }
+    {rank=same;
+    apiml [label="OCaml wrapper (manual)", group=g1];
+    apimlg [label="OCaml wrapper (generated)", group=g2];
+    }
+    {rank=same;
+    apic [label="C/C++ API (manual)", group=g1];
+    apicg [label="C/C++ API (generated)", group=g2];
+    }
+  }
+
+
+  // GENERATION
+  bindinggen -> bindingsg [penwidth=2];
+  bindinggen -> apimlg [penwidth=2];
+  bindinggen -> apicg [penwidth=2];
+  stubgen -> stubscg [penwidth=2];
+  stubgen -> stubsmlg [penwidth=2];
+
+  // DEPENDENCY
+  declarations -> bindinggen[style="dashed"];
+  bindings -> stubgen[style="dashed"];
+  bindingsg -> stubgen[style="dashed"];
+  apic -> stubsc[style="dashed"];
+  apic -> stubscg[style="dashed"];
+  apicg -> stubscg[style="dashed"];
+  stubsc -> stubsml[style="dashed"];
+  stubscg -> stubsmlg[style="dashed"];
+  stubsml -> apiml[style="dashed"];
+  stubsml -> apimlg[style="dashed"];
+  stubsmlg -> apimlg[style="dashed"];
+}
diff --git a/internals.md b/internals.md
@@ -0,0 +1,96 @@
+# ocaml-torch internals
+
+ocaml-torch faces several challenges, including:
+* binding to thousands of functions
+* avoiding any minor memory leaks in these functions
+* quickly cleaning up the memory allocations of tensors when OCaml is done with them
+
+In order to solve this, we have 2 steps of code generation. In this diagram, solid arrows
+represent the code generation DAG and dashed arrows represent the code dependency DAG:
+
+![code generation DAGs](./images/codegen_graph.png)
+
+At a high level,
+
+* Declarations.yaml contains the function signatures for the whole Torch C++ API.
+* Custom binding generation reads all the declarations, and whenever possible, generating
+  * glue code for crossing between C/C++ (the generated C/C++ API),
+  * glue code for using the (yet to be generated) OCaml `foreign` functions in OCaml (the generated OCaml wrapper),
+  * and `ctypes` bindings.
+* Stub generation uses the `ctypes` library, reading the bindings and generating C and
+  OCaml stubs. These are just glue code to handle C/OCaml FFI. Note that we have some
+  manually-written C++ functions and bindings that get generated stubs.
+* There are an extremely small number of manually-written stubs (just 1 as of writing)
+  that ctypes cannot handle.
+* A combination of the generated OCaml wrapper and manually written wrapper provide an
+  actually usable OCaml API. These are further built upon in the main library (not
+  pictured).
+
+# Memory management
+
+A large part of this complexity is driven by memory management.
+
+## Avoiding memory leaks
+
+It is challenging to write manual FFI stubs without memory leaks or race conditions. We
+use `ctypes` to make sure we get this right on the vast majority of functions. Although it
+requires a second code generation step, this spares us from reinventing stub generation.
+
+## Cleaning up tensors
+
+We ensure that tensors are freed when OCaml garbage collects them. To do this, each Tensor
+is equipped with a custom finalizer. This could be done on either the C++ or OCaml side.
+However, the API to inform OCaml of a tensor's true size in memory only exists in C++ (the
+custom block API). Without this, OCaml would not know when to garbage collect on CPU and
+would OOM easily.
+
+
+Note that:
+
+* We have not yet informed OCaml of each tensor's true size, but this is coming soon.
+* OCaml is unaware of GPU memory usage. GPU users may need to manually free tensors or
+  manaully garbage collect.
+
+### Raw tensors and GC tensors
+
+One wrinkle in this setup is that ctypes cannot handle custom blocks. Since we want the
+bulk of our stubs to be generated by ctypes, we create a distinction between `raw_tensor`s
+and `gc_tensor`s.
+
+|                    | raw tensor | GC tensor   |
+|--------------------|------------|-------------|
+| has finalizer?     | no         | yes         |
+| GC knows its size? | no         | coming soon |
+| FFI input for C?   | no         | yes         |
+| FFI output from C? | yes        | no          |
+| ctypes type        | void ptr   | void ptr    |
+
+The only way to convert from a `raw_tensor` to `gc_tensor` is with the hand-written,
+non-ctypes function `with_tensor_gc`. It is used copiously in the generated OCaml wrapper
+code to ensure we only surface GC tensors to the user.
+
+The lifecycle of each tensor looks like this:
+
+1. Some wrapper function `let t = Tensor.foo ()` gets invoked, which makes its way into C++.
+2. C++ returns a `raw_tensor` that goes through a regular ctypes stub and makes its way
+   back to the OCaml `Tensor.foo` call.
+3. Still in `Tensor.foo`, `with_tensor_gc` gets invoked. This goes back into C++ and
+   copies the pointer (but not the data) of the tensor to a new custom block. It now has
+   known off-heap size and a finalizer to free its memory. This gets returned to OCaml
+   with the same memory layout ctypes uses but without going through ctypes.
+4. Now `let () = Tensor.bar t` gets invoked. This goes through usual ctypes stubs, since
+   `t` looks just like a regular `void ptr` to ctypes.
+5. Eventually `t` gets garbage collected. OCaml traverses its blocks and runs the
+   finalizer on each one, freeing the tensor's data.
+
+The memory of each tensor (raw or GC) looks like this:
+
+```
+             block 1              block2
+        ------------------      ----------
+root -> | ctypes fat ptr |----> | void * |----> tensor
+        ------------------      ----------
+
+```
+
+For GC tensors, `block2` is the one with finalizer and off-heap memory.
diff --git a/src/bindings/dune b/src/bindings/dune
@@ -0,0 +1,6 @@
+(library
+ (name torch_bindings)
+ (public_name torch.bindings)
+ (libraries ctypes.stubs)
+ (preprocess
+  (pps ppx_jane)))