This guide describes the recommended high-level architecture and steps to add hardware-specific optimized kernels to TfLite Micro.
The goal with these optimizations and the process that we recommend to getting them merged into the TfLite Micro codebase is to have a measurable and documented performance improvement on a benchmark of interest.
Once the optimizations are merged, they will indeed be used for more than the benchmark but the context for why the optimizations were added is still very important.
-
Pick a benchmark that you would like to measure the performance for.
- Existing benchmarks are in the benchmarks directory.
- If none of the existing benchmarks capture your use-case, then please create a github issue or start a thread on [email protected] to figure out how to add in a new benchmark.
- If adding a publicly-available benchmark to the TFLM codebase is determined to be infeasible, then a fall-back would be to have an internal benchmark that can be used to document the benefits of adding in the optimizations via PR descriptions.
- Adding optimized code without any associated benchmarks will need very strong justification and will most likely not be permitted.
-
Do the groundwork and architecture needed to be able to add in optimizations for your target (more details in the software architecture section).
-
Create one pull request for each optimized kernel with the PR description clearly stating the commands that were used to measure the performance improvement.
- This context is important even if the toolchain is proprietary and there
are currently a small number of users.
- See this PR as an example.
- At minimum the latency with and without the particular optimized kernel should be documented. Additional context may also be desirable.
- Here is some general guidance on writing good PR descriptions
- This context is important even if the toolchain is proprietary and there
are currently a small number of users.
We would like to explicitly point out (as have others) that the reference kernel implementations are not performant and there are plenty of opportunities to speed them up. This is by design and the reference kernels are meant to be a shared starting point to then be optimized in a target specific optimized kernel implementation.
Two previous discussions on this topic are on PR #42477 and PR #45227
Our current point of view on this topic is that while optimizing shared reference code in a portable manner is attractive, we are making an explicit choice to not go down that path and instead rely on target-specific optimized implementations. The TFLM codebase has a growing list of optimized kernel implementations, and we are investing in making the process of adding new implementations smoother.
The optimized kernel architecture is composed of the following three modules:
- Hardware-specific NN library
- Optimized Kernels
- Build System Integration
This library uses knowledge of the hardware and compiler to implement the underlying operations. Examples of this are CMSIS-NN from ARM and NNLib from Cadence.
The benefits of having this API separation are:
- The NN library does not need to follow the style guide of the rest of the TFLM code.
- Releases of the NN library can be made independent of TFLM
- The same NN library can be used and tested independent of TFLM.
- The maintainers of the NN library have full control over the development process that they would like to follow.
These will be (hopefully thin) wrappers that act as the glue between TFLM and the NN library.
The goal here is to delegate as much work as possible to the NN library while still allowing the two APIs (TFLM and NN library) to be independent of each other. If there is a performance degradation due to this (for example, unnecessary memory copies) then we can evaluate those on a case-by-case basis.
This code will be reviewed and merged in the TFLM github repository and must follow the development style of the TFLM codebase.
Some amount of refactoring of the existing code may be needed to ensure that code is suitably shared between the reference and optimized kernels. There is currently no fixed recipe for this refactor and we will evaluate on a case-by-case basis during the PR review.
For example, to add an optimized implementation for fully_conntected
for the
Xtensa Fusion F1 the steps were: *
PR 1: refactor for
reference fallbacks and a baseline latency. *
PR 2: refactor to share
code between reference and optimized kernels. *
PR 3: add the code needed
to use the optimized NN lib and document the latency improvement.
This module is the least defined but we strongly recommend the following: 1. A single target makefile.inc for all the architectures that you would like to support along with optional target-specific system_setup.cc. See cortex_m_generic_makefile.inc and xtensa_makefile.inc as examples.
-
A single
ext_libs.inc
(and associated scripts) that downloads any external dependencies (including the NN library). For example: -
The optimized kernels will then live in a kernels subdirectory (e.g. kernels/cmsis_nn and kernels/xtensa)
Two development workflows that the TFLM team would like to encourage and support:
-
Export static library + headers into target-specific development environment
- Build a static libtensorflow-microlite.a using the TFLM makefile with:
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target> OPTIMIZED_KERNEL_DIR=<optimize_dir> microlite
- Use the static library and any TFLM headers as part of the overall application (with its own build system).
- Build a static libtensorflow-microlite.a using the TFLM makefile with:
-
Integrate TFLM with IDE:
-
This has historically been done using the TFLM Makefile’s support for project generation.
-
However, given the learning curve and high-maintenance overhead, we are moving away from supporting project generation via the Makefile and are encouraging future IDE integrations to be done outside of the TFLM Makefiles.
-
The TFLM team is currently working through the details on this topic.
-
The kernel tests are the primary method of ensuring that the optimized kernel implementations are accurate.
Currently, most of the tests require the optimizations to be bit-exact to the quantized reference implementation. We can revisit this requirement if it ends up having a high associated cost on the latency.
We strongly encourage optimized kernel implementations to have an associated continuous build that runs through all the unit tests and publishes a build badge to the TFLM community supported builds table. Running the units tests once a day is often a good place to start.